Titanic_PyTorch Titanic - Machine Learning from Disaster | Kaggle
目标是对泰坦尼克号的幸存者做预测,一个很好的试验田
使用 pytorch 构建的 DNN 进行训练和预测,准确率约 0.772
通过查看相关文章,实验数据存在一定的特殊性,高准确率基本都是先通过特征分析选取强特征再使用决策树(随机森林)的方式进行判断
总结了一下大致的过程:
载入数据
分析数据
预处理数据(空值,独热编码,标签,标准化,类型转换)
网络模型定义(网络,损失函数,优化器,超参数)
训练(mini-batch SGD)
预测
在 CPU 和 GPU 上都实现了一下,但是数据量小看不太出来差异,同时也需要对测试数据进一步划分以求出最佳epoch
其实还可以进一步加入 Kfold 操作,对各种超参数也没有怎么理解
Notebook 1 2 3 4 5 6 7 import numpy as np import pandas as pd import osfor dirname, _, filenames in os.walk('/kaggle/input' ): for filename in filenames: print (os.path.join(dirname, filename))
/kaggle/input/titanic/train.csv /kaggle/input/titanic/test.csv /kaggle/input/titanic/gender_submission.csv
1 2 3 4 5 6 7 8 9 10 11 from tqdm import tqdm from sklearn.preprocessing import StandardScalerimport torch import torch.nn as nnimport torch.nn.functional as F import torch.optim as optim device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu' ) device
device(type=’cpu’)
load data 1 2 3 4 5 6 train_data = pd.read_csv('/kaggle/input/titanic/train.csv' ) test_data = pd.read_csv('/kaggle/input/titanic/test.csv' ) sub = pd.read_csv('/kaggle/input/titanic/gender_submission.csv' ) train_data.head(3 )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
analyzing data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
可以看出Age, Cabin, Embarked
列存在数据缺失,后续进一步填充或者编码
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
count
891.000000
891.000000
891.000000
714.000000
891.000000
891.000000
891.000000
mean
446.000000
0.383838
2.308642
29.699118
0.523008
0.381594
32.204208
std
257.353842
0.486592
0.836071
14.526497
1.102743
0.806057
49.693429
min
1.000000
0.000000
1.000000
0.420000
0.000000
0.000000
0.000000
25%
223.500000
0.000000
2.000000
20.125000
0.000000
0.000000
7.910400
50%
446.000000
0.000000
3.000000
28.000000
0.000000
0.000000
14.454200
75%
668.500000
1.000000
3.000000
38.000000
1.000000
0.000000
31.000000
max
891.000000
1.000000
3.000000
80.000000
8.000000
6.000000
512.329200
进一步了解数据分布和一些特征
Preprocess the data 逐步展示数据的变化流程
filling the NaN values
label encoding
One-hot encoding
split the data
define features and target
Normalization
numpy to tensor
1 2 3 4 5 6 7 8 9 10 11 12 age_mean = train_data['Age' ].dropna().mean() train_data['Age' ].fillna(age_mean,inplace=True ) test_data['Age' ].fillna(age_mean,inplace=True ) from sklearn.preprocessing import LabelEncoderle = LabelEncoder() le.fit(train_data["Sex" ]) train_data["Sex" ] = le.transform(train_data["Sex" ]) test_data["Sex" ] = le.transform(test_data["Sex" ]) train_data.head(5 )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
1
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
0
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
0
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
0
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
1
35.0
0
0
373450
8.0500
NaN
S
1 2 3 4 5 temp = pd.concat([train_data,test_data],axis = 0 ) temp_em = pd.get_dummies(temp["Embarked" ],dummy_na=True ) temp = pd.concat([temp, temp_em],axis = 1 ) temp.head(5 )
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
C
Q
S
NaN
0
1
0.0
3
Braund, Mr. Owen Harris
1
22.0
1
0
A/5 21171
7.2500
NaN
S
False
False
True
False
1
2
1.0
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
0
38.0
1
0
PC 17599
71.2833
C85
C
True
False
False
False
2
3
1.0
3
Heikkinen, Miss. Laina
0
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
False
False
True
False
3
4
1.0
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
0
35.0
1
0
113803
53.1000
C123
S
False
False
True
False
4
5
0.0
3
Allen, Mr. William Henry
1
35.0
0
0
373450
8.0500
NaN
S
False
False
True
False
1 2 3 4 5 6 7 8 train = temp.iloc[:len (train_data),:] test = temp.iloc[len (train_data):,:] features = ["Pclass" ,"Sex" ,"Age" ,"SibSp" ,"Parch" ,"C" ,"Q" ,"S" ,np.nan] target = "Survived" train[features].head(5 )
Pclass
Sex
Age
SibSp
Parch
C
Q
S
NaN
0
3
1
22.0
1
0
False
False
True
False
1
1
0
38.0
1
0
True
False
False
False
2
3
0
26.0
0
0
False
False
True
False
3
1
0
35.0
1
0
False
False
True
False
4
3
1
35.0
0
0
False
False
True
False
1 2 3 4 5 6 7 8 9 train_X = np.array(train[features]) train_Y = np.array(train[target]) test_X = np.array(test[features]) test_Y = np.array(test[target]) Scaler = StandardScaler() train_X = Scaler.fit_transform(train_X) test_X = Scaler.fit_transform(test_X)
1 2 temp = pd.DataFrame(train_X, columns=features) temp
Pclass
Sex
Age
SibSp
Parch
C
Q
S
NaN
0
0.827377
0.737695
-0.592481
0.432793
-0.473674
-0.482043
-0.307562
0.619306
-0.047431
1
-1.566107
-1.355574
0.638789
0.432793
-0.473674
2.074505
-0.307562
-1.614710
-0.047431
2
0.827377
-1.355574
-0.284663
-0.474545
-0.473674
-0.482043
-0.307562
0.619306
-0.047431
3
-1.566107
-1.355574
0.407926
0.432793
-0.473674
-0.482043
-0.307562
0.619306
-0.047431
4
0.827377
0.737695
0.407926
-0.474545
-0.473674
-0.482043
-0.307562
0.619306
-0.047431
...
...
...
...
...
...
...
...
...
...
886
-0.369365
0.737695
-0.207709
-0.474545
-0.473674
-0.482043
-0.307562
0.619306
-0.047431
887
-1.566107
-1.355574
-0.823344
-0.474545
-0.473674
-0.482043
-0.307562
0.619306
-0.047431
888
0.827377
-1.355574
0.000000
0.432793
2.008933
-0.482043
-0.307562
0.619306
-0.047431
889
-1.566107
0.737695
-0.284663
-0.474545
-0.473674
2.074505
-0.307562
-1.614710
-0.047431
890
0.827377
0.737695
0.177063
-0.474545
-0.473674
-0.482043
3.251373
-1.614710
-0.047431
891 rows × 9 columns
1 2 3 4 5 6 train_X = torch.FloatTensor(train_X[:]) train_Y = torch.LongTensor(train_Y[:]) val_X = torch.FloatTensor(test_X[:]) val_Y = torch.LongTensor(test_Y[:]) train_X
tensor([[ 0.8274, 0.7377, -0.5925, …, -0.3076, 0.6193, -0.0474], [-1.5661, -1.3556, 0.6388, …, -0.3076, -1.6147, -0.0474], [ 0.8274, -1.3556, -0.2847, …, -0.3076, 0.6193, -0.0474], …, [ 0.8274, -1.3556, 0.0000, …, -0.3076, 0.6193, -0.0474], [-1.5661, 0.7377, -0.2847, …, -0.3076, -1.6147, -0.0474], [ 0.8274, 0.7377, 0.1771, …, 3.2514, -1.6147, -0.0474]])
Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 class Net (nn.Module): def __init__ (self ): super ().__init__() self.fc1 = nn.Linear(len (features), 256 ) self.fc2 = nn.Linear(256 , 32 ) self.fc3 = nn.Linear(32 , 8 ) self.fc4 = nn.Linear(8 , 2 ) def forward (self, x ): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = F.relu(self.fc3(x)) x = self.fc4(x) return x model = Net() model.to(device) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr = 0.02 ) print (model,'\n' ,loss_fn,'\n' ,optimizer)
Net( (fc1): Linear(in_features=9, out_features=256, bias=True) (fc2): Linear(in_features=256, out_features=32, bias=True) (fc3): Linear(in_features=32, out_features=8, bias=True) (fc4): Linear(in_features=8, out_features=2, bias=True) ) CrossEntropyLoss() SGD ( Parameter Group 0 dampening: 0 differentiable: False foreach: None lr: 0.02 maximize: False momentum: 0 nesterov: False weight_decay: 0 )
Training 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 batch_size = 64 batch = len (train_X) // batch_size n_epochs = 800 total_loss = 0 loop = tqdm(range (n_epochs)) for epoch in loop: for i in range (batch): start = i * batch_size end = start + batch_size if ((start + batch_size) > len (train_X)) else len (train_X) x_t = train_X[start:end].to(device) y_t = train_Y[start:end].to(device) output = model(x_t) loss = loss_fn(output,y_t) optimizer.zero_grad() loss.backward() optimizer.step() values, labels = torch.max (output, 1 ) total_loss += loss.item() * batch_size total_loss = total_loss / len (train_X) loop.set_postfix(loss = '{:6f}' .format (total_loss))
100%|██████████| 800/800 [00:21<00:00, 37.97it/s, loss=0.195122]
模型输出端两个结点,表示两种结果的概率,通过max()
方法映射成对应的01标签
Predictions 1 2 3 4 5 with torch.no_grad(): test_result = model(val_X) labels = torch.max (test_result, 1 )[1 ] survived = labels.data.numpy() labels
tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1])
Submission 1 2 3 submission = pd.DataFrame({'PassengerId' : sub['PassengerId' ], 'Survived' : survived}) submission.to_csv('submission.csv' , index=False ) submission
PassengerId
Survived
0
892
0
1
893
0
2
894
0
3
895
0
4
896
0
...
...
...
413
1305
0
414
1306
1
415
1307
0
416
1308
0
417
1309
1
418 rows × 2 columns