Titanic_PyTorch

Titanic - Machine Learning from Disaster | Kaggle

目标是对泰坦尼克号的幸存者做预测,一个很好的试验田

使用 pytorch 构建的 DNN 进行训练和预测,准确率约 0.772

通过查看相关文章,实验数据存在一定的特殊性,高准确率基本都是先通过特征分析选取强特征再使用决策树(随机森林)的方式进行判断


总结了一下大致的过程:

  1. 载入数据
  2. 分析数据
  3. 预处理数据(空值,独热编码,标签,标准化,类型转换)
  4. 网络模型定义(网络,损失函数,优化器,超参数)
  5. 训练(mini-batch SGD)
  6. 预测

在 CPU 和 GPU 上都实现了一下,但是数据量小看不太出来差异,同时也需要对测试数据进一步划分以求出最佳epoch

其实还可以进一步加入 Kfold 操作,对各种超参数也没有怎么理解


Notebook

1
2
3
4
5
6
7
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv

1
2
3
4
5
6
7
8
9
10
11
from tqdm import tqdm  # progress bar
from sklearn.preprocessing import StandardScaler

# ------------- pytorch ---------------
import torch
import torch.nn as nn
import torch.nn.functional as F # all functions including loss and activation function
import torch.optim as optim # optimization algorithm

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # GPU or CPU
device

device(type=’cpu’)

load data

1
2
3
4
5
6
# -------------- load data ------------------

train_data = pd.read_csv('/kaggle/input/titanic/train.csv')
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')
sub = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
train_data.head(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

analyzing data

1
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

可以看出Age, Cabin, Embarked列存在数据缺失,后续进一步填充或者编码

1
train_data.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
进一步了解数据分布和一些特征

Preprocess the data

逐步展示数据的变化流程

  • filling the NaN values
  • label encoding
  • One-hot encoding
  • split the data
  • define features and target
  • Normalization
  • numpy to tensor
1
2
3
4
5
6
7
8
9
10
11
12
# Age - filling the NaN values
age_mean = train_data['Age'].dropna().mean()
train_data['Age'].fillna(age_mean,inplace=True)
test_data['Age'].fillna(age_mean,inplace=True)

# Sex - label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train_data["Sex"])
train_data["Sex"] = le.transform(train_data["Sex"])
test_data["Sex"] = le.transform(test_data["Sex"])
train_data.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN S
1
2
3
4
5
# Embarked - one hot encoding
temp = pd.concat([train_data,test_data],axis = 0)
temp_em = pd.get_dummies(temp["Embarked"],dummy_na=True)
temp = pd.concat([temp, temp_em],axis = 1)
temp.head(5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked C Q S NaN
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN S False False True False
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C True False False False
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN S False False True False
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S False False True False
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN S False False True False
1
2
3
4
5
6
7
8
# split the data
train = temp.iloc[:len(train_data),:]
test = temp.iloc[len(train_data):,:]

# define features and target
features = ["Pclass","Sex","Age","SibSp","Parch","C","Q","S",np.nan]
target = "Survived"
train[features].head(5)
Pclass Sex Age SibSp Parch C Q S NaN
0 3 1 22.0 1 0 False False True False
1 1 0 38.0 1 0 True False False False
2 3 0 26.0 0 0 False False True False
3 1 0 35.0 1 0 False False True False
4 3 1 35.0 0 0 False False True False
1
2
3
4
5
6
7
8
9
train_X  = np.array(train[features])
train_Y = np.array(train[target])
test_X = np.array(test[features])
test_Y = np.array(test[target])

# Normalization
Scaler = StandardScaler()
train_X = Scaler.fit_transform(train_X)
test_X = Scaler.fit_transform(test_X)
1
2
temp = pd.DataFrame(train_X, columns=features)
temp
Pclass Sex Age SibSp Parch C Q S NaN
0 0.827377 0.737695 -0.592481 0.432793 -0.473674 -0.482043 -0.307562 0.619306 -0.047431
1 -1.566107 -1.355574 0.638789 0.432793 -0.473674 2.074505 -0.307562 -1.614710 -0.047431
2 0.827377 -1.355574 -0.284663 -0.474545 -0.473674 -0.482043 -0.307562 0.619306 -0.047431
3 -1.566107 -1.355574 0.407926 0.432793 -0.473674 -0.482043 -0.307562 0.619306 -0.047431
4 0.827377 0.737695 0.407926 -0.474545 -0.473674 -0.482043 -0.307562 0.619306 -0.047431
... ... ... ... ... ... ... ... ... ...
886 -0.369365 0.737695 -0.207709 -0.474545 -0.473674 -0.482043 -0.307562 0.619306 -0.047431
887 -1.566107 -1.355574 -0.823344 -0.474545 -0.473674 -0.482043 -0.307562 0.619306 -0.047431
888 0.827377 -1.355574 0.000000 0.432793 2.008933 -0.482043 -0.307562 0.619306 -0.047431
889 -1.566107 0.737695 -0.284663 -0.474545 -0.473674 2.074505 -0.307562 -1.614710 -0.047431
890 0.827377 0.737695 0.177063 -0.474545 -0.473674 -0.482043 3.251373 -1.614710 -0.047431

891 rows × 9 columns

1
2
3
4
5
6
# numpy to torch
train_X = torch.FloatTensor(train_X[:])
train_Y = torch.LongTensor(train_Y[:])
val_X = torch.FloatTensor(test_X[:])
val_Y = torch.LongTensor(test_Y[:])
train_X

tensor([[ 0.8274, 0.7377, -0.5925, …, -0.3076, 0.6193, -0.0474],
[-1.5661, -1.3556, 0.6388, …, -0.3076, -1.6147, -0.0474],
[ 0.8274, -1.3556, -0.2847, …, -0.3076, 0.6193, -0.0474],
…,
[ 0.8274, -1.3556, 0.0000, …, -0.3076, 0.6193, -0.0474],
[-1.5661, 0.7377, -0.2847, …, -0.3076, -1.6147, -0.0474],
[ 0.8274, 0.7377, 0.1771, …, 3.2514, -1.6147, -0.0474]])


Modeling

  • model
  • loss_fn
  • optimizer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# ----------------- pytorch ---------------------
# ----------- define Neural Network -------------

class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(len(features), 256)
self.fc2 = nn.Linear(256, 32)
self.fc3 = nn.Linear(32, 8)
self.fc4 = nn.Linear(8, 2) # dead or alive

def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = self.fc4(x)
return x

model = Net()
model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.02)
print(model,'\n',loss_fn,'\n',optimizer)

​ Net(
​ (fc1): Linear(in_features=9, out_features=256, bias=True)
​ (fc2): Linear(in_features=256, out_features=32, bias=True)
​ (fc3): Linear(in_features=32, out_features=8, bias=True)
​ (fc4): Linear(in_features=8, out_features=2, bias=True)
​ )
​ CrossEntropyLoss()
​ SGD (
​ Parameter Group 0
​ dampening: 0
​ differentiable: False
​ foreach: None
​ lr: 0.02
​ maximize: False
​ momentum: 0
​ nesterov: False
​ weight_decay: 0
​ )


Training

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# ----------- train ------------
# ------ mini-batch SGD --------

batch_size = 64
batch = len(train_X) // batch_size
n_epochs = 800

total_loss = 0
loop = tqdm(range(n_epochs))
for epoch in loop:
for i in range(batch):
start = i * batch_size
end = start + batch_size if ((start + batch_size) > len(train_X)) else len(train_X)
# Get data to cuda if possible
x_t = train_X[start:end].to(device)
y_t = train_Y[start:end].to(device)

# forward
output = model(x_t) # prediction
loss = loss_fn(output,y_t) # loss between predictions and answers

# backward
optimizer.zero_grad()
loss.backward()
optimizer.step()

values, labels = torch.max(output, 1) # probability corresponds to label
total_loss += loss.item() * batch_size

total_loss = total_loss / len(train_X)
# tqdm
loop.set_postfix(loss = '{:6f}'.format(total_loss))

100%|██████████| 800/800 [00:21<00:00, 37.97it/s, loss=0.195122]

模型输出端两个结点,表示两种结果的概率,通过max()方法映射成对应的01标签


Predictions

1
2
3
4
5
with torch.no_grad():
test_result = model(val_X)
labels = torch.max(test_result, 1)[1]
survived = labels.data.numpy()
labels

tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 0, 0, 1])


Submission

1
2
3
submission = pd.DataFrame({'PassengerId': sub['PassengerId'], 'Survived': survived})
submission.to_csv('submission.csv', index=False)
submission
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 0
... ... ...
413 1305 0
414 1306 1
415 1307 0
416 1308 0
417 1309 1

418 rows × 2 columns