Titanic_PyTorch

Titanic - Machine Learning from Disaster | Kaggle

目标是对泰坦尼克号的幸存者做预测，一个很好的试验田

使用 pytorch 构建的 DNN 进行训练和预测，准确率约 0.772

通过查看相关文章，实验数据存在一定的特殊性，高准确率基本都是先通过特征分析选取强特征再使用决策树（随机森林）的方式进行判断

总结了一下大致的过程：

载入数据
分析数据
预处理数据（空值，独热编码，标签，标准化，类型转换）
网络模型定义（网络，损失函数，优化器，超参数）
训练（mini-batch SGD）
预测

在 CPU 和 GPU 上都实现了一下，但是数据量小看不太出来差异，同时也需要对测试数据进一步划分以求出最佳epoch

其实还可以进一步加入 Kfold 操作，对各种超参数也没有怎么理解

Notebook

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv

from tqdm import tqdm  # progress bar
from sklearn.preprocessing import StandardScaler

# ------------- pytorch ---------------
import torch                    
import torch.nn as nn
import torch.nn.functional as F # all functions including loss and activation function
import torch.optim as optim     # optimization algorithm

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # GPU or CPU
device

device(type=’cpu’)

load data

# -------------- load data ------------------

train_data = pd.read_csv('/kaggle/input/titanic/train.csv')
test_data  = pd.read_csv('/kaggle/input/titanic/test.csv')
sub = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
train_data.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

analyzing data

1	train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

可以看出Age, Cabin, Embarked列存在数据缺失，后续进一步填充或者编码

1	train_data.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

进一步了解数据分布和一些特征

Preprocess the data

逐步展示数据的变化流程

filling the NaN values
label encoding
One-hot encoding
split the data
define features and target
Normalization
numpy to tensor

# Age - filling the NaN values
age_mean = train_data['Age'].dropna().mean()
train_data['Age'].fillna(age_mean,inplace=True)
test_data['Age'].fillna(age_mean,inplace=True)

# Sex - label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train_data["Sex"])
train_data["Sex"] = le.transform(train_data["Sex"])
test_data["Sex"]  = le.transform(test_data["Sex"])
train_data.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	S

# Embarked - one hot encoding
temp    = pd.concat([train_data,test_data],axis = 0)
temp_em = pd.get_dummies(temp["Embarked"],dummy_na=True)
temp    = pd.concat([temp, temp_em],axis = 1)
temp.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	C	Q	S	NaN
0	1	0.0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	S	False	False	True	False
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	C	True	False	False	False
2	3	1.0	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	S	False	False	True	False
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	S	False	False	True	False
4	5	0.0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	S	False	False	True	False

# split the data
train = temp.iloc[:len(train_data),:]
test  = temp.iloc[len(train_data):,:]

# define features and target
features = ["Pclass","Sex","Age","SibSp","Parch","C","Q","S",np.nan]
target   = "Survived"
train[features].head(5)

	Pclass	Sex	Age	SibSp	C	Q	S	NaN
0	3	1	22.0	1	False	False	True	False
1	1	0	38.0	1	True	False	False	False
2	3	0	26.0	0	False	False	True	False
3	1	0	35.0	1	False	False	True	False
4	3	1	35.0	0	False	False	True	False

train_X  = np.array(train[features])
train_Y  = np.array(train[target])
test_X   = np.array(test[features])
test_Y   = np.array(test[target])

# Normalization
Scaler  = StandardScaler()
train_X = Scaler.fit_transform(train_X)
test_X  = Scaler.fit_transform(test_X)

1 2	temp = pd.DataFrame(train_X, columns=features) temp

	Pclass	Sex	Age	SibSp	Parch	C	Q	S	NaN
0	0.827377	0.737695	-0.592481	0.432793	-0.473674	-0.482043	-0.307562	0.619306	-0.047431
1	-1.566107	-1.355574	0.638789	0.432793	-0.473674	2.074505	-0.307562	-1.614710	-0.047431
2	0.827377	-1.355574	-0.284663	-0.474545	-0.473674	-0.482043	-0.307562	0.619306	-0.047431
3	-1.566107	-1.355574	0.407926	0.432793	-0.473674	-0.482043	-0.307562	0.619306	-0.047431
4	0.827377	0.737695	0.407926	-0.474545	-0.473674	-0.482043	-0.307562	0.619306	-0.047431
...	...	...	...	...	...	...	...	...	...
886	-0.369365	0.737695	-0.207709	-0.474545	-0.473674	-0.482043	-0.307562	0.619306	-0.047431
887	-1.566107	-1.355574	-0.823344	-0.474545	-0.473674	-0.482043	-0.307562	0.619306	-0.047431
888	0.827377	-1.355574	0.000000	0.432793	2.008933	-0.482043	-0.307562	0.619306	-0.047431
889	-1.566107	0.737695	-0.284663	-0.474545	-0.473674	2.074505	-0.307562	-1.614710	-0.047431
890	0.827377	0.737695	0.177063	-0.474545	-0.473674	-0.482043	3.251373	-1.614710	-0.047431

891 rows × 9 columns

# numpy to torch
train_X = torch.FloatTensor(train_X[:])
train_Y = torch.LongTensor(train_Y[:])
val_X   = torch.FloatTensor(test_X[:])
val_Y   = torch.LongTensor(test_Y[:])
train_X

tensor([[ 0.8274, 0.7377, -0.5925, …, -0.3076, 0.6193, -0.0474],
[-1.5661, -1.3556, 0.6388, …, -0.3076, -1.6147, -0.0474],
[ 0.8274, -1.3556, -0.2847, …, -0.3076, 0.6193, -0.0474],
…,
[ 0.8274, -1.3556, 0.0000, …, -0.3076, 0.6193, -0.0474],
[-1.5661, 0.7377, -0.2847, …, -0.3076, -1.6147, -0.0474],
[ 0.8274, 0.7377, 0.1771, …, 3.2514, -1.6147, -0.0474]])

Modeling

model
loss_fn
optimizer

# ----------------- pytorch ---------------------
# ----------- define Neural Network -------------

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(len(features), 256)
        self.fc2 = nn.Linear(256, 32)
        self.fc3 = nn.Linear(32, 8) 
        self.fc4 = nn.Linear(8, 2) # dead or alive
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x
    
model = Net()
model.to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = 0.02)
print(model,'\n',loss_fn,'\n',optimizer)

Net(
(fc1): Linear(in_features=9, out_features=256, bias=True)
(fc2): Linear(in_features=256, out_features=32, bias=True)
(fc3): Linear(in_features=32, out_features=8, bias=True)
(fc4): Linear(in_features=8, out_features=2, bias=True)
)
CrossEntropyLoss()
SGD (
Parameter Group 0
dampening: 0
differentiable: False
foreach: None
lr: 0.02
maximize: False
momentum: 0
nesterov: False
weight_decay: 0
)

Training

# ----------- train ------------
# ------ mini-batch SGD --------

batch_size = 64
batch = len(train_X) // batch_size
n_epochs   = 800

total_loss = 0
loop = tqdm(range(n_epochs))
for epoch in loop:
    for i in range(batch):
        start = i * batch_size
        end = start + batch_size if ((start + batch_size) > len(train_X)) else len(train_X)
        # Get data to cuda if possible
        x_t = train_X[start:end].to(device)
        y_t = train_Y[start:end].to(device)
        
        # forward
        output = model(x_t) # prediction
        loss   = loss_fn(output,y_t) # loss between predictions and answers
        
        # backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        values, labels = torch.max(output, 1) # probability corresponds to label
        total_loss += loss.item() * batch_size
        
    total_loss = total_loss / len(train_X)
    # tqdm
    loop.set_postfix(loss = '{:6f}'.format(total_loss))

100%|██████████| 800/800 [00:21<00:00, 37.97it/s, loss=0.195122]

模型输出端两个结点，表示两种结果的概率，通过max()方法映射成对应的01标签

Predictions

with torch.no_grad():
    test_result = model(val_X)
labels = torch.max(test_result, 1)[1]
survived = labels.data.numpy()
labels

tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,
1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0,
1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,
1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1,
0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 1, 0, 1, 1, 0, 1, 0, 0, 1])

Submission

1
2
3

submission = pd.DataFrame({'PassengerId': sub['PassengerId'], 'Survived': survived})
submission.to_csv('submission.csv', index=False)
submission

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	0
...	...	...
413	1305	0
414	1306	1
415	1307	0
416	1308	0
417	1309	1

418 rows × 2 columns