Titanic_PyTorch Titanic - Machine Learning from Disaster | Kaggle 
目标是对泰坦尼克号的幸存者做预测,一个很好的试验田
使用 pytorch 构建的 DNN 进行训练和预测,准确率约 0.772
通过查看相关文章,实验数据存在一定的特殊性,高准确率基本都是先通过特征分析选取强特征再使用决策树(随机森林)的方式进行判断
总结了一下大致的过程:
载入数据 
分析数据 
预处理数据(空值,独热编码,标签,标准化,类型转换) 
网络模型定义(网络,损失函数,优化器,超参数) 
训练(mini-batch SGD) 
预测 
 
在 CPU 和 GPU 上都实现了一下,但是数据量小看不太出来差异,同时也需要对测试数据进一步划分以求出最佳epoch
其实还可以进一步加入 Kfold 操作,对各种超参数也没有怎么理解
Notebook 1 2 3 4 5 6 7 import  numpy as  np import  pandas as  pd import  osfor  dirname, _, filenames in  os.walk('/kaggle/input' ):    for  filename in  filenames:         print (os.path.join(dirname, filename)) 
/kaggle/input/titanic/train.csv
1 2 3 4 5 6 7 8 9 10 11 from  tqdm import  tqdm  from  sklearn.preprocessing import  StandardScalerimport  torch                    import  torch.nn as  nnimport  torch.nn.functional as  F import  torch.optim as  optim     device = torch.device('cuda:0'  if  torch.cuda.is_available() else  'cpu' )  device 
device(type=’cpu’)
load data 1 2 3 4 5 6 train_data = pd.read_csv('/kaggle/input/titanic/train.csv' ) test_data  = pd.read_csv('/kaggle/input/titanic/test.csv' ) sub = pd.read_csv('/kaggle/input/titanic/gender_submission.csv' ) train_data.head(3 ) 
  
    
      PassengerId 
      Survived 
      Pclass 
      Name 
      Sex 
      Age 
      SibSp 
      Parch 
      Ticket 
      Fare 
      Cabin 
      Embarked 
     
   
  
    
      0 
      1 
      0 
      3 
      Braund, Mr. Owen Harris 
      male 
      22.0 
      1 
      0 
      A/5 21171 
      7.2500 
      NaN 
      S 
     
    
      1 
      2 
      1 
      1 
      Cumings, Mrs. John Bradley (Florence Briggs Th... 
      female 
      38.0 
      1 
      0 
      PC 17599 
      71.2833 
      C85 
      C 
     
    
      2 
      3 
      1 
      3 
      Heikkinen, Miss. Laina 
      female 
      26.0 
      0 
      0 
      STON/O2. 3101282 
      7.9250 
      NaN 
      S 
     
   
analyzing data <class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
可以看出Age, Cabin, Embarked列存在数据缺失,后续进一步填充或者编码
  
    
      PassengerId 
      Survived 
      Pclass 
      Age 
      SibSp 
      Parch 
      Fare 
     
   
  
    
      count 
      891.000000 
      891.000000 
      891.000000 
      714.000000 
      891.000000 
      891.000000 
      891.000000 
     
    
      mean 
      446.000000 
      0.383838 
      2.308642 
      29.699118 
      0.523008 
      0.381594 
      32.204208 
     
    
      std 
      257.353842 
      0.486592 
      0.836071 
      14.526497 
      1.102743 
      0.806057 
      49.693429 
     
    
      min 
      1.000000 
      0.000000 
      1.000000 
      0.420000 
      0.000000 
      0.000000 
      0.000000 
     
    
      25% 
      223.500000 
      0.000000 
      2.000000 
      20.125000 
      0.000000 
      0.000000 
      7.910400 
     
    
      50% 
      446.000000 
      0.000000 
      3.000000 
      28.000000 
      0.000000 
      0.000000 
      14.454200 
     
    
      75% 
      668.500000 
      1.000000 
      3.000000 
      38.000000 
      1.000000 
      0.000000 
      31.000000 
     
    
      max 
      891.000000 
      1.000000 
      3.000000 
      80.000000 
      8.000000 
      6.000000 
      512.329200 
     
   
进一步了解数据分布和一些特征
Preprocess the data 逐步展示数据的变化流程
filling the NaN values 
label encoding 
One-hot encoding 
split the data 
define features and target 
Normalization 
numpy to tensor 
 
1 2 3 4 5 6 7 8 9 10 11 12 age_mean = train_data['Age' ].dropna().mean() train_data['Age' ].fillna(age_mean,inplace=True ) test_data['Age' ].fillna(age_mean,inplace=True ) from  sklearn.preprocessing import  LabelEncoderle = LabelEncoder() le.fit(train_data["Sex" ]) train_data["Sex" ] = le.transform(train_data["Sex" ]) test_data["Sex" ]  = le.transform(test_data["Sex" ]) train_data.head(5 ) 
  
    
      PassengerId 
      Survived 
      Pclass 
      Name 
      Sex 
      Age 
      SibSp 
      Parch 
      Ticket 
      Fare 
      Cabin 
      Embarked 
     
   
  
    
      0 
      1 
      0 
      3 
      Braund, Mr. Owen Harris 
      1 
      22.0 
      1 
      0 
      A/5 21171 
      7.2500 
      NaN 
      S 
     
    
      1 
      2 
      1 
      1 
      Cumings, Mrs. John Bradley (Florence Briggs Th... 
      0 
      38.0 
      1 
      0 
      PC 17599 
      71.2833 
      C85 
      C 
     
    
      2 
      3 
      1 
      3 
      Heikkinen, Miss. Laina 
      0 
      26.0 
      0 
      0 
      STON/O2. 3101282 
      7.9250 
      NaN 
      S 
     
    
      3 
      4 
      1 
      1 
      Futrelle, Mrs. Jacques Heath (Lily May Peel) 
      0 
      35.0 
      1 
      0 
      113803 
      53.1000 
      C123 
      S 
     
    
      4 
      5 
      0 
      3 
      Allen, Mr. William Henry 
      1 
      35.0 
      0 
      0 
      373450 
      8.0500 
      NaN 
      S 
     
   
1 2 3 4 5 temp    = pd.concat([train_data,test_data],axis = 0 ) temp_em = pd.get_dummies(temp["Embarked" ],dummy_na=True ) temp    = pd.concat([temp, temp_em],axis = 1 ) temp.head(5 ) 
  
    
      PassengerId 
      Survived 
      Pclass 
      Name 
      Sex 
      Age 
      SibSp 
      Parch 
      Ticket 
      Fare 
      Cabin 
      Embarked 
      C 
      Q 
      S 
      NaN 
     
   
  
    
      0 
      1 
      0.0 
      3 
      Braund, Mr. Owen Harris 
      1 
      22.0 
      1 
      0 
      A/5 21171 
      7.2500 
      NaN 
      S 
      False 
      False 
      True 
      False 
     
    
      1 
      2 
      1.0 
      1 
      Cumings, Mrs. John Bradley (Florence Briggs Th... 
      0 
      38.0 
      1 
      0 
      PC 17599 
      71.2833 
      C85 
      C 
      True 
      False 
      False 
      False 
     
    
      2 
      3 
      1.0 
      3 
      Heikkinen, Miss. Laina 
      0 
      26.0 
      0 
      0 
      STON/O2. 3101282 
      7.9250 
      NaN 
      S 
      False 
      False 
      True 
      False 
     
    
      3 
      4 
      1.0 
      1 
      Futrelle, Mrs. Jacques Heath (Lily May Peel) 
      0 
      35.0 
      1 
      0 
      113803 
      53.1000 
      C123 
      S 
      False 
      False 
      True 
      False 
     
    
      4 
      5 
      0.0 
      3 
      Allen, Mr. William Henry 
      1 
      35.0 
      0 
      0 
      373450 
      8.0500 
      NaN 
      S 
      False 
      False 
      True 
      False 
     
   
1 2 3 4 5 6 7 8 train = temp.iloc[:len (train_data),:] test  = temp.iloc[len (train_data):,:] features = ["Pclass" ,"Sex" ,"Age" ,"SibSp" ,"Parch" ,"C" ,"Q" ,"S" ,np.nan] target   = "Survived"  train[features].head(5 ) 
  
    
      Pclass 
      Sex 
      Age 
      SibSp 
      Parch 
      C 
      Q 
      S 
      NaN 
     
   
  
    
      0 
      3 
      1 
      22.0 
      1 
      0 
      False 
      False 
      True 
      False 
     
    
      1 
      1 
      0 
      38.0 
      1 
      0 
      True 
      False 
      False 
      False 
     
    
      2 
      3 
      0 
      26.0 
      0 
      0 
      False 
      False 
      True 
      False 
     
    
      3 
      1 
      0 
      35.0 
      1 
      0 
      False 
      False 
      True 
      False 
     
    
      4 
      3 
      1 
      35.0 
      0 
      0 
      False 
      False 
      True 
      False 
     
   
1 2 3 4 5 6 7 8 9 train_X  = np.array(train[features]) train_Y  = np.array(train[target]) test_X   = np.array(test[features]) test_Y   = np.array(test[target]) Scaler  = StandardScaler() train_X = Scaler.fit_transform(train_X) test_X  = Scaler.fit_transform(test_X) 
1 2 temp = pd.DataFrame(train_X, columns=features) temp 
  
    
      Pclass 
      Sex 
      Age 
      SibSp 
      Parch 
      C 
      Q 
      S 
      NaN 
     
   
  
    
      0 
      0.827377 
      0.737695 
      -0.592481 
      0.432793 
      -0.473674 
      -0.482043 
      -0.307562 
      0.619306 
      -0.047431 
     
    
      1 
      -1.566107 
      -1.355574 
      0.638789 
      0.432793 
      -0.473674 
      2.074505 
      -0.307562 
      -1.614710 
      -0.047431 
     
    
      2 
      0.827377 
      -1.355574 
      -0.284663 
      -0.474545 
      -0.473674 
      -0.482043 
      -0.307562 
      0.619306 
      -0.047431 
     
    
      3 
      -1.566107 
      -1.355574 
      0.407926 
      0.432793 
      -0.473674 
      -0.482043 
      -0.307562 
      0.619306 
      -0.047431 
     
    
      4 
      0.827377 
      0.737695 
      0.407926 
      -0.474545 
      -0.473674 
      -0.482043 
      -0.307562 
      0.619306 
      -0.047431 
     
    
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
      ... 
     
    
      886 
      -0.369365 
      0.737695 
      -0.207709 
      -0.474545 
      -0.473674 
      -0.482043 
      -0.307562 
      0.619306 
      -0.047431 
     
    
      887 
      -1.566107 
      -1.355574 
      -0.823344 
      -0.474545 
      -0.473674 
      -0.482043 
      -0.307562 
      0.619306 
      -0.047431 
     
    
      888 
      0.827377 
      -1.355574 
      0.000000 
      0.432793 
      2.008933 
      -0.482043 
      -0.307562 
      0.619306 
      -0.047431 
     
    
      889 
      -1.566107 
      0.737695 
      -0.284663 
      -0.474545 
      -0.473674 
      2.074505 
      -0.307562 
      -1.614710 
      -0.047431 
     
    
      890 
      0.827377 
      0.737695 
      0.177063 
      -0.474545 
      -0.473674 
      -0.482043 
      3.251373 
      -1.614710 
      -0.047431 
     
   
891 rows × 9 columns
1 2 3 4 5 6 train_X = torch.FloatTensor(train_X[:]) train_Y = torch.LongTensor(train_Y[:]) val_X   = torch.FloatTensor(test_X[:]) val_Y   = torch.LongTensor(test_Y[:]) train_X 
tensor([[ 0.8274,  0.7377, -0.5925,  …, -0.3076,  0.6193, -0.0474],
Modeling 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 class  Net (nn.Module):    def  __init__ (self ):         super ().__init__()         self.fc1 = nn.Linear(len (features), 256 )         self.fc2 = nn.Linear(256 , 32 )         self.fc3 = nn.Linear(32 , 8 )          self.fc4 = nn.Linear(8 , 2 )               def  forward (self, x ):         x = F.relu(self.fc1(x))         x = F.relu(self.fc2(x))         x = F.relu(self.fc3(x))         x = self.fc4(x)         return  x      model = Net() model.to(device) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr = 0.02 ) print (model,'\n' ,loss_fn,'\n' ,optimizer)
	Net(
Training 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 batch_size = 64  batch = len (train_X) // batch_size n_epochs   = 800  total_loss = 0  loop = tqdm(range (n_epochs)) for  epoch in  loop:    for  i in  range (batch):         start = i * batch_size         end = start + batch_size if  ((start + batch_size) > len (train_X)) else  len (train_X)                  x_t = train_X[start:end].to(device)         y_t = train_Y[start:end].to(device)                           output = model(x_t)          loss   = loss_fn(output,y_t)                            optimizer.zero_grad()         loss.backward()         optimizer.step()                  values, labels = torch.max (output, 1 )          total_loss += loss.item() * batch_size              total_loss = total_loss / len (train_X)          loop.set_postfix(loss = '{:6f}' .format (total_loss)) 
100%|██████████| 800/800 [00:21<00:00, 37.97it/s, loss=0.195122]
模型输出端两个结点,表示两种结果的概率,通过max()方法映射成对应的01标签
Predictions 1 2 3 4 5 with  torch.no_grad():    test_result = model(val_X) labels = torch.max (test_result, 1 )[1 ] survived = labels.data.numpy() labels 
tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1,
Submission 1 2 3 submission = pd.DataFrame({'PassengerId' : sub['PassengerId' ], 'Survived' : survived}) submission.to_csv('submission.csv' , index=False ) submission 
  
    
      PassengerId 
      Survived 
     
   
  
    
      0 
      892 
      0 
     
    
      1 
      893 
      0 
     
    
      2 
      894 
      0 
     
    
      3 
      895 
      0 
     
    
      4 
      896 
      0 
     
    
      ... 
      ... 
      ... 
     
    
      413 
      1305 
      0 
     
    
      414 
      1306 
      1 
     
    
      415 
      1307 
      0 
     
    
      416 
      1308 
      0 
     
    
      417 
      1309 
      1 
     
   
418 rows × 2 columns