如何检测数据集中某列中的可疑错误?

问题描述

我正在尝试对此github repository中提供的名为train.csv的数据集中的数据进行编码。我用下面的代码来做到这一点。

import pandas as pd 
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder() 
for col in df.columns:
    if df[col].dtype == 'O':
        print(df[col])
        df[col] = label_encoder.fit_transform(df[col])
print(df) 

在编码时,将提示以下输出

MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'

但是当我查看数据集时,'<'列中没有任何Alley。 并且前面的列已经编码,但是Alley列引起错误。请帮帮我!

This is the colab notebook of the code

解决方法

有一个问题,您缺少的值不会在所有列中都被替换,需要分配回来,并且还向.iloc[0]添加了mode以便首先选择(如果有两个或更多值):

from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)

colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)

df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])

label_encoder = preprocessing.LabelEncoder() 
for col in colsObj:
    print(df[col])
    df[col] = label_encoder.fit_transform(df[col])

print (df)
      MSSubClass  MSZoning  LotFrontage  LotArea  Street  Alley  LotShape  \
Id                                                                          
1             60         3         65.0     8450       1      0         3   
2             20         3         80.0     9600       1      0         3   
3             60         3         68.0    11250       1      0         0   
4             70         3         60.0     9550       1      0         0   
5             60         3         84.0    14260       1      0         0   
         ...       ...          ...      ...     ...    ...       ...   
1456          60         3         62.0     7917       1      0         3   
1457          20         3         85.0    13175       1      0         3   
1458          70         3         66.0     9042       1      0         3   
1459          20         3         68.0     9717       1      0         3   
1460          20         3         75.0     9937       1      0         3   

      LandContour  Utilities  LotConfig  ...  PoolArea  PoolQC  Fence  \
Id                                       ...                            
1               3          0          4  ...         0       2      2   
2               3          0          2  ...         0       2      2   
3               3          0          4  ...         0       2      2   
4               3          0          0  ...         0       2      2   
5               3          0          2  ...         0       2      2   
          ...        ...        ...  ...       ...     ...    ...   
1456            3          0          4  ...         0       2      2   
1457            3          0          4  ...         0       2      2   
1458            3          0          4  ...         0       2      0   
1459            3          0          4  ...         0       2      2   
1460            3          0          4  ...         0       2      2   

      MiscFeature  MiscVal  MoSold  YrSold  SaleType  SaleCondition  SalePrice  
Id                                                                              
1               2        0       2    2008         8              4     208500  
2               2        0       5    2007         8              4     181500  
3               2        0       9    2008         8              4     223500  
4               2        0       2    2006         8              0     140000  
5               2        0      12    2008         8              4     250000  
          ...      ...     ...     ...       ...            ...        ...  
1456            2        0       8    2007         8              4     175000  
1457            2        0       2    2010         8              4     210000  
1458            2     2500       5    2010         8              4     266500  
1459            2        0       4    2010         8              4     142125  
1460            2        0       6    2008         8              4     147500  

[1460 rows x 80 columns]