问题描述
我正在尝试对此github repository中提供的名为train.csv
的数据集中的数据进行编码。我用下面的代码来做到这一点。
import pandas as pd
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
df.head()
df['MSSubClass'].fillna(df['MSSubClass'].mean()//1)
df['MSZoning'].fillna(df['MSZoning'].mode())
label_encoder = preprocessing.LabelEncoder()
for col in df.columns:
if df[col].dtype == 'O':
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print(df)
MSSubClass
MSZoning
LotFrontage
LotArea
Street
Alley
TypeError: '<' not supported between instances of 'str' and 'float'
但是当我查看数据集时,'<'
列中没有任何Alley
。
并且前面的列已经编码,但是Alley
列引起错误。请帮帮我!
This is the colab notebook of the code
解决方法
有一个问题,您缺少的值不会在所有列中都被替换,需要分配回来,并且还向.iloc[0]
添加了mode
以便首先选择(如果有两个或更多值):
from sklearn import preprocessing
df = pd.read_csv(r'train.csv',index_col='Id')
print (df)
colsNum = df.select_dtypes(np.number).columns
colsObj = df.columns.difference(colsNum)
df[colsNum] = df[colsNum].fillna(df[colsNum].mean()//1)
df[colsObj] = df[colsObj].fillna(df[colsObj].mode().iloc[0])
label_encoder = preprocessing.LabelEncoder()
for col in colsObj:
print(df[col])
df[col] = label_encoder.fit_transform(df[col])
print (df)
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \
Id
1 60 3 65.0 8450 1 0 3
2 20 3 80.0 9600 1 0 3
3 60 3 68.0 11250 1 0 0
4 70 3 60.0 9550 1 0 0
5 60 3 84.0 14260 1 0 0
... ... ... ... ... ... ...
1456 60 3 62.0 7917 1 0 3
1457 20 3 85.0 13175 1 0 3
1458 70 3 66.0 9042 1 0 3
1459 20 3 68.0 9717 1 0 3
1460 20 3 75.0 9937 1 0 3
LandContour Utilities LotConfig ... PoolArea PoolQC Fence \
Id ...
1 3 0 4 ... 0 2 2
2 3 0 2 ... 0 2 2
3 3 0 4 ... 0 2 2
4 3 0 0 ... 0 2 2
5 3 0 2 ... 0 2 2
... ... ... ... ... ... ...
1456 3 0 4 ... 0 2 2
1457 3 0 4 ... 0 2 2
1458 3 0 4 ... 0 2 0
1459 3 0 4 ... 0 2 2
1460 3 0 4 ... 0 2 2
MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 2 0 2 2008 8 4 208500
2 2 0 5 2007 8 4 181500
3 2 0 9 2008 8 4 223500
4 2 0 2 2006 8 0 140000
5 2 0 12 2008 8 4 250000
... ... ... ... ... ... ...
1456 2 0 8 2007 8 4 175000
1457 2 0 2 2010 8 4 210000
1458 2 2500 5 2010 8 4 266500
1459 2 0 4 2010 8 4 142125
1460 2 0 6 2008 8 4 147500
[1460 rows x 80 columns]