问题
我目前有一个pandas数据帧,其中包含来自this kaggle数据集的属性信息.以下是该集合的示例数据框:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| AnnaDale | 5 | 5425 | 2015 | ... |
| Woodside | 4 | 2327 | 1966 | ... |
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 405 | 1996 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
我想要做的是获取“year built”列中的值等于零的每一行,并将这些行中的“year built”值替换为具有相同邻域的行中“year built”值的中值,自治市镇和街区.在某些情况下,{neighborhood,borough,block}集合中有多个行在“year built”列中具有零.这在上面的示例数据框中显示.
为了说明问题,我将这两行放在示例数据框中.
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
为了解决这个问题,我想使用具有相同邻域,行政区和块的所有其他行中的“年建”值的平均值来填充“年建”值在“年”中为零的行中建立“专栏.对于示例行,邻域是Alphabet City,行政区是1,块是396所以我将使用示例数据帧中的以下匹配行来计算平均值:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
我将从这些行(即1987.4)中取出“year built”列的平均值,并用均值替换零.最初有零的行最终看起来像这样:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1987.4 | ... |
| Alphabet City | 1 | 396 | 1987.4 | ... |
我到目前为止的代码
我到目前为止所做的就是在“年建”栏中删除带有零的行,并找到每个{邻域,区域,块}集的平均年份.原始数据帧存储在raw_data中,它看起来就像本文最顶部的示例数据帧.代码如下所示:
# create a copy of the data
temp_data = raw_data.copy()
# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]
# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()
输出看起来像这样:
| neighborhood | borough | block | year built |
------------------------------------------------
| .... | ... | ... | ... |
| Alphabet City | 1 | 390 | 1985.342 |
| Alphabet City | 1 | 391 | 1986.76 |
| Alphabet City | 1 | 392 | 1992.8473 |
| Alphabet City | 1 | 393 | 1990.096 |
| Alphabet City | 1 | 394 | 1984.45 |
那么如何从mean_year_by_location数据帧中取出那些平均的“年建”值并替换原始raw_data数据帧中的零?
我为这篇长篇大论道歉.我只想非常清楚.
解决方法:
使用set_index替换,然后使用fillna on mean.
v = df.set_index(
['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)
df = v.fillna(v.mean(level=[0, 1, 2])).reset_index()
df
neighborhood borough block year built
0 AnnaDale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4
细节
首先,设置索引,并用NaN替换0,以便即将进行的平均计算不受这些值的影响 –
v = df.set_index(
['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)
v
neighborhood borough block
AnnaDale 5 5425 2015.0
Woodside 4 2327 1966.0
Alphabet City 1 396 1985.0
405 1996.0
396 1986.0
396 1992.0
396 NaN
396 1990.0
396 1984.0
396 NaN
Name: year built, dtype: float64
接下来,计算平均值 –
m = v.mean(level=[0, 1, 2])
m
neighborhood borough block
AnnaDale 5 5425 2015.0
Woodside 4 2327 1966.0
Alphabet City 1 396 1987.4
405 1996.0
Name: year built, dtype: float64
这用作映射,我们将传递给fillna. fillna相应地替换前面介绍的NaN,并用索引映射的相应平均值替换它们.完成后,只需重置索引即可恢复原始结构.
v.fillna(m).reset_index()
neighborhood borough block year built
0 AnnaDale 5 5425 2015.0
1 Woodside 4 2327 1966.0
2 Alphabet City 1 396 1985.0
3 Alphabet City 1 405 1996.0
4 Alphabet City 1 396 1986.0
5 Alphabet City 1 396 1992.0
6 Alphabet City 1 396 1987.4
7 Alphabet City 1 396 1990.0
8 Alphabet City 1 396 1984.0
9 Alphabet City 1 396 1987.4