如何使用类似行的平均列值替换pandas列中的某些值？

问题

我目前有一个pandas数据帧,其中包含来自this kaggle数据集的属性信息.以下是该集合的示例数据框：

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| AnnaDale      | 5       | 5425  | 2015       | ... |
| Woodside      | 4       | 2327  | 1966       | ... |
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 405   | 1996       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |
| Alphabet City | 1       | 396   | 0          | ... |

我想要做的是获取“year built”列中的值等于零的每一行,并将这些行中的“year built”值替换为具有相同邻域的行中“year built”值的中值,自治市镇和街区.在某些情况下,{neighborhood,borough,block}集合中有多个行在“year built”列中具有零.这在上面的示例数据框中显示.

为了说明问题,我将这两行放在示例数据框中.

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 0          | ... |
| Alphabet City | 1       | 396   | 0          | ... |

为了解决这个问题,我想使用具有相同邻域,行政区和块的所有其他行中的“年建”值的平均值来填充“年建”值在“年”中为零的行中建立“专栏.对于示例行,邻域是Alphabet City,行政区是1,块是396所以我将使用示例数据帧中的以下匹配行来计算平均值：

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1985       | ... |
| Alphabet City | 1       | 396   | 1986       | ... |
| Alphabet City | 1       | 396   | 1992       | ... |
| Alphabet City | 1       | 396   | 1990       | ... |
| Alphabet City | 1       | 396   | 1984       | ... |

我将从这些行(即1987.4)中取出“year built”列的平均值,并用均值替换零.最初有零的行最终看起来像这样：

| neighborhood  | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1       | 396   | 1987.4     | ... |
| Alphabet City | 1       | 396   | 1987.4     | ... |

我到目前为止的代码

我到目前为止所做的就是在“年建”栏中删除带有零的行,并找到每个{邻域,区域,块}集的平均年份.原始数据帧存储在raw_data中,它看起来就像本文最顶部的示例数据帧.代码如下所示：

# create a copy of the data
temp_data = raw_data.copy()

# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]

# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()

输出看起来像这样：

| neighborhood  | borough | block | year built | 
------------------------------------------------
| ....          | ...     | ...   | ...        |
| Alphabet City | 1       | 390   | 1985.342   | 
| Alphabet City | 1       | 391   | 1986.76    | 
| Alphabet City | 1       | 392   | 1992.8473  | 
| Alphabet City | 1       | 393   | 1990.096   | 
| Alphabet City | 1       | 394   | 1984.45    |

那么如何从mean_year_by_location数据帧中取出那些平均的“年建”值并替换原始raw_data数据帧中的零？

我为这篇长篇大论道歉.我只想非常清楚.

解决方法:

使用set_index替换,然后使用fillna on mean.

v = df.set_index(
    ['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)   

df = v.fillna(v.mean(level=[0, 1, 2])).reset_index()
df

    neighborhood  borough  block  year built
0       AnnaDale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4

细节

首先,设置索引,并用NaN替换0,以便即将进行的平均计算不受这些值的影响 –

v = df.set_index(
    ['neighborhood', 'borough', 'block']
)['year built'].replace(0, np.nan)   

v 

neighborhood   borough  block
AnnaDale       5        5425     2015.0
Woodside       4        2327     1966.0
Alphabet City  1        396      1985.0
                        405      1996.0
                        396      1986.0
                        396      1992.0
                        396         NaN
                        396      1990.0
                        396      1984.0
                        396         NaN
Name: year built, dtype: float64

接下来,计算平均值 –

m = v.mean(level=[0, 1, 2])
m

neighborhood   borough  block
AnnaDale       5        5425     2015.0
Woodside       4        2327     1966.0
Alphabet City  1        396      1987.4
                        405      1996.0
Name: year built, dtype: float64

这用作映射,我们将传递给fillna. fillna相应地替换前面介绍的NaN,并用索引映射的相应平均值替换它们.完成后,只需重置索引即可恢复原始结构.

v.fillna(m).reset_index()

    neighborhood  borough  block  year built
0       AnnaDale        5   5425      2015.0
1       Woodside        4   2327      1966.0
2  Alphabet City        1    396      1985.0
3  Alphabet City        1    405      1996.0
4  Alphabet City        1    396      1986.0
5  Alphabet City        1    396      1992.0
6  Alphabet City        1    396      1987.4
7  Alphabet City        1    396      1990.0
8  Alphabet City        1    396      1984.0
9  Alphabet City        1    396      1987.4

如何使用类似行的平均列值替换pandas列中的某些值？

相关文章