问题描述
我在一个数据框中有一个名为null
的怪异列,其中包含其他列中的某些缺失值。一列是名为location
的经纬度坐标,另一列是表示名为level
的目标变量的整数。在location
或level
缺少某些值的情况下,null
列中应有应有的值。这是df的示例:
pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073',1: '2',2: '43.70637091,-72.28704334',3: '4',4: '3'},'location': {0: nan,1: nan,2: nan,3: nan,4: nan},'level': {0: nan,4: nan}
}
)
我需要能够根据值是整数还是字符串来过滤null
列,然后基于该值用适当的值填充适当列中的缺失值。我已经尝试过将.apply()
与lambda函数以及.match()
,.contains()
和in
一起使用在lambda函数中,到目前为止还没有运气。
解决方法
让我们尝试to_numeric
checker = pd.to_numeric(df.null,errors='coerce')
checker
Out[171]:
0 NaN
1 2.0
2 NaN
3 4.0
4 3.0
Name: null,dtype: float64
并应用isnull
,如果返回NaN
表示该字符串不是整数
isstring = checker.isnull()
Out[172]:
0 True
1 False
2 True
3 False
4 False
Name: null,dtype: bool
# isnumber = checker.notnull()
填充值
df.loc[isnumber,'location'] = df['null']
df.loc[isstring,'level'] = df['null']
,
另一种方法可以使用方法pandas.Series.mask
:
>>> df
null location level
0 43.70477575,-72.28844073 NaN NaN
1 2 NaN NaN
2 43.70637091,-72.28704334 NaN NaN
3 4 NaN NaN
4 3 NaN NaN
>>> df.level.mask(df.null.str.isnumeric(),other = df.null,inplace = True)
>>> df.location.where(df.null.str.isnumeric(),inplace = True)
>>>
>>> df
null location level
0 43.70477575,-72.28844073 43.70477575,-72.28844073 NaN
1 2 NaN 2
2 43.70637091,-72.28704334 43.70637091,-72.28704334 NaN
3 4 NaN 4
4 3 NaN 3
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mask.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html
,最简单的方法(如果不是最简单的方法)是用df.location
中的值简单填充df.level
和df.null
中的所有缺失值,然后使用正则表达式创建布尔过滤器以将df.location
和df.level
中不适当/错误分配的值返回到np.nan
。
pd.fillna()
df = pd.DataFrame(
{'null': {0: '43.70477575,-72.28844073',1: '2',2: '43.70637091,-72.28704334',3: '4',4: '3'},'location': {0: nan,1: nan,2: nan,3: nan,4: nan},'level': {0: nan,4: nan}
}
)
for col in ['location','level']:
df[col].fillna(
value = stress.null,inplace = True
)
现在,我们将使用字符串表达式来更正分配错误的值。
str.contains()
# Converting columns to type str so string methods work
df = df.astype(str)
# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex),'level'] = np.nan
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex),'location'] = np.nan
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)
这是输出:
pd.DataFrame(
{'null': {0: '43.70477575,'location': {0: '43.70477575,4: '3'}
}
)