用其他列中的过滤后的值填充所选列中的缺失值

问题描述

我在一个数据框中有一个名为null的怪异列，其中包含其他列中的某些缺失值。一列是名为location的经纬度坐标，另一列是表示名为level的目标变量的整数。在location或level缺少某些值的情况下，null列中应有应有的值。这是df的示例：

pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073',1: '2',2: '43.70637091,-72.28704334',3: '4',4: '3'},'location': {0: nan,1: nan,2: nan,3: nan,4: nan},'level': {0: nan,4: nan}
     }
)

我需要能够根据值是整数还是字符串来过滤null列，然后基于该值用适当的值填充适当列中的缺失值。我已经尝试过将.apply()与lambda函数以及.match()，.contains()和in一起使用在lambda函数中，到目前为止还没有运气。

解决方法

让我们尝试to_numeric

checker = pd.to_numeric(df.null,errors='coerce')
checker
Out[171]: 
0    NaN
1    2.0
2    NaN
3    4.0
4    3.0
Name: null,dtype: float64

并应用isnull，如果返回NaN表示该字符串不是整数

isstring = checker.isnull()
Out[172]: 
0     True
1    False
2     True
3    False
4    False
Name: null,dtype: bool
# isnumber = checker.notnull()

填充值

df.loc[isnumber,'location'] = df['null']
df.loc[isstring,'level'] = df['null']

另一种方法可以使用方法pandas.Series.mask：

>>> df
                       null  location  level
0  43.70477575,-72.28844073       NaN    NaN
1                         2       NaN    NaN
2  43.70637091,-72.28704334       NaN    NaN
3                         4       NaN    NaN
4                         3       NaN    NaN
>>> df.level.mask(df.null.str.isnumeric(),other = df.null,inplace = True)
>>> df.location.where(df.null.str.isnumeric(),inplace = True)
>>>
>>> df
                       null                  location level
0  43.70477575,-72.28844073  43.70477575,-72.28844073   NaN
1                         2                       NaN     2
2  43.70637091,-72.28704334  43.70637091,-72.28704334   NaN
3                         4                       NaN     4
4                         3                       NaN     3

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mask.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html

最简单的方法（如果不是最简单的方法）是用df.location中的值简单填充df.level和df.null中的所有缺失值，然后使用正则表达式创建布尔过滤器以将df.location和df.level中不适当/错误分配的值返回到np.nan。

pd.fillna（）

df = pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073',1: '2',2: '43.70637091,-72.28704334',3: '4',4: '3'},'location': {0: nan,1: nan,2: nan,3: nan,4: nan},'level': {0: nan,4: nan}
     }
)

for col in ['location','level']:
     df[col].fillna(
          value = stress.null,inplace = True
     )

现在，我们将使用字符串表达式来更正分配错误的值。

str.contains（）

# Converting columns to type str so string methods work
df = df.astype(str)

# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex),'level'] = np.nan
    
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex),'location'] = np.nan
    
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)

这是输出：

pd.DataFrame(
     {'null': {0: '43.70477575,'location': {0: '43.70477575,4: '3'}
     }
)

fillna null null null pandas pandas python regex regex regex

用其他列中的过滤后的值填充所选列中的缺失值

问题描述

解决方法

pd.fillna（）

str.contains（）

相关问答