用其他列中的过滤后的值填充所选列中的缺失值

问题描述

我在一个数据框中有一个名为null的怪异列,其中包含其他列中的某些缺失值。一列是名为location的经纬度坐标,另一列是表示名为level的目标变量的整数。在locationlevel缺少某些值的情况下,null列中应有应有的值。这是df的示例:

pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073',1: '2',2: '43.70637091,-72.28704334',3: '4',4: '3'},'location': {0: nan,1: nan,2: nan,3: nan,4: nan},'level': {0: nan,4: nan}
     }
)

我需要能够根据值是整数还是字符串来过滤null列,然后基于该值用适当的值填充适当列中的缺失值。我已经尝试过将.apply()与lambda函数以及.match().contains()in一起使用在lambda函数中,到目前为止还没有运气。

解决方法

让我们尝试to_numeric

checker = pd.to_numeric(df.null,errors='coerce')
checker
Out[171]: 
0    NaN
1    2.0
2    NaN
3    4.0
4    3.0
Name: null,dtype: float64

并应用isnull,如果返回NaN表示该字符串不是整数

isstring = checker.isnull()
Out[172]: 
0     True
1    False
2     True
3    False
4    False
Name: null,dtype: bool
# isnumber = checker.notnull()

填充值

df.loc[isnumber,'location'] = df['null']
df.loc[isstring,'level'] = df['null']
,

另一种方法可以使用方法pandas.Series.mask

>>> df
                       null  location  level
0  43.70477575,-72.28844073       NaN    NaN
1                         2       NaN    NaN
2  43.70637091,-72.28704334       NaN    NaN
3                         4       NaN    NaN
4                         3       NaN    NaN
>>> df.level.mask(df.null.str.isnumeric(),other = df.null,inplace = True)
>>> df.location.where(df.null.str.isnumeric(),inplace = True)
>>>
>>> df
                       null                  location level
0  43.70477575,-72.28844073  43.70477575,-72.28844073   NaN
1                         2                       NaN     2
2  43.70637091,-72.28704334  43.70637091,-72.28704334   NaN
3                         4                       NaN     4
4                         3                       NaN     3

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mask.html https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.where.html

,

最简单的方法(如果不是最简单的方法)是用df.location中的值简单填充df.leveldf.null中的所有缺失值,然后使用正则表达式创建布尔过滤器以将df.locationdf.level中不适当/错误分配的值返回到np.nan

pd.fillna()

df = pd.DataFrame(
     {'null': {0: '43.70477575,-72.28844073',1: '2',2: '43.70637091,-72.28704334',3: '4',4: '3'},'location': {0: nan,1: nan,2: nan,3: nan,4: nan},'level': {0: nan,4: nan}
     }
)

for col in ['location','level']:
     df[col].fillna(
          value = stress.null,inplace = True
     )

现在,我们将使用字符串表达式来更正分配错误的值。

str.contains()

# Converting columns to type str so string methods work
df = df.astype(str)

# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.level.str.contains(regex),'level'] = np.nan
    
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex),'location'] = np.nan
    
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.level.astype(float)

这是输出:

pd.DataFrame(
     {'null': {0: '43.70477575,'location': {0: '43.70477575,4: '3'}
     }
)

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...