问题描述
我正在尝试推断我的数据集。片段如下所示。一个简单的线性外推法就可以了:
Index Value
3000 NaN
4000 NaN
5000 10
6000 20
6500 33
7000 44
8300 60
9300 NaN
9400 NaN
外推法应考虑索引值。由于pandas包仅提供插值功能,因此我陷入了困境。我看着scipy包,但似乎无法实现我的想法。非常感谢您的帮助。
解决方法
我对scikit-learn更熟悉:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.DataFrame([(3000,np.nan),(4000,(5000,10),(6000,20),(6500,33),(7000,44 ),(8300,60),(9300,(9400,np.nan)],columns=['Index','Value'])
def extrapolate(df,X_col,y_col):
df_ = df[[X_col,y_col]].dropna()
return LinearRegression().fit(
df_[X_col].values.reshape(-1,1),df_[y_col]).predict(
df[X_col].values.reshape(-1,1))
df['Value_'] = extrapolate(df,'Index','Value')
df
您应该获得以下内容:
Index Value Value_
0 3000 NaN -23.219022
1 4000 NaN -7.314802
2 5000 10.0 8.589417
3 6000 20.0 24.493637
4 6500 33.0 32.445747
5 7000 44.0 40.397857
6 8300 60.0 61.073342
7 9300 NaN 76.977562
8 9400 NaN 78.567984
# I assume you don't want to extrapolate the orginal values
df['Value'] = df['Value'].fillna(df['Value_'])
df
礼物:
Index Value Value_
0 3000 -23.219022 -23.219022
1 4000 -7.314802 -7.314802
2 5000 10.000000 8.589417
3 6000 20.000000 24.493637
4 6500 33.000000 32.445747
5 7000 44.000000 40.397857
6 8300 60.000000 61.073342
7 9300 76.977562 76.977562
8 9400 78.567984 78.567984