基于索引的数据集外推

问题描述

我正在尝试推断我的数据集。片段如下所示。一个简单的线性外推法就可以了：

Index Value
3000  NaN
4000  NaN
5000  10
6000  20
6500  33
7000  44  
8300  60
9300  NaN
9400  NaN

外推法应考虑索引值。由于pandas包仅提供插值功能，因此我陷入了困境。我看着scipy包，但似乎无法实现我的想法。非常感谢您的帮助。

解决方法

我对scikit-learn更熟悉：

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression

df = pd.DataFrame([(3000,np.nan),(4000,(5000,10),(6000,20),(6500,33),(7000,44  ),(8300,60),(9300,(9400,np.nan)],columns=['Index','Value'])

def extrapolate(df,X_col,y_col):
    
    df_ = df[[X_col,y_col]].dropna()
    
    return LinearRegression().fit(
        df_[X_col].values.reshape(-1,1),df_[y_col]).predict(
        df[X_col].values.reshape(-1,1))

df['Value_'] = extrapolate(df,'Index','Value')
df

您应该获得以下内容：

    Index   Value   Value_
0   3000    NaN     -23.219022
1   4000    NaN     -7.314802
2   5000    10.0    8.589417
3   6000    20.0    24.493637
4   6500    33.0    32.445747
5   7000    44.0    40.397857
6   8300    60.0    61.073342
7   9300    NaN     76.977562
8   9400    NaN     78.567984

# I assume you don't want to extrapolate the orginal values
df['Value'] = df['Value'].fillna(df['Value_'])
df

礼物：

    Index   Value   Value_
0   3000    -23.219022  -23.219022
1   4000    -7.314802   -7.314802
2   5000    10.000000   8.589417
3   6000    20.000000   24.493637
4   6500    33.000000   32.445747
5   7000    44.000000   40.397857
6   8300    60.000000   61.073342
7   9300    76.977562   76.977562
8   9400    78.567984   78.567984

extrapolation pandas python scipy