问题描述
我有一个数据框,其中包含多年的数据,其中包含多个环境参数作为列。数据框如下所示:
import pandas as pd
import numpy as np
from scipy import stats
Parameters= ['Temperature','Rain','Pressure','Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019',periods=nrows,freq='D')
Vals = pd.DataFrame(np.random.randint(10,150,size=(nrows,len(Parameters))),columns=Parameters)
Vals = Vals.set_index(daterange)
print(Vals)
我创建了一个列,月份名称为Vals['Month'] = Vals.index.month_name().str.slice(stop=3)
,我想根据两个变量Rain and Temperature
之间的回归来计算斜率,并将其提取到数据框中。我尝试了如下解决方案:
pd.DataFrame.from_dict({y:stats.linregress(Vals['Temperature'],Vals['Rain'])[:2] for y,x in
Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Slope Intercept
Apr -0.016868 81.723291
Aug -0.016868 81.723291
Dec -0.016868 81.723291
Feb -0.016868 81.723291
Jan -0.016868 81.723291
Jul -0.016868 81.723291
Jun -0.016868 81.723291
Mar -0.016868 81.723291
May -0.016868 81.723291
Nov -0.016868 81.723291
Oct -0.016868 81.723291
Sep -0.016868 81.723291
似乎回归是根据总数据集计算得出的,并存储在每个月的索引中。如何通过类似的过程计算月度统计数据?
解决方法
这是我过去使用的一些代码。我之所以使用sklearn.LinearModel
是因为我认为它易于使用,但是您可以根据需要将其更改为scipy.stats。
此代码使用apply
并在函数linear_model
中进行线性回归。
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
def linear_model(group):
x,y = group.Temperature.values.reshape(-1,1),group.Rain.values.reshape(-1,1)
model = LinearRegression().fit(x,y)
m = model.coef_
i = model.intercept_
r_sqd = model.score(x,y)
return (pd.Series({ 'slope':np.squeeze(m),'intercept':np.squeeze(i),'r_sqd':np.squeeze(r_sqd)}))
Parameters= ['Temperature','Rain','Pressure','Humidity']
nrows = 365
daterange = pd.date_range('1/1/2019',periods=nrows,freq='D')
Vals = pd.DataFrame(np.random.randint(10,150,size=(nrows,len(Parameters))),columns=Parameters)
Vals = Vals.set_index(daterange)
Vals.groupby(Vals.index.month).apply(linear_model)
结果:
Vals.groupby(Vals.index.month).apply(linear_model)
Out[15]:
slope intercept r_sqd
1 -0.06334408633973578 80.98723450432585 0.003480
2 -0.1393001910724248 85.40023995141723 0.020435
3 -0.0535505295232336 69.09958112535743 0.003481
4 0.23187299827488306 57.866651248302546 0.048741
5 -0.04813654915436082 74.31295680099751 0.001867
6 0.31976921541526526 48.496345031992746 0.089027
7 -0.1979417421554613 94.84215558468942 0.052023
8 0.22239030327077666 68.62700822940076 0.061849
9 0.054607306452220644 72.0988798639258 0.002877
10 -0.07841007716276265 91.9211204014171 0.006085
11 -0.13517307855088803 100.44769438307809 0.016045
12 -0.1967407738498068 101.7393002049148 0.042255
您的尝试已结束。当对groupby对象使用for循环时,将对名称和数据进行分组。典型的约定是:
for name,group in Vals.groupby('Month'):
#do stuff with group
由于您为x
调用了name
,为y
调用了group
,因此可以将Vals
更改为y
,代码将产生与上述结果相同。
pd.DataFrame.from_dict({y:stats.linregress(x['Temperature'],x['Rain'])[:2] for y,x in
Vals.groupby('Month')},'index').\
rename(columns={0:'Slope',1:'Intercept'})
Slope Intercept
Apr 0.231873 57.866651
Aug 0.222390 68.627008
Dec -0.196741 101.739300
Feb -0.139300 85.400240
Jan -0.063344 80.987235
Jul -0.197942 94.842156
Jun 0.319769 48.496345
Mar -0.053551 69.099581
May -0.048137 74.312957
Nov -0.135173 100.447694
Oct -0.078410 91.921120
Sep 0.054607 72.098880