用Pandas Groupy + Apply和Condensing组更快地计算平均值

问题描述

我想对两个值进行分组，如果该组包含多个元素，则仅返回该组的第一行，并用该组的均值替换该值。如果只有一个元素，我想直接返回。我的代码如下：

correlationMatrix <- cor(cor_numVar[,1:274])
highlycorrelated <- findCorrelation(correlationMatrix,cutoff=0.5)
train[,highlycorrelated]

df看起来像这样：

final = df.groupby(["a","b"]).apply(condense).drop(['a','b'],axis=1).reset_index()

def condense(df):
    if df.shape[0] > 1:
        mean = df["c"].mean()
        record = df.iloc[[0]]
        record["c"] = mean
        return(record)
    else:
        return(df)

由于数据帧很大，我有73800个组，整个groupby + apply的计算大约需要一分钟。这太长了。有没有办法使其运行更快？

解决方法

我认为一个值的均值与多个值的均值相同，因此您可以使用mean来简化c列的GroupBy.agg的求解，而将所有其他值求和{{1 }}：

first

pandas pandas pandas-apply pandas-groupby python

用Pandas Groupy + Apply和Condensing组更快地计算平均值

问题描述

解决方法

相关问答