pandas.apply展开列ValueError:如果使用所有标量值,则必须传递索引

问题描述

我想对一个DataFrame应用一个函数,该函数为原始数据集中的每一列返回几列。 apply函数返回带有列和索引的DataFrame,但它仍然引发错误ValueError:如果使用所有标量值,则必须传递索引。

我尝试设置输出数据框的名称,将列设置为多索引,并将索引设置为多索引,但这不起作用。

示例:我有这个输入数据框

df_all_users = pd.DataFrame(
    [[1,2,3],[1,],index=["2020-01-01","2020-01-02","2020-01-03"],columns=["user_1","user_2","user_3"])

          user_1  user_2    user_3
2020-01-01     1       2         3
2020-01-02     1       2         3
2020-01-03     1       2         3

apply_function是这样的:

def apply_function(df):
    df_out = pd.DataFrame(index=df.index)
    # these columns are in reality computed used some other functions
    df_out["column_1"] = df.values  # example: pyod.ocsvm.OCSVM.fit_predict(df.values) 
    df_out["column_2"] = - df.values  # example: pyod.knn.KNN.fit_predict(df.values)
    
    # these are the things I've tried without working
    df_out.name = df.name
    df_out.columns = pd.MultiIndex.from_tuples([(df.name,column) for column in df_out.columns],names=["user","score"])
    df_out.index = pd.MultiIndex.from_tuples([(df.name,idx) for idx in df_out.index],"date"])
    print(df_out)
    return df_out

df_all_users.apply(apply_function,axis=0,result_type="expand")

哪个会引发错误

ValueError:如果使用所有标量值,则必须传递索引

我期望的输出是这样的:

out_df = pd.DataFrame(
    [[1,1,3,columns=pd.MultiIndex.from_tuples([(user,column)
                                       for user in ["user_1","user_3"]
                                       for column in ["column_1","column_2"]],names=("user","score"))
)

             user_1           user_2            user_3
           column_1 column_2 column_1 column_2 column_1 column_2
2020-01-01        1        1        2        2        3        3
2020-01-02        1        1        2        2        3        3
2020-01-03        1        1        2        2        3        3

解决方法

执行此操作:

p <- ggplot(corttestunitedcol,aes(x = Sex,y = mean,fill = Treatment_Status)) + 
  geom_bar(stat = 'identity',colour = 'black',position = position_dodge()) +
  geom_errorbar(aes(ymin = mean - sd,ymax = mean + sd),width = 0.2,position = position_dodge(0.9)) +
  facet_wrap(. ~ Sex) +
  labs(title = 'Corticosterone',x = '',y = 'mean plasma Corticosterone (pg/ml)')
,

好的,答案是将输出转换为一系列数组,然后将结果连接起来:

import pandas as pd
df_all_users = pd.DataFrame(
    [[1,2,3],[1,],index=["2020-01-01","2020-01-02","2020-01-03"],columns=["user_1","user_2","user_3"])

def apply_function(df):
    df_out = pd.DataFrame(index=df.index)
    df_out["column_1"] = df.values
    df_out["column_2"] = df.values

    df_out = pd.Series([values for values in df_out.values],index=df.index)
    df_out.name = df.name
    return df_out

df_out = df_all_users.groupby(level=0,axis=1).apply(apply_function)
df_out = pd.DataFrame([np.concatenate(values,axis=0) for values in df_out.values],index=df_out.index,columns=pd.MultiIndex.from_tuples([(user,column)
                                                         for column in ["column_1","column_2"]
                                                         for user in df_out.columns
                                                        ],names=["user","algorithm"]))
df_out



user          user_1                  user_2                  user_3
algorithm   column_1    column_2    column_1    column_2    column_1    column_2
2020-01-01         1           1           2           2           3           3
2020-01-02         1           1           2           2           3           3
2020-01-03         1           1           2           2           3           3