问题描述
我想对一个DataFrame应用一个函数,该函数为原始数据集中的每一列返回几列。 apply函数返回带有列和索引的DataFrame,但它仍然引发错误ValueError:如果使用所有标量值,则必须传递索引。
我尝试设置输出数据框的名称,将列设置为多索引,并将索引设置为多索引,但这不起作用。
示例:我有这个输入数据框
df_all_users = pd.DataFrame(
[[1,2,3],[1,],index=["2020-01-01","2020-01-02","2020-01-03"],columns=["user_1","user_2","user_3"])
user_1 user_2 user_3
2020-01-01 1 2 3
2020-01-02 1 2 3
2020-01-03 1 2 3
apply_function是这样的:
def apply_function(df):
df_out = pd.DataFrame(index=df.index)
# these columns are in reality computed used some other functions
df_out["column_1"] = df.values # example: pyod.ocsvm.OCSVM.fit_predict(df.values)
df_out["column_2"] = - df.values # example: pyod.knn.KNN.fit_predict(df.values)
# these are the things I've tried without working
df_out.name = df.name
df_out.columns = pd.MultiIndex.from_tuples([(df.name,column) for column in df_out.columns],names=["user","score"])
df_out.index = pd.MultiIndex.from_tuples([(df.name,idx) for idx in df_out.index],"date"])
print(df_out)
return df_out
df_all_users.apply(apply_function,axis=0,result_type="expand")
哪个会引发错误:
ValueError:如果使用所有标量值,则必须传递索引
我期望的输出是这样的:
out_df = pd.DataFrame(
[[1,1,3,columns=pd.MultiIndex.from_tuples([(user,column)
for user in ["user_1","user_3"]
for column in ["column_1","column_2"]],names=("user","score"))
)
user_1 user_2 user_3
column_1 column_2 column_1 column_2 column_1 column_2
2020-01-01 1 1 2 2 3 3
2020-01-02 1 1 2 2 3 3
2020-01-03 1 1 2 2 3 3
解决方法
执行此操作:
p <- ggplot(corttestunitedcol,aes(x = Sex,y = mean,fill = Treatment_Status)) +
geom_bar(stat = 'identity',colour = 'black',position = position_dodge()) +
geom_errorbar(aes(ymin = mean - sd,ymax = mean + sd),width = 0.2,position = position_dodge(0.9)) +
facet_wrap(. ~ Sex) +
labs(title = 'Corticosterone',x = '',y = 'mean plasma Corticosterone (pg/ml)')
,
好的,答案是将输出转换为一系列数组,然后将结果连接起来:
import pandas as pd
df_all_users = pd.DataFrame(
[[1,2,3],[1,],index=["2020-01-01","2020-01-02","2020-01-03"],columns=["user_1","user_2","user_3"])
def apply_function(df):
df_out = pd.DataFrame(index=df.index)
df_out["column_1"] = df.values
df_out["column_2"] = df.values
df_out = pd.Series([values for values in df_out.values],index=df.index)
df_out.name = df.name
return df_out
df_out = df_all_users.groupby(level=0,axis=1).apply(apply_function)
df_out = pd.DataFrame([np.concatenate(values,axis=0) for values in df_out.values],index=df_out.index,columns=pd.MultiIndex.from_tuples([(user,column)
for column in ["column_1","column_2"]
for user in df_out.columns
],names=["user","algorithm"]))
df_out
user user_1 user_2 user_3
algorithm column_1 column_2 column_1 column_2 column_1 column_2
2020-01-01 1 1 2 2 3 3
2020-01-02 1 1 2 2 3 3
2020-01-03 1 1 2 2 3 3