问题描述
df.selection
(...)
1454 5
1458 6
1473 4
1474 4
1487 4
1491 3
1500 6
Name: selection,Length: 117,dtype: int64
和
df.value_lsts
(...)
1454 [8.4,16.0,7.4,3.96,17.5,2.6]
1458 [8.85,3.25,5.3,4.95,8.14,11.0]
1473 [9.8,5.28,11.67,15.15,4.47,3.06]
1474 [5.5,2.19,7.7,11.98,28.0,8.54]
1487 [26.6,9.74,7.71,6.46,2.28,7.58]
1491 [6.4,3.1,19.92,4.2,6.37,11.05]
1500 [3.0,22.91,8.61,13.58,3.69]
Name: value_lsts,dtype: object
那是一列列表。
我需要创建另一个列,该列的值将为:
value_lsts [df.selection-1]
例如,对于第1500行,我们有
df.value_lsts
1500 [3.0,3.69]
df.selection
1500 6
因此返回值将为 3.69
我已经尝试了所有方法,但无法提出解决方案。 通过df.selection列访问正确索引的pythonic方法是什么?
非常感谢。 皮耶罗
解决方法
请注意, putting mutable objects inside a DataFrame can be an antipattern
如果您确定要实现的目标并确定需要一列列表,则可以这样解决问题:
-
使用
apply
方法:df["new_column"] = df.apply(lambda raw: raw.value_lsts[raw.selection -1],axis = 1)
-
使用列表理解:
df["new_column"] = [x[y-1] for x,y in zip(df['value_lsts'],df['selection'])]
-
使用矢量化功能:
def get_by_index(value_lsts,selection): # you may use lambda here as well return value_lsts[selection-1] df["new_column"] = np.vectorize(get_by_index) (df['value_lsts'],df['selection'])
我认为选择哪个选项是在可读性和性能之间进行权衡。
让我们比较算法性能
创建更大的数据框
df_1 = df.sample(100000,replace=True).reset_index(drop=True)
时间
# 1. apply
%timeit df_1["new_column"] = df_1.apply(lambda raw: raw.value_lsts[raw.selection-1],axis = 1)
2.77 s ± 94.7 ms per loop (mean ± std. dev. of 7 runs,1 loop each)
# 2. list comprehension:
%timeit df_1["new_column"] = [x[y-1] for x,y in zip(df_1['value_lsts'],df_1['selection'])]
33.9 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs,10 loops each)
# 3. vectorized function:
%timeit df_1["new_column"] = np.vectorize(get_by_index) (df_1['value_lsts'],df_1['selection'])
12 ms ± 302 µs per loop (mean ± std. dev. of 7 runs,100 loops each)
# 4. solution proposed by @anky using lookup
%%timeit
u = pd.DataFrame(df_1['value_lsts'].tolist(),index=df_1.index) #helper dataframe
df_1['selected_value'] = u.lookup(u.index,df_1['selection']-1)
51.9 ms ± 865 µs per loop (mean ± std. dev. of 7 runs,10 loops each)
如果不确定是否确实需要一列列表,可以阅读proper way for splitting column of lists to multiple columns。
,在将一系列列表转换为数据框后,您也可以在此处使用df.lookup
(请注意,python索引从0开始,因此selection-1
应该根据您的逻辑使用)
u = pd.DataFrame(df['value_list'].tolist(),index=df.index) #helper dataframe
df['selected_value'] = u.lookup(u.index,df['selection']-1)
print(df)
selection value_list selected_value
1454 5 [8.4,16.0,7.4,3.96,17.5,2.6] 17.50
1458 6 [8.85,3.25,5.3,4.95,8.14,11.0] 11.00
1473 4 [9.8,5.28,11.67,15.15,4.47,3.06] 15.15
1474 4 [5.5,2.19,7.7,11.98,28.0,8.54] 11.98
1487 4 [26.6,9.74,7.71,6.46,2.28,7.58] 6.46
1491 3 [6.4,3.1,19.92,4.2,6.37,11.05] 19.92
1500 6 [3.0,22.91,8.61,13.58,3.69] 3.69