问题描述
df = pd.DataFrame({'person_id': [11,11,11],'Age':[23,25,28],'Summary':['Test','Test1','Test2']})
df1 = pd.DataFrame({'person_id': [21,22,51],'Age':[26,29,22],'Order Summary':['Tep','Tst1','Tt2'],'Order Summary2':['ppp','Ttt','Tfsa']})
df2 = pd.DataFrame({'person_id': [31,31,41],'Age':[27,20,21],'Order Summary':['Tet','Tt1','Order Summary1':['Tet','Tt2']})
但是,我只想为每个文件读取两列,它们的名称有些不同。我想用两个列person_id
和Summary
(在其他csv文件中也称为Order Summary
)创建一个最终的数据框
我不希望阅读其他csv文件的Age
,Order Summary1
或Order Summary2
列。
在创建最终数据框时,基本上只使用正则表达式/模式匹配来读取Summary|Order Summary
列
我正在SO帖子中尝试以下内容
col_list = ["person_id","Summary"] # but here i don't kNow how to use regex
files = glob.glob("file*.csv")
dfs = [pd.read_csv(f,usecols = col_list,header=None,sep=";") for f in files]
meddata = pd.concat(dfs,ignore_index=True)
可以帮助我阅读CSV时如何使用正则表达式选择列吗?
我希望我的最终数据帧具有如下所示的列,您可以看到如何仅选择并连接每个csv文件中的这2个必需列)
Person_id Summary
11 Test
11 Test1
11 Test2
21 Tep
22 Tst1
51 Tt2
31 Tet
31 Tt1
41 Tt2
解决方法
阅读所有内容。您以后可以通过检查每个DF中的列是否在允许的列列表中,例如:
permitted = ['person_id','Summary','Order Summary']
df.loc[:,df.columns.isin(permitted)]
# person_id Summary
#0 11 Test
#1 11 Test1
#2 11 Test2
,
更简单的方法是首先重命名列Order Summary
,然后仅选择预期的2列:
dfs = [pd.read_csv(f,sep=";").rename(columns={'Order Summary':'Summary'})[['person_id','Summary']]
for f in files]
旧答案:
正则表达式使用DataFrame.filter
来匹配person_id
和Summary
并以字符串结尾的值:
print (df.filter(regex='person_id|Summary$'))
person_id Summary
0 11 Test
1 11 Test1
2 11 Test2
print (df1.filter(regex='person_id|Summary$'))
person_id Order Summary
0 21 Tep
1 22 Tst1
2 51 Tt2
print (df2.filter(regex='person_id|Summary$'))
person_id Order Summary
0 31 Tet
1 31 Tt1
2 41 Tt2
另一种想法是在可能的值列表中使用Index.intersection
:
print (df[df.columns.intersection(['person_id','Order Summary'])])
print (df1[df1.columns.intersection(['person_id','Order Summary'])])
print (df2[df2.columns.intersection(['person_id','Order Summary'])])
因此,在您的解决方案中,还为输出2列DataFrame添加rename
:
dfs = [pd.read_csv(f,sep=";").filter(regex='person_id|Summary$').rename(columns={'Order Summary':'Summary'})
for f in files]
第二需要:
dfs = []
for f in files:
df = pd.read_csv(f,sep=";")
df1 = df[df.columns.intersection(['person_id','Order Summary'])].rename(columns={'Order Summary':'Summary'})
dfs.append(df1)