如何从文件名中提取元素并将其移动到不同的列?

问题描述

我有一个文件名,我将其转换为列表。该列表包含以下元素:

list = ['15253_Variation.JPG','15253_Variation_Tainted.JPG','15253_Variation_O2_Saxophone.PNG','15253_Variation_O2_Saxophone.jpg','15253_Variation_O2_Saxophone_reference.png','15253_Variation_Side1.JPG','15253_Variation_Side2.JPG']

我的目标是从此列表中提取元素并填写一个数据框,该数据框应如下所示:

enter image description here

链接到包含上述图像的Google表格:https://docs.google.com/spreadsheets/d/1kuX3M4RFCNWtNoE7Hm1ejxWMwF-Cs4p8SsjA3JzdidA/edit?usp=sharing

我完成的操作如下代码

Obj = pd.DataFrame(data = list,index = None,columns = ['file'])
new_list = []
for i in Obj['file']:
       new_list.append(i.split('_'))

但是,这个人不会留下任何空白,因此无法满足我的需求。

非常感谢您。

解决方法

根据评论数。这很痛苦,因为文件名中的标记不是完全固定的格式。很多条件逻辑

  1. 已经定义了两个附加列表mi工具和oxygen,无论它是什么。
  2. 第一遍是建立dict这是熊猫的标准格式
  3. 然后在具有基本数据帧之后通过条件逻辑进行工作
# don't name it list - it override python list()!
l = ['15253_Variation.JPG','15253_Variation_Tainted.JPG','15253_Variation_O2_Saxophone.PNG','15253_Variation_O2_Saxophone.jpg','15253_Variation_O2_Saxophone_reference.png','15253_Variation_Side1.JPG','15253_Variation_Side2.JPG']

issues = ["Tainted","Perfect"]
mi = ["Saxophone"]
oxygen = ["O2"]

# first pass using dict/list comprehensions
df = pd.DataFrame({"filename":{i:f.split(".")[0] for i,f in enumerate(l)},"Number":{i:f.split("_")[0] for i,"Name":{i:f.split("_")[1].split(".")[0] for i,"Location2":{},"Issues":{},"Oxygen":{},"Location":{},"Musical Instrument":{},"Ref":{},"Extension":{i:f.split(".")[1] for i,f in enumerate(l)}})

df = df.assign(**{
    # list of tokens for checking fixed lists against
    "Tokens":lambda dfa: dfa.apply(lambda s: s["filename"].split("_")[2:],axis=1),"Issues":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s,issues))[0][0]] 
                                           if np.isin(s,issues).any() else np.nan),"Musical Instrument":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s,mi))[0][0]] 
                                           if np.isin(s,mi).any() else np.nan),"Oxygen":lambda dfa: dfa["Tokens"].apply(lambda s: s[np.where(np.isin(s,oxygen))[0][0]] 
                                           if np.isin(s,oxygen).any() else np.nan),}).assign(**{
    # let's do tokens again minus ones already placed
    "Tokens":lambda dfa: dfa.apply(lambda s: [t for t in s["filename"].split("_")[2:] 
                                             if not(t==s["Issues"] 
                                                    or t==s["Musical Instrument"]
                                                   or t==s["Oxygen"])],"Location2":lambda dfa: dfa.apply(lambda s: s["Tokens"][0] if len(s["Tokens"])>0 
                                      and "Side" in s["Tokens"][0] else np.nan,"Ref":lambda dfa: dfa.apply(lambda s: s["Tokens"][0] if len(s["Tokens"])>0 
                                      and "Side" not in s["Tokens"][0] else np.nan,axis=1)

}).drop(columns=["Tokens","filename"])

print(df.to_string(index=False))

输出

Number       Name Location2   Issues Oxygen  Location Musical Instrument        Ref Extension
 15253  Variation       NaN      NaN    NaN       NaN                NaN        NaN       JPG
 15253  Variation       NaN  Tainted    NaN       NaN                NaN        NaN       JPG
 15253  Variation       NaN      NaN     O2       NaN          Saxophone        NaN       PNG
 15253  Variation       NaN      NaN     O2       NaN          Saxophone        NaN       jpg
 15253  Variation       NaN      NaN     O2       NaN          Saxophone  reference       png
 15253  Variation     Side1      NaN    NaN       NaN                NaN        NaN       JPG
 15253  Variation     Side2      NaN    NaN       NaN                NaN        NaN       JPG