从匹配的数据帧拆分中查找下一个

问题描述

如下所示的数据框和名称列表。

['Amelia','Elijah','Amelia']

我想知道下一个是谁，当数据框的一部分与给定的名字匹配时（名字列表是一个固定的序列）。（这是 1990-09-01 00:00:00 詹姆斯）

import pandas as pd
from io import StringIO

to_find_list = ['Amelia','Amelia']

short_frame = 3

csvfile = StringIO(
"""Date Staff
1990-05-01 00:00:00 Mason
1990-06-01 00:00:00 Amelia
1990-07-01 00:00:00 Elijah
1990-08-01 00:00:00 Amelia
1990-09-01 00:00:00 James
1990-10-01 00:00:00 Benjamin
1990-11-01 00:00:00 Isabella
1990-12-01 00:00:00 Lucas
1991-01-01 00:00:00 Mason""")

df = pd.read_csv(csvfile,sep = '\t',engine='python')

# split the df into small frames with overlaps
list_of_dfs = [df.loc[i:i + short_frame-1,:].reset_index(drop=True) for i in range(0,len(df),short_frame - 2) if i < len(df) - 2]          

for son_df in list_of_dfs:

    first_cell = son_df.iloc[0]['Date']
    last_cell = son_df.iloc[-1]['Date']

    if son_df['Staff'].to_list() == to_find_list:
        found_date = son_df['Date'].iloc[-1]                # 1990-08-01 00:00:00 
        who = df['Staff'].loc[df['Date'] == found_date]     # Amelia

我尝试使用 shift() 在“Amelia”旁边打印下一个日期和人员，但没有成功。

实现它的方法是什么？谢谢。

解决方法

您可以尝试 extract() 并获取值出现的索引：

idx=df['Staff'].str.extract(f'({"|".join(to_find_list)})',expand=False).dropna().index

最后传递那个索引：

out=df.loc[[x+3 for x in idx if x <=len(df)]]
             #^
        #if you add 1 then you will get the 1st member of next staff

out 的输出：

    Date                    Staff
4   1990-09-01 00:00:00     James
5   1990-10-01 00:00:00     Benjamin
6   1990-11-01 00:00:00     Isabella

或

out=df.loc[[x+3 for x in idx if x <=len(df)],'Staff']
             #^
        #if you add 1 then you will get the 1st member of next staff

out 的输出：

4       James
5    Benjamin
6    Isabella

性能：

您可以使用 pd.DataFrame shift() 函数创建新列。然后进行列表推导以匹配 to_find_list 与转换为列表的列。

>>> df['Staff_prev'] = df['Staff'].shift(1)
>>> df['Staff_prev2'] = df['Staff'].shift(2)
>>> df['Staff_prev3'] = df['Staff'].shift(3)
>>> df['my_row'] = [ to_find_list == [ row['Staff_prev'],row['Staff_prev2'],row['Staff_prev3'] ] for index,row in df.iterrows()  ]
>>> df.head()
                  Date   Staff Staff_prev Staff_prev2 Staff_prev3  my_row
0  1990-05-01 00:00:00   Mason        NaN         NaN         NaN   False
1  1990-06-01 00:00:00  Amelia      Mason         NaN         NaN   False
2  1990-07-01 00:00:00  Elijah     Amelia       Mason         NaN   False
3  1990-08-01 00:00:00  Amelia     Elijah      Amelia       Mason   False
4  1990-09-01 00:00:00   James     Amelia      Elijah      Amelia    True

>>> df.loc[df['my_row'] == True,'Date']
1990-09-01 00:00:00

让我们做

m = pd.concat([df['Staff'].shift(x)==y for x,y  in zip(range(3),['Amelia','Elijah','Amelia'])]).all(level=0)
idx = m.index[m]+1
idx
Int64Index([4],dtype='int64')
df.loc[idx]
                      Date  Staff
4      1990-09-01 00:00:00  James

dataframe pandas pandas python

从匹配的数据帧拆分中查找下一个

问题描述

解决方法

相关问答