问题描述
test2
我需要:
- 按相同的“ CODE”分组,
- 检查“ DESC”是否不同
- 检查“ TYPE”是否相同
- 计算满足前两个命令的日期之间的月份差异
预期输出如下:
解决方法
以下代码使用.drop_duplicates()和.duplicated()从数据框中保留或丢弃具有重复值的行。
您如何计算一个月的差额?一个月可以是28、30或31天。您可以将最终结果除以30,并获得月数差异的指示。所以我暂时保留了几天。
import pandas as pd
df = {'CODE': ['BBLGLC70M','BBLGLC70M','ZZTNRD77','AACCBD','BCCDN','BCCDN'],'DATE': ['16/05/2019','25/09/2019','16/03/2020','27/02/2020','16/07/2020','21/07/2020','13/02/2020','23/07/2020','27/02/2020'],'TYPE': ['PRI','PRI','PUB','PUB'],'DESC' : ['KO','OK','KO','OK']
}
df = pd.DataFrame(df)
df['DATE'] = pd.to_datetime(df['DATE'],format = '%d/%m/%Y')
# only keep rows that have the same code and type
df = df[df.duplicated(subset=['CODE','TYPE'],keep=False)]
# throw out rows that have the same code and desc
df = df.drop_duplicates(subset=['CODE','DESC'],keep=False)
# find previous date
df = df.sort_values(by=['CODE','DATE'])
df['previous_date'] = df.groupby('CODE')['DATE'].transform('shift')
# drop rows that don't have a previous date
df = df.dropna()
# calculate the difference between current date and previous date
df['difference_in_dates'] = (df['DATE'] - df['previous_date'])
这将导致以下df:
CODE DATE TYPE DESC previous_date difference_in_dates
AACCBD 2020-07-21 PUB OK 2020-07-16 5 days
BBLGLC70M 2019-09-25 PRI OK 2019-05-16 132 days
BCCDN 2020-02-27 PUB OK 2020-02-13 14 days