在Pandas中查询和计算的更快方法

问题描述

我在熊猫中有两个数据框。我想要实现的是,从DF1中获取每个“名称”,并在DF2中获取相应的“城市”和“州”。

例如,DF1中的'Dwight'应该从DF2中返回相应的值'Miami'和'Florida'。

DF1

  COUNT(*)
FROM
  booking_status_journey bs
  INNER JOIN booking_indonesia b ON b.id = bs.booking
WHERE
  bs.hid IN (
    4,5,6,7,11,14,16,17,18,19,23,24,25,26
  )
  AND (
    (
      UNIX_TIMESTAMP(STR_TO_DATE(bs.picked_up_fromwh,'%d/%m/%Y')) <= UNIX_TIMESTAMP('2020-10-15')
      AND UNIX_TIMESTAMP(STR_TO_DATE(bs.picked_up_fromwh,'%d/%m/%Y')) >= UNIX_TIMESTAMP(DATE_SUB('2020-10-15',INTERVAL 5 DAY))
      AND b.no_show_count = 0
      AND bs.rabbit_id1 IS NOT NULL
      AND bs.parcel_picked1 IS NULL
      AND bs.start_delivery1 IS NULL
      AND bs.arrived_at_drop_off1 IS NULL
      AND bs.delivered IS NULL
    )
    OR (
      UNIX_TIMESTAMP(STR_TO_DATE(bs.picked_up_fromwh,INTERVAL 5 DAY))
      AND b.no_show_count = 1
      AND bs.rabbit_id2 IS NOT NULL
      AND bs.parcel_picked2 IS NULL
      AND bs.no_show1 IS NOT NULL
      AND bs.start_delivery2 IS NULL
      AND bs.arrived_at_drop_off2 IS NULL
      AND bs.delivered IS NULL
      AND bs.Failed IS NULL
      AND bs.returned_after_Failed IS NULL
      AND bs.returned_after_no_show1 IS NULL
      AND bs.returned_towh IS NULL
    )
  )```



  [1]: https://i.stack.imgur.com/mPDTG.png

DF1具有约70,000行和3列

第二个数据帧,DF2大约有320,000行。

         Name     Age  Student
0        Dwight   20   Yes
1        Michael  30   No
2        Pam      55   No
.  .        .    .
70000    Jim      27   Yes

当前,我有两个函数,它们使用过滤器返回“城市”和“州”的值。

         Name     City       State
0        Dwight   Miami      Florida
1        Michael  Scranton   Pennsylvania
2        Pam      Austin     Texas
.  .        .    .           .
325082    Jim      Scranton   Pennsylvania

我正在使用apply函数来处理所有值。

def read_city(id):
    filt = (df2['Name'] == id)
    if filt.any():
        field = (df2[filt]['City'].values[0])
    else:
        field = ""
    return field


def read_state(id):
    filt = (df2['Name'] == id)
    if filt.any():
        field = (df2[filt]['State'].values[0])
    else:
        field = ""
    return field

以上述方式计算结果需要很长时间。我大约需要18分钟才能恢复df ['city_list']和df ['State_list']。

有更快的计算速度吗?由于我是熊猫的新手,所以我想知道是否有一种有效的方法来计算这个?

解决方法

我相信您可以做一个map

s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])

或者在您获得s之后左合并:

df = df.merge(s.add_suffix('_list'),left_on='Name',right_index=True,how='left')
,

我认为您可以执行以下操作:

# Dataframe DF1 (dummy data)

DF1 = pd.DataFrame(columns=['Name','Age','Student'],data=[['Dwight',20,'Yes'],['Michael',30,'No'],['Pam',55,['Jim',27,'Yes']])

print("DataFrame DF1")
print(DF1)

# Dataframe DF2 (dummy data)

DF2 = pd.DataFrame(columns=['Name','City','State'],'Miami','Florida'],'Scranton','Pennsylvania'],'Austin','Texas'],'Pennsylvania']])

print("DataFrame DF2")
print(DF2)

# You do a merge on 'Name' column and then,you change the name of columns 'City' and 'State'
df = pd.merge(DF1,DF2,on=['Name']).rename(columns={'City': 'city_list','State': 'State_list'})
print("DataFrame final")
print(df)

输出:

DataFrame DF1
Name       Age  Student
0   Dwight  20  Yes
1   Michael 30  No
2   Pam     55  No
3   Jim     27  Yes

DataFrame DF2
Name        City       State
0   Dwight  Miami       Florida
1   Michael Scranton    Pennsylvania
2   Pam     Austin      Texas
3   Jim     Scranton    Pennsylvania

DataFrame final
Name       Age  Student city_list   State_list
0   Dwight  20  Yes     Miami       Florida
1   Michael 30  No      Scranton    Pennsylvania
2   Pam     55  No      Austin      Texas
3   Jim     27  Yes     Scranton    Pennsylvania