问题描述
我是熊猫新手-我正试图从(非常图形化设计的)每周Excel文件中提取一些有用的信息,该文件代表某些房间的预订,一年52个文件。它包含员工的姓名,工作时间和项目。我主要对
感兴趣- 哪个员工在哪个项目上工作了多少天/小时
从我所读到的有关Panda的全部资料来看,这是非常困难的情况,因为数据的表示形式不太适合提取。
基本上,第一列中唯一相关的信息是ROOM语句,然后跟在此上下文中不需要的4行描述性文本中。从ROOM字符串开始,在每个日期列上,我需要提取4行相关信息。对于我的用例,我不需要知道谁在哪个房间里工作,但是ROOMS是用作索引的。
现在,我被困在如何重新格式化第一列的方式上,以使Panda可以以有意义的方式使用它。我的想法是,我搜索包含ROOM的任何行,从生成的布尔语句创建新的索引Column,将它们用作multiindex,等等。但是我一开始就被困住了。
在我继续朝着这个方向发展之前,我想了解一下有关如何使这样的文档更适合熊猫使用的一般方法,我想检查一下是否有一些最佳实践,以及如何处理它。我的想法是在4行的房间描述中创建一个多重指数是正确的方法...就像我说的,我对熊猫还很陌生,所以如果对使用此方法有普遍的误解,或者如果我想要的话,请原谅完全可以在熊猫里做...
import pandas as pd
from pathlib import Path
# Assign spreadsheet filename to `file`
file = './data/Week02-2019.xlsx'
# Load spreadsheet
xl = pd.ExcelFile(file)
# Remove all Sorts of crap on loading
df1 = xl.parse(sheet_name='Booking',header=1,parse_dates=[1],na_values=['xxx',''],skiprows=[2,3,4,5,6,7,8,9,10,19,28,37,46,64],usecols="A:H")
print (df1.iloc[:,0].str.contains("ROOM"),"NewIndex")
+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
| 2018 | 01.01.2018 | 02.08.2018 | 03.08.2018 | 04.08.2018 | 05.08.2018 | 06.08.2018 | 07.08.2018 |
+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
| ROOM 01 (Morning) | John Doe | Jane Doe | Donny Doe | | | | |
| Very Nice | Project# | Project# | Project# | | | | |
| Good Projector | Project Title | Project Title | Project Title | | | | |
| Telephone 1234 | 9:30-17.00 | 8-13.00 | 12-14.00 | | | | |
| --------- | ---- | ---- | ---- | --- | ---- | --- | --- |
| ROOM 01 (Afternoon) | Alan Smithee | Susi Smithee | Donald Smithee | | | | |
| Very Nice | Project# | Project# | Project# | | | | |
| Good Projector | Project Title | Project Title | Project Title | | | | |
| Telephone 1234 | 17:30-21.00 | 13.15-16.00 | 14.15-16.00 | | | | |
| ----- | ---------- | ---------- | --------- | ---- | --- | --- | |
| ROOM 02 (Morning) | Jimmy Doe | Duffy Duck | Benny Blanco | | | | |
| Not So Nice | Project# | Project# | Project# | | | | |
| whiteboard | Project Title | Project Title | Project Title | | | | |
| Telephone 5678 | 9:30-17.00 | 8-13.00 | 12-14.00 | | | | |
| --------- | ---- | ---- | ---- | --- | ---- | --- | --- |
| ROOM 02 (Afternoon) | Doris Day | Teddy Kaczinsky| Ru Paul | | | |
| Not so Nice | Project# | Project# | Project# | | | | |
| whiteboard | Project Title | Project Title | Project Title | | | | |
| Telephone 5678 | 17:30-21.00 | 13.15-16.00 | 14.15-16.00 | | | | |
+---------------------+----------------+----------------+----------------+------------+------------+------------+------------+
解决方法
使用DataFrame.set_index
和DataFrame.stack
和Series.unstack
进行整形,输出为MultiIndex
:
first = df.columns[0]
#repeat only ROOM data in first column
df[first] = df[first].where(df[first].str.contains("ROOM")).ffill()
#create helper columns
df['group'] = df.index % 4
#new columns names
d = {0:'name',1:'project',2:'project title',3: 'time'}
df1 = (df.set_index([first,'group'])
.rename(columns = lambda x: pd.to_datetime(x,format='%d.%m.%Y'))
.stack()
.unstack(1)
.rename(columns=d)
.swaplevel(1,0)
.sort_index()
.rename_axis(index=['date','room'],columns=None)
)
print (df1)
name project project title \
date room
2018-01-01 ROOM 01 (Afternoon) Alan Smithee Project# Project Title
ROOM 01 (Morning) John Doe Project# Project Title
ROOM 02 (Afternoon) Doris Day Project# Project Title
ROOM 02 (Morning) Jimmy Doe Project# Project Title
2018-08-02 ROOM 01 (Afternoon) Susi Smithee Project# Project Title
ROOM 01 (Morning) Jane Doe Project# Project Title
ROOM 02 (Afternoon) Teddy Kaczinsky Project# Project Title
ROOM 02 (Morning) Duffy Duck Project# Project Title
2018-08-03 ROOM 01 (Afternoon) Donald Smithee Project# Project Title
ROOM 01 (Morning) Donny Doe Project# Project Title
ROOM 02 (Afternoon) Ru Paul Project# Project Title
ROOM 02 (Morning) Benny Blanco Project# Project Title
time
date room
2018-01-01 ROOM 01 (Afternoon) 17:30-21.00
ROOM 01 (Morning) 9:30-17.00
ROOM 02 (Afternoon) 17:30-21.00
ROOM 02 (Morning) 9:30-17.00
2018-08-02 ROOM 01 (Afternoon) 13.15-16.00
ROOM 01 (Morning) 8-13.00
ROOM 02 (Afternoon) 13.15-16.00
ROOM 02 (Morning) 8-13.00
2018-08-03 ROOM 01 (Afternoon) 14.15-16.00
ROOM 01 (Morning) 12-14.00
ROOM 02 (Afternoon) 14.15-16.00
ROOM 02 (Morning) 12-14.00
编辑:错误意味着有一些重复的ROOM
值,因此需要GroupBy.cumcount
处理它以在MultiIndex
中创建新级别:
print (df)
2018 01.01.2018 02.08.2018 03.08.2018
0 ROOM 01 (Morning) John Doe Jane Doe Donny Doe
1 Very Nice Project# Project# Project#
2 Good Projector Project Title Project Title Project Title
3 Telephone 1234 9:30-17.00 8-13.00 12-14.00
4 ROOM 01 (Morning) Alan Smithee Susi Smithee Donald Smithee
5 Very Nice Project# Project# Project#
6 Good Projector Project Title Project Title Project Title
7 Telephone 1234 17:30-21.00 13.15-16.00 14.15-16.00
8 ROOM 02 (Morning) Jimmy Doe Duffy Duck Benny Blanco
9 Not So Nice Project# Project# Project#
10 Whiteboard Project Title Project Title Project Title
11 Telephone 5678 9:30-17.00 8-13.00 12-14.00
12 ROOM 02 (Afternoon) Doris Day Teddy Kaczinsky Ru Paul
13 Not so Nice Project# Project# Project#
14 Whiteboard Project Title Project Title Project Title
15 Telephone 5678 17:30-21.00 13.15-16.00 14.15-16.00
first = df.columns[0]
df[first] = df[first].where(df[first].str.contains("ROOM")).ffill()
df['group'] = df.index % 4
d = {0:'name',3: 'time'}
df1 = (df.set_index([first,format='%d.%m.%Y'))
.stack()
.to_frame())
g = df1.groupby(level=[0,1]).cumcount()
df1 = (df1.set_index(g,append=True)[0]
.unstack(1)
.rename(columns=d)
.swaplevel(1,'room','tmp'],columns=None)
)
print (df1)
name project project title \
date room tmp
2018-01-01 ROOM 01 (Morning) 0 John Doe Project# Project Title
3 Alan Smithee Project# Project Title
ROOM 02 (Afternoon) 0 Doris Day Project# Project Title
ROOM 02 (Morning) 0 Jimmy Doe Project# Project Title
2018-08-02 ROOM 01 (Morning) 1 Jane Doe Project# Project Title
4 Susi Smithee Project# Project Title
ROOM 02 (Afternoon) 1 Teddy Kaczinsky Project# Project Title
ROOM 02 (Morning) 1 Duffy Duck Project# Project Title
2018-08-03 ROOM 01 (Morning) 2 Donny Doe Project# Project Title
5 Donald Smithee Project# Project Title
ROOM 02 (Afternoon) 2 Ru Paul Project# Project Title
ROOM 02 (Morning) 2 Benny Blanco Project# Project Title
time
date room tmp
2018-01-01 ROOM 01 (Morning) 0 9:30-17.00
3 17:30-21.00
ROOM 02 (Afternoon) 0 17:30-21.00
ROOM 02 (Morning) 0 9:30-17.00
2018-08-02 ROOM 01 (Morning) 1 8-13.00
4 13.15-16.00
ROOM 02 (Afternoon) 1 13.15-16.00
ROOM 02 (Morning) 1 8-13.00
2018-08-03 ROOM 01 (Morning) 2 12-14.00
5 14.15-16.00
ROOM 02 (Afternoon) 2 14.15-16.00
ROOM 02 (Morning) 2 12-14.00