多索引df的箱线图

问题描述

我想做两件事:

  1. 我想为每个日期/日期创建一个箱形图,其中包含该日期中MeanTravelTimeSeconds的所有值。每个日期的MeanTravelTimeSeconds元素的数量各不相同(例如,一天可能有300个值,而另一天可能有400个值)。

  2. 我还想将多索引系列中的行转换为列,因为我不希望这些行每次都重复。如果仍然这样,我将有数千万不必要的行。

这是在按日期索引的df上使用df.stack()之后的结果系列(日期是日期时间对象索引):

Date                                        
2016-01-02  NumericIndex                        1611664
            OriginMovementID                       4744
            DestinationMovementID                  5084
            MeanTravelTimeSeconds                  1233
            RangeLowerBoundTravelTimeSeconds        756
                                                 ...   
2020-03-31  DestinationMovementID                  3594
            MeanTravelTimeSeconds                  1778
            RangeLowerBoundTravelTimeSeconds       1601
            RangeUpperBoundTravelTimeSeconds       1973
            DayOfWeek                           Tuesday
Length: 11281655,dtype: object

当我使用seaborn绘制箱线图时,在玩了不同的选择后我发现很多错误

如果我尝试执行df.stack().unstack()df.stack().T,则会出现以下错误

Index contains duplicate entries,cannot reshape

如何绘制箱形图以及如何将行变成列?

解决方法

您确实确实需要使索引唯一,以使您想要的功能起作用。我建议在其他两个关键列中的每次更改时重置一个顺序号。

import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds","RangeLowerBoundTravelTimeSeconds"]

df = pd.DataFrame(
[{"Date":d,"Observation":cat[random.randint(0,len(cat)-1)],"Value":random.randint(1000,10000)} 
 for i in range(random.randint(5,20)) 
 for d in pd.date_range(dt.datetime(2016,1,2),dt.datetime(2016,3,31),freq="14D")])

# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])

# generate an array that is sequential within change of key
seq = np.full(df.index.shape,0)
s=0
p=""
for i,v in enumerate(df.index):
    if i==0 or p!=v: s=0
    else: s+=1
    seq[i] = s
    p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"],append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())

输出 enter image description here

                                 Value                                                                                     
Observation      DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date       SeqNo                                                                                                           
2016-01-02 0                       NaN                   NaN       2560.0           5324.0                           5085.0
           1                       NaN                   NaN       1066.0           7372.0                              NaN
2016-01-16 0                       NaN                6226.0          NaN           7832.0                              NaN
           1                       NaN                1384.0          NaN           8839.0                              NaN
           2                       NaN                7892.0          NaN              NaN                              NaN