问题描述
我想从我的 Pandas 数据帧创建一个多路列联表并将其存储在一个 xarray 中。在我看来,使用 pandas.crosstab 后跟 DataFrame.to_xarray() 应该足够直接了,但我在 Pandas v1.1.5 中得到“TypeError: Cannot interpret 'interval[int64]' as a data type”。 (v1.0.1 给出“ValueError:所有数组的长度必须相同”)。
In [1]: import numpy as np
...: import pandas as pd
...: pd.__version__
Out[1]: '1.1.5'
In [2]: import xarray as xr
...: xr.__version__
Out[2]: '0.17.0'
In [3]: n = 100
...: np.random.seed(42)
...: x = pd.cut(np.random.uniform(low=0,high=3,size=n),range(5))
...: x
Out[3]:
[(1,2],(2,3],(1,(0,1],...,1]]
Length: 100
Categories (4,interval[int64]): [(0,1] < (1,2] < (2,3] < (3,4]]
In [4]: x.value_counts().sort_index()
Out[4]:
(0,1] 41
(1,2] 28
(2,3] 31
(3,4] 0
dtype: int64
请注意,我需要我的表格包含空类别,例如 (3,4]。
In [6]: idx=pd.date_range('2001-01-01',periods=n,freq='8H')
...: df = pd.DataFrame({'x': x},index=idx)
...: df['xlag'] = df.x.shift(1,'D')
...: df['h'] = df.index.hour
...: xtab = pd.crosstab([df.h,df.xlag],df.x,dropna=False,normalize='index')
...: xtab
Out[6]:
x (0,1] (1,2] (2,3] (3,4]
h xlag
0 (0,1] 0.000000 0.700000 0.300000 0.0
(1,2] 0.470588 0.411765 0.117647 0.0
(2,3] 0.500000 0.333333 0.166667 0.0
(3,4] 0.000000 0.000000 0.000000 0.0
8 (0,1] 0.588235 0.000000 0.411765 0.0
(1,2] 1.000000 0.000000 0.000000 0.0
(2,3] 0.428571 0.142857 0.428571 0.0
(3,4] 0.000000 0.000000 0.000000 0.0
16 (0,1] 0.333333 0.250000 0.416667 0.0
(1,2] 0.444444 0.222222 0.333333 0.0
(2,3] 0.454545 0.363636 0.181818 0.0
(3,4] 0.000000 0.000000 0.000000 0.0
那很好,但我的实际应用程序有更多类别和更多维度,所以这似乎是 xarray 的一个明确用例,但我收到一个错误:
In [8]: xtab.to_xarray()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-aaedf730bb97> in <module>
----> 1 xtab.to_xarray()
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/pandas/core/generic.py in to_xarray(self)
2818 return xarray.DataArray.from_series(self)
2819 else:
-> 2820 return xarray.Dataset.from_dataframe(self)
2821
2822 @Substitution(returns=fmt.return_docstring)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in from_dataframe(cls,dataframe,sparse)
5131 obj._set_sparse_data_from_dataframe(idx,arrays,dims)
5132 else:
-> 5133 obj._set_numpy_data_from_dataframe(idx,dims)
5134 return obj
5135
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in _set_numpy_data_from_dataframe(self,idx,dims)
5062 data = np.zeros(shape,values.dtype)
5063 data[indexer] = values
-> 5064 self[name] = (dims,data)
5065
5066 @classmethod
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in __setitem__(self,key,value)
1427 )
1428
-> 1429 self.update({key: value})
1430
1431 def __delitem__(self,key: Hashable) -> None:
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in update(self,other)
3897 Dataset.assign
3898 """
-> 3899 merge_result = dataset_update_method(self,other)
3900 return self._replace(inplace=True,**merge_result._asdict())
3901
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in dataset_update_method(dataset,other)
958 priority_arg=1,959 indexes=indexes,--> 960 combine_attrs="override",961 )
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/merge.py in merge_core(objects,compat,join,combine_attrs,priority_arg,explicit_coords,indexes,fill_value)
609 coerced = coerce_pandas_values(objects)
610 aligned = deep_align(
--> 611 coerced,join=join,copy=False,indexes=indexes,fill_value=fill_value
612 )
613 collected = collect_variables_and_indexes(aligned)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in deep_align(objects,copy,exclude,raise_on_invalid,fill_value)
428 indexes=indexes,429 exclude=exclude,--> 430 fill_value=fill_value,431 )
432
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/alignment.py in align(join,fill_value,*objects)
352 if not valid_indexers:
353 # fast path for no reindexing necessary
--> 354 new_obj = obj.copy(deep=copy)
355 else:
356 new_obj = obj.reindex(
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in copy(self,deep,data)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k,v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/dataset.py in <dictcomp>(.0)
1218 """
1219 if data is None:
-> 1220 variables = {k: v.copy(deep=deep) for k,v in self._variables.items()}
1221 elif not utils.is_dict_like(data):
1222 raise ValueError("Data must be dict-like")
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/variable.py in copy(self,data)
2632 """
2633 if data is None:
-> 2634 data = self._data.copy(deep=deep)
2635 else:
2636 data = as_compatible_data(data)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in copy(self,deep)
1484 # 8000341
1485 array = self.array.copy(deep=True) if deep else self.array
-> 1486 return PandasIndexAdapter(array,self._dtype)
/opt/scitools/environments/default/2021_03_18-1/lib/python3.6/site-packages/xarray/core/indexing.py in __init__(self,array,dtype)
1407 dtype_ = array.dtype
1408 else:
-> 1409 dtype_ = np.dtype(dtype)
1410 self._dtype = dtype_
1411
TypeError: Cannot interpret 'interval[int64]' as a data type
在使用 pandas.crosstab 之前,我可以通过将 x(和 xlag)转换为不同的 dtype 而不是 pandas.Categorical 来避免错误,但随后我会丢失任何空类别,我需要将其保留在我的实际应用程序中。>
解决方法
这里的问题不是 CategoricalIndex
的使用,而是类别标签 (x.categories) 是 IntervalIndex
不喜欢的 xarray
。
要解决此问题,您可以简单地将 x
变量中的类别替换为其字符串表示形式,这将 x.categories
强制为“对象”数据类型而不是“间隔[int64]”数据类型:
x = (
pd.cut(np.random.uniform(low=0,high=3,size=n),range(5))
.rename_categories(str)
)
然后像你已经完成的那样计算你的交叉表,它应该可以工作!
为了让您的数据集处于您想要的坐标(我认为),您需要做的就是将所有内容堆叠在一个 MultiIndex
行形状中。 (而不是交叉表 MultiIndex
行/Index
列形状)。
xtab = (
pd.crosstab([df.h,df.xlag],df.x,dropna=False,normalize="index")
.stack()
.reorder_levels(["x","h","xlag"])
.sort_index()
)
xtab.to_xarray()
如果您想缩短代码并丢失一些索引级别的显式排序,您还可以使用 unstack
代替堆栈,这样可以立即为您提供正确的排序:
xtab = (
pd.crosstab([df.h,normalize="index")
.unstack([0,1])
)
xtab.to_xarray()
无论您使用哪种 stack()
与 unstack([0,1])
方法,您都会得到以下输出:
<xarray.DataArray (x: 4,h: 3,xlag: 4)>
array([[[0.,0.47058824,0.5,0. ],[0.58823529,1.,0.42857143,[0.33333333,0.44444444,0.45454545,0. ]],[[0.7,0.41176471,0.33333333,[0.,0.,0.14285714,[0.25,0.22222222,0.36363636,[[0.3,0.11764706,0.16666667,[0.41176471,[0.41666667,0.18181818,[[0.,0. ]]])
Coordinates:
* x (x) object '(0,1]' '(1,2]' '(2,3]' '(3,4]'
* h (h) int64 0 8 16
* xlag (xlag) object '(0,4]'
,
@Cameron-Riddell 的回答是解决我的问题的关键,但还有一些额外的重塑蠕动可以解决。按照他的建议将 rename_categories(str)
应用于我的 x
变量,然后按照我的问题继续进行,最后一行可以工作:
In [8]: xtab = pd.crosstab([df.h,normalize='index')
...: xtab.to_xarray()
Out[8]:
<xarray.Dataset>
Dimensions: (h: 3,xlag: 4)
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0,4]'
Data variables:
(0,1] (h,xlag) float64 0.0 0.4706 0.5 0.0 ... 0.3333 0.4444 0.4545 0.0
(1,2] (h,xlag) float64 0.7 0.4118 0.3333 0.0 ... 0.25 0.2222 0.3636 0.0
(2,3] (h,xlag) float64 0.3 0.1176 0.1667 0.0 ... 0.3333 0.1818 0.0
(3,4] (h,xlag) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
但我想要一个带有一个变量的 3-d 数组,而不是一个带有 3 个变量的 2-d 数组。要转换它,我需要应用 .to_array(dim='x')
。但是我的尺寸顺序是 x
,h
,xlag
并且我显然不希望 h
在中间,所以我也需要转置它们:>
In [9]: xtab.to_xarray().to_array(dim='x').transpose('h','xlag','x')
Out[9]:
<xarray.DataArray (h: 3,xlag: 4,x: 4)>
array([[[0.,0.7,0.3,[0.47058824,[0.5,[[0.58823529,[1.,[0.42857143,[[0.33333333,0.25,0.41666667,[0.44444444,[0.45454545,0. ]]])
Coordinates:
* h (h) int64 0 8 16
* xlag (xlag) object '(0,4]'
* x (x) <U6 '(0,4]'
这就是我所设想的!它的显示方式与 pd.crosstab 类似,但它是一个 3-d xarray 而不是带有多索引的 Pandas 数据框。这在我程序的后续阶段会更容易处理(交叉表只是一个中间步骤,本身不是结果)。
我必须说,结果比我预期的要复杂……我在 2017 年从@kilojoules 发现了一个问题“When to use multiindexing vs. xarray in pandas”,其中 @Tkanno wrote an answer 开始“似乎确实有过渡到 xarray 以处理多维数组。”对我来说似乎很遗憾没有一个版本的 pd.crosstab 返回一个 xarray - 或者我是否要求更多的 pandas-xarray 集成?