如何使用 xarray 数据集实现 numpy 索引

问题描述

我知道二维数组的 x 和 y 索引（numpy 索引）。

在此 documentation 之后，xarray 使用例如Fortran 风格的索引。

所以当我通过例如

ind_x = [1,2]
ind_y = [3,4]

我期望索引对 (1,3) 和 (2,4) 有 2 个值，但 xarray 返回一个 2x2 矩阵。

现在我想知道如何使用 xarray 实现类似 numpy 的索引？

注意：我想避免将整个数据加载到内存中。所以使用 .values api 不是我正在寻找的解决方案的一部分。

解决方法

您可以访问底层的 numpy 数组以直接对其进行索引：

import xarray as xr

x = xr.tutorial.load_dataset("air_temperature")

ind_x = [1,2]
ind_y = [3,4]

print(x.air.data[0,ind_y,ind_x].shape)
# (2,)

编辑：

假设您的数据位于 dask 支持的 xarray 中并且不想将所有数据加载到内存中，您需要在 {{1} vindex 数据对象后面的 } 数组：

dask

为了考虑速度，我用不同的方法做了一个测试。

def method_1(file_paths: List[Path],indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        d = Dataset(file,'r')
        data.append(d.variables['hrv'][indices])
        d.close()
    return data


def method_2(file_paths: List[Path],indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        data.append(xarray.open_dataset(file,engine='h5netcdf').hrv.values[indices])
    return data


def method_3(file_paths: List[Path],indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        data.append(xarray.open_mfdataset([file],engine='h5netcdf').hrv.data.vindex[indices].compute())
    return data

In [1]: len(file_paths)
Out[1]: 4813

结果：

method_1（使用 netcdf4 库）：101.9s
method_2（使用 xarray 和 values API）：591.4s
method_3（使用 xarray+dask）：688.7s

我猜 xarray+dask 在 .compute 步内需要很多时间。

numpy python python-xarray