如何为pyarrow表格栏设置“类别”数据类型？

问题描述

我了解到使用category在拼花文件中写入熊猫DataFrame时可以保留to_parquet类型。

在我的情况下，一开始我已经有一个金字塔Table。我可以将其列之一设置为category类型吗？如果是，怎么办？（我无法在Google和pyarrow文档中找到提示）

感谢您的帮助！最好的

解决方法

在pyarrow中，分类类型称为“字典类型”。可以使用dictionary_encode()方法将pyarrow数组转换为这种类型：

>>> import pyarrow as pa
>>> table = pa.table({'a': ['A','B','A']})
>>> table.schema
a: string

>>> table.column('a')
<pyarrow.lib.ChunkedArray object at 0x7f1f94fb9938>
[
  [
    "A","B","A"
  ]
]

>>> table.column('a').dictionary_encode()
<pyarrow.lib.ChunkedArray object at 0x7f1f94fb9b48>
[

  -- dictionary:
    [
      "A","B"
    ]
  -- indices:
    [
      0,1,0
    ]
]

然后使用这个新编码的列来更改表会更加复杂，但是可以通过以下方式完成：

>>> table2 = table.set_column(0,"a",table.column('a').dictionary_encode())
>>> table2.schema
a: dictionary<values=string,indices=int32,ordered=0>

parquet pyarrow python