将Pandas df写入Pyarrow Parquet表会导致“越界”时间戳问题更新：

问题描述

尝试将熊猫数据框转换为pyarrow表并写入镶木地板数据集时，我收到一条out of bounds timestamp错误消息。通过一些研究，我认为这似乎是熊猫使用纳秒精度的结果，而佩拉罗只能解释到毫秒精度。

import cx_Oracle
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

connection = cx_Oracle.connect(os.getenv('USER'),os.getenv('__OPW'),os.getenv('DB_SERVICE'))
gen = pd.read_sql('SELECT * FROM myschema.mytable where rownum < 10001',con=connection,chunksize=1_000)
for df in gen:
    table = pa.Table.from_pandas(df)
    pq.write_to_dataset(table,root_path='/tmp/dataset',partition_cols=['my_part_col'])

ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000

当我注释掉最后一行时：

# pq.write_to_dataset(table,partition_cols=['my_part_col'])

...然后重新运行，该错误消息将不再产生，因此可能是从pyarrow表到实木复合地板的转换所致。

是否有已知的解决方法？

谢谢。

更新：

这是完整的追溯...

Traceback (most recent call last):
  File "<stdin>",line 3,in <module>
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/parquet.py",line 1754,in write_to_dataset
    df = table.to_pandas()
  File "pyarrow/array.pxi",line 715,in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi",line 1565,in pyarrow.lib.Table._to_pandas
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py",line 779,in table_to_blockmanager
    blocks = _table_to_blocks(options,table,categories,ext_columns_dtypes)
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py",line 1114,in _table_to_blocks
    result = pa.lib.table_to_blocks(options,block_table,File "pyarrow/table.pxi",line 1028,in pyarrow.lib.table_to_blocks
  File "pyarrow/error.pxi",line 84,in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000

解决方法

从纪元开始的253402214400000000微秒是10'000年。

很少有库支持此范围的时间戳。您有几种选择：

截断所有超出范围的值，然后再转换为箭头/镶木地板
将有问题的列转换为int64或uint64（而不是使用时间戳记）
使用日期而不是时间戳。如果您将目光投向未来，那么您可能不在乎现在是几点。日期范围更大。

编辑：

如果这是您的数据库代表无效/缺失日期的方式，则在转换为箭头之前，应将所有这些日期替换为pd.NaT。

dataframe pandas parquet pyarrow python

将Pandas df写入Pyarrow Parquet表会导致“越界”时间戳问题 更新：

问题描述

更新：

解决方法

将Pandas df写入Pyarrow Parquet表会导致“越界”时间戳问题更新：