问题描述
尝试将熊猫数据框转换为pyarrow表并写入镶木地板数据集时,我收到一条out of bounds timestamp
错误消息。通过一些研究,我认为这似乎是熊猫使用纳秒精度的结果,而佩拉罗只能解释到毫秒精度。
import cx_Oracle
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
connection = cx_Oracle.connect(os.getenv('USER'),os.getenv('__OPW'),os.getenv('DB_SERVICE'))
gen = pd.read_sql('SELECT * FROM myschema.mytable where rownum < 10001',con=connection,chunksize=1_000)
for df in gen:
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table,root_path='/tmp/dataset',partition_cols=['my_part_col'])
ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000
当我注释掉最后一行时:
# pq.write_to_dataset(table,partition_cols=['my_part_col'])
...然后重新运行,该错误消息将不再产生,因此可能是从pyarrow表到实木复合地板的转换所致。
是否有已知的解决方法?
谢谢。
更新:
这是完整的追溯...
Traceback (most recent call last):
File "<stdin>",line 3,in <module>
File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/parquet.py",line 1754,in write_to_dataset
df = table.to_pandas()
File "pyarrow/array.pxi",line 715,in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi",line 1565,in pyarrow.lib.Table._to_pandas
File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py",line 779,in table_to_blockmanager
blocks = _table_to_blocks(options,table,categories,ext_columns_dtypes)
File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py",line 1114,in _table_to_blocks
result = pa.lib.table_to_blocks(options,block_table,File "pyarrow/table.pxi",line 1028,in pyarrow.lib.table_to_blocks
File "pyarrow/error.pxi",line 84,in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000
解决方法
从纪元开始的253402214400000000微秒是10'000年。
很少有库支持此范围的时间戳。您有几种选择:
- 截断所有超出范围的值,然后再转换为箭头/镶木地板
- 将有问题的列转换为int64或uint64(而不是使用时间戳记)
- 使用日期而不是时间戳。如果您将目光投向未来,那么您可能不在乎现在是几点。日期范围更大。
编辑:
如果这是您的数据库代表无效/缺失日期的方式,则在转换为箭头之前,应将所有这些日期替换为pd.NaT
。