使用pyarrow优化将pandas数据框转换为pyspark数据框不起作用

问题描述

当我尝试将熊猫数据框转换为这样的pyspark

def upload_file(file_name,bucket,object_name=None):
    """Upload a file to an S3 bucket -> from aws docs

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded,else False
    """

    # If S3 object_name was not specified,use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name,object_name,ExtraArgs={'ContentType': "text/html"})
    except ClientError as e:
        logging.error(e)
        return False
    return True

我收到以下错误：

df = spark.createDataFrame(pd.DataFrame({'a': [1,2],'b': [4,5]}))

我还按照spark文档推荐的pyarrow> = 0.15.0设置了ARROW_PRE_0_15_IPC_FORMAT = 1，但这没有帮助。

pyspark版本：2.4.0
pyarrow版本：0.13.0（错误也发生在pyarrow版本0.16.0和1.0.1中）
熊猫版本：1.0.3
Java版本：1.8.0_201
python版本：3.7.4

P.S .：如果将'spark.sql.execution.arrow.fallback.enabled'设置为'true'，则转换工作正常，但没有pyarrow优化。不幸的是，由于我的熊猫数据框很大，因此我需要进行pyarrow优化。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

pandas pyarrow pyspark python-3.x