将“.eml”文件传输到 Google Cloud Platform 时出现 UnicodeEncodeErrorLinux 上的 gsutil v4.6.1

问题描述

在使用 gsutil cp 命令将文件从 Linux 系统传输到 Google Cloud Platform 时,它在尝试处理其内容(不仅仅是文件名!)时在一些旧的“.eml”文件中失败包含未以 Unicode 编码的非英文字符。

尝试的命令是:

gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/

错误信息是:

UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)

gsutil rsync 给出了一个非常相似的错误。位置 22881 (0x5961) 位于多部分电子邮件文件的末尾。以下显示十六进制转储的文件内容

00005960: 20a8 43a4 d1b3 a320 5961 686f 6f21 a95f   .C.... Yahoo!._
00005970: bcaf 203e 2020 7777 772e 7961 686f 6f2e  .. >  www.yahoo.
00005980: 636f 6d2e 7477 0d0a                      com.tw..

我们在位置 0x5961 看到字节“0xa8”,这是错误消息指出的问题根源。出于某种原因,gsutil 试图对文本进行编码。在支持汉字的终端打开文件时,我们看到:

< 每天都 Yahoo!奇摩 >  www.yahoo.com.tw

Big-5 编码的第一个汉字“每”是 0xa843。一个简单的解决方法是将文件扩展名重命名为“.eml”以外的其他名称,例如“.eml.bak”,以便 gsutil 不处理文件内容。遗憾的是,在进行批量传输时,很难提前知道此类非英文字文件的存在,并且整个过程可能会多次停止。

以下是完整的错误信息:

darsenlu@devmodel:~/Home$ gsutil cp "/home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml" gs://darsen_backup_monthly/
copying file:///home/darsenlu/Home/mail/Pan/Fw_ japanese_lyrics.eml [Content-Type=message/rfc822]...
Traceback (most recent call last):
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil",line 21,in <module>
    gsutil.RunMain()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil.py",line 122,in RunMain
    sys.exit(gslib.__main__.main())
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py",line 444,in main
    user_project=user_project)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py",line 780,in _RunNamedCommandAndHandleExceptions
    _HandleUnkNownFailure(e)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py",line 639,in _RunNamedCommandAndHandleExceptions
    user_project=user_project)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command_runner.py",line 411,in RunNamedCommand
    return_code = command_inst.runcommand()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py",line 1124,in runcommand
    seek_ahead_iterator=seek_ahead_iterator)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py",line 1525,in Apply
    arg_checker,should_return_results,fail_on_error)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py",line 1596,in _SequentialApply
    worker_thread.PerformTask(task,self)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/command.py",line 2316,in PerformTask
    results = task.func(cls,task.args,thread_state=self.thread_gsutil_api)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py",line 709,in _copyFuncWrapper
    preserve_posix=cls.preserve_posix_attrs)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/commands/cp.py",line 924,in copyFunc
    preserve_posix=preserve_posix)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py",line 3957,in Performcopy
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py",line 2250,in _UploadFiletoObject
    parallel_composite_upload,logger)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py",line 2066,in _DelegateUploadFiletoObject
    elapsed_time,uploaded_object = upload_delegate()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py",line 2227,in CallNonResumableupload
    gzip_encoded=gzip_encoded_file)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/utils/copy_helper.py",line 1762,in _UploadFiletoObjectNonResumable
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py",line 388,in Uploadobject
    gzip_encoded=gzip_encoded)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/gcs_json_api.py",line 1712,line 1534,in _Uploadobject
    global_params=global_params)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/third_party/storage_apitools/storage_v1_client.py",line 1182,in Insert
    upload=upload,upload_config=upload_config)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py",line 703,in _RunMethod
    download)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/base_api.py",line 679,in PrepareHttpRequest
    upload.ConfigureRequest(upload_config,HTTP_Request,url_builder)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py",line 763,in ConfigureRequest
    self.__ConfigureMultipartRequest(HTTP_Request)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/apitools/apitools/base/py/transfer.py",line 823,in __ConfigureMultipartRequest
    g.flatten(msg_root,unixfrom=False)
  File "/usr/lib/python3.6/email/generator.py",line 116,in flatten
    self._write(msg)
  File "/usr/lib/python3.6/email/generator.py",line 181,in _write
    self._dispatch(msg)
  File "/usr/lib/python3.6/email/generator.py",line 214,in _dispatch
    meth(msg)
  File "/usr/lib/python3.6/email/generator.py",line 272,in _handle_multipart
    g.flatten(part,unixfrom=False,linesep=self._NL)
  File "/usr/lib/python3.6/email/generator.py",line 361,in _handle_message
    payload = self._encode(payload)
  File "/usr/lib/python3.6/email/generator.py",line 412,in _encode
    return s.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\udca8' in position 22881: ordinal not in range(128)

Linux 系统为 Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-76-generic x86_64)。

解决方法

我把你的字符串换成了中文字符,并且能够重现你的错误。我在更新到 gsutil 4.62 后修复了它。这是 merged PRissue tracker 作为参考。

通过运行更新 Cloud SDK:

gcloud components update