在Docker中运行Kedro管道时找不到PartitionedDataSet

问题描述

我在读取和处理的S3存储桶中有多个文本文件。因此,我在Kedro datacatalog中定义了PartitionedDataSet,如下所示:

raw_data:
  type: PartitionedDataSet
  path: s3://reads/raw
  dataset: pandas.CSVDataSet
  load_args:
    sep: "\t"
    comment: "#"

此外,我实现了此solution,以通过环境变量(包括AWS秘密密钥)从凭证文件获取所有秘密。

当我使用kedro run在本地运行事物时,一切都很好,但是当我构建Docker映像(使用kedro-docker)并在Docker环境中使用kedro docker run运行管道并通过提供所有环境时使用--docker-args选项的变量时,出现以下错误

Traceback (most recent call last):
  File "/usr/local/bin/kedro",line 8,in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/cli/cli.py",line 724,in main
    cli_collection()
  File "/usr/local/lib/python3.7/site-packages/click/core.py",line 829,in __call__
    return self.main(*args,**kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py",line 782,in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py",line 1259,in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.7/site-packages/click/core.py",line 1066,in invoke
    return ctx.invoke(self.callback,**ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py",line 610,in invoke
    return callback(*args,**kwargs)
  File "/home/kedro/kedro_cli.py",line 230,in run
    pipeline_name=pipeline,File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py",line 767,in run
    raise exc
  File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py",line 759,in run
    run_result = runner.run(filtered_pipeline,catalog,run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py",line 101,in run
    self._run(pipeline,run_id)
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/sequential_runner.py",line 90,in _run
    run_node(node,self._is_async,line 213,in run_node
    node = _run_node_sequential(node,line 221,in _run_node_sequential
    inputs = {name: catalog.load(name) for name in node.inputs}
  File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py",in <dictcomp>
    inputs = {name: catalog.load(name) for name in node.inputs}
  File "/usr/local/lib/python3.7/site-packages/kedro/io/data_catalog.py",line 392,in load
    result = func()
  File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py",in load
    return self._load()
  File "/usr/local/lib/python3.7/site-packages/kedro/io/partitioned_data_set.py",line 240,in _load
    raise DataSetError("No partitions found in `{}`".format(self._path))
kedro.io.core.DataSetError: No partitions found in `s3://reads/raw`

注意:如果我将文件移动到某个本地目录,定义PartitionedDataSet并构建Docker映像并通过--docker-args

提供环境变量,则管道在Docker环境中的运行就很好。

解决方法

解决方案(至少在我的情况下)是在AWS_DEFAULT_REGION命令中提供kedro docker run env变量。