问题描述
我在读取和处理的S3存储桶中有多个文本文件。因此,我在Kedro datacatalog中定义了PartitionedDataSet,如下所示:
raw_data:
type: PartitionedDataSet
path: s3://reads/raw
dataset: pandas.CSVDataSet
load_args:
sep: "\t"
comment: "#"
此外,我实现了此solution,以通过环境变量(包括AWS秘密密钥)从凭证文件中获取所有秘密。
当我使用kedro run
在本地运行事物时,一切都很好,但是当我构建Docker映像(使用kedro-docker)并在Docker环境中使用kedro docker run
运行管道并通过提供所有环境时使用--docker-args
选项的变量时,出现以下错误:
Traceback (most recent call last):
File "/usr/local/bin/kedro",line 8,in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/kedro/framework/cli/cli.py",line 724,in main
cli_collection()
File "/usr/local/lib/python3.7/site-packages/click/core.py",line 829,in __call__
return self.main(*args,**kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py",line 782,in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py",line 1259,in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py",line 1066,in invoke
return ctx.invoke(self.callback,**ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py",line 610,in invoke
return callback(*args,**kwargs)
File "/home/kedro/kedro_cli.py",line 230,in run
pipeline_name=pipeline,File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py",line 767,in run
raise exc
File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py",line 759,in run
run_result = runner.run(filtered_pipeline,catalog,run_id)
File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py",line 101,in run
self._run(pipeline,run_id)
File "/usr/local/lib/python3.7/site-packages/kedro/runner/sequential_runner.py",line 90,in _run
run_node(node,self._is_async,line 213,in run_node
node = _run_node_sequential(node,line 221,in _run_node_sequential
inputs = {name: catalog.load(name) for name in node.inputs}
File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py",in <dictcomp>
inputs = {name: catalog.load(name) for name in node.inputs}
File "/usr/local/lib/python3.7/site-packages/kedro/io/data_catalog.py",line 392,in load
result = func()
File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py",in load
return self._load()
File "/usr/local/lib/python3.7/site-packages/kedro/io/partitioned_data_set.py",line 240,in _load
raise DataSetError("No partitions found in `{}`".format(self._path))
kedro.io.core.DataSetError: No partitions found in `s3://reads/raw`
注意:如果我将文件移动到某个本地目录,定义PartitionedDataSet并构建Docker映像并通过--docker-args
解决方法
解决方案(至少在我的情况下)是在AWS_DEFAULT_REGION
命令中提供kedro docker run
env变量。