composer工作流在dataproc操作员处失败

问题描述

我在gcp中有一个composer环境设置,它正在按以下方式运行DAG

with DAG('sample-dataproc-dag',default_args=DEFAULT_DAG_ARGS,schedule_interval=None) as dag:  # Here we are using dag as context


# Submit the PySpark job.
submit_pyspark = DataProcpySparkOperator(
    task_id='run_dataproc_pyspark',main='gs://.../dataprocjob.py',cluster_name='xyz',dataproc_pyspark_jars=
    'gs://.../spark-bigquery-latest_2.12.jar'
    )


simple_bash = BashOperator(
    task_id='simple-bash',bash_command="ls -la")

submit_pyspark.dag = dag
submit_pyspark.set_upstream(simple_bash)

这是我的dataprocjob.py

from pyspark.sql import SparkSession



if __name__ == '__main__':

spark = SparkSession.builder.appName('Jupyter BigQuery Storage').getorCreate()
table = "projct.dataset.txn_w_ah_demo"
df = spark.read.format("bigquery").option("table",table).load()
df.printSchema()

我的编辑器管道在dataproc步骤失败。在gcs中存储的作曲家日志中,这就是我所看到的

[2020-09-23 21:40:02,849] {taskinstance.py:1059} ERROR - <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">@-@{"workflow": "sample-dataproc-dag","task-id": "run_dataproc_pyspark","execution-date": "2020-09-23T21:39:42.371933+00:00"}
Traceback (most recent call last):
File "/usr/local/lib/airflow/airflow/models/taskinstance.py",line 930,in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py",line 1139,in execute
super(DataProcpySparkOperator,self).execute(context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py",line 707,in execute
self.hook.submit(self.hook.project_id,self.job,self.region,self.job_error_states)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py",line 311,in submit
num_retries=self.num_retries)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py",line 51,in __init__
clusterName=cluster_name).execute()
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/_helpers.py",line 130,in positional_wrapper
return wrapped(*args,**kwargs)
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py",line 851,in execute
raise HttpError(resp,content,uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">

解决方法

乍看之下,您似乎在调用Dataproc API的Google Cloud帐户权限对于操作员来说是不够的。

,

似乎您提出的问题与您授予应用程序的Dataproc权限相对应。

根据docummentation,您需要不同的角色特权才能执行Dataproc任务,例如:

dataproc.clusters.create permits the creation of Cloud Dataproc clusters in the containing project
dataproc.jobs.create permits the submission of Dataproc jobs to Dataproc clusters in the containing project
dataproc.clusters.list permits the listing of details of Dataproc clusters in the containing project

如果要创建一个提交dataproc作业,则需要'dataproc.clusters.use'和'dataproc.jobs.create'权限。

要为您的用户帐户授予正确的特权,您可以按照docummentation更新您在代码中使用的服务帐户并添加正确的权限。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...