来自Apache Beam的BigQuery授权视图

问题描述

我正在尝试使用Apache Beam查询BigQuery中的视图。

该视图有权访问其引用的所有数据集。数据流/ GCE服务帐户可以访问该视图,但不能访问其基础数据集(这应该没有问题)。

当我尝试运行查询授权视图的作业时,出现如下错误

java.lang.RuntimeException: java.io.IOException: Unable to get table: test_13249,aborting after 9 retries.
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl.executeWithRetries(BigQueryServicesImpl.java:1004)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.getTable(BigQueryServicesImpl.java:491)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.getTable(BigQueryServicesImpl.java:477)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.getTable(BigQueryServicesImpl.java:471)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQueryHelper.executeQuery(BigQueryQueryHelper.java:109)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySourceDef.getTableReference(BigQueryQuerySourceDef.java:113)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQueryQuerySource.getTabletoExtract(BigQueryQuerySource.java:65)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.extractFiles(BigQuerySourceBase.java:110)
    at org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.split(BigQuerySourceBase.java:148)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.splitAndValidate(WorkerCustomSources.java:290)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitTyped(WorkerCustomSources.java:212)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplitWithApiLimit(WorkerCustomSources.java:196)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSources.performSplit(WorkerCustomSources.java:175)
    at org.apache.beam.runners.dataflow.worker.WorkerCustomSourceOperationExecutor.execute(WorkerCustomSourceOperationExecutor.java:78)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
    at org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
    at org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
{
  "code" : 403,"errors" : [ {
    "domain" : "global","message" : "Access Denied: Table my-gcp-project:bigquery_dataset_any.test_13249: User does not have bigquery.tables.get permission for table my-gcp-project:bigquery_dataset_any.test_13249.","reason" : "accessDenied"
  } ],"status" : "PERMISSION_DENIED"
}

解决方法

Beam在处理BigQuery时会变得聪明。之所以会发生此错误,是因为Beam检查了查询所引用的表,而对于授权视图而言,并非总是如此。

要变通解决此问题,可以在withQueryLocationBigQueryIO.readTableRows()中使用方法BigQueryIO.read(SerializableFunction)。这样一来,Beam就可以使用提供的查询位置,而不会推断出一个。

因此:

BigQueryIO.readTableRows()
    .fromQuery("SELECT * FROM my_authorized_view")
    .withQueryLocation("US")  // Whatever location is convenient for you
    ...

这应该可以解决您的问题。