XmlIO Apache梁中的缓冲区溢出

问题描述

我正在使用Google Dataflow中的Apache Beam 2.24.0。 我正在使用XmlIO通过这样的设置读取XML文件

pipeline
    .apply("Read Storage Bucket",XmlIO.read<XmlProduct>()
            .from(sourcePath)
            .withRootElement(xmlProductRoot)
            .withRecordElement(xmlProductRecord)
            .withRecordClass(XmlProduct::class.java)
)

但是,我们有时会通过读取随机xml文件来运行缓冲区溢出异常:

"Error message from worker: java.io.IOException: Failed to start reading from source: gs://path-to-xml-file.xml range [1722550,2684411)
    org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:610)
    org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:359)
    org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:194)
    org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:159)
    org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:77)
    org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:417)
    org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:386)
    org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:311)
    org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.doWork(DataflowBatchWorkerHarness.java:140)
    org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:120)
    org.apache.beam.runners.dataflow.worker.DataflowBatchWorkerHarness$WorkerThread.call(DataflowBatchWorkerHarness.java:107)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.nio.BufferOverflowException
    java.base/java.nio.Buffer.nextPutIndex(Buffer.java:662)
    java.base/java.nio.HeapByteBuffer.put(HeapByteBuffer.java:196)
    org.apache.beam.sdk.io.xml.XmlSource$XMLReader.getFirstOccurenceOfRecordElement(XmlSource.java:285)
    org.apache.beam.sdk.io.xml.XmlSource$XMLReader.startReading(XmlSource.java:192)
    org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:476)
    org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:249)
    org.apache.beam.runners.dataflow.worker.WorkerCustomSources$BoundedReaderIterator.start(WorkerCustomSources.java:607)
    ... 14 more

我无法使用DirectRunner在本地重现此缓冲区溢出异常。这似乎是随机发生的。例如,对于第二个尝试,重新运行同一xml文件的数据流作业将成功。

例如XmlIO软件包中的ByteReadPacket内部是否没有正确设置某些东西?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)