Apache Spark结构化流式处理花费大量时间打印字数示例的输出

问题描述

以下程序运行一个简单的字数来测试Spark结构化的流。我在终端上写单词，然后在另一个终端上运行程序。写完单词后，需要花费15到20秒的时间才能在第二个端子上显示输出。有没有一种方法可以减少输出时间，因为它很长。有人请帮助

   from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()
lines = spark \
    .readStream \
    .format("socket") \
    .option("host","localhost") \
    .option("port",9999) \
    .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value," ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()


Terminal where I am connecting to port and writing the words
C:\Program Files (x86)\Nmap>ncat -lvp 9999
Ncat: Version 7.80 ( https://nmap.org/ncat )
Ncat: Listening on :::9999
Ncat: Listening on 0.0.0.0:9999
Ncat: Connection from 127.0.0.1.
Ncat: Connection from 127.0.0.1:44577.
apacheapaop





apache
spark
apache
hadoop
hello
world
hello
hello guys guys
hello
Output terminal where I am counting words
Batch: 2
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|apacheapaop|    1|
|      hello|    1|
|     apache|    2|
|      spark|    1|
|           |    5|
|     hadoop|    1|
+-----------+-----+

-------------------------------------------
Batch: 3
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|apacheapaop|    1|
|      hello|    2|
|     apache|    2|
|      spark|    1|
|      world|    1|
|           |    5|
|     hadoop|    1|
+-----------+-----+

-------------------------------------------
Batch: 4
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|       guys|    2|
|apacheapaop|    1|
|      hello|    3|
|     apache|    2|
|      spark|    1|
|      world|    1|
|           |    6|
|     hadoop|    1|
+-----------+-----+

-------------------------------------------
Batch: 5
-------------------------------------------
+-----------+-----+
|       word|count|
+-----------+-----+
|       guys|    2|
|apacheapaop|    1|
|      hello|    4|
|     apache|    2|
|      spark|    1|
|      world|    1|
|           |    6|
|     hadoop|    1|
+-----------+-----+

在终端上接收输出（每批）需要15-20秒...如何减少这种延迟

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark apache-spark-sql pyspark spark-streaming spark-structured-streaming