问题描述
我正在尝试按段落查找单词的平均长度。从文本文件中以1 |格式提取数据已有五年多了……每一行都有一个段落编号。
到目前为止,这是我的代码:
implicits
当前输出遵循以下格式:
from pyspark import SparkContext,SparkConf
sc = SparkContext('local','longest')
text = sc.textFile("walden.txt")
lines = text.map(lambda line: (line.split("|")[0],line))
lines = lines.filter(lambda kv: len(kv[1]) > 0)
words = lines.mapValues(lambda x: x.replace("1|","").replace("2|","").replace("3|",""))
words = words.mapValues(lambda x: x.split())
words = words.mapValues(lambda x: [(len(i),1) for i in x])
words = words.reduceByKey(lambda a,b: a+b)
words.saveAsTextFile("results")
其中'1'/'2'/'3'是段落ID,元组遵循(word length,1)格式。
我需要做的是对元组的值求和(按键/段ID),以便(2,1),(6,1),(1,1)变为(9,3),然后除这些值(9/3)可以找到每个段落中的平均单词长度。
我尝试了很多不同的方法,但无法使它正常工作。非常感谢您的帮助!
解决方法
对于您的rdd,请尝试此操作。
text = sc.textFile("test.txt")
lines = text.map(lambda line: (line.split("|")[0],line))
lines = lines.filter(lambda kv: len(kv[1]) > 0)
words = lines.mapValues(lambda x: x.replace("1|","").replace("2|","").replace("3|",""))
words = words.mapValues(lambda x: x.split())
words = words.mapValues(lambda x: [len(i) for i in x])
words = words.mapValues(lambda x: sum(x) / len(x))
words.collect()
[('1',4.0),('2',5.4),('3',7.0)]
我使用数据框并得到了它。
import pyspark.sql.functions as f
df = spark.read.option("inferSchema","true").option("sep","|").csv("test.txt").toDF("col1","col2")
df.show(10,False)
+----+---------------------------------------+
|col1|col2 |
+----+---------------------------------------+
|1 |For more than five years |
|2 |For moasdre than five asdfyears |
|3 |Fasdfor more thasdfan fidafve yearasdfs|
+----+---------------------------------------+
df.withColumn('array',f.split('col2',r'[ ][ ]*')) \
.withColumn('count_arr',f.expr('transform(array,x -> LENGTH(x))')) \
.withColumn('sum',f.expr('aggregate(array,(sum,x) -> sum + LENGTH(x))')) \
.withColumn('size',f.size('array')) \
.withColumn('avg',f.col('sum') / f.col('size')) \
.show(10,False)
+----+---------------------------------------+---------------------------------------------+---------------+---+----+---+
|col1|col2 |array |count_arr |sum|size|avg|
+----+---------------------------------------+---------------------------------------------+---------------+---+----+---+
|1 |For more than five years |[For,more,than,five,years] |[3,4,5]|20 |5 |4.0|
|2 |For moasdre than five asdfyears |[For,moasdre,asdfyears] |[3,7,9]|27 |5 |5.4|
|3 |Fasdfor more thasdfan fidafve yearasdfs|[Fasdfor,thasdfan,fidafve,yearasdfs]|[7,8,9]|35 |5 |7.0|
+----+---------------------------------------+---------------------------------------------+---------------+---+----+---+
我知道这是完全不同的方法,但会有所帮助。