从另一个具有所需特定列的 rdd 创建 rdd

问题描述

我有一个包含以下表格数据的 spark 文件

Property ID|Location|Price|bedrooms|Bathrooms|Size|Price SQ Ft|Status

我已使用 :-

将此文件读取为 rdd

a = sc.textFile("/FileStore/tables/realestate.txt")

现在我需要从上述 RDD 创建一个具有 PropertyID、位置、价格（= 大小 * 价格 SQ Ft）的新 RDD。

我可以通过将其转换为数据帧来实现，但无法弄清楚如何使用所需的列将其转换为另一个 RDD。

解决方法

您可以使用地图获取前三列：

a = sc.textFile("/FileStore/tables/realestate.txt")
b = a.map(
    lambda x: 
    (x.split('|')[:2] + [float(x.split('|')[5]) * float(x.split('|')[6])]) 
    if x.split('|')[0] != 'Property ID'
    else ['Property ID','Location','Price']
)

def splitfunc(x):
    array=x.split('|')
    return [array[0],array[1],array[5]*array[6]]
#array[0] is your properties and so on..
newrdd=rdd.map(splitfunc)

使用map函数..在map函数中将rdd分割为seperator(lines.split('|'))，然后选择数组中需要的列。

apache-spark apache-spark-sql pyspark pyspark rdd