获取Spark中RDD的索引

问题描述

我有一个 RDD RDD[Employee] ，其中Employee中的列是id，first_name，last_name，dob 我想用索引值RDD[Employee]设置id的值。我怎么做？我可以使用rdd.zipwithIndex()获取索引，但是我不知道下一步该怎么做。

解决方法

您需要将每个元素映射到新的所需元素：

rdd.zipWithIndex()
   .map{case(elem,index) => elem.copy(id = index.toInt)}

如果您的Employee类不是案例类，或者缺少copy方法，您可以这样做：

rdd.zipWithIndex()
   .map{case(elem,index) => Employee(index.toInt,elem.first_name,elem.last_name)}

首先，使用zipWithIndex获得每一行的索引，这是将您的RDD转换为成簇的RDD，其中第一个元素是员工（在示例中称为elem），第二个是索引
然后您可以创建一个索引为id的新员工并与原始人的名字和姓氏相同

注意：默认情况下，由zip方法创建的索引是Long，因为在我的Employee类中，id是一个int，所以我需要将其转换为int。如果您的ID已经是Long，则可能不需要此。

apache-spark rdd scala scala