在Spark的Edge和顶点中存储多列数据

问题描述

我是Spark Graphx的新手,具有如下边缘数据框:

Dataframe : edges_main
+------------------+------------------+------------+--------+-----------+
|               src|               dst|relationship|category|subcategory|
+------------------+------------------+------------+--------+-----------+
|294201130817328347|294201131015844283|   friend   |  school|      class|
|294201131015844283|294201131007361339|  brother   |   home |     cousin|
|294201131015844283|294201131014451003|  son       |   home |   relative|
-------------------------------------------------------------------------

,顶点为:

Dataframe : vertices_main
+------------------+----------+
|               id |value|name|
+------------------+----------+
|294201130817328347|Mary |a   |
|294201131015844283|Hola |b   |
|294201131015844283|Rama |c   |
-------------------------------

我想在Graphx中保留其他属性,以便可以使用map访问它们。我的代码

case class MyEdges(src: String,dst: String,attributes: MyEdgesLabel)
case class MyEdgesLabel(relationship:String,category: String,subcategory:String)

val edges = edges_main.as[MyEdges].rdd.map { edge =>
      Edge(
        edge.src.toLong,edge.dst.toLong,//**what to mention here(MyEdgesLabel)**//
      )}

case class MyVerticesLabel(name:String)

val vertices: RDD[(VertexId,Any)] = vertices_data.rdd.map(verticesRow => (
      verticesRow.getLong(0),verticesRow.getString(1))
//**what to mention here(MyVerticesLabel)**//
    )

上述要求的原因是在创建图形之后,我可以通过以下方式直接访问其他属性

val g = Graph(vertices,edges)
g.vertices.map(v => v._1 + v._2 + /*addidtional attributes which is in case class MyEdgesLabel*/).collect.mkString 
g.edges.map(e =>  e.srcId + e.dstId + e.attr(/*addidtional attributes which is in case class 
 MyVerticesLabel*/))).collect.mkString

我从url下面得到了一些线索,但是我仍然在满足顶点和边缘的多个属性时感到困惑: http://www.sunlab.org/teaching/cse6250/fall2019/spark/spark-graphx.html#graph-construction

请就此提供帮助。

解决方法

您可以将案例类用作边属性,将另一个案例用作顶点属性。 MyEdgesLabel对于边缘来说已经可以了,要使边缘RDD光滑,只需执行以下操作:

val edges = edges_main.as[MyEdges].rdd.map { edge =>
      Edge(
        edge.src.toLong,edge.dst.toLong,MyEdgesLabel(edge.relationship,edge.category,edge.subcategory)
      )}

对于顶点,您需要在案例类中同时包含valuename

case class MyVerticesLabel(value: String,name: String)

然后用它来创建顶点RDD

val vertices: RDD[(VertexId,MyVerticesLabel)] = vertices_data.rdd.map{verticesRow => 
    (verticesRow.getAs[Long]("id"),MyVerticesLabel(verticesRow.getAs[String]("value"),verticesRow.getAs[String]("name")))
}

现在,可以轻松访问这些值,例如:

g.edges.map(e =>  e.srcId + e.dstId + e.attr.relationship).collect.mkString