使用现有键创建值列表的组合和总和-Pyspark

问题描述

我的问题类似于给定的here，但我还有一个我想从中获得总和的字段，即我的RDD如下（我将其显示为数据框）

+----------+----------------+----------------+
|    c1    |        c2      |      val       |
+----------+----------------+----------------+
|        t1|         [a,b] |        [11,12]|
|        t2|     [a,b,c ] |    [13,14,15]|
|        t3|   [a,c,d] |[16,17,18,19]|
+----------+----------------+----------------+

我想得到这样的东西：

        +----------+----------------+----------------+
        |    c1    |        c2      |     sum(val)   |
        +----------+----------------+----------------+
        |        t1|         [a,b] |        23      |
        |        t2|         [a,b] |        27      |
        |        t2|         [a,c] |        28      |
        |        t2|         [b,d] |        29      |
        |        t3|         [a,b] |        33      |
        |        t3|         [a,c] |        34      |
        |        t3|         [a,d] |        35      |
        |        t3|         [b,c] |        35      |
        |        t3|         [b,d] |        36      |
        |        t3|         [c,d] |        37      |
        +----------+----------------+----------------+

使用以下代码，我得到前两列

def combinations(row):
    l = row[1]
    k = row[0]
    m = row[2]
return [(k,v) for v in itertools.combinations(l,2)]

a.map(combinations).flatMap(lambda x: x).take(5)

使用此代码，我尝试获取第三列，但获得更多行

    def combinations(row):
            l = row[1]
            k = row[0]
            m = row[2]
    return [(k,v,x) for v in itertools.combinations(l,2) for x in map(sum,itertools.combinations(m,2)) ]
        
a.map(combinations).flatMap(lambda x: x).take(5)

感谢您的帮助。

解决方法

尝试以下方法：

a = sc.parallelize([
    (1,[1,2,3,4],[11,12,13,14]),(2,[3,4,5,6],[15,16,17,18]),(3,[-1,[19,20,21,22])
  ])

def combinations(row):
    l = row[1]
    k = row[0]
    m = row[2]
    return [(k,v,x) for v in itertools.combinations(l,2) for x in map(sum,itertools.combinations(m,2))]

a.map(combinations).flatMap(lambda x: x).take(5)

如下解决

    def combinations(row):
    l = row[1]
    k = row[0]
    m = row[2]
    return [(k,m[l.index(v[0])]+m[l.index(v[1])]) for v in itertools.combinations(l,2)]

a.map(combinations).flatMap(lambda x: x).take(5)

由于第二列和第三列中的元素数量相同，因此我提取了元素并将其添加。感谢Lavesh的回答，他帮助我找到了解决方案。

apache-spark pyspark pyspark python rdd