问题描述
我的问题类似于给定的here,但我还有一个我想从中获得总和的字段,即我的RDD如下(我将其显示为数据框)
+----------+----------------+----------------+
| c1 | c2 | val |
+----------+----------------+----------------+
| t1| [a,b] | [11,12]|
| t2| [a,b,c ] | [13,14,15]|
| t3| [a,c,d] |[16,17,18,19]|
+----------+----------------+----------------+
我想得到这样的东西:
+----------+----------------+----------------+
| c1 | c2 | sum(val) |
+----------+----------------+----------------+
| t1| [a,b] | 23 |
| t2| [a,b] | 27 |
| t2| [a,c] | 28 |
| t2| [b,d] | 29 |
| t3| [a,b] | 33 |
| t3| [a,c] | 34 |
| t3| [a,d] | 35 |
| t3| [b,c] | 35 |
| t3| [b,d] | 36 |
| t3| [c,d] | 37 |
+----------+----------------+----------------+
使用以下代码,我得到前两列
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k,v) for v in itertools.combinations(l,2)]
a.map(combinations).flatMap(lambda x: x).take(5)
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k,v,x) for v in itertools.combinations(l,2) for x in map(sum,itertools.combinations(m,2)) ]
a.map(combinations).flatMap(lambda x: x).take(5)
感谢您的帮助。
解决方法
尝试以下方法:
a = sc.parallelize([
(1,[1,2,3,4],[11,12,13,14]),(2,[3,4,5,6],[15,16,17,18]),(3,[-1,[19,20,21,22])
])
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k,v,x) for v in itertools.combinations(l,2) for x in map(sum,itertools.combinations(m,2))]
a.map(combinations).flatMap(lambda x: x).take(5)
,
如下解决
def combinations(row):
l = row[1]
k = row[0]
m = row[2]
return [(k,m[l.index(v[0])]+m[l.index(v[1])]) for v in itertools.combinations(l,2)]
a.map(combinations).flatMap(lambda x: x).take(5)
由于第二列和第三列中的元素数量相同,因此我提取了元素并将其添加。感谢Lavesh的回答,他帮助我找到了解决方案。