派斯帕克只获取最小值

问题描述

我只想获得最小值。

import pyspark as ps

spark = ps.sql.SparkSession.builder.master('local[4]')\
    .appName('some-name-here').getorCreate()

sc = spark.sparkContext

sc.textFile('path-to.csv')\
    .map(lambda x: x.replace('"','').split(','))\
    .filter(lambda x: not x[0].startswith('player_id'))\
    .map(lambda x: (x[2] + " " + x[1],int(x[8]) if x[8] else 0))\
    .reduceByKey(lambda value1,value2: value1 + value2)\
    .sortBy(lambda price: price[1],ascending=True).collect()

这是我得到的：

[('Cedric Ceballos',0),('Maurcie Cheeks',('James Foster',('Billy Gabor',('Julius Keye',('Anthony Mason',('Chuck Noble',('Theo Ratliff',('Austin Carr',('Mark Eaton',('A.C. Green',('Darrall Imhoff',('John Johnson',('Neil Johnson',('Jim King',('Max Zaslofsky',1),('Don BarksDale',('Curtis Rowe',('Caron Butler',2),('Chris gatling',2)].

如您所见，有很多键值为 0，这是最小值。我该如何排序？

解决方法

您可以将最小值收集到一个变量中，并根据该变量进行相等过滤：

rdd = sc.textFile('path-to.csv')\
    .map(lambda x: x.replace('"','').split(','))\
    .filter(lambda x: not x[0].startswith('player_id'))\
    .map(lambda x: (x[2] + " " + x[1],int(x[8]) if x[8] else 0))\
    .reduceByKey(lambda value1,value2: value1 + value2)\
    .sortBy(lambda price: price[1],ascending=True)

minval = rdd.take(1)[0][1]
rdd2 = rdd.filter(lambda x: x[1] == minval)

您的数据已经排序。使用 take(1) 而不是 collect() 获取第一个元素，最小值

apache-spark pyspark pyspark python rdd