如何使用 Pyspark 和 NLTK 计算所有 NP名词单词的长度？

问题描述

在使用 pyspark 和 nltk 时，我想获取所有“NP”单词的长度并按降序对它们进行排序。我目前卡在子树的导航上。

示例子树输出。

#>>>[(Tree('NP',[Tree('NBAR',[('WASHINGTON','NN')])]),1)

尝试获取所有 NP 单词的长度。然后取这些长度并按降序排列。

第一个元素是长度为 1 的单词和单词的数量等。

example: 
#[(1,6157),6157 words length of one
#  (2,1833),1833 words length of 2
#  (3,654),#  (4,204),#  (5,65)]

import nltk
import re

textstring = """This is just a bunch of words to use for this example.  
John gave them to me last night but Kim took them to work.  
Hi Stacy. URL:http://example.com. Jessica,Mark,Tiger,Book,Crow,Airplane,SpaceShip"""

TOKEN_RE = re.compile(r"\b[\w']+\b")

grammar = r"""
    NBAR:
        {<NN.*|JJS>*<NN.*>}
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}
"""
chunker = nltk.RegexpParser(grammar)
text = sc.parallelize(textstring.split(' ')


dropURL=text.filter(lambda x: "URL" not in x)

words = dropURL.flatMap(lambda dropURL: dropURL.split(" "))
tree = words.flatMap(lambda words: chunker.parse(nltk.tag.pos_tag(nltk.regexp_tokenize(words,TOKEN_RE))))

#data=tree.map(lambda word: (word,len(word))).filter(lambda t : t.label() =='NBAR') -- error

#data=tree.map(lambda x: (x,len(x)))##.filter(lambda t : t[0] =='NBAR')
      
#>>>[(Tree('NP',1)  Trying to get the length of all NP's and in descending order.

#data=tree.map(lambda x: (x,len(x))).reduceByKey(lambda x: x=='NBAR') ##this is an error but I am getting close I think
data=tree.map(lambda x: (x[0][0],len(x[0][0][0])))#.reduceByKey(lambda x : x[1] =='NP') ##Long run time.

things = data.collect()
things

解决方法

您可以为每个条目添加类型检查以防止错误：

result = (tree.filter(lambda t: isinstance(t,nltk.tree.Tree) and 
                                t.label() == 'NP'
                     )
              .map(lambda t: (len(t[0][0][0]),1))
              .reduceByKey(lambda x,y: x + y)
              .sortByKey()
        )

print(result.collect())
# [(2,1),(3,2),(4,5),(5,(7,(8,(9,1)]

apache-spark nltk pyspark pyspark python rdd