问题描述
我正在做一个关键字频率排名。但是,前 1 位是 '',我不知道。我已经尝试删除标点符号和空值。 代码如下:
def get_keywords(row):
Vbtm = row['a1']
lowered = Vbtm.lower()
tokens = nltk.tokenize.word_tokenize(lowered)
punctuations = ['(',')',';',':','[',']',','’','”','“','.',']
extra=['','null',' ']
keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words \
and not keyword in extra and not keyword in punctuations]
keywords_string = ','.join(keywords)
return keywords_string
df['keywords'] = df.apply(get_keywords,axis=1)
wfq=Counter(df.keywords)
wfq_sorted = sorted(wfq.items(),key=lambda kv: kv[1],reverse=True)
for w in wfq_sorted[:30]:
print(w)
这是我得到的结果:
('',2379) --What is ''? and how to remove that?
('reliability',134)
('reliable',129)
('good,service',54) --also,anyone kNow why these two words are stick together?
('service',29)
('dependable',27)
('great,24)
('dependability',23)
...
谢谢。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)