分解列名,对多个单词而不是一个单词使用wordnet.synsets

问题描述

我正在尝试获取列名中每个单词的同义词列表。但是,当我运行wordnet.synsets()时,它将仅对一个单词的列名起作用。如何在多个单词上运行它并像下面的期望输出一样输出它?还有没有办法只显示前4个结果以提高可读性?

代码

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd

df =  ['Unnamed 0','business id','name','postal code',]

syns = {w : [] for w in df}
for k,v in syns.items():
    for synset in wordnet.synsets(k):
        for lemma in synset.lemmas():
            if lemma.name() not in syns:
                v.append(lemma.name())

pd.DataFrame([syns],columns = syns.keys())

当前输出

Unnamed 0   business id   name                                                postal code
[]          []            [gens,figure,public_figure,epithet,call,i...   []

所需的输出

Unnamed 0               business id               name                            postal code
Unnamed[deFinitions],business[deFinitions],[gens,public_figure]   postal[deFinitions],0[deFinitions]          id[deFinitions]                                           code[deFinitions]

解决方法

简单易用

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd

df =  ['Unnamed 0','business id','name','postal code',]
df = pd.DataFrame(
{tuple([k,t]):pd.Series(np.unique([l.name() 
                                     for s in wordnet.synsets(t) 
                                     for l in s.lemmas() if "_" not in l.name()])).to_dict()
 for k in df 
 for t in nltk.word_tokenize(k)
}).fillna("")
df.columns.set_names(["sentance","word"],inplace = True)
df.loc[:4] # just first 5 matches...



只需更改列表/字典理解为熊猫格式 {"colA":[1,2],"colB":[3,4]}

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk
import pandas as pd

df =  ['Unnamed 0',]

mr = max([len(k.split(" ")) for k in df])
pd.DataFrame(
    # column for each requesed space delimited request
    # use f-string to format as requested....
    {k:[f"{v}:{np.unique([l.name() for s in wordnet.synsets(v) for l in s.lemmas() ]).tolist()}" 
            # need to pad request with fewer tokend to meet pandas required format
            for v in f"{k}{(mr-len(k.split(' ')))*' '}".split(" ")] 
     for k in df}).replace({":[]":""})

输出

    Unnamed 0   business id name    postal code
0   Unnamed:['nameless','unidentified','unknown'...   business:['business','business_concern','bus...   name:['advert','appoint','bring_up','call',...   postal:['postal']
1   0:['0','cipher','cypher','nought','zero']   id:['Gem_State','I.D.','ID','Idaho','id']       code:['cipher','code','codification','compu...