使用一组关键字对文本进行分类

问题描述

这是我的第一个问题。我有使用svms,nns,xgboost等对结果进行预测的经验,但都一样,当涉及到文本时,我对python世界来说是相对较新的东西,而我当前的项目不在我的掌舵之下。这个问题似乎很简单。我有一个带有一列文本的文件。我需要能够在此列中进行搜索,并确定将确定每一行中的文本是否是我们所拥有的十几种类别之一的关键字。

我加载数据,然后将所有文本转换为小写字母,标记化,删除停用词,进行词形化,然后使用以下代码重新加入所有标记

dfnew['Activity (original)'] = dfnew['Activity (original)'].str.lower()
def identify_tokens(row):
    Activity_Cleaned = row['Activity (original)']
    tokens = nltk.word_tokenize(Activity_Cleaned)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words
dfnew['Activity (original)'] = dfnew.apply(identify_tokens,axis=1)

def removestops(row):
    mlist = row['Activity (original)']
    keepwords = [w for w in mlist if not w in stops]
    return (keepwords)
dfnew['Activity (original)'] = dfnew.apply(removestops,axis=1)

def lemmy(row):
    mlist = row['Activity (original)']
    keepwords = [lemmatizer.lemmatize(word) for word in mlist]
    return (keepwords)
dfnew['Activity (original)'] = dfnew.apply(lemmy,axis=1)

def rejoin_words(row):
    my_list = row['Activity (original)']
    joined_words = ( " ".join(my_list))
    return joined_words   
dfnew['Activity (original)'] = dfnew.apply(rejoin_words,axis=1)

然后,我尝试按照本段末尾链接的指导,创建一个关键字元组,该元组具有对每个类别的文本和关联关键字进行分类标签,以及一个看起来类似的结果元组通过df中的文本并将其与关键字元组categorise text in column using keywords

匹配

我的代码如下:

keywordstuple = [('Management',{'in': ['management','manage','oversee','supervise','administer','direct','control','handle','lead','govern','look',' moderate','chair'],'out':[]}),('Planning',{'in': ['plan','organize','arrange','design','outline','draft','prepare','schedule',' formulate','develop','set','create'],('Executing',{'in': ['executing','execute','implement','perform','carry','accomplish','achieve','complete','enact','do','attain','conduct'],('Coordinating',{'in': ['coordinating','coordinate','syncrhonize','mesh','collaborate','cooperate','pull','laise'],('Advancing',{'in': ['advancing','advance','facilitate','ease','smooth','enamble','assist','help','aid','expedite','accelerate','speed','promote','further','simplify','encourage','orchestrate'],('Maintaining',{'in': ['maintaining','maintain','continue','preserve','sustain','service','keep','nurtue','track'],['enhancing','enhance','improve','modernize','rehabilitate','touch','reform','better','upgrade'],'out':[]})]  
               
resultstuple = []
for description in 'Activity (original)':
    categories_in = [cat[0] for cat in keywordstuple if([kw in description for kw in cat[1]['in']])]
    categories_out = [cat[0] for cat in keywordstuple if all([kw not in description for kw in cat[1]['out']])]

    categories = list(set(categories_in).intersection(categories_out))
    if len(categories) > 0:
        category = categories[0]
    else:
        category = 'NO CATEGORY'

    resultstuple.append(category)
   
dfnew['Category']=category

为了简洁起见,我删除代码中的许多类别和关键字,以免本文过长。当我在运行上述代码后查看df时,无论实际文本中的内容是什么,所有内容标记为“计划中”。我知道我一定做错了,但是我不确定。我已经做了很多事情,例如从关键字元组删除“ out”之类的东西,但是这里我缺少一些基本知识,因此我只是盲目地破解了。任何建议将不胜感激。谢谢。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)