合并NLP中的相关单词

问题描述

我想定义一个新单词，其中包括来自两个（或更多）不同单词的计数值。例如：

Words Frequency
0   mom 250
1   2020    151
2   the 124
3   19  82
4   mother  81
... ... ...
10  London  6
11  life    6
12  something   6

我想将母亲定义为mom + mother：

Words Frequency
0   mother  331
1   2020    151
2   the 124
3   19  82
... ... ...
9   London  6
10  life    6
11  something   6

这是另一种定义具有一定含义的单词组的方法（至少出于我的目的）。

任何建议将不胜感激。

解决方法

更新2020年10月21日

我决定构建一个Python模块来处理此答案中概述的任务。该模块称为 wordhoard ，可以从pypi

下载

我尝试在需要确定关键字的频率（例如医疗保健）和关键字的同义词（例如健康计划，预防医学）的项目中使用Word2vec和WordNet。我发现大多数NLP库无法产生所需的结果，因此我决定使用自定义关键字和同义词构建自己的字典。该方法可用于多个项目中的文本分析和分类。

我敢肯定，精通NLP技术的人可能会提供更强大的解决方案，但是下面的解决方案是一次又一次为我工作的类似解决方案。

我对答案进行了编码，以匹配您在问题中使用的词频数据，但是可以对其进行修改以使用任何关键字和同义词数据集。

import string

# Python Dictionary
# I manually created these word relationship - primary_word:synonyms
word_relationship = {"father": ['dad','daddy','old man','pa','pappy','papa','pop'],"mother": ["mamma","momma","mama","mammy","mummy","mommy","mom","mum"]}

# This input text is from various poems about mothers and fathers
input_text = 'The hand that rocks the cradle also makes the house a home. It is the prayers of the mother ' \
         'that keeps the family strong. When I think about my mum,I just cannot help but smile; The beauty of ' \
         'her loving heart,the easy grace in her style. I will always need my mom,regardless of my age. She ' \
         'has made me laugh,made me cry. Her love will never fade. If I could write a story,It would be the ' \
         'greatest ever told. I would write about my daddy,For he had a heart of gold. For my father,my friend,' \
         'This to me you have always been. Through the good times and the bad,Your understanding I have had.'

# converts the input text to lowercase and splits the words based on empty space.
wordlist = input_text.lower().split()

# remove all punctuation from the wordlist
remove_punctuation = [''.join(ch for ch in s if ch not in string.punctuation) 
for s in wordlist]

# list for word frequencies
wordfreq = []

# count the frequencies of a word
for w in remove_punctuation:
wordfreq.append(remove_punctuation.count(w))

word_frequencies = (dict(zip(remove_punctuation,wordfreq)))

word_matches = []

# loop through the dictionaries
for word,frequency in word_frequencies.items():
   for keyword,synonym in word_relationship.items():
      match = [x for x in synonym if word == x]
      if word == keyword or match:
        match = ' '.join(map(str,match))
        # append the keywords (mother),synonyms(mom) and frequencies to a list
        word_matches.append([keyword,match,frequency])

# used to hold the final keyword and frequencies
final_results = {}

# list comprehension to obtain the primary keyword and its frequencies
synonym_matches = [(keyword[0],keyword[2]) for keyword in word_matches]

# iterate synonym_matches and output total frequency count for a specific keyword
for item in synonym_matches:
  if item[0] not in final_results.keys():
    frequency_count = 0
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count
  else:
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count

 
print(final_results)
# output
{'mother': 3,'father': 2}

其他方法

下面是其他一些方法及其开箱即用的输出。

NLTK WORDNET

在此示例中，我查找了“母亲”一词的同义词。请注意，WordNet没有与“妈妈”一词链接的同义词“妈妈”或“妈妈”。这两个词在上面的示例文本中。另外请注意，“父亲”一词是“母亲”的同义词。

from nltk.corpus import wordnet

synonyms = []
word = 'mother'
for synonym in wordnet.synsets(word):
   for item in synonym.lemmas():
      if word != synonym.name() and len(synonym.lemma_names()) > 1:
        synonyms.append(item.name())

print(synonyms)
['mother','female_parent','mother','fuss','overprotect','beget','get','engender','father','sire','generate','bring_forth']

PyDictionary

在此示例中，我使用PyDictionary查询了synonym.com，以查找单词'mother'的同义词。在此示例中，同义词包括单词“妈妈”和“妈妈”。此示例还包括WordNet未生成的其他同义词。

但是，PyDictionary也产生了“妈妈”的同义词列表。与“母亲”一词无关。似乎PyDictionary从页面的adjective section而不是名词部分中提取了此列表。计算机很难区分形容词妈妈和名词妈妈。

from PyDictionary import PyDictionary
dictionary_mother = PyDictionary('mother')

print(dictionary_mother.getSynonyms())
# output 
[{'mother': ['mother-in-law','female parent','supermom','mum','parent','mom','momma','para I','mama','mummy','quadripara','mommy','quintipara','ma','puerpera','surrogate mother','mater','primipara','mammy','mamma']}]

dictionary_mum = PyDictionary('mum')

print(dictionary_mum.getSynonyms())
# output 
[{'mum': ['incommunicative','silent','uncommunicative']}]

其他一些可能的方法是使用牛津词典API或查询thesaurus.com。这两种方法都有陷阱。例如，Oxford Dictionary API需要一个API密钥和一个基于查询号的付费订阅。而且thesaurus.com缺少潜在的同义词，这些同义词可能有助于对单词进行分组。

https://www.thesaurus.com/browse/mother
synonyms: mom,parent,ancestor,creator,mommy,origin,predecessor,progenitor,source,child-bearer,forebearer,procreator

更新

很难为您的语料库中的每个潜在单词生成一个精确的同义词列表，这将需要多种插脚方法。下面的代码使用 WordNet和PyDictionary创建同义词的超集。像所有其他答案一样，这种组合方法还会导致单词频率的过度计数。我一直在尝试通过在最终的同义词字典中组合键和值对来减少这种过度计算。后一个问题比我预期的要难得多，可能需要我提出自己的问题来解决。最后，我认为您需要根据您的用例来确定哪种方法最有效，并且可能需要结合几种方法。

感谢您发布此问题，因为它使我能够查看解决复杂问题的其他方法。

from string import punctuation
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from PyDictionary import PyDictionary

input_text = """The hand that rocks the cradle also makes the house a home. It is the prayers of the mother
         that keeps the family strong. When I think about my mum,I just cannot help but smile; The beauty of
         her loving heart,regardless of my age. She
         has made me laugh,It would be the
         greatest ever told. I would write about my daddy,This to me you have always been. Through the good times and the bad,Your understanding I have had."""


def normalize_textual_information(text):
   # split text into tokens by white space
   token = text.split()

   # remove punctuation from each token
   table = str.maketrans('','',punctuation)
   token = [word.translate(table) for word in token]

   # remove any tokens that are not alphabetic
   token = [word.lower() for word in token if word.isalpha()]

   # filter out English stop words
   stop_words = set(stopwords.words('english'))

   # you could add additional stops like this
   stop_words.add('cannot')
   stop_words.add('could')
   stop_words.add('would')

   token = [word for word in token if word not in stop_words]

   # filter out any short tokens
   token = [word for word in token if len(word) > 1]
   return token


def generate_word_frequencies(words):
   # list to hold word frequencies
   word_frequencies = []

   # loop through the tokens and generate a word count for each token
   for word in words:
      word_frequencies.append(words.count(word))

   # aggregates the words and word_frequencies into tuples and coverts them into a dictionary
   word_frequencies = (dict(zip(words,word_frequencies)))

   # sort the frequency of the words from low to high
   sorted_frequencies = {key: value for key,value in 
   sorted(word_frequencies.items(),key=lambda item: item[1])}

 return sorted_frequencies


def get_synonyms_internet(word):
   dictionary = PyDictionary(word)
   synonym = dictionary.getSynonyms()
   return synonym

 
words = normalize_textual_information(input_text)

all_synsets_1 = {}
for word in words:
  for synonym in wordnet.synsets(word):
    if word != synonym.name() and len(synonym.lemma_names()) > 1:
      for item in synonym.lemmas():
        if word != item.name():
          all_synsets_1.setdefault(word,[]).append(str(item.name()).lower())

all_synsets_2 = {}
for word in words:
  word_synonyms = get_synonyms_internet(word)
  for synonym in word_synonyms:
    if word != synonym and synonym is not None:
      all_synsets_2.update(synonym)

 word_relationship = {**all_synsets_1,**all_synsets_2}

 frequencies = generate_word_frequencies(words)
 word_matches = []
 word_set = {}
 duplication_check = set()

 for word,frequency in frequencies.items():
    for keyword,synonym in word_relationship.items():
       match = [x for x in synonym if word == x]
       if word == keyword or match:
         match = ' '.join(map(str,match))
         if match not in word_set or match not in duplication_check or word not in duplication_check:
            duplication_check.add(word)
            duplication_check.add(match)
            word_matches.append([keyword,frequency])

 # used to hold the final keyword and frequencies
 final_results = {}

 # list comprehension to obtain the primary keyword and its frequencies
 synonym_matches = [(keyword[0],keyword[2]) for keyword in word_matches]

 # iterate synonym_matches and output total frequency count for a specific keyword
 for item in synonym_matches:
    if item[0] not in final_results.keys():
      frequency_count = 0
      frequency_count = frequency_count + item[1]
      final_results[item[0]] = frequency_count
 else:
    frequency_count = frequency_count + item[1]
    final_results[item[0]] = frequency_count

# do something with the final results

这是一个难题，最好的解决方案取决于您要解决的用例。这是一个难题，因为组合单词需要理解单词的语义。您可以将mom和mother组合在一起，因为它们在语义上相关。

一种识别两个单词是否在语义上实现的方法是通过关联诸如word2vec，Glove，fasttext等的分布式单词嵌入（向量）。您可以找到所有单词相对于一个单词的向量之间的余弦相似度，并且可以选择前5个最接近的单词并创建新单词。

使用word2vec的示例

# Load a pretrained word2vec model
import gensim.downloader as api
model = api.load('word2vec-google-news-300')

vectors = [model.get_vector(w) for w in words]
for i,w in enumerate(vectors):
   first_best_match = model.cosine_similarities(vectors[i],vectors).argsort()[::-1][1]
   second_best_match = model.cosine_similarities(vectors[i],vectors).argsort()[::-1][2]
   
   print (f"{words[i]} + {words[first_best_match]}")
   print (f"{words[i]} + {words[second_best_match]}")

输出：

mom + mother
mom + teacher
mother + mom
mother + teacher
london + mom
london + life
life + mother
life + mom
teach + teacher
teach + mom
teacher + teach
teacher + mother

您可以尝试将阈值放在余弦相似度上，仅选择那些余弦相似度大于此阈值的阈值。

语义相似性的一个问题是它们可以在语义上相反，因此它们是相似的（男人-女人），另一方面（男人-国王）在语义上相似，因为它们是相同的。

使用古朴的PyDictionary库解决此问题的另一种古怪方式。您可以使用

dictionary.getSynonyms()

功能可遍历列表中的所有单词并将其分组。列出的所有可用同义词将被覆盖并映射到一组。在那里，您可以分配最终变量并汇总同义词。在您的示例中。您选择最后一个单词作为Mother，以显示同义词的最终数量。

您要实现的目标是语义文本相似性。

例如：

#@title Load the Universal Sentence Encoder's TF Hub module
from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4","https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return model(input)

def plot_similarity(labels,features,rotation):
  corr = np.inner(features,features)
  sns.set(font_scale=1.2)
  g = sns.heatmap(
      corr,xticklabels=labels,yticklabels=labels,vmin=0,vmax=1,cmap="YlOrRd")
  g.set_xticklabels(labels,rotation=rotation)
  g.set_title("Semantic Textual Similarity")

def run_and_plot(messages_):
  message_embeddings_ = embed(messages_)
  plot_similarity(messages_,message_embeddings_,90)

messages = [
    "Mother","Mom","Mama","Dog","Cat"
]

run_and_plot(messages)

该示例是用python编写的，但我还创建了一个将模型加载到基于JVM的语言中的示例

https://github.com/ntedgi/universal-sentence-encoder

您可以生成单词嵌入向量并使用一些聚类算法。最后，您需要调整算法的超参数，以达到高精度的结果。

from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA

import spacy

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Load the large english model
nlp = spacy.load("en_core_web_lg")

tokens = nlp("dog cat banana apple teaching teacher mom mother mama mommy berlin paris")

# Generate word embedding vectors
vectors = np.array([token.vector for token in tokens])
vectors.shape
# (12,300)

让我们使用主成分分析算法来可视化我们在3维空间中的嵌入：

pca_vecs = PCA(n_components=3).fit_transform(vectors)
pca_vecs.shape
# (12,3)

fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111,projection='3d')
xs,ys,zs = pca_vecs[:,0],pca_vecs[:,1],2]
_ = ax.scatter(xs,zs)

for x,y,z,lable in zip(xs,zs,tokens):
    ax.text(x+0.3,str(lable))

让我们使用DBSCAN算法对单词进行聚类：

model = DBSCAN(eps=5,min_samples=1)
model.fit(vectors)

for word,cluster in zip(tokens,model.labels_):
    print(word,'->',cluster)

输出：

dog -> 0
cat -> 0
banana -> 1
apple -> 2
teaching -> 3
teacher -> 3
mom -> 4
mother -> 4
mama -> 4
mommy -> 4
berlin -> 5
paris -> 6

matthewreagan/WebstersEnglishDictionary

想法是使用该字典来识别相似的单词。

简而言之：运行一些根据英语语法提取知识的知识发现算法

Here is a thesaurus：其18MB。

HERE是同义词库的摘录，您可以尝试通过某些算法来匹配替代单词。

{"word": "ma","key": "ma_1","pos": "noun","synonyms": ["mamma","mum"]}

使用外部api进行快速修复的链接是：它们允许使用api做更多的事情，例如获取同义词，查找多个定义，查找押韵的单词等等。

WORDAPI

cluster-analysis nlp python word2vec wordnet

合并NLP中的相关单词

问题描述

解决方法

其他方法

更新

相关问答