图形连接句子

问题描述

我有几个主题的句子列表（两个），如下所示：

Sentences
Trump says that it is useful to win the next presidential election. 
The Prime Minister suggests the name of the winner of the next presidential election.
In yesterday's conference,the Prime Minister said that it is very important to win the next presidential election. 
The Chinese Minister is in London to discuss about climate change.
The president Donald Trump states that he wants to win the presidential election. This will require a strong media engagement.
The president Donald Trump states that he wants to win the presidential election. The UK has proposed collaboration. 
The president Donald Trump states that he wants to win the presidential election. He has the support of his electors.

如您所见，句子中有相似之处。

我试图通过使用图形（定向）来关联多个句子并形象化它们的特征。通过应用句子的行排序，从相似性矩阵构建图形，如上所示。我创建了一个新列，时间，以显示句子的顺序，因此第一行（特朗普说……）在时间1；第二排（总理建议...）在时间2，依此类推。像这样

Time    Sentences
1           Trump said that it is useful to win the next presidential election. 
2           The Prime Minister suggests the name of the winner of the next presidential election.

3           In today's conference,the Prime Minister said that it is very important to win the next presidential election. 

...

然后我想找到这些关系，以便对该主题有一个清晰的了解。句子的多个路径将表明存在与之相关的多个信息。为了确定两个句子之间的相似性，我尝试如下提取名词和动词：

noun=[]
verb=[]
for  index,row in df.iterrows():
      nouns.append([word for word,pos in pos_tag(row[0]) if pos == 'NN'])
      verb.append([word for word,pos in pos_tag(row[0]) if pos == 'VB'])

因为它们是任何句子中的关键字。因此，当关键字（名词或动词）出现在句子x中而不出现在其他句子中时，表示这两个句子之间存在差异。我认为，更好的方法可能是使用word2vec或gensim（WMD）。

必须为每个句子计算相似度。我想建立一个图表，显示上面示例中句子的内容。由于有两个主题（特朗普和中国内阁大臣），因此我需要针对每个主题寻找子主题。例如，特朗普举行了副主题总统选举。我图中的一个节点应该代表一个句子。每个节点中的单词代表句子的差异，并在句子中显示新信息。例如，在时间5的句子中的单词states在时间6和7的相邻句子中。我只想找到一种产生类似结果的方法，如下图所示。我尝试主要使用名词和动词提取，但可能不是正确的方法。我试图做的是考虑在时间1处的句子，并将其与其他句子进行比较，分配相似性分数（名词和动词提取，以及word2vec），然后对所有其他句子重复该句子。但是我现在的问题是如何提取差异以创建可以理解的图形。

对于图的一部分，我会考虑使用networkx（DiGraph）：

G = nx.DiGraph()
N = Network(directed=True)

显示关系的方向。

我提供了一个不同的示例来使它更清楚（但是如果您使用前面的示例，也可以。不便之处，敬请原谅，但是由于我的第一个问题不太清楚，我还必须提供一个更好，可能更简单的示例）。

解决方法

没有为动词/名词分离实现NLP，只是添加了一个好的单词列表。可以相对容易地使用spacy提取和标准化它们。请注意，walk出现在1,2,5个句子中，并构成一个三合会。

import re
import networkx as nx
import matplotlib.pyplot as plt

plt.style.use("ggplot")

sentences = [
    "I went out for a walk or walking.","When I was walking,I saw a cat. ","The cat was injured. ","My mum's name is Marylin.","While I was walking,I met John. ","Nothing has happened.",]

G = nx.Graph()
# set of possible good words
good_words = {"went","walk","cat","walking"}

# remove punctuation and keep only good words inside sentences
words = list(
    map(
        lambda x: set(re.sub(r"[^\w\s]","",x).lower().split()).intersection(
            good_words
        ),sentences,)
)

# convert sentences to dict for furtehr labeling
sentences = {k: v for k,v in enumerate(sentences)}

# add nodes
for i,sentence in sentences.items():
    G.add_node(i)

# add edges if two nodes have the same word inside
for i in range(len(words)):
    for j in range(i + 1,len(words)):
        for edge_label in words[i].intersection(words[j]):
            G.add_edge(i,j,r=edge_label)

# compute layout coords
coord = nx.spring_layout(G)

plt.figure(figsize=(20,14))

# set label coords a bit upper the nodes
node_label_coords = {}
for node,coords in coord.items():
    node_label_coords[node] = (coords[0],coords[1] + 0.04)

# draw the network
nodes = nx.draw_networkx_nodes(G,pos=coord)
edges = nx.draw_networkx_edges(G,pos=coord)
edge_labels = nx.draw_networkx_edge_labels(G,pos=coord)
node_labels = nx.draw_networkx_labels(G,pos=node_label_coords,labels=sentences)
plt.title("Sentences network")
plt.axis("off")

更新
如果要测量不同句子之间的相似性，则可能需要计算句子嵌入之间的差异。
这使您有机会找到带有不同单词的句子之间的语义相似性，例如“一个有多个男子踢足球的足球比赛”和“一些男子正在参加一项运动”。几乎可以找到使用BERT的SoTA方法here，更简单的方法是here。
由于您具有相似性度量，因此仅当相似性度量大于某个阈值时，才替换add_edge块以添加新边。生成的添加边代码将如下所示：

# add edges if two nodes have the same word inside
tresold = 0.90
for i in range(len(words)):
    for j in range(i + 1,len(words)):
        # suppose you have some similarity function using BERT or PCA
        similarity = check_similarity(sentences[i],sentences[j])
        if similarity > tresold:
            G.add_edge(i,r=similarity)

处理此问题的一种方法是标记化，删除停用词并创建词汇表。然后根据该词汇表绘制图形。我在下面展示基于基于字母组合的标记的示例，但是更好的方法是识别短语（ngram）并将其用作词汇而不是字母组合。类似地，句子将由具有更多程度和程度的节点（和相应的句子）以图形方式描绘。

示例：

from sklearn.feature_extraction.text import CountVectorizer
import networkx as nx
import matplotlib.pyplot as plt


corpus = [
  "Trump says that it is useful to win the next presidential election","The Prime Minister suggests the name of the winner of the next presidential election","In yesterday conference,the Prime Minister said that it is very important to win the next presidential election","The Chinese Minister is in London to discuss about climate change","The president Donald Trump states that he wants to win the presidential election. This will require a strong media engagement","The president Donald Trump states that he wants to win the presidential election. The UK has proposed collaboration","The president Donald Trump states that he wants to win the presidential election. He has the support of his electors",]

vectorizer = CountVectorizer(analyzer='word',ngram_range=(1,1),stop_words="english")
vectorizer.fit_transform(corpus)


G = nx.DiGraph()
G.add_nodes_from(vectorizer.get_feature_names())

all_edges = []
for s in corpus:
  edges = []
  previous = None
  for w in s.split():
    w = w.lower()
    if w in vectorizer.get_feature_names():
      if previous:
        edges.append((previous,w))
        #print (previous,w)
      previous = w   

  all_edges.append(edges)


plt.figure(figsize=(20,20))
pos = nx.shell_layout(G)
nx.draw_networkx_nodes(G,pos,node_size = 500)
nx.draw_networkx_labels(G,pos)
colors = ['r','g','b','y','m','c','k']
for i,edges in enumerate(all_edges):
  nx.draw_networkx_edges(G,edgelist=edges,edge_color=colors[i],arrows=True)
#nx.draw_networkx_edges(G,edgelist=black_edges,arrows=False)
plt.show()

输出：

networkx nlp nltk python word2vec