如何找到单词之间的“连接”以对句子进行聚类

问题描述

我需要连接单词4Gmobile phonesInternet以便将有关技术的句子聚集在一起。 我有以下句子:

4G is the fourth generation of broadband network.
4G is slow. 
4G is defined as the fourth generation of mobile technology
I bought a new mobile phone. 

我需要在同一组中考虑以上句子。目前还没有,可能是因为它没有找到4G与移动之间的关系。 我尝试使用第一个wordnet.synsets查找将4G连接到Internet或手机的同义词,但是很遗憾,它没有找到任何连接。 要对句子进行聚类,如下所示:

rom sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
import numpy

texts = ["4G is the fourth generation of broadband network.","4G is slow.","4G is defined as the fourth generation of mobile technology","I bought a new mobile phone."]

# vectorization of the sentences
vectorizer = TfidfVectorizer(stop_words="english")
X = vectorizer.fit_transform(texts)
words = vectorizer.get_feature_names()
print("words",words)


n_clusters=3
number_of_seeds_to_try=10
max_iter = 300
number_of_process=2 # seads are distributed
model = KMeans(n_clusters=n_clusters,max_iter=max_iter,n_init=number_of_seeds_to_try,n_jobs=number_of_process).fit(X)

labels = model.labels_
# indices of preferible words in each cluster
ordered_words = model.cluster_centers_.argsort()[:,::-1]

print("centers:",model.cluster_centers_)
print("labels",labels)
print("intertia:",model.inertia_)

texts_per_cluster = numpy.zeros(n_clusters)
for i_cluster in range(n_clusters):
    for label in labels:
        if label==i_cluster:
            texts_per_cluster[i_cluster] +=1 

print("Top words per cluster:")
for i_cluster in range(n_clusters):
    print("Cluster:",i_cluster,"texts:",int(texts_per_cluster[i_cluster])),for term in ordered_words[i_cluster,:10]:
        print("\t"+words[term])

print("\n")
print("Prediction")

text_to_predict = "Why 5G is dangerous?"
Y = vectorizer.transform([text_to_predict])
predicted_cluster = model.predict(Y)[0]
texts_per_cluster[predicted_cluster]+=1

print(text_to_predict)
print("Cluster:",predicted_cluster,int(texts_per_cluster[predicted_cluster])),for term in ordered_words[predicted_cluster,:10]:
print("\t"+words[term])

对此有任何帮助,将不胜感激。

解决方法

正如@ sergey-bushmanov的注释所指出的那样,密集的词嵌入(例如来自word2vec或类似算法)可能会有所帮助。

它们会将单词转换为密集的高维向量,其中具有相似含义/用法的单词彼此接近。甚至:空间中的某些方向通常会与单词之间的关系的 大致相关。

因此,在具有足够代表性(大而变化的)文本上训练的单词向量会将'4G''mobile'的向量放在彼此附近,然后,如果您的句子表示从单词向量,可能有助于您的聚类。

使用单个词向量对句子建模的一种快速方法是将所有句子的词向量的平均值用作句子向量。建模多种含义的阴影(尤其是来自语法和单词顺序的阴影)太简单了,但是通常可以作为良好的基线,尤其是对于广泛的话题性。

另一种计算方式是“单词移动器的距离”,将句子视为单词向量集(不对它们进行平均),并且可以进行句子到句子的距离计算,其效果比简单的平均值更好,但对于计算更长的句子。