句子嵌入聚类

问题描述

我正在做一个小项目,我需要从我从网站提取的html内容中消除不相关的信息(例如广告)。由于我是NLP的初学者,因此在进行了一些研究之后,我想到了一种简单的方法

网站上使用的语言主要是中文,我将每个句子(用逗号分隔)存储在列表中。我使用称为HanLP的模型对句子进行语义解析。像这样:

[['萨哈夫','说',',','伊拉克','将','同','联合国','销毁','大','规模','杀伤性','武器','特别','委员会','继续','保持','合作','。'],['上海','华安','工业','(','集团',')','公司','董事长','谭旭光','和','秘书','张晚霞','来到','美国','纽约','现代','艺术','博物馆','参观','。']]

我找到了一个经过预训练的中文单词嵌入数据库,以将单词嵌入添加到我的列表中。然后,我的方法是通过计算该句子中元素的平均数来获得该句子的嵌入。现在,我得到了一个列表,其中包含我解析的每个句子的句子嵌入向量。

sentence: ['各国','必须','“','”','支出','》','的','报道','称']
sentence embedding: [0.08130878633396192,-0.07660450288941237,0.008989107615145093,0.07014013996178453,0.028158639980988068,0.01821030060422014,0.017793822186914356,0.04148909364911643,0.019383941353722053,0.03080177273262631,-0.025636445207055658,-0.019274188523096116,0.0007501963356679136,0.00476544528183612,-0.024648051539605313,-0.011124626140702854,-0.0009071269834583455,-0.08850407109341839,0.016131568784740837,-0.025241035714068195,-0.041586867829954084,-0.0068722023954085835,-0.010853541125966744,0.03994347004812549,0.04977656596086242,0.029051605612039566,-0.031031965550606732,0.05125975541093133,0.02666312647687102,0.0376262941096105,-0.00833959155716002,0.035523645325817844,-0.0026961421932686458,0.04742895790629766,-0.07069634984840047,-0.054931600324132225,0.0727336619218642,0.0434290729039772,-0.09277284060689536,-0.020194332538680596,0.0011523241092535582,0.035080605863847515,0.13034072890877724,0.06350403482263739,-0.04108352984555743,0.03208382343026725,-0.08344872626052662,-0.14081071757457472,-0.010535095733675089,-0.04253014939075166,-0.06409504175694151,0.04499104322696274,-0.1153958263722333,0.011868207969448784,0.032386500388383865,-0.0036963022192305125,0.01861521213802255,0.05440248447385701,0.026148285970769146,0.011136160687204789,0.04259885661303997,0.09219381585717201,0.06065366725141013,-0.015763109010136264,-0.0030524068596688185,0.0031816939061338253,-0.01272551697382534,0.02884035756472837,-0.002176688645373691,-0.04119681418788704,-0.08371328799562021,0.007803680078888481,0.0917377421124415,0.027042210250246255,-0.0168504383076321,-0.0005781924013387073,0.0075592477594248276,0.07226487367667934,0.005541681396690282,0.001809495755217292,0.011297995647923513,0.10331092673269185,0.0034428672357039018,0.07364177612841806,0.03861967177892273,-0.051503680434755304,-0.025596174390309236,0.014137779785828157,-0.08445698734034192,-0.07401955000717532,0.05168289600194178,-0.019313615386966954,0.007136409255591306,-0.042960755484686655,0.01830706542188471,-0.001172357662157579,-0.008949846103364094,-0.02356141348454085,-0.05277112944432619,0.006653293967247009,-0.00572453092106364,0.049479073389771984,-0.03399876727913083,0.029434629207984966,-0.06990156170319427,0.0924786920659244,0.015472117049450224,-0.10265431468459693,-0.023421658562834968,0.004523425542918796,-0.008990391665561632,-0.06445665437389504,0.03898039324717088,-0.025552247142927212,0.03958867977119305,-0.03243451675569469,-0.03848901360338046,-0.061713250523263756,-0.00904815017499707,-0.03730008362750099,0.02715366007760167,-0.08498009599067947,-0.00397337388924577,-0.0003402943098494275,0.008005982349542055,0.05871503853069788,-0.013795949010686441,0.007956360128115524,-0.024331797295334665,0.03842244771393863,-0.04393653944134712,0.02677931230176579,0.07715398648923094,-0.048624055216681554,-0.11324723844882101,-0.08751555024222894,-0.02469049582511864,-0.08767948790707371,-0.021930147846102376,0.011519658294591036,-0.08155732788145542,-0.10763703049583868,-0.07967398501932621,-0.03249315629628571,0.02701333300633864,-0.015305672687563028,0.002375963249836456,0.012275356545367024,-0.02917095824060115,0.02626959386874329,-0.0158629031767222,-0.05546591058373451,-0.023678493686020374,-0.048296650278974666,-0.06167154920033433,0.004435380412773652,0.07418209609617903,0.03524015434297987,0.063185997529548,-0.05814945189790292,0.13036084697920491,-0.03370768073099581,0.03256692289671099,0.06808869439092549,0.0563600350340659,5.7854774323376745e-05,-0.0793171048333699,0.03862177783792669,0.007196083004766313,0.013824320821599527,0.02798982642707415,-0.00918149473992261,-0.00839392692697319,0.040496235374699936,-0.007375971498814496,-0.03586547057652338,-0.03411220566538924,-0.025101724758066914,-0.005714270286262035,0.07351569867354225,-0.024216756182299418,0.0066968070935796604,-0.032809603959321976,0.05006068360737779,0.0504626590250568,0.04525104385208,-0.027629732069644062,0.10429493219337681,-0.021474285961382768,0.018212029964409092,0.07260083373297345,0.026920156976716084,0.043199389770796355,-0.03641596379351209,0.0661080302670598,0.09141866947439584,0.0157452768815512,-0.04552285996297459,-0.03509725736115466,0.02857604629190808]

下一步是对这些句子嵌入向量进行聚类,以找出与其他句子相比显然没有相关内容的句子。

我的方法甚至有意义吗?如果可以,我可以使用哪些工具对句子嵌入值进行聚类?我看到有一些方法,例如K均值或计算L2距离,但我不确定如何实现。

谢谢!

解决方法

如果您要摆脱对下游分析没有帮助的句子,但是按元素取平均值可能不是构造句子嵌入的最佳方法,那么这种方法很有意义。构造句子嵌入的一种更好的方法是采用单个单词嵌入,然后使用tf-idf对其进行组合。

sentence = [w1,w2,w3]
word_vectors = [v1,v2,v3],# v is of shape (N,) where N is the size of embedding

term_frequency_of_word = [t1,t2,t3]
inverse_doc_freq = [idf1,idf2,idf3]

word_weights = [tf*idf for tf,idf in zip(term_frequency_of_word,inverse_doc_freq)]

sentence_vector = np.zeros(N)

for weight,vector in zip(word_weights,word_vectors):
    scaled_vectors = vector * weight
    sentence_vector += scaled_vector

通过使用tf-idf缩放,您的句子嵌入将朝着句子中最重要的词的嵌入进行,这可能有助于您应用聚类算法来过滤掉不需要的句子。

以下是有关TF-IDF的快速教程:http://www.tfidf.com

,

对于群集,您可以尝试k-means,但是此算法仅使用欧几里德度量。为了使用另一个距离(即余弦距离),k-medoids也是合适的EM算法。在Python中,您可以在KMeans库中找到scikit-learn。为了尝试“ KMedoids”,您应该安装scikit-learn-extra库(https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html)或以下库:https://github.com/letiantian/kmedoids