不同集合文档的成对相似度

问题描述

我正在尝试为不同数据集的文档实现 cosine_similarity。每个集合有 30 个文档,我有兴趣匹配 documentsdocuments2间的相似文档。

到目前为止,我的方法是这样的:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import os
import sys
import glob
import codecs
from collections import defaultdict
from collections import Counter
from nltk import word_tokenize
import nltk
import re
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from sklearn.metrics.pairwise import cosine_similarity
from contextlib import ExitStack

#max_df = 29 set to ignore terms that appear in more than 29 documents 
#TFIDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=29)

#Create list of documents to work with
path = "C:\\Users\\path\\Desktop\\research\\dataset\\1"
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
documents = [os.path.join(path,name) for name in text_files]

with ExitStack() as stack:
    files = [stack.enter_context(open(filename,encoding="utf-8")).read() for filename in documents]
    X = tfidf_vectorizer.fit_transform(files)

path2 = "C:\\Users\\path\\Desktop\\research\\dataset\\2"
text_files2 = [f for f in os.listdir(path2) if f.endswith('.txt')]
documents2 = [os.path.join(path2,name) for name in text_files2]

with ExitStack() as stack:
    files2 = [stack.enter_context(open(filename,encoding="utf-8")).read() for filename in documents2]
    X2 = tfidf_vectorizer.fit_transform(files2)

#X = X.reshape(-1,1)
#X2 = X2.reshape(-1,1)

sm = cosine_similarity(X,X2)

我收到以下错误 ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 5068 while Y.shape[1] == 4479

如果我取消注释 reshape 语句,我会得到 numpy.core._exceptions.MemoryError: Unable to allocate 152. GiB for an array with shape (152040,134370) and data type float64

有什么想法吗?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)