如何从头开始创建TF-IDF的转换功能

问题描述

我创建了以下函数，通过遵循此tutorial来计算文档的TF-IDF 我创建了一些函数来计算词频，词频的倒数并结合到TF-IDF。

def tf(freq,all_tokens):
    #create TF-Matrix
    tf_matrix = {}
    for word in freq: #loop moost frequent words,if they are our tokens
        doc_tf = []
        for doc in all_tokens:
            frequency = 0
            for token in doc:
                if word == token:
                    frequency += 1
            tf_word = frequency/len(doc)
            doc_tf.append(tf_word)
        tf_matrix[word] = doc_tf
    return tf_matrix

# IDF Dictionary
def idf(freq,all_tokens):
    idf_matrix = {}
    for word in freq:
        doc_count = 0
        for doc in all_tokens:
            if word in doc:
                doc_count += 1
        idf_matrix[word] = np.log(len(all_tokens)/(1+doc_count))
    return idf_matrix
# Creating the Tf-Idf Model
def tfidf(tokens):
    most_freq=word_count(tokens)
    tf_matrix=tf(most_freq,tokens)
    idf_matrix=idf(most_freq,tokens)
    tfidf_matrix = []
    for word in tf_matrix.keys():
        tfidf = []
        for value in tf_matrix[word]:
            score = value * idf_matrix[word]
            tfidf.append(score)
        tfidf_matrix.append(tfidf)   
    
    
    X = np.asarray(tfidf_matrix).T
    return X

我想学习的东西。如果我有一个新文档，例如测试文档，该如何创建转换函数

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

nlp python tf-idf