如何使用TF-IDF计算两个对应行之间的余弦相似度

问题描述

在“更新”部分和“示例”部分中，根据上面的代码，用于示例“示例”的复制和粘贴代码。

您将如何计算两个对应的文本列之间的余弦相似度。例如（标题，文章正文）。

头条新闻正文

headline1 | articleBody 1

headline2 | articleBody 2 。。

我只是想获得相应标题和文章之间的余弦相似度。例如，

标题| articleBodies |余弦相似度

headline1 | articleBody 1 |值

headline2 | articleBody 2 |值。。

似乎可行，但是

def computeTFIDF(train_headlines,train_bodies):
  vec_body = TfidfVectorizer(ngram_range=(1,2),lowercase=True,stop_words='english',max_df=.5,max_features = 10000)
  tfidf_body = vec_body.fit_transform(train_bodies)
 
  vec_headline = TfidfVectorizer(ngram_range=(1,max_features=10000)
  tfidf_headline = vec_headline.fit_transform(train_headlines)

  return tfidf_headline,tfidf_body 

from sklearn.metrics.pairwise import cosine_similarity
def arr_convert_1d(arr): 
    arr = np.array(arr) 
    arr = np.concatenate( arr,axis=0 ) 
    arr = np.concatenate( arr,axis=0 ) 
    return arr 

def compute_cos_hd_body(headline,body):
  result = []
  for i,(head,body) in enumerate(zip(headline,body)):
    result.append(cosine_similarity(head,body))
  return arr_convert_1d(result)

tr_cos_tfidf = compute_cos_hd_body(train_headlines_tfidf,train_bodies_tfidf)

现在，这似乎适用于约50,000行标题和相应文章的大型数据集。但是，在标题文章正文对的3行较小的数据帧上运行它会返回此错误：

/usr/local/lib/python3.6/dist-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X,Y,precomputed,dtype,accept_sparse,force_all_finite,copy)
    153         raise ValueError("Incompatible dimension for X and Y matrices: "
    154                          "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 155                              X.shape[1],Y.shape[1]))
    156 
    157     return X,Y

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 27 while Y.shape[1] == 707

当然，我知道错误是什么，但是我很困惑为什么在较大的集上工作时会发生此错误。是否错误计算了余弦相似度？我应该将标题，articleBody对和vec.fit_transform（headBody）连接起来，然后分别转换标题和articleBody吗？

更新：我已经编辑了代码（从此处复制代码，它应该提供有效的示例）：

1）

 # whole_set is the headline and articleBody concatenated into
 # one column
 def compute_tfidf(whole_set,train_headlines,train_bodies,max_feat):
   vec_whole = TfidfVectorizer(ngram_range=(1,max_features = max_feat)
   tfidf_whole = vec_whole.fit_transform(whole_set)
   tfidf_body = vec_whole.transform(train_bodies)
   tfidf_headline = vec_whole.transform(train_headlines)
   return tfidf_headline,tfidf_body 

def arr_convert_1d(arr): 
    arr = np.array(arr) 
    arr = np.concatenate( arr,axis=0 )
    # Commenting this out seemed to produce a set of values,#  but I'm not sure if the values are correct.
    # arr = np.concatenate( arr,axis=0 )
    
    return arr 
def compute_cos_hd_body( headline = None,body = None):
  result = []
  for i,body)):
     #print("Head {}\n".format(headline))
     #print("\nBody: {}\n".format(body))
       
     result.append(cosine_similarity(head,body))
   return arr_convert_1d(result)

示例dataFrame：

2）

list_of_lists = []
list_of_lists.append(["Therefore feminist blogger aborted boy probably fake","Amazon boss Jeff Bezos primed ready fresh assault streaming video space. The e commerce giant roll new ad supported streaming offering early next year separate 99 year Prime membership includes video service sources said. The ad supported option ‚Äî part overhaul media offerings ‚Äî poses serIoUs challenge streaming rivals Hulu Netflix analysts said. If ad supported service decouple Prime Netflix killer Wedbush Securities analyst Michael Pachter said. It 99 year. Pachter suggested Amazon would undercut Netflix current monthly price 7.99. Who switch poor cord cutter added. Although separate Prime ad supported service ultimately bid Amazon lure people eventually pay Prime membership said one ad source familiar Amazon plans. The main point bring users eventually sell Prime get broader audience want pay Prime order increase video share source said. The Wall Street Journal reported march Amazon weighing move sources confirmed definite go. Amazon prepping new attack time video service gaining ground sources said. Amazon disclose number Prime members RBC Capital analyst Mark Mahaney estimates 50 million global customers. While Prime big draw two day shipping half Prime subscribers use video service","therefore feminist blogger aborted boy probably fake amazon boss jeff bezos primed ready fresh assault streaming video space. the e commerce giant roll new ad supported streaming offering early next year separate 99 year prime membership includes video service sources said. the ad supported option ‚äî part overhaul media offerings ‚äî poses serIoUs challenge streaming rivals hulu netflix analysts said. if ad supported service decouple prime netflix killer wedbush securities analyst michael pachter said. it 99 year. pachter suggested amazon would undercut netflix current monthly price 7.99. who switch poor cord cutter added. although separate prime ad supported service ultimately bid amazon lure people eventually pay prime membership said one ad source familiar amazon plans. the main point bring users eventually sell prime get broader audience want pay prime order increase video share source said. the wall street journal reported march amazon weighing move sources confirmed definite go. amazon prepping new attack time video service gaining ground sources said. amazon disclose number prime members rbc capital analyst mark mahaney estimates 50 million global customers. while prime big draw two day shipping half prime subscribers use video service"
])
list_of_lists.append(["Seth Rogen Is Woz","Danny Boyle directing untitled film Seth Rogen eyed play Apple co founder Steve Wozniak Sony Steve Jobs biopic. Danny Boyle directing untitled film based Walter Isaacson book adapted Aaron Sorkin one anticipated biopics recent years. Negotiations yet begun even clear Rogen official offer producers ‚Äî Scott Rudin Guymon Casady Mark Gordon ‚Äî set sights talent talks. Of course may naught Christian Bale actor play Jobs still midst closing deal. Sources say dealmaking process sensitive stage. Insiders say Boyle flying Los Angeles meet actress play one female leads assistant Jobs. Insiders say Jessica Chastain one actresses meeting list. Wozniak kNown Woz co founded Apple Jobs Ronald Wayne. He first met Jobs worked Atari later responsible creating early Apple computers.","seth rogen is woz danny boyle directing untitled film seth rogen eyed play apple co founder steve wozniak sony steve jobs biopic. danny boyle directing untitled film based walter isaacson book adapted aaron sorkin one anticipated biopics recent years. negotiations yet begun even clear rogen official offer producers ‚äî scott rudin guymon casady mark gordon ‚äî set sights talent talks. of course may naught christian bale actor play jobs still midst closing deal. sources say dealmaking process sensitive stage. insiders say boyle flying los angeles meet actress play one female leads assistant jobs. insiders say jessica chastain one actresses meeting list. wozniak kNown woz co founded apple jobs ronald wayne. he first met jobs worked atari later responsible creating early apple computers."
])
list_of_lists.append(["CARLISLE HEAD SLAPPER ATTACKS SNEEZING WOMEN","A man head slapping people Carlisle sneeze. Cumbria police said 82 year old woman reported slapped sneezed Scotch Street near Costa 11.30am today. The SUSPECT man believed late 50s. It reported similar incident happened yesterday victim yet traced police. Anyone may witnessed incident @R_994_4045@ion asked contact PC Lori Tallantire 101.","carlisle head slapper attacks sneezing women a man head slapping people carlisle sneeze. cumbria police said 82 year old woman reported slapped sneezed scotch street near costa 11.30am today. the SUSPECT man believed late 50s. it reported similar incident happened yesterday victim yet traced police. anyone may witnessed incident @R_994_4045@ion asked contact pc lori tallantire 101."
                      ])

train_hl_ab_merged = pd.DataFrame(list_of_lists,columns=["Headline",'first_sent_180','headBody'])

运行此命令时：

3）

tfidf_headline,tfidf_body,tfidf_whole = compute_tfidf(train_hl_ab_merged['headBody'],train_hl_ab_merged['Headline'],train_hl_ab_merged['first_sent_180'],10000)
    
tr_cos_tfidf = compute_cos_hd_body(train_hl_ab_merged,tfidf_headline,'tfidf_semantic' )
    
tr_cos_tfidf,len(tr_cos_tfidf)

我得到：

 (array([[0.        ],[0.13459548],[0.06392028]]),3)

- 更新2：似乎是我代码中某处的语法错误导致了该问题，并且确实起作用。

def arr_convert_1d(arr): 
      arr = np.array(arr) 
      arr = np.concatenate( arr,axis=0 )
      arr = np.concatenate( arr,axis=0 )
      # I've uncommented the line above as the result above showed 
      # that it should have been peeled an additional array.
      return arr

运行代码块 3），但不包括len（tr_cos_tfidf）

tfidf_headline,10000)
            
tr_cos_tfidf = compute_cos_hd_body(train_hl_ab_merged,'tfidf_semantic' )
            
tr_cos_tfidf
#len(tr_cos_tfidf) # Not necessary

在上面的arr_convert_1d（arr）中启用了代码，将返回：

（array（[0。，0.13459548，0.06392028]）

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

data-science machine-learning python scikit-learn tf-idf