问题描述
我有一些文本行,然后是它们的相关性权重。
Weight,Text
10,"I like apples"
20,"Someone needs apples"
是否可以获得组合,并将值保留在权重列中?类似的东西:
weight,combinations
10,[I like]
10,[I apples]
10,[like apples]
20,[someone needs]
20,[someone apples]
20,[needs apples]
“Generate n-grams from Pandas column while persisting another column”(未解决)是一个类似的问题,但未解决。
谢谢!!!
解决方法
from itertools import combinations
import pandas as pd
df = pd.DataFrame({'Weight': [10,20],'Text': ["I like apples","Someone needs apples"]})
df['Combinations'] = df.Text.apply(lambda x : list(combinations(x.split(),2)))
df = df.explode('Combinations')
df.drop('Text',axis=1,inplace=True)
print(df)
输出:
Weight Combinations
0 10 (I,like)
0 10 (I,apples)
0 10 (like,apples)
1 20 (Someone,needs)
1 20 (Someone,apples)
1 20 (needs,apples)
,
Panda 的 expand() 是一个相对较新的特性 (.25),它支持像 David M. 那样的高效解决方案。它还解决了以前的典型解决方案在列表列是列表列表时存在的问题。
在explode() 之前,典型的解决方案如下所示:
someDF = pd.DataFrame({col:np.repeat(someListColDF[col].values,someListColDF[someListCol].str.len()) for col in someListColDF.columns.drop(someListCol)} ).assign(**{someListCol:np.concatenate(someListColDF[someListCol].values)})[someListColDF.columns]
但是当 someListCol 是列表列表时,这似乎不起作用。
这里是中间步骤:
import itertools
someDF = pd.DataFrame([[10,"I like apples"],[20,"Someone needs apples"]],columns = ["Weight","Text"])
重复权重正确的次数
weightList = np.repeat(someDF["Weight"].values,someDF["Permute"].str.len())
获取所有可能的 2 元素排列(假设顺序很重要),并将它们连接成一个数组
someDF["Permute"] = someDF["Text"].apply(lambda x: list((itertools.permutations(x.split(),2))))
print(someDF["Permute"])
0 [(I,like),(I,apples),(like,I),app...
1 [(Someone,needs),(Someone,(needs,...
permuteList = np.concatenate(someDF["Permute"].values)
print(permuteList)
array([['I','like'],['I','apples'],['like','I'],['apples',['Someone','needs'],['needs','Someone'],'needs']],dtype='<U7')
但是当我尝试通过使用例如 np.column_stack()、np.vstack() 和 np.concatenate(axis=1) 以正常方式将它们粘合在一起时,列表列表不断被误解,并且重塑似乎没有帮助。
最终我不得不求助于这个混杂:
newDF = pd.DataFrame(weightList,columns=["Weight"])
newDF["Permute"] = [i for i in permuteList]
输出
Weight Permute
0 10 [I,like]
1 10 [I,apples]
2 10 [like,I]
3 10 [like,apples]
4 10 [apples,I]
5 10 [apples,like]
6 20 [Someone,needs]
7 20 [Someone,apples]
8 20 [needs,Someone]
9 20 [needs,apples]
10 20 [apples,Someone]
11 20 [apples,needs]
所有这些都是一种迂回的方式,感谢 Pandas 开发人员为我们提供了explode()!