如何使我的数据帧的 Ngram 字典以一些字符串 Python 使用成对 recipe from itertools 更新

问题描述

我有dataframe这样的

id  name        cat     subcat
-------------------------------
1   aa bb cc    A       a-a
2   bb cc dd    B       b-a
3   aa bb ee    C       c-a
4   aa gg cc    D       d-a

我想制作 dict 这个 dataframe 其中包含最多 Ngram 个这样的两个词

aa bb : 2
bb cc : 2
cc dd : 1
bb ee : 1
aa gg : 1
gg cc : 1

解决方法

使用成对 recipe from itertools 更新

from itertools import combinations,chain

def pairwise(iterable):
    "s -> (s0,s1),(s1,s2),(s2,s3),..."
    a,b = tee(iterable)
    next(b,None)
    return zip(a,b)

pd.Series(chain(*df['name'].str.split(' ')
                           .apply(lambda x: pairwise(x))))\
  .value_counts()

输出：

(aa,bb)    2
(bb,cc)    2
(cc,dd)    1
(bb,ee)    1
(aa,gg)    1
(gg,cc)    1
dtype: int64

IIUC，你可以试试这样的：

from itertools import combinations,chain

pd.Series(list(chain(*df['name'].str.split(' ')
                                .apply(lambda x: list(combinations(x,2))))))\
  .value_counts()

输出：

(aa,bb)    2
(aa,cc)    2
(bb,dd)    1
(cc,dd)    1
(aa,ee)    1
(bb,cc)    1
dtype: int64

n-gram python scikit-learn

如何使我的数据帧的 Ngram 字典以一些字符串 Python 使用成对 recipe from itertools 更新

问题描述

解决方法

使用成对 recipe from itertools 更新

相关问答