如何根据条件分配多个类别

问题描述

以下是每个类别都有一个单词列表,用于检查行是否匹配:

fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','Box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']

这是我的代码:(我正在检查关键字的句子并相应地为该行分配一个类别。我希望允许重叠,因此一行可以有多个类别)

#check if description row contains words from one of our category lists
df['description'] = np.select(
    [
        (df['description'].str.contains('|'.join(fashion))),(df['description'].str.contains('|'.join(general))),(df['description'].str.contains('|'.join(decor))),(df['description'].str.contains('|'.join(kitchen))),(df['description'].str.contains('|'.join(holiday))),(df['description'].str.contains('|'.join(garden))),(df['description'].str.contains('|'.join(kids)))
    ],['fashion','general','decor','holiday','kids'],'Other'
)
Current Output:

index         description         category
0         children wine glass     kids
1         candles                 decor 
2         christmas tree          holiday
3         bottle                  general
4         soldiers                kids
5         bag                     fashion


Expected Output:

index         description         category
0         children wine glass     kids,kitchen
1         candles                 decor
2         christmas tree          holiday,garden
3         bottle                  general
4         soldiers                kids
5         bag                     fashion

解决方法

这是一个使用 apply() 的选项:

df = pd.DataFrame({'description': ['children wine glass','candles','christmas tree','bottle','soldiers','bag']})

def categorize(desc):
    lst = []
    for w in desc.split(' '):
        if w in fashion:
            lst.append('fashion')
        if w in general:
            lst.append('general')
        if w in decor:
            lst.append('decor')
        if w in kitchen:
            lst.append('kitchen')
        if w in holiday:
            lst.append('holiday')
        if w in garden:
            lst.append('garden')
        if w in kids:
            lst.append('kids')
    return ','.join(lst)
            
   df.apply(lambda x: categorize(x.description),axis=1)

输出:

0      kids,kitchen
1              decor
2    holiday,garden
3            general
4               kids
5            fashion
,

这就是我要怎么做。

每行上方的注释为您提供有关我正在尝试做的事情的详细信息。

步骤:

  1. 将所有类别转换为 key:value 对。使用中的值 类别作为键,类别作为值。这是为了让您能够 搜索值并将其映射回键
  2. 使用将描述字段拆分为多列 拆分(展开)
  3. 对每列的键值进行匹配。结果将是 类别和 NaN
  4. 将所有这些重新连接到以 ',' 分隔的列中,以获得最终结果,同时排除 NaN。再次对其应用 pd.unique() 以删除重复的类别

你需要的六行代码是:

dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
temp = df['description'].str.split(expand=True)
temp = temp.applymap(s_dict.get)
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]),axis = 1)
df['new_category'] = df['new_category'].apply(lambda x: ','.join(pd.unique(x.split(','))))

如果您有更多类别,只需将其添加到 dict_keys 和 dict_cats 中。其他一切都保持不变。

带有注释的完整代码从这里开始:

import pandas as pd

c = ['description','category']
d = [['children wine glass','kids'],['candles','decor'],['christmas tree','holiday'],['bottle','general'],['soldiers',['bag','fashion']]
df = pd.DataFrame(d,columns = c)

fashion = ['bag','purse','pen']
general = ['knob','hanger','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','baking','jar','mug','cookie','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','tree']
kids = ['children','doll','birdie','asstd','bank','spaceboy','childs']

#create a list of all the lists
dict_keys = ['fashion',kids]

#create a dictionary with words from the list as key and category as value
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}

#create a temp dataframe with one word for each column using split
temp = df['description'].str.split(expand=True)

#match the words in each column against the dictionary
temp = temp.applymap(s_dict.get)

#Now put them back together and you have the final list
df['new_category'] = temp.apply(lambda x: ',axis = 1)

#Remove duplicates using pd.unique()
#Note: prev line join modified to ',' from ','
df['new_category'] = df['new_category'].apply(lambda x: ','))))
print (df)

此输出将是:(我保留了您的 category 列并创建了一个名为 new_category

           description category     new_category
0  children wine glass     kids    kids,kitchen
1              candles    decor            decor
2       christmas tree  holiday  holiday,garden
3               bottle  general          general
4             soldiers     kids             kids
5                  bag  fashion          fashion

包含 'party candles holder' 的输出是:

            description category     new_category
0   children wine glass     kids    kids,kitchen
1               candles    decor            decor
2        christmas tree  holiday  holiday,garden
3                bottle  general          general
4  party candles holder     None   holiday,decor
5              soldiers     kids             kids
6                   bag  fashion          fashion