如何更轻松地将类别分配给具有 50 多个类别的新列中的字符串

问题描述

我有一个数据框，其中包含一列开放响应字符串，用于标识美国的一个州（希望这将很快成为一个封闭式问题）。我需要为每个响应分配一个状态名称，目前正在使用以下代码。

alabama_cat = ["alabama","al"]
alaska_cat = ["alaska","ak"]
newyork_cat = ["new york","ny","newyork"]

state_cat = [alabama_cat,alaska_cat,newyork_cat]

#Conditions for categories
conditions = [
    (survey['state'].str.lower().str.contains('|'.join(alabama_cat),na=False)),(survey['state'].str.lower().str.contains('|'.join(alaska_cat),(survey['state'].str.lower().str.contains('|'.join(newyork_cat),]

#Names of categories
choices = ["Alabama","Alaska","New York"]

# categorize
survey['state_category'] = np.select(conditions,choices)

我想知道是否有更简单的方法来创建条件变量，并希望找到一种通过 (survey['state'].str.lower().str.contains('|'.join(alabama_cat),na=False)) 运行每个 state_cat 的自动化方法。我需要为每个州、可能的地区以及人们输入其他国家/地区的实例运行此流程。

非常感谢您的任何见解。

解决方法

您可以尝试提取任何猫，然后使用 cat，而不是检查每个 map。像这样：

# map the codes to actual names
state_codes = {code:choice for cat,choice in zip(state_cat,choices) 
                 for code in cat}

patt = '|'.join(state_codes.keys())

survey['state_category'] = survey['state'].str.extract(f'({patt})',expand=False).map(state_codes)

categorization numpy python