熊猫数据框的频率图

问题描述

我有一个dataframe while answer.lower() != "proceed" and answer.lower() != "return": await ctx.send("Only enter 'proceed' or 'return'!") await ctx.send('''Are you sure you want to nuke this channel? This will completely erase all messages from it! type proceed to continue,and return to return. ''') answer = await client.wait_for('message',check=lambda message: message.author == ctx.author and message != "") # Gets user input and checks if message is not empty and was sent by the same user answer = answer.content ,例如:

df
df['user_location'].value_counts()

我想从India 3741 United States 2455 New Delhi,India 1721 Mumbai,India 1401 Washington,DC 1354 ... SpaceCoast,Florida 1 stuck in a book. 1 Beirut,Lebanon 1 Royston Vasey - Tralfamadore 1 Langham,Colchester 1 Name: user_location,Length: 26920,dtype: int64 列中了解USAIndia等特定国家/地区的频率。然后,我想将频率绘制为user_locationUSAIndia。 因此,我想对该列进行一些操作,以使Others的输出为:

value_counts()

似乎我应该合并包含相同国家名称的行的频率,并将其余的合并在一起!但是,在处理城市,州等名称时,它看起来很复杂。最有效的方法是什么?

解决方法

在评论中添加到@Trenton_McKinney的答案中,如果您需要将其他国家/地区的州/省映射到该国家/地区名称,则您需要做一些工作来建立这些关联。例如,对于印度和美国,您可以从维基百科上获取它们的州列表,并将其映射到您自己的数据,以将其重新标记为各自的国家名称,如下所示:

# Get states of India and USA
in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:,0].tolist()
us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:,0].tolist()
states = in_states + us_states

# Make a sample dataframe
df = pd.DataFrame({'Country': states})

    Country
0   Andhra Pradesh
1   Arunachal Pradesh
2   Assam
3   Bihar
4   Chhattisgarh
... ...
73  Virginia[E]
74  Washington
75  West Virginia
76  Wisconsin
77  Wyoming

将州名映射到国家名:

# Map state names to country name
states_dict = {state: 'India' for state in in_states}
states_dict.update({state: 'USA' for state in us_states})
df['Country'] = df['Country'].map(states_dict)

    Country
0   India
1   India
2   India
3   India
4   India
... ...
73  USA
74  USA
75  USA
76  USA
77  USA

但是从您的数据样本看来,您还将需要处理很多边缘情况。

,

首先,使用上一个答案的概念,我试图获得所有地点,包括城市,工会,州,地区,地区。然后,我制作了一个函数checkl(),使其可以检查该位置是印度还是美国,然后将其转换为其国家名称。最后,该功能已应用到dataframedf['user_location']上:

# Trying to get all the locations of USA and India

import pandas as pd

us_url = 'https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States'
us_states = pd.read_html(us_url)[0].iloc[:,0].tolist()
us_cities = pd.read_html(us_url)[0].iloc[:,1].tolist() + pd.read_html(us_url)[0].iloc[:,2].tolist() + pd.read_html(us_url)[0].iloc[:,3].tolist()
us_Federal_district = pd.read_html(us_url)[1].iloc[:,0].tolist()
us_Inhabited_territories = pd.read_html(us_url)[2].iloc[:,0].tolist()
us_Uninhabited_territories = pd.read_html(us_url)[3].iloc[:,0].tolist()
us_Disputed_territories = pd.read_html(us_url)[4].iloc[:,0].tolist()

us = us_states + us_cities + us_Federal_district + us_Inhabited_territories + us_Uninhabited_territories + us_Disputed_territories

in_url = 'https://en.wikipedia.org/wiki/States_and_union_territories_of_India#States_and_Union_territories'
in_states = pd.read_html(in_url)[3].iloc[:,0].tolist() + pd.read_html(in_url)[3].iloc[:,4].tolist() + pd.read_html(in_url)[3].iloc[:,5].tolist()
in_unions = pd.read_html(in_url)[4].iloc[:,0].tolist()
ind = in_states + in_unions

usToStr = ' '.join([str(elem) for elem in us])
indToStr = ' '.join([str(elem) for elem in ind]) 


# Country name checker function

def checkl(T): 
    TSplit_space = [x.lower().strip() for x in T.split()]
    TSplit_comma = [x.lower().strip() for x in T.split(',')]
    TSplit = list(set().union(TSplit_space,TSplit_comma))
    res_ind = [ele for ele in ind if(ele in T)]
    res_us = [ele for ele in us if(ele in T)]
  
    if 'india' in TSplit or 'hindustan' in TSplit or 'bharat' in TSplit or T.lower() in indToStr.lower() or bool(res_ind) == True :
        T = 'India'
    elif 'US' in T or 'USA' in T or 'United States' in T or 'usa' in TSplit or 'united state' in TSplit or T.lower() in usToStr.lower() or bool(res_us) == True:
        T = 'USA'
    elif len(T.split(','))>1 :
        if T.split(',')[0] in indToStr or  T.split(',')[1] in indToStr :
             T = 'India'
        elif T.split(',')[0] in usToStr or  T.split(',')[1] in usToStr :
             T = 'USA'
        else:
             T = "Others"
    else:
        T = "Others"
    return T

# Appling the function on the dataframe column

print(df['user_location'].dropna().apply(checkl).value_counts())
Others    74206
USA       47840
India     20291
Name: user_location,dtype: int64

我在python编码方面还很陌生。我认为这段代码可以用更好,更紧凑的形式编写。就像在前面的答案中提到的那样,仍然有很多边缘情况需要处理。因此,我也将其添加到了 Code Review Stack Exchange上。对于提高我的代码的效率和可读性的任何批评和建议,将不胜感激。

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...