python-熊猫:数据框不会合并

我在下面有两个数据框(可以找到herehere):

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \
                engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

我只提供了在df中读取的代码,因为它存在一些独特的格式问题.

df.dtypes

SICcode     object
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

merged.dtypes

2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

df.columns.tolist()
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC']

merged.columns.tolist()
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode']

df.head(3)

    SICcode     Catcode     Category                          SICname   MultSIC
0   111         A1500   Wheat, corn, soybeans and cash grain    Wheat   X
1   112         A1600   Other commodities (incl rice, peanuts)  Rice    X
2   115         A1500   Wheat, corn, soybeans and cash grain    Corn    X

merged.sort_values('SICcode')

    2012 NAICS Code     2002to2007 NAICS    SICcode
89  212210                       212210     1011
93  212234                       212234     1021
92  212231                       212231     1031
90  212221                       212221     1041
91  212222                       212222     1044
96  212299                       212299     1061
94  212234                       212234     1061
119 213114                       213114     1081
1770    541360                   541360     1081
233     238910                   238910     1081
95  212291                       212291     1094
97  212299                       212299     1099
3   111140                       111140     111
6   111160                       111160     112
4   111150                       111150     115
0   111110                       111110     116

我正在尝试将其与以下代码合并:merged = pd.merge(merged,df,how =’right’,on =’SICcode’)

结果是:

2012 NAICS Code        0
2002to2007 NAICS       0
SICcode             1007
Catcode              991
Category            1007
SICname             1007
MultSIC              906
dtype: int64

我怀疑问题在于df的格式,但我不知道该如何描述(我听说过空格一词,也许与这种情况有关)或解决该问题.有人对此有想法吗?

解决方法:

我相信这是您遇到问题的原因:

In [47]: merged[merged.SICcode == 'Aux']
Out[47]:
      2012 NAICS Code  2002to2007 NAICS SICcode
1828         551114.0          551114.0     Aux

导致不同的数据类型:

In [61]: df.dtypes
Out[61]:
SICcode      int64
Catcode     object
Category    object
SICname     object
MultSIC     object
dtype: object

In [62]: merged.dtypes
Out[62]:
2012 NAICS Code     float64
2002to2007 NAICS    float64
SICcode              object
dtype: object

In [63]: df.SICcode.unique()
Out[63]: array([ 111,  112,  115, ..., 9711, 9721, 9999], dtype=int64)

In [64]: merged.SICcode.head(10).unique()
Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object)

因此,您可以按照以下方式进行操作:

url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv'
df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge'
merged = pd.read_csv(url, index_col=0)

# clearing data
merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce')

mrg = df.merge(merged, on='SICcode', how='left')

mrg.head()

输出

In [51]: mrg.head()
Out[51]:
   SICcode Catcode                                       Category  \
0      111   A1500           Wheat, corn, soybeans and cash grain
1      112   A1600  Other commodities (incl rice, peanuts, honey)
2      115   A1500           Wheat, corn, soybeans and cash grain
3      116   A1500           Wheat, corn, soybeans and cash grain
4      119   A1500           Wheat, corn, soybeans and cash grain

            SICname MultSIC  2012 NAICS Code  2002to2007 NAICS
0             Wheat       X         111140.0          111140.0
1              Rice       X         111160.0          111160.0
2              Corn       X         111150.0          111150.0
3          Soybeans       X         111110.0          111110.0
4  Cash grains, NEC       X         111120.0          111120.0

相关文章

转载:一文讲述Pandas库的数据读取、数据获取、数据拼接、数...
Pandas是一个开源的第三方Python库,从Numpy和Matplotlib的基...
整体流程登录天池在线编程环境导入pandas和xrld操作EXCEL文件...
 一、numpy小结             二、pandas2.1为...
1、时间偏移DateOffset对象DateOffset类似于时间差Timedelta...
1、pandas内置样式空值高亮highlight_null最大最小值高亮背景...