将具有布尔条件的数据框列按行拆分为熊猫中具有固定标题的多列

问题描述

我从此网站https://aps.dac.gov.in/APY/Public_Report1.aspx下载了一个数据集,必须进行选择,然后根据我的选择,我得到了一个包含2017-2018年所有州和所有农作物数据的文件。这是使用read_excel()

导入后的数据集的外观
State/Crop/District             Season  Area (Hectare)  Production (Tonnes) Yield (Tonnes/Hectare)
0   Andaman and Nicobar Islands NaN             NaN     NaN                 NaN
1   Arecanut                    NaN             NaN     NaN                 NaN
2   1.NICOBARS                  Rabi            534.10  125.23              0.234469
3   2.NORTH AND MIDDLE ANDAMAN  Rabi            1744.00 4639.44             2.66023
4   3.SOUTH ANDAMANS            Rabi            1220.20 10518.7             8.62047
6   Arhar/Tur                   NaN             NaN     NaN                 NaN
7   1.NORTH AND MIDDLE ANDAMAN  Rabi            1.20    0.6                 0.5
9   Black pepper                NaN             NaN     NaN                 NaN
10  1.NICOBARS                  Rabi            12.40   0.42                0.033871
11  2.NORTH AND MIDDLE ANDAMAN  Rabi            8.76    2.13                0.243151
12  3.SOUTH ANDAMANS            Rabi            69.46   349.72              5.03484

名为State/Crop/District的第一列包含三个不同的值,根据我的说法,这些值应该位于三个不同的列中,但不是。有趣的是,并非所有地区都种植了所有农作物。我的目标是通过以下方式获得它。

State                       Crop     District                 Season    Area    Production    Yield
Andaman and Nicobar Islands Arecanut Nicobars                 Rabi      534     125.23        0.2344
Andaman and Nicobar Islands Arecanut North and Middle Andaman Rabi      1744    4639.44       2.66023

,依此类推。大约有27个州(如安达曼和尼科巴群岛)以及大约54种不同的农作物和众多地区。我尝试了三种方法来解决此问题,但在任何一种方法中都无法获得成功。

  1. 我在熊猫中使用了pivot函数,但是它创建了一个包含约12000列的数据框。那不是我想要的。
  2. 我在第一列中使用了melt,在其他列中使用了values_vars。结果类似于第一次运行。
  3. 我创建了一个自定义函数,该函数将扫描包含单词Total的行。这是我下载的原始文件中分离出两种作物的地方,但是我无法将其缩放到其他状态。

一些答案​​建议使用stack()unstack(),但由于看不到多个索引,因此无法在此处使用。我是Pandas的新手,并且使用Python3。将不胜感激。

解决方法

这是如何执行此操作的完整演示...有点笨拙,但是我认为没有内置的熊猫函数真正实现“状态/作物/区”到相应列的爆炸式增长。除了for循环,它还不错:)

import pandas as pd
from io import StringIO

data = """State/Crop/District             Season  Area (Hectare)  Production (Tonnes)  Yield (Tonnes/Hectare)
Andaman and Nicobar Islands  NaN             NaN     NaN                 NaN
Arecanut                    NaN             NaN     NaN                 NaN
1.NICOBARS                  Rabi            534.10  125.23              0.234469
2.NORTH AND MIDDLE ANDAMAN  Rabi            1744.00  4639.44             2.66023
3.SOUTH ANDAMANS            Rabi            1220.20  10518.7             8.62047
Arhar/Tur                   NaN             NaN     NaN                 NaN
1.NORTH AND MIDDLE ANDAMAN  Rabi            1.20    0.6                 0.5
Black pepper                NaN             NaN     NaN                 NaN
1.NICOBARS                  Rabi            12.40   0.42                0.033871
2.NORTH AND MIDDLE ANDAMAN  Rabi            8.76    2.13                0.243151
3.SOUTH ANDAMANS            Rabi            69.46   349.72              5.03484"""

df = pd.read_table(StringIO(data),sep="\s\s+")
l = list(zip(df.iloc[:,0],df.iloc[:,1]))
out = []
for i,(j,k) in enumerate(l):
    if str(k) == "nan":
        if str(l[i + 1][1]) == "nan":
            state = j
        else:
            crop = j
    else:
        district = j
    try:
        out.append([state.title(),crop.title(),district[2:].title()])
    except NameError:
        pass
df1 = pd.DataFrame(columns=["State","Crop","District"],data=out)
df_final = pd.concat([df1,df.dropna().iloc[:,1:]],1).dropna()

输出(print(df_final.to_string())):

                          State          Crop                  District Season  Area (Hectare)  Production (Tonnes)  Yield (Tonnes/Hectare)
2   Andaman And Nicobar Islands      Arecanut                  Nicobars   Rabi          534.10               125.23                0.234469
3   Andaman And Nicobar Islands      Arecanut  North And Middle Andaman   Rabi         1744.00              4639.44                2.660230
4   Andaman And Nicobar Islands      Arecanut            South Andamans   Rabi         1220.20             10518.70                8.620470
6   Andaman And Nicobar Islands     Arhar/Tur  North And Middle Andaman   Rabi            1.20                 0.60                0.500000
8   Andaman And Nicobar Islands  Black Pepper                  Nicobars   Rabi           12.40                 0.42                0.033871
9   Andaman And Nicobar Islands  Black Pepper  North And Middle Andaman   Rabi            8.76                 2.13                0.243151
10  Andaman And Nicobar Islands  Black Pepper            South Andamans   Rabi           69.46               349.72                5.034840
,

处理流程:。

  1. 用NA分割数据帧
  2. 从索引扩展数据框作为标头
  3. 用扩展名列中的列名 { "$schema": "http://json-schema.org/draft-04/schema#","type": "object","properties": { "type": { "type": "string" },"features": { "type": "array","items": { "$ref": "#/definitions/Feature" } } },"definitions": { "Geometry": { "type": "object","properties": { "type": { "type": "string" },"coordinates": { "type": "array","items": { "$ref": "#/definitions/Anonymous15" } } } },"Properties": { "type": "object","properties": { "prop0": { "type": "string" } } },"Coordinate": { "type": "integer" },"Anonymous4": { "type": "integer" },"Anonymous5": { "type": "integer" },"Anonymous6": { "type": "integer" },"Anonymous7": { "type": "array" },"Anonymous8": { "type": "object","properties": { "prop0": { "type": "string" },"prop1": { "$ref": "#/definitions/Prop1" } } },"Anonymous9": { "type": "integer" },"Anonymous10": { "type": "integer" },"Anonymous11": { "type": "integer" },"Anonymous12": { "type": "integer" },"Anonymous13": { "type": "integer" },"Anonymous14": { "type": "array" },"Anonymous15": { "type": "array" },"Prop1": { "type": "object","properties": { "this": { "type": "string" } } },"Feature": { "type": "object","geometry": { "$ref": "#/definitions/Geometry" },"properties": { "$ref": "#/definitions/Properties" } } } } } 填充NA。
  4. 合并标头数据帧和拆分数据帧
  5. 修改列名称并添加“状态”列
  6. 删除“地区”列的第一个字符

可能还有改进的余地,但是下面的代码可以做到。

method='ffill'

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...