问题描述
我从此网站https://aps.dac.gov.in/APY/Public_Report1.aspx下载了一个数据集,必须进行选择,然后根据我的选择,我得到了一个包含2017-2018年所有州和所有农作物数据的文件。这是使用read_excel()
State/Crop/District Season Area (Hectare) Production (Tonnes) Yield (Tonnes/Hectare)
0 Andaman and Nicobar Islands NaN NaN NaN NaN
1 Arecanut NaN NaN NaN NaN
2 1.NICOBARS Rabi 534.10 125.23 0.234469
3 2.NORTH AND MIDDLE ANDAMAN Rabi 1744.00 4639.44 2.66023
4 3.SOUTH ANDAMANS Rabi 1220.20 10518.7 8.62047
6 Arhar/Tur NaN NaN NaN NaN
7 1.NORTH AND MIDDLE ANDAMAN Rabi 1.20 0.6 0.5
9 Black pepper NaN NaN NaN NaN
10 1.NICOBARS Rabi 12.40 0.42 0.033871
11 2.NORTH AND MIDDLE ANDAMAN Rabi 8.76 2.13 0.243151
12 3.SOUTH ANDAMANS Rabi 69.46 349.72 5.03484
名为State/Crop/District
的第一列包含三个不同的值,根据我的说法,这些值应该位于三个不同的列中,但不是。有趣的是,并非所有地区都种植了所有农作物。我的目标是通过以下方式获得它。
State Crop District Season Area Production Yield
Andaman and Nicobar Islands Arecanut Nicobars Rabi 534 125.23 0.2344
Andaman and Nicobar Islands Arecanut North and Middle Andaman Rabi 1744 4639.44 2.66023
,依此类推。大约有27个州(如安达曼和尼科巴群岛)以及大约54种不同的农作物和众多地区。我尝试了三种方法来解决此问题,但在任何一种方法中都无法获得成功。
- 我在熊猫中使用了
pivot
函数,但是它创建了一个包含约12000列的数据框。那不是我想要的。 - 我在第一列中使用了
melt
,在其他列中使用了values_vars
。结果类似于第一次运行。 - 我创建了一个自定义函数,该函数将扫描包含单词
Total
的行。这是我下载的原始文件中分离出两种作物的地方,但是我无法将其缩放到其他状态。
一些答案建议使用stack()
和unstack()
,但由于看不到多个索引,因此无法在此处使用。我是Pandas的新手,并且使用Python3。将不胜感激。
解决方法
这是如何执行此操作的完整演示...有点笨拙,但是我认为没有内置的熊猫函数真正实现“状态/作物/区”到相应列的爆炸式增长。除了for循环,它还不错:)
import pandas as pd
from io import StringIO
data = """State/Crop/District Season Area (Hectare) Production (Tonnes) Yield (Tonnes/Hectare)
Andaman and Nicobar Islands NaN NaN NaN NaN
Arecanut NaN NaN NaN NaN
1.NICOBARS Rabi 534.10 125.23 0.234469
2.NORTH AND MIDDLE ANDAMAN Rabi 1744.00 4639.44 2.66023
3.SOUTH ANDAMANS Rabi 1220.20 10518.7 8.62047
Arhar/Tur NaN NaN NaN NaN
1.NORTH AND MIDDLE ANDAMAN Rabi 1.20 0.6 0.5
Black pepper NaN NaN NaN NaN
1.NICOBARS Rabi 12.40 0.42 0.033871
2.NORTH AND MIDDLE ANDAMAN Rabi 8.76 2.13 0.243151
3.SOUTH ANDAMANS Rabi 69.46 349.72 5.03484"""
df = pd.read_table(StringIO(data),sep="\s\s+")
l = list(zip(df.iloc[:,0],df.iloc[:,1]))
out = []
for i,(j,k) in enumerate(l):
if str(k) == "nan":
if str(l[i + 1][1]) == "nan":
state = j
else:
crop = j
else:
district = j
try:
out.append([state.title(),crop.title(),district[2:].title()])
except NameError:
pass
df1 = pd.DataFrame(columns=["State","Crop","District"],data=out)
df_final = pd.concat([df1,df.dropna().iloc[:,1:]],1).dropna()
输出(print(df_final.to_string())
):
State Crop District Season Area (Hectare) Production (Tonnes) Yield (Tonnes/Hectare)
2 Andaman And Nicobar Islands Arecanut Nicobars Rabi 534.10 125.23 0.234469
3 Andaman And Nicobar Islands Arecanut North And Middle Andaman Rabi 1744.00 4639.44 2.660230
4 Andaman And Nicobar Islands Arecanut South Andamans Rabi 1220.20 10518.70 8.620470
6 Andaman And Nicobar Islands Arhar/Tur North And Middle Andaman Rabi 1.20 0.60 0.500000
8 Andaman And Nicobar Islands Black Pepper Nicobars Rabi 12.40 0.42 0.033871
9 Andaman And Nicobar Islands Black Pepper North And Middle Andaman Rabi 8.76 2.13 0.243151
10 Andaman And Nicobar Islands Black Pepper South Andamans Rabi 69.46 349.72 5.034840
,
处理流程:。
- 用NA分割数据帧
- 从索引扩展数据框作为标头
- 用扩展名列中的列名
{ "$schema": "http://json-schema.org/draft-04/schema#","type": "object","properties": { "type": { "type": "string" },"features": { "type": "array","items": { "$ref": "#/definitions/Feature" } } },"definitions": { "Geometry": { "type": "object","properties": { "type": { "type": "string" },"coordinates": { "type": "array","items": { "$ref": "#/definitions/Anonymous15" } } } },"Properties": { "type": "object","properties": { "prop0": { "type": "string" } } },"Coordinate": { "type": "integer" },"Anonymous4": { "type": "integer" },"Anonymous5": { "type": "integer" },"Anonymous6": { "type": "integer" },"Anonymous7": { "type": "array" },"Anonymous8": { "type": "object","properties": { "prop0": { "type": "string" },"prop1": { "$ref": "#/definitions/Prop1" } } },"Anonymous9": { "type": "integer" },"Anonymous10": { "type": "integer" },"Anonymous11": { "type": "integer" },"Anonymous12": { "type": "integer" },"Anonymous13": { "type": "integer" },"Anonymous14": { "type": "array" },"Anonymous15": { "type": "array" },"Prop1": { "type": "object","properties": { "this": { "type": "string" } } },"Feature": { "type": "object","geometry": { "$ref": "#/definitions/Geometry" },"properties": { "$ref": "#/definitions/Properties" } } } } }
填充NA。 - 合并标头数据帧和拆分数据帧
- 修改列名称并添加“状态”列
- 删除“地区”列的第一个字符
可能还有改进的余地,但是下面的代码可以做到。
method='ffill'