一 前言
说来惭愧,知识追寻者太跳了,喜欢这学来学去,不够专一,java基础文章都没出完,汗颜,革命尚未成功,同志仍需努力;这篇文章是pandas的数据处理文章,学完本篇读者将学会基础的对DataFrame,Sereis 进行数据的剔除和替换工作,也是数据处理中必不可少的一环;
公众号:知识追寻者
知识追寻者(Inheriting the spirit of open source, Spreading technology kNowledge;)
二 重复数据处理
2.1 构造数据
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
print(frame)
user price hobby
0 zszxz 100 reading
1 zszxz 100 reading
2 rose -300 hiking
2.2 判定重复与处理
使用duplicated()可以对重复的行进行重复判定,返回的是Series形式的bool值;
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
# 判定行重复 返回Series
print(frame.duplicated())
0 False
1 True
2 False
dtype: bool
进行重复过滤,在duplicated()的基础上再次对DataFrame进行过滤可以得到重复的值,注意原来没有重复的值是不会被显示出来;
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
# 过滤掉重复行
print(frame[frame.duplicated()])
user price hobby
1 zszxz 100 reading
还有一个功能较好的函数就是drop_duplicates()
,其能够去除重复行,然后再打印所有数据;
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
user price hobby
0 zszxz 100 reading
2 rose -300 hiking
2.3 删除行或者列
有时候需要对数据删除,之前文章有提到过使用del 删除列,这次使用 drop() 函数删除行;删除多行在参数中使用中括号,多个逗号隔开,格式 [index0,index1........];
如果想用drop()删除 列,就需要指定axis = 1 ;
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
del_frame = frame.drop([0])
print(del_frame)
user price hobby
1 zszxz 100 reading
2 rose -300 hiking
三 数据替换
3.1 单值替换
原来的数据中价格有 -300 这种不合理数据将其替换为合理数据,比如200;可以使用replace()
函数将单个数据全部替换为另一个数据
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
re_frame = frame.replace(-300,200)
print(re_frame)
user price hobby
0 zszxz 100 reading
1 zszxz 100 reading
2 rose 200 hiking
3.2 多值替换
如果要对多个相同的元素替换则参数需要引入列表,比如对所有的 100, -300 分别进行替换为 200, 300;
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
re_frame = frame.replace([100,-300],[200,300])
print(re_frame)
user price hobby
0 zszxz 200 reading
1 zszxz 200 reading
2 rose 300 hiking
3.3 字典形式替换
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
re_frame = frame.replace({-300:200})
print(re_frame)
user price hobby
0 zszxz 100 reading
1 zszxz 100 reading
2 rose 200 hiking
四 数据添加
如果要对 hobby 一列 的每个元素追加一个说明,也就是追加一列,就可以使用map()
函数进行映射
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
# 映射到 hobby 列 添加对应值
add_column = {"reading":300, "running":900, "hiking":5}
frame['term'] = frame['hobby'].map(add_column)
print(frame)
user price hobby term
0 zszxz 100 reading 300
1 zszxz 100 reading 300
2 rose -300 hiking 5
五 重命名索引
有时候对原始的索引不满意,可以进行微调修改就需要重名名索引,使用 rename()
函数即可
data = {
'user' : ['zszxz','zszxz','rose'],
'price' : [100, 100, -300],
'hobby' : ['reading','reading','hiking']
}
frame = pd.DataFrame(data)
# 重命名索引
reindex = {0:"user1", 1:"user2", 2:"user3"}
rename_frame = frame.rename(reindex)
print(rename_frame)
user price hobby term
user1 zszxz 100 reading 300
user2 zszxz 100 reading 300
user3 rose -300 hiking 5