文件读取：

数据结构：

seriers：

dataframe:

文件读取：

pandas可以简单的读取csv，excel，txt，如下：

import pandas as pd
#read csv
df_csv = pd.read_csv('1.csv')
#read excel
pd_excel = pd.read_excel('/Users/j.levis/PycharmProjects/爬豆瓣榜单/豆瓣榜单.xlsx')
#read txt
pd_txt = pd.read_table('2.txt')

读写的相关参数：

header=None 表示第一行不作为列名

index_col 表示把某一列或几列作为索引

usecols 表示读取列的集合

parse_dates 表示需要转化为时间的列

nrows 表示读取的数据行数

以我之前爬取的豆瓣榜单为例：

pd_excel = pd.read_excel('/Users/j.levis/PycharmProjects/爬豆瓣榜单/豆瓣榜单.xlsx',usecols=['排名','书名'],nrows=5)


   排名    书名
0   1   红楼梦
1   2    活着
2   3  1984
3   4  百年孤独
4   5  三体全集

数据结构：

seriers：

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。index默认为空.

object类型代表了一种混合类型，正如上面的例子中存储了整数、字符串以及 Python 的字典数据结构。

s = pd.Series(data=[1,'你好',{'你好':'hello','再见':'bye'}],dtype='object',name = 'myname',index=['one','two','three'])

one                                 1
two                                你好
three    {'你好': 'hello', '再见': 'bye'}
Name: myname, dtype: object

s['one']

1

dataframe:

DataFrame 在 Series 的基础上增加了列索引，一个数据框可以由二维的 data 与行列索引来构造

df = pd.DataFrame(
    data=[[1,2,3],[2,3,4],[4,5,6]]
)


   0  1  2
0  1  2  3
1  2  3  4
2  4  5  6
#index/列名默认012

设置列名：

df = pd.DataFrame(
    data={'col1':[1,2,3],'col2':['a','b','c'],'col3':['m','n','k']}
)

   col1 col2 col3
0     1    a    m
1     2    b    n
2     3    c    k

设置index：

df = pd.DataFrame(
    data={'col1':[1,2,3],'col2':['a','b','c'],'col3':['m','n','k']},
    index=['row1','row2','row3']
)


      col1 col2 col3
row1     1    a    m
row2     2    b    n
row3     3    c    k

可以直接写入excel：

df.to_excel('1.xlsx')

提取列：

在 DataFrame 中可以用 [col_name] 与 [col_list] 来取出相应的列与由多个列组成的表，结果分别为 Series 和 DataFrame ：

df['col2']


row1    a
row2    b
row3    c
Name: col2, dtype: object

df[['col1','col2']]

      col1 col2
row1     1    a
row2     2    b
row3     3    c

提取行

用loc函数可以提取行：

df.loc[['row1']]

      col1 col2 col3
row1     1    a    m

dataframe转置：

df.T

     row1 row2 row3
col1    1    2    3
col2    a    b    c
col3    m    n    k

常用基本函数：

样例文件：learn_pandas.csv

df = pd.read_csv('learn_pandas.csv')
df = df[df.columns[:7]]#取前七列

    School      Grade            Name  Gender  Height  Weight Transfer
0        A   Freshman    Gaopeng Yang  Female   158.9    46.0        N
1        B   Freshman  Changqiang You    Male   166.5    70.0        N
2        A     Senior         Mei Sun    Male   188.9    89.0        N
3        C  Sophomore    Xiaojuan Sun  Female     NaN    41.0        N
4        C  Sophomore     Gaojuan You    Male   174.0    74.0        N
..     ...        ...             ...     ...     ...     ...      ...
195      C     Junior    Xiaojuan Sun  Female   153.9    46.0        N
196      D     Senior         Li Zhao  Female   160.9    50.0        N
197      A     Senior  Chengqiang Chu  Female   153.9    45.0        N
198      A     Senior   Chengmei Shen    Male   175.3    71.0        N
199      D  Sophomore     Chunpeng Lv    Male   155.7    51.0        N

汇总函数：

head（n）返回前n行

df.head(3)


School     Grade            Name  Gender  Height  Weight Transfer
0      A  Freshman    Gaopeng Yang  Female   158.9    46.0        N
1      B  Freshman  Changqiang You    Male   166.5    70.0        N
2      A    Senior         Mei Sun    Male   188.9    89.0        N

tail（n）返回后n行

df.tail(3)


    School      Grade            Name  Gender  Height  Weight Transfer
197      A     Senior  Chengqiang Chu  Female   153.9    45.0        N
198      A     Senior   Chengmei Shen    Male   175.3    71.0        N
199      D  Sophomore     Chunpeng Lv    Male   155.7    51.0        N

info() 返回df的信息概况：

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB

describe() 返回df的数值列的统计信息：

df.describe()

           Height      Weight
count  183.000000  189.000000
mean   163.218033   55.015873
std      8.608879   12.824294
min    145.400000   34.000000
25%    157.150000   46.000000
50%    161.900000   51.000000
75%    167.500000   65.000000
max    193.900000   89.000000

其中count，mean，std等数值可以用函数获取：

df_demo = df['Height']
df_demo.mean()
df_demo.quantile(0.75)

163.21803278688526
167.5

count（）返回非缺失值的个数：

df_demo.count()

183

idmax()返回最大值的索引

df.idmax()

193

唯一值函数：

对序列使用 unique 和 nunique 可以分别得到其唯一值组成的列表和唯一值的个数：

df['School'].unique()

['A' 'B' 'C' 'D']

df['School'].nunique() 

4

value_counts返回唯一值及其对应出现次数

df['School'].value_counts()

D    69
A    57
C    40
B    34

排序函数:

sort_values值排序：

默认参数 ascending=True 为升序，false为降序

df_demo = df[['Height','Weight']]
print(df_demo.sort_values('Height').head(10)

     Height  Weight
143   145.4    34.0
49    147.3    34.0
120   147.8    34.0
30    148.7    41.0
80    150.5    40.0
139   150.5    36.0
185   151.2    43.0
89    151.5    44.0
29    151.6    43.0
39    151.7    38.0

在排序中，进场遇到多列排序的问题，比如在体重相同的情况下，对身高进行排序，并且保持身高降序排列，体重升序排列:

df_demo.sort_values(['Weight','Height'],ascending=[True,False]).head() 


Height  Weight
120   147.8    34.0
49    147.3    34.0
143   145.4    34.0
139   150.5    36.0
108   152.4    38.0

python

pandas基础

文件读取：

数据结构：

seriers：

dataframe:

常用基本函数：

汇总函数：

唯一值函数：

排序函数:

相关文章