Pandas 数据框增、删、改、查、去重、抽样基本操作方法_Python

总括

pandas的索引函数主要有三种：

loc 标签索引，行和列的名称

iloc 整型索引（绝对位置索引），绝对意义上的几行几列，起始索引为0

ix 是 iloc 和 loc的合体

at是loc的快捷方式

iat是iloc的快捷方式

建立测试数据集：

				?

									import pandas as pd

									df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'],'c': ["A","B","C"]})

									print(df)

									 a b c

									0 1 a A

									1 2 b B

									2 3 c C

行操作

选择某一行

				?

									print(df.loc[1,:])

									a 2

									b b

									c B

									Name: 1, dtype: object

选择多行

				?

									print(df.loc[1:2,:])#选择1:2行，slice为1

									 a b c

									1 2 b B

									2 3 c C

									print(df.loc[::-1,:])#选择所有行，slice为-1，所以为倒序

									 a b c

									2 3 c C

									1 2 b B

									0 1 a A

									print(df.loc[0:2:2,:])#选择0至2行，slice为2，等同于print(df.loc[0:2:2,:])因为只有3行

									 a b c

									0 1 a A

									2 3 c C

条件筛选

普通条件筛选

				?

									print(df.loc[:,"a"]>2)#原理是首先做了一个判断，然后再筛选

									0 False

									1 False

									2  True

									Name: a, dtype: bool

									print(df.loc[df.loc[:,"a"]>2,:])

									 a b c

									2 3 c C

另外条件筛选还可以集逻辑运算符 | for or, & for and, and ~for not

				?

									In [129]: s = pd.Series(range(-3, 4))

									In [132]: s[(s < -1) | (s > 0.5)]

									Out[132]: 

									0 -3

									1 -2

									4 1

									5 2

									6 3

									dtype: int64

isin

非索引列使用isin

				?

									In [141]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

									In [143]: s.isin([2, 4, 6])

									Out[143]: 

									4 False

									3 False

									2  True

									1 False

									0  True

									dtype: bool

									In [144]: s[s.isin([2, 4, 6])]

									Out[144]: 

									2 2

									0 4

									dtype: int64

索引列使用isin

				?

									In [145]: s[s.index.isin([2, 4, 6])]

									Out[145]: 

									4 0

									2 2

									dtype: int64

									# compare it to the following

									In [146]: s[[2, 4, 6]]

									Out[146]: 

									2 2.0

									4 0.0

									6 NaN

									dtype: float64

结合any()/all()在多列索引时

				?

									In [151]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],

									 .....:     'ids2': ['a', 'n', 'c', 'n']})

									 .....: 

									In [156]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

									In [157]: row_mask = df.isin(values).all(1)

									In [158]: df[row_mask]

									Out[158]: 

									 ids ids2 vals

									0 a a  1

where()

				?

									In [1]: dates = pd.date_range('1/1/2000', periods=8)

									In [2]: df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

									In [3]: df

									Out[3]: 

									     A   B   C   D

									2000-01-01 0.469112 -0.282863 -1.509059 -1.135632

									2000-01-02 1.212112 -0.173215 0.119209 -1.044236

									2000-01-03 -0.861849 -2.104569 -0.494929 1.071804

									2000-01-04 0.721555 -0.706771 -1.039575 0.271860

									2000-01-05 -0.424972 0.567020 0.276232 -1.087401

									2000-01-06 -0.673690 0.113648 -1.478427 0.524988

									2000-01-07 0.404705 0.577046 -1.715002 -1.039268

									2000-01-08 -0.370647 -1.157892 -1.344312 0.844885

									In [162]: df.where(df < 0, -df)

									Out[162]: 

									     A   B   C   D

									2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166

									2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824

									2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059

									2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203

									2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416

									2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718

									2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048

									2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

DataFrame.where() differs from numpy.where()的区别

				?

									In [172]: df.where(df < 0, -df) == np.where(df < 0, df, -df)

当series对象使用where()时，则返回一个序列

				?

									In [141]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

									In [159]: s[s > 0]

									Out[159]: 

									3 1

									2 2

									1 3

									0 4

									dtype: int64

									In [160]: s.where(s > 0)

									Out[160]: 

									4 NaN

									3 1.0

									2 2.0

									1 3.0

									0 4.0

									dtype: float64

抽样筛选

				?

									DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)

当在有权重筛选时，未赋值的列权重为0，如果权重和不为1，则将会将每个权重除以总和。random_state可以设置抽样的种子（seed）。axis可是设置列随机抽样。

				?

									In [105]: df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})

									In [106]: df2.sample(n = 3, weights = 'weight_column')

									Out[106]: 

									 col1 weight_column

									1  8   0.4

									0  9   0.5

									2  7   0.1

增加行

				?

									df.loc[3,:]=4

									  a b c

									0 1.0 a A

									1 2.0 b B

									2 3.0 c C

									3 4.0 4 4

插入行

pandas里并没有直接指定索引的插入行的方法，所以要自己设置

				?

									line = pd.DataFrame({df.columns[0]:"--",df.columns[1]:"--",df.columns[2]:"--"},index=[1])

									df = pd.concat([df.loc[:0],line,df.loc[1:]]).reset_index(drop=True)#df.loc[:0]这里不能写成df.loc[0]，因为df.loc[0]返回的是series

									  a b c

									0 1.0 a A

									1 -- -- --

									2 2.0 b B

									3 3.0 c C

									4 4.0 4 4

交换行

				?

									df.loc[[1,2],:]=df.loc[[2,1],:].values

									 a b c

									0 1 a A

									1 3 c C

									2 2 b B

删除行

				?

									df.drop(0,axis=0,inplace=True)

									print(df)

									 a b c

									1 2 b B

									2 3 c C

注意

在以时间作为索引的数据框中，索引是以整形的方式来的。

				?

									In [39]: dfl = pd.DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=pd.date_range('20130101',periods=5))

									In [40]: dfl

									Out[40]: 

									     A   B   C   D

									2013-01-01 1.075770 -0.109050 1.643563 -1.469388

									2013-01-02 0.357021 -0.674600 -1.776904 -0.968914

									2013-01-03 -1.294524 0.413738 0.276662 -0.472035

									2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061

									2013-01-05 0.895717 0.805244 -1.206412 2.565646

									In [41]: dfl.loc['20130102':'20130104']

									Out[41]: 

									     A   B   C   D

									2013-01-02 0.357021 -0.674600 -1.776904 -0.968914

									2013-01-03 -1.294524 0.413738 0.276662 -0.472035

									2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061

列操作

选择某一列

				?

									print(df.loc[:,"a"])

									0 1

									1 2

									2 3

									Name: a, dtype: int64

选择多列

				?

									print(df.loc[:,"a":"b"])

									 a b

									0 1 a

									1 2 b

									2 3 c

增加列,如果对已有的列,则是赋值

				?

									df.loc[:,"d"]=4

									 a b c d

									0 1 a A 4

									1 2 b B 4

									2 3 c C 4

交换两列的值

				?

									df.loc[:,['b', 'a']] = df.loc[:,['a', 'b']].values

									print(df)

									 a b c

									0 a 1 A

									1 b 2 B

									2 c 3 C

删除列

1）直接del DF[‘column-name']

2）采用drop方法，有下面三种等价的表达式：

DF= DF.drop(‘column_name', 1)；

DF.drop(‘column_name',axis=1, inplace=True)

DF.drop([DF.columns[[0,1,]]], axis=1,inplace=True)

				?

									df.drop("a",axis=1,inplace=True)

									print(df)

									 b c

									0 a A

									1 b B

									2 c C

还有一些其他的功能：

切片df.loc[::,::]

选择随机抽样df.sample()

去重.duplicated()

查询.lookup

以上这篇Pandas 数据框增、删、改、查、去重、抽样基本操作方法就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持服务器之家。

原文链接：https://blog.csdn.net/claroja/article/details/65661826

Pandas 数据框增、删、改、查、去重、抽样基本操作方法

相关文章

热门资讯