Pandas 数据框增、删、改、查、去重、抽样基本操作方法

2023-09-14 06:00:04 163

总括

pandas的索引函数主要有三种：

loc标签索引，行和列的名称

iloc整型索引（绝对位置索引），绝对意义上的几行几列，起始索引为0

ix是iloc和loc的合体

at是loc的快捷方式

iat是iloc的快捷方式

建立测试数据集：

importpandasaspd
df=pd.DataFrame({'a':[1,2,3],'b':['a','b','c'],'c':["A","B","C"]})
print(df)
abc
01aA
12bB
23cC

行操作

选择某一行

print(df.loc[1,:])
a2
bb
cB
Name:1,dtype:object

选择多行

print(df.loc[1:2,:])#选择1:2行，slice为1
abc
12bB
23cC
print(df.loc[::-1,:])#选择所有行，slice为-1，所以为倒序
abc
23cC
12bB
01aA
print(df.loc[0:2:2,:])#选择0至2行，slice为2，等同于print(df.loc[0:2:2,:])因为只有3行
abc
01aA
23cC

条件筛选

普通条件筛选

print(df.loc[:,"a"]>2)#原理是首先做了一个判断，然后再筛选
0False
1False
2True
Name:a,dtype:bool
print(df.loc[df.loc[:,"a"]>2,:])
abc
23cC

另外条件筛选还可以集逻辑运算符|foror,&forand,and~fornot

In[129]:s=pd.Series(range(-3,4))
In[132]:s[(s<-1)|(s>0.5)]
Out[132]:
0-3
1-2
41
52
63
dtype:int64

isin

非索引列使用isin

In[141]:s=pd.Series(np.arange(5),index=np.arange(5)[::-1],dtype='int64')
In[143]:s.isin([2,4,6])
Out[143]:
4False
3False
2True
1False
0True
dtype:bool
In[144]:s[s.isin([2,4,6])]
Out[144]:
22
04
dtype:int64

索引列使用isin

In[145]:s[s.index.isin([2,4,6])]
Out[145]:
40
22
dtype:int64
#compareittothefollowing
In[146]:s[[2,4,6]]
Out[146]:
22.0
40.0
6NaN
dtype:float64

结合any()/all()在多列索引时

In[151]:df=pd.DataFrame({'vals':[1,2,3,4],'ids':['a','b','f','n'],
.....:'ids2':['a','n','c','n']})
.....:
In[156]:values={'ids':['a','b'],'ids2':['a','c'],'vals':[1,3]}
In[157]:row_mask=df.isin(values).all(1)
In[158]:df[row_mask]
Out[158]:
idsids2vals
0aa1

where()

In[1]:dates=pd.date_range('1/1/2000',periods=8)
In[2]:df=pd.DataFrame(np.random.randn(8,4),index=dates,columns=['A','B','C','D'])
In[3]:df
Out[3]:
ABCD
2000-01-010.469112-0.282863-1.509059-1.135632
2000-01-021.212112-0.1732150.119209-1.044236
2000-01-03-0.861849-2.104569-0.4949291.071804
2000-01-040.721555-0.706771-1.0395750.271860
2000-01-05-0.4249720.5670200.276232-1.087401
2000-01-06-0.6736900.113648-1.4784270.524988
2000-01-070.4047050.577046-1.715002-1.039268
2000-01-08-0.370647-1.157892-1.3443120.844885
In[162]:df.where(df<0,-df)
Out[162]:
ABCD
2000-01-01-2.104139-1.309525-0.485855-0.245166
2000-01-02-0.352480-0.390389-1.192319-1.655824
2000-01-03-0.864883-0.299674-0.227870-0.281059
2000-01-04-0.846958-1.222082-0.600705-1.233203
2000-01-05-0.669692-0.605656-1.169184-0.342416
2000-01-06-0.868584-0.948458-2.297780-0.684718
2000-01-07-2.670153-0.114722-0.168904-0.048048
2000-01-08-0.801196-1.392071-0.048788-0.808838

DataFrame.where()differsfromnumpy.where()的区别

In[172]:df.where(df<0,-df)==np.where(df<0,df,-df)

当series对象使用where()时，则返回一个序列

In[141]:s=pd.Series(np.arange(5),index=np.arange(5)[::-1],dtype='int64')
In[159]:s[s>0]
Out[159]:
31
22
13
04
dtype:int64
In[160]:s.where(s>0)
Out[160]:
4NaN
31.0
22.0
13.0
04.0
dtype:float64

抽样筛选

DataFrame.sample(n=None,frac=None,replace=False,weights=None,random_state=None,axis=None)

当在有权重筛选时，未赋值的列权重为0，如果权重和不为1，则将会将每个权重除以总和。random_state可以设置抽样的种子（seed）。axis可是设置列随机抽样。

In[105]:df2=pd.DataFrame({'col1':[9,8,7,6],'weight_column':[0.5,0.4,0.1,0]})
In[106]:df2.sample(n=3,weights='weight_column')
Out[106]:
col1weight_column
180.4
090.5
270.1

增加行

df.loc[3,:]=4
abc
01.0aA
12.0bB
23.0cC
34.044

插入行

pandas里并没有直接指定索引的插入行的方法，所以要自己设置

line=pd.DataFrame({df.columns[0]:"--",df.columns[1]:"--",df.columns[2]:"--"},index=[1])
df=pd.concat([df.loc[:0],line,df.loc[1:]]).reset_index(drop=True)#df.loc[:0]这里不能写成df.loc[0]，因为df.loc[0]返回的是series
abc
01.0aA
1------
22.0bB
33.0cC
44.044

交换行

df.loc[[1,2],:]=df.loc[[2,1],:].values
abc
01aA
13cC
22bB

删除行

df.drop(0,axis=0,inplace=True)
print(df)
abc
12bB
23cC

注意

在以时间作为索引的数据框中，索引是以整形的方式来的。

In[39]:dfl=pd.DataFrame(np.random.randn(5,4),columns=list('ABCD'),index=pd.date_range('20130101',periods=5))
In[40]:dfl
Out[40]:
ABCD
2013-01-011.075770-0.1090501.643563-1.469388
2013-01-020.357021-0.674600-1.776904-0.968914
2013-01-03-1.2945240.4137380.276662-0.472035
2013-01-04-0.013960-0.362543-0.006154-0.923061
2013-01-050.8957170.805244-1.2064122.565646
In[41]:dfl.loc['20130102':'20130104']
Out[41]:
ABCD
2013-01-020.357021-0.674600-1.776904-0.968914
2013-01-03-1.2945240.4137380.276662-0.472035
2013-01-04-0.013960-0.362543-0.006154-0.923061

列操作

选择某一列

print(df.loc[:,"a"])
01
12
23
Name:a,dtype:int64

选择多列

print(df.loc[:,"a":"b"])
ab
01a
12b
23c

增加列,如果对已有的列,则是赋值

df.loc[:,"d"]=4
abcd
01aA4
12bB4
23cC4

交换两列的值

df.loc[:,['b','a']]=df.loc[:,['a','b']].values
print(df)
abc
0a1A
1b2B
2c3C

删除列

1）直接delDF[‘column-name']

2）采用drop方法，有下面三种等价的表达式：

DF=DF.drop(‘column_name',1)；

DF.drop(‘column_name',axis=1,inplace=True)

DF.drop([DF.columns[[0,1,]]],axis=1,inplace=True)

df.drop("a",axis=1,inplace=True)
print(df)
bc
0aA
1bB
2cC

还有一些其他的功能：

切片df.loc[::,::]

选择随机抽样df.sample()

去重.duplicated()

查询.lookup

以上这篇Pandas数据框增、删、改、查、去重、抽样基本操作方法就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持毛票票。

Pandas 数据框增、删、改、查、去重、抽样基本操作方法

热门推荐

随机推荐