Pandas时间序列:重采样及频率转换方式
如下所示:
importpandasaspd importnumpyasnp
一、介绍
重采样(resampling)指的是将时间序列从一个频率转换到另一个频率的处理过程;
将高频率(间隔短)数据聚合到低频率(间隔长)称为降采样(downsampling);
将低频率数据转换到高频率则称为升采样(unsampling);
有些采样即不是降采样也不是升采样,例如将W-WED(每周三)转换为W-FRI;
二、resample方法–转换频率的主力函数
rng=pd.date_range('1/1/2000',periods=100,freq='D') ts=pd.Series(np.random.randn(len(rng)),index=rng) ts.resample('M').mean()#将100天按月进行降采样(聚合)
2000-01-31-0.156092 2000-02-290.060607 2000-03-31-0.039608 2000-04-30-0.154838 Freq:M,dtype:float64
ts.resample('M',kind='period').mean()
2000-01-0.156092 2000-020.060607 2000-03-0.039608 2000-04-0.154838 Freq:M,dtype:float64
三、降采样(聚合)
1.降采样面元(区间)默认才有左闭右开的形式,而且聚合的索引是以左边界标记
rng=pd.date_range('1/1/2000',periods=12,freq='T') ts=pd.Series(np.arange(12),index=rng) ts
2000-01-0100:00:000 2000-01-0100:01:001 2000-01-0100:02:002 2000-01-0100:03:003 2000-01-0100:04:004 2000-01-0100:05:005 2000-01-0100:06:006 2000-01-0100:07:007 2000-01-0100:08:008 2000-01-0100:09:009 2000-01-0100:10:0010 2000-01-0100:11:0011 Freq:T,dtype:int32
ts.resample('5min').sum()
2000-01-0100:00:0010 2000-01-0100:05:0035 2000-01-0100:10:0021 Freq:5T,dtype:int32
2.通过参数closed='right'可以实现左开右闭
ts.resample('5min',closed='right').sum()
1999-12-3123:55:000 2000-01-0100:00:0015 2000-01-0100:05:0040 2000-01-0100:10:0011 Freq:5T,dtype:int32
3.通过参数label='right'可以实现以右边界为聚合后的标签
ts.resample('5min',closed='right',label='right').sum()
2000-01-0100:00:000 2000-01-0100:05:0015 2000-01-0100:10:0040 2000-01-0100:15:0011 Freq:5T,dtype:int32
4.通过参数loffset可以实现精准的调整标签
ts.resample('5min',closed='right',loffset='-1s').sum()
1999-12-3123:54:590 1999-12-3123:59:5915 2000-01-0100:04:5940 2000-01-0100:09:5911 Freq:5T,dtype:int32
四、OHLC重采样
在金融领域常用的聚合方式–OHLC,它会计算各个面元的:第一个值(开盘)、最后一个值(收盘)、最大值和最小值,并产生一个DataFrame
print(ts.resample('5min').ohlc())
openhighlowclose 2000-01-0100:00:000404 2000-01-0100:05:005959 2000-01-0100:10:0010111011
五、通过groupby进行重采样
rng=pd.date_range('1/1/2000',periods=100,freq='D') ts=pd.Series(np.arange(100),index=rng) ts.groupby(lambdax:x.month).mean()#等价于ts.groupby(rng.month).mean()
115 245 375 495 dtype:int32
ts.groupby(lambdax:x.weekday).mean()#按周聚合
047.5 148.5 249.5 350.5 451.5 549.0 650.0 dtype:float64
六、升采样和插值
升采样是从低频率到高频率,这样会引入缺失值;
升采样时需要决定采样后结果中具体那个值代替原始的值;
当决定了替换原始值的值后,中间的值会按照频率进行添加;
frame=pd.DataFrame(np.random.randn(2,4), index=pd.date_range('1/1/2000',periods=2,freq='W-WED'), columns=['Colorado','Texas','NewYork','Ohio']) print(frame)
ColoradoTexasNewYorkOhio 2000-01-05-0.0787651.3894170.7327260.816723 2000-01-12-0.6636860.7443841.395332-0.031715
1.升采样、前向填充
df_daily=frame.resample('D') print(df_daily.ffill())
ColoradoTexasNewYorkOhio 2000-01-05-0.0787651.3894170.7327260.816723 2000-01-06-0.0787651.3894170.7327260.816723 2000-01-07-0.0787651.3894170.7327260.816723 2000-01-08-0.0787651.3894170.7327260.816723 2000-01-09-0.0787651.3894170.7327260.816723 2000-01-10-0.0787651.3894170.7327260.816723 2000-01-11-0.0787651.3894170.7327260.816723 2000-01-12-0.6636860.7443841.395332-0.031715
print(df_daily.ffill(limit=2))
ColoradoTexasNewYorkOhio 2000-01-05-0.0787651.3894170.7327260.816723 2000-01-06-0.0787651.3894170.7327260.816723 2000-01-07-0.0787651.3894170.7327260.816723 2000-01-08NaNNaNNaNNaN 2000-01-09NaNNaNNaNNaN 2000-01-10NaNNaNNaNNaN 2000-01-11NaNNaNNaNNaN 2000-01-12-0.6636860.7443841.395332-0.031715
2.重采样后的日期不一定与先前的日期有交集
print(frame)
ColoradoTexasNewYorkOhio 2000-01-05-0.0787651.3894170.7327260.816723 2000-01-12-0.6636860.7443841.395332-0.031715
print(frame.resample('W-THU').ffill())#重采样后的结果开始为全NaN,使用ffill会使用2000-01-05和2000-01-12的值向前填充
ColoradoTexasNewYorkOhio 2000-01-06-0.0787651.3894170.7327260.816723 2000-01-13-0.6636860.7443841.395332-0.031715
七、通过时期(period)进行重采样
1.将采样
frame=pd.DataFrame(np.random.randn(24,4), index=pd.period_range('1-2000','12-2001',freq='M'), columns=['Colorado','Texas','NewYork','Ohio']) print(frame[:5])
ColoradoTexasNewYorkOhio 2000-01-1.956495-0.6895080.057439-0.655832 2000-02-0.491443-1.7318871.3368010.659877 2000-03-0.139601-1.310386-0.2992051.194269 2000-040.431474-1.3125181.8802230.379421 2000-05-0.6747960.4710180.1329980.509761
annual_frame=frame.resample('A-DEC').mean() print(annual_frame)
ColoradoTexasNewYorkOhio 2000-0.332076-0.7625990.0469170.224908 2001-0.1529220.168667-0.326439-0.052034
2.通过convention决定在升采样后,那端来替换原来的值
#Q-DEC:以12月做为最后一个季度的最后一个月进行升采样.也就是1-3月是1季度,4-6月是2季度,7-9月是3季度,10-12月是4季度 print(annual_frame.resample('Q-DEC').ffill())
ColoradoTexasNewYorkOhio 2000Q1-0.332076-0.7625990.0469170.224908 2000Q2-0.332076-0.7625990.0469170.224908 2000Q3-0.332076-0.7625990.0469170.224908 2000Q4-0.332076-0.7625990.0469170.224908 2001Q1-0.1529220.168667-0.326439-0.052034 2001Q2-0.1529220.168667-0.326439-0.052034 2001Q3-0.1529220.168667-0.326439-0.052034 2001Q4-0.1529220.168667-0.326439-0.052034
#使用2000Q4替换2000、2001Q4替换2001,这两个值2000Q4和2001Q4之间就是升采样新增的值 print(annual_frame.resample('Q-DEC',convention='end').ffill())
ColoradoTexasNewYorkOhio 2000Q4-0.332076-0.7625990.0469170.224908 2001Q1-0.332076-0.7625990.0469170.224908 2001Q2-0.332076-0.7625990.0469170.224908 2001Q3-0.332076-0.7625990.0469170.224908 2001Q4-0.1529220.168667-0.326439-0.052034
3.综合案例解析
Q-MAR:4-6月是1季度,7-9月是2季度,10-12月是3季度,1-3月是4季度;
2000-01到2000-03是2000Q4,2000-04到2000-6是2001Q1,以此类推;
2000转变为[2000Q4,2001Q1,2001Q2,2001Q3],2001转变为[2001Q4,2002Q1,2002Q2,2002Q3];
convention='end',那么会使用2001Q3替换原始的2000,2002Q3替换2001,中间的部分自动添加;
索引结果为[2001Q3,2001Q4,2002Q1,2002Q2,2002Q3];
print(annual_frame.resample('Q-MAR',convention='end').ffill())
ColoradoTexasNewYorkOhio 2001Q3-0.332076-0.7625990.0469170.224908 2001Q4-0.332076-0.7625990.0469170.224908 2002Q1-0.332076-0.7625990.0469170.224908 2002Q2-0.332076-0.7625990.0469170.224908 2002Q3-0.1529220.168667-0.326439-0.052034
以上这篇Pandas时间序列:重采样及频率转换方式就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持毛票票。
声明:本文内容来源于网络,版权归原作者所有,内容由互联网用户自发贡献自行上传,本网站不拥有所有权,未作人工编辑处理,也不承担相关法律责任。如果您发现有涉嫌版权的内容,欢迎发送邮件至:czq8825#qq.com(发邮件时,请将#更换为@)进行举报,并提供相关证据,一经查实,本站将立刻删除涉嫌侵权内容。