NumPy.npy与pandas DataFrame的实例讲解
用CSV格式来保存文件是个不错的主意,因为大部分程序设计语言和应用程序都能处理这种格式,所以交流起来非常方便。然而这种格式的存储效率不是很高,原因是CSV及其他纯文本格式中含有大量空白符;而后来发明的一些文件格式,如zip、bzip和gzip等,压缩率则有了显著提升。
首先导入模块:
In[1]:importnumpyasnp In[2]:importpandasaspd In[3]:fromtempfileimportNamedTemporaryFile In[4]:fromos.pathimportgetsize
这里我们将使用Python标准的NamedTemporaryFile来存储数据,这些临时文件随后会自动删除。
接下来获取CSV文件格式的大小:
In[5]:np.random.seed(42)
In[6]:a=np.random.randn(365,4)
In[7]:tmpf=NamedTemporaryFile()
In[8]:np.savetxt(tmpf,a,delimiter=',')
In[9]:print("SizeCSVfile",getsize(tmpf.name))
SizeCSVfile36693
下面首先以NumPy.npy格式来保存该数组,随后载入内存,并检查数组的形状以及.npy文件的大小:
In[10]:tmpf=NamedTemporaryFile()
In[11]:np.save(tmpf,a)
In[12]:tmpf.seek(0)
Out[12]:0
In[13]:loaded=np.load(tmpf)
In[14]:print("Shape",loaded.shape)
Shape(365,4)
In[15]:print("Size.npyfile",getsize(tmpf.name))
Size.npyfile11760
.npy文件的大小只有CSV文件的三分之一左右。实际上,利用Python可以存储任意复杂的数据结构。也可以序列化格式来存储pandas的DataFrame或者Series数据结构
在Python中,pickle是将Python对象存储到磁盘或其他介质时采用的一种格式,这个格式化的过程叫做序列化。之后,我们可以从存储器中重建该Python对象,这个逆过程称为反序列化。并非所有的Python对象都能够序列化;不过借助诸如dill之列的模块,可以将更多种类的Python对象序列化。
首先用前面生成的NumPy数组创建一个DataFame,接着用to_pickle()方法将其写入一个pickle对象中,然后用read_pickle()函数从这个pickle对象中检索该DataFrame:
In[16]:tmpf.name
Out[16]:'/tmp/tmpyy06safp'
In[17]:df=pd.DataFrame(a)
In[18]:df.to_pickle(tmpf.name)是将DataFrame()写入到/tmp/tmpyy06safp中
In[19]:print("Sizepickleddataframes",getsize(tmpf.name))
Sizepickleddataframes12250
In[20]:tmpf.name
Out[20]:'/tmp/tmpyy06safp'
In[21]:print("DFfrompickle\n",pd.read_pickle(tmpf.name))
DFfrompickle
0123
00.496714-0.1382640.6476891.523030
1-0.234153-0.2341371.5792130.767435
2-0.4694740.542560-0.463418-0.465730
30.241962-1.913280-1.724918-0.562288
4-1.0128310.314247-0.908024-1.412304
51.465649-0.2257760.067528-1.424748
6-0.5443830.110923-1.1509940.375698
7-0.600639-0.291694-0.6017071.852278
8-0.013497-1.0577110.822545-1.220844
90.208864-1.959670-1.3281860.196861
100.7384670.171368-0.115648-0.301104
11-1.478522-0.719844-0.4606391.057122
120.343618-1.7630400.324084-0.385082
13-0.6769220.6116761.0310000.931280
14-0.839218-0.3092120.3312630.975545
15-0.479174-0.185659-1.106335-1.196207
160.8125261.356240-0.0720101.003533
170.361636-0.6451200.3613961.538037
18-0.0358261.564644-2.6197450.821903
190.087047-0.2990070.091761-1.987569
20-0.2196720.3571131.477894-0.518270
21-0.808494-0.5017570.9154020.328751
22-0.5297600.5132670.0970780.968645
23-0.702053-0.327662-0.392108-1.463515
240.2961200.2610550.005113-0.234587
25-1.415371-0.420645-0.342715-0.802277
26-0.1612860.4040511.8861860.174578
270.257550-0.074446-1.918771-0.026514
280.0602302.463242-0.1923610.301547
29-0.034712-1.1686781.1428230.751933
..............
3350.1605740.0030460.4369381.190646
3360.949554-1.484898-2.5539210.934320
337-1.366879-0.224765-1.170113-1.801980
3380.5414630.759155-0.576510-2.591042
339-0.5462440.391804-1.4789120.183360
340-0.0153100.5792910.119580-0.973069
3411.196572-0.158530-0.027305-0.933268
342-0.443282-0.884803-0.1729461.711708
343-1.371901-1.6135611.471170-0.209324
344-0.6690731.039905-0.6056161.826010
3450.677926-0.4879112.157308-0.605715
3460.7420950.2992931.3017411.561511
3470.032004-0.7534180.459972-0.677715
3482.0133870.136535-0.3653220.184680
349-1.347126-0.9716141.200414-0.656894
350-1.0469110.5366531.1857040.718953
3510.996048-0.756795-1.4218111.501334
352-0.322680-0.2508331.3281940.556230
3530.4558882.165002-0.6435180.927840
3540.0570130.2685921.5284680.507836
3550.5382961.072507-0.364953-0.839210
356-1.044809-1.9663572.056207-1.103208
357-0.221254-0.2768130.3074070.815737
3580.860473-0.583077-0.1671220.282580
359-0.2486911.6073460.4909750.734878
3600.6628811.1734740.181022-1.296832
3610.399688-0.651357-0.5286170.586364
3621.2382830.0212720.3088331.702215
3630.2407532.6016830.565510-1.760763
3640.7533420.3811581.2897530.673181
[365rowsx4columns]
以上这篇NumPy.npy与pandasDataFrame的实例讲解就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持毛票票。