NumPy.npy与pandas DataFrame的实例讲解
用CSV格式来保存文件是个不错的主意,因为大部分程序设计语言和应用程序都能处理这种格式,所以交流起来非常方便。然而这种格式的存储效率不是很高,原因是CSV及其他纯文本格式中含有大量空白符;而后来发明的一些文件格式,如zip、bzip和gzip等,压缩率则有了显著提升。
首先导入模块:
In[1]:importnumpyasnp In[2]:importpandasaspd In[3]:fromtempfileimportNamedTemporaryFile In[4]:fromos.pathimportgetsize
这里我们将使用Python标准的NamedTemporaryFile来存储数据,这些临时文件随后会自动删除。
接下来获取CSV文件格式的大小:
In[5]:np.random.seed(42) In[6]:a=np.random.randn(365,4) In[7]:tmpf=NamedTemporaryFile() In[8]:np.savetxt(tmpf,a,delimiter=',') In[9]:print("SizeCSVfile",getsize(tmpf.name)) SizeCSVfile36693
下面首先以NumPy.npy格式来保存该数组,随后载入内存,并检查数组的形状以及.npy文件的大小:
In[10]:tmpf=NamedTemporaryFile() In[11]:np.save(tmpf,a) In[12]:tmpf.seek(0) Out[12]:0 In[13]:loaded=np.load(tmpf) In[14]:print("Shape",loaded.shape) Shape(365,4) In[15]:print("Size.npyfile",getsize(tmpf.name)) Size.npyfile11760
.npy文件的大小只有CSV文件的三分之一左右。实际上,利用Python可以存储任意复杂的数据结构。也可以序列化格式来存储pandas的DataFrame或者Series数据结构
在Python中,pickle是将Python对象存储到磁盘或其他介质时采用的一种格式,这个格式化的过程叫做序列化。之后,我们可以从存储器中重建该Python对象,这个逆过程称为反序列化。并非所有的Python对象都能够序列化;不过借助诸如dill之列的模块,可以将更多种类的Python对象序列化。
首先用前面生成的NumPy数组创建一个DataFame,接着用to_pickle()方法将其写入一个pickle对象中,然后用read_pickle()函数从这个pickle对象中检索该DataFrame:
In[16]:tmpf.name Out[16]:'/tmp/tmpyy06safp' In[17]:df=pd.DataFrame(a) In[18]:df.to_pickle(tmpf.name)是将DataFrame()写入到/tmp/tmpyy06safp中 In[19]:print("Sizepickleddataframes",getsize(tmpf.name)) Sizepickleddataframes12250 In[20]:tmpf.name Out[20]:'/tmp/tmpyy06safp' In[21]:print("DFfrompickle\n",pd.read_pickle(tmpf.name)) DFfrompickle 0123 00.496714-0.1382640.6476891.523030 1-0.234153-0.2341371.5792130.767435 2-0.4694740.542560-0.463418-0.465730 30.241962-1.913280-1.724918-0.562288 4-1.0128310.314247-0.908024-1.412304 51.465649-0.2257760.067528-1.424748 6-0.5443830.110923-1.1509940.375698 7-0.600639-0.291694-0.6017071.852278 8-0.013497-1.0577110.822545-1.220844 90.208864-1.959670-1.3281860.196861 100.7384670.171368-0.115648-0.301104 11-1.478522-0.719844-0.4606391.057122 120.343618-1.7630400.324084-0.385082 13-0.6769220.6116761.0310000.931280 14-0.839218-0.3092120.3312630.975545 15-0.479174-0.185659-1.106335-1.196207 160.8125261.356240-0.0720101.003533 170.361636-0.6451200.3613961.538037 18-0.0358261.564644-2.6197450.821903 190.087047-0.2990070.091761-1.987569 20-0.2196720.3571131.477894-0.518270 21-0.808494-0.5017570.9154020.328751 22-0.5297600.5132670.0970780.968645 23-0.702053-0.327662-0.392108-1.463515 240.2961200.2610550.005113-0.234587 25-1.415371-0.420645-0.342715-0.802277 26-0.1612860.4040511.8861860.174578 270.257550-0.074446-1.918771-0.026514 280.0602302.463242-0.1923610.301547 29-0.034712-1.1686781.1428230.751933 .............. 3350.1605740.0030460.4369381.190646 3360.949554-1.484898-2.5539210.934320 337-1.366879-0.224765-1.170113-1.801980 3380.5414630.759155-0.576510-2.591042 339-0.5462440.391804-1.4789120.183360 340-0.0153100.5792910.119580-0.973069 3411.196572-0.158530-0.027305-0.933268 342-0.443282-0.884803-0.1729461.711708 343-1.371901-1.6135611.471170-0.209324 344-0.6690731.039905-0.6056161.826010 3450.677926-0.4879112.157308-0.605715 3460.7420950.2992931.3017411.561511 3470.032004-0.7534180.459972-0.677715 3482.0133870.136535-0.3653220.184680 349-1.347126-0.9716141.200414-0.656894 350-1.0469110.5366531.1857040.718953 3510.996048-0.756795-1.4218111.501334 352-0.322680-0.2508331.3281940.556230 3530.4558882.165002-0.6435180.927840 3540.0570130.2685921.5284680.507836 3550.5382961.072507-0.364953-0.839210 356-1.044809-1.9663572.056207-1.103208 357-0.221254-0.2768130.3074070.815737 3580.860473-0.583077-0.1671220.282580 359-0.2486911.6073460.4909750.734878 3600.6628811.1734740.181022-1.296832 3610.399688-0.651357-0.5286170.586364 3621.2382830.0212720.3088331.702215 3630.2407532.6016830.565510-1.760763 3640.7533420.3811581.2897530.673181 [365rowsx4columns]
以上这篇NumPy.npy与pandasDataFrame的实例讲解就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持毛票票。