Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

2023-09-09 21:35:04 244

本文实例讲述了Python统计纯文本文件中英文单词出现个数的方法。分享给大家供大家参考，具体如下：

第一版:效率低

#-*-coding:utf-8-*-
#!python3
path='test.txt'
withopen(path,encoding='utf-8',newline='')asf:
word=[]
words_dict={}
forletterinf.read():
ifletter.isalnum():
word.append(letter)
elifletter.isspace():#空白字符空格\t\n
ifword:
word=''.join(word).lower()#转小写
ifwordnotinwords_dict:
words_dict[word]=1
else:
words_dict[word]+=1
word=[]
#处理最后一个单词
ifword:
word=''.join(word).lower()#转小写
ifwordnotinwords_dict:
words_dict[word]=1
else:
words_dict[word]+=1
word=[]
fork,vinwords_dict.items():
print(k,v)

运行结果：

we4
are1
busy1
all1
day1
like1
swarms1
of6
flies1
without1
souls1
noisy1
restless1
unable1
to1
hear1
the7
voices1
soul1
as1
time1
goes1
by1
childhood1
away2
grew1
up1
years1
a1
lot1
memories1
once1
have2
also1
eroded1
bottom1
childish1
innocence1
regardless1
shackles1
mind1
indulge1
in1
world1
buckish1
focus1
on1
beneficial1
principle1
lost1
themselves1

第二版:

缺点:遇到大文件要一次读入内存，性能不好

#-*-coding:utf-8-*-
#!python3
importre
path='test.txt'
withopen(path,'r',encoding='utf-8')asf:
data=f.read()
word_reg=re.compile(r'\w+')
#word_reg=re.compile(r'\w+\b')
word_list=word_reg.findall(data)
word_list=[word.lower()forwordinword_list]#转小写
word_set=set(word_list)#避免重复查询
#words_dict={}
#forwordinword_set:
#words_dict[word]=word_list.count(word)
#简洁写法
words_dict={word:word_list.count(word)forwordinword_set}
fork,vinwords_dict.items():
print(k,v)

运行结果：

on1
also1
souls1
focus1
soul1
time1
noisy1
grew1
lot1
childish1
like1
voices1
indulge1
swarms1
buckish1
restless1
we4
hear1
childhood1
as1
world1
themselves1
are1
bottom1
memories1
the7
of6
flies1
without1
have2
day1
busy1
to1
eroded1
regardless1
unable1
innocence1
up1
a1
in1
mind1
goes1
by1
lost1
principle1
once1
away2
years1
beneficial1
all1
shackles1

第三版:

#-*-coding:utf-8-*-
#!python3
importre
path='test.txt'
withopen(path,'r',encoding='utf-8')asf:
word_list=[]
word_reg=re.compile(r'\w+')
forlineinf:
#line_words=word_reg.findall(line)
#比上面的正则更加简单
line_words=line.split()
word_list.extend(line_words)
word_set=set(word_list)#避免重复查询
words_dict={word:word_list.count(word)forwordinword_set}
fork,vinwords_dict.items():
print(k,v)

运行结果：

childhood1
innocence,1
are1
of6
also1
lost1
We1
regardless1
noisy,1
by,1
on1
themselves.1
grew1
lot1
bottom1
buckish,1
time1
childish1
voices1
once1
restless,1
shackles1
world1
eroded1
As1
all1
day,1
swarms1
we3
soul.1
memories,1
in1
without1
like1
beneficial1
up,1
unable1
away1
flies1
goes1
a1
have2
away,1
mind,1
focus1
principle,1
hear1
to1
the7
years1
busy1
souls,1
indulge1

第四版:使用Counter统计

#-*-coding:utf-8-*-
#!python3
importcollections
importre
path='test.txt'
withopen(path,'r',encoding='utf-8')asf:
word_list=[]
word_reg=re.compile(r'\w+')
forlineinf:
line_words=line.split()
word_list.extend(line_words)
words_dict=dict(collections.Counter(word_list))#使用Counter统计
fork,vinwords_dict.items():
print(k,v)

运行结果：

We1
are1
busy1
all1
day,1
like1
swarms1
of6
flies1
without1
souls,1
noisy,1
restless,1
unable1
to1
hear1
the7
voices1
soul.1
As1
time1
goes1
by,1
childhood1
away,1
we3
grew1
up,1
years1
away1
a1
lot1
memories,1
once1
have2
also1
eroded1
bottom1
childish1
innocence,1
regardless1
shackles1
mind,1
indulge1
in1
world1
buckish,1
focus1
on1
beneficial1
principle,1
lost1
themselves.1

注：这里使用的测试文本test.txt如下：

Wearebusyallday,likeswarmsofflieswithoutsouls,noisy,restless,unabletohearthevoicesofthesoul.Astimegoesby,childhoodaway,wegrewup,yearsawayalotofmemories,oncehavealsoerodedthebottomofthechildishinnocence,weregardlessoftheshacklesofmind,indulgeintheworldbuckish,focusonthebeneficialprinciple,wehavelostthemselves.

PS：这里再为大家推荐2款相关统计工具供大家参考：

在线字数统计工具：
http://tools.jb51.net/code/zishutongji

在线字符统计与编辑工具：
http://tools.jb51.net/code/char_tongji

更多关于Python相关内容感兴趣的读者可查看本站专题：《Python文件与目录操作技巧汇总》、《Python文本文件操作技巧汇总》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》及《Python入门与进阶经典教程》

希望本文所述对大家Python程序设计有所帮助。

Python统计纯文本文件中英文单词出现个数的方法总结【测试可用】

热门推荐

随机推荐