微客导航 » 文章资讯 » python 全文检索引擎详解

python 全文检索引擎详解

2024-02-25 01:48:03 289

python全文检索引擎详解

最近一直在探索着如何用Python实现像百度那样的关键词检索功能。说起关键词检索，我们会不由自主地联想到正则表达式。正则表达式是所有检索的基础，python中有个re类，是专门用于正则匹配。然而，光光是正则表达式是不能很好实现检索功能的。

python有一个whoosh包，是专门用于全文搜索引擎。

whoosh在国内使用的比较少，而它的性能还没有sphinx/coreseek成熟，不过不同于前者，这是一个纯python库，对python的爱好者更为方便使用。具体的代码如下

安装

输入命令行pipinstallwhoosh

需要导入的包有:

fromwhoosh.indeximportcreate_in

fromwhoosh.fieldsimport*

fromwhoosh.analysisimportRegexAnalyzer

fromwhoosh.analysisimportTokenizer,Token

中文分词解析器

classChineseTokenizer(Tokenizer):
"""
中文分词解析器
"""
def__call__(self,value,positions=False,chars=False,
keeporiginal=True,removestops=True,start_pos=0,start_char=0,
mode='',**kwargs):
assertisinstance(value,text_type),"%risnotunicode"%value
t=Token(positions,chars,removestops=removestops,mode=mode,**kwargs)
list_seg=jieba.cut_for_search(value)
forwinlist_seg:
t.original=t.text=w
t.boost=0.5
ifpositions:
t.pos=start_pos+value.find(w)
ifchars:
t.startchar=start_char+value.find(w)
t.endchar=start_char+value.find(w)+len(w)
yieldt


defchinese_analyzer():
returnChineseTokenizer()

构建索引的函数

@staticmethod
defcreate_index(document_dir):
analyzer=chinese_analyzer()
schema=Schema(titel=TEXT(stored=True,analyzer=analyzer),path=ID(stored=True),
content=TEXT(stored=True,analyzer=analyzer))
ix=create_in("./",schema)
writer=ix.writer()
forparents,dirnames,filenamesinos.walk(document_dir):
forfilenameinfilenames:
title=filename.replace(".txt","").decode('utf8')
printtitle
content=open(document_dir+'/'+filename,'r').read().decode('utf-8')
path=u"/b"
writer.add_document(titel=title,path=path,content=content)
writer.commit()

检索函数

@staticmethod
defsearch(search_str):
title_list=[]
print'here'
ix=open_dir("./")
searcher=ix.searcher()
printsearch_str,type(search_str)
results=searcher.find("content",search_str)
forhitinresults:
printhit['titel']
printhit.score
printhit.highlights("content",top=10)
title_list.append(hit['titel'])
print'tt',title_list
returntitle_list

感谢阅读，希望能帮助到大家，谢谢大家对本站的支持！

返回顶部
3162201930
czq8825@qq.com

python 全文检索引擎详解

热门推荐

随机推荐