Perl使用Tesseract-OCR实现验证码识别教程

2024-04-10 20:19:03 203

一、Tesseract-OCR是什么

AnOCREnginethatwasdevelopedatHPLabsbetween1985and1995…andnowatGoogle
基于Leptonica(http://leptonica.com/)图形处理库开的开源图形识别引擎。
支持Linux、Windows、Mac平台，
支持.NET、C++、Python、Java等开发语言：https://code.google.com/p/tesseract-ocr/wiki/AddOns
项目地址：https://code.google.com/p/tesseract-ocr/

二、使用方法

下载安装：https://tesseract-ocr.googlecode.com/files/tesseract-ocr-setup-3.02.02.exe
注意安装时的Path目录、数学符号、语言选项，按需选择。
执行：”tesseractyourpic.pngres”
图片yourpic.png里面的内容会被识别后保存在res.txt
为了更精确的识别可以去到项目地址上下载相应的各种语言的languagetessdata
例如：
简体中文https://tesseract-ocr.googlecode.com/files/tesseract-ocr-3.02.chi_sim.tar.gz
繁体中文
下载解压后chi_sim.traineddata复制到Tesseract-OCR\tessdata即可
执行：
“tesseractyourpic.pngeng”使用默认eng语言包
“tesseractyourpic.pngsim-lchi_sim”使用chi_sim语言包
“tesseractyourpic.pngtra-lchi_tra”使用chi_sim语言包
选择最接近真实数据的，方便以后修正

三、进阶使用Training

为数不多的trainingtesseract-ocr中文文档
http://yy-programer.blogspot.tw/2012/08/training-tesseract-ocr-301.html
对于高精度需求的需要研究一下了，日常民用级别，默认识别加后期修正即可。

四、应用实例之吸附代理

针对http://www.proxyfire.net/几个代理列表页的代理吸附

话不多说直接上代码，
pf.bat


pf.plhttp://www.proxyfire.net/index.php?pageid=eliteproxylist      elite.txt

pf.plhttp://www.proxyfire.net/index.php?pageid=anonymousproxylist  anony.txt

pf.plhttp://www.proxyfire.net/index.php?pageid=transparentproxylisttrans.txt

pf.plhttp://www.proxyfire.net/index.php?pageid=socks4proxylist     s4.txt

pf.plhttp://www.proxyfire.net/index.php?pageid=socks5proxylist     s5.txt

type*.txt>all.tmp

del*.txt/s/q

renall.tmpall.txt

@pause

pf.pl


usestrict;

 

our$url   =$ARGV[0];

our$file  =$ARGV[1];

 

my$res=undef;

my@tmp=undef;

my@pxy=undef;

 

   `wget$url-q-O___html`;

   openFH,"<___html";

   @tmp=;

   closeFH;

   $res=join('',@tmp);

   undef(@tmp);

   `del___html/s/q`;

 

@tmp=($res=~/<imgalt=""src="([^"border="0">]+)><\/td>(\d+)'http://www.proxyfire.net'.$tmp[$i],'port'=>$tmp[$i+1]};

   $i=$i+1;

}

 

for(my$i=0;$i<@pxy;$i++){    if(length(${$pxy[$i]}{ip})>0)

   {

       `echooff&wget${$pxy[$i]}{ip}-q-O___png`;

       `tesseract___png___-lchi_tra`;

 

       my$txt=undef;  

       openFH,"<___.txt";

       $txt=;

       closeFH;

       if(length($txt)>11)

       {

           $txt=~s/\s+//g;

           $txt=~s/日/8/g;

           $txt=~s/昍/88/g;

           $txt=~s/s0/60/g;

           $txt=~s/s1/61/g;

           $txt=~s/s2/62/g;

           $txt=~s/s3/69/g;

           $txt=~s/s4/64/g;

           $txt=~s/s5/65/g;

           $txt=~s/s7/67/g;

           $txt=~s/s8/68/g;

           $txt=~s/s9/69/g;

           $txt=~s/0s/06/g;

           $txt=~s/1s/16/g;

           $txt=~s/2s/26/g;

           $txt=~s/3s/96/g;

           $txt=~s/4s/46/g;

           $txt=~s/5s/56/g;

           $txt=~s/6s/66/g;

           $txt=~s/7s/76/g;

           $txt=~s/8s/86/g;

           $txt=~s/9s/96/g;

           $txt=~s/ss/66/g;

           $txt=~s/\.s/\.6/g;

           ${$pxy[$i]}{ip}=$txt;

 

           my$bak1=$txt;

           my$bak2=$txt;

           $bak1=~s/13/19/g;

           $bak1=~s/\.32\./\.92\./g;

           $bak1=~s/\.33\./\.99\./g;

 

           $bak2=~s/19/13/g;

           $bak2=~s/\.243/\.249/g;

           $bak2=~s/203\./209\./g;

 

           openFHX,">>$file";

           printFHX${$pxy[$i]}{ip}.":".${$pxy[$i]}{port}."\n";

           printFHX$bak1.":".${$pxy[$i]}{port}."\n";

           printFHX$bak2.":".${$pxy[$i]}{port}."\n";

           closeFHX;

 

       }

       my$txt=undef;

   }

}

`del___*/s/q`;

undef($url);

undef($file);

undef($res);

undef(@tmp);

undef(@pxy);

Perl使用Tesseract-OCR实现验证码识别教程

热门推荐

随机推荐