讲解Python的Scrapy爬虫框架使用代理进行采集的方法
1.在Scrapy工程下新建“middlewares.py”
#Importingbase64librarybecausewe'llneeditONLYincaseiftheproxywearegoingtouserequiresauthentication importbase64 #Startyourmiddlewareclass classProxyMiddleware(object): #overwriteprocessrequest defprocess_request(self,request,spider): #Setthelocationoftheproxy request.meta['proxy']="http://YOUR_PROXY_IP:PORT" #Usethefollowinglinesifyourproxyrequiresauthentication proxy_user_pass="USERNAME:PASSWORD" #setupbasicauthenticationfortheproxy encoded_user_pass=base64.encodestring(proxy_user_pass) request.headers['Proxy-Authorization']='Basic'+encoded_user_pass
2.在项目配置文件里(./project_name/settings.py)添加
DOWNLOADER_MIDDLEWARES={ 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':110, 'project_name.middlewares.ProxyMiddleware':100, }
只要两步,现在请求就是通过代理的了。测试一下^_^
fromscrapy.spiderimportBaseSpider fromscrapy.contrib.spidersimportCrawlSpider,Rule fromscrapy.httpimportRequest classTestSpider(CrawlSpider): name="test" domain_name="whatismyip.com" #Thefollowingurlissubjecttochange,youcangetthelastupdatedonefromhere: #http://www.whatismyip.com/faq/automation.asp start_urls=["http://xujian.info"] defparse(self,response): open('test.html','wb').write(response.body)
3.使用随机user-agent
默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user-agent的列表中随机选择一个来采集不同的页面
在settings.py中添加以下代码
DOWNLOADER_MIDDLEWARES={ 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None, 'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware':400 }
注意:Crawler;是你项目的名字,通过它是一个目录的名称下面是蜘蛛的代码
#!/usr/bin/python #-*-coding:utf-8-*- importrandom fromscrapy.contrib.downloadermiddleware.useragentimportUserAgentMiddleware classRotateUserAgentMiddleware(UserAgentMiddleware): def__init__(self,user_agent=''): self.user_agent=user_agent defprocess_request(self,request,spider): #这句话用于随机选择user-agent ua=random.choice(self.user_agent_list) ifua: request.headers.setdefault('User-Agent',ua) #thedefaultuser_agent_listcomposeschrome,IE,firefox,Mozilla,opera,netscape #formoreuseragentstrings,youcanfinditinhttp://www.useragentstring.com/pages/useragentstring.php user_agent_list=[\ "Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.1(KHTML,likeGecko)Chrome/22.0.1207.1Safari/537.1"\ "Mozilla/5.0(X11;CrOSi6862268.111.0)AppleWebKit/536.11(KHTML,likeGecko)Chrome/20.0.1132.57Safari/536.11",\ "Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.6(KHTML,likeGecko)Chrome/20.0.1092.0Safari/536.6",\ "Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.6(KHTML,likeGecko)Chrome/20.0.1090.0Safari/536.6",\ "Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/537.1(KHTML,likeGecko)Chrome/19.77.34.5Safari/537.1",\ "Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/536.5(KHTML,likeGecko)Chrome/19.0.1084.9Safari/536.5",\ "Mozilla/5.0(WindowsNT6.0)AppleWebKit/536.5(KHTML,likeGecko)Chrome/19.0.1084.36Safari/536.5",\ "Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3",\ "Mozilla/5.0(WindowsNT5.1)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3",\ "Mozilla/5.0(Macintosh;IntelMacOSX10_8_0)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1063.0Safari/536.3",\ "Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1062.0Safari/536.3",\ "Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1062.0Safari/536.3",\ "Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3",\ "Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3",\ "Mozilla/5.0(WindowsNT6.1)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.1Safari/536.3",\ "Mozilla/5.0(WindowsNT6.2)AppleWebKit/536.3(KHTML,likeGecko)Chrome/19.0.1061.0Safari/536.3",\ "Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/535.24(KHTML,likeGecko)Chrome/19.0.1055.1Safari/535.24",\ "Mozilla/5.0(WindowsNT6.2;WOW64)AppleWebKit/535.24(KHTML,likeGecko)Chrome/19.0.1055.1Safari/535.24" ]