JAVA使用爬虫抓取网站网页内容的方法

2024-03-29 03:01:02 284

//第一种方法 //这种方法是用apache提供的包,简单方便 //但是要用到以下包:commons-codec-1.4.jar //commons-httpclient-3.1.jar //commons-logging-1.0.4.jar publicstaticStringcreatehttpClient(Stringurl,Stringparam){ HttpClientclient=newHttpClient(); Stringresponse=null; Stringkeyword=null; PostMethodpostMethod=newPostMethod(url); //try{ //if(param!=null) //keyword=newString(param.getBytes("gb2312"),"ISO-8859-1"); //}catch(UnsupportedEncodingExceptione1){ ////TODOAuto-generatedcatchblock //e1.printStackTrace(); //} //NameValuePair[]data={newNameValuePair("keyword",keyword)}; ////将表单的值放入postMethod中 //postMethod.setRequestBody(data); //以上部分是带参数抓取,我自己把它注销了．大家可以把注销消掉研究下 try{ intstatusCode=client.executeMethod(postMethod); response=newString(postMethod.getResponseBodyAsString() .getBytes("ISO-8859-1"),"gb2312"); //这里要注意下gb2312要和你抓取网页的编码要一样 Stringp=response.replaceAll("//&[a-zA-Z]{1,10};","") .replaceAll("<[^>]*>","");//去掉网页中带有html语言的标签 System.out.println(p); }catch(Exceptione){ e.printStackTrace(); } returnresponse; } //第二种方法 //这种方法是JAVA自带的URL来抓取网站内容 publicStringgetPageContent(StringstrUrl,StringstrPostRequest, intmaxLength){ //读取结果网页 StringBufferbuffer=newStringBuffer(); System.setProperty("sun.net.client.defaultConnectTimeout","5000"); System.setProperty("sun.net.client.defaultReadTimeout","5000"); try{ URLnewUrl=newURL(strUrl); HttpURLConnectionhConnect=(HttpURLConnection)newUrl .openConnection(); //POST方式的额外数据 if(strPostRequest.length()>0){ hConnect.setDoOutput(true); OutputStreamWriterout=newOutputStreamWriter(hConnect .getOutputStream()); out.write(strPostRequest); out.flush(); out.close(); } //读取内容 BufferedReaderrd=newBufferedReader(newInputStreamReader( hConnect.getInputStream())); intch; for(intlength=0;(ch=rd.read())>-1 &&(maxLength<=0||length<maxLength);length++) buffer.append((char)ch); Strings=buffer.toString(); s.replaceAll("//&[a-zA-Z]{1,10};","").replaceAll("<[^>]*>",""); System.out.println(s); rd.close(); hConnect.disconnect(); returnbuffer.toString().trim(); }catch(Exceptione){ //return"错误:读取网页失败！"; // returnnull; } }

publicstaticvoidmain(String[]args){ Stringurl="https://www.nhooo.com"; Stringkeyword="毛票票"; createhttpClientp=newcreatehttpClient(); Stringresponse=p.createhttpClient(url,keyword); //第一种方法 //p.getPageContent(url,"post",100500);//第二种方法 }

JAVA使用爬虫抓取网站网页内容的方法

热门推荐

随机推荐