微客导航 » 文章资讯 » Java获取任意http网页源代码的方法

Java获取任意http网页源代码的方法

2023-09-24 02:23:05 392

本文实例讲述了JAVA获取任意http网页源代码。分享给大家供大家参考，具体如下：

JAVA获取任意http网页源代码可实现如下功能：

1.获取任意http网页的代码
2.获取任意http网页去掉HTML标签的代码

Webpage类：

/**
*网页操作相关类
*/
packagetest;
importjava.io.BufferedReader;
importjava.io.InputStreamReader;
importjava.net.URL;
importjava.util.regex.Matcher;
importjava.util.regex.Pattern;
/**
*@authorwinddack
*
*/
publicclassWebpage{
privateStringpageUrl;//定义需要操作的网页地址
privateStringpageEncode="UTF8";//定义需要操作的网页的编码
publicStringgetPageUrl(){
returnpageUrl;
}
publicvoidsetPageUrl(StringpageUrl){
this.pageUrl=pageUrl;
}
publicStringgetPageEncode(){
returnpageEncode;
}
publicvoidsetPageEncode(StringpageEncode){
this.pageEncode=pageEncode;
}
//定义取源码的方法
publicStringgetPageSource()
{
StringBuffersb=newStringBuffer();
try{
//构建一URL对象
URLurl=newURL(pageUrl);
//使用openStream得到一输入流并由此构造一个BufferedReader对象
BufferedReaderin=newBufferedReader(newInputStreamReader(url.openStream(),pageEncode));
Stringline;
//读取www资源
while((line=in.readLine())!=null)
{
sb.append(line);
}
in.close();
}
catch(Exceptionex)
{
System.err.println(ex);
}
returnsb.toString();
}
//定义一个把HTML标签删除过的源码的方法
publicStringgetPageSourceWithoutHtml()
{
finalStringregEx_script="]*?>[\\s\\S]*?<\\/script>";//定义script的正则表达式
finalStringregEx_style="]*?>[\\s\\S]*?<\\/style>";//定义style的正则表达式
finalStringregEx_html="<[^>]+>";//定义HTML标签的正则表达式
finalStringregEx_space="\\s*|\t|\r|\n";//定义空格回车换行符
StringhtmlStr=getPageSource();//获取未处理过的源码
Patternp_script=Pattern.compile(regEx_script,Pattern.CASE_INSENSITIVE);
Matcherm_script=p_script.matcher(htmlStr);
htmlStr=m_script.replaceAll("");//过滤script标签
Patternp_style=Pattern.compile(regEx_style,Pattern.CASE_INSENSITIVE);
Matcherm_style=p_style.matcher(htmlStr);
htmlStr=m_style.replaceAll("");//过滤style标签
Patternp_html=Pattern.compile(regEx_html,Pattern.CASE_INSENSITIVE);
Matcherm_html=p_html.matcher(htmlStr);
htmlStr=m_html.replaceAll("");//过滤html标签
Patternp_space=Pattern.compile(regEx_space,Pattern.CASE_INSENSITIVE);
Matcherm_space=p_space.matcher(htmlStr);
htmlStr=m_space.replaceAll("");//过滤空格回车标签
htmlStr=htmlStr.trim();//返回文本字符串
htmlStr=htmlStr.replaceAll(" ","");
htmlStr=htmlStr.substring(0,htmlStr.indexOf("。")+1);
returnhtmlStr;
}
}

调用：

Webpagepage=newWebpage();
page.setPageUrl("http://www.baidu.com");
Stringcode=page.getPageSourceWithoutHtml();
System.out.println(code);

PS：这里再为大家提供2款非常方便的正则表达式工具供大家参考使用：

JavaScript正则表达式在线测试工具：
http://tools.jb51.net/regex/javascript

正则表达式在线生成工具：
http://tools.jb51.net/regex/create_reg

更多关于java算法相关内容感兴趣的读者可查看本站专题：《Java正则表达式技巧大全》、《Java数据结构与算法教程》、《Java操作DOM节点技巧总结》、《Java文件与目录操作技巧汇总》和《Java缓存操作技巧汇总》

希望本文所述对大家java程序设计有所帮助。

返回顶部
3162201930
czq8825@qq.com

Java获取任意http网页源代码的方法

热门推荐

随机推荐