Java实现从Html文本中提取纯文本的方法

2023-09-12 13:58:04 380

//从html中提取纯文本 publicstaticStringStripHT(StringstrHtml){ Stringtxtcontent=strHtml.replaceAll("]+>","");//剔出的标签 txtcontent=txtcontent.replaceAll("\\s*|\t|\r|\n","");//去除字符串中的空格,回车,换行符,制表符 returntxtcontent; }

//从html中提取纯文本 publicstaticStringHtml2Text(StringinputString){ StringhtmlStr=inputString;//含html标签的字符串 StringtextStr=""; java.util.regex.Patternp_script; java.util.regex.Matcherm_script; java.util.regex.Patternp_style; java.util.regex.Matcherm_style; java.util.regex.Patternp_html; java.util.regex.Matcherm_html; try{ StringregEx_script="<[\\s]*?script[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?script[\\s]*?>";//定义script的正则表达式{或]*?>[\\s\\S]*?<\\/script> StringregEx_style="<[\\s]*?style[^>]*?>[\\s\\S]*?<[\\s]*?\\/[\\s]*?style[\\s]*?>";//定义style的正则表达式{或]*?>[\\s\\S]*?<\\/style> StringregEx_html="<[^>]+>";//定义HTML标签的正则表达式 p_script=Pattern.compile(regEx_script,Pattern.CASE_INSENSITIVE); m_script=p_script.matcher(htmlStr); htmlStr=m_script.replaceAll("");//过滤script标签 p_style=Pattern.compile(regEx_style,Pattern.CASE_INSENSITIVE); m_style=p_style.matcher(htmlStr); htmlStr=m_style.replaceAll("");//过滤style标签 p_html=Pattern.compile(regEx_html,Pattern.CASE_INSENSITIVE); m_html=p_html.matcher(htmlStr); htmlStr=m_html.replaceAll("");//过滤html标签 textStr=htmlStr; }catch(Exceptione){System.err.println("Html2Text:"+e.getMessage());} //剔除空格行 textStr=textStr.replaceAll("[]+",""); textStr=textStr.replaceAll("(?m)^\\s*$(\\n|\\r\\n)",""); returntextStr;//返回文本字符串 }

packagecom.util; importjava.io.*; importjavax.swing.text.html.*; importjavax.swing.text.html.parser.*; publicclassHtml2TextextendsHTMLEditorKit.ParserCallback{ StringBuffers; publicHtml2Text(){} publicvoidparse(Readerin)throwsIOException{ s=newStringBuffer(); ParserDelegatordelegator=newParserDelegator(); //thethirdparameterisTRUEtoignorecharsetdirective delegator.parse(in,this,Boolean.TRUE); } publicvoidhandleText(char[]text,intpos){ s.append(text); } publicStringgetText(){ returns.toString(); } publicstaticvoidmain(String[]args){ try{ //theHTMLtoconvert //Readerin=newStringReader("string"); FileReaderin=newFileReader("java-new.html"); Html2Textparser=newHtml2Text(); parser.parse(in); in.close(); System.out.println(parser.getText()); } catch(Exceptione){ e.printStackTrace(); } } }

Java实现从Html文本中提取纯文本的方法

热门推荐

随机推荐