本文实例讲述了JAVA过滤标签实现将html内容转换为文本的方法。分享给大家供大家参考,具体如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
/** * 把html内容转为文本 * @param html 需要处理的html文本 * @param filterTags 需要保留的html标签样式 * @return */ public static String trimHtml2Txt(String html, String[] filterTags){ html = html.replaceAll( "\\<head>[\\s\\S]*?</head>(?i)" , "" ); //去掉head html = html.replaceAll( "\\<!--[\\s\\S]*?-->" , "" ); //去掉注释 html = html.replaceAll( "\\<![\\s\\S]*?>" , "" ); html = html.replaceAll( "\\<style[^>]*>[\\s\\S]*?</style>(?i)" , "" ); //去掉样式 html = html.replaceAll( "\\<script[^>]*>[\\s\\S]*?</script>(?i)" , "" ); //去掉js html = html.replaceAll( "\\<w:[^>]+>[\\s\\S]*?</w:[^>]+>(?i)" , "" ); //去掉word标签 html = html.replaceAll( "\\<xml>[\\s\\S]*?</xml>(?i)" , "" ); html = html.replaceAll( "\\<html[^>]*>|<body[^>]*>|</html>|</body>(?i)" , "" ); html = html.replaceAll( "\\\r\n|\n|\r" , " " ); //去掉换行 html = html.replaceAll( "\\<br[^>]*>(?i)" , "\n\r" ); List<String> tags = new ArrayList<String>(); List<String> s_tags = new ArrayList<String>(); List<String> halfTag = Arrays.asList( new String[]{ "img" , "table" , "thead" , "th" , "tr" , "td" }); // if (filterTags != null && filterTags.length > 0 ){ for (String tag : filterTags) { tags.add( "<" +tag+(halfTag.contains(tag)? "" : ">" )); //开始标签 if (! "img" .equals(tag)) tags.add( "</" +tag+ ">" ); //结束标签 s_tags.add( "#REPLACETAG" +tag+(halfTag.contains(tag)? "" : "REPLACETAG#" )); //尽量替换为复杂一点的标记,以免与显示文本混合,如:文本中包含#td、#table等 if (! "img" .equals(tag)) s_tags.add( "#REPLACETAG/" +tag+ "REPLACETAG#" ); } } html = StringUtils.replaceEach(html, tags.toArray( new String[tags.size()]), s_tags.toArray( new String[s_tags.size()])); html = html.replaceAll( "\\</p>(?i)" , "\n\r" ); html = html.replaceAll( "\\<[^>]+>" , "" ); html = StringUtils.replaceEach(html,s_tags.toArray( new String[s_tags.size()]),tags.toArray( new String[tags.size()])); html = html.replaceAll( "\\ " , " " ); return html.trim(); } |
希望本文所述对大家java程序设计有所帮助。