服务器之家

服务器之家 > 正文

c# 正则表达式对网页进行有效内容抽取

时间:2020-07-24 16:15     来源/作者:正则之家

搜索引擎中一个比较重要的环节就是从网页中抽取出有效内容。简单来说,就是吧HTML文本中的HTML标记去掉,留下我们用IE等浏览器打开HTML文档看到的部分(我们这里不考虑图片).
将HTML文本中的标记分为:注释,script ,style,以及其他标记分别去掉:
1.去注释,正则为:
output = Regex.Replace(input, @"<!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase);
2.去script,正则为:
ouput = Regex.Replace(input, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
output2 = Regex.Replace(ouput , @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
3.去style,正则为:
output = Regex.Replace(input, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
4.去其他HTML标记
result = result.Replace("&nbsp;", " ");
result = result.Replace("&quot;", "\"");
result = result.Replace("&lt;", "<");
result = result.Replace("&gt;", ">");
result = result.Replace("&amp", "&");
result = result.Replace("<br>", "\r\n");
result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
以上的代码中大家可以看到,我使用了RegexOptions.Singleline参数,这个参数很重要,他主要是为了让"."(小圆点)可以匹配换行符.如果没有这个参数,大多数情况下,用上面列正则表达式来消除网页HTML标记是无效的.
HTML发展至今,语法已经相当复杂,上面只列出了几种最主要的标记,更多的去HTML标记的正则我将在
Rost WebSpider 的开发过程中补充进来。
下面用c#实现了一个从HTML字符串中提取有效内容的类:
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
class HtmlExtract
{
#region private attributes
private string _strHtml;
#endregion
#region public mehtods
public HtmlExtract(string inStrHtml)
{
_strHtml = inStrHtml
}
public override string ExtractText()
{
string result = _strHtml;
result = RemoveComment(result);
result = RemoveScript(result);
result = RemoveStyle(result);
result = RemoveTags(result);
return result.Trim();
}
#endregion
#region private methods
private string RemoveComment(string input)
{
string result = input;
//remove comment
result = Regex.Replace(result, @"<!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase);
return result;
}
private string RemoveStyle(string input)
{
string result = input;
//remove all styles
result = Regex.Replace(result, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
private string RemoveScript(string input)
{
string result = input;
result = Regex.Replace(result, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
result = Regex.Replace(result, @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
private string RemoveTags(string input)
{
string result = input;
result = result.Replace("&nbsp;", " ");
result = result.Replace("&quot;", "\"");
result = result.Replace("&lt;", "<");
result = result.Replace("&gt;", ">");
result = result.Replace("&amp", "&");
result = result.Replace("<br>", "\r\n");
result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
return result;
}
#endregion

相关文章

热门资讯

2020微信伤感网名听哭了 让对方看到心疼的伤感网名大全
2020微信伤感网名听哭了 让对方看到心疼的伤感网名大全 2019-12-26
歪歪漫画vip账号共享2020_yy漫画免费账号密码共享
歪歪漫画vip账号共享2020_yy漫画免费账号密码共享 2020-04-07
男生常说24816是什么意思?女生说13579是什么意思?
男生常说24816是什么意思?女生说13579是什么意思? 2019-09-17
沙雕群名称大全2019精选 今年最火的微信群名沙雕有创意
沙雕群名称大全2019精选 今年最火的微信群名沙雕有创意 2019-07-07
玄元剑仙肉身有什么用 玄元剑仙肉身境界等级划分
玄元剑仙肉身有什么用 玄元剑仙肉身境界等级划分 2019-06-21
返回顶部

747
Weibo Article 1 Weibo Article 2 Weibo Article 3 Weibo Article 4 Weibo Article 5 Weibo Article 6 Weibo Article 7 Weibo Article 8 Weibo Article 9 Weibo Article 10 Weibo Article 11 Weibo Article 12 Weibo Article 13 Weibo Article 14 Weibo Article 15 Weibo Article 16 Weibo Article 17 Weibo Article 18 Weibo Article 19 Weibo Article 20 Weibo Article 21 Weibo Article 22 Weibo Article 23 Weibo Article 24 Weibo Article 25 Weibo Article 26 Weibo Article 27 Weibo Article 28 Weibo Article 29 Weibo Article 30 Weibo Article 31 Weibo Article 32 Weibo Article 33 Weibo Article 34 Weibo Article 35 Weibo Article 36 Weibo Article 37 Weibo Article 38 Weibo Article 39 Weibo Article 40