Python爬虫之正则表达式的使用教程详解_Python

正则表达式的使用

re.match(pattern,string,flags=0)

re.match尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none

参数介绍：

pattern:正则表达式

string：匹配的目标字符串

flags：匹配模式

正则表达式的匹配模式：

Python爬虫之正则表达式的使用教程详解

最常规的匹配

									import re

									content ='hello 123456 world_this is a regex demo'

									print(len(content))

									result = re.match('^hello\s\d{6}\s\w{10}.*demo$$',content)

									print(result)

									print(result.group()) #返回匹配结果

									print(result.span()) #返回匹配结果的范围

结果运行如下：

39
<_sre.sre_match object; span=(0, 39), match='hello 123456 world_this is a regex demo'>
hello 123456 world_this is a regex demo
(0, 39)

泛匹配

使用（.*）匹配更多内容

									import re

									content ='hello 123456 world_this is a regex demo'

									result = re.match('^hello.*demo$',content)

									print(result)

									print(result.group())

结果运行如下：

<_sre.sre_match object; span=(0, 39), match='hello 123456 world_this is a regex demo'>
hello 123456 world_this is a regex demo

匹配目标

在正则表达式中使用（）将要获取的内容括起来

使用group(1)获取第一处，group(2)获取第二处，如此可以提取我们想要获取的内容

				?

									import re

									content ='hello 123456 world_this is a regex demo'

									result = re.match('^hello\s(\d{6})\s.*demo$',content)

									print(result)

									print(result.group(1))#获取匹配目标

结果运行如下：

<_sre.sre_match object; span=(0, 39), match='hello 123456 world_this is a regex demo'>
123456

贪婪匹配

				?

									import re

									content ='hello 123456 world_this is a regex demo'

									result = re.match('^he.*(\d+).*demo$',content)

									print(result)

									print(result.group(1))

注意：.*会尽可能的多匹配字符

非贪婪匹配

				?

									import re

									content ='hello 123456 world_this is a regex demo'

									result = re.match('^he.*?(\d+).*demo$',content)

									print(result)

									print(result.group(1))

注意：.*?会尽可能匹配少的字符

使用匹配模式

在解析html代码时会有换行，这时我们就要使用re.s

				?

									import re

									content ='hello 123456 world_this ' \

									'is a regex demo'

									result = re.match('^he.*?(\d+).*?demo$',content,re.s)

									print(result)

									print(result.group(1))

运行结果如下：

<_sre.sre_match object; span=(0, 39), match='hello 123456 world_this is a regex demo'>
123456

转义

在解析过程中遇到特殊字符，就需要做转义，比如下面的$符号。

				?

									import re

									content = 'price is $5.00'

									result = re.match('^price.*\$5\.00',content)

									print(result.group())

总结：尽量使用泛匹配，使用括号得到匹配目标，尽量使用非贪婪模式，有换行就用re.s

				?

									re.search(pattern,string,flags=0)

re.search扫描整个字符串并返回第一个成功的匹配。

比如我想要提取字符串中的123456，使用match方法无法提取，只能使用search方法。

				?

									import re

									content ='hello 123456 world_this is a regex demo'

									result = re.match('\d{6}',content)

									print(result)

									import re

									content ='hello 123456 world_this is a regex demo'

									result = re.search('\d{6}',content)

									print(result)

									print(result.group())

运行结果如下：

<_sre.sre_match object; span=(6, 12), match='123456'>

匹配演练

可以匹配代码里结构相同的部分，这样可以返回你需要的内容

import re

content =

'<a python" id="highlighter_960028">
			
				?
			
				
					
						
							
								1
							
								2
							
								3
							
								4
							
								5
							
								6
							
								7
							
								8
							
								9
							
								10
							
								11
							
								12
							
								13
							
								14
							
								15
							
								16
							
								17
						
						
							
								
									import re
								
									html ='''
								
									<li>
								
									<a python" id="highlighter_779575">
			
				?
			
				
					
						
							
								1
							
								2
							
								3
							
								4
						
						
							
								
									import re
								
									content = 'hello 123456 world_this is a regex demo'
								
									content = re.sub('\d+','',content)
								
									print(content)
							
						
					
				
			
		

	



	运行结果如下：

	
		hello  world_this is a regex demo

		import re

		content = 'hello 123456 world_this is a regex demo'

		content = re.sub('\d+','what',content)

		print(content)


	运行结果如下：

	
		hello what world_this is a regex demo

		import re

		content = 'hello 123456 world_this is a regex demo'

		content = re.sub('(\d+)',r'\1 789',content)

		print(content)


	运行结果如下：

	
		hello 123456 789 world_this is a regex demo


	注意：这里\1代表前面匹配的123456

	演练

	在这里我们替换li标签

	
		
			
				?
			
				
					
						
							
								1
							
								2
							
								3
							
								4
							
								5
							
								6
							
								7
							
								8
							
								9
							
								10
							
								11
						
						
							
								
									import re
								
									html ='''
								
									<li>
								
									<a python" id="highlighter_135047">
			
				?
			
				
					
						
							
								1
							
								2
							
								3
						
						
							
								
									<a title="网络歌曲" href="/doc/2703035-2853927.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" target="_blank" data-log="old:2703035-2853885,new:2703035-2853927" data-cid="sense-list">网络歌曲</a>
								
									<a title="2009年中信出版社出版图书" href="/doc/2703035-2853985.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" target="_blank" data-log="old:2703035-2853885,new:2703035-2853985" data-cid="sense-list">2009年中信出版社出版图书</a>
								
									compile(pattern [, flags])
							
						
					
				
			
		

	



	该函数根据包含的正则表达式的字符串创建模式对象。可以实现更有效率的匹配

	将正则表达式编译成正则表达式对象，以便于复用该匹配模式

	
		
			
				?
			
				
					
						
							
								1
							
								2
							
								3
							
								4
							
								5
							
								6
						
						
							
								
									import re
								
									content = 'hello 123456 ' \
								
									'world_this is a regex demo'
								
									pattern = re.compile('hello.*?demo',re.s)
								
									result = re.match(pattern,content)
								
									print(result.group())　
							
						
					
				
			
		
	


	运行结果如下：

	
		hello 123456 world_this is a regex demo


	综合使用

	
		
			
				?
			
				
					
						
							
								1
							
								2
							
								3
							
								4
							
								5
							
								6
							
								7
							
								8
							
								9
							
								10
							
								11
							
								12
							
								13
							
								14
							
								15
							
								16
							
								17
							
								18
							
								19
							
								20
							
								21
							
								22
							
								23
							
								24
							
								25
							
								26
							
								27
							
								28
							
								29
							
								30
							
								31
							
								32
							
								33
							
								34
							
								35
							
								36
							
								37
							
								38
							
								39
							
								40
							
								41
							
								42
							
								43
							
								44
							
								45
							
								46
							
								47
							
								48
							
								49
							
								50
							
								51
							
								52
							
								53
							
								54
							
								55
							
								56
							
								57
							
								58
							
								59
							
								60
							
								61
							
								62
							
								63
							
								64
							
								65
							
								66
							
								67
							
								68
							
								69
							
								70
							
								71
							
								72
							
								73
							
								74
							
								75
							
								76
							
								77
							
								78
							
								79
							
								80
							
								81
							
								82
							
								83
							
								84
							
								85
							
								86
							
								87
							
								88
							
								89
							
								90
							
								91
							
								92
							
								93
							
								94
							
								95
							
								96
							
								97
							
								98
							
								99
						
						
							
								
									import re
								
									html = '''
								
									<div class="slide-page" style="width: 700px;" data-index="1">
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/26725678/?tag=热门&from=gaia">
								
									      <div class="cover-wp" data-isnew="false" data-id="26725678">
								
									        <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2525020357.jpg" alt="解除好友2：暗网" data-x="694" data-y="1000">
								
									      </div>
								
									      <p>
								
									        解除好友2：暗网
								
									          <strong>7.9</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/26916229/?tag=热门&from=gaia_video">
								
									      <div class="cover-wp" data-isnew="false" data-id="26916229">
								
									        <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2532008868.jpg" alt="镰仓物语" data-x="2143" data-y="2993">
								
									      </div>
								
									      <p>
								
									        镰仓物语
								
									          <strong>6.9</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/26683421/?tag=热门&from=gaia">
								
									      <div class="cover-wp" data-isnew="false" data-id="26683421">
								
									        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2528281606.jpg" alt="特工" data-x="690" data-y="986">
								
									      </div>
								
									      <p>
								
									        特工
								
									          <strong>8.3</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/27072795/?tag=热门&from=gaia">
								
									      <div class="cover-wp" data-isnew="false" data-id="27072795">
								
									        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2521583093.jpg" alt="幸福的拉扎罗" data-x="640" data-y="914">
								
									      </div>
								
									      <p>
								
									        幸福的拉扎罗
								
									          <strong>8.6</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/27201353/?tag=热门&from=gaia_video">
								
									      <div class="cover-wp" data-isnew="false" data-id="27201353">
								
									        <img src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2528842218.jpg" alt="大师兄" data-x="679" data-y="950">
								
									      </div>
								
									      <p>
								
									        大师兄
								
									          <strong>5.2</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/30146756/?tag=热门&from=gaia_video">
								
									      <div class="cover-wp" data-isnew="false" data-id="30146756">
								
									        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2530872223.jpg" alt="风语咒" data-x="1079" data-y="1685">
								
									      </div>
								
									      <p>
								
									        风语咒
								
									          <strong>6.9</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/26630714/?tag=热门&from=gaia">
								
									      <div class="cover-wp" data-isnew="false" data-id="26630714">
								
									        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2530591543.jpg" alt="精灵旅社3：疯狂假期" data-x="1063" data-y="1488">
								
									      </div>
								
									      <p>
								
									        精灵旅社3：疯狂假期
								
									          <strong>6.8</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/25882296/?tag=热门&from=gaia_video">
								
									      <div class="cover-wp" data-isnew="false" data-id="25882296">
								
									        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2526405034.jpg" alt="狄仁杰之四大天王" data-x="2500" data-y="3500">
								
									      </div>
								
									      <p>
								
									        狄仁杰之四大天王
								
									          <strong>6.2</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/26804147/?tag=热门&from=gaia_video">
								
									      <div class="cover-wp" data-isnew="false" data-id="26804147">
								
									        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2527484082.jpg" alt="摩天营救" data-x="1371" data-y="1920">
								
									      </div>
								
									      <p>
								
									        摩天营救
								
									          <strong>6.4</strong>
								
									      </p>
								
									    </a>
								
									    <a class="item" target="_blank" href="https://movie.douban.com/subject/24773958/?tag=热门&from=gaia_video">
								
									      <div class="cover-wp" data-isnew="false" data-id="24773958">
								
									        <img src="https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2517753454.jpg" alt="复仇者联盟3：无限战争" data-x="1968" data-y="2756">
								
									      </div>
								
									      <p>
								
									        复仇者联盟3：无限战争
								
									          <strong>8.1</strong>
								
									      </p>
								
									    </a>
								
									  </div>
								
									'''
								
									count = 0
								
									for list in result:
								
									  print(result[count])
								
									  count+=1
							
						
					
				
			
		
	


	运行结果如下：

	
		('解除好友2：暗网', '7.9')

		('镰仓物语', '6.9')

		('特工', '8.3')

		('幸福的拉扎罗', '8.6')

		('大师兄', '5.2')

		('风语咒', '6.9')

		('精灵旅社3：疯狂假期', '6.8')

		('狄仁杰之四大天王', '6.2')

		('摩天营救', '6.4')

		('复仇者联盟3：无限战争', '8.1')


	总结

	以上所述是小编给大家介绍的python爬虫之正则表达式的使用教程，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对服务器之家网站的支持！

	原文链接：http://www.cnblogs.com/-wenli/p/9846116.html
标签：Python 爬虫 正则表达式 
相关文章
用python生成1000个txt文件的方法2021-04-12
python实现键盘控制鼠标移动2021-04-12
解决python 无法加载downsample模型的问题2021-04-12
python实现写数字文件名的递增保存文件方法2021-04-12
python hook监听事件详解2021-04-12
python根据list重命名文件夹里的所有文件实例2021-04-12
热门资讯
2020微信伤感网名听哭了 让对方看到心疼的伤感网名大全 2019-12-26
yue是什么意思 网络流行语yue了是什么梗 2020-10-11
背刺什么意思 网络词语背刺是什么梗 2020-05-22
Intellij idea2020永久破解，亲测可用！！！ 2020-07-29
苹果12mini价格表官网报价 iPhone12mini全版本价格汇总 2020-11-13
返回顶部
首页 l 电脑版 l 网站标签 l 网站地图