单线程+多任务异步协程
- 协程
在函数(特殊函数)定义的时候,使用async修饰,函数调用后,内部语句不会立即执行,而是会返回一个协程对象
- 任务对象
任务对象=高级的协程对象(进一步封装)=特殊的函数
任务对象必须要注册到时间循环对象中
给任务对象绑定回调:爬虫的数据解析中
- 事件循环
当做是一个装载任务对象的容器
当启动事件循环对象的时候,存储在内的任务对象会异步执行
- 特殊函数内部不能写不支持异步请求的模块,如time,requests...否则虽然不报错但实现不了异步
time.sleep -- asyncio.sleep
requests -- aiohttp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
import asyncio import time start_time = time.time() async def get_request(url): await asyncio.sleep( 2 ) print (url, '下载完成!' ) urls = [ 'www.1.com' , 'www.2.com' , ] task_lst = [] # 任务对象列表 for url in urls: c = get_request(url) # 协程对象 task = asyncio.ensure_future(c) # 任务对象 # task.add_done_callback(...) # 绑定回调 task_lst.append(task) loop = asyncio.get_event_loop() # 事件循环对象 loop.run_until_complete(asyncio.wait(task_lst)) # 注册,手动挂起 |
线程池+requests模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# 线程池 import time from multiprocessing.dummy import Pool start_time = time.time() url_list = [ 'www.1.com' , 'www.2.com' , 'www.3.com' , ] def get_request(url): print ( '正在下载...' ,url) time.sleep( 2 ) print ( '下载完成!' ,url) pool = Pool( 3 ) pool. map (get_request,url_list) print ( '总耗时:' ,time.time() - start_time) |
两个方法提升爬虫效率
起一个flask服务端
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
from flask import Flask import time app = Flask(__name__) @app .route( '/bobo' ) def index_bobo(): time.sleep( 2 ) return 'hello bobo!' @app .route( '/jay' ) def index_jay(): time.sleep( 2 ) return 'hello jay!' @app .route( '/tom' ) def index_tom(): time.sleep( 2 ) return 'hello tom!' if __name__ = = '__main__' : app.run(threaded = True ) |
aiohttp模块+单线程多任务异步协程
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
import asyncio import aiohttp import requests import time start = time.time() async def get_page(url): # page_text = requests.get(url=url).text # print(page_text) # return page_text async with aiohttp.ClientSession() as s: #生成一个session对象 async with await s.get(url = url) as response: page_text = await response.text() print (page_text) return page_text urls = [ 'http://127.0.0.1:5000/bobo' , 'http://127.0.0.1:5000/jay' , 'http://127.0.0.1:5000/tom' , ] tasks = [] for url in urls: c = get_page(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print (end - start) # 异步执行! # hello tom! # hello bobo! # hello jay! # 2.0311079025268555 |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
''' aiohttp模块实现单线程+多任务异步协程 并用xpath解析数据 ''' import aiohttp import asyncio from lxml import etree import time start = time.time() # 特殊函数:请求的发送和数据的捕获 # 注意async with await关键字 async def get_request(url): async with aiohttp.ClientSession() as s: async with await s.get(url = url) as response: page_text = await response.text() return page_text # 返回页面源码 # 回调函数,解析数据 def parse(task): page_text = task.result() tree = etree.HTML(page_text) msg = tree.xpath( '/html/body/ul//text()' ) print (msg) urls = [ 'http://127.0.0.1:5000/bobo' , 'http://127.0.0.1:5000/jay' , 'http://127.0.0.1:5000/tom' , ] tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) task.add_done_callback(parse) #绑定回调函数! tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) end = time.time() print (end - start) |
requests模块+线程池
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
import time import requests from multiprocessing.dummy import Pool start = time.time() urls = [ 'http://127.0.0.1:5000/bobo' , 'http://127.0.0.1:5000/jay' , 'http://127.0.0.1:5000/tom' , ] def get_request(url): page_text = requests.get(url = url).text print (page_text) return page_text pool = Pool( 3 ) pool. map (get_request, urls) end = time.time() print ( '总耗时:' , end - start) # 实现异步请求 # hello jay! # hello bobo! # hello tom! # 总耗时: 2.0467123985290527 |
小结
- 爬虫的加速目前掌握了两种方法:
aiohttp模块+单线程多任务异步协程
requests模块+线程池
- 爬虫接触的模块有三个:
requests
urllib
aiohttp
- 接触了一下flask开启服务器
以上就是python如何提升爬虫效率的详细内容,更多关于python提升爬虫效率的资料请关注服务器之家其它相关文章!
原文链接:https://www.cnblogs.com/straightup/p/13676391.html