本文实例讲述了python3爬虫爬取英雄联盟高清桌面壁纸功能。分享给大家供大家参考,具体如下:
使用scrapy爬虫抓取英雄联盟高清桌面壁纸
源码地址:https://github.com/snowyme/loldesk
开始项目前需要安装python3和scrapy,不会的自行百度,这里就不具体介绍了
首先,创建项目
1
|
scrapy startproject loldesk |
生成项目的目录结构
首先需要定义抓取元素,在item.py中,我们这个项目用到了图片名和链接
1
2
3
4
5
|
import scrapy class loldeskitem(scrapy.item): name = scrapy.field() imgurl = scrapy.field() pass |
接下来在爬虫目录创建爬虫文件,并编写主要代码,loldesk.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
import scrapy from loldesk.items import loldeskitem class loldeskpiderspider(scrapy.spider): name = "loldesk" allowed_domains = [ "www.win4000.com" ] # 抓取链接 start_urls = [ 'http://www.win4000.com/zt/lol.html' ] def parse( self , response): list = response.css( ".left_bar ul li" ) for img in list : imgurl = img.css( "a::attr(href)" ).extract_first() imgurl2 = str (imgurl) next_url = response.css( ".next::attr(href)" ).extract_first() if next_url is not none: # 下一页 yield response.follow(next_url, callback = self .parse) yield scrapy.request(imgurl2, callback = self .content) def content( self , response): item = loldeskitem() item[ 'name' ] = response.css( ".pic-large::attr(title)" ).extract_first() item[ 'imgurl' ] = response.css( ".pic-large::attr(src)" ).extract() yield item # 判断页码 next_url = response.css( ".pic-next-img a::attr(href)" ).extract_first() allnum = response.css( ".ptitle em::text" ).extract_first() thisnum = next_url[ - 6 : - 5 ] if int (allnum) > int (thisnum): # 下一页 yield response.follow(next_url, callback = self .content) |
图片的链接和名称已经获取到了,接下来需要使用图片通道下载图片并保存到本地,pipelines.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
from scrapy.pipelines.images import imagespipeline from scrapy.exceptions import dropitem from scrapy.http import request import re class myimagespipeline(imagespipeline): def get_media_requests( self , item, info): for image_url in item[ 'imgurl' ]: yield request(image_url,meta = { 'item' :item[ 'name' ]}) def file_path( self , request, response = none, info = none): name = request.meta[ 'item' ] name = re.sub(r '[?\\*|“<>:/()0123456789]' , '', name) image_guid = request.url.split( '/' )[ - 1 ] filename = u 'full/{0}/{1}' . format (name, image_guid) return filename def item_completed( self , results, item, info): image_path = [x[ 'path' ] for ok, x in results if ok] if not image_path: raise dropitem( 'item contains no images' ) item[ 'image_paths' ] = image_path return item |
最后在settings.py中设置存储目录并开启通道:
1
2
3
4
5
6
|
# 设置图片存储路径 images_store = 'f:/python/loldesk' #启动pipeline中间件 item_pipelines = { 'loldesk.pipelines.myimagespipeline' : 300 , } |
在根目录下运行程序:
1
|
scrapy crawl loldesk |
大功告成!!!一共抓取到128个文件夹
希望本文所述对大家python程序设计有所帮助。
原文链接:https://blog.csdn.net/ziwoods/article/details/84321188