服务器之家

服务器之家 > 正文

Python抓取电影天堂电影信息的代码

时间:2020-08-18 10:13     来源/作者:itjh

Python2.7Mac OS

抓取的是电影天堂里面最新电影的页面。链接地址: http://www.dytt8.net/html/gndy/dyzz/index.html

获取页面的中电影详情页链接

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import urllib2
import os
import re
import string
 
 
# 电影URL集合
movieUrls = []
 
 
# 获取电影列表
def queryMovieList():
 
 url = 'http://www.dytt8.net/html/gndy/dyzz/index.html'
 conent = urllib2.urlopen(url)
 conent = conent.read()
 conent = conent.decode('gb2312','ignore').encode('utf-8','ignore')
 pattern = re.compile ('<div class="title_all"><h1><font color=#008800>.*?</a>></font></h1></div>'+
      '(.*?)<td height="25" align="center" bgcolor="#F4FAE2"> ',re.S)
 items = re.findall(pattern,conent)
 
 str = ''.join(items)
 pattern = re.compile ('<a href="(.*?)" class="ulink">(.*?)</a>.*?<td colspan.*?>(.*?)</td>',re.S)
 news = re.findall(pattern, str)
 
 for j in news:
  
    movieUrls.append('http://www.dytt8.net'+j[0])

抓取详情页中的电影数据

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def queryMovieInfo(movieUrls):
 
 for index, item in enumerate(movieUrls):
 
 print('电影URL: ' + item)
 
 conent = urllib2.urlopen(item)
 conent = conent.read()
 conent = conent.decode('gb2312','ignore').encode('utf-8','ignore')
 
 
 movieName = re.findall(r'<div class="title_all"><h1><font color=#07519a>(.*?)</font></h1></div>', conent, re.S)
 if (len(movieName) > 0):
  movieName = movieName[0] + ""
  # 截取名称
  movieName = movieName[movieName.find("《") + 3:movieName.find("》")]
 else:
  movieName = ""
 
 print("电影名称: " + movieName.strip())
 
 movieContent = re.findall(r'<div class="co_content8">(.*?)</tbody>',conent , re.S)
 
 
 pattern = re.compile('<ul>(.*?)<tr>', re.S)
 movieDate = re.findall(pattern,movieContent[0])
 
 if (len(movieDate) > 0):
  movieDate = movieDate[0].strip() + ''
 else:
  movieDate = ""
 
 print("电影发布时间: " + movieDate[-10:])
 
 pattern = re.compile('<br /><br />(.*?)<br /><br /><img')
 movieInfo = re.findall(pattern, movieContent[0])
 
 if (len(movieInfo) > 0):
  movieInfo = movieInfo[0]+''
 
  # 删除<br />标签
  movieInfo = movieInfo.replace("<br />","")
 
  # 根据 ◎ 符号拆分
 
  movieInfo = movieInfo.split('◎')
 
 else:
  movieInfo = ""
 
 print("电影基础信息: ")
 
 for item in movieInfo:
  print(item)
 
 
 # 电影海报
 pattern = re.compile('<img.*? src="(.*?)".*? />', re.S)     
 movieImg = re.findall(pattern,movieContent[0])
 
 if (len(movieImg) > 0):
  movieImg = movieImg[0]
 else:
  movieImg = ""
 
 print("电影海报: " + movieImg)
 
 pattern = re.compile('<td style="WORD-WRAP: break-word" bgcolor="#fdfddf"><a href="(.*?)">.*?</a></td>', re.S)
 movieDownUrl = re.findall(pattern,movieContent[0])
 
 if (len(movieDownUrl) > 0):
  movieDownUrl = movieDownUrl[0]
 else:
  movieDownUrl = ""
 
 print("电影下载地址:" + movieDownUrl + "")
 
 print("------------------------------------------------\n\n\n")

执行抓取

?
1
2
3
4
5
6
7
8
9
if __name__=='__main__':
 
  print("开始抓取电影数据");
 
  queryMovieList()
  print(len(movieUrls))
 
  queryMovieInfo(movieUrls)
  print("结束抓取电影数据")

总结

学好正则表达式很重要,很重要,很重要!!!! Python的语法好有感觉, 对比Java …

标签:

相关文章

热门资讯

2020微信伤感网名听哭了 让对方看到心疼的伤感网名大全
2020微信伤感网名听哭了 让对方看到心疼的伤感网名大全 2019-12-26
歪歪漫画vip账号共享2020_yy漫画免费账号密码共享
歪歪漫画vip账号共享2020_yy漫画免费账号密码共享 2020-04-07
Intellij idea2020永久破解,亲测可用!!!
Intellij idea2020永久破解,亲测可用!!! 2020-07-29
男生常说24816是什么意思?女生说13579是什么意思?
男生常说24816是什么意思?女生说13579是什么意思? 2019-09-17
沙雕群名称大全2019精选 今年最火的微信群名沙雕有创意
沙雕群名称大全2019精选 今年最火的微信群名沙雕有创意 2019-07-07
返回顶部

609
Weibo Article 1 Weibo Article 2 Weibo Article 3 Weibo Article 4 Weibo Article 5 Weibo Article 6 Weibo Article 7 Weibo Article 8 Weibo Article 9 Weibo Article 10 Weibo Article 11 Weibo Article 12 Weibo Article 13 Weibo Article 14 Weibo Article 15 Weibo Article 16 Weibo Article 17 Weibo Article 18 Weibo Article 19 Weibo Article 20 Weibo Article 21 Weibo Article 22 Weibo Article 23 Weibo Article 24 Weibo Article 25 Weibo Article 26 Weibo Article 27 Weibo Article 28 Weibo Article 29 Weibo Article 30 Weibo Article 31 Weibo Article 32 Weibo Article 33 Weibo Article 34 Weibo Article 35 Weibo Article 36 Weibo Article 37 Weibo Article 38 Weibo Article 39 Weibo Article 40