Python采集猫眼两万条数据对《无名之辈》影评进行分析_Python

一、说明

本文主要讲述采集猫眼电影用户评论进行分析，相关爬虫采集程序可以爬取多个电影评论。

运行环境：win10/python3.5。

分析工具：jieba、wordcloud、pyecharts、matplotlib。

基本流程：下载内容 ---> 分析获取关键数据 ---> 保存本地文件 ---> 分析本地文件制作图表

注意：本文所有图文和源码仅供学习，请勿他用，转发请注明出处！

本文主要参考：https://mp.weixin.qq.com/s/mtxxkwrzpgbikc3sv-jo3g

二、开始采集

2.1、分析数据接口：

为了健全数据样本，数据直接从移动端接口进行采集，连接如下，其中橙色部分为猫眼电影id，修改即可爬取其他电影。

链接地址：http://m.maoyan.com/mmdb/comments/movie/1208282.json?v=yes&offset=15&starttime=

　　　　 Python采集猫眼两万条数据对《无名之辈》影评进行分析

接口返回的数据如下，主要采集（昵称、城市、评论、评分和时间），用户评论在 json['cmts'] 中：

　　　　 Python采集猫眼两万条数据对《无名之辈》影评进行分析

2.2、爬虫程序核心内容（详细可以看后面源代码）：

>启动脚本需要的参数如下（脚本名+猫眼电影id+上映日期+数据保存的文件名）：.\mymoviecomment.py 1208282 2016-11-16 mycmts2.txt

>下载html内容：download(self, url)，通过python的requests模块进行下载，将下载的数据转成json格式　　

									def download(self, url):

									 """下载html内容"""

									 print("正在下载url: "+url)

									 # 下载html内容

									 response = requests.get(url, headers=self.headers)

									 # 转成json格式数据

									 if response.status_code == 200:

									  return response.json()

									 else:

									  # print(html.status_code)

									  print('下载数据为空！')

									  return ""

>然后就是对已下载的内容进行分析，就是取出我们需要的数据：

									def parse(self, content):

									 """分析数据"""

									 comments = []

									 try:

									  for item in content['cmts']:

									  comment = {

									   'nickname': item['nickname'], # 昵称

									   'cityname': item['cityname'], # 城市

									   'content': item['content'],  # 评论内容

									   'score': item['score'],  # 评分

									   'starttime': item['starttime'], # 时间

									  }

									  comments.append(comment)

									 except exception as e:

									  print(e)

									 finally:

									  return comments

>将分析出来的数据，进行本地保存，方便后续的分析工作：　

									def save(self, data):

									 """写入文件"""

									 print("保存数据，写入文件中...")

									 self.save_file.write(data)

> 爬虫的核心控制也即爬虫的程序启动入口，管理上面几个方法的有序执行：

									def start(self):

									 """启动控制方法"""

									 print("爬虫开始...\r\n")

									 start_time = self.start_time

									 end_time = self.end_time

									 num = 1

									 while start_time > end_time:

									  print("执行次数:", num)

									  # 1、下载html

									  content = self.download(self.target_url + str(start_time))

									  # 2、分析获取关键数据

									  comments = ''

									  if content != "":

									  comments = self.parse(content)

									  if len(comments) <= 0:

									  print("本次数据量为：0，退出爬取！\r\n")

									  break

									  # 3、写入文件

									  res = ''

									  for cmt in comments:

									  res += "%s###%s###%s###%s###%s\n" % (cmt['nickname'], cmt['cityname'], cmt['content'], cmt['score'], cmt['starttime'])

									  self.save(res)

									  print("本次数据量：%s\r\n" % len(comments))

									  # 获取最后一条数据的时间 ，然后减去一秒

									  start_time = datetime.strptime(comments[len(comments) - 1]['starttime'], "%y-%m-%d %h:%m:%s") + timedelta(seconds=-1)

									  # start_time = datetime.strptime(start_time, "%y-%m-%d %h:%m:%s")

									  # 休眠3s

									  num += 1

									  time.sleep(3)

									 self.save_file.close()

									 print("爬虫结束...")

2.3 数据样本，最终爬取将近2万条数据，每条记录的每个数据使用 ### 进行分割：

Python采集猫眼两万条数据对《无名之辈》影评进行分析

三、图形化分析数据

3.1、制作观众城市分布热点图，(pyecharts-geo)：

从图表可以轻松看出，用户主要分布地区，主要以沿海一些发达城市群为主：

Python采集猫眼两万条数据对《无名之辈》影评进行分析

									def createcharts(self):

									 """生成图表"""

									 # 读取数据,格式：[{"北京", 10}, {"上海",10}]

									 data = self.readcitynum()

									 # 1 热点图

									 geo1 = geo("《无名之辈》观众位置分布热点图", "数据来源：猫眼，fly采集", title_color="#fff", title_pos="center", width="100%", height=600, background_color="#404a59")

									 attr1, value1 = geo1.cast(data)

									 geo1.add("", attr1, value1, type="heatmap", visual_range=[0, 1000], visual_text_color="#fff", symbol_size=15, is_visualmap=true, is_piecewise=false, visual_split_number=10)

									 geo1.render("files/无名之辈-观众位置热点图.html")

									 # 2 位置图

									 geo2 = geo("《无名之辈》观众位置分布", "数据来源：猫眼，fly采集", title_color="#fff", title_pos="center", width="100%", height=600,

									   background_color="#404a59")

									 attr2, value2 = geo1.cast(data)

									 geo2.add("", attr2, value2, visual_range=[0, 1000], visual_text_color="#fff", symbol_size=15,

									  is_visualmap=true, is_piecewise=false, visual_split_number=10)

									 geo2.render("files/无名之辈-观众位置图.html")

									 # 3、top20 柱状图

									 data_top20 = data[:20]

									 bar = bar("《无名之辈》观众来源排行 top20", "数据来源：猫眼，fly采集", title_pos="center", width="100%", height=600)

									 attr, value = bar.cast(data_top20)

									 bar.add('', attr, value, is_visualmap=true, visual_range=[0, 3500], visual_text_color="#fff", is_more_utils=true, is_label_show=true)

									 bar.render("files/无名之辈-观众来源top20.html")

									 print("图表生成完成")

3.2、制作观众人数top20的柱形图,(pyecharts-bar)：

Python采集猫眼两万条数据对《无名之辈》影评进行分析

3.3、制作评论词云,(jieba、wordcloud)：

Python采集猫眼两万条数据对《无名之辈》影评进行分析

生成词云核心代码：

									def createwordcloud(self):

									 """生成评论词云"""

									 comments = self.readallcomments() # 19185

									 # 使用 jieba 分词

									 commens_split = jieba.cut(str(comments), cut_all=false)

									 words = ''.join(commens_split)

									 # 给词库添加停止词

									 stopwords = stopwords.copy()

									 stopwords.add("电影")

									 stopwords.add("一部")

									 stopwords.add("无名之辈")

									 stopwords.add("一部")

									 stopwords.add("一个")

									 stopwords.add("有点")

									 stopwords.add("觉得")

									 # 加载背景图片

									 bg_image = plt.imread("files/2048_bg.png")

									 # 初始化 wordcloud

									 wc = wordcloud(width=1200, height=600, background_color='#fff', mask=bg_image, font_path='c:/windows/fonts/stfangso.ttf', stopwords=stopwords, max_font_size=400, random_state=50)

									 # 生成，显示图片

									 wc.generate_from_text(words)

									 plt.imshow(wc)

									 plt.axis('off')

									 plt.show()

四、修改pyecharts源码

4.1、样本数据的城市简称与数据集完整城市名匹配不上：

使用位置热点图时候，由于采集数据城市是一些简称，与pyecharts的已存在数据的城市名对不上，所以对源码进行一些修改，方便匹配一些简称。

黔南 =>黔南布依族苗族自治州

模块自带的全国主要市县经纬度在：[python安装路径]\lib\site-packages\pyecharts\datasets\city_coordinates.json

由于默认情况下，一旦城市名不能完全匹配就会报异常，程序会停止，所以对源码修改如下（报错方法为 geo.add()）,其中添加注析为个人修改部分：

									def get_coordinate(self, name, region="中国", raise_exception=false):

									 """

									 return coordinate for the city name.

									 :param name: city name or any custom name string.

									 :param raise_exception: whether to raise exception if not exist.

									 :return: a list like [longitude, latitude] or none

									 """

									 if name in self._coordinates:

									  return self._coordinates[name]

									 coordinate = get_coordinate(name, region=region)

									 # [ 20181204 添加

									 # print(name, coordinate)

									 if coordinate is none:

									  # 如果字典key匹配不上，尝试进行模糊查询

									  search_res = search_coordinates_by_region_and_keyword(region, name)

									  # print("###",search_res)

									  if search_res:

									  coordinate = sorted(search_res.values())[0]

									 # 20181204 添加 ]

									 if coordinate is none and raise_exception:

									  raise valueerror("no coordinate is specified for {}".format(name))

									 return coordinate

相应的需要对 __add()方法进行如下修改：

Python采集猫眼两万条数据对《无名之辈》影评进行分析

五、附录-源码

*说明：源码为本人所写，数据来源为猫眼，全部内容仅供学习，拒绝其他用途！转发请注明出处！

5.1 采集源码

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

									# -*- coding:utf-8 -*-

									import requests

									from datetime import datetime, timedelta

									import os

									import time

									import sys

									class maoyanfilmreviewspider:

									 """猫眼影评爬虫"""

									 def __init__(self, url, end_time, filename):

									  # 头部

									  self.headers = {

									   'user-agent': 'mozilla/5.0 (iphone; cpu iphone os 11_0 like mac os x) applewebkit/604.1.38 (khtml, like gecko) version/11.0 mobile/15a372 safari/604.1'

									  }

									  # 目标url

									  self.target_url = url

									  # 数据获取时间段，start_time:截止日期，end_time:上映时间

									  now = datetime.now()

									  # 获取当天的 零点

									  self.start_time = now + timedelta(hours=-now.hour, minutes=-now.minute, seconds=-now.second)

									  self.start_time = self.start_time.replace(microsecond=0)

									  self.end_time = datetime.strptime(end_time, "%y-%m-%d %h:%m:%s")

									  # 打开写入文件, 创建目录

									  self.save_path = "files/"

									  if not os.path.exists(self.save_path):

									   os.makedirs(self.save_path)

									  self.save_file = open(self.save_path + filename, "a", encoding="utf-8")

									 def download(self, url):

									  """下载html内容"""

									  print("正在下载url: "+url)

									  # 下载html内容

									  response = requests.get(url, headers=self.headers)

									  # 转成json格式数据

									  if response.status_code == 200:

									   return response.json()

									  else:

									   # print(html.status_code)

									   print('下载数据为空！')

									   return ""

									 def parse(self, content):

									  """分析数据"""

									  comments = []

									  try:

									   for item in content['cmts']:

									    comment = {

									     'nickname': item['nickname'],  # 昵称

									     'cityname': item['cityname'],  # 城市

									     'content': item['content'],   # 评论内容

									     'score': item['score'],    # 评分

									     'starttime': item['starttime'], # 时间

									    }

									    comments.append(comment)

									  except exception as e:

									   print(e)

									  finally:

									   return comments

									 def save(self, data):

									  """写入文件"""

									  print("保存数据，写入文件中...")

									  self.save_file.write(data)

									 def start(self):

									  """启动控制方法"""

									  print("爬虫开始...\r\n")

									  start_time = self.start_time

									  end_time = self.end_time

									  num = 1

									  while start_time > end_time:

									   print("执行次数:", num)

									   # 1、下载html

									   content = self.download(self.target_url + str(start_time))

									   # 2、分析获取关键数据

									   comments = ''

									   if content != "":

									    comments = self.parse(content)

									   if len(comments) <= 0:

									    print("本次数据量为：0，退出爬取！\r\n")

									    break

									   # 3、写入文件

									   res = ''

									   for cmt in comments:

									    res += "%s###%s###%s###%s###%s\n" % (cmt['nickname'], cmt['cityname'], cmt['content'], cmt['score'], cmt['starttime'])

									   self.save(res)

									   print("本次数据量：%s\r\n" % len(comments))

									   # 获取最后一条数据的时间 ，然后减去一秒

									   start_time = datetime.strptime(comments[len(comments) - 1]['starttime'], "%y-%m-%d %h:%m:%s") + timedelta(seconds=-1)

									   # start_time = datetime.strptime(start_time, "%y-%m-%d %h:%m:%s")

									   # 休眠3s

									   num += 1

									   time.sleep(3)

									  self.save_file.close()

									  print("爬虫结束...")

									if __name__ == "__main__":

									 # 确保输入参数

									 if len(sys.argv) != 4:

									  print("请输入相关参数：[moveid]、[上映日期]和[保存文件名]，如：xxx.py 42962 2018-11-09 text.txt")

									  exit()

									 # 猫眼电影id

									 mid = sys.argv[1] # "1208282" # "42964"

									 # 电影上映日期

									 end_time = sys.argv[2] # "2018-11-16" # "2018-11-09"

									 # 每次爬取条数

									 offset = 15

									 # 保存文件名

									 filename = sys.argv[3]

									 spider = maoyanfilmreviewspider(url="http://m.maoyan.com/mmdb/comments/movie/%s.json?v=yes&offset=%d&starttime=" % (mid, offset), end_time="%s 00:00:00" % end_time, filename=filename)

									 # spider.start()

									 spider.start()

									 # t1 = "2018-11-09 23:56:23"

									 # t2 = "2018-11-25"

									 #

									 # res = datetime.strptime(t1, "%y-%m-%d %h:%m:%s") + timedelta(days=-1)

									 # print(type(res))

									maoyanfilmreviewspider.py

5.2 分析制图源码

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

									# -*- coding:utf-8 -*-

									from pyecharts import geo, bar, bar3d

									import jieba

									from wordcloud import stopwords, wordcloud

									import matplotlib.pyplot as plt

									class acoolfishanalysis:

									 """无名之辈 --- 数据分析"""

									 def __init__(self):

									  pass

									 def readcitynum(self):

									  """读取观众城市分布数量"""

									  d = {}

									  with open("files/mycmts2.txt", "r", encoding="utf-8") as f:

									   row = f.readline()

									   while row != "":

									    arr = row.split('###')

									    # 确保每条记录长度为 5

									    while len(arr) < 5:

									     row += f.readline()

									     arr = row.split('###')

									    # 记录每个城市的人数

									    if arr[1] in d:

									     d[arr[1]] += 1

									    else:

									     d[arr[1]] = 1 # 首次加入字典，为 1

									    row = f.readline()

									   # print(len(comments))

									   # print(d)

									  # 字典 转 元组数组

									  res = []

									  for ks in d.keys():

									   if ks == "":

									    continue

									   tmp = (ks, d[ks])

									   res.append(tmp)

									  # 按地点人数降序

									  res = sorted(res, key=lambda x: (x[1]),reverse=true)

									  return res

									 def readallcomments(self):

									  """读取所有评论"""

									  comments = []

									  # 打开文件读取数据

									  with open("files/mycmts2.txt", "r", encoding="utf-8") as f:

									   row = f.readline()

									   while row != "":

									    arr = row.split('###')

									    # 每天记录长度为 5

									    while len(arr) < 5:

									     row += f.readline()

									     arr = row.split('###')

									    if len(arr) == 5:

									     comments.append(arr[2])

									    # if len(comments) > 20:

									    #  break

									    row = f.readline()

									  return comments

									 def createcharts(self):

									  """生成图表"""

									  # 读取数据,格式：[{"北京", 10}, {"上海",10}]

									  data = self.readcitynum()

									  # 1 热点图

									  geo1 = geo("《无名之辈》观众位置分布热点图", "数据来源：猫眼，fly采集", title_color="#fff", title_pos="center", width="100%", height=600, background_color="#404a59")

									  attr1, value1 = geo1.cast(data)

									  geo1.add("", attr1, value1, type="heatmap", visual_range=[0, 1000], visual_text_color="#fff", symbol_size=15, is_visualmap=true, is_piecewise=false, visual_split_number=10)

									  geo1.render("files/无名之辈-观众位置热点图.html")

									  # 2 位置图

									  geo2 = geo("《无名之辈》观众位置分布", "数据来源：猫眼，fly采集", title_color="#fff", title_pos="center", width="100%", height=600,

									     background_color="#404a59")

									  attr2, value2 = geo1.cast(data)

									  geo2.add("", attr2, value2, visual_range=[0, 1000], visual_text_color="#fff", symbol_size=15,

									    is_visualmap=true, is_piecewise=false, visual_split_number=10)

									  geo2.render("files/无名之辈-观众位置图.html")

									  # 3、top20 柱状图

									  data_top20 = data[:20]

									  bar = bar("《无名之辈》观众来源排行 top20", "数据来源：猫眼，fly采集", title_pos="center", width="100%", height=600)

									  attr, value = bar.cast(data_top20)

									  bar.add('', attr, value, is_visualmap=true, visual_range=[0, 3500], visual_text_color="#fff", is_more_utils=true, is_label_show=true)

									  bar.render("files/无名之辈-观众来源top20.html")

									  print("图表生成完成")

									 def createwordcloud(self):

									  """生成评论词云"""

									  comments = self.readallcomments() # 19185

									  # 使用 jieba 分词

									  commens_split = jieba.cut(str(comments), cut_all=false)

									  words = ''.join(commens_split)

									  # 给词库添加停止词

									  stopwords = stopwords.copy()

									  stopwords.add("电影")

									  stopwords.add("一部")

									  stopwords.add("无名之辈")

									  stopwords.add("一部")

									  stopwords.add("一个")

									  stopwords.add("有点")

									  stopwords.add("觉得")

									  # 加载背景图片

									  bg_image = plt.imread("files/2048_bg.png")

									  # 初始化 wordcloud

									  wc = wordcloud(width=1200, height=600, background_color='#fff', mask=bg_image, font_path='c:/windows/fonts/stfangso.ttf', stopwords=stopwords, max_font_size=400, random_state=50)

									  # 生成，显示图片

									  wc.generate_from_text(words)

									  plt.imshow(wc)

									  plt.axis('off')

									  plt.show()

									if __name__ == "__main__":

									 demo = acoolfishanalysis()

									 demo.createwordcloud()