Python爬虫之用Xpath获取关键标签实现自动评论盖楼抽奖(二)_Python

一、分析链接

一般来说，我们参加某个网站的盖楼抽奖活动，并不是仅仅只参加一个，而是多个盖楼活动一起参加。

这个时候，我们就需要分析评论的链接是怎么区分不同帖子进行评论的，如上篇的刷帖链接，具体格式如下：

				?

									https://club.hihonor.com/cn/forum.php?mod=post&action=reply&fid=154&tid=21089001&extra=page%3d1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1

这里面用于区分不同帖子的键是tid，不妨大家可以会看上一篇博文评论帖子的链接，是不是同样有一个21089001的数字。

而经过博主的测试，该网站评论post请求网址除了tid之外，其他数据是一模一样的并不需要变更。所以，我们切换新帖子评论时，只需要替换tid的值就行。

二、切分提取tid

读者可以自行随便打开一个该网站的帖子，我们一般会得到如下形式的字符串帖子链接：

				?

									https://club.hihonor.com/cn/thread-26194745-1-1.html

这里，我们需要应用字符串切割知识，来获取链接字符串种的长数字字符串26194745。具体代码如下：

				?

									import re

									# 获取需要评论的所有网页链接

									url_start = "https://club.hihonor.com/cn/forum.php?mod=post&action=reply&fid=4515&tid="

									url_end = "&extra=page%3d1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1"

									url = []  # 评论网页

									txt_url = []  # 提供的网页（格式不同）

									f = open("随机帖子.txt", "r", encoding='utf-8')

									line = f.readline()  # 读取第一行

									while line:

									    if re.match(r'http[s]?://(?:[a-za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fa-f][0-9a-fa-f]))+', line):

									        txt_url.append(line.strip())  # 列表增加

									    line = f.readline()  # 读取下一行

									datas = []

									headers = []

									for i in txt_url:

									    url_start = "https://club.hihonor.com/cn/forum.php?mod=post&action=reply&fid=4515&tid="

									    url_end = "&extra=page%3d1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1"

									    url.append(url_start + i.split("-")[1] + url_end)

这里，博主将一大堆需要评论的链接全部放到文本文件之中，然后通过读取文件获取每一行链接数据（其中用正则表达式判断链接是否合法）。

在通过遍历链接切分获取帖子标识数字字符串，最后进行拼接获取到真正的post评论链接。

Python爬虫之用Xpath获取关键标签实现自动评论盖楼抽奖(二)

三、随机提取评论的内容

在众多的网站盖楼活动中，官方网站一般都会检测是否有内容重复，一般同一个账号多次评论重复的内容，肯定会被禁止评论一段时间。

所以，我们需要将评论的内容多样化，比如说这个网站要我们称赞手机性能进行盖楼抽奖，那么我们就需要备用一些评论文字，方便程序随机获取。

具体文字放置在txt文件中，我们通过下面的代码进行读取：

				?

									# 获取需要评论的文本内容

									txt_contents = []

									f = open("回帖文案.txt", "r", encoding='utf-8')

									line = f.readline()  # 读取第一行

									while line:

									    if line.strip() != "":

									        txt_contents.append(line.strip())  # 列表增加

									    line = f.readline()  # 读取下一行

									print(txt_contents)

									count = len(txt_contents)

假如，我们是需要参加游戏论坛的盖楼评论活动，那么就可以用下面的文本进行随机提取评论，样本越多，重复性越少。

Python爬虫之用Xpath获取关键标签实现自动评论盖楼抽奖(二)

四、盖楼刷抽奖

一般来说，这种经常有活动的网站都是需要验证登录的。而各个网站的验证码算法都不相同，怎么自动登录账号，往往就非常关键了。

对于识别验证码，我们要么用百度，腾讯，阿里云提供的文字识别接口，但是博主试过了都无法保证百分百识别成功，而且最高识别准备率都不到50%。

如果需要自己写机器学习识别算法，那么学过机器学习的都应该知道，这个是需要庞大的标记的，哪怕你真的做出来，恐怕人家网站又会换了验证方式。

这种验证码与防验证码一直在进步，花费大量实现标注验证码这些内容，往往会浪费大量的时间，到最后人家可能又换了。

所以，博主的建议还是自己手动输入验证码，也就这一步输入验证码手动，其他的全自动。完整代码如下：

				?

									import random

									import time

									from selenium import webdriver

									import requests

									import re

									# 获取需要评论的文本内容

									txt_contents = []

									f = open("回帖文案.txt", "r", encoding='utf-8')

									line = f.readline()  # 读取第一行

									while line:

									    if line.strip() != "":

									        txt_contents.append(line.strip())  # 列表增加

									    line = f.readline()  # 读取下一行

									print(txt_contents)

									count = len(txt_contents)

									# 获取需要评论的所有网页链接

									url_start = "https://club.hihonor.com/cn/forum.php?mod=post&action=reply&fid=4515&tid="

									url_end = "&extra=page%3d1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1"

									url = []  # 评论网页

									txt_url = []  # 提供的网页（格式不同）

									f = open("随机帖子.txt", "r", encoding='utf-8')

									line = f.readline()  # 读取第一行

									while line:

									    if re.match(r'http[s]?://(?:[a-za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fa-f][0-9a-fa-f]))+', line):

									        txt_url.append(line.strip())  # 列表增加

									    line = f.readline()  # 读取下一行

									datas = []

									headers = []

									for i in txt_url:

									    url_start = "https://club.hihonor.com/cn/forum.php?mod=post&action=reply&fid=4515&tid="

									    url_end = "&extra=page%3d1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1"

									    url.append(url_start + i.split("-")[1] + url_end)

									# 获取账号

									usernames = []

									f = open("账号.txt", "r", encoding='utf-8')

									line = f.readline()  # 读取第一行

									while line:

									    usernames.append(line.strip())  # 列表增加

									    line = f.readline()  # 读取下一行

									for name in usernames:

									    browser = webdriver.chrome()

									    browser.implicitly_wait(10)

									    browser.get("https://club.hihonor.com/cn/")

									    time.sleep(5)

									    login_text = browser.find_element_by_xpath("//*[@id='loginandreg']/a[1]")

									    login_text.click()

									    username = browser.find_element_by_xpath(

									'/html/body/div[1]/div[2]/div/div/div[1]/div[3]/span/div[1]/span/div[2]/div[2]/div/input')

									    password = browser.find_element_by_xpath(

									'/html/body/div[1]/div[2]/div/div/div[1]/div[3]/span/div[1]/span/div[3]/div/div/div/input')

									    username.send_keys(name)

									    password.send_keys("密码")#所有盖楼刷评论账号密码尽量统一，这样就可以只在txt每行输入账号即可

									    sign = browser.find_element_by_xpath(

									'/html/body/div[1]/div[2]/div/div/div[1]/div[3]/span/div[1]/span/div[6]/div/div/span/span')

									#等待10秒，让程序运行者输入验证码

									    time.sleep(10)

									    sign.click()

									    time.sleep(2)

									    cookie = [item["name"] + "=" + item["value"] for item in browser.get_cookies()]

									    cookiestr = ';'.join(item for item in cookie)

									    url2 = "https://club.hihonor.com/cn/thread-26183971-1-1.html"

									    time.sleep(2)

									    browser.get(url2)

									    posttime = browser.find_element_by_id("posttime")

									    posttime = posttime.get_attribute("value")

									    formhash = browser.find_element_by_name("formhash")

									    formhash = formhash.get_attribute("value")

									    browser.close()

									    data = {

									        "formhash": formhash,

									        "posttime": posttime,

									        "usesig": "1",

									        "message": txt_contents[0],

									    }

									    header = {

									        "accept": "application/json, text/javascript, */*; q=0.01",

									        "accept-encoding": "gzip, deflate, br",

									        "accept-language": "zh-cn,zh;q=0.9",

									        "content-length": "146",

									        "sec-ch-ua": '"google chrome";v="87", "\"not;a\\brand";v="99", "chromium";v="87"',

									        "user-agent": "mozilla/5.0 (linux; android 6.0; nexus 5 build/mra58n) applewebkit/537.36 (khtml, like gecko) chrome/87.0.4280.141 mobile safari/537.36",

									        "cookie": cookiestr,

									        "content-type": "application/x-www-form-urlencoded; charset=utf-8",

									        "x-requested-with": "xmlhttprequest",

									    }

									    datas.append(data)

									    headers.append(header)

									while true:

									    z = 0

									    if int(time.strftime("%h%m%s")) <= 220000:

									        url_num = random.sample(range(0, len(url)), len(url))

									        for i in url_num:

									            j = 1

									            for data, header in zip(datas, headers):

									                data['message'] = txt_contents[random.randint(0, count - 1)]

									                res = requests.post(url=url[i], data=data, headers=header)

									                if '回复发布成功' in res.text:

									                    print("账号{0}回复成功".format(j))

									                else:

									                    print(res.text)

									                j += 1

									                z += 1

									            time.sleep(5)

									            print("已经评论{0}条".format(str(z)))