本文的爬虫教程分为四部:
1.从哪爬 where
2.爬什么 what
3.怎么爬 how
4.爬了之后信息如何保存 save
一、从哪爬
二、爬什么
三国演义全文
三、怎么爬
在Chrome页面打开F12,就可以发现文章内容在节点
1
|
< div id = "con" class = "bookyuanjiao" > |
只要找到这个节点,然后把内容写入到一个html文件即可。
1
|
content = soup.find( "div" , { "class" : "bookyuanjiao" , "id" : "con" }) |
四、爬了之后如何保存
主要就是拿到内容,拼接到一个html文件,然后保存下来就可以了。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
|
#!usr/bin/env # -*-coding:utf-8 -*- import urllib2 import os from bs4 import BeautifulSoup as BS import locale import sys from lxml import etree import re reload (sys) sys.setdefaultencoding( 'gbk' ) sub_folder = os.path.join(os.getcwd(), "sanguoyanyi" ) if not os.path.exists(sub_folder): os.mkdir(sub_folder) path = sub_folder # customize html as head of the articles input = open (r '0.html' , 'r' ) head = input .read() t = domain.find(r '.html' ) new_domain = '/' .join(domain.split( "/" )[: - 2 ]) first_chapter_url = domain[:t] + "/" + str ( 1 ) + '.html' print first_chapter_url # Get url if chapter lists req = urllib2.Request(url = domain) resp = urllib2.urlopen(req) html = resp.read() soup = BS(html, 'lxml' ) chapter_list = soup.find( "div" , { "class" : "bookyuanjiao" , "id" : "mulu" }) sel = etree.HTML( str (chapter_list)) result = sel.xpath( '//li/a/@href' ) for each_link in result: each_chapter_link = new_domain + "/" + each_link print each_chapter_link req = urllib2.Request(url = each_chapter_link) resp = urllib2.urlopen(req) html = resp.read() soup = BS(html, 'lxml' ) content = soup.find( "div" , { "class" : "bookyuanjiao" , "id" : "con" }) title = soup.title.text title = title.split(u '_《三国演义》_诗词名句网' )[ 0 ] html = str (content) html = head + html + "</body></html>" filename = path + "\\" + title + " .html" print filename # write file output = open (filename, 'w' ) output.write(html) output.close() |
0.html的内容如下
1
|
< html >< head >< meta http-equiv = "Content-Type" content = "text/html; charset=utf-8" ></ head >< body > |
总结