本文实例为大家分享了python3爬取数据至mysql的具体代码,供大家参考,具体内容如下
直接贴代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
|
#!/usr/local/bin/python3.5 # -*- coding:UTF-8 -*- from urllib.request import urlopen from bs4 import BeautifulSoup import re import datetime import random import pymysql connect = pymysql.connect(host = '192.168.10.142' , unix_socket = '/tmp/mysql.sock' , user = 'root' , passwd = '1234' , db = 'scraping' , charset = 'utf8' ) cursor = connect.cursor() cursor.execute( 'USE scraping' ) random.seed(datetime.datetime.now()) def store(title, content): execute = cursor.execute( "select * from pages WHERE `title` = %s" , title) if execute < = 0 : cursor.execute( "insert into pages(`title`, `content`) VALUES(%s, %s)" , (title, content)) cursor.connection.commit() else : print ( 'This content is already exist.' ) def get_links(acticle_url): html = urlopen( 'http://en.wikipedia.org' + acticle_url) soup = BeautifulSoup(html, 'html.parser' ) title = soup.h1.get_text() content = soup.find( 'div' , { 'id' : 'mw-content-text' }).find( 'p' ).get_text() store(title, content) return soup.find( 'div' , { 'id' : 'bodyContent' }).findAll( 'a' , href = re. compile ( "^(/wiki/)(.)*$" )) links = get_links('') try : while len (links) > 0 : newActicle = links[random.randint( 0 , len (links) - 1 )].attrs[ 'href' ] links = get_links(newActicle) print (links) finally : cursor.close() connect.close() |
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持服务器之家。
原文链接:https://blog.csdn.net/ASAS1314/article/details/52594232