python数据抓取3种方法总结_Python

三种数据抓取的方法

正则表达式（re库）
beautifulsoup（bs4）
lxml

*利用之前构建的下载网页函数，获取目标网页的html，我们以https://guojiadiqu.bmcx.com/afg__guojiayudiqu/为例，获取html。

python数据抓取3种方法总结

				?

									from get_html import download

									url = 'https://guojiadiqu.bmcx.com/afg__guojiayudiqu/'

									page_content = download(url)

*假设我们需要爬取该网页中的国家名称和概况，我们依次使用这三种数据抓取的方法实现数据抓取。

1.正则表达式

				?

									from get_html import download

									import re

									url = 'https://guojiadiqu.bmcx.com/afg__guojiayudiqu/'

									page_content = download(url)

									country = re.findall('class="h2dabiaoti">(.*?)</h2>', page_content) #注意返回的是list

									survey_data = re.findall('<tr><td bgcolor="#ffffff" id="wzneirong">(.*?)</td></tr>', page_content)

									survey_info_list = re.findall('<p>　　(.*?)</p>', survey_data[0])

									survey_info = ''.join(survey_info_list)

									print(country[0],survey_info)

2.beautifulsoup（bs4）

				?

									from get_html import download

									from bs4 import beautifulsoup

									url = 'https://guojiadiqu.bmcx.com/afg__guojiayudiqu/'

									html = download(url)

									#创建 beautifulsoup 对象

									soup = beautifulsoup(html,"html.parser")

									#搜索

									country = soup.find(attrs={'class':'h2dabiaoti'}).text

									survey_info = soup.find(attrs={'id':'wzneirong'}).text

									print(country,survey_info)

3.lxml

				?

									from get_html import download

									from lxml import etree #解析树

									url = 'https://guojiadiqu.bmcx.com/afg__guojiayudiqu/'

									page_content = download(url)

									selector = etree.html(page_content)#可进行xpath解析

									country_select = selector.xpath('//*[@id="main_content"]/h2') #返回列表

									for country in country_select:

									 print(country.text)

									survey_select = selector.xpath('//*[@id="wzneirong"]/p')

									for survey_content in survey_select:

									 print(survey_content.text,end='')