本文介绍了python爬虫之BeautifulSoup 使用select方法详解 ,分享给大家。具体如下:
1
|
2
3
4
5
6
7
8
9
10
|
< html >< head >< title >The Dormouse's story</ title ></ head > < body > < p class = "title" name = "dromouse" >< b >The Dormouse's story</ b ></ p > < p class = "story" >Once upon a time there were three little sisters; and their names were < a href = " http://example.com/elsie " rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" class = "sister" id = "link1" > <!-- Elsie --> </ a >, < a href = " http://example.com/lacie " rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" class = "sister" id = "link2" >Lacie</ a > and < a href = " http://example.com/tillie " rel = "external nofollow" rel = "external nofollow" rel = "external nofollow" class = "sister" id = "link3" >Tillie</ a >; and they lived at the bottom of a well.</ p > < p class = "story" >...</ p > """ |
我们在写 CSS 时,标签名不加任何修饰,类名前加点,id名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list
(1)通过标签名查找
1
|
2
3
4
5
6
7
8
|
print soup.select( 'title' ) #[<title>The Dormouse's story</title>] print soup.select( 'a' ) #[<a class="sister" href=" http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie " rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] print soup.select( 'b' ) #[<b>The Dormouse's story</b>] |
(2)通过类名查找
1
|
2
|
print soup.select( '.sister' ) #[<a class="sister" href=" http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie " rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] |
(3)通过 id 名查找
1
|
2
|
print soup.select( '#link1' ) #[<a class="sister" href=" http://example.com/elsie " rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>] |
(4)组合查找
组合查找即和写 class 文件时,标签名与类名、id名进行的组合原理是一样的,例如查找 p 标签中,id 等于 link1的内容,二者需要用空格分开
1
|
2
|
print soup.select( 'p #link1' ) #[<a class="sister" href=" http://example.com/elsie " rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>] |
直接子标签查找
1
|
2
|
print soup.select( "head > title" ) #[<title>The Dormouse's story</title>] |
(5)属性查找
查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。
1
|
2
3
4
5
|
print soup.select( "head > title" ) #[<title>The Dormouse's story</title>] print soup.select( 'a[href=" http://example.com/elsie " rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]' ) #[<a class="sister" href=" http://example.com/elsie " rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>] |
同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格
1
|
2
|
print soup.select( 'p a[href=" http://example.com/elsie " rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]' ) #[<a class="sister" href=" http://example.com/elsie " rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>] |
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持服务器之家。
原文链接:http://www.cnblogs.com/yizhenfeng168/p/6979339.html