Python之BeautifulSoup库学习笔记

记录以备后用

基本使用

1 2	>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<p>data</p>', 'html.parser')

基本元素

Tag：标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾

>>> soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
>>> tag = soup.b
>>> type(tag)
# <class 'bs4.element.Tag'>

Name：标签的名字，<p>…</p>的名字是’p’，格式：<tag>.name

>>> tag.name
# u'b'
>>> tag.name = "blockquote"
>>> tag
# <blockquote class="boldest">Extremely bold</blockquote>

Attributes：标签的属性，字典形式组织，格式：<tag>.attrs

>>> tag['class']
# u'boldest'
>>> tag.attrs
# {u'class': u'boldest'}

tag的属性可以被添加,删除或修改

>>> tag['class'] = 'verybold'
>>> tag['id'] = 1
>>> tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
>>> del tag['class']
>>> del tag['id']
>>> tag
# <blockquote>Extremely bold</blockquote>
>>> tag['class']
# KeyError: 'class'
>>> print(tag.get('class'))
# None

NavigableString：标签内非属性字符串，<>…</>中字符串，格式：<tag>.string

>>> tag.string
# u'Extremely bold'
>>> type(tag.string)
# <class 'bs4.element.NavigableString'>

Comment：标签内字符串的注释部分，一种特殊的NavigableString类型

>>> markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
>>> soup = BeautifulSoup(markup, 'html.parser')
>>> comment = soup.b.string
>>> type(comment)
# <class 'bs4.element.Comment'>

简单遍历

只获得第一个tag

>>> soup.head
# <head><title>The Dormouse's story</title></head>
>>> soup.title
# <title>The Dormouse's story</title>
>>> soup.body.b
# <b>The Dormouse's story</b>

获得所有tag

>>> soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

精确遍历

find_all(name, attrs, recursive, text, **kwargs)

name：name 参数可以查找所有名字为 name 的 tag

1 2	>>> soup.find_all("title") # [<title>The Dormouse's story</title>]

keywork：搜索时会把该参数当作指定名字 tag 的属性来搜索

1 2	>>> soup.find_all(id='link2') # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

参数值为True，查找所有包含 id 属性的 tag

>>> soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_：搜索有指定CSS类名的 tag。因为 class 是 python 的关键字，需要加下横线以区分

>>> soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

attrs：定义一个字典参数来搜索包含特殊属性的 tag

1 2	>>> data_soup.find_all(attrs={"data-foo": "value"}) # [<div data-foo="value">foo!</div>]

text：搜索文档中的字符串内容

1 2	>>> soup.find_all("a", text="Elsie") # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

limit：限制返回结果的数量

1
2
3

>>> soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive：value为False时，搜索tag的直接子节点

find(name, attrs, recursive, text, **kwargs)

与find_all()的区别：

find()只得到一个结果，find_all()得到结果的列表

>>> soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]
>>> soup.find('title')
# <title>The Dormouse's story</title>

find_all()方法没有找到目标是返回空列表，find() 方法找不到目标时,返回 None
1
2
>>> print(soup.find("nosuchtag"))
# None

标签树的下行遍历

.contents：子节点的列表，将所有儿子节点存入列表
.children：子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants：子孙节点的迭代类型，包含所有子孙节点，用于循环遍历

标签树的上行遍历

.parent：节点的父亲标签
.parents：节点先辈标签的迭代类型，用于循环遍历先辈节点

标签树的平行遍历

.next_sibling：返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling：返回按照HTML文本顺序的上一个平行节点标签
.next_siblings： 迭代类型，返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings： 迭代类型，返回按照HTML文本顺序的前续所有平行节点标签

格式化输出

1	>>> print(soup.prettify())