Python爬虫之BeautifulSoup

简介

Beautiful Soup供给一些简单的、python式的函数用来处理导航、搜刮、修改分析树等功能。它是一个对象箱，经由过程解析文档为用户供给须要抓取的数据，因为简单，所以不须要若干代码就可以写出一个完全的应用法度榜样。Beautiful Soup主动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不须要推敲编码方法，除非文档没有指定一个编码方法，这时，Beautiful Soup就不克不及主动辨认编码方法了。然后，你仅仅须要解释一下原始编码方法就可以了。

Beautiful Soup已成为和lxml、html6lib一样出色的python说冥器，为用户灵活地供给不合的解析策略或强健的速度。

安装

pip install BeautifulSoup4 
 
或 
 
easy_install BeautifulSoup4

创建BeautifulSoup对象

起首应当导入BeautifulSoup类库

下面开端创建对像，在开端之前为了便利演示，先创建一个html文本，如下：

html = """ 
 
<html><head><title>The Dormouse's story</title></head> 
 
<body> 
 
<p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
 
<p class="story">Once upon a time there were three little sisters; and their names were 
 
<a href=http://developer.51cto.com/art/201706/"http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 
 
<a href=http://developer.51cto.com/art/201706/"http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
 
<a href=http://developer.51cto.com/art/201706/"http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
 
and they lived at the bottom of a well.</p> 
 
<p class="story">...</p> 
 
"""

创建对象：soup=BeautifulSoup(html,’lxml’),这里的lxml是解析的类库，今朝来说小我认为最好的解析器了，一向在用这个，安装办法：

Tag

from bs4 import BeautifulSoup

Tag就是html中的一个标签，用BeautifulSoup就能解析出来Tag的具体内容，具体的格局为soup.name,个中name是html下的标签，具体实例如下：

print soup.title # 输出title标签下的内容，包含此标签，这个将会输出<title>The Dormouse's story</title> 
 
print soup.head 	
			 1/5    1 2 3 4 5 下一页 尾页	
			

　　推荐阅读
　　华为发力云端  撬动城市云计算产业升级
            
            
                
                    
                
                【51CTO.com原创稿件】进入2017年，记者异常明显地感到到，云办事的成长，已经成为城市之间比拼实力的重要赛道>>>详细阅读


本文标题：Python爬虫之BeautifulSoup
地址：http://www.17bianji.com/lsqh/35763.html
 1/2    1