2018-08-12 10:58:24 -04:00
|
|
|
Beautiful Soup is a library that makes it easy to scrape information
|
|
|
|
from web pages. It sits atop an HTML or XML parser, providing Pythonic
|
|
|
|
idioms for iterating, searching, and modifying the parse tree.
|
2011-02-27 16:51:56 -05:00
|
|
|
|
2018-08-12 10:58:24 -04:00
|
|
|
# Quick start
|
|
|
|
|
|
|
|
```
|
2019-10-06 10:46:16 -04:00
|
|
|
>>> from bs4 import BeautifulSoup
|
|
|
|
>>> soup = BeautifulSoup("<p>Some<b>bad<i>HTML")
|
2020-06-11 16:10:48 -04:00
|
|
|
>>> print(soup.prettify())
|
2019-10-06 10:46:16 -04:00
|
|
|
<html>
|
2020-06-11 16:10:48 -04:00
|
|
|
<body>
|
|
|
|
<p>
|
|
|
|
Some
|
|
|
|
<b>
|
|
|
|
bad
|
|
|
|
<i>
|
|
|
|
HTML
|
|
|
|
</i>
|
|
|
|
</b>
|
|
|
|
</p>
|
|
|
|
</body>
|
2019-10-06 10:46:16 -04:00
|
|
|
</html>
|
2024-02-13 12:27:04 -05:00
|
|
|
>>> soup.find(string="bad")
|
2020-06-11 16:10:48 -04:00
|
|
|
'bad'
|
2019-10-06 10:46:16 -04:00
|
|
|
>>> soup.i
|
|
|
|
<i>HTML</i>
|
|
|
|
#
|
|
|
|
>>> soup = BeautifulSoup("<tag1>Some<tag2/>bad<tag3>XML", "xml")
|
|
|
|
#
|
2020-06-11 16:10:48 -04:00
|
|
|
>>> print(soup.prettify())
|
|
|
|
<?xml version="1.0" encoding="utf-8"?>
|
2019-10-06 10:46:16 -04:00
|
|
|
<tag1>
|
2020-06-11 16:10:48 -04:00
|
|
|
Some
|
|
|
|
<tag2/>
|
|
|
|
bad
|
|
|
|
<tag3>
|
|
|
|
XML
|
|
|
|
</tag3>
|
2019-10-06 10:46:16 -04:00
|
|
|
</tag1>
|
2018-08-12 10:58:24 -04:00
|
|
|
```
|
|
|
|
|
2021-10-09 08:06:11 -04:00
|
|
|
To go beyond the basics, [comprehensive documentation is available](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).
|
2018-08-12 10:58:24 -04:00
|
|
|
|
|
|
|
# Links
|
2011-08-16 09:13:05 -04:00
|
|
|
|
2021-10-09 08:06:11 -04:00
|
|
|
* [Homepage](https://www.crummy.com/software/BeautifulSoup/bs4/)
|
|
|
|
* [Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
|
|
|
|
* [Discussion group](https://groups.google.com/group/beautifulsoup/)
|
2018-08-12 10:58:24 -04:00
|
|
|
* [Development](https://code.launchpad.net/beautifulsoup/)
|
|
|
|
* [Bug tracker](https://bugs.launchpad.net/beautifulsoup/)
|
2025-02-03 07:54:12 -05:00
|
|
|
* [Complete changelog](https://git.launchpad.net/beautifulsoup/tree/CHANGELOG)
|
2018-08-12 10:58:24 -04:00
|
|
|
|
2019-10-06 10:02:07 -04:00
|
|
|
# Note on Python 2 sunsetting
|
|
|
|
|
2021-09-07 20:09:32 -04:00
|
|
|
Beautiful Soup's support for Python 2 was discontinued on December 31,
|
|
|
|
2020: one year after the sunset date for Python 2 itself. From this
|
|
|
|
point onward, new Beautiful Soup development will exclusively target
|
|
|
|
Python 3. The final release of Beautiful Soup 4 to support Python 2
|
|
|
|
was 4.9.3.
|
2019-10-06 10:02:07 -04:00
|
|
|
|
2019-07-22 16:58:40 -04:00
|
|
|
# Supporting the project
|
|
|
|
|
|
|
|
If you use Beautiful Soup as part of your professional work, please consider a
|
|
|
|
[Tidelift subscription](https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=readme).
|
|
|
|
This will support many of the free software projects your organization
|
|
|
|
depends on, not just Beautiful Soup.
|
|
|
|
|
|
|
|
If you use Beautiful Soup for personal projects, the best way to say
|
|
|
|
thank you is to read
|
|
|
|
[Tool Safety](https://www.crummy.com/software/BeautifulSoup/zine/), a zine I
|
|
|
|
wrote about what Beautiful Soup has taught me about software
|
|
|
|
development.
|
|
|
|
|
2018-08-12 10:58:24 -04:00
|
|
|
# Building the documentation
|
2012-01-20 13:56:02 -05:00
|
|
|
|
2012-02-08 09:21:39 -05:00
|
|
|
The bs4/doc/ directory contains full documentation in Sphinx
|
2018-08-12 10:58:24 -04:00
|
|
|
format. Run `make html` in that directory to create HTML
|
2012-02-08 09:21:39 -05:00
|
|
|
documentation.
|
2012-01-20 13:56:02 -05:00
|
|
|
|
2018-08-12 10:58:24 -04:00
|
|
|
# Running the unit tests
|
2012-01-20 13:56:02 -05:00
|
|
|
|
2021-10-09 08:06:11 -04:00
|
|
|
Beautiful Soup supports unit test discovery using Pytest:
|
2012-01-20 13:56:02 -05:00
|
|
|
|
2018-08-12 10:58:24 -04:00
|
|
|
```
|
2021-10-09 08:06:11 -04:00
|
|
|
$ pytest
|
2018-08-12 10:58:24 -04:00
|
|
|
```
|
2012-02-20 10:03:20 -05:00
|
|
|
|