mirror of
https://git.launchpad.net/beautifulsoup
synced 2025-10-06 00:12:49 +02:00
breaking changes in users of the library, so I reverted the ResultSet code back to where it was in 4.13.5 and added tests of all known breaking behavior. [bug=2125906]
2439 lines
95 KiB
Plaintext
2439 lines
95 KiB
Plaintext
= 4.14.2 (20250929)
|
|
|
|
* Making ResultSet inherit from MutableSequence still resulted in too many
|
|
breaking changes in users of the library, so I reverted the
|
|
ResultSet code back to where it was in 4.13.5 and added tests of all known
|
|
breaking behavior. [bug=2125906]
|
|
|
|
= 4.14.1 (20250929)
|
|
|
|
* Made ResultSet inherit from MutableSequence instead of Sequence,
|
|
since lots of existing code treats ResultSet as a mutable list.
|
|
[bug=2125906,2125903]
|
|
|
|
= 4.14.0 (20250927)
|
|
|
|
* This version adds function overloading to the find_* methods to make
|
|
it easier to write type-safe Python.
|
|
|
|
In most cases you can just assign the result of a find() or
|
|
find_all() call to the type of object you're expecting to get back:
|
|
a Tag, a NavigableString, a Sequence[Tag], or a
|
|
Sequence[NavigableString]. It's very rare that you'll have to do a
|
|
cast or suppress type-checker warnings like you did in previous
|
|
versions of Beautiful Soup.
|
|
|
|
(In fact, the only time you should still have to do this is if you
|
|
pass both 'string' and one of the other arguments into one of the
|
|
find* methods, e.g. tag.find("a", string="tag contents".)
|
|
|
|
The following code has been verified to pass type checking using
|
|
mypy, pyright, and the Visual Studio Code IDE. It's available in
|
|
the source repository as scripts/type_checking_smoke_test.py.
|
|
|
|
---
|
|
from typing import Optional, Sequence
|
|
from bs4 import BeautifulSoup, Tag, NavigableString
|
|
soup = BeautifulSoup("<p>", 'html.parser')
|
|
|
|
tag:Optional[Tag]
|
|
string:Optional[NavigableString]
|
|
tags:Sequence[Tag]
|
|
strings:Sequence[NavigableString]
|
|
|
|
tag = soup.find()
|
|
tag = soup.find(id="a")
|
|
string = soup.find(string="b")
|
|
|
|
tags = soup()
|
|
tags = soup(id="a")
|
|
strings = soup(string="b")
|
|
|
|
tags = soup.find_all()
|
|
tags = soup.find_all(id="a")
|
|
strings = soup.find_all(string="b")
|
|
|
|
tag = soup.find_next()
|
|
tag = soup.find_next(id="a")
|
|
string = soup.find_next(string="b")
|
|
|
|
tags = soup.find_all_next()
|
|
tags = soup.find_all_next(id="a")
|
|
strings = soup.find_all_next(string="b")
|
|
|
|
tag = soup.find_next_sibling()
|
|
tag = soup.find_next_sibling(id="a")
|
|
string = soup.find_next_sibling(string="b")
|
|
|
|
tags = soup.find_next_siblings()
|
|
tags = soup.find_next_siblings(id="a")
|
|
strings = soup.find_next_siblings(string="b")
|
|
|
|
tag = soup.find_previous()
|
|
tag = soup.find_previous(id="a")
|
|
string = soup.find_previous(string="b")
|
|
|
|
tags = soup.find_all_previous()
|
|
tags = soup.find_all_previous(id="a")
|
|
strings = soup.find_all_previous(string="b")
|
|
|
|
tag = soup.find_previous_sibling()
|
|
tag = soup.find_previous_sibling(id="a")
|
|
string = soup.find_previous_sibling(string="bold")
|
|
|
|
tags = soup.find_previous_siblings()
|
|
tags = soup.find_previous_siblings(id="a")
|
|
strings = soup.find_previous_siblings(string="b")
|
|
|
|
tag = soup.find_parent()
|
|
tag = soup.find_parent(id="a")
|
|
tags = soup.find_parents()
|
|
tags = soup.find_parents(id="a")
|
|
|
|
# This code will work, but mypy and pyright will both flag it.
|
|
tags = soup.find_all("a", string="b")
|
|
---
|
|
|
|
* The typing for find_parent() and find_parents() was improved without
|
|
any overloading. Casts should never be necessary, since those
|
|
methods only ever return Tag and ResultSet[Tag], respectively.
|
|
|
|
* ResultSet now inherits from Sequence. This should make it easier to
|
|
incorporate ResultSet objects into your type system without needing to
|
|
handle ResultSet specially.
|
|
|
|
* Fixed an unhandled exception when creating the string representation of
|
|
a decomposed element. (The output is not *useful* and you still
|
|
shouldn't do this, but it won't raise an exception anymore.) [bug=2120300]
|
|
|
|
* The default value for the 'attrs' attribute in find* methods is now
|
|
None, not the empty dictionary. This should have no visible effect
|
|
on anything.
|
|
|
|
= 4.13.5 (20250824)
|
|
|
|
* Fixed an unhandled exception when parsing invalid markup that contains the { character
|
|
when using lxml==6.0.0. [bug=2116306]
|
|
* Fixed a regression when matching a multi-valued attribute against the
|
|
empty string. [bug=2115352]
|
|
* Unit tests and test case data are no longer packaged with the wheel. [bug=2107495]
|
|
* Fixed a bug that gave the wrong result when parsing the empty bytestring. [bug=2110492]
|
|
* Brought the Spanish translation of the documentation up to date with
|
|
4.13.4. Courtesy of Carlos Romero.
|
|
* For Python 3.13 and above, disabled tests that verify Beautiful Soup's handling of htmlparser's
|
|
exceptions when given very bad markup. The bug in htmlparser that caused
|
|
this behavior has been fixed. Patch courtesy of Stefano Rivera.
|
|
* Used overloading to improve type hints for prettify().
|
|
* Updated the SoupStrainer documentation to clarify that during initial
|
|
parsing, attribute values are always passed into the SoupStrainer as raw strings. [bug=2111651]
|
|
* Fixed all type checking errors issued by pyright. (Previously only mypy
|
|
was used for type checking.)
|
|
* Improved the type hints for PageElement.replace_with. [bug=2114746]
|
|
* Improved the type hint for the arguments of the lambda function that can
|
|
be used to match a tag's attribute. [bug=2110401]
|
|
* Modified some of the lxml tests to accommodate behavioral changes in libxml2
|
|
2.14.3. Specifically:
|
|
|
|
1. XML declarations and processing instructions in HTML documents
|
|
are rewritten as comments. Note that this means XHTML documents will
|
|
now turn into regular HTML documents if run through the 'lxml'
|
|
parser. The 'xml' parser is unaffected.
|
|
|
|
2. Out-of-range numeric entities are replaced with REPLACEMENT
|
|
CHARACTER rather than omitted entirely. [bug=2112242]
|
|
|
|
= 4.13.4 (20250415)
|
|
|
|
* If you pass a function as the first argument to a find* method, the
|
|
function will only ever be called once per tag, with the Tag object
|
|
as the argument. Starting in 4.13.0, there were cases where the
|
|
function would be called with a Tag object and then called again
|
|
with the name of the tag. [bug=2106435]
|
|
|
|
* Added a passthrough implementation for NavigableString.__getitem__ which gives a
|
|
more helpful exception if the user tries to treat it as a Tag and
|
|
access its HTML attributes.
|
|
|
|
* Fixed a bug that caused an exception when unpickling the result of
|
|
parsing certain invalid markup with lxml as the tree builder. [bug=2103126]
|
|
|
|
* Converted the AUTHORS file to UTF-8 for PEP8 compliance. [bug=2107405]
|
|
|
|
= 4.13.3 (20250204)
|
|
|
|
* Modified the 4.13.2 change slightly to restore backwards
|
|
compatibility. Specifically, calling a find_* method with no
|
|
arguments should return the first Tag out of the iterator, not the
|
|
first PageElement. [bug=2097333]
|
|
|
|
= 4.13.2 (20250204)
|
|
|
|
* Gave ElementFilter the ability to explicitly say that it excludes
|
|
every item in the parse tree. This is used internally in situations
|
|
where the provided filters are logically inconsistent or match a
|
|
value against the null set.
|
|
|
|
Without this, it's not always possible to distinguish between
|
|
a SoupStrainer that excludes everything and one that excludes
|
|
nothing.
|
|
|
|
This fixes a bug where calls to find_* methods with no arguments
|
|
returned None, instead of the first item out of the iterator. [bug=2097333]
|
|
|
|
Things added to the API to support this:
|
|
|
|
- The ElementFilter.includes_everything property
|
|
- The MatchRule.exclude_everything member
|
|
- The _known_rules argument to ElementFilter.match. This is an optional
|
|
argument used internally to indicate that an optimization is safe.
|
|
|
|
= 4.13.1 (20250203)
|
|
|
|
* Updated pyproject.toml to require Python 3.7 or above. [bug=2097263]
|
|
* Pinned the typing-extensions dependency to a minimum version of 4.0.0.
|
|
[bug=2097262]
|
|
* Restored the English documentation to the source distribution.
|
|
[bug=2097237]
|
|
* Fixed a regression where HTMLFormatter and XMLFormatter were not
|
|
propagating the indent parameter to the superconstructor. [bug=2097272]
|
|
|
|
= 4.13.0 (20250202)
|
|
|
|
This release introduces Python type hints to all public classes and
|
|
methods in Beautiful Soup. The addition of these type hints exposed a
|
|
large number of very small inconsistencies in the code, which I've
|
|
fixed, but the result is a larger-than-usual number of deprecations
|
|
and changes that may break backwards compatibility.
|
|
|
|
Chris Papademetrious deserves a special thanks for his work on this
|
|
release through its long beta process.
|
|
|
|
NOTE: This release was yanked from PyPI on 20250203, because bug 2097263
|
|
made it difficult to install Beautiful Soup on Python 3.6. You can still
|
|
install this version by explicitly pinning beautifulsoup4==4.13.0, but
|
|
you really shouldn't need to.
|
|
|
|
# Deprecation notices
|
|
|
|
These things now give DeprecationWarnings when you try to use them,
|
|
and are scheduled to be removed in Beautiful Soup 4.15.0.
|
|
|
|
* Every deprecated method, attribute and class from the 3.0 and 2.0
|
|
major versions of Beautiful Soup. These have been deprecated for a
|
|
very long time, but they didn't issue DeprecationWarning when you
|
|
tried to use them. Now they do, and they're all going away soon.
|
|
|
|
This mainly refers to methods and attributes with camelCase names,
|
|
for example: renderContents, replaceWith, replaceWithChildren,
|
|
findAll, findAllNext, findAllPrevious, findNext, findNextSibling,
|
|
findNextSiblings, findParent, findParents, findPrevious,
|
|
findPreviousSibling, findPreviousSiblings, getText, nextSibling,
|
|
previousSibling, isSelfClosing, fetchNextSiblings,
|
|
fetchPreviousSiblings, fetchPrevious, fetchPreviousSiblings,
|
|
fetchParents, findChild, findChildren, childGenerator,
|
|
nextGenerator, nextSiblingGenerator, previousGenerator,
|
|
previousSiblingGenerator, recursiveChildGenerator, and
|
|
parentGenerator.
|
|
|
|
This also includes the BeautifulStoneSoup class.
|
|
|
|
* The SAXTreeBuilder class, which was never officially supported or tested.
|
|
|
|
* The private class method BeautifulSoup._decode_markup(), which has not
|
|
been used inside Beautiful Soup for many years.
|
|
|
|
* The first argument to BeautifulSoup.decode has been changed from
|
|
pretty_print:bool to indent_level:int, to match the signature of
|
|
Tag.decode. Using a bool will still work but will give you a
|
|
DeprecationWarning.
|
|
|
|
* SoupStrainer.text and SoupStrainer.string are both deprecated, since
|
|
a single item can't capture all the possibilities of a SoupStrainer
|
|
designed to match strings.
|
|
|
|
* SoupStrainer.search_tag(). It was never a documented method, but if
|
|
you use it, you should start using SoupStrainer.allow_tag_creation()
|
|
instead.
|
|
|
|
* The soup:BeautifulSoup argument to the TreeBuilderForHtml5lib
|
|
constructor is now required, not optional. It's unclear why it was
|
|
optional in the first place, so if you discover you need this,
|
|
contact me for possible un-deprecation.
|
|
|
|
# Compatibility notices
|
|
|
|
* This version drops support for Python 3.6. The minimum supported
|
|
major Python version for Beautiful Soup is now Python 3.7.
|
|
|
|
* Deprecation warnings have been added for all deprecated methods and
|
|
attributes (see above). Going forward, deprecated names will be
|
|
removed two feature releases or one major release after the
|
|
deprecation warning is added.
|
|
|
|
* The storage for a tag's attribute values now modifies incoming values
|
|
to be consistent with the HTML or XML spec. This means that if you set an
|
|
attribute value to a number, it will be converted to a string
|
|
immediately, rather than being converted when you output the document.
|
|
[bug=2065525]
|
|
|
|
More importantly for backwards compatibility, setting an HTML
|
|
attribute value to True will set the attribute's value to the
|
|
appropriate string per the HTML spec. Setting an attribute value to
|
|
False or None will remove the attribute value from the tag
|
|
altogether, rather than (effectively, as before) setting the value
|
|
to the string "False" or the string "None".
|
|
|
|
This means that some programs that modify documents will generate
|
|
different output than they would in earlier versions of Beautiful Soup,
|
|
but the new documents are more likely to represent the intent behind the
|
|
modifications.
|
|
|
|
To give a specific example, if you have code that looks something like this:
|
|
|
|
checkbox1['checked'] = True
|
|
checkbox2['checked'] = False
|
|
|
|
Then a document that used to look like this (with most browsers
|
|
treating both boxes as checked):
|
|
|
|
<input type="checkbox" checked="True"/>
|
|
<input type="checkbox" checked="False"/>
|
|
|
|
Will now look like this (with browsers treating only the first box
|
|
as checked):
|
|
|
|
<input type="checkbox" checked="checked"/>
|
|
<input type="checkbox"/>
|
|
|
|
You can get the old behavior back by instantiating a TreeBuilder
|
|
with `attribute_dict_class=dict`, or you can customize how Beautiful Soup
|
|
treates attribute values by passing in a custom subclass of dict.
|
|
|
|
* If Tag.get_attribute_list() is used to access an attribute that's not set,
|
|
the return value is now an empty list rather than [None].
|
|
|
|
* If you pass an empty list as the attribute value when searching the
|
|
tree, you will now find all tags which have that attribute set to a value in
|
|
the empty list--that is, you will find nothing. This is consistent with other
|
|
situations where a list of acceptable values is provided. Previously, an
|
|
empty list was treated the same as None and False, and you would have
|
|
found the tags which did not have that attribute set at all. [bug=2045469]
|
|
|
|
* For similar reasons, if you pass in limit=0 to a find() method, you
|
|
will now get zero results. Previously, you would get all matching results.
|
|
|
|
* When using one of the find() methods or creating a SoupStrainer,
|
|
if you specify the same attribute value in ``attrs`` and the
|
|
keyword arguments, you'll end up with two different ways to match that
|
|
attribute. Previously the value in keyword arguments would override the
|
|
value in ``attrs``.
|
|
|
|
* All exceptions were moved to the bs4.exceptions module, and all
|
|
warnings to the bs4._warnings module (named so as not to shadow
|
|
Python's built-in warnings module). All warnings and exceptions are
|
|
exported from the bs4 module, which is probably the safest place to
|
|
import them from in your own code.
|
|
|
|
* As a side effect of this, the string constant
|
|
BeautifulSoup.NO_PARSER_SPECIFIED_WARNING was moved to
|
|
GuessedAtParserWarning.MESSAGE.
|
|
|
|
* The 'html5' formatter is now much less aggressive about escaping
|
|
ampersands, escaping only the ampersands considered "ambiguous" by the HTML5
|
|
spec (which is almost none of them). This is the sort of change that
|
|
might break your unit test suite, but the resulting markup will be much more
|
|
readable and more HTML5-ish.
|
|
|
|
To quickly get the old behavior back, change code like this:
|
|
|
|
tag.encode(formatter='html5')
|
|
|
|
to this:
|
|
|
|
tag.encode(formatter='html5-4.12')
|
|
|
|
In the future, the 'html5' formatter may be become the default HTML
|
|
formatter, which will change Beautiful Soup's default output. This
|
|
will break a lot of test suites so it's not going to happen for a
|
|
while. [bug=1902431]
|
|
|
|
* Tag.sourceline and Tag.sourcepos now always have a consistent data
|
|
type: Optional[int]. Previously these values were sometimes an
|
|
Optional[int], and sometimes they were Optional[Tag], the result of
|
|
searching for a child tag called <sourceline> or
|
|
<sourcepos>. [bug=2065904]
|
|
|
|
If your code does search for a tag called <sourceline> or
|
|
<sourcepos>, it may stop finding that tag when you upgrade to
|
|
Beautiful Soup 4.13. If this happens, you'll need to replace code
|
|
that treats "sourceline" or "sourcepos" as tag names:
|
|
|
|
tag.sourceline
|
|
|
|
with code that explicitly calls the find() method:
|
|
|
|
tag.find("sourceline").name
|
|
|
|
Making the behavior of sourceline and sourcepos consistent has the
|
|
side effect of fixing a major performance problem when a Tag is
|
|
copied.
|
|
|
|
With this change, the store_line_numbers argument to the
|
|
BeautifulSoup constructor becomes much less useful, and its use is
|
|
now discouraged, thought I'm not deprecating it yet. Please contact
|
|
me if you have a performance or security rationale for setting
|
|
store_line_numbers=False.
|
|
|
|
* append(), extend(), insert(), and unwrap() were moved from PageElement to
|
|
Tag. Those methods manipulate the 'contents' collection, so they would
|
|
only have ever worked on Tag objects.
|
|
|
|
* The BeautifulSoupHTMLParser constructor now requires a BeautifulSoup
|
|
object as its first argument. This almost certainly does not affect
|
|
you, since you probably use HTMLParserTreeBuilder, not
|
|
BeautifulSoupHTMLParser directly.
|
|
|
|
* The TreeBuilderForHtml5lib methods fragmentClass(), getFragment(),
|
|
and testSerializer() now raise NotImplementedError. These methods
|
|
are called only by html5lib's test suite, and Beautiful Soup isn't
|
|
integrated into that test suite, so this code was long since unused and
|
|
untested.
|
|
|
|
These methods are _not_ deprecated, since they are methods defined by
|
|
html5lib. They may one day have real implementations, as part of a future
|
|
effort to integrate Beautiful Soup into html5lib's test suite.
|
|
|
|
* AttributeValueWithCharsetSubstitution.encode() is renamed to
|
|
substitute_encoding, to avoid confusion with the much different str.encode()
|
|
|
|
* Using PageElement.replace_with() to replace an element with itself
|
|
returns the element instead of None.
|
|
|
|
* All TreeBuilder constructors now take the empty_element_tags
|
|
argument. The sets of tags found in HTMLTreeBuilder.empty_element_tags and
|
|
HTMLTreeBuilder.block_elements are now in
|
|
HTMLTreeBuilder.DEFAULT_EMPTY_ELEMENT_TAGS and
|
|
HTMLTreeBuilder.DEFAULT_BLOCK_ELEMENTS, to avoid confusing them with
|
|
instance variables.
|
|
|
|
* The unused constant LXMLTreeBuilderForXML.DEFAULT_PARSER_CLASS
|
|
has been removed.
|
|
|
|
* Some of the arguments in the methods of LXMLTreeBuilderForXML
|
|
have been renamed for consistency with the names lxml uses for those
|
|
arguments in the superclass. This won't affect you unless you were
|
|
calling methods like LXMLTreeBuilderForXML.start() directly.
|
|
|
|
* In particular, the arguments to LXMLTreeBuilderForXML.prepare_markup
|
|
have been changed to match the arguments to the superclass,
|
|
TreeBuilder.prepare_markup. Specifically, document_declared_encoding
|
|
now appears before exclude_encodings, not after. If you were calling
|
|
this method yourself, I recommend switching to using keyword
|
|
arguments instead.
|
|
|
|
# New features
|
|
|
|
* The new ElementFilter class encapsulates Beautiful Soup's rules
|
|
about matching elements and deciding which parts of a document to
|
|
parse. It's easy to override those rules with subclassing or
|
|
function composition. The SoupStrainer class, which contains all the
|
|
matching logic you're familiar with from the find_* methods, is now
|
|
a subclass of ElementFilter.
|
|
|
|
* The new PageElement.filter() method provides a fully general way of
|
|
finding elements in a Beautiful Soup parse tree. You can specify a
|
|
function to iterate over the tree and an ElementFilter to determine
|
|
what matches.
|
|
|
|
* The new_tag() method now takes a 'string' argument. This allows you to
|
|
set the string contents of a Tag when creating it. Patch by Chris
|
|
Papademetrious. [bug=2044599]
|
|
|
|
* Defined a number of new iterators which are the same as existing iterators,
|
|
but which yield the element itself before beginning to traverse the
|
|
tree. [bug=2052936] [bug=2067634]
|
|
|
|
- PageElement.self_and_parents
|
|
- PageElement.self_and_descendants
|
|
- PageElement.self_and_next_elements
|
|
- PageElement.self_and_next_siblings
|
|
- PageElement.self_and_previous_elements
|
|
- PageElement.self_and_previous_siblings
|
|
|
|
self_and_parents yields the element you call it on and then all of its
|
|
parents. self_and_next_element yields the element you call it on and then
|
|
every element parsed afterwards; and so on.
|
|
|
|
* The NavigableString class now has a .string property which returns the
|
|
string itself. This makes it easier to iterate over a mixed list
|
|
of Tag and NavigableString objects. [bug=2044794]
|
|
|
|
* Defined a new method, Tag.copy_self(), which creates a copy of a Tag
|
|
with the same attributes but no contents. [bug=2065120]
|
|
|
|
Note that this method used to be a private method named
|
|
_clone(). The _clone() method has been removed, so if you were using
|
|
it, change your code to call copy_self() instead.
|
|
|
|
* The PageElement.append() method now returns the element that was
|
|
appended; it used to have no return value. [bug=2093025]
|
|
|
|
* The methods PageElement.insert(), PageElement.extend(),
|
|
PageElement.insert_before(), and PageElement.insert_after() now return a
|
|
list of the items inserted. These methods used to have no return
|
|
value. [bug=2093025]
|
|
|
|
* The PageElement.insert() method now takes a variable number of
|
|
arguments and returns a list of all elements inserted, to match
|
|
insert_before() and insert_after(). (Even if I hadn't made the
|
|
variable-argument change, an edge case around inserting one
|
|
Beautiful Soup object into another means that insert()'s return
|
|
value needs to be a list.) [bug=2093025]
|
|
|
|
* Defined a new warning class, UnusualUsageWarning, which is a superclass
|
|
for all of the warnings issued when Beautiful Soup notices something
|
|
unusual but not guaranteed to be wrong, like markup that looks like
|
|
a URL (MarkupResemblesLocatorWarning) or XML being run through an HTML
|
|
parser (XMLParsedAsHTMLWarning).
|
|
|
|
The text of these warnings has been revamped to explain in more
|
|
detail what is going on, how to check if you've made a mistake,
|
|
and how to make the warning go away if you are acting deliberately.
|
|
|
|
If these warnings are interfering with your workflow, or simply
|
|
annoying you, you can filter all of them by filtering
|
|
UnusualUsageWarning, without worrying about losing the warnings
|
|
Beautiful Soup issues when there *definitely* is a problem you
|
|
need to correct.
|
|
|
|
* It's now possible to modify the behavior of the list used to store the
|
|
values of multi-valued attributes such as HTML 'class', by passing in
|
|
whatever class you want instantiated (instead of a normal Python list)
|
|
to the TreeBuilder constructor as attribute_value_list_class.
|
|
[bug=2052943]
|
|
|
|
# Improvements
|
|
|
|
* decompose() was moved from Tag to its superclass PageElement, since
|
|
there's no reason it won't also work on NavigableString objects.
|
|
|
|
* Emit an UnusualUsageWarning if the user tries to search for an attribute
|
|
called _class; they probably mean "class_". [bug=2025089]
|
|
|
|
* The MarkupResemblesLocatorWarning issued when the markup resembles a
|
|
filename is now issued less often, due to improvements in detecting
|
|
markup that's unlikely to be a filename. [bug=2052988]
|
|
|
|
* Emit a warning if a document is parsed using a SoupStrainer that's
|
|
set up to filter everything. In these cases, filtering everything is
|
|
the most consistent thing to do, but there was no indication that
|
|
this was happening, so the behavior may have seemed mysterious.
|
|
|
|
* When using one of the find() methods or creating a SoupStrainer, you can
|
|
pass a list of any accepted object (strings, regular expressions, etc.) for
|
|
any of the objects. Previously you could only pass in a list of strings.
|
|
|
|
* A SoupStrainer can now filter tag creation based on a tag's
|
|
namespaced name. Previously only the unqualified name could be used.
|
|
|
|
* Added the correct stacklevel to another instance of the
|
|
XMLParsedAsHTMLWarning. [bug=2034451]
|
|
|
|
* Improved the wording of the TypeError raised when you pass something
|
|
other than markup into the BeautifulSoup constructor. [bug=2071530]
|
|
|
|
* Optimized the case where you use Tag.insert() to "insert" a PageElement
|
|
into its current location. [bug=2077020]
|
|
|
|
* Changes to make tests work whether tests are run under soupsieve 2.6
|
|
or an earlier version. Based on a patch by Stefano Rivera.
|
|
|
|
* Removed the strip_cdata argument to lxml's HTMLParser
|
|
constructor, which never did anything and is deprecated as of lxml
|
|
5.3.0. Patch by Stefano Rivera. [bug=2076897]
|
|
|
|
# Bug fixes
|
|
|
|
* Copying a tag with a multi-valued attribute now makes a copy of the
|
|
list of values, eliminating a bug where both the old and new copy
|
|
shared the same list. [bug=2067412]
|
|
|
|
* The lxml TreeBuilder, like the other TreeBuilders, now filters a
|
|
document's initial DOCTYPE if you've set up a SoupStrainer that
|
|
eliminates it. [bug=2062000]
|
|
|
|
* A lot of things can go wrong if you modify the parse tree while
|
|
iterating over it, especially if you are removing or replacing
|
|
elements. Most of those things fall under the category of unexpected
|
|
behavior (which is why I don't recommend doing this), but there
|
|
are a few ways that caused unhandled exceptions. The list
|
|
comprehensions used by Beautiful Soup (e.g. .descendants, which
|
|
powers the find* methods) should now work correctly in those cases,
|
|
or at least not raise exceptions.
|
|
|
|
As part of this work, I changed when the list comprehension
|
|
determines the next element. Previously it was done after the yield
|
|
statement; now it's done before the yield statement. This lets you
|
|
remove the yielded element in calling code, or modify it in a way that
|
|
would break this calculation, without causing an exception.
|
|
|
|
So if your code relies on modifying the tree in a way that 'steers' a
|
|
list comprehension, rather than using the list comprension to decide
|
|
which bits of the tree to modify, it will probably stop working at
|
|
this point. [bug=2091118]
|
|
|
|
* Fixed an error in the lookup table used when converting
|
|
ISO-Latin-1 to ASCII, which no one should do anyway.
|
|
|
|
* Corrected the markup that's output in the unlikely event that you
|
|
encode a document to a Python internal encoding (like "palmos")
|
|
that's not recognized by the HTML or XML standard.
|
|
|
|
* UnicodeDammit.markup is now always a bytestring representing the
|
|
*original* markup (sans BOM), and UnicodeDammit.unicode_markup is
|
|
always the converted Unicode equivalent of the original
|
|
markup. Previously, UnicodeDammit.markup was treated inconsistently
|
|
and would often end up containing Unicode. UnicodeDammit.markup was
|
|
not a documented attribute, but if you were using it, you probably
|
|
want to switch to using .unicode_markup instead.
|
|
|
|
= 4.12.3 (20240117)
|
|
|
|
* The Beautiful Soup documentation now has a Spanish translation, thanks
|
|
to Carlos Romero. Delong Wang's Chinese translation has been updated
|
|
to cover Beautiful Soup 4.12.0.
|
|
|
|
* Fixed a regression such that if you set .hidden on a tag, the tag
|
|
becomes invisible but its contents are still visible. User manipulation
|
|
of .hidden is not a documented or supported feature, so don't do this,
|
|
but it wasn't too difficult to keep the old behavior working.
|
|
|
|
* Fixed a case found by Mengyuhan where html.parser giving up on
|
|
markup would result in an AssertionError instead of a
|
|
ParserRejectedMarkup exception.
|
|
|
|
* Added the correct stacklevel to instances of the XMLParsedAsHTMLWarning.
|
|
[bug=2034451]
|
|
|
|
* Corrected the syntax of the license definition in pyproject.toml. Patch
|
|
by Louis Maddox. [bug=2032848]
|
|
|
|
* Corrected a typo in a test that was causing test failures when run against
|
|
libxml2 2.12.1. [bug=2045481]
|
|
|
|
= 4.12.2 (20230407)
|
|
|
|
* Fixed an unhandled exception in BeautifulSoup.decode_contents
|
|
and methods that call it. [bug=2015545]
|
|
|
|
= 4.12.1 (20230405)
|
|
|
|
NOTE: the following things are likely to be dropped in the next
|
|
feature release of Beautiful Soup:
|
|
|
|
Official support for Python 3.6.
|
|
Inclusion of unit tests and test data in the wheel file.
|
|
Two scripts: demonstrate_parser_differences.py and test-all-versions.
|
|
|
|
Changes:
|
|
|
|
* This version of Beautiful Soup replaces setup.py and setup.cfg
|
|
with pyproject.toml. Beautiful Soup now uses tox as its test backend
|
|
and hatch to do builds.
|
|
|
|
* The main functional improvement in this version is a nonrecursive technique
|
|
for regenerating a tree. This technique is used to avoid situations where,
|
|
in previous versions, doing something to a very deeply nested tree
|
|
would overflow the Python interpreter stack:
|
|
|
|
1. Outputting a tree as a string, e.g. with
|
|
BeautifulSoup.encode() [bug=1471755]
|
|
|
|
2. Making copies of trees (copy.copy() and
|
|
copy.deepcopy() from the Python standard library). [bug=1709837]
|
|
|
|
3. Pickling a BeautifulSoup object. (Note that pickling a Tag
|
|
object can still cause an overflow.)
|
|
|
|
* Making a copy of a BeautifulSoup object no longer parses the
|
|
document again, which should improve performance significantly.
|
|
|
|
* When a BeautifulSoup object is unpickled, Beautiful Soup now
|
|
tries to associate an appropriate TreeBuilder object with it.
|
|
|
|
* Tag.prettify() will now consistently end prettified markup with
|
|
a newline.
|
|
|
|
* Added unit tests for fuzz test cases created by third
|
|
parties. Some of these tests are skipped since they point
|
|
to problems outside of Beautiful Soup, but this change
|
|
puts them all in one convenient place.
|
|
|
|
* PageElement now implements the known_xml attribute. (This was technically
|
|
a bug, but it shouldn't be an issue in normal use.) [bug=2007895]
|
|
|
|
* The demonstrate_parser_differences.py script was still written in
|
|
Python 2. I've converted it to Python 3, but since no one has
|
|
mentioned this over the years, it's a sign that no one uses this
|
|
script and it's not serving its purpose.
|
|
|
|
= 4.12.0 (20230320)
|
|
|
|
* Introduced the .css property, which centralizes all access to
|
|
the Soup Sieve API. This allows Beautiful Soup to give direct
|
|
access to as much of Soup Sieve that makes sense, without cluttering
|
|
the BeautifulSoup and Tag classes with a lot of new methods.
|
|
|
|
This does mean one addition to the BeautifulSoup and Tag classes
|
|
(the .css property itself), so this might be a breaking change if you
|
|
happen to use Beautiful Soup to parse XML that includes a tag called
|
|
<css>. In particular, code like this will stop working in 4.12.0:
|
|
|
|
soup.css['id']
|
|
|
|
Code like this will work just as before:
|
|
|
|
soup.find_one('css')['id']
|
|
|
|
The Soup Sieve methods supported through the .css property are
|
|
select(), select_one(), iselect(), closest(), match(), filter(),
|
|
escape(), and compile(). The BeautifulSoup and Tag classes still
|
|
support the select() and select_one() methods; they have not been
|
|
deprecated, but they have been demoted to convenience methods.
|
|
|
|
[bug=2003677]
|
|
|
|
* When the html.parser parser decides it can't parse a document, Beautiful
|
|
Soup now consistently propagates this fact by raising a
|
|
ParserRejectedMarkup error. [bug=2007343]
|
|
|
|
* Removed some error checking code from diagnose(), which is redundant with
|
|
similar (but more Pythonic) code in the BeautifulSoup constructor.
|
|
[bug=2007344]
|
|
|
|
* Added intersphinx references to the documentation so that other
|
|
projects have a target to point to when they reference Beautiful
|
|
Soup classes. [bug=1453370]
|
|
|
|
= 4.11.2 (20230131)
|
|
|
|
* Fixed test failures caused by nondeterministic behavior of
|
|
UnicodeDammit's character detection, depending on the platform setup.
|
|
[bug=1973072]
|
|
|
|
* Fixed another crash when overriding multi_valued_attributes and using the
|
|
html5lib parser. [bug=1948488]
|
|
|
|
* The HTMLFormatter and XMLFormatter constructors no longer return a
|
|
value. [bug=1992693]
|
|
|
|
* Tag.interesting_string_types is now propagated when a tag is
|
|
copied. [bug=1990400]
|
|
|
|
* Warnings now do their best to provide an appropriate stacklevel,
|
|
improving the usefulness of the message. [bug=1978744]
|
|
|
|
* Passing a Tag's .contents into PageElement.extend() now works the
|
|
same way as passing the Tag itself.
|
|
|
|
* Soup Sieve tests will be skipped if the library is not installed.
|
|
|
|
= 4.11.1 (20220408)
|
|
|
|
This release was done to ensure that the unit tests are packaged along
|
|
with the released source. There are no functionality changes in this
|
|
release, but there are a few other packaging changes:
|
|
|
|
* The Japanese and Korean translations of the documentation are included.
|
|
* The changelog is now packaged as CHANGELOG, and the license file is
|
|
packaged as LICENSE. NEWS.txt and COPYING.txt are still present,
|
|
but may be removed in the future.
|
|
* TODO.txt is no longer packaged, since a TODO is not relevant for released
|
|
code.
|
|
|
|
= 4.11.0 (20220407)
|
|
|
|
* Ported unit tests to use pytest.
|
|
|
|
* Added special string classes, RubyParenthesisString and RubyTextString,
|
|
to make it possible to treat ruby text specially in get_text() calls.
|
|
[bug=1941980]
|
|
|
|
* It's now possible to customize the way output is indented by
|
|
providing a value for the 'indent' argument to the Formatter
|
|
constructor. The 'indent' argument works very similarly to the
|
|
argument of the same name in the Python standard library's
|
|
json.dump() function. [bug=1955497]
|
|
|
|
* If the charset-normalizer Python module
|
|
(https://pypi.org/project/charset-normalizer/) is installed, Beautiful
|
|
Soup will use it to detect the character sets of incoming documents.
|
|
This is also the module used by newer versions of the Requests library.
|
|
For the sake of backwards compatibility, chardet and cchardet both take
|
|
precedence if installed. [bug=1955346]
|
|
|
|
* Added a workaround for an lxml bug
|
|
(https://bugs.launchpad.net/lxml/+bug/1948551) that causes
|
|
problems when parsing a Unicode string beginning with BYTE ORDER MARK.
|
|
[bug=1947768]
|
|
|
|
* Issue a warning when an HTML parser is used to parse a document that
|
|
looks like XML but not XHTML. [bug=1939121]
|
|
|
|
* Do a better job of keeping track of namespaces as an XML document is
|
|
parsed, so that CSS selectors that use namespaces will do the right
|
|
thing more often. [bug=1946243]
|
|
|
|
* Some time ago, the misleadingly named "text" argument to find-type
|
|
methods was renamed to the more accurate "string." But this supposed
|
|
"renaming" didn't make it into important places like the method
|
|
signatures or the docstrings. That's corrected in this
|
|
version. "text" still works, but will give a DeprecationWarning.
|
|
[bug=1947038]
|
|
|
|
* Fixed a crash when pickling a BeautifulSoup object that has no
|
|
tree builder. [bug=1934003]
|
|
|
|
* Fixed a crash when overriding multi_valued_attributes and using the
|
|
html5lib parser. [bug=1948488]
|
|
|
|
* Standardized the wording of the MarkupResemblesLocatorWarning
|
|
warnings to omit untrusted input and make the warnings less
|
|
judgmental about what you ought to be doing. [bug=1955450]
|
|
|
|
* Removed support for the iconv_codec library, which doesn't seem
|
|
to exist anymore and was never put up on PyPI. (The closest
|
|
replacement on PyPI, iconv_codecs, is GPL-licensed, so we can't use
|
|
it--it's also quite old.)
|
|
|
|
= 4.10.0 (20210907)
|
|
|
|
* This is the first release of Beautiful Soup to only support Python
|
|
3. I dropped Python 2 support to maintain support for newer versions
|
|
(58 and up) of setuptools. See:
|
|
https://github.com/pypa/setuptools/issues/2769 [bug=1942919]
|
|
|
|
* The behavior of methods like .get_text() and .strings now differs
|
|
depending on the type of tag. The change is visible with HTML tags
|
|
like <script>, <style>, and <template>. Starting in 4.9.0, methods
|
|
like get_text() returned no results on such tags, because the
|
|
contents of those tags are not considered 'text' within the document
|
|
as a whole.
|
|
|
|
But a user who calls script.get_text() is working from a different
|
|
definition of 'text' than a user who calls div.get_text()--otherwise
|
|
there would be no need to call script.get_text() at all. In 4.10.0,
|
|
the contents of (e.g.) a <script> tag are considered 'text' during a
|
|
get_text() call on the tag itself, but not considered 'text' during
|
|
a get_text() call on the tag's parent.
|
|
|
|
Because of this change, calling get_text() on each child of a tag
|
|
may now return a different result than calling get_text() on the tag
|
|
itself. That's because different tags now have different
|
|
understandings of what counts as 'text'. [bug=1906226] [bug=1868861]
|
|
|
|
* NavigableString and its subclasses now implement the get_text()
|
|
method, as well as the properties .strings and
|
|
.stripped_strings. These methods will either return the string
|
|
itself, or nothing, so the only reason to use this is when iterating
|
|
over a list of mixed Tag and NavigableString objects. [bug=1904309]
|
|
|
|
* The 'html5' formatter now treats attributes whose values are the
|
|
empty string as HTML boolean attributes. Previously (and in other
|
|
formatters), an attribute value must be set as None to be treated as
|
|
a boolean attribute. In a future release, I plan to also give this
|
|
behavior to the 'html' formatter. Patch by Isaac Muse. [bug=1915424]
|
|
|
|
* The 'replace_with()' method now takes a variable number of arguments,
|
|
and can be used to replace a single element with a sequence of elements.
|
|
Patch by Bill Chandos. [rev=605]
|
|
|
|
* Corrected output when the namespace prefix associated with a
|
|
namespaced attribute is the empty string, as opposed to
|
|
None. [bug=1915583]
|
|
|
|
* Performance improvement when processing tags that speeds up overall
|
|
tree construction by 2%. Patch by Morotti. [bug=1899358]
|
|
|
|
* Corrected the use of special string container classes in cases when a
|
|
single tag may contain strings with different containers; such as
|
|
the <template> tag, which may contain both TemplateString objects
|
|
and Comment objects. [bug=1913406]
|
|
|
|
* The html.parser tree builder can now handle named entities
|
|
found in the HTML5 spec in much the same way that the html5lib
|
|
tree builder does. Note that the lxml HTML tree builder doesn't handle
|
|
named entities this way. [bug=1924908]
|
|
|
|
* Added a second way to pass specify encodings to UnicodeDammit and
|
|
EncodingDetector, based on the order of precedence defined in the
|
|
HTML5 spec, starting at:
|
|
https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding
|
|
|
|
Encodings in 'known_definite_encodings' are tried first, then
|
|
byte-order-mark sniffing is run, then encodings in 'user_encodings'
|
|
are tried. The old argument, 'override_encodings', is now a
|
|
deprecated alias for 'known_definite_encodings'.
|
|
|
|
This changes the default behavior of the html.parser and lxml tree
|
|
builders, in a way that may slightly improve encoding
|
|
detection but will probably have no effect. [bug=1889014]
|
|
|
|
* Improve the warning issued when a directory name (as opposed to
|
|
the name of a regular file) is passed as markup into the BeautifulSoup
|
|
constructor. [bug=1913628]
|
|
|
|
= 4.9.3 (20201003)
|
|
|
|
This is the final release of Beautiful Soup to support Python
|
|
2. Beautiful Soup's official support for Python 2 ended on 01 January,
|
|
2021. In the Launchpad Git repository, the final revision to support
|
|
Python 2 was revision 70f546b1e689a70e2f103795efce6d261a3dadf7; it is
|
|
tagged as "python2".
|
|
|
|
* Implemented a significant performance optimization to the process of
|
|
searching the parse tree. Patch by Morotti. [bug=1898212]
|
|
|
|
= 4.9.2 (20200926)
|
|
|
|
* Fixed a bug that caused too many tags to be popped from the tag
|
|
stack during tree building, when encountering a closing tag that had
|
|
no matching opening tag. [bug=1880420]
|
|
|
|
* Fixed a bug that inconsistently moved elements over when passing
|
|
a Tag, rather than a list, into Tag.extend(). [bug=1885710]
|
|
|
|
* Specify the soupsieve dependency in a way that complies with
|
|
PEP 508. Patch by Mike Nerone. [bug=1893696]
|
|
|
|
* Change the signatures for BeautifulSoup.insert_before and insert_after
|
|
(which are not implemented) to match PageElement.insert_before and
|
|
insert_after, quieting warnings in some IDEs. [bug=1897120]
|
|
|
|
= 4.9.1 (20200517)
|
|
|
|
* Added a keyword argument 'on_duplicate_attribute' to the
|
|
BeautifulSoupHTMLParser constructor (used by the html.parser tree
|
|
builder) which lets you customize the handling of markup that
|
|
contains the same attribute more than once, as in:
|
|
<a href="url1" href="url2"> [bug=1878209]
|
|
|
|
* Added a distinct subclass, GuessedAtParserWarning, for the warning
|
|
issued when BeautifulSoup is instantiated without a parser being
|
|
specified. [bug=1873787]
|
|
|
|
* Added a distinct subclass, MarkupResemblesLocatorWarning, for the
|
|
warning issued when BeautifulSoup is instantiated with 'markup' that
|
|
actually seems to be a URL or the path to a file on
|
|
disk. [bug=1873787]
|
|
|
|
* The new NavigableString subclasses (Stylesheet, Script, and
|
|
TemplateString) can now be imported directly from the bs4 package.
|
|
|
|
* If you encode a document with a Python-specific encoding like
|
|
'unicode_escape', that encoding is no longer mentioned in the final
|
|
XML or HTML document. Instead, encoding information is omitted or
|
|
left blank. [bug=1874955]
|
|
|
|
* Fixed test failures when run against soupselect 2.0. Patch by Tomáš
|
|
Chvátal. [bug=1872279]
|
|
|
|
= 4.9.0 (20200405)
|
|
|
|
* Added PageElement.decomposed, a new property which lets you
|
|
check whether you've already called decompose() on a Tag or
|
|
NavigableString.
|
|
|
|
* Embedded CSS and Javascript is now stored in distinct Stylesheet and
|
|
Script tags, which are ignored by methods like get_text() since most
|
|
people don't consider this sort of content to be 'text'. This
|
|
feature is not supported by the html5lib treebuilder. [bug=1868861]
|
|
|
|
* Added a Russian translation by 'authoress' to the repository.
|
|
|
|
* Fixed an unhandled exception when formatting a Tag that had been
|
|
decomposed.[bug=1857767]
|
|
|
|
* Fixed a bug that happened when passing a Unicode filename containing
|
|
non-ASCII characters as markup into Beautiful Soup, on a system that
|
|
allows Unicode filenames. [bug=1866717]
|
|
|
|
* Added a performance optimization to PageElement.extract(). Patch by
|
|
Arthur Darcet.
|
|
|
|
= 4.8.2 (20191224)
|
|
|
|
* Added Python docstrings to all public methods of the most commonly
|
|
used classes.
|
|
|
|
* Added a Chinese translation by Deron Wang and a Brazilian Portuguese
|
|
translation by Cezar Peixeiro to the repository.
|
|
|
|
* Fixed two deprecation warnings. Patches by Colin
|
|
Watson and Nicholas Neumann. [bug=1847592] [bug=1855301]
|
|
|
|
* The html.parser tree builder now correctly handles DOCTYPEs that are
|
|
not uppercase. [bug=1848401]
|
|
|
|
* PageElement.select() now returns a ResultSet rather than a regular
|
|
list, making it consistent with methods like find_all().
|
|
|
|
= 4.8.1 (20191006)
|
|
|
|
* When the html.parser or html5lib parsers are in use, Beautiful Soup
|
|
will, by default, record the position in the original document where
|
|
each tag was encountered. This includes line number (Tag.sourceline)
|
|
and position within a line (Tag.sourcepos). Based on code by Chris
|
|
Mayo. [bug=1742921]
|
|
|
|
* When instantiating a BeautifulSoup object, it's now possible to
|
|
provide a dictionary ('element_classes') of the classes you'd like to be
|
|
instantiated instead of Tag, NavigableString, etc.
|
|
|
|
* Fixed the definition of the default XML namespace when using
|
|
lxml 4.4. Patch by Isaac Muse. [bug=1840141]
|
|
|
|
* Fixed a crash when pretty-printing tags that were not created
|
|
during initial parsing. [bug=1838903]
|
|
|
|
* Copying a Tag preserves information that was originally obtained from
|
|
the TreeBuilder used to build the original Tag. [bug=1838903]
|
|
|
|
* Raise an explanatory exception when the underlying parser
|
|
completely rejects the incoming markup. [bug=1838877]
|
|
|
|
* Avoid a crash when trying to detect the declared encoding of a
|
|
Unicode document. [bug=1838877]
|
|
|
|
* Avoid a crash when unpickling certain parse trees generated
|
|
using html5lib on Python 3. [bug=1843545]
|
|
|
|
= 4.8.0 (20190720, "One Small Soup")
|
|
|
|
This release focuses on making it easier to customize Beautiful Soup's
|
|
input mechanism (the TreeBuilder) and output mechanism (the Formatter).
|
|
|
|
* You can customize the TreeBuilder object by passing keyword
|
|
arguments into the BeautifulSoup constructor. Those keyword
|
|
arguments will be passed along into the TreeBuilder constructor.
|
|
|
|
The main reason to do this right now is to change how which
|
|
attributes are treated as multi-valued attributes (the way 'class'
|
|
is treated by default). You can do this with the
|
|
'multi_valued_attributes' argument. [bug=1832978]
|
|
|
|
* The role of Formatter objects has been greatly expanded. The Formatter
|
|
class now controls the following:
|
|
|
|
- The function to call to perform entity substitution. (This was
|
|
previously Formatter's only job.)
|
|
- Which tags should be treated as containing CDATA and have their
|
|
contents exempt from entity substitution.
|
|
- The order in which a tag's attributes are output. [bug=1812422]
|
|
- Whether or not to put a '/' inside a void element, e.g. '<br/>' vs '<br>'
|
|
|
|
All preexisting code should work as before.
|
|
|
|
* Added a new method to the API, Tag.smooth(), which consolidates
|
|
multiple adjacent NavigableString elements. [bug=1697296]
|
|
|
|
* ' (which is valid in XML, XHTML, and HTML 5, but not HTML 4) is always
|
|
recognized as a named entity and converted to a single quote. [bug=1818721]
|
|
|
|
= 4.7.1 (20190106)
|
|
|
|
* Fixed a significant performance problem introduced in 4.7.0. [bug=1810617]
|
|
|
|
* Fixed an incorrectly raised exception when inserting a tag before or
|
|
after an identical tag. [bug=1810692]
|
|
|
|
* Beautiful Soup will no longer try to keep track of namespaces that
|
|
are not defined with a prefix; this can confuse soupselect. [bug=1810680]
|
|
|
|
* Tried even harder to avoid the deprecation warning originally fixed in
|
|
4.6.1. [bug=1778909]
|
|
|
|
= 4.7.0 (20181231)
|
|
|
|
* Beautiful Soup's CSS Selector implementation has been replaced by a
|
|
dependency on Isaac Muse's SoupSieve project (the soupsieve package
|
|
on PyPI). The good news is that SoupSieve has a much more robust and
|
|
complete implementation of CSS selectors, resolving a large number
|
|
of longstanding issues. The bad news is that from this point onward,
|
|
SoupSieve must be installed if you want to use the select() method.
|
|
|
|
You don't have to change anything lf you installed Beautiful Soup
|
|
through pip (SoupSieve will be automatically installed when you
|
|
upgrade Beautiful Soup) or if you don't use CSS selectors from
|
|
within Beautiful Soup.
|
|
|
|
SoupSieve documentation: https://facelessuser.github.io/soupsieve/
|
|
|
|
* Added the PageElement.extend() method, which works like list.append().
|
|
[bug=1514970]
|
|
|
|
* PageElement.insert_before() and insert_after() now take a variable
|
|
number of arguments. [bug=1514970]
|
|
|
|
* Fix a number of problems with the tree builder that caused
|
|
trees that were superficially okay, but which fell apart when bits
|
|
were extracted. Patch by Isaac Muse. [bug=1782928,1809910]
|
|
|
|
* Fixed a problem with the tree builder in which elements that
|
|
contained no content (such as empty comments and all-whitespace
|
|
elements) were not being treated as part of the tree. Patch by Isaac
|
|
Muse. [bug=1798699]
|
|
|
|
* Fixed a problem with multi-valued attributes where the value
|
|
contained whitespace. Thanks to Jens Svalgaard for the
|
|
fix. [bug=1787453]
|
|
|
|
* Clarified ambiguous license statements in the source code. Beautiful
|
|
Soup is released under the MIT license, and has been since 4.4.0.
|
|
|
|
* This file has been renamed from NEWS.txt to CHANGELOG.
|
|
|
|
= 4.6.3 (20180812)
|
|
|
|
* Exactly the same as 4.6.2. Re-released to make the README file
|
|
render properly on PyPI.
|
|
|
|
= 4.6.2 (20180812)
|
|
|
|
* Fix an exception when a custom formatter was asked to format a void
|
|
element. [bug=1784408]
|
|
|
|
= 4.6.1 (20180728)
|
|
|
|
* Stop data loss when encountering an empty numeric entity, and
|
|
possibly in other cases. Thanks to tos.kamiya for the fix. [bug=1698503]
|
|
|
|
* Preserve XML namespaces introduced inside an XML document, not just
|
|
the ones introduced at the top level. [bug=1718787]
|
|
|
|
* Added a new formatter, "html5", which represents void elements
|
|
as "<element>" rather than "<element/>". [bug=1716272]
|
|
|
|
* Fixed a problem where the html.parser tree builder interpreted
|
|
a string like "&foo " as the character entity "&foo;" [bug=1728706]
|
|
|
|
* Correctly handle invalid HTML numeric character entities like “
|
|
which reference code points that are not Unicode code points. Note
|
|
that this is only fixed when Beautiful Soup is used with the
|
|
html.parser parser -- html5lib already worked and I couldn't fix it
|
|
with lxml. [bug=1782933]
|
|
|
|
* Improved the warning given when no parser is specified. [bug=1780571]
|
|
|
|
* When markup contains duplicate elements, a select() call that
|
|
includes multiple match clauses will match all relevant
|
|
elements. [bug=1770596]
|
|
|
|
* Fixed code that was causing deprecation warnings in recent Python 3
|
|
versions. Includes a patch from Ville Skyttä. [bug=1778909] [bug=1689496]
|
|
|
|
* Fixed a Windows crash in diagnose() when checking whether a long
|
|
markup string is a filename. [bug=1737121]
|
|
|
|
* Stopped HTMLParser from raising an exception in very rare cases of
|
|
bad markup. [bug=1708831]
|
|
|
|
* Fixed a bug where find_all() was not working when asked to find a
|
|
tag with a namespaced name in an XML document that was parsed as
|
|
HTML. [bug=1723783]
|
|
|
|
* You can get finer control over formatting by subclassing
|
|
bs4.element.Formatter and passing a Formatter instance into (e.g.)
|
|
encode(). [bug=1716272]
|
|
|
|
* You can pass a dictionary of `attrs` into
|
|
BeautifulSoup.new_tag. This makes it possible to create a tag with
|
|
an attribute like 'name' that would otherwise be masked by another
|
|
argument of new_tag. [bug=1779276]
|
|
|
|
* Clarified the deprecation warning when accessing tag.fooTag, to cover
|
|
the possibility that you might really have been looking for a tag
|
|
called 'fooTag'.
|
|
|
|
= 4.6.0 (20170507) =
|
|
|
|
* Added the `Tag.get_attribute_list` method, which acts like `Tag.get` for
|
|
getting the value of an attribute, but which always returns a list,
|
|
whether or not the attribute is a multi-value attribute. [bug=1678589]
|
|
|
|
* It's now possible to use a tag's namespace prefix when searching,
|
|
e.g. soup.find('namespace:tag') [bug=1655332]
|
|
|
|
* Improved the handling of empty-element tags like <br> when using the
|
|
html.parser parser. [bug=1676935]
|
|
|
|
* HTML parsers treat all HTML4 and HTML5 empty element tags (aka void
|
|
element tags) correctly. [bug=1656909]
|
|
|
|
* Namespace prefix is preserved when an XML tag is copied. Thanks
|
|
to Vikas for a patch and test. [bug=1685172]
|
|
|
|
= 4.5.3 (20170102) =
|
|
|
|
* Fixed foster parenting when html5lib is the tree builder. Thanks to
|
|
Geoffrey Sneddon for a patch and test.
|
|
|
|
* Fixed yet another problem that caused the html5lib tree builder to
|
|
create a disconnected parse tree. [bug=1629825]
|
|
|
|
= 4.5.2 (20170102) =
|
|
|
|
* Apart from the version number, this release is identical to
|
|
4.5.3. Due to user error, it could not be completely uploaded to
|
|
PyPI. Use 4.5.3 instead.
|
|
|
|
= 4.5.1 (20160802) =
|
|
|
|
* Fixed a crash when passing Unicode markup that contained a
|
|
processing instruction into the lxml HTML parser on Python
|
|
3. [bug=1608048]
|
|
|
|
= 4.5.0 (20160719) =
|
|
|
|
* Beautiful Soup is no longer compatible with Python 2.6. This
|
|
actually happened a few releases ago, but it's now official.
|
|
|
|
* Beautiful Soup will now work with versions of html5lib greater than
|
|
0.99999999. [bug=1603299]
|
|
|
|
* If a search against each individual value of a multi-valued
|
|
attribute fails, the search will be run one final time against the
|
|
complete attribute value considered as a single string. That is, if
|
|
a tag has class="foo bar" and neither "foo" nor "bar" matches, but
|
|
"foo bar" does, the tag is now considered a match.
|
|
|
|
This happened in previous versions, but only when the value being
|
|
searched for was a string. Now it also works when that value is
|
|
a regular expression, a list of strings, etc. [bug=1476868]
|
|
|
|
* Fixed a bug that deranged the tree when a whitespace element was
|
|
reparented into a tag that contained an identical whitespace
|
|
element. [bug=1505351]
|
|
|
|
* Added support for CSS selector values that contain quoted spaces,
|
|
such as tag[style="display: foo"]. [bug=1540588]
|
|
|
|
* Corrected handling of XML processing instructions. [bug=1504393]
|
|
|
|
* Corrected an encoding error that happened when a BeautifulSoup
|
|
object was copied. [bug=1554439]
|
|
|
|
* The contents of <textarea> tags will no longer be modified when the
|
|
tree is prettified. [bug=1555829]
|
|
|
|
* When a BeautifulSoup object is pickled but its tree builder cannot
|
|
be pickled, its .builder attribute is set to None instead of being
|
|
destroyed. This avoids a performance problem once the object is
|
|
unpickled. [bug=1523629]
|
|
|
|
* Specify the file and line number when warning about a
|
|
BeautifulSoup object being instantiated without a parser being
|
|
specified. [bug=1574647]
|
|
|
|
* The `limit` argument to `select()` now works correctly, though it's
|
|
not implemented very efficiently. [bug=1520530]
|
|
|
|
* Fixed a Python 3 ByteWarning when a URL was passed in as though it
|
|
were markup. Thanks to James Salter for a patch and
|
|
test. [bug=1533762]
|
|
|
|
* We don't run the check for a filename passed in as markup if the
|
|
'filename' contains a less-than character; the less-than character
|
|
indicates it's most likely a very small document. [bug=1577864]
|
|
|
|
= 4.4.1 (20150928) =
|
|
|
|
* Fixed a bug that deranged the tree when part of it was
|
|
removed. Thanks to Eric Weiser for the patch and John Wiseman for a
|
|
test. [bug=1481520]
|
|
|
|
* Fixed a parse bug with the html5lib tree-builder. Thanks to Roel
|
|
Kramer for the patch. [bug=1483781]
|
|
|
|
* Improved the implementation of CSS selector grouping. Thanks to
|
|
Orangain for the patch. [bug=1484543]
|
|
|
|
* Fixed the test_detect_utf8 test so that it works when chardet is
|
|
installed. [bug=1471359]
|
|
|
|
* Corrected the output of Declaration objects. [bug=1477847]
|
|
|
|
|
|
= 4.4.0 (20150703) =
|
|
|
|
Especially important changes:
|
|
|
|
* Added a warning when you instantiate a BeautifulSoup object without
|
|
explicitly naming a parser. [bug=1398866]
|
|
|
|
* __repr__ now returns an ASCII bytestring in Python 2, and a Unicode
|
|
string in Python 3, instead of a UTF8-encoded bytestring in both
|
|
versions. In Python 3, __str__ now returns a Unicode string instead
|
|
of a bytestring. [bug=1420131]
|
|
|
|
* The `text` argument to the find_* methods is now called `string`,
|
|
which is more accurate. `text` still works, but `string` is the
|
|
argument described in the documentation. `text` may eventually
|
|
change its meaning, but not for a very long time. [bug=1366856]
|
|
|
|
* Changed the way soup objects work under copy.copy(). Copying a
|
|
NavigableString or a Tag will give you a new NavigableString that's
|
|
equal to the old one but not connected to the parse tree. Patch by
|
|
Martijn Peters. [bug=1307490]
|
|
|
|
* Started using a standard MIT license. [bug=1294662]
|
|
|
|
* Added a Chinese translation of the documentation by Delong .w.
|
|
|
|
New features:
|
|
|
|
* Introduced the select_one() method, which uses a CSS selector but
|
|
only returns the first match, instead of a list of
|
|
matches. [bug=1349367]
|
|
|
|
* You can now create a Tag object without specifying a
|
|
TreeBuilder. Patch by Martijn Pieters. [bug=1307471]
|
|
|
|
* You can now create a NavigableString or a subclass just by invoking
|
|
the constructor. [bug=1294315]
|
|
|
|
* Added an `exclude_encodings` argument to UnicodeDammit and to the
|
|
Beautiful Soup constructor, which lets you prohibit the detection of
|
|
an encoding that you know is wrong. [bug=1469408]
|
|
|
|
* The select() method now supports selector grouping. Patch by
|
|
Francisco Canas [bug=1191917]
|
|
|
|
Bug fixes:
|
|
|
|
* Fixed yet another problem that caused the html5lib tree builder to
|
|
create a disconnected parse tree. [bug=1237763]
|
|
|
|
* Force object_was_parsed() to keep the tree intact even when an element
|
|
from later in the document is moved into place. [bug=1430633]
|
|
|
|
* Fixed yet another bug that caused a disconnected tree when html5lib
|
|
copied an element from one part of the tree to another. [bug=1270611]
|
|
|
|
* Fixed a bug where Element.extract() could create an infinite loop in
|
|
the remaining tree.
|
|
|
|
* The select() method can now find tags whose names contain
|
|
dashes. Patch by Francisco Canas. [bug=1276211]
|
|
|
|
* The select() method can now find tags with attributes whose names
|
|
contain dashes. Patch by Marek Kapolka. [bug=1304007]
|
|
|
|
* Improved the lxml tree builder's handling of processing
|
|
instructions. [bug=1294645]
|
|
|
|
* Restored the helpful syntax error that happens when you try to
|
|
import the Python 2 edition of Beautiful Soup under Python
|
|
3. [bug=1213387]
|
|
|
|
* In Python 3.4 and above, set the new convert_charrefs argument to
|
|
the html.parser constructor to avoid a warning and future
|
|
failures. Patch by Stefano Revera. [bug=1375721]
|
|
|
|
* The warning when you pass in a filename or URL as markup will now be
|
|
displayed correctly even if the filename or URL is a Unicode
|
|
string. [bug=1268888]
|
|
|
|
* If the initial <html> tag contains a CDATA list attribute such as
|
|
'class', the html5lib tree builder will now turn its value into a
|
|
list, as it would with any other tag. [bug=1296481]
|
|
|
|
* Fixed an import error in Python 3.5 caused by the removal of the
|
|
HTMLParseError class. [bug=1420063]
|
|
|
|
* Improved docstring for encode_contents() and
|
|
decode_contents(). [bug=1441543]
|
|
|
|
* Fixed a crash in Unicode, Dammit's encoding detector when the name
|
|
of the encoding itself contained invalid bytes. [bug=1360913]
|
|
|
|
* Improved the exception raised when you call .unwrap() or
|
|
.replace_with() on an element that's not attached to a tree.
|
|
|
|
* Raise a NotImplementedError whenever an unsupported CSS pseudoclass
|
|
is used in select(). Previously some cases did not result in a
|
|
NotImplementedError.
|
|
|
|
* It's now possible to pickle a BeautifulSoup object no matter which
|
|
tree builder was used to create it. However, the only tree builder
|
|
that survives the pickling process is the HTMLParserTreeBuilder
|
|
('html.parser'). If you unpickle a BeautifulSoup object created with
|
|
some other tree builder, soup.builder will be None. [bug=1231545]
|
|
|
|
= 4.3.2 (20131002) =
|
|
|
|
* Fixed a bug in which short Unicode input was improperly encoded to
|
|
ASCII when checking whether or not it was the name of a file on
|
|
disk. [bug=1227016]
|
|
|
|
* Fixed a crash when a short input contains data not valid in
|
|
filenames. [bug=1232604]
|
|
|
|
* Fixed a bug that caused Unicode data put into UnicodeDammit to
|
|
return None instead of the original data. [bug=1214983]
|
|
|
|
* Combined two tests to stop a spurious test failure when tests are
|
|
run by nosetests. [bug=1212445]
|
|
|
|
= 4.3.1 (20130815) =
|
|
|
|
* Fixed yet another problem with the html5lib tree builder, caused by
|
|
html5lib's tendency to rearrange the tree during
|
|
parsing. [bug=1189267]
|
|
|
|
* Fixed a bug that caused the optimized version of find_all() to
|
|
return nothing. [bug=1212655]
|
|
|
|
= 4.3.0 (20130812) =
|
|
|
|
* Instead of converting incoming data to Unicode and feeding it to the
|
|
lxml tree builder in chunks, Beautiful Soup now makes successive
|
|
guesses at the encoding of the incoming data, and tells lxml to
|
|
parse the data as that encoding. Giving lxml more control over the
|
|
parsing process improves performance and avoids a number of bugs and
|
|
issues with the lxml parser which had previously required elaborate
|
|
workarounds:
|
|
|
|
- An issue in which lxml refuses to parse Unicode strings on some
|
|
systems. [bug=1180527]
|
|
|
|
- A returning bug that truncated documents longer than a (very
|
|
small) size. [bug=963880]
|
|
|
|
- A returning bug in which extra spaces were added to a document if
|
|
the document defined a charset other than UTF-8. [bug=972466]
|
|
|
|
This required a major overhaul of the tree builder architecture. If
|
|
you wrote your own tree builder and didn't tell me, you'll need to
|
|
modify your prepare_markup() method.
|
|
|
|
* The UnicodeDammit code that makes guesses at encodings has been
|
|
split into its own class, EncodingDetector. A lot of apparently
|
|
redundant code has been removed from Unicode, Dammit, and some
|
|
undocumented features have also been removed.
|
|
|
|
* Beautiful Soup will issue a warning if instead of markup you pass it
|
|
a URL or the name of a file on disk (a common beginner's mistake).
|
|
|
|
* A number of optimizations improve the performance of the lxml tree
|
|
builder by about 33%, the html.parser tree builder by about 20%, and
|
|
the html5lib tree builder by about 15%.
|
|
|
|
* All find_all calls should now return a ResultSet object. Patch by
|
|
Aaron DeVore. [bug=1194034]
|
|
|
|
= 4.2.1 (20130531) =
|
|
|
|
* The default XML formatter will now replace ampersands even if they
|
|
appear to be part of entities. That is, "<" will become
|
|
"&lt;". The old code was left over from Beautiful Soup 3, which
|
|
didn't always turn entities into Unicode characters.
|
|
|
|
If you really want the old behavior (maybe because you add new
|
|
strings to the tree, those strings include entities, and you want
|
|
the formatter to leave them alone on output), it can be found in
|
|
EntitySubstitution.substitute_xml_containing_entities(). [bug=1182183]
|
|
|
|
* Gave new_string() the ability to create subclasses of
|
|
NavigableString. [bug=1181986]
|
|
|
|
* Fixed another bug by which the html5lib tree builder could create a
|
|
disconnected tree. [bug=1182089]
|
|
|
|
* The .previous_element of a BeautifulSoup object is now always None,
|
|
not the last element to be parsed. [bug=1182089]
|
|
|
|
* Fixed test failures when lxml is not installed. [bug=1181589]
|
|
|
|
* html5lib now supports Python 3. Fixed some Python 2-specific
|
|
code in the html5lib test suite. [bug=1181624]
|
|
|
|
* The html.parser treebuilder can now handle numeric attributes in
|
|
text when the hexidecimal name of the attribute starts with a
|
|
capital X. Patch by Tim Shirley. [bug=1186242]
|
|
|
|
= 4.2.0 (20130514) =
|
|
|
|
* The Tag.select() method now supports a much wider variety of CSS
|
|
selectors.
|
|
|
|
- Added support for the adjacent sibling combinator (+) and the
|
|
general sibling combinator (~). Tests by "liquider". [bug=1082144]
|
|
|
|
- The combinators (>, +, and ~) can now combine with any supported
|
|
selector, not just one that selects based on tag name.
|
|
|
|
- Added limited support for the "nth-of-type" pseudo-class. Code
|
|
by Sven Slootweg. [bug=1109952]
|
|
|
|
* The BeautifulSoup class is now aliased to "_s" and "_soup", making
|
|
it quicker to type the import statement in an interactive session:
|
|
|
|
from bs4 import _s
|
|
or
|
|
from bs4 import _soup
|
|
|
|
The alias may change in the future, so don't use this in code you're
|
|
going to run more than once.
|
|
|
|
* Added the 'diagnose' submodule, which includes several useful
|
|
functions for reporting problems and doing tech support.
|
|
|
|
- diagnose(data) tries the given markup on every installed parser,
|
|
reporting exceptions and displaying successes. If a parser is not
|
|
installed, diagnose() mentions this fact.
|
|
|
|
- lxml_trace(data, html=True) runs the given markup through lxml's
|
|
XML parser or HTML parser, and prints out the parser events as
|
|
they happen. This helps you quickly determine whether a given
|
|
problem occurs in lxml code or Beautiful Soup code.
|
|
|
|
- htmlparser_trace(data) is the same thing, but for Python's
|
|
built-in HTMLParser class.
|
|
|
|
* In an HTML document, the contents of a <script> or <style> tag will
|
|
no longer undergo entity substitution by default. XML documents work
|
|
the same way they did before. [bug=1085953]
|
|
|
|
* Methods like get_text() and properties like .strings now only give
|
|
you strings that are visible in the document--no comments or
|
|
processing commands. [bug=1050164]
|
|
|
|
* The prettify() method now leaves the contents of <pre> tags
|
|
alone. [bug=1095654]
|
|
|
|
* Fix a bug in the html5lib treebuilder which sometimes created
|
|
disconnected trees. [bug=1039527]
|
|
|
|
* Fix a bug in the lxml treebuilder which crashed when a tag included
|
|
an attribute from the predefined "xml:" namespace. [bug=1065617]
|
|
|
|
* Fix a bug by which keyword arguments to find_parent() were not
|
|
being passed on. [bug=1126734]
|
|
|
|
* Stop a crash when unwisely messing with a tag that's been
|
|
decomposed. [bug=1097699]
|
|
|
|
* Now that lxml's segfault on invalid doctype has been fixed, fixed a
|
|
corresponding problem on the Beautiful Soup end that was previously
|
|
invisible. [bug=984936]
|
|
|
|
* Fixed an exception when an overspecified CSS selector didn't match
|
|
anything. Code by Stefaan Lippens. [bug=1168167]
|
|
|
|
= 4.1.3 (20120820) =
|
|
|
|
* Skipped a test under Python 2.6 and Python 3.1 to avoid a spurious
|
|
test failure caused by the lousy HTMLParser in those
|
|
versions. [bug=1038503]
|
|
|
|
* Raise a more specific error (FeatureNotFound) when a requested
|
|
parser or parser feature is not installed. Raise NotImplementedError
|
|
instead of ValueError when the user calls insert_before() or
|
|
insert_after() on the BeautifulSoup object itself. Patch by Aaron
|
|
Devore. [bug=1038301]
|
|
|
|
= 4.1.2 (20120817) =
|
|
|
|
* As per PEP-8, allow searching by CSS class using the 'class_'
|
|
keyword argument. [bug=1037624]
|
|
|
|
* Display namespace prefixes for namespaced attribute names, instead of
|
|
the fully-qualified names given by the lxml parser. [bug=1037597]
|
|
|
|
* Fixed a crash on encoding when an attribute name contained
|
|
non-ASCII characters.
|
|
|
|
* When sniffing encodings, if the cchardet library is installed,
|
|
Beautiful Soup uses it instead of chardet. cchardet is much
|
|
faster. [bug=1020748]
|
|
|
|
* Use logging.warning() instead of warning.warn() to notify the user
|
|
that characters were replaced with REPLACEMENT
|
|
CHARACTER. [bug=1013862]
|
|
|
|
= 4.1.1 (20120703) =
|
|
|
|
* Fixed an html5lib tree builder crash which happened when html5lib
|
|
moved a tag with a multivalued attribute from one part of the tree
|
|
to another. [bug=1019603]
|
|
|
|
* Correctly display closing tags with an XML namespace declared. Patch
|
|
by Andreas Kostyrka. [bug=1019635]
|
|
|
|
* Fixed a typo that made parsing significantly slower than it should
|
|
have been, and also waited too long to close tags with XML
|
|
namespaces. [bug=1020268]
|
|
|
|
* get_text() now returns an empty Unicode string if there is no text,
|
|
rather than an empty bytestring. [bug=1020387]
|
|
|
|
= 4.1.0 (20120529) =
|
|
|
|
* Added experimental support for fixing Windows-1252 characters
|
|
embedded in UTF-8 documents. (UnicodeDammit.detwingle())
|
|
|
|
* Fixed the handling of " with the built-in parser. [bug=993871]
|
|
|
|
* Comments, processing instructions, document type declarations, and
|
|
markup declarations are now treated as preformatted strings, the way
|
|
CData blocks are. [bug=1001025]
|
|
|
|
* Fixed a bug with the lxml treebuilder that prevented the user from
|
|
adding attributes to a tag that didn't originally have
|
|
attributes. [bug=1002378] Thanks to Oliver Beattie for the patch.
|
|
|
|
* Fixed some edge-case bugs having to do with inserting an element
|
|
into a tag it's already inside, and replacing one of a tag's
|
|
children with another. [bug=997529]
|
|
|
|
* Added the ability to search for attribute values specified in UTF-8. [bug=1003974]
|
|
|
|
This caused a major refactoring of the search code. All the tests
|
|
pass, but it's possible that some searches will behave differently.
|
|
|
|
= 4.0.5 (20120427) =
|
|
|
|
* Added a new method, wrap(), which wraps an element in a tag.
|
|
|
|
* Renamed replace_with_children() to unwrap(), which is easier to
|
|
understand and also the jQuery name of the function.
|
|
|
|
* Made encoding substitution in <meta> tags completely transparent (no
|
|
more %SOUP-ENCODING%).
|
|
|
|
* Fixed a bug in decoding data that contained a byte-order mark, such
|
|
as data encoded in UTF-16LE. [bug=988980]
|
|
|
|
* Fixed a bug that made the HTMLParser treebuilder generate XML
|
|
definitions ending with two question marks instead of
|
|
one. [bug=984258]
|
|
|
|
* Upon document generation, CData objects are no longer run through
|
|
the formatter. [bug=988905]
|
|
|
|
* The test suite now passes when lxml is not installed, whether or not
|
|
html5lib is installed. [bug=987004]
|
|
|
|
* Print a warning on HTMLParseErrors to let people know they should
|
|
install a better parser library.
|
|
|
|
= 4.0.4 (20120416) =
|
|
|
|
* Fixed a bug that sometimes created disconnected trees.
|
|
|
|
* Fixed a bug with the string setter that moved a string around the
|
|
tree instead of copying it. [bug=983050]
|
|
|
|
* Attribute values are now run through the provided output formatter.
|
|
Previously they were always run through the 'minimal' formatter. In
|
|
the future I may make it possible to specify different formatters
|
|
for attribute values and strings, but for now, consistent behavior
|
|
is better than inconsistent behavior. [bug=980237]
|
|
|
|
* Added the missing renderContents method from Beautiful Soup 3. Also
|
|
added an encode_contents() method to go along with decode_contents().
|
|
|
|
* Give a more useful error when the user tries to run the Python 2
|
|
version of BS under Python 3.
|
|
|
|
* UnicodeDammit can now convert Microsoft smart quotes to ASCII with
|
|
UnicodeDammit(markup, smart_quotes_to="ascii").
|
|
|
|
= 4.0.3 (20120403) =
|
|
|
|
* Fixed a typo that caused some versions of Python 3 to convert the
|
|
Beautiful Soup codebase incorrectly.
|
|
|
|
* Got rid of the 4.0.2 workaround for HTML documents--it was
|
|
unnecessary and the workaround was triggering a (possibly different,
|
|
but related) bug in lxml. [bug=972466]
|
|
|
|
= 4.0.2 (20120326) =
|
|
|
|
* Worked around a possible bug in lxml that prevents non-tiny XML
|
|
documents from being parsed. [bug=963880, bug=963936]
|
|
|
|
* Fixed a bug where specifying `text` while also searching for a tag
|
|
only worked if `text` wanted an exact string match. [bug=955942]
|
|
|
|
= 4.0.1 (20120314) =
|
|
|
|
* This is the first official release of Beautiful Soup 4. There is no
|
|
4.0.0 release, to eliminate any possibility that packaging software
|
|
might treat "4.0.0" as being an earlier version than "4.0.0b10".
|
|
|
|
* Brought BS up to date with the latest release of soupselect, adding
|
|
CSS selector support for direct descendant matches and multiple CSS
|
|
class matches.
|
|
|
|
= 4.0.0b10 (20120302) =
|
|
|
|
* Added support for simple CSS selectors, taken from the soupselect project.
|
|
|
|
* Fixed a crash when using html5lib. [bug=943246]
|
|
|
|
* In HTML5-style <meta charset="foo"> tags, the value of the "charset"
|
|
attribute is now replaced with the appropriate encoding on
|
|
output. [bug=942714]
|
|
|
|
* Fixed a bug that caused calling a tag to sometimes call find_all()
|
|
with the wrong arguments. [bug=944426]
|
|
|
|
* For backwards compatibility, brought back the BeautifulStoneSoup
|
|
class as a deprecated wrapper around BeautifulSoup.
|
|
|
|
= 4.0.0b9 (20120228) =
|
|
|
|
* Fixed the string representation of DOCTYPEs that have both a public
|
|
ID and a system ID.
|
|
|
|
* Fixed the generated XML declaration.
|
|
|
|
* Renamed Tag.nsprefix to Tag.prefix, for consistency with
|
|
NamespacedAttribute.
|
|
|
|
* Fixed a test failure that occurred on Python 3.x when chardet was
|
|
installed.
|
|
|
|
* Made prettify() return Unicode by default, so it will look nice on
|
|
Python 3 when passed into print().
|
|
|
|
= 4.0.0b8 (20120224) =
|
|
|
|
* All tree builders now preserve namespace information in the
|
|
documents they parse. If you use the html5lib parser or lxml's XML
|
|
parser, you can access the namespace URL for a tag as tag.namespace.
|
|
|
|
However, there is no special support for namespace-oriented
|
|
searching or tree manipulation. When you search the tree, you need
|
|
to use namespace prefixes exactly as they're used in the original
|
|
document.
|
|
|
|
* The string representation of a DOCTYPE always ends in a newline.
|
|
|
|
* Issue a warning if the user tries to use a SoupStrainer in
|
|
conjunction with the html5lib tree builder, which doesn't support
|
|
them.
|
|
|
|
= 4.0.0b7 (20120223) =
|
|
|
|
* Upon decoding to string, any characters that can't be represented in
|
|
your chosen encoding will be converted into numeric XML entity
|
|
references.
|
|
|
|
* Issue a warning if characters were replaced with REPLACEMENT
|
|
CHARACTER during Unicode conversion.
|
|
|
|
* Restored compatibility with Python 2.6.
|
|
|
|
* The install process no longer installs docs or auxiliary text files.
|
|
|
|
* It's now possible to deepcopy a BeautifulSoup object created with
|
|
Python's built-in HTML parser.
|
|
|
|
* About 100 unit tests that "test" the behavior of various parsers on
|
|
invalid markup have been removed. Legitimate changes to those
|
|
parsers caused these tests to fail, indicating that perhaps
|
|
Beautiful Soup should not test the behavior of foreign
|
|
libraries.
|
|
|
|
The problematic unit tests have been reformulated as informational
|
|
comparisons generated by the script
|
|
scripts/demonstrate_parser_differences.py.
|
|
|
|
This makes Beautiful Soup compatible with html5lib version 0.95 and
|
|
future versions of HTMLParser.
|
|
|
|
= 4.0.0b6 (20120216) =
|
|
|
|
* Multi-valued attributes like "class" always have a list of values,
|
|
even if there's only one value in the list.
|
|
|
|
* Added a number of multi-valued attributes defined in HTML5.
|
|
|
|
* Stopped generating a space before the slash that closes an
|
|
empty-element tag. This may come back if I add a special XHTML mode
|
|
(http://www.w3.org/TR/xhtml1/#C_2), but right now it's pretty
|
|
useless.
|
|
|
|
* Passing text along with tag-specific arguments to a find* method:
|
|
|
|
find("a", text="Click here")
|
|
|
|
will find tags that contain the given text as their
|
|
.string. Previously, the tag-specific arguments were ignored and
|
|
only strings were searched.
|
|
|
|
* Fixed a bug that caused the html5lib tree builder to build a
|
|
partially disconnected tree. Generally cleaned up the html5lib tree
|
|
builder.
|
|
|
|
* If you restrict a multi-valued attribute like "class" to a string
|
|
that contains spaces, Beautiful Soup will only consider it a match
|
|
if the values correspond to that specific string.
|
|
|
|
= 4.0.0b5 (20120209) =
|
|
|
|
* Rationalized Beautiful Soup's treatment of CSS class. A tag
|
|
belonging to multiple CSS classes is treated as having a list of
|
|
values for the 'class' attribute. Searching for a CSS class will
|
|
match *any* of the CSS classes.
|
|
|
|
This actually affects all attributes that the HTML standard defines
|
|
as taking multiple values (class, rel, rev, archive, accept-charset,
|
|
and headers), but 'class' is by far the most common. [bug=41034]
|
|
|
|
* If you pass anything other than a dictionary as the second argument
|
|
to one of the find* methods, it'll assume you want to use that
|
|
object to search against a tag's CSS classes. Previously this only
|
|
worked if you passed in a string.
|
|
|
|
* Fixed a bug that caused a crash when you passed a dictionary as an
|
|
attribute value (possibly because you mistyped "attrs"). [bug=842419]
|
|
|
|
* Unicode, Dammit now detects the encoding in HTML 5-style <meta> tags
|
|
like <meta charset="utf-8" />. [bug=837268]
|
|
|
|
* If Unicode, Dammit can't figure out a consistent encoding for a
|
|
page, it will try each of its guesses again, with errors="replace"
|
|
instead of errors="strict". This may mean that some data gets
|
|
replaced with REPLACEMENT CHARACTER, but at least most of it will
|
|
get turned into Unicode. [bug=754903]
|
|
|
|
* Patched over a bug in html5lib (?) that was crashing Beautiful Soup
|
|
on certain kinds of markup. [bug=838800]
|
|
|
|
* Fixed a bug that wrecked the tree if you replaced an element with an
|
|
empty string. [bug=728697]
|
|
|
|
* Improved Unicode, Dammit's behavior when you give it Unicode to
|
|
begin with.
|
|
|
|
= 4.0.0b4 (20120208) =
|
|
|
|
* Added BeautifulSoup.new_string() to go along with BeautifulSoup.new_tag()
|
|
|
|
* BeautifulSoup.new_tag() will follow the rules of whatever
|
|
tree-builder was used to create the original BeautifulSoup object. A
|
|
new <p> tag will look like "<p />" if the soup object was created to
|
|
parse XML, but it will look like "<p></p>" if the soup object was
|
|
created to parse HTML.
|
|
|
|
* We pass in strict=False to html.parser on Python 3, greatly
|
|
improving html.parser's ability to handle bad HTML.
|
|
|
|
* We also monkeypatch a serious bug in html.parser that made
|
|
strict=False disastrous on Python 3.2.2.
|
|
|
|
* Replaced the "substitute_html_entities" argument with the
|
|
more general "formatter" argument.
|
|
|
|
* Bare ampersands and angle brackets are always converted to XML
|
|
entities unless the user prevents it.
|
|
|
|
* Added PageElement.insert_before() and PageElement.insert_after(),
|
|
which let you put an element into the parse tree with respect to
|
|
some other element.
|
|
|
|
* Raise an exception when the user tries to do something nonsensical
|
|
like insert a tag into itself.
|
|
|
|
|
|
= 4.0.0b3 (20120203) =
|
|
|
|
Beautiful Soup 4 is a nearly-complete rewrite that removes Beautiful
|
|
Soup's custom HTML parser in favor of a system that lets you write a
|
|
little glue code and plug in any HTML or XML parser you want.
|
|
|
|
Beautiful Soup 4.0 comes with glue code for four parsers:
|
|
|
|
* Python's standard HTMLParser (html.parser in Python 3)
|
|
* lxml's HTML and XML parsers
|
|
* html5lib's HTML parser
|
|
|
|
HTMLParser is the default, but I recommend you install lxml if you
|
|
can.
|
|
|
|
For complete documentation, see the Sphinx documentation in
|
|
bs4/doc/source/. What follows is a summary of the changes from
|
|
Beautiful Soup 3.
|
|
|
|
=== The module name has changed ===
|
|
|
|
Previously you imported the BeautifulSoup class from a module also
|
|
called BeautifulSoup. To save keystrokes and make it clear which
|
|
version of the API is in use, the module is now called 'bs4':
|
|
|
|
>>> from bs4 import BeautifulSoup
|
|
|
|
=== It works with Python 3 ===
|
|
|
|
Beautiful Soup 3.1.0 worked with Python 3, but the parser it used was
|
|
so bad that it barely worked at all. Beautiful Soup 4 works with
|
|
Python 3, and since its parser is pluggable, you don't sacrifice
|
|
quality.
|
|
|
|
Special thanks to Thomas Kluyver and Ezio Melotti for getting Python 3
|
|
support to the finish line. Ezio Melotti is also to thank for greatly
|
|
improving the HTML parser that comes with Python 3.2.
|
|
|
|
=== CDATA sections are normal text, if they're understood at all. ===
|
|
|
|
Currently, the lxml and html5lib HTML parsers ignore CDATA sections in
|
|
markup:
|
|
|
|
<p><![CDATA[foo]]></p> => <p></p>
|
|
|
|
A future version of html5lib will turn CDATA sections into text nodes,
|
|
but only within tags like <svg> and <math>:
|
|
|
|
<svg><![CDATA[foo]]></svg> => <p>foo</p>
|
|
|
|
The default XML parser (which uses lxml behind the scenes) turns CDATA
|
|
sections into ordinary text elements:
|
|
|
|
<p><![CDATA[foo]]></p> => <p>foo</p>
|
|
|
|
In theory it's possible to preserve the CDATA sections when using the
|
|
XML parser, but I don't see how to get it to work in practice.
|
|
|
|
=== Miscellaneous other stuff ===
|
|
|
|
If the BeautifulSoup instance has .is_xml set to True, an appropriate
|
|
XML declaration will be emitted when the tree is transformed into a
|
|
string:
|
|
|
|
<?xml version="1.0" encoding="utf-8">
|
|
<markup>
|
|
...
|
|
</markup>
|
|
|
|
The ['lxml', 'xml'] tree builder sets .is_xml to True; the other tree
|
|
builders set it to False. If you want to parse XHTML with an HTML
|
|
parser, you can set it manually.
|
|
|
|
|
|
= 3.2.0 =
|
|
|
|
The 3.1 series wasn't very useful, so I renamed the 3.0 series to 3.2
|
|
to make it obvious which one you should use.
|
|
|
|
= 3.1.0 =
|
|
|
|
A hybrid version that supports 2.4 and can be automatically converted
|
|
to run under Python 3.0. There are three backwards-incompatible
|
|
changes you should be aware of, but no new features or deliberate
|
|
behavior changes.
|
|
|
|
1. str() may no longer do what you want. This is because the meaning
|
|
of str() inverts between Python 2 and 3; in Python 2 it gives you a
|
|
byte string, in Python 3 it gives you a Unicode string.
|
|
|
|
The effect of this is that you can't pass an encoding to .__str__
|
|
anymore. Use encode() to get a string and decode() to get Unicode, and
|
|
you'll be ready (well, readier) for Python 3.
|
|
|
|
2. Beautiful Soup is now based on HTMLParser rather than SGMLParser,
|
|
which is gone in Python 3. There's some bad HTML that SGMLParser
|
|
handled but HTMLParser doesn't, usually to do with attribute values
|
|
that aren't closed or have brackets inside them:
|
|
|
|
<a href="foo</a>, </a><a href="bar">baz</a>
|
|
<a b="<a>">', '<a b="<a>"></a><a>"></a>
|
|
|
|
A later version of Beautiful Soup will allow you to plug in different
|
|
parsers to make tradeoffs between speed and the ability to handle bad
|
|
HTML.
|
|
|
|
3. In Python 3 (but not Python 2), HTMLParser converts entities within
|
|
attributes to the corresponding Unicode characters. In Python 2 it's
|
|
possible to parse this string and leave the é intact.
|
|
|
|
<a href="http://crummy.com?sacré&bleu">
|
|
|
|
In Python 3, the é is always converted to \xe9 during
|
|
parsing.
|
|
|
|
|
|
= 3.0.7a =
|
|
|
|
Added an import that makes BS work in Python 2.3.
|
|
|
|
|
|
= 3.0.7 =
|
|
|
|
Fixed a UnicodeDecodeError when unpickling documents that contain
|
|
non-ASCII characters.
|
|
|
|
Fixed a TypeError that occurred in some circumstances when a tag
|
|
contained no text.
|
|
|
|
Jump through hoops to avoid the use of chardet, which can be extremely
|
|
slow in some circumstances. UTF-8 documents should never trigger the
|
|
use of chardet.
|
|
|
|
Whitespace is preserved inside <pre> and <textarea> tags that contain
|
|
nothing but whitespace.
|
|
|
|
Beautiful Soup can now parse a doctype that's scoped to an XML namespace.
|
|
|
|
|
|
= 3.0.6 =
|
|
|
|
Got rid of a very old debug line that prevented chardet from working.
|
|
|
|
Added a Tag.decompose() method that completely disconnects a tree or a
|
|
subset of a tree, breaking it up into bite-sized pieces that are
|
|
easy for the garbage collecter to collect.
|
|
|
|
Tag.extract() now returns the tag that was extracted.
|
|
|
|
Tag.findNext() now does something with the keyword arguments you pass
|
|
it instead of dropping them on the floor.
|
|
|
|
Fixed a Unicode conversion bug.
|
|
|
|
Fixed a bug that garbled some <meta> tags when rewriting them.
|
|
|
|
|
|
= 3.0.5 =
|
|
|
|
Soup objects can now be pickled, and copied with copy.deepcopy.
|
|
|
|
Tag.append now works properly on existing BS objects. (It wasn't
|
|
originally intended for outside use, but it can be now.) (Giles
|
|
Radford)
|
|
|
|
Passing in a nonexistent encoding will no longer crash the parser on
|
|
Python 2.4 (John Nagle).
|
|
|
|
Fixed an underlying bug in SGMLParser that thinks ASCII has 255
|
|
characters instead of 127 (John Nagle).
|
|
|
|
Entities are converted more consistently to Unicode characters.
|
|
|
|
Entity references in attribute values are now converted to Unicode
|
|
characters when appropriate. Numeric entities are always converted,
|
|
because SGMLParser always converts them outside of attribute values.
|
|
|
|
ALL_ENTITIES happens to just be the XHTML entities, so I renamed it to
|
|
XHTML_ENTITIES.
|
|
|
|
The regular expression for bare ampersands was too loose. In some
|
|
cases ampersands were not being escaped. (Sam Ruby?)
|
|
|
|
Non-breaking spaces and other special Unicode space characters are no
|
|
longer folded to ASCII spaces. (Robert Leftwich)
|
|
|
|
Information inside a TEXTAREA tag is now parsed literally, not as HTML
|
|
tags. TEXTAREA now works exactly the same way as SCRIPT. (Zephyr Fang)
|
|
|
|
= 3.0.4 =
|
|
|
|
Fixed a bug that crashed Unicode conversion in some cases.
|
|
|
|
Fixed a bug that prevented UnicodeDammit from being used as a
|
|
general-purpose data scrubber.
|
|
|
|
Fixed some unit test failures when running against Python 2.5.
|
|
|
|
When considering whether to convert smart quotes, UnicodeDammit now
|
|
looks at the original encoding in a case-insensitive way.
|
|
|
|
= 3.0.3 (20060606) =
|
|
|
|
Beautiful Soup is now usable as a way to clean up invalid XML/HTML (be
|
|
sure to pass in an appropriate value for convertEntities, or XML/HTML
|
|
entities might stick around that aren't valid in HTML/XML). The result
|
|
may not validate, but it should be good enough to not choke a
|
|
real-world XML parser. Specifically, the output of a properly
|
|
constructed soup object should always be valid as part of an XML
|
|
document, but parts may be missing if they were missing in the
|
|
original. As always, if the input is valid XML, the output will also
|
|
be valid.
|
|
|
|
= 3.0.2 (20060602) =
|
|
|
|
Previously, Beautiful Soup correctly handled attribute values that
|
|
contained embedded quotes (sometimes by escaping), but not other kinds
|
|
of XML character. Now, it correctly handles or escapes all special XML
|
|
characters in attribute values.
|
|
|
|
I aliased methods to the 2.x names (fetch, find, findText, etc.) for
|
|
backwards compatibility purposes. Those names are deprecated and if I
|
|
ever do a 4.0 I will remove them. I will, I tell you!
|
|
|
|
Fixed a bug where the findAll method wasn't passing along any keyword
|
|
arguments.
|
|
|
|
When run from the command line, Beautiful Soup now acts as an HTML
|
|
pretty-printer, not an XML pretty-printer.
|
|
|
|
= 3.0.1 (20060530) =
|
|
|
|
Reintroduced the "fetch by CSS class" shortcut. I thought keyword
|
|
arguments would replace it, but they don't. You can't call soup('a',
|
|
class='foo') because class is a Python keyword.
|
|
|
|
If Beautiful Soup encounters a meta tag that declares the encoding,
|
|
but a SoupStrainer tells it not to parse that tag, Beautiful Soup will
|
|
no longer try to rewrite the meta tag to mention the new
|
|
encoding. Basically, this makes SoupStrainers work in real-world
|
|
applications instead of crashing the parser.
|
|
|
|
= 3.0.0 "Who would not give all else for two p" (20060528) =
|
|
|
|
This release is not backward-compatible with previous releases. If
|
|
you've got code written with a previous version of the library, go
|
|
ahead and keep using it, unless one of the features mentioned here
|
|
really makes your life easier. Since the library is self-contained,
|
|
you can include an old copy of the library in your old applications,
|
|
and use the new version for everything else.
|
|
|
|
The documentation has been rewritten and greatly expanded with many
|
|
more examples.
|
|
|
|
Beautiful Soup autodetects the encoding of a document (or uses the one
|
|
you specify), and converts it from its native encoding to
|
|
Unicode. Internally, it only deals with Unicode strings. When you
|
|
print out the document, it converts to UTF-8 (or another encoding you
|
|
specify). [Doc reference]
|
|
|
|
It's now easy to make large-scale changes to the parse tree without
|
|
screwing up the navigation members. The methods are extract,
|
|
replaceWith, and insert. [Doc reference. See also Improving Memory
|
|
Usage with extract]
|
|
|
|
Passing True in as an attribute value gives you tags that have any
|
|
value for that attribute. You don't have to create a regular
|
|
expression. Passing None for an attribute value gives you tags that
|
|
don't have that attribute at all.
|
|
|
|
Tag objects now know whether or not they're self-closing. This avoids
|
|
the problem where Beautiful Soup thought that tags like <BR /> were
|
|
self-closing even in XML documents. You can customize the self-closing
|
|
tags for a parser object by passing them in as a list of
|
|
selfClosingTags: you don't have to subclass anymore.
|
|
|
|
There's a new built-in parser, MinimalSoup, which has most of
|
|
BeautifulSoup's HTML-specific rules, but no tag nesting rules. [Doc
|
|
reference]
|
|
|
|
You can use a SoupStrainer to tell Beautiful Soup to parse only part
|
|
of a document. This saves time and memory, often making Beautiful Soup
|
|
about as fast as a custom-built SGMLParser subclass. [Doc reference,
|
|
SoupStrainer reference]
|
|
|
|
You can (usually) use keyword arguments instead of passing a
|
|
dictionary of attributes to a search method. That is, you can replace
|
|
soup(args={"id" : "5"}) with soup(id="5"). You can still use args if
|
|
(for instance) you need to find an attribute whose name clashes with
|
|
the name of an argument to findAll. [Doc reference: **kwargs attrs]
|
|
|
|
The method names have changed to the better method names used in
|
|
Rubyful Soup. Instead of find methods and fetch methods, there are
|
|
only find methods. Instead of a scheme where you can't remember which
|
|
method finds one element and which one finds them all, we have find
|
|
and findAll. In general, if the method name mentions All or a plural
|
|
noun (eg. findNextSiblings), then it finds many elements
|
|
method. Otherwise, it only finds one element. [Doc reference]
|
|
|
|
Some of the argument names have been renamed for clarity. For instance
|
|
avoidParserProblems is now parserMassage.
|
|
|
|
Beautiful Soup no longer implements a feed method. You need to pass a
|
|
string or a filehandle into the soup constructor, not with feed after
|
|
the soup has been created. There is still a feed method, but it's the
|
|
feed method implemented by SGMLParser and calling it will bypass
|
|
Beautiful Soup and cause problems.
|
|
|
|
The NavigableText class has been renamed to NavigableString. There is
|
|
no NavigableUnicodeString anymore, because every string inside a
|
|
Beautiful Soup parse tree is a Unicode string.
|
|
|
|
findText and fetchText are gone. Just pass a text argument into find
|
|
or findAll.
|
|
|
|
Null was more trouble than it was worth, so I got rid of it. Anything
|
|
that used to return Null now returns None.
|
|
|
|
Special XML constructs like comments and CDATA now have their own
|
|
NavigableString subclasses, instead of being treated as oddly-formed
|
|
data. If you parse a document that contains CDATA and write it back
|
|
out, the CDATA will still be there.
|
|
|
|
When you're parsing a document, you can get Beautiful Soup to convert
|
|
XML or HTML entities into the corresponding Unicode characters. [Doc
|
|
reference]
|
|
|
|
= 2.1.1 (20050918) =
|
|
|
|
Fixed a serious performance bug in BeautifulStoneSoup which was
|
|
causing parsing to be incredibly slow.
|
|
|
|
Corrected several entities that were previously being incorrectly
|
|
translated from Microsoft smart-quote-like characters.
|
|
|
|
Fixed a bug that was breaking text fetch.
|
|
|
|
Fixed a bug that crashed the parser when text chunks that look like
|
|
HTML tag names showed up within a SCRIPT tag.
|
|
|
|
THEAD, TBODY, and TFOOT tags are now nestable within TABLE
|
|
tags. Nested tables should parse more sensibly now.
|
|
|
|
BASE is now considered a self-closing tag.
|
|
|
|
= 2.1.0 "Game, or any other dish?" (20050504) =
|
|
|
|
Added a wide variety of new search methods which, given a starting
|
|
point inside the tree, follow a particular navigation member (like
|
|
nextSibling) over and over again, looking for Tag and NavigableText
|
|
objects that match certain criteria. The new methods are findNext,
|
|
fetchNext, findPrevious, fetchPrevious, findNextSibling,
|
|
fetchNextSiblings, findPreviousSibling, fetchPreviousSiblings,
|
|
findParent, and fetchParents. All of these use the same basic code
|
|
used by first and fetch, so you can pass your weird ways of matching
|
|
things into these methods.
|
|
|
|
The fetch method and its derivatives now accept a limit argument.
|
|
|
|
You can now pass keyword arguments when calling a Tag object as though
|
|
it were a method.
|
|
|
|
Fixed a bug that caused all hand-created tags to share a single set of
|
|
attributes.
|
|
|
|
= 2.0.3 (20050501) =
|
|
|
|
Fixed Python 2.2 support for iterators.
|
|
|
|
Fixed a bug that gave the wrong representation to tags within quote
|
|
tags like <script>.
|
|
|
|
Took some code from Mark Pilgrim that treats CDATA declarations as
|
|
data instead of ignoring them.
|
|
|
|
Beautiful Soup's setup.py will now do an install even if the unit
|
|
tests fail. It won't build a source distribution if the unit tests
|
|
fail, so I can't release a new version unless they pass.
|
|
|
|
= 2.0.2 (20050416) =
|
|
|
|
Added the unit tests in a separate module, and packaged it with
|
|
distutils.
|
|
|
|
Fixed a bug that sometimes caused renderContents() to return a Unicode
|
|
string even if there was no Unicode in the original string.
|
|
|
|
Added the done() method, which closes all of the parser's open
|
|
tags. It gets called automatically when you pass in some text to the
|
|
constructor of a parser class; otherwise you must call it yourself.
|
|
|
|
Reinstated some backwards compatibility with 1.x versions: referencing
|
|
the string member of a NavigableText object returns the NavigableText
|
|
object instead of throwing an error.
|
|
|
|
= 2.0.1 (20050412) =
|
|
|
|
Fixed a bug that caused bad results when you tried to reference a tag
|
|
name shorter than 3 characters as a member of a Tag, eg. tag.table.td.
|
|
|
|
Made sure all Tags have the 'hidden' attribute so that an attempt to
|
|
access tag.hidden doesn't spawn an attempt to find a tag named
|
|
'hidden'.
|
|
|
|
Fixed a bug in the comparison operator.
|
|
|
|
= 2.0.0 "Who cares for fish?" (20050410)
|
|
|
|
Beautiful Soup version 1 was very useful but also pretty stupid. I
|
|
originally wrote it without noticing any of the problems inherent in
|
|
trying to build a parse tree out of ambiguous HTML tags. This version
|
|
solves all of those problems to my satisfaction. It also adds many new
|
|
clever things to make up for the removal of the stupid things.
|
|
|
|
== Parsing ==
|
|
|
|
The parser logic has been greatly improved, and the BeautifulSoup
|
|
class should much more reliably yield a parse tree that looks like
|
|
what the page author intended. For a particular class of odd edge
|
|
cases that now causes problems, there is a new class,
|
|
ICantBelieveItsBeautifulSoup.
|
|
|
|
By default, Beautiful Soup now performs some cleanup operations on
|
|
text before parsing it. This is to avoid common problems with bad
|
|
definitions and self-closing tags that crash SGMLParser. You can
|
|
provide your own set of cleanup operations, or turn it off
|
|
altogether. The cleanup operations include fixing self-closing tags
|
|
that don't close, and replacing Microsoft smart quotes and similar
|
|
characters with their HTML entity equivalents.
|
|
|
|
You can now get a pretty-print version of parsed HTML to get a visual
|
|
picture of how Beautiful Soup parses it, with the Tag.prettify()
|
|
method.
|
|
|
|
== Strings and Unicode ==
|
|
|
|
There are separate NavigableText subclasses for ASCII and Unicode
|
|
strings. These classes directly subclass the corresponding base data
|
|
types. This means you can treat NavigableText objects as strings
|
|
instead of having to call methods on them to get the strings.
|
|
|
|
str() on a Tag always returns a string, and unicode() always returns
|
|
Unicode. Previously it was inconsistent.
|
|
|
|
== Tree traversal ==
|
|
|
|
In a first() or fetch() call, the tag name or the desired value of an
|
|
attribute can now be any of the following:
|
|
|
|
* A string (matches that specific tag or that specific attribute value)
|
|
* A list of strings (matches any tag or attribute value in the list)
|
|
* A compiled regular expression object (matches any tag or attribute
|
|
value that matches the regular expression)
|
|
* A callable object that takes the Tag object or attribute value as a
|
|
string. It returns None/false/empty string if the given string
|
|
doesn't match, and any other value if it does.
|
|
|
|
This is much easier to use than SQL-style wildcards (see, regular
|
|
expressions are good for something). Because of this, I took out
|
|
SQL-style wildcards. I'll put them back if someone complains, but
|
|
their removal simplifies the code a lot.
|
|
|
|
You can use fetch() and first() to search for text in the parse tree,
|
|
not just tags. There are new alias methods fetchText() and firstText()
|
|
designed for this purpose. As with searching for tags, you can pass in
|
|
a string, a regular expression object, or a method to match your text.
|
|
|
|
If you pass in something besides a map to the attrs argument of
|
|
fetch() or first(), Beautiful Soup will assume you want to match that
|
|
thing against the "class" attribute. When you're scraping
|
|
well-structured HTML, this makes your code a lot cleaner.
|
|
|
|
1.x and 2.x both let you call a Tag object as a shorthand for
|
|
fetch(). For instance, foo("bar") is a shorthand for
|
|
foo.fetch("bar"). In 2.x, you can also access a specially-named member
|
|
of a Tag object as a shorthand for first(). For instance, foo.barTag
|
|
is a shorthand for foo.first("bar"). By chaining these shortcuts you
|
|
traverse a tree in very little code: for header in
|
|
soup.bodyTag.pTag.tableTag('th'):
|
|
|
|
If an element relationship (like parent or next) doesn't apply to a
|
|
tag, it'll now show up Null instead of None. first() will also return
|
|
Null if you ask it for a nonexistent tag. Null is an object that's
|
|
just like None, except you can do whatever you want to it and it'll
|
|
give you Null instead of throwing an error.
|
|
|
|
This lets you do tree traversals like soup.htmlTag.headTag.titleTag
|
|
without having to worry if the intermediate stages are actually
|
|
there. Previously, if there was no 'head' tag in the document, headTag
|
|
in that instance would have been None, and accessing its 'titleTag'
|
|
member would have thrown an AttributeError. Now, you can get what you
|
|
want when it exists, and get Null when it doesn't, without having to
|
|
do a lot of conditionals checking to see if every stage is None.
|
|
|
|
There are two new relations between page elements: previousSibling and
|
|
nextSibling. They reference the previous and next element at the same
|
|
level of the parse tree. For instance, if you have HTML like this:
|
|
|
|
<p><ul><li>Foo<br /><li>Bar</ul>
|
|
|
|
The first 'li' tag has a previousSibling of Null and its nextSibling
|
|
is the second 'li' tag. The second 'li' tag has a nextSibling of Null
|
|
and its previousSibling is the first 'li' tag. The previousSibling of
|
|
the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the
|
|
'br' tag.
|
|
|
|
I took out the ability to use fetch() to find tags that have a
|
|
specific list of contents. See, I can't even explain it well. It was
|
|
really difficult to use, I never used it, and I don't think anyone
|
|
else ever used it. To the extent anyone did, they can probably use
|
|
fetchText() instead. If it turns out someone needs it I'll think of
|
|
another solution.
|
|
|
|
== Tree manipulation ==
|
|
|
|
You can add new attributes to a tag, and delete attributes from a
|
|
tag. In 1.x you could only change a tag's existing attributes.
|
|
|
|
== Porting Considerations ==
|
|
|
|
There are three changes in 2.0 that break old code:
|
|
|
|
In the post-1.2 release you could pass in a function into fetch(). The
|
|
function took a string, the tag name. In 2.0, the function takes the
|
|
actual Tag object.
|
|
|
|
It's no longer to pass in SQL-style wildcards to fetch(). Use a
|
|
regular expression instead.
|
|
|
|
The different parsing algorithm means the parse tree may not be shaped
|
|
like you expect. This will only actually affect you if your code uses
|
|
one of the affected parts. I haven't run into this problem yet while
|
|
porting my code.
|
|
|
|
= Between 1.2 and 2.0 =
|
|
|
|
This is the release to get if you want Python 1.5 compatibility.
|
|
|
|
The desired value of an attribute can now be any of the following:
|
|
|
|
* A string
|
|
* A string with SQL-style wildcards
|
|
* A compiled RE object
|
|
* A callable that returns None/false/empty string if the given value
|
|
doesn't match, and any other value otherwise.
|
|
|
|
This is much easier to use than SQL-style wildcards (see, regular
|
|
expressions are good for something). Because of this, I no longer
|
|
recommend you use SQL-style wildcards. They may go away in a future
|
|
release to clean up the code.
|
|
|
|
Made Beautiful Soup handle processing instructions as text instead of
|
|
ignoring them.
|
|
|
|
Applied patch from Richie Hindle (richie at entrian dot com) that
|
|
makes tag.string a shorthand for tag.contents[0].string when the tag
|
|
has only one string-owning child.
|
|
|
|
Added still more nestable tags. The nestable tags thing won't work in
|
|
a lot of cases and needs to be rethought.
|
|
|
|
Fixed an edge case where searching for "%foo" would match any string
|
|
shorter than "foo".
|
|
|
|
= 1.2 "Who for such dainties would not stoop?" (20040708) =
|
|
|
|
Applied patch from Ben Last (ben at benlast dot com) that made
|
|
Tag.renderContents() correctly handle Unicode.
|
|
|
|
Made BeautifulStoneSoup even dumber by making it not implicitly close
|
|
a tag when another tag of the same type is encountered; only when an
|
|
actual closing tag is encountered. This change courtesy of Fuzzy (mike
|
|
at pcblokes dot com). BeautifulSoup still works as before.
|
|
|
|
= 1.1 "Swimming in a hot tureen" =
|
|
|
|
Added more 'nestable' tags. Changed popping semantics so that when a
|
|
nestable tag is encountered, tags are popped up to the previously
|
|
encountered nestable tag (of whatever kind). I will revert this if
|
|
enough people complain, but it should make more people's lives easier
|
|
than harder. This enhancement was suggested by Anthony Baxter (anthony
|
|
at interlink dot com dot au).
|
|
|
|
= 1.0 "So rich and green" (20040420) =
|
|
|
|
Initial release.
|