Skip to content Skip to sidebar Skip to footer

How To Get "subsoups" And Concatenate/join Them?

I have a HTML document I need to process. I'm using 'beautifoulsoup' for that. Now I would like to retrieve a few 'subsoups' from that document and join them into one soup so I can

Solution 1:

SoupStrainer would do exactly what you are asking about and, as a bonus, you'll get a performance boost since it would parse exactly what you want it to parse - not the complete document tree:

from bs4 import BeautifulSoup, SoupStrainer

parse_only = SoupStrainer(id=["first", "third", "loner"])
soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

Now, the soup object would contain only the desired elements:

<divid="first"><p>
  A paragraph.
 </p><ahref="another_doc.html">
  A link
 </a><p>
  A paragraph.
 </p></div><divid="third"><p>
  A paragraph.
 </p><ahref="another_doc.html">
  A link
 </a><ahref="yet_another_doc.html">
  A link
 </a></div><pid="loner">
 A paragraph.
</p>

Is it also possible to specify not only ids but also tags? For example if I want to filter all paragraphs with class="someclass but not divs with the same class?

In this case, you can make a search function to join multiple criteria for the SoupStrainer:

from bs4 import BeautifulSoup, SoupStrainer, ResultSet

my_document = """
<html><body><h1>Some Heading</h1><divid="first"><p>A paragraph.</p><ahref="another_doc.html">A link</a><p>A paragraph.</p></div><divid="second"><p>A paragraph.</p><p>A paragraph.</p></div><divid="third"><p>A paragraph.</p><ahref="another_doc.html">A link</a><ahref="yet_another_doc.html">A link</a></div><pid="loner">A paragraph.</p><pclass="myclass">test</p></body></html>
"""

def search(tag, attrs):
    if tag == "p" and "myclass" in attrs.get("class", []):
        return tag

    if attrs.get("id") in ["first", "third", "loner"]:
        return tag


parse_only = SoupStrainer(search)

soup = BeautifulSoup(my_document, "html.parser", parse_only=parse_only)

print(soup.prettify())

Solution 2:

You can use findAll with passing in the ids of the elements you want to use.

import bs4

soup = bs4.BeautifulSoup(my_document)

#EDIT -> I discovered you do not need regex, you can pass in a list of`ids`
sub = soup.findAll(attrs={'id': ['first', 'third', 'loner']})

#EDIT -> adding `html.parser` will force `BeautifulSoup` to not auto append `html` and `body` tags.
sub = bs4.BeautifulSoup('\n\n'.join(str(s) for s in sub), 'html.parser')

print(sub)

>>> <divid="first"><p>A paragraph.</p><ahref="another_doc.html">A link</a><p>A paragraph.</p></div><divid="third"><p>A paragraph.</p><ahref="another_doc.html">A link</a><ahref="yet_another_doc.html">A link</a></div><pid="loner">A paragraph.</p>

Post a Comment for "How To Get "subsoups" And Concatenate/join Them?"