October 24, 2024
Chicago 12, Melborne City, USA
python

Only extracting text from this element, not its children


I’d like to be able to retrieve the current node’s text:

markup = "<h1>that<span>but not that</span>and that</h1>"
soup = Soup(markup, "html.parser")
assert soup.find_all(string=True, recursive=False) == ["that", "and that"]
# returns []

Other solutions I found only work when you know the node type, for example doing:

assert soup.h1.find_all(string=True, recursive=False) == ["that", "and that"]
# returns ["that", "and that"]

But in this case, I do not know that it’s a h1, I’d like it to be type-agnostic. I looked for something like soup.self, and tried soup[soup.name] in case it was stored as a dictionary, but not luck there.

These solutions either use soup.h1, which is not node-type agnostic, or use soup.find_all(string=True, recursive=False), which returns in my case []. Not sure if BeautifulSoup updated the behavior here.



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video