OiO.lk Blog python Only extracting text from this element, not its children
python

Only extracting text from this element, not its children


I’d like to be able to retrieve the current node’s text:

markup = "<h1>that<span>but not that</span>and that</h1>"
soup = Soup(markup, "html.parser")
assert soup.find_all(string=True, recursive=False) == ["that", "and that"]
# returns []

Other solutions I found only work when you know the node type, for example doing:

assert soup.h1.find_all(string=True, recursive=False) == ["that", "and that"]
# returns ["that", "and that"]

But in this case, I do not know that it’s a h1, I’d like it to be type-agnostic. I looked for something like soup.self, and tried soup[soup.name] in case it was stored as a dictionary, but not luck there.

These solutions either use soup.h1, which is not node-type agnostic, or use soup.find_all(string=True, recursive=False), which returns in my case []. Not sure if BeautifulSoup updated the behavior here.



You need to sign in to view this answers

Exit mobile version