I am observing a strange behavior, while parsing texts from a html file using python regex. Would greatly appreciate your suggestions on regex which I should use.
string = "<a href="https://academia/course/3743">3743</a>, <a href="https://academia/course/3963">3963</a>, <a href="https://academia/course/3850">3850</a>,"
# I want to extract 3743, 3963, 3850 from the above text
pattern = r".*?<a href=".*">([0-9]+)</a>,.*"
result = re.findall(pattern, string)
print(result)
# Output
['3850']
It is printing only the last occurence and leaving out rest. I tried following this as well, but it doesn’t help
python findall finds only the last occurrence
Can anybody please help with the regex I should use to get all the numbers
# expected output
[3743, 3963, 3850]
PS: I can’t use any other python modules like bs4. I need to stick with native python modules.
You need to sign in to view this answers