OiO.lk Blog HTML Unconditionally stop scraping at specified element (or EOF)
HTML

Unconditionally stop scraping at specified element (or EOF)


I’m using Python lxml.html package to scrape an HTML file. The HTML I’m trying to scrape that reads in part

<h1>Description of DAB Ensemble 1</h1><table>Stuff I don't care about</table>
    <!-- Tags I don't care about -->
    <div id="announcement_data_block>
        <h3>Announcement information</h3>
        <p>No announcement information is broadcast</p>
    </div>
    <!-- More tags I don't care about -->
<h1>Description of DAB Ensemble 2</h1><table>Stuff I don't care about</table>
    <!-- Tags I don't care about -->
    <div id="announcement_data_block>
        <h3>Announcement information</h3>
            <h4>Announcement switching (FIG0/19)</h4>
                <table>Stuff I DO care about</table>
    </div>
    <!-- More tags I don't are about -->

I’m interested in the "Announcement switching" table, which may or may not be present for a given DAB ensemble. I have a lxml.hmtl.xpath expression as follows:

f'//h1[text()="Description of DAB Ensemble {ens_idx}"]/following-sibling::table/following-sibling::div[@id="announcement_data_block"]/h4[starts-with(text(), "Announcement switching")]/following-sibling::table'

Per my understanding, this XPath statement is saying, for a given ens_idx value:

Start at root and find a h1 tag with text matching "Description of DAB Ensemble {ens_idx}" (e.g "Description of DAB Ensemble 1", "Description of DAB Ensemble 2"), then go to the first table you see after that. In the above example, it would be the table labelled "Stuff I don’t care about". Afterwards, go to the next div whose id is "announcement_data_block". Within that div, find a h4 tag whose text starts with "Announcement switching". Get the first table following that.

In the example above, DAB Ensemble 1 does not have such a table. I would want xpath to return None when attempting to get the table for DAB Ensemble 1. However, xpath doesn’t know to stop when it hits the h1 tag "Description of DAB Ensemble 2", so it keeps going until it finds DAB Ensemble 2’s h4 tag. I’m looking for help in finding a xpath statement that will have XPath unconditionally stop at the next "Description of DAB Ensemble" h1 tag. Essentially I wish to modify the directive to:

Start at root and find a h1 tag with text matching "Description of DAB Ensemble {ens_idx}" (e.g "Description of DAB Ensemble 1", "Description of DAB Ensemble 2"), then go to the first table you see after that. In the above example, it would be the table labelled "Stuff I don’t care about". Afterwards, go to the next div whose id is "announcement_data_block". Within that div, find a h4 tag whose text starts with "Announcement switching". Get the first table following that. If this criteria is not found before the h1 tag with text matching "Description of DAB Ensemble {ens_idx + 1}" or EOF, then return None.

The part in bold is what is missing from my XPath expression. Does anyone know how to construct such an expression?



You need to sign in to view this answers

Exit mobile version