OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Handling sync vs async with scrapy + Playwright

  • Thread starter Thread starter Allen Y
  • Start date Start date
A

Allen Y

Guest
I'm using scrapy with Playwright to load a Google Jobs search results page. Playwright is needed to be able to load the page in a browser setting, then to click on different jobs to reveal the details of the job.

Example URL I want to extract information from: https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs

While I can get the code to open that page in a Playwright browser and parse the fields I want in an interactive python environment, I'm not sure how to integrate Playwright into scrapy smoothly. I have the start_requests function set up correctly, in the sense that Playwright is set up and it'll open up a browser to the desired page, like the URL above.

Here's what I have so far for the parse function:

Code:
async def parse(self, response):
    page = response.meta["playwright_page"]

    jobs = page.locator("//li")
    num_jobs = jobs.count()

    for idx in range(num_jobs):
        # For each job found, first need to click on it
        await jobs.nth(idx).click()

        # Then grab this large section of the page that has details about the job
        # In that large section, first click a couple of "More" buttons
        job_details = page.locator("#tl_ditsc")
        more_button1 = job_details.get_by_text("More job highlights")
        await more_button1.click()
        more_button2 = job_details.get_by_text("Show full description")
        await more_button2.click()

        # Then take that large section and pass it to another function for parsing
        soup = BeautifulSoup(job_details, 'html.parser')
        data = self.parse_single_jd(soup)

    ...
    yield {data here}
    return

When I try to run the above, it errors on the for idx in range(num_jobs) line with "TypeError: 'coroutine' object cannot be interpreted as an integer". When running in an interactive python shell, the use of page.locator, jobs.count(), jobs.nth(#).click(), etc all work. This leads me to believe that I'm misunderstanding something fundamental about the async nature of parse, which I believe is needed in order to be able to do things like click on the page (per this documentation). It's like I need to force num_jobs = jobs.count() to 'evaluate', but it's not doing so.

(Note that a bit further down, if I want to create an if more_button1.count() check before the await more_button1.click() line, I run into the same sort of error - it's as if I need to force the .count() to 'evaluate')

Any advice?
<p>I'm using scrapy with Playwright to load a Google Jobs search results page. Playwright is needed to be able to load the page in a browser setting, then to click on different jobs to reveal the details of the job.</p>
<p>Example URL I want to extract information from: <a href="https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs" rel="nofollow noreferrer">https://www.google.com/search?q=product+designer+nyc&ibp=htl;jobs</a></p>
<p>While I can get the code to open that page in a Playwright browser and parse the fields I want in an interactive python environment, I'm not sure how to integrate Playwright into scrapy smoothly. I have the <code>start_requests</code> function set up correctly, in the sense that Playwright is set up and it'll open up a browser to the desired page, like the URL above.</p>
<p>Here's what I have so far for the <code>parse</code> function:</p>
<pre><code>async def parse(self, response):
page = response.meta["playwright_page"]

jobs = page.locator("//li")
num_jobs = jobs.count()

for idx in range(num_jobs):
# For each job found, first need to click on it
await jobs.nth(idx).click()

# Then grab this large section of the page that has details about the job
# In that large section, first click a couple of "More" buttons
job_details = page.locator("#tl_ditsc")
more_button1 = job_details.get_by_text("More job highlights")
await more_button1.click()
more_button2 = job_details.get_by_text("Show full description")
await more_button2.click()

# Then take that large section and pass it to another function for parsing
soup = BeautifulSoup(job_details, 'html.parser')
data = self.parse_single_jd(soup)

...
yield {data here}
return
</code></pre>
<p>When I try to run the above, it errors on the <code>for idx in range(num_jobs)</code> line with "TypeError: 'coroutine' object cannot be interpreted as an integer". When running in an interactive python shell, the use of <code>page.locator</code>, <code>jobs.count()</code>, <code>jobs.nth(#).click()</code>, etc all work. This leads me to believe that I'm misunderstanding something fundamental about the async nature of parse, which I believe is needed in order to be able to do things like click on the page (per <a href="https://github.com/scrapy-plugins/s...e-ov-file#receiving-page-objects-in-callbacks" rel="nofollow noreferrer">this documentation</a>). It's like I need to force <code>num_jobs = jobs.count()</code> to 'evaluate', but it's not doing so.</p>
<p>(Note that a bit further down, if I want to create an <code>if more_button1.count()</code> check before the <code>await more_button1.click()</code> line, I run into the same sort of error - it's as if I need to force the <code>.count()</code> to 'evaluate')</p>
<p>Any advice?</p>
 

Latest posts

Top