OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Beautiful Soup says a URL is a filename, but it clearly isn't

  • Thread starter Thread starter Tom L
  • Start date Start date
T

Tom L

Guest
I've been working through some web scraping tutorials, this one specifically aimed at teaching how to spoof headers. Here's the relevant portion of the code:

Code:
url = 'https://www.scrapethissite.com/pages/advanced/?gotcha=headers'
response = requests.get(url, headers=head)
response = response.content
soup = BeautifulSoup(response, 'html.parser')
print(soup.prettify())

This is the output:

Code:
Accept value is missing 'text/html'

/var/folders/8p/sbpv06dx3b576tlsdyc4cs9c0000gn/T/ipykernel_25552/4089810762.py:12: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
  soup = BeautifulSoup(response, 'html.parser')

I've used the last four lines of this code on every other tutorial project on the same site and there was no problem. I don't think it's a problem with the user agents or the spoofed headers, either, because those errors disappeared when I added the headers. The only thing I can think of that might be an issue is the question mark in the URL, but that would be weird; the site is literally entitled "Scrape This Site", so it wouldn't make sense for one of their tutorial pages to be unscrapeable dude to an oversight in the URL. I've tried to re-encode the question mark (%3F, if I recall correctly), but that just returned a 404. Any ideas about what might be going on here?
<p>I've been working through some web scraping tutorials, this one specifically aimed at teaching how to spoof headers. Here's the relevant portion of the code:</p>
<pre><code>url = 'https://www.scrapethissite.com/pages/advanced/?gotcha=headers'
response = requests.get(url, headers=head)
response = response.content
soup = BeautifulSoup(response, 'html.parser')
print(soup.prettify())
</code></pre>
<p>This is the output:</p>
<pre><code>Accept value is missing 'text/html'

/var/folders/8p/sbpv06dx3b576tlsdyc4cs9c0000gn/T/ipykernel_25552/4089810762.py:12: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
soup = BeautifulSoup(response, 'html.parser')
</code></pre>
<p>I've used the last four lines of this code on every other tutorial project on the same site and there was no problem. I don't think it's a problem with the user agents or the spoofed headers, either, because those errors disappeared when I added the headers. The only thing I can think of that might be an issue is the question mark in the URL, but that would be weird; the site is literally entitled "Scrape This Site", so it wouldn't make sense for one of their tutorial pages to be unscrapeable dude to an oversight in the URL. I've tried to re-encode the question mark (%3F, if I recall correctly), but that just returned a 404. Any ideas about what might be going on here?</p>
 
Top