I don’t know why, but I am getting a lookup error with an unknown encoding found, ‘b’utf8” when I try to scrape and parse Walmart’s web page.
I have already set the encoding to utf-8 and also tried removing BOM, according to this post: lxml LookupError occured. Arguments: ("unknown encoding: 'b'utf-8-sig''",).
Appreciate any help or pointers!
Complete code:
import httpx
from parsel import Selector
import json
# Fake browser-like headers
BASE_HEADERS = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language": "en-US;en;q=0.9",
"accept-encoding": "gzip, deflate, br",
}
response = httpx.get("https://www.walmart.com/product-page-url", headers=BASE_HEADERS)
if response.encoding is None:
response.encoding = 'utf-8'
# Remove BOM if present
content = response.content
if content.startswith(b'\xef\xbb\xbf'):
content = content[3:] # Remove the BOM
response_text = content.decode('utf-8')
sel = Selector(text=response_text)
data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
if data:
data = json.loads(data)
product = data["props"]["pageProps"]["initialData"]["data"]["product"]
print(product)
else:
print("No product data found.")
You need to sign in to view this answers
Leave feedback about this