October 25, 2024
Chicago 12, Melborne City, USA
HTML

Trying to build a news scraper, but can't access the Wall Street Journal site


Trying to build a news scraper, but can’t access the Wall Street Journal site. I have got the subscription to the site as well as my CSRF Token, however, I still get denied access. I tried contacting the support team of WSJ, but have not got a response yet. Is there another way around this?

After running the code, it returns me "Login successful", but after that it raises an exception:"requests.exceptions.HTTPError: 403 Client Error: Forbidden for url:"

import requests
from bs4 import BeautifulSoup

login_url = "https://id.wsj.com/auth/login"
news_url = "https://www.wsj.com/"

with requests.Session() as session:
    login_page = session.get(login_url)
    login_page.raise_for_status()  
    payload = {
        'username': 
        'password': 
        'csrfToken': 
    }
    login_response = session.post(login_url, data=payload)
    login_response.raise_for_status()

    if login_response.ok:
        print("Login successful")
        response = session.get(news_url)
        response.raise_for_status()
        if response.ok:
            soup = BeautifulSoup(response.text, 'html.parser')
            headlines = soup.find_all('h3')
            for headline in headlines:
                print(headline.text)
    else:
        print("Login failed")



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video