OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Python Requests: website does not recognize login after returning status <200>

  • Thread starter Thread starter Martin
  • Start date Start date
M

Martin

Guest
I'm trying to scrape letterboxd.com for film info, but when I try to fetch the HTML of the page I'd like to use, I was getting served different HTML than my browser. So, I tried to log in through a requests Session. But even after getting a <200> status from the response, any GET I perform on the page does not recognize my credentials (nor does this change the HTML I get served).

I've verified that my User-Agent header is the same my browser gives (all the rest of the headers are requests defaults), as well looking at the HTML to make sure I'm giving all required information. I must be missing something. Here is the HTML of the <form> section of the website:


Code:
<form method="post" action="#" id="signin" class="signin signin-form js-header-signin-form js-signin" data-url="/user/login.do" data-recaptcha-action="signin" novalidate='novalidate' autocorrect='off' autocapitalize='off'>
    <input type="hidden" name="__csrf" value="placeholder" />
    <input type="hidden" name="authenticationCode" value="" />
    <fieldset class="fieldset">
        <div class="fields">
            <div class="col">
                <label for="username">Username</label>
                <input type="email" name="username" id="username" class="field signin-field" tabindex="1" data-focus-control="signingIn" autocomplete='email' inputmode='email' value="" />
            </div>
            <div class="col">
                <label for="password">Password</label>
                <input type="password" name="password" id="password" class="field signin-field" tabindex="2" autocomplete='current-password' value="" />
            </div>
            <div class="signin-actions">
                <label for="remember" class="option-label -checkbox -small">
                    <input type="checkbox" name="remember" id="remember" class="checkbox" tabindex="3" value="true" /><i class="substitute"></i>
                    <span class="focus">Remember<span class="mob-hide"> me</span></span>
                </label>
                <p class="reset" tabindex="5"><a class="reset-password-link" href="/user/request-password-reset" target="_top">Forgotten<span class="elongated"> username or password</span>?</a></p>
            </div>
            <div class="col buttons">
                <div class="button-container"><input type="submit" value="Sign in" class="button -action button-green" tabindex="4" /><i></i></div>
                <div class="close js-close-signin">&times;</div>
            </div>
        </div>
    </fieldset>
    <div id="signin-message" class="errormessage"></div>
</form>

From this, I tried to find the relevant data to POST, and wrote the following code:

Code:
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36"    # from browser

login = {'__csrf': 'a0b859246d2044858517',     # from Chrome Inspector, after login
         'authenticationCode': '',             # this was left null in browser
         'username': 'my_user',                # from name='username' in an <input> element
         'password': 'my_pass',                # from name='password'
         'remember': 'true',                   # from name='remember'
         'submit':   'Sign In'}                # from type='submit'

with requests.Session() as s:
    s.headers['User-Agent'] = user_agent

    p = s.post("https://letterboxd.com/user/login.do",   # from data-url member in <form> element
               data=login,
               auth=('my_user', 'my_pass'))

    print(p.status_code)

The output of this code is 200. Yet, when I GET the homepage, within the same session, a JS script in the response HTML indicates that I am not logged in, notably in this line:

Code:
analytic_params['user_type'] = 'Visitor';

This goes unchanged after the requests login. In my browser this line reads as follows after login:

Code:
analytic_params['user_type'] = 'Member';

I must be missing some authentication, or something simple. I'm quite new at this, so insight would be helpful!
<p>I'm trying to scrape <strong>letterboxd.com</strong> for film info, but when I try to fetch the HTML of the page I'd like to use, I was getting served different HTML than my browser. So, I tried to log in through a <code>requests</code> <code>Session</code>. But even after getting a <200> status from the response, any GET I perform on the page does not recognize my credentials (nor does this change the HTML I get served).</p>
<p>I've verified that my User-Agent header is the same my browser gives (all the rest of the headers are <code>requests</code> defaults), as well looking at the HTML to make sure I'm giving all required information. I must be missing something. Here is the HTML of the <code><form></code> section of the website:</p>
<p><div class="snippet" data-lang="js" data-hide="false" data-console="true" data-babel="false">
<div class="snippet-code">
<pre class="snippet-code-html lang-html prettyprint-override"><code><form method="post" action="#" id="signin" class="signin signin-form js-header-signin-form js-signin" data-url="/user/login.do" data-recaptcha-action="signin" novalidate='novalidate' autocorrect='off' autocapitalize='off'>
<input type="hidden" name="__csrf" value="placeholder" />
<input type="hidden" name="authenticationCode" value="" />
<fieldset class="fieldset">
<div class="fields">
<div class="col">
<label for="username">Username</label>
<input type="email" name="username" id="username" class="field signin-field" tabindex="1" data-focus-control="signingIn" autocomplete='email' inputmode='email' value="" />
</div>
<div class="col">
<label for="password">Password</label>
<input type="password" name="password" id="password" class="field signin-field" tabindex="2" autocomplete='current-password' value="" />
</div>
<div class="signin-actions">
<label for="remember" class="option-label -checkbox -small">
<input type="checkbox" name="remember" id="remember" class="checkbox" tabindex="3" value="true" /><i class="substitute"></i>
<span class="focus">Remember<span class="mob-hide"> me</span></span>
</label>
<p class="reset" tabindex="5"><a class="reset-password-link" href="/user/request-password-reset" target="_top">Forgotten<span class="elongated"> username or password</span>?</a></p>
</div>
<div class="col buttons">
<div class="button-container"><input type="submit" value="Sign in" class="button -action button-green" tabindex="4" /><i></i></div>
<div class="close js-close-signin">&times;</div>
</div>
</div>
</fieldset>
<div id="signin-message" class="errormessage"></div>
</form></code></pre>
</div>
</div>
</p>
<p>From this, I tried to find the relevant data to POST, and wrote the following code:</p>
<pre><code>user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36" # from browser

login = {'__csrf': 'a0b859246d2044858517', # from Chrome Inspector, after login
'authenticationCode': '', # this was left null in browser
'username': 'my_user', # from name='username' in an <input> element
'password': 'my_pass', # from name='password'
'remember': 'true', # from name='remember'
'submit': 'Sign In'} # from type='submit'

with requests.Session() as s:
s.headers['User-Agent'] = user_agent

p = s.post("https://letterboxd.com/user/login.do", # from data-url member in <form> element
data=login,
auth=('my_user', 'my_pass'))

print(p.status_code)
</code></pre>
<p>The output of this code is <code>200</code>. Yet, when I GET the homepage, within the same session, a JS script in the response HTML indicates that I am <em>not</em> logged in, notably in this line:</p>
<pre><code>analytic_params['user_type'] = 'Visitor';
</code></pre>
<p>This goes unchanged after the <code>requests</code> login. In my browser this line reads as follows after login:</p>
<pre><code>analytic_params['user_type'] = 'Member';
</code></pre>
<p>I must be missing some authentication, or something simple. I'm quite new at this, so insight would be helpful!</p>
Continue reading...
 
Top