OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Web Scraping problem in Python using Jupyter Notebook

  • Thread starter Thread starter Xotic Flash24
  • Start date Start date
X

Xotic Flash24

Guest
I have a question i need to do for an assignment but i can not for the life of me figure it out. The question is as follows: Write a Python application that can be used to scrape data from the CareerJuction website (at careerjunction.co.za). The application must allow the user to enter a job title to search for, then for each result (on the first result page only) extract the following information:


  1. The job title


  2. The Name of the recruiter


  3. The job salary


  4. The job position


  5. The job location


  6. The date posted

(8 Marks) The question does state that we need to use careerjunction.co.za but we got permission to use indeed.com, so i have been using that instead.

I have tried many types of code but everytime i just get errors, or 403 Forbidden.

The codes that i have tried are: def scrape_careerjunction(job_title): url = f"https://www.careerjunction.co.za/jobs/results/?keyword={job_title}" headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' }

Code:
    # Send HTTP request
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
    
        jobs = []
     
        # Find all job listings
        job_listings = soup.find_all('div', class_='listing-item')
    
        for job in job_listings:
            job_title = job.find('h2', class_='title').text.strip()
            recruiter = job.find('span', class_='company').text.strip()
            salary = job.find('span', class_='salary').text.strip()
            position = job.find('span', class_='location').text.strip()
            location = job.find('span', class_='area').text.strip()
            date_posted = job.find('span', class_='time').text.strip()
        
            job_info = {
                'Job Title': job_title,
                'Recruiter': recruiter,
                'Salary': salary,
                'Position': position,
                'Location': location,
                'Date Posted': date_posted
            }
        
            jobs.append(job_info)
        
        return jobs
    else:
        print(f"Failed to retrieve data, status code: {response.status_code}")
        return []

2nd one:

Code:
import requests
from bs4 import BeautifulSoup

# Function to scrape job data from Indeed
def scrape_jobs(job_title):
    # Replace spaces in job_title with '+'
    job_title = job_title.replace(' ', '+')

    # The base URL of Indeed
    base_url = 'https://www.indeed.com/jobs?q='

    # Complete URL with the job title
    search_url = f'{base_url}{job_title}'

    # Send a request to the website
    response = requests.get(search_url)

    # Check if the request was successful
    if response.status_code == 200:
        # Parse the content with BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
    
        # Find all job entries - this will depend on the website's structure
        job_entries = soup.find_all('div', class_='jobsearch-SerpJobCard') # Replace with  actual class
    
        for job in job_entries:
            # Extract the required information
            job_title = job.find('h2', class_='title').text.strip() # Replace with actual class
            company_name = job.find('span', class_='company').text.strip() # Replace with actual class
            job_location = job.find('span', class_='location').text.strip() # Replace with actual class
            job_summary = job.find('div', class_='summary').text.strip() # Replace with actual class
        
            # Print the extracted information
            print(f'Job Title: {job_title}')
            print(f'Company Name: {company_name}')
            print(f'Location: {job_location}')
            print(f'Summary: {job_summary}')
            print('-----------------------------------')
    else:
        print('Failed to retrieve the webpage')

# Example usage
job_to_search = input('Enter a job title to search for: ')
scrape_jobs(job_to_search)

3rd one:

Code:
import csv
from datetime import datetime
import bs4 as BeautifulSoup
import requests

def get_url(position, location):
    """Generate a url from position and location"""
    template = 'https://za.indeed.com/jobs?q={}&l={}'
    url = template.format(position, location)
    return url

url = get_url('senior accountant', 'charlotte nc')

response = requests.get(url)
response

4th one: # import module import requests from bs4 import BeautifulSoup

Code:
# user define function 
# Scrape the data 
# and get in string

def getdata(url): r = requests.get(url) return r.text

Get Html code using parse​


def html_code(url):

Code:
# pass the url 
# into getdata function 
htmldata = getdata(url) 
soup = BeautifulSoup(htmldata, 'html.parser') 

# return html code 
return(soup)

filter job data using​

find_all function​


def job_data(soup):

Code:
# find the Html tag 
# with find() 
# and convert into string 
data_str = "" 
for item in soup.find_all("a", class_="jobtitle turnstileLink"): 
    data_str = data_str + item.get_text() 
result_1 = data_str.split("\n") 
return(result_1)

filter company_data using​

find_all function​


def company_data(soup):

Code:
# find the Html tag 
# with find() 
# and convert into string 
data_str = "" 
result = "" 
for item in soup.find_all("div", class_="sjcl"): 
    data_str = data_str + item.get_text() 
result_1 = data_str.split("\n") 

res = [] 
for i in range(1, len(result_1)): 
    if len(result_1[i]) > 1: 
        res.append(result_1[i]) 
return(res)

driver nodes/main function​


if name == "main":

Code:
# Data for URL 
job = "data+science+internship"
Location = "Noida%2C+Uttar+Pradesh"
url = "https://in.indeed.com/jobs?q="+job+"&l="+Location 

# Pass this URL into the soup 
# which will return 
# html string 
soup = html_code(url) 

# call job and company data 
# and store into it var 
job_res = job_data(soup) 
com_res = company_data(soup) 

# Traverse the both data 
temp = 0
for i in range(1, len(job_res)): 
    j = temp 
    for j in range(temp, 2+temp): 
        print("Company Name and Address : " + com_res[j]) 

    temp = j 
    print("Job : " + job_res[i]) 
    print("-----------------------------")

With some of these i can enter values but get nothing in return, and yes i did modify the codes via the correct classes, and other just gives me 403 Forbidden

I would really aprreciate urgent help since the assignment is due the 21st of June (This Friday)
<p>I have a question i need to do for an assignment but i can not for the life of me figure it out. The question is as follows:
Write a Python application that can be used to scrape data from the CareerJuction website (at careerjunction.co.za). The application must allow the user to enter a job title to search for, then for each result (on the first result page only) extract the following information:</p>
<ol>
<li><p>The job title</p>
</li>
<li><p>The Name of the recruiter</p>
</li>
<li><p>The job salary</p>
</li>
<li><p>The job position</p>
</li>
<li><p>The job location</p>
</li>
<li><p>The date posted</p>
</li>
</ol>
<p>(8 Marks)
The question does state that we need to use careerjunction.co.za but we got permission to use indeed.com, so i have been using that instead.</p>
<p>I have tried many types of code but everytime i just get errors, or 403 Forbidden.</p>
<p>The codes that i have tried are:
def scrape_careerjunction(job_title):
url = f"https://www.careerjunction.co.za/jobs/results/?keyword={job_title}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}</p>
<pre><code> # Send HTTP request
response = requests.get(url, headers=headers)

if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')

jobs = []

# Find all job listings
job_listings = soup.find_all('div', class_='listing-item')

for job in job_listings:
job_title = job.find('h2', class_='title').text.strip()
recruiter = job.find('span', class_='company').text.strip()
salary = job.find('span', class_='salary').text.strip()
position = job.find('span', class_='location').text.strip()
location = job.find('span', class_='area').text.strip()
date_posted = job.find('span', class_='time').text.strip()

job_info = {
'Job Title': job_title,
'Recruiter': recruiter,
'Salary': salary,
'Position': position,
'Location': location,
'Date Posted': date_posted
}

jobs.append(job_info)

return jobs
else:
print(f"Failed to retrieve data, status code: {response.status_code}")
return []
</code></pre>
<p>2nd one:</p>
<pre><code>import requests
from bs4 import BeautifulSoup

# Function to scrape job data from Indeed
def scrape_jobs(job_title):
# Replace spaces in job_title with '+'
job_title = job_title.replace(' ', '+')

# The base URL of Indeed
base_url = 'https://www.indeed.com/jobs?q='

# Complete URL with the job title
search_url = f'{base_url}{job_title}'

# Send a request to the website
response = requests.get(search_url)

# Check if the request was successful
if response.status_code == 200:
# Parse the content with BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find all job entries - this will depend on the website's structure
job_entries = soup.find_all('div', class_='jobsearch-SerpJobCard') # Replace with actual class

for job in job_entries:
# Extract the required information
job_title = job.find('h2', class_='title').text.strip() # Replace with actual class
company_name = job.find('span', class_='company').text.strip() # Replace with actual class
job_location = job.find('span', class_='location').text.strip() # Replace with actual class
job_summary = job.find('div', class_='summary').text.strip() # Replace with actual class

# Print the extracted information
print(f'Job Title: {job_title}')
print(f'Company Name: {company_name}')
print(f'Location: {job_location}')
print(f'Summary: {job_summary}')
print('-----------------------------------')
else:
print('Failed to retrieve the webpage')

# Example usage
job_to_search = input('Enter a job title to search for: ')
scrape_jobs(job_to_search)
</code></pre>
<p>3rd one:</p>
<pre><code>import csv
from datetime import datetime
import bs4 as BeautifulSoup
import requests

def get_url(position, location):
"""Generate a url from position and location"""
template = 'https://za.indeed.com/jobs?q={}&l={}'
url = template.format(position, location)
return url

url = get_url('senior accountant', 'charlotte nc')

response = requests.get(url)
response
</code></pre>
<p>4th one:
# import module
import requests
from bs4 import BeautifulSoup</p>
<pre><code># user define function
# Scrape the data
# and get in string
</code></pre>
<p>def getdata(url):
r = requests.get(url)
return r.text</p>
<h1>Get Html code using parse</h1>
<p>def html_code(url):</p>
<pre><code># pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')

# return html code
return(soup)
</code></pre>
<h1>filter job data using</h1>
<h1>find_all function</h1>
<p>def job_data(soup):</p>
<pre><code># find the Html tag
# with find()
# and convert into string
data_str = ""
for item in soup.find_all("a", class_="jobtitle turnstileLink"):
data_str = data_str + item.get_text()
result_1 = data_str.split("\n")
return(result_1)
</code></pre>
<h1>filter company_data using</h1>
<h1>find_all function</h1>
<p>def company_data(soup):</p>
<pre><code># find the Html tag
# with find()
# and convert into string
data_str = ""
result = ""
for item in soup.find_all("div", class_="sjcl"):
data_str = data_str + item.get_text()
result_1 = data_str.split("\n")

res = []
for i in range(1, len(result_1)):
if len(result_1) > 1:
res.append(result_1)
return(res)
</code></pre>
<h1>driver nodes/main function</h1>
<p>if <strong>name</strong> == "<strong>main</strong>":</p>
<pre><code># Data for URL
job = "data+science+internship"
Location = "Noida%2C+Uttar+Pradesh"
url = "https://in.indeed.com/jobs?q="+job+"&l="+Location

# Pass this URL into the soup
# which will return
# html string
soup = html_code(url)

# call job and company data
# and store into it var
job_res = job_data(soup)
com_res = company_data(soup)

# Traverse the both data
temp = 0
for i in range(1, len(job_res)):
j = temp
for j in range(temp, 2+temp):
print("Company Name and Address : " + com_res[j])

temp = j
print("Job : " + job_res)
print("-----------------------------")
</code></pre>
<p>With some of these i can enter values but get nothing in return, and yes i did modify the codes via the correct classes, and other just gives me 403 Forbidden</p>
<p>I would really aprreciate urgent help since the assignment is due the 21st of June (This Friday)</p>
 

Latest posts

Top