OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Scrape data from website with complex structure

  • Thread starter Thread starter Stuart Macfarlane
  • Start date Start date
S

Stuart Macfarlane

Guest
I am trying to scrape data from the TransferMarkt website in Python. However, the website structure is complex. I've tried using the requests and Beautiful Soup modules and the following code. However,enter image description here the end result I'm getting is two empty dataframes for 'in' and 'out' transfers. I've attached a photo showing the structure of the website and my code attempt. Any help would be greatly appreciated.

Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Transfermarkt page
url = 'https://www.transfermarkt.com/premier-league/transfers/wettbewerb/GB1/plus/?saison_id=2023&s_w=&leihe=0&intern=0'

# Send a GET request to the URL
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
response.raise_for_status()  # Raise an exception if the request was unsuccessful

# Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Function to extract transfer data
def extract_transfer_data(table):
    transfers = []
    rows = table.find_all('tr', class_=['odd', 'even'])
    for row in rows:
        cols = row.find_all('td')
        if len(cols) >= 5:  # Ensure there are enough columns
            transfers.append({
                'Player': cols[0].text.strip(),
                'Age': cols[1].text.strip(),
                'Club': cols[2].text.strip(),
                'Fee': cols[4].text.strip()
            })
    return transfers

# Locate the main transfer table container
transfer_containers = soup.find_all('div', class_='grid-view')

# Debugging: print the number of transfer containers found
print(f"Found {len(transfer_containers)} transfer containers.")

# Extract 'In' and 'Out' transfers data
in_transfers = []
out_transfers = []

for container in transfer_containers:
    headers = container.find_all('h2')
    tables = container.find_all('table')
    for header, table in zip(headers, tables):
        if 'Arrivals' in header.text:
            in_transfers.extend(extract_transfer_data(table))
        elif 'Departures' in header.text:
            out_transfers.extend(extract_transfer_data(table))

# Convert to DataFrames
in_transfers_df = pd.DataFrame(in_transfers)
out_transfers_df = pd.DataFrame(out_transfers)
<p>I am trying to scrape data from the TransferMarkt website in Python. However, the website structure is complex. I've tried using the requests and Beautiful Soup modules and the following code. However,<a href="https://i.sstatic.net/lQvcBN9F.png" rel="nofollow noreferrer"><img src="https://i.sstatic.net/lQvcBN9F.png" alt="enter image description here" /></a> the end result I'm getting is two empty dataframes for 'in' and 'out' transfers. I've attached a photo showing the structure of the website and my code attempt. Any help would be greatly appreciated.</p>
<pre><code>import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Transfermarkt page
url = 'https://www.transfermarkt.com/premi...B1/plus/?saison_id=2023&s_w=&leihe=0&intern=0'

# Send a GET request to the URL
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise an exception if the request was unsuccessful

# Parse the page content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Function to extract transfer data
def extract_transfer_data(table):
transfers = []
rows = table.find_all('tr', class_=['odd', 'even'])
for row in rows:
cols = row.find_all('td')
if len(cols) >= 5: # Ensure there are enough columns
transfers.append({
'Player': cols[0].text.strip(),
'Age': cols[1].text.strip(),
'Club': cols[2].text.strip(),
'Fee': cols[4].text.strip()
})
return transfers

# Locate the main transfer table container
transfer_containers = soup.find_all('div', class_='grid-view')

# Debugging: print the number of transfer containers found
print(f"Found {len(transfer_containers)} transfer containers.")

# Extract 'In' and 'Out' transfers data
in_transfers = []
out_transfers = []

for container in transfer_containers:
headers = container.find_all('h2')
tables = container.find_all('table')
for header, table in zip(headers, tables):
if 'Arrivals' in header.text:
in_transfers.extend(extract_transfer_data(table))
elif 'Departures' in header.text:
out_transfers.extend(extract_transfer_data(table))

# Convert to DataFrames
in_transfers_df = pd.DataFrame(in_transfers)
out_transfers_df = pd.DataFrame(out_transfers)
</code></pre>
Continue reading...
 

Latest posts

Top