October 22, 2024
Chicago 12, Melborne City, USA
HTML

How to webscrape elements using beautifulsoup properly?


I am not from web scaping or website/html background and new to this field.

Trying out scraping elements from this link that contains containers/cards.

I have tried below code and find a little success but not sure how to do it properly to get just informative content without getting html/css elements in the results.

from bs4 import BeautifulSoup as bs
import requests

url="https://ihgfdelhifair.in/mis/Exhibitors"

page = requests.get(url)
soup = bs(page.text, 'html')

What I am looking to extract (as practice) info from below content:
sample image

cards = soup.find_all('div', class_="row Exhibitor-Listing-box")
cards

below sort of content it display:

[<div class="row Exhibitor-Listing-box">
 <div class="col-md-3">
 <div class="card">
 <div class="container">
 <h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> artifactdecor01@gmail.com</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>                                                   SHEENU</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>State : </span> UTTAR PRADESH</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>City : </span> AGRA</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Hall No. : </span> 12</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Stand No. : </span> G-15/43</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Mobile No. : </span> +91-5624010111, +91-7055166000</p>
 <p style="margin-bottom: 5px!important; font-size: 11px;"><span>Website : </span> www.artifactdecor.com</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Source Retail : </span> Y</p>
 <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Vriksh Certified : </span> N</p>
 </div>

Now when I use below code to extract element:

for element in cards:
    title = element.find_all('h4')
    email = element.find_all('p')
    print(title)
    print(email)

Output: It is giving me the info that I need but with html/css content in it which I do not want

[<h4><b>  1 ARTIFACT DECOR (INDIA)</b></h4>, <h4><b>  10G HOUSE OF CRAFT</b></h4>, <h4><b>  2 S COLLECTION</b></h4>, <h4><b>  ........]
[<p style="margin-bottom: 5px!important; font-size: 13px;"><span>Email : </span> artifactdecor01@gmail.com</p>, <p style="margin-bottom: 5px!important; font-size: 13px;"><span>Contact Person : </span>        ..................]

So how can I take out just title, email, Contact Person, State, City elements from this without html/css in results?



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video