OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

How to reduce CPU and Memory usage using Selenium?

  • Thread starter Thread starter Justin McDonald
  • Start date Start date
J

Justin McDonald

Guest
Python script uses Selenium to scrape website links from a single URL. The script runs most URLs without issue but appears to be getting stuck or having memory issues that I'm having a difficult time resolving. I'm not sure how to fix the script so it continues and reduces memory/cpu usage.

The expected outcome is that the script attempts to locate 'href' and the associated link, adds links to list and returns list.

It appears that sometimes the code is not making it to this section:

Code:
        print('Grab URL Count: ' + str(grab_count))
        print('Grab Non-duplicate URL count: ' + str(len(url_list)))

Code & documentation below. The server is showing high CPU, Memory and Disk I/O.

The script is currently hung up on this URL


Code:
from selenium.webdriver.common.by import By
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import sys

# Scrape single URL
def web_scraper(url):
    url_bool = True
    count_issue = 0
    while url_bool:
        try:
            url_list = []

            userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.114 Safari/537.36'
            options = Options()
            options.add_argument('--headless')
            options.add_argument('--no-sandbox')
            options.add_argument('--disable-dev-shm-usage')
            options.add_argument(f"--user-agent={userAgent}")
            options.add_argument('--disable-gpu') # Is this needed anymore? 
            driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

            driver.get(url)
            all_links = driver.find_elements(By.TAG_NAME, 'a')
            
            grab_count = 0 
            # Grab LINKs from scrape
            for item in all_links:
                try:
                    if (item.get_attribute('href')).startswith('https://'):
                        grab_count = grab_count + 1
                        if item.get_attribute('href') not in url_list:
                            url_list.append(item.get_attribute('href'))
                            #print(item.get_attribute('href'))
                except:
                    next

            driver.quit()
            print('Grab URL Count: ' + str(grab_count))
            print('Grab Non-duplicate URL count: ' + str(len(url_list)))
            url_bool = False
            
            # Test for error
        except Exception as e:
            driver.quit()
            f = open("web.log", "a")
            f.write(str(e))
            f.close()

            count_issue = count_issue + 1
            if count_issue == 3:
                url_bool = False
            print("\nDriver (get) error: " + str(e))
            time.sleep(5)
        driver.quit()
    return(url_list)

htop

Server Info

Code:
chromeDriver -v
Google Chrome 126.0.6478.114
<p>Python script uses Selenium to scrape website links from a single URL. The script runs most URLs without issue but appears to be getting stuck or having memory issues that I'm having a difficult time resolving. I'm not sure how to fix the script so it continues and reduces memory/cpu usage.</p>
<p>The expected outcome is that the script attempts to locate 'href' and the associated link, adds links to list and returns list.</p>
<p>It appears that sometimes the code is not making it to this section:</p>
<blockquote>
<pre><code> print('Grab URL Count: ' + str(grab_count))
print('Grab Non-duplicate URL count: ' + str(len(url_list)))
</code></pre>
</blockquote>
<p>Code & documentation below. The server is showing high CPU, Memory and Disk I/O.</p>
<p>The script is currently hung up on this URL</p>
<blockquote>
<p>url =
<a href="https://larepublica.pe/deportes/202...erica-television-canal-4-america-tv-go-378690" rel="nofollow noreferrer">https://larepublica.pe/deportes/202...erica-television-canal-4-america-tv-go-378690</a></p>
</blockquote>
<pre><code>from selenium.webdriver.common.by import By
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import sys

# Scrape single URL
def web_scraper(url):
url_bool = True
count_issue = 0
while url_bool:
try:
url_list = []

userAgent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.6478.114 Safari/537.36'
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument(f"--user-agent={userAgent}")
options.add_argument('--disable-gpu') # Is this needed anymore?
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

driver.get(url)
all_links = driver.find_elements(By.TAG_NAME, 'a')

grab_count = 0
# Grab LINKs from scrape
for item in all_links:
try:
if (item.get_attribute('href')).startswith('https://'):
grab_count = grab_count + 1
if item.get_attribute('href') not in url_list:
url_list.append(item.get_attribute('href'))
#print(item.get_attribute('href'))
except:
next

driver.quit()
print('Grab URL Count: ' + str(grab_count))
print('Grab Non-duplicate URL count: ' + str(len(url_list)))
url_bool = False

# Test for error
except Exception as e:
driver.quit()
f = open("web.log", "a")
f.write(str(e))
f.close()

count_issue = count_issue + 1
if count_issue == 3:
url_bool = False
print("\nDriver (get) error: " + str(e))
time.sleep(5)
driver.quit()
return(url_list)
</code></pre>
<p><a href="https://i.sstatic.net/9QfPJU3K.jpg" rel="nofollow noreferrer"><img src="https://i.sstatic.net/9QfPJU3K.jpg" alt="htop" /></a></p>
<p><a href="https://i.sstatic.net/53jZgQcH.jpg" rel="nofollow noreferrer"><img src="https://i.sstatic.net/53jZgQcH.jpg" alt="Server Info" /></a></p>
<pre><code>chromeDriver -v
Google Chrome 126.0.6478.114
</code></pre>
 
Top