OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

What Regex Can I use to remove contet-url() expressions?

  • Thread starter Thread starter Ken Tola
  • Start date Start date
K

Ken Tola

Guest
I am attempting to remove all extraneous tags, URLs, and scripts from HTML prior to running the text through an LLM. Right now I have the following Python function.

Code:
def remove_tags(html) -> str:

    # First we decode any encoded text
    html = unquote(html)

    # Next we strip out all of the HTML tags
    soup = BeautifulSoup(html, "html.parser")

    for data in soup(['style', 'script']):
        # Remove tags
        data.decompose()

    # Now we get rid of the URLs
    tag_free = ' '.join(soup.stripped_strings)
    words = tag_free.split()
    for i, word in enumerate(words):
        parsed_url = urlparse(word)
        if parsed_url.scheme and parsed_url.netloc:
            words[i] = "[URL Removed]"

    final_text = ' '.join(words)

    # Finally we remove any unwanted returns
    final_text.replace("\t", " ").replace("\n", " ").replace("\r", " ")

    return final_text

This works for everything BUT content-urls such as the following:

Code:
content - url(https - //link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP)

These scripted URLs are everywhere and are bloating my content and I need to removed them.

I have tried various regex options such as ^['content - url']+[)]$ but it does not work.

I am using re:

Code:
start = "content - url"
test_string = ("sonos-logo  content -  url(https - "
           "//link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA"
           "~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important;  u + .body .arrow-icon  content -  url(https - //link.sonos.com/f/a/1vHBAmM0w7VCBDGBsH-ADg~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWViOTlhMGIyNWY4MDA0ZGU2MzVhYS9vcmlnaW5hbC5wbmc_MTcwNDkwMTAxOFcDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important;  u + .body .facebook-icon")

clean_string = re.sub('^[' + start + ']+[)]$', '', test_string)

Can somebody please provide some help?
<p>I am attempting to remove all extraneous tags, URLs, and scripts from HTML prior to running the text through an LLM. Right now I have the following Python function.</p>
<pre><code>def remove_tags(html) -> str:

# First we decode any encoded text
html = unquote(html)

# Next we strip out all of the HTML tags
soup = BeautifulSoup(html, "html.parser")

for data in soup(['style', 'script']):
# Remove tags
data.decompose()

# Now we get rid of the URLs
tag_free = ' '.join(soup.stripped_strings)
words = tag_free.split()
for i, word in enumerate(words):
parsed_url = urlparse(word)
if parsed_url.scheme and parsed_url.netloc:
words = "[URL Removed]"

final_text = ' '.join(words)

# Finally we remove any unwanted returns
final_text.replace("\t", " ").replace("\n", " ").replace("\r", " ")

return final_text
</code></pre>
<p>This works for everything BUT content-urls such as the following:</p>
<pre><code>content - url(https - //link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP)
</code></pre>
<p>These scripted URLs are everywhere and are bloating my content and I need to removed them.</p>
<p>I have tried various regex options such as ^['content - url']+[)]$ but it does not work.</p>
<p>I am using re:</p>
<pre><code>start = "content - url"
test_string = ("sonos-logo content - url(https - "
"//link.sonos.com/f/a/ZinnmUI5FVMlzaiMExZvPw~~/AAQRxQA"
"~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWQ5MjA0MDBhOTVmMDA1OTYwN2EwMS9vcmlnaW5hbC5wbmc_MTcwNDgyNTM0N1cDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important; u + .body .arrow-icon content - url(https - //link.sonos.com/f/a/1vHBAmM0w7VCBDGBsH-ADg~~/AAQRxQA~/RgRoUgEXP0R5aHR0cHM6Ly9icmF6ZS1pbWFnZXMuY29tL2FwcGJveS9jb21tdW5pY2F0aW9uL2Fzc2V0cy9pbWFnZV9hc3NldHMvaW1hZ2VzLzY1OWViOTlhMGIyNWY4MDA0ZGU2MzVhYS9vcmlnaW5hbC5wbmc_MTcwNDkwMTAxOFcDc3BjQgpmbRd8b2anjldDUhNrZW50b2xhanJAZ21haWwuY29tWAQAAAPP) !important; u + .body .facebook-icon")

clean_string = re.sub('^[' + start + ']+[)]$', '', test_string)
</code></pre>
<p>Can somebody please provide some help?</p>
 
Top