October 22, 2024
Chicago 12, Melborne City, USA
python

Regex pattern sanitization for wildcard replacement


I need a function to sanitize regex patterns in Python, specifically targeting strings that may contain wildcard characters (%). The goal is to replace these % wildcards with the regex equivalent .*? to allow for flexible matching in regex patterns. Additionally, the function should ensure that special regex characters are escaped appropriately without adding unnecessary escape characters.

Requirements:

  1. Convert % to .*? in the provided regex patterns.
  2. Escape special regex characters only when necessary, avoiding unnecessary escape sequences.
  3. Maintain the integrity of existing regex patterns.

Examples:

  1. Input: %core account.*?Annual Percentage Yield%

    exp Output: .*?core account.*?Annual Percentage Yield.*?

    String: Your core account offers an Annual Percentage Yield of 3.25%. This account is ideal for saving.

  2. Input: interest rate on your account ((\d+\.\d+)|[^.])*?([\d+.]+%)

    Output: interest rate on your account ((\d+\.\d+)|[^.])*?([\d+.]+.*?)

    String: The interest rate on your account is 1.75%, while promotional rates are as high as 2.00%.

few more inputs:

1: Overdraft Protection Service Fee[^.]*?(\\$\\d+)

2: APY was accurate as of ([^\\s]+)

3: Cuenta de Ahorro opcional 6Obtenga hasta un ([^\\s]*) de porcentaje de rendimiento anual

4: \\bVisa\\b

5: withdrawals(.*?) per day

6: ([^a-zA-Z\\s]+) total is calculated based on all withdrawals

7: activate .*? calling ([^\\s]+)

I tried the following approach:

def _sanitize_regex(rhs: string):
    # Replace '%' wildcards with '.*?' for regex matching
    rhs = rhs.replace('%', '.*?')

    # Define special characters that need to be escaped in regex
    special_chars = ['$', '^', '[', ']', '{', '}', '|', '+', '<', '>', '\\']
    
    sanitized_rhs = []
    inside_group = False  # Track whether we're inside a group (e.g., parentheses or brackets)

    for char in rhs:
        # Handle opening and closing of groups
        if char in ['(', '[', '{']:
            inside_group = True
            sanitized_rhs.append(char)
        elif char in [')', ']', '}']:
            inside_group = False
            sanitized_rhs.append(char)
        # Escape special characters only if they are not inside a group
        elif char in special_chars:
            if not inside_group:  # Only escape if we are not inside a group
                sanitized_rhs.append(f"\\{char}")
            else:
                sanitized_rhs.append(char)  # Do not escape inside groups
        else:
            sanitized_rhs.append(char)

    # Join sanitized parts back together
    return ''.join(sanitized_rhs)

But it did not work for the following inputs:

  1. interest rate on your account ((\d+.\d+)|[^.])*?([\d+.]+%)

output: interest rate on your account ((\d+\.\d+)\|[^.])*?([\d+.]\+.*?)

extra \ was being added by the above code. Can you help me sanitize the strings to get valid regex, considering the above requirements?



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video