Extract Fields used in SQL query

Extract Fields used in SQL query

I'm having trouble to extract fields used in SQL queries. The script below works well for most queries but having trouble when deal with multiple Cases or Select...From statements. Here is the script:

def remove_comments(sql):
    # Regular expression to remove comments
    sql = re.sub(r'--.*', '', sql)  # Remove single-line comments
    sql = re.sub(r'/\*.*?\*/', '', sql, flags=re.DOTALL)  # Remove multi-line comments
    return sql

def extract_fields(sql):
    # Regular expression to extract field names from the SELECT clause
    # This will match valid columns (with table alias support) and skip expressions or numeric values
    pattern = r'\s*([\w\.]+(?:\s+AS\s+[\w]+)?)\s*(?:,|$)'  # This will capture fields and aliases
    
    # Find all matches
    fields = re.findall(pattern, sql)
    
    # Clean up field names by removing aliases or extra spaces
    cleaned_fields = [field.split()[0] for field in fields]
    
    # Remove any numeric expressions, keywords like 'BaseData_CTE', and duplicates
    valid_fields = [field for field in cleaned_fields if not field.isdigit() and not field.lower().startswith('cte')]
    
    # Remove duplicates by converting to a set and then back to a list
    valid_fields = list(set(valid_fields))
    
    return valid_fields

Answer

SQL is not a Read more and thus cannot be adequately parsed with regular expressions.

Either use a proper SQL parser, or (unless your task is specifically to parse arbitrary user-provided SQL) consume not SQL but formalized inputs that produce the SQL you're trying to deal with now, including the fields.

Enjoyed this article?

Check out more content on our blog or follow us on social media.

Browse more articles