Extract Fields used in SQL query

I'm having trouble to extract fields used in SQL queries. The script below works well for most queries but having trouble when deal with multiple Cases or Select...From statements. Here is the script:
def remove_comments(sql):
# Regular expression to remove comments
sql = re.sub(r'--.*', '', sql) # Remove single-line comments
sql = re.sub(r'/\*.*?\*/', '', sql, flags=re.DOTALL) # Remove multi-line comments
return sql
def extract_fields(sql):
# Regular expression to extract field names from the SELECT clause
# This will match valid columns (with table alias support) and skip expressions or numeric values
pattern = r'\s*([\w\.]+(?:\s+AS\s+[\w]+)?)\s*(?:,|$)' # This will capture fields and aliases
# Find all matches
fields = re.findall(pattern, sql)
# Clean up field names by removing aliases or extra spaces
cleaned_fields = [field.split()[0] for field in fields]
# Remove any numeric expressions, keywords like 'BaseData_CTE', and duplicates
valid_fields = [field for field in cleaned_fields if not field.isdigit() and not field.lower().startswith('cte')]
# Remove duplicates by converting to a set and then back to a list
valid_fields = list(set(valid_fields))
return valid_fields
Answer
SQL is not a Read more and thus cannot be adequately parsed with regular expressions.
Either use a proper SQL parser, or (unless your task is specifically to parse arbitrary user-provided SQL) consume not SQL but formalized inputs that produce the SQL you're trying to deal with now, including the fields.
Enjoyed this article?
Check out more content on our blog or follow us on social media.
Browse more articles