What is the better method to parse these data instead of regex+conditional statements?

What is the better method to parse these data instead of regex+conditional statements?

Suppose you have a data set that looks like

name1: float

name2: float [unit]

name3: {str: float [unit], (...repeat)}

name4: datetime

name5: int

name6: alphanumeric

name7: strings (sentence/phrase)

name8: {{...can contain any of above but we ignore..}}

What is the appropriate parsing method from python? With regex I solve this partially but feels like I need additional hard coded if statements to get to a complete solution. I am looking for pointers or direction here. Something similar to pd.read_csv(...), perhaps with some pre-clean then throw it into a parser?

PATTERN_GENERIC = r'(?P<name>[^,:]*)(,|:)\s?(?P<content>.*)'
PATTERN_VALUE = r'(?P<name>[^:]*)[^0-9-]*(?P<content>[-+]?\d*\.\d+|\d+)'
header = {}
with open(f) as fopen:
    for line in fopen:
        # parse number only
        results = re.search(PATTERN_VALUE,line.strip('\n'))
        if results:
            h = results.groupdict()
            header[h['name'].replace(' ','')] = float(h['content'])
        else: #treat as ID
            results = re.search(PATTERN_GENERIC,line.strip('\n'))
            if results:
                h = results.groupdict()
                header[h['name'].replace(' ','')] = h['content']

What I wish to achieve

col1 | col2

name | numbers

name | non-numbers

Currently I haven't write something to handle the {...} lines, I can do that with a specific if statement. Also, it falls short on whenever there is a number it is seen as a number which isn't true for alphanumeric case (they are IDs not values). To solve that I probably have to write a pattern to differentiate alphanumeric vs numbers.

Answer

Instead of using regex and conditionals, a cleaner solution is to leverage Python's built-in libraries for parsing.

  1. Use ast.literal_eval() or json.loads() for structured data (like dictionaries or lists). These can safely parse strings with numbers or nested structures.

    import ast
    data = 'name3: {"key1": 1.23, "key2": 4.56}'
    name, content = data.split(': ', 1)
    parsed_content = ast.literal_eval(content)
    
  2. Leverage Pandas for tabular data. It can automatically infer types (strings, integers, floats) and handle structured data efficiently.

    import pandas as pd
    data = [("name1", 1.23), ("name2", "text")]
    df = pd.DataFrame(data, columns=["Name", "Value"])
    
  3. Handle alphanumeric IDs by checking if values are digits or strings:

    def process_value(value):
        if value.isdigit():
            return int(value)
        return value  # Keep as string
    
  4. For nested structures (e.g., name8: {{...}}), recursively parse the data:

    def parse_nested(data):
        if data.startswith('{{') and data.endswith('}}'):
            return parse_nested(data[1:-1])  # Recursive call
        return data
    

By using these methods, you can eliminate regex and conditionals, creating a cleaner and more maintainable parser.

Enjoyed this article?

Check out more content on our blog or follow us on social media.

Browse more articles