What is the better method to parse these data instead of regex+conditional statements?

Suppose you have a data set that looks like

name1: float

name2: float [unit]

name3: {str: float [unit], (...repeat)}

name4: datetime

name5: int

name6: alphanumeric

name7: strings (sentence/phrase)

name8: {{...can contain any of above but we ignore..}}

What is the appropriate parsing method from python? With regex I solve this partially but feels like I need additional hard coded if statements to get to a complete solution. I am looking for pointers or direction here. Something similar to pd.read_csv(...), perhaps with some pre-clean then throw it into a parser?

PATTERN_GENERIC = r'(?P<name>[^,:]*)(,|:)\s?(?P<content>.*)'
PATTERN_VALUE = r'(?P<name>[^:]*)[^0-9-]*(?P<content>[-+]?\d*\.\d+|\d+)'
header = {}
with open(f) as fopen:
    for line in fopen:
        # parse number only
        results = re.search(PATTERN_VALUE,line.strip('\n'))
        if results:
            h = results.groupdict()
            header[h['name'].replace(' ','')] = float(h['content'])
        else: #treat as ID
            results = re.search(PATTERN_GENERIC,line.strip('\n'))
            if results:
                h = results.groupdict()
                header[h['name'].replace(' ','')] = h['content']

What I wish to achieve

col1 | col2

name | numbers

name | non-numbers

Currently I haven't write something to handle the {...} lines, I can do that with a specific if statement. Also, it falls short on whenever there is a number it is seen as a number which isn't true for alphanumeric case (they are IDs not values). To solve that I probably have to write a pattern to differentiate alphanumeric vs numbers.

Answer

Instead of using regex and conditionals, a cleaner solution is to leverage Python's built-in libraries for parsing.

Use ast.literal_eval() or json.loads() for structured data (like dictionaries or lists). These can safely parse strings with numbers or nested structures.
```
import ast
data = 'name3: {"key1": 1.23, "key2": 4.56}'
name, content = data.split(': ', 1)
parsed_content = ast.literal_eval(content)
```
Leverage Pandas for tabular data. It can automatically infer types (strings, integers, floats) and handle structured data efficiently.
```
import pandas as pd
data = [("name1", 1.23), ("name2", "text")]
df = pd.DataFrame(data, columns=["Name", "Value"])
```

Handle alphanumeric IDs by checking if values are digits or strings:

def process_value(value):
    if value.isdigit():
        return int(value)
    return value  # Keep as string

For nested structures (e.g., name8: {{...}}), recursively parse the data:

def parse_nested(data):
    if data.startswith('{{') and data.endswith('}}'):
        return parse_nested(data[1:-1])  # Recursive call
    return data

By using these methods, you can eliminate regex and conditionals, creating a cleaner and more maintainable parser.

Answer

Enjoyed this question?