What is the better method to parse these data instead of regex+conditional statements?

Suppose you have a data set that looks like
name1: float
name2: float [unit]
name3: {str: float [unit], (...repeat)}
name4: datetime
name5: int
name6: alphanumeric
name7: strings (sentence/phrase)
name8: {{...can contain any of above but we ignore..}}
What is the appropriate parsing method from python? With regex I solve this partially but feels like I need additional hard coded if statements to get to a complete solution. I am looking for pointers or direction here. Something similar to pd.read_csv(...), perhaps with some pre-clean then throw it into a parser?
PATTERN_GENERIC = r'(?P<name>[^,:]*)(,|:)\s?(?P<content>.*)'
PATTERN_VALUE = r'(?P<name>[^:]*)[^0-9-]*(?P<content>[-+]?\d*\.\d+|\d+)'
header = {}
with open(f) as fopen:
for line in fopen:
# parse number only
results = re.search(PATTERN_VALUE,line.strip('\n'))
if results:
h = results.groupdict()
header[h['name'].replace(' ','')] = float(h['content'])
else: #treat as ID
results = re.search(PATTERN_GENERIC,line.strip('\n'))
if results:
h = results.groupdict()
header[h['name'].replace(' ','')] = h['content']
What I wish to achieve
col1 | col2
name | numbers
name | non-numbers
Currently I haven't write something to handle the {...} lines, I can do that with a specific if statement. Also, it falls short on whenever there is a number it is seen as a number which isn't true for alphanumeric case (they are IDs not values). To solve that I probably have to write a pattern to differentiate alphanumeric vs numbers.
Answer
Instead of using regex and conditionals, a cleaner solution is to leverage Python's built-in libraries for parsing.
Use
ast.literal_eval()
orjson.loads()
for structured data (like dictionaries or lists). These can safely parse strings with numbers or nested structures.import ast data = 'name3: {"key1": 1.23, "key2": 4.56}' name, content = data.split(': ', 1) parsed_content = ast.literal_eval(content)
Leverage Pandas for tabular data. It can automatically infer types (strings, integers, floats) and handle structured data efficiently.
import pandas as pd data = [("name1", 1.23), ("name2", "text")] df = pd.DataFrame(data, columns=["Name", "Value"])
Handle alphanumeric IDs by checking if values are digits or strings:
def process_value(value): if value.isdigit(): return int(value) return value # Keep as string
For nested structures (e.g.,
name8: {{...}}
), recursively parse the data:def parse_nested(data): if data.startswith('{{') and data.endswith('}}'): return parse_nested(data[1:-1]) # Recursive call return data
By using these methods, you can eliminate regex and conditionals, creating a cleaner and more maintainable parser.
Enjoyed this article?
Check out more content on our blog or follow us on social media.
Browse more articles