I'm not able to get organized data in my scraper

I'm not able to get organized data in my scraper

I'm making a scraper to capture events from a website, with title, date, location and link, so I could put them into a dataframe.

The scraper worked, but some events that have two dates are coming up wrong. For example:

[['Concerto | Clamor Pela Paz',
'04/04/2025',
'Menu',
'https://www.theatromunicipal.org.br/evento/concerto-clamor-pela-paz/'],
['Concerto | Clamor Pela Paz',
'a 05/04/2025',
'Theatro Municipal - Sala de Espetáculos',
'https://www.theatromunicipal.org.br/evento/concerto-clamor-pela-paz/'],

Notice that it's the same event, but in different lists. And the date came broken - the first date is in the first list, in the middle there is a "Menu" that shouldn't be there. The second date comes in another list, with an "a" in front, which shouldn't be there either.

What could be causing this error?

In the website's HTML, the dates are inside the same tag and the same class, but in different lists.

I captured the dates this way:

datas = sopa.findAll('span', class_='elementor-icon-list-text elementor-post-info__item elementor-post-info__item--type-custom')

And I did the for this way:

lista_eventos = []

for titulo, data, local, link in list(zip(nome_evento, datas, local_evento, link_evento)): # Changed data_evento to datas
    titulo = titulo.text.strip()
    data = data.text.strip() if hasattr(data, 'text') else data
    local = local.text.strip()
    link = link.get('href')

    lista_eventos.append([titulo, data, local, link])

Colab link: Read more

What am I doing wrong?

Answer

For the dates of the show, all you need is this line of code.

datas = soup.find_all("div", {"class": "jet-listing-dynamic-field__content"})

The datas list variable will look like this (based on your example).

[<div class="jet-listing-dynamic-field__content">20h</div>, 
<div class="jet-listing-dynamic-field__content">sexta-feira 04/04/25</div>, 
<div class="jet-listing-dynamic-field__content">17h</div>, 
<div class="jet-listing-dynamic-field__content">sábado 05/04/25</div>]

To get the schedule

horario = []
for i in range(0, len(datas), 2):
    if i + 1 < len(datas):
        hora = datas[i].text.strip()
        dia = datas[i+1].text.strip()
        horario.append((dia, hora))

for dia, hora in horario:
    print(f"Dia: {dia}, Hora: {hora}")

Output

Dia: sexta-feira 04/04/25, Hora: 20h
Dia: sábado 05/04/25, Hora: 17h

Enjoyed this article?

Check out more content on our blog or follow us on social media.

Browse more articles