User Tools

Site Tools


html_to_csv

This is an old revision of the document!


Esta es la consigna:

I want to make a csv file with the contents of a very large html dump containing certain information.

An excerpt of that html could be the following: ´´´ {“id”:“purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d”,“type”:“purchase”,“title”:“Compra de 7 productos”,“description”:“Mercado Libre”,“email”:null,“status”:“”,“statusColor”:“gray”,“statusIcon”:null,“image”:null,“initials”:null,“iconClass”:“ic_merchant_order”,“amount”:{“currency_id”:“ARS”,“symbol”:“$”,“symbol_text”:“Peso argentino”,“fraction”:“-61.046”,“decimal_separator”:“,”,“cents”:“33”,“cents_text”:“”},“actions”:{“other”:[]},“name”:“purchase”,“date”:“07 de junio”,“creationDate”:“2023-06-07T10:19:55.000Z”,“lastModified”:“2023-06-07T10:19:55.000Z”,“moneyReleaseDate”:“2023-06-07T10:21:32.000Z”,“link”:“/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d”,“entity”:“payment”,“period”:“previous”} ´´´

The CSV that I want, needs to have the following columns: { Title; Counterpart; Credit column; Debit column; Date }

There is a condition to fill Credit or Debit: if the captured value of “fraction” is less than 0 (or it have a - sign), it is a debit. On the contrary, it would be a credit. And delete the thousands separation dot (or whatever is called).

The proposed capture groups would be the following: Title column: \“title\”:\“(.*?)\”, + the string resulting from \“description\”:\“(.*?)\”, Counterpart column: \“description\”:\“(.*?)\”, Credit/Debit column: \“fraction\”:\“(.*?)\”, Date column: \“moneyReleaseDate\”:\“(.*?)\”,

So, for example, with that excerpt I shared before, the desired output would be: Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023

Por otra parte, necesito guardar como html cada página de la actividad (botón derecho > guardar como > solo html) / y concatenar los html en un solo archivo

cat *.html > bigHtmlFile.html

este el resultado

mp2csvFTW.py
import re
import csv
 
with open("bigHtmlFile.html", "r") as infile:
    html_dump = infile.read()
 
# Define the regular expressions with negative lookbehind
regex_title = r"(?<!{)\"title\":\"(.*?)\","
regex_counterpart = r"(?<!{)\"description\":\"(.*?)\","
regex_fraction = r"(?<!{)\"fraction\":\"(-?\d+(?:\.\d+)?)\","
regex_date = r"(?<!{)\"moneyReleaseDate\":\"(.*?)\""
 
# Compile the regular expressions
pattern_title = re.compile(regex_title)
pattern_counterpart = re.compile(regex_counterpart)
pattern_fraction = re.compile(regex_fraction)
pattern_date = re.compile(regex_date)
 
# Find all matches for each field
titles = pattern_title.findall(html_dump)
counterparts = pattern_counterpart.findall(html_dump)
fractions = pattern_fraction.findall(html_dump)
dates = pattern_date.findall(html_dump)
 
# Process the counterparts to remove "a " or "de "
counterparts = [re.sub(r"^(a |de )", "", counterpart) for counterpart in counterparts]
 
# Combine the titles and processed counterparts
combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)]
 
# Prepare the data for CSV writing
data = zip(combined_columns, counterparts, fractions, dates)
 
# Process the data and write to the CSV file
with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Title', 'Counterpart', 'Credit column', 'Debit column', 'Date'])
 
    for title, counterpart, fraction, date in zip(titles, counterparts, fractions, dates):
        # Convert fraction to float and remove thousands separator
        fraction = float(fraction.replace('.', '').replace(',', '.'))
 
        # Determine if it's a credit or debit
        credit = fraction if fraction >= 0 else 0
        debit = -fraction if fraction < 0 else 0
 
        # Format the date
        formatted_date = date[:10].replace('-', '/')
 
        # Write the row to the CSV file
        writer.writerow([title + ' ' + counterpart, counterpart, credit, debit, formatted_date])
html_to_csv.1687538451.txt.gz · Last modified: 2024/10/17 21:42 (external edit)