This is an old revision of the document!
Esta es la consigna:
I want to make a csv file with the contents of a very large html dump containing certain information.
An excerpt of that html could be the following:
{"id":"purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","type":"purchase","title":"Compra de 7 productos","description":"Mercado Libre","email":null,"status":"","statusColor":"gray","statusIcon":null,"image":null,"initials":null,"iconClass":"ic_merchant_order","amount":{"currency_id":"ARS","symbol":"$","symbol_text":"Peso argentino","fraction":"-61.046","decimal_separator":",","cents":"33","cents_text":""},"actions":{"other":[]},"name":"purchase","date":"07 de junio","creationDate":"2023-06-07T10:19:55.000Z","lastModified":"2023-06-07T10:19:55.000Z","moneyReleaseDate":"2023-06-07T10:21:32.000Z","link":"/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","entity":"payment","period":"previous"}
The CSV that I want, needs to have the following columns: { Title; Counterpart; Credit column; Debit column; Date }
There is a condition to fill Credit or Debit: if the captured value of “fraction” is less than 0 (or it have a - sign), it is a debit. On the contrary, it would be a credit. And delete the thousands separation dot (or whatever is called).
The proposed capture groups would be the following:
- Title column: \“title\”:\“(.*?)\”, + the string resulting from \“description\”:\“(.*?)\”,
- Counterpart column: \“description\”:\“(.*?)\”,
- Credit/Debit column: \“fraction\”:\“(.*?)\”,
- Date column: \“moneyReleaseDate\”:\“(.*?)\”,
So, for example, with that excerpt I shared before, the desired output would be: Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023
Por otra parte, necesito guardar como html cada página de la actividad (botón derecho > guardar como > solo html) / y concatenar los html en un solo archivo
cat *.html > bigHtmlFile.html
este el resultado
- mp2csvFTW.py
import re import csv with open("bigHtmlFile.html", "r") as infile: html_dump = infile.read() # Define the regular expressions with negative lookbehind regex_title = r"(?<!{)\"title\":\"(.*?)\"," regex_counterpart = r"(?<!{)\"description\":\"(.*?)\"," regex_fraction = r"(?<!{)\"fraction\":\"(-?\d+(?:\.\d+)?)\"," regex_date = r"(?<!{)\"moneyReleaseDate\":\"(.*?)\"" # Compile the regular expressions pattern_title = re.compile(regex_title) pattern_counterpart = re.compile(regex_counterpart) pattern_fraction = re.compile(regex_fraction) pattern_date = re.compile(regex_date) # Find all matches for each field titles = pattern_title.findall(html_dump) counterparts = pattern_counterpart.findall(html_dump) fractions = pattern_fraction.findall(html_dump) dates = pattern_date.findall(html_dump) # Process the counterparts to remove "a " or "de " counterparts = [re.sub(r"^(a |de )", "", counterpart) for counterpart in counterparts] # Combine the titles and processed counterparts combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)] # Prepare the data for CSV writing data = zip(combined_columns, counterparts, fractions, dates) # Process the data and write to the CSV file with open('output.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Title', 'Counterpart', 'Credit column', 'Debit column', 'Date']) for title, counterpart, fraction, date in zip(titles, counterparts, fractions, dates): # Convert fraction to float and remove thousands separator fraction = float(fraction.replace('.', '').replace(',', '.')) # Determine if it's a credit or debit credit = fraction if fraction >= 0 else 0 debit = -fraction if fraction < 0 else 0 # Format the date formatted_date = date[:10].replace('-', '/') # Write the row to the CSV file writer.writerow([title + ' ' + counterpart, counterpart, credit, debit, formatted_date])
