This is an old revision of the document!

Esta es la consigna:

I want to make a csv file with the contents of a very large html dump containing certain information.

An excerpt of that html could be the following:

 
{"id":"purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","type":"purchase","title":"Compra de 7 productos","description":"Mercado Libre","email":null,"status":"","statusColor":"gray","statusIcon":null,"image":null,"initials":null,"iconClass":"ic_merchant_order","amount":{"currency_id":"ARS","symbol":"$","symbol_text":"Peso argentino","fraction":"-61.046","decimal_separator":",","cents":"33","cents_text":""},"actions":{"other":[]},"name":"purchase","date":"07 de junio","creationDate":"2023-06-07T10:19:55.000Z","lastModified":"2023-06-07T10:19:55.000Z","moneyReleaseDate":"2023-06-07T10:21:32.000Z","link":"/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","entity":"payment","period":"previous"}

The CSV that I want, needs to have the following columns: { Title; Counterpart; Credit column; Debit column; Date }

There is a condition to fill Credit or Debit: if the captured value of “fraction” is less than 0 (or it have a - sign), it is a debit. On the contrary, it would be a credit. And delete the thousands separation dot (or whatever is called).

The proposed capture groups would be the following:

Title column: \“title\”:\“(.*?)\”, + the string resulting from \“description\”:\“(.*?)\”,
Counterpart column: \“description\”:\“(.*?)\”,
Credit/Debit column: \“fraction\”:\“(.*?)\”,
Date column: \“moneyReleaseDate\”:\“(.*?)\”,

So, for example, with that excerpt I shared before, the desired output would be: Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023

Por otra parte, necesito guardar como html cada página de la actividad (botón derecho > guardar como > solo html) / y concatenar los html en un solo archivo

cat *.html > bigHtmlFile.html

este el resultado

Based on your requirements, using a programming language like Python would be a good choice for processing the HTML dump and generating the CSV file.

Here's an example Python code snippet that utilizes the re module for regular expressions to extract the desired information from the HTML dump and format it into the CSV file

mp2csvFTW.py

import re
import csv
 
with open("bigHtmlFile.html", "r") as infile:
    html_dump = infile.read()
 
# Define the regular expressions with negative lookbehind
regex_title = r"(?<!{)\"title\":\"(.*?)\","
regex_counterpart = r"(?<!{)\"description\":\"(.*?)\","
regex_fraction = r"(?<!{)\"fraction\":\"(-?\d+(?:\.\d+)?)\","
regex_date = r"(?<!{)\"moneyReleaseDate\":\"(.*?)\""
 
# Compile the regular expressions
pattern_title = re.compile(regex_title)
pattern_counterpart = re.compile(regex_counterpart)
pattern_fraction = re.compile(regex_fraction)
pattern_date = re.compile(regex_date)
 
# Find all matches for each field
titles = pattern_title.findall(html_dump)
counterparts = pattern_counterpart.findall(html_dump)
fractions = pattern_fraction.findall(html_dump)
dates = pattern_date.findall(html_dump)
 
# Process the counterparts to remove "a " or "de "
counterparts = [re.sub(r"^(a |de )", "", counterpart) for counterpart in counterparts]
 
# Combine the titles and processed counterparts
combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)]
 
# Prepare the data for CSV writing
data = zip(combined_columns, counterparts, fractions, dates)
 
# Process the data and write to the CSV file
with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Title', 'Counterpart', 'Credit column', 'Debit column', 'Date'])
 
    for title, counterpart, fraction, date in zip(titles, counterparts, fractions, dates):
        # Convert fraction to float and remove thousands separator
        fraction = float(fraction.replace('.', '').replace(',', '.'))
 
        # Determine if it's a credit or debit
        credit = fraction if fraction >= 0 else 0
        debit = -fraction if fraction < 0 else 0
 
        # Format the date
        formatted_date = date[:10].replace('-', '/')
 
        # Write the row to the CSV file
        writer.writerow([title + ' ' + counterpart, counterpart, credit, debit, formatted_date])

This code snippet extracts the required fields using regular expressions and performs the necessary conversions and calculations to determine the credit and debit values based on the 'fraction' field. It then writes the processed data to the CSV file 'output.csv' with the specified columns.

If you want to ignore strings that begin with a curly brace {, you can modify the regular expressions in the Python code to exclude such strings. One way to achieve this is by using a negative lookahead assertion in the regular expressions. the regular expressions regex_title, regex_counterpart, regex_fraction, and regex_date have been modified to include a negative lookahead (?!{) at the beginning. This negative lookahead ensures that the strings captured by the regular expressions do not start with a curly brace {.

With this modification, any strings that begin with a curly brace will be ignored and excluded from the captured results.

Using this approach, you can skip the step of manually editing the source HTML file to remove unwanted strings, as the regular expressions will handle it during the extraction process.