Table of Contents
El problema
Necesito importar a mi cuenta de Mercadopago en Firefly III varios meses de transacciones (unas 200 transacciones, entre compras, transferencias, devoluciones, pagos, cobros, etc) y no quiero hacerlo una por una.
Firefly III tiene una utilidad, el Data Importer, para procesar archivos csv e insertar esos datos a la base.
Mercadopago tiene una función de conciliación, da un montón de información pero no es precisamente la que necesito. No tiene, por ejemplo el nombre del comercio o la persona con la que opero. Hay un ID, cuit o algo por el estilo y no me sirve en este caso. Me resulta mucho más útil la información que aparece paginada en la sección de Actividad.
Qué pasa si guardo el html, busco cómo se llama cada 'tag' y extraigo los datos con regex?
Necesito entonces guardar como html cada página de la actividad (botón derecho > guardar como > solo html)
y ahora concatenar los html en un solo archivo grande que voy a procesar:
cat *.html > bigHtmlFile.html
Acá *.html le llega como una lista de parámetros a cat, con el nombre de cada archivo, aparentemente en orden alfabético. Hay un límite, si se pasara de ese límite devuelve un error de demasiados parámetros. Hay un workaround pero no me hizo falta porque solamente procesé 8 archivos.
Me gustaría que ChatGPT haga la parte aburrida, así que le paso la consigna…
Esta es la consigna:
I want to make a csv file with the contents of a very large html dump containing certain information.
An excerpt of that html could be the following:
{"id":"purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","type":"purchase","title":"Compra de 7 productos","description":"Mercado Libre","email":null,"status":"","statusColor":"gray","statusIcon":null,"image":null,"initials":null,"iconClass":"ic_merchant_order","amount":{"currency_id":"ARS","symbol":"$","symbol_text":"Peso argentino","fraction":"-61.046","decimal_separator":",","cents":"33","cents_text":""},"actions":{"other":[]},"name":"purchase","date":"07 de junio","creationDate":"2023-06-07T10:19:55.000Z","lastModified":"2023-06-07T10:19:55.000Z","moneyReleaseDate":"2023-06-07T10:21:32.000Z","link":"/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","entity":"payment","period":"previous"}
The CSV that I want, needs to have the following columns: { Title; Counterpart; Credit column; Debit column; Date }
There is a condition to fill Credit or Debit columns: if the captured value of “fraction” is less than 0 (or it have a - sign), it is a debit. On the contrary, it would be a credit. And delete the thousands separation dot (or whatever it is called).
The proposed capture groups would be the following:
- Title column: \“title\”:\“(.*?)\”, + the string resulting from \“description\”:\“(.*?)\”,
- Counterpart column: \“description\”:\“(.*?)\”,
- Credit/Debit column: \“fraction\”:\“(.*?)\”,
- Date column: \“moneyReleaseDate\”:\“(.*?)\”,
So, for example, processing that excerpt I shared before, the desired output would be: Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023
este el resultado
Based on your requirements, using a programming language like Python would be a good choice for processing the HTML dump and generating the CSV file.
Here's an example Python code snippet that utilizes the re module for regular expressions to extract the desired information from the HTML dump and format it into the CSV file
- mp2csvFTW.py
import re import csv with open("bigHtmlFile.html", "r") as infile: html_dump = infile.read() # Define the regular expressions with negative lookbehind regex_title = r"(?<!{)\"title\":\"(.*?)\"," regex_counterpart = r"(?<!{)\"description\":\"(.*?)\"," regex_fraction = r"(?<!{)\"fraction\":\"(-?\d+(?:\.\d+)?)\"," regex_date = r"(?<!{)\"moneyReleaseDate\":\"(.*?)\"" # Compile the regular expressions pattern_title = re.compile(regex_title) pattern_counterpart = re.compile(regex_counterpart) pattern_fraction = re.compile(regex_fraction) pattern_date = re.compile(regex_date) # Find all matches for each field titles = pattern_title.findall(html_dump) counterparts = pattern_counterpart.findall(html_dump) fractions = pattern_fraction.findall(html_dump) dates = pattern_date.findall(html_dump) # Process the counterparts to remove "a " or "de " (esto es cosmético) counterparts = [re.sub(r"^(a |de )", "", counterpart) for counterpart in counterparts] # Combine the titles and processed counterparts combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)] # Prepare the data for CSV writing (esto no sé si lo usa...) data = zip(combined_columns, counterparts, fractions, dates) # Process the data and write to the CSV file with open('output.csv', 'w', newline='') as csvfile: writer = csv.writer(csvfile) writer.writerow(['Title', 'Counterpart', 'Credit column', 'Debit column', 'Date']) for title, counterpart, fraction, date in zip(titles, counterparts, fractions, dates): # Convert fraction to float and remove thousands separator fraction = float(fraction.replace('.', '').replace(',', '.')) # Determine if it's a credit or debit credit = fraction if fraction >= 0 else 0 debit = -fraction if fraction < 0 else 0 # Format the date formatted_date = date[:10].replace('-', '/') # Write the row to the CSV file writer.writerow([title + ' ' + counterpart, counterpart, credit, debit, formatted_date])
This code snippet extracts the required fields using regular expressions and performs the necessary conversions and calculations to determine the credit and debit values based on the 'fraction' field. It then writes the processed data to the CSV file 'output.csv' with the specified columns.
Había un tag title que era del sitio en sí y molestaba. Ese en particular tenía un {, lo edité con un <redacted> que era más rápido en este caso donde es un archivo chico, pero consulté si se puede extraer los que no tienen llave {:
If you want to ignore strings that begin with a curly brace {, you can modify the regular expressions in the Python code to exclude such strings. One way to achieve this is by using a negative lookahead assertion in the regular expressions. The regular expressions regex_title, regex_counterpart, regex_fraction, and regex_date have been modified to include a negative lookahead (?!{) at the beginning. This negative lookahead ensures that the strings captured by the regular expressions do not start with a curly brace {. With this modification, any strings that begin with a curly brace will be ignored and excluded from the captured results.
Using this approach, you can skip the step of manually editing the source HTML file to remove unwanted strings, as the regular expressions will handle it during the extraction process.
Esto tiene la suerte que el origen de datos tiene los datos ordenados y los procesa de principio a fin, por eso puede ir prolijamente agregando los valores que corresponden en cada columna.
