Differences

This shows you the differences between two versions of the page.

--- html_to_csv [2023/06/23 16:58] – oso
+++ html_to_csv [2024/10/17 21:42] (current) – external edit 127.0.0.1
@@ Line 1: / Line 1: @@
+====== El problema ======
+Necesito importar a mi cuenta de Mercadopago en [[https://www.firefly-iii.org/|Firefly III]] varios meses de transacciones (unas 200 transacciones, entre compras, transferencias, devoluciones, pagos, cobros, etc) y no quiero hacerlo una por una.
+Firefly III tiene una utilidad, el [[https://github.com/firefly-iii/data-importer/|Data Importer]], para procesar archivos csv e insertar esos datos a la base.
+Mercadopago tiene una función de conciliación, da un montón de información pero no es precisamente la que necesito. No tiene, por ejemplo **el nombre** del comercio o la persona con la que opero. Hay un ID, cuit o algo por el estilo y no me sirve en este caso. Me resulta mucho más útil la información que aparece paginada en la sección de Actividad.
+Qué pasa si guardo el html, busco cómo se llama cada 'tag' y extraigo los datos con regex?
+Necesito entonces guardar como html cada página de la actividad (botón derecho > guardar como > solo html)
+y ahora concatenar los html en un solo archivo grande que voy a procesar:
+<code bash>cat *.html > bigHtmlFile.html</code>
+Acá *.html le llega como una lista de parámetros a ''cat'', con el nombre de cada archivo, aparentemente en orden alfabético. Hay un límite, si se pasara de ese límite devuelve un error de demasiados parámetros. Hay un workaround pero no me hizo falta porque solamente procesé 8 archivos.
+Me gustaría que ChatGPT haga la parte aburrida, así que le paso la consigna...
 ====== Esta es la consigna: ======
@@ Line 5: / Line 25: @@
 An excerpt of that html could be the following:
-<code>
+<code>{"id":"purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","type":"purchase","title":"Compra de 7 productos","description":"Mercado Libre","email":null,"status":"","statusColor":"gray","statusIcon":null,"image":null,"initials":null,"iconClass":"ic_merchant_order","amount":{"currency_id":"ARS","symbol":"$","symbol_text":"Peso argentino","fraction":"-61.046","decimal_separator":",","cents":"33","cents_text":""},"actions":{"other":[]},"name":"purchase","date":"07 de junio","creationDate":"2023-06-07T10:19:55.000Z","lastModified":"2023-06-07T10:19:55.000Z","moneyReleaseDate":"2023-06-07T10:21:32.000Z","link":"/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","entity":"payment","period":"previous"}</code>
-{"id":"purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","type":"purchase","title":"Compra de 7 productos","description":"Mercado Libre","email":null,"status":"","statusColor":"gray","statusIcon":null,"image":null,"initials":null,"iconClass":"ic_merchant_order","amount":{"currency_id":"ARS","symbol":"$","symbol_text":"Peso argentino","fraction":"-61.046","decimal_separator":",","cents":"33","cents_text":""},"actions":{"other":[]},"name":"purchase","date":"07 de junio","creationDate":"2023-06-07T10:19:55.000Z","lastModified":"2023-06-07T10:19:55.000Z","moneyReleaseDate":"2023-06-07T10:21:32.000Z","link":"/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","entity":"payment","period":"previous"}
-</code>
 The CSV that I want, needs to have the following columns:
@@ Line 22: / Line 40: @@
 So, for example, processing that excerpt I shared before, the desired output would be:
 Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023
-----
-Por otra parte, necesito guardar como html cada página de la actividad (botón derecho > guardar como > solo html)
-/
-y concatenar los html en un solo archivo grande que voy a procesar con el script:
-<code bash>cat *.html > bigHtmlFile.html</code>
 ====== este el resultado ======
@@ Line 61: / Line 71: @@
 dates = pattern_date.findall(html_dump)
-# Process the counterparts to remove "a " or "de "
+# Process the counterparts to remove "a " or "de " (esto es cosmético)
 counterparts = [re.sub(r"^(a |de )", "", counterpart) for counterpart in counterparts]
@@ Line 67: / Line 77: @@
 combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)]
-# Prepare the data for CSV writing
+# Prepare the data for CSV writing (esto no sé si lo usa...)
 data = zip(combined_columns, counterparts, fractions, dates)
@@ Line 91: / Line 101: @@
 This code snippet extracts the required fields using regular expressions and performs the necessary conversions and calculations to determine the credit and debit values based on the 'fraction' field. It then writes the processed data to the CSV file 'output.csv' with the specified columns.
+Había un tag ''title'' que era del sitio en sí y molestaba. Ese en particular tenía un ''{'', lo edité con un ''<redacted>'' que era más rápido en este caso donde es un archivo chico, pero consulté si se puede extraer los que no tienen llave ''{'':
-If you want to ignore strings that begin with a curly brace {, you can modify the regular expressions in the Python code to exclude such strings. One way to achieve this is by using a negative lookahead assertion in the regular expressions.
+If you want to ignore strings that begin with a curly brace {, you can modify the regular expressions in the Python code to exclude such strings. One way to achieve this is by using a negative lookahead assertion in the regular expressions. The regular expressions regex_title, regex_counterpart, regex_fraction, and regex_date have been modified to include a negative lookahead (?!{) at the beginning. This negative lookahead ensures that the strings captured by the regular expressions do not start with a curly brace {. With this modification, any strings that begin with a curly brace will be ignored and excluded from the captured results.
-the regular expressions regex_title, regex_counterpart, regex_fraction, and regex_date have been modified to include a negative lookahead (?!{) at the beginning. This negative lookahead ensures that the strings captured by the regular expressions do not start with a curly brace {.
-With this modification, any strings that begin with a curly brace will be ignored and excluded from the captured results.