html_to_csv

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
html_to_csv [2023/06/23 16:49] osohtml_to_csv [2024/10/17 21:42] (current) – external edit 127.0.0.1
Line 1: Line 1:
 +====== El problema ======
 +Necesito importar a mi cuenta de Mercadopago en [[https://www.firefly-iii.org/|Firefly III]] varios meses de transacciones (unas 200 transacciones, entre compras, transferencias, devoluciones, pagos, cobros, etc) y no quiero hacerlo una por una.
 +
 +Firefly III tiene una utilidad, el [[https://github.com/firefly-iii/data-importer/|Data Importer]], para procesar archivos csv e insertar esos datos a la base.
 +
 +Mercadopago tiene una función de conciliación, da un montón de información pero no es precisamente la que necesito. No tiene, por ejemplo **el nombre** del comercio o la persona con la que opero. Hay un ID, cuit o algo por el estilo y no me sirve en este caso. Me resulta mucho más útil la información que aparece paginada en la sección de Actividad.
 +
 +Qué pasa si guardo el html, busco cómo se llama cada 'tag' y extraigo los datos con regex?
 +
 +Necesito entonces guardar como html cada página de la actividad (botón derecho > guardar como > solo html)
 +
 +
 +y ahora concatenar los html en un solo archivo grande que voy a procesar:
 +
 +<code bash>cat *.html > bigHtmlFile.html</code>
 +
 +Acá *.html le llega como una lista de parámetros a ''cat'', con el nombre de cada archivo, aparentemente en orden alfabético. Hay un límite, si se pasara de ese límite devuelve un error de demasiados parámetros. Hay un workaround pero no me hizo falta porque solamente procesé 8 archivos.
 +
 +Me gustaría que ChatGPT haga la parte aburrida, así que le paso la consigna...
  
 ====== Esta es la consigna: ====== ====== Esta es la consigna: ======
Line 6: Line 25:
  
 An excerpt of that html could be the following: An excerpt of that html could be the following:
-<code>  +<code>{"id":"purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","type":"purchase","title":"Compra de 7 productos","description":"Mercado Libre","email":null,"status":"","statusColor":"gray","statusIcon":null,"image":null,"initials":null,"iconClass":"ic_merchant_order","amount":{"currency_id":"ARS","symbol":"$","symbol_text":"Peso argentino","fraction":"-61.046","decimal_separator":",","cents":"33","cents_text":""},"actions":{"other":[]},"name":"purchase","date":"07 de junio","creationDate":"2023-06-07T10:19:55.000Z","lastModified":"2023-06-07T10:19:55.000Z","moneyReleaseDate":"2023-06-07T10:21:32.000Z","link":"/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","entity":"payment","period":"previous"}</code>
-{"id":"purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","type":"purchase","title":"Compra de 7 productos","description":"Mercado Libre","email":null,"status":"","statusColor":"gray","statusIcon":null,"image":null,"initials":null,"iconClass":"ic_merchant_order","amount":{"currency_id":"ARS","symbol":"$","symbol_text":"Peso argentino","fraction":"-61.046","decimal_separator":",","cents":"33","cents_text":""},"actions":{"other":[]},"name":"purchase","date":"07 de junio","creationDate":"2023-06-07T10:19:55.000Z","lastModified":"2023-06-07T10:19:55.000Z","moneyReleaseDate":"2023-06-07T10:21:32.000Z","link":"/activities/detail/purchase_v3_purchase-ea7c91009a3649cb9f5aa6eb8b7111b71895756d","entity":"payment","period":"previous"} +
-</code>+
  
 The CSV that I want, needs to have the following columns:  The CSV that I want, needs to have the following columns: 
 { Title; Counterpart; Credit column; Debit column; Date } { Title; Counterpart; Credit column; Debit column; Date }
  
-There is a condition to fill Credit or Debit: if the captured value of "fraction" is less than 0 (or it have a - sign), it is a debit. On the contrary, it would be a credit. And delete the thousands separation dot (or whatever is called).+There is a condition to fill Credit or Debit columns: if the captured value of "fraction" is less than 0 (or it have a - sign), it is a debit. On the contrary, it would be a credit. And delete the thousands separation dot (or whatever it is called).
  
 The proposed capture groups would be the following: The proposed capture groups would be the following:
Line 21: Line 38:
   * Date column: \"moneyReleaseDate\":\"(.*?)\",   * Date column: \"moneyReleaseDate\":\"(.*?)\",
  
-So, for example, with that excerpt I shared before, the desired output would be:+So, for example, processing that excerpt I shared before, the desired output would be:
 Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023 Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023
- 
----- 
- 
-Por otra parte, necesito guardar como html cada página de la actividad (botón derecho > guardar como > solo html) 
-/ 
-y concatenar los html en un solo archivo 
- 
-<code bash>cat *.html > bigHtmlFile.html</code> 
  
 ====== este el resultado ====== ====== este el resultado ======
Line 62: Line 71:
 dates = pattern_date.findall(html_dump) dates = pattern_date.findall(html_dump)
  
-# Process the counterparts to remove "a " or "de "+# Process the counterparts to remove "a " or "de " (esto es cosmético)
 counterparts = [re.sub(r"^(a |de )", "", counterpart) for counterpart in counterparts] counterparts = [re.sub(r"^(a |de )", "", counterpart) for counterpart in counterparts]
  
Line 68: Line 77:
 combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)] combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)]
  
-# Prepare the data for CSV writing+# Prepare the data for CSV writing (esto no sé si lo usa...)
 data = zip(combined_columns, counterparts, fractions, dates) data = zip(combined_columns, counterparts, fractions, dates)
  
Line 92: Line 101:
 This code snippet extracts the required fields using regular expressions and performs the necessary conversions and calculations to determine the credit and debit values based on the 'fraction' field. It then writes the processed data to the CSV file 'output.csv' with the specified columns. This code snippet extracts the required fields using regular expressions and performs the necessary conversions and calculations to determine the credit and debit values based on the 'fraction' field. It then writes the processed data to the CSV file 'output.csv' with the specified columns.
  
- +Había un tag ''title'' que era del sitio en sí y molestaba. Ese en particular tenía un ''{'', lo edité con un ''<redacted>'' que era más rápido en este caso donde es un archivo chico, pero consulté si se puede extraer los que no tienen llave ''{'': 
-If you want to ignore strings that begin with a curly brace {, you can modify the regular expressions in the Python code to exclude such strings. One way to achieve this is by using a negative lookahead assertion in the regular expressions. +If you want to ignore strings that begin with a curly brace {, you can modify the regular expressions in the Python code to exclude such strings. One way to achieve this is by using a negative lookahead assertion in the regular expressions. The regular expressions regex_title, regex_counterpart, regex_fraction, and regex_date have been modified to include a negative lookahead (?!{) at the beginning. This negative lookahead ensures that the strings captured by the regular expressions do not start with a curly brace {. With this modification, any strings that begin with a curly brace will be ignored and excluded from the captured results.
-the regular expressions regex_title, regex_counterpart, regex_fraction, and regex_date have been modified to include a negative lookahead (?!{) at the beginning. This negative lookahead ensures that the strings captured by the regular expressions do not start with a curly brace {. +
- +
- +
-With this modification, any strings that begin with a curly brace will be ignored and excluded from the captured results.+
  
  
 Using this approach, you can skip the step of manually editing the source HTML file to remove unwanted strings, as the regular expressions will handle it during the extraction process. Using this approach, you can skip the step of manually editing the source HTML file to remove unwanted strings, as the regular expressions will handle it during the extraction process.
  
 +<WRAP center round info 60%>
 +Esto tiene la suerte que el origen de datos tiene los datos ordenados y los procesa de principio a fin, por eso puede ir prolijamente agregando los valores que corresponden en cada columna.
 +</WRAP>
  
html_to_csv.1687538971.txt.gz · Last modified: 2024/10/17 21:42 (external edit)