html_to_csv
Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| html_to_csv [2023/06/23 16:40] – created oso | html_to_csv [2024/10/17 21:42] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | Esta es la consigna: | + | ====== El problema ====== |
| + | Necesito importar a mi cuenta de Mercadopago en [[https:// | ||
| + | |||
| + | Firefly III tiene una utilidad, el [[https:// | ||
| + | |||
| + | Mercadopago tiene una función de conciliación, | ||
| + | |||
| + | Qué pasa si guardo el html, busco cómo se llama cada ' | ||
| + | |||
| + | Necesito entonces guardar como html cada página de la actividad (botón derecho > guardar como > solo html) | ||
| + | |||
| + | |||
| + | y ahora concatenar los html en un solo archivo grande que voy a procesar: | ||
| + | |||
| + | <code bash>cat *.html > bigHtmlFile.html</ | ||
| + | |||
| + | Acá *.html le llega como una lista de parámetros a '' | ||
| + | |||
| + | Me gustaría que ChatGPT haga la parte aburrida, así que le paso la consigna... | ||
| + | |||
| + | ====== | ||
| I want to make a csv file with the contents of a very large html dump containing certain information. | I want to make a csv file with the contents of a very large html dump containing certain information. | ||
| An excerpt of that html could be the following: | An excerpt of that html could be the following: | ||
| - | ´´´ | + | < |
| - | {" | + | |
| - | ´´´ | + | |
| The CSV that I want, needs to have the following columns: | The CSV that I want, needs to have the following columns: | ||
| { Title; Counterpart; | { Title; Counterpart; | ||
| - | There is a condition to fill Credit or Debit: if the captured value of " | + | There is a condition to fill Credit or Debit columns: if the captured value of " |
| The proposed capture groups would be the following: | The proposed capture groups would be the following: | ||
| - | Title column: \" | + | * Title column: \" |
| - | Counterpart column: \" | + | |
| - | Credit/ | + | |
| - | Date column: \" | + | |
| - | So, for example, | + | So, for example, |
| Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023 | Compra de 7 productos Mercado Libre; Mercado Libre; 0; 61046; 07/06/2023 | ||
| - | |||
| - | Por otra parte, necesito guardar como html cada página de la actividad (botón derecho > guardar como > solo html) | ||
| - | / | ||
| - | y concatenar los html en un solo archivo | ||
| - | |||
| - | <code bash>cat *.html > bigHtmlFile.html</ | ||
| ====== este el resultado ====== | ====== este el resultado ====== | ||
| + | Based on your requirements, | ||
| + | Here's an example Python code snippet that utilizes the re module for regular expressions to extract the desired information from the HTML dump and format it into the CSV file | ||
| <code bash: mp2csvFTW.py> | <code bash: mp2csvFTW.py> | ||
| Line 56: | Line 71: | ||
| dates = pattern_date.findall(html_dump) | dates = pattern_date.findall(html_dump) | ||
| - | # Process the counterparts to remove "a " or "de " | + | # Process the counterparts to remove "a " or "de " |
| counterparts = [re.sub(r" | counterparts = [re.sub(r" | ||
| Line 62: | Line 77: | ||
| combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)] | combined_columns = [title + " " + counterpart for title, counterpart in zip(titles, counterparts)] | ||
| - | # Prepare the data for CSV writing | + | # Prepare the data for CSV writing |
| data = zip(combined_columns, | data = zip(combined_columns, | ||
| Line 84: | Line 99: | ||
| writer.writerow([title + ' ' + counterpart, | writer.writerow([title + ' ' + counterpart, | ||
| </ | </ | ||
| + | This code snippet extracts the required fields using regular expressions and performs the necessary conversions and calculations to determine the credit and debit values based on the ' | ||
| + | |||
| + | Había un tag '' | ||
| + | If you want to ignore strings that begin with a curly brace {, you can modify the regular expressions in the Python code to exclude such strings. One way to achieve this is by using a negative lookahead assertion in the regular expressions. The regular expressions regex_title, | ||
| + | |||
| + | |||
| + | Using this approach, you can skip the step of manually editing the source HTML file to remove unwanted strings, as the regular expressions will handle it during the extraction process. | ||
| + | |||
| + | <WRAP center round info 60%> | ||
| + | Esto tiene la suerte que el origen de datos tiene los datos ordenados y los procesa de principio a fin, por eso puede ir prolijamente agregando los valores que corresponden en cada columna. | ||
| + | </ | ||
| + | |||
html_to_csv.1687538451.txt.gz · Last modified: 2024/10/17 21:42 (external edit)
