Specialized HTML parser that converts downloaded ledger html pages into CSV files for export into Google Sheets. Currently designed with a focus on extracting and compiling per-truck financial data for assisting with semi operations data analysis on The Entreprenauts
DISCLAIMER: The parser has not been tested or updated since prior to the semi revamp, so it is provided as-is with no guarantee of functionality.
There are currently no releases, and no plans for any. However, the program can be run as a script. To do so, first time setup requires the following:
- Install Python. You may also need to install the Python package installer, pip
- Clone/download the repo to a directory of your choice (Clone with
git clone https://github.com/Tropingenie/repo_Meta_Ledger_Parser.git
or download as per the screenshot) - Open a terminal of your choice and navigate to the same directory the program is downloaded to
- Run the command
pip install -r requirements.txt
to download the dependencies
The script is designed to work alongside dedicated spreadsheet software. Therefore, for very large entries the console will display a truncated table. All data is exported to the file processed_ledger.csv
, which I recomment importing into Google Sheets for further analysis.
The parser runs in one of two modes: ID mode, or date mode.
In ID mode, the parser will parse all ledger entries above a certain ID. This is useful if you need to bring a partially filled table up to date, without parsing duplicate data.
In date mode, the parser will parse all ledger entries that fall on a certain day. This is useful for compiling daily per-truck financial reports.
To run the parser, you need to run the program with at least 1 command line arguments: The ID of the oldest ledger entry you have saved or a date. Opetonally, the path(s) to at least one html file containing the ledger.
Generically, the parser can be run using:
python scraper.py "YYYY-MM-DD"
python scraper.py "OldestID"
See below for specific examples.
General steps are:
- Download all pages of the ledger you want parsed to .html files
- Determine the oldest ledger entry you want parsed, and take the ID of the ledger entry ONE BELOW the oldest entry you want parsed
- Open your terminal in the directory of the
scraper.py
file and runpython scraper.py OldestID+1 page_1.html ... page_n.html
(whereOldestID+1
is the ledger entry's ID one below the last ledger entry you want parsed, andpage_1.html ... page_n.html
are the ledger pages you want parsed) - The output will be printed to the terminal if it is short enough, and exported to a .csv file located in the same directory that
scraper.py
is in
For example, to parse the first three pages of my ledger, I would get the following:
Running the parser in date mode is identical to running it in ID mode, except you specify a date (in YYYY-MM-DD
format, like in the ledger)
python scraper.py YYYY-MM-DD page_1.html ... page_n.html
For example, to parse two pages of my ledger in date mode:
To parse the ledger, currently ledger pages must be downloaded. This can be done simply enough by going into your ledger on Entreprenauts, right clicking anywhere, and selecting "Save As"
One note is that you need to ensure the html containing the table is saved. If you have downloaded the page and are getting "no table found" errors, then check that the table is actually in the .html file. An html table looks like the following:
To avoid this issue it is recommended to download the entire webpage through the save as dialogue, instead of just the html file.