Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Wikipedia table faulty boolean design; and HTML headers with row/colspan > 1 (also XLS) #105

Open
JBThiel opened this issue Mar 28, 2024 · 1 comment

Comments

@JBThiel
Copy link

JBThiel commented Mar 28, 2024

TO thombashi, Tsuyoshi Hombashi
SUBJ Handle Wikipedia table faulty boolean design, and headers with row/colspan > 1

I have noticed a couple design flaws in how Wikipedia generates tables.
Maybe you would like to incorporate some workarounds in sqlitebiter:

A) Some (all?) tables use a GREEN CHECK / RED X indicator for Boolean columns. A broken design, they are using
only TD attributes to display the image, and the actual TD cell content is blank, empty. This results in NULLs
coming out of conversion programs (incl sqlitebiter).
This situation can be searched/fixed by looking for pattern <td data-sort-value="Yes" ... (or "No"), and injecting 1/0.

B) Some tables have header cells with rowspan/colspan > 1, for making a fancy "grouped" header.
This breaks the column assignment for later data values.
For example, if the first header row has 5 cells including one with colspan=2, and the 2nd header row and data rows have 6 cells,
then converters (incl SqliteBiter) will see only the 5 cells, and some data will go missing or end up in wrong-named columns.
It's a little tricky to detect/automate, since the rowspan/colspan settings are across multiple html lines.
Example broken Wikipedia tables at https://en.wikipedia.org/wiki/Comparison_of_text_editors

NOTE OF APPRECIATION FOR Tsuyoshi Hombashi
Very nice work on SqliteBiter. And great name, clever with good rhyme.
Just what we need to grab/manage those tables floating around everywhere.
I had come up with some multi-stage dataflows through other apps, but too many steps, kept searching
for a better solution, which you have made.
Thanks for producing such a useful comprehensive CLI converter tool.

Best regards,
John
-- jbthiel

@JBThiel JBThiel changed the title Handle Wikipedia table faulty boolean design; and headers with row/colspan > 1 Handle Wikipedia table faulty boolean design; and HTML headers with row/colspan > 1 (also XLS) Apr 1, 2024
@JBThiel
Copy link
Author

JBThiel commented Apr 1, 2024

A related issue is in XLS sheets converted via Gnumeric, from HTML tables with header rowspan/colspan = 2.
If all the headers are merged (per all rowspan=2) it's ok, but if there is a mix of merged and unmerged cells, then sqlitebiter fails on XLS with:

$ sqlitebiter file -f excel ../first-sheet-merged-headers.xls
[WARNING] convertible table not found in ../first-sheet-merged-headers.xls
[INFO] converted results: source=0

Example is the first sheet from the HTML tables at
https://en.wikipedia.org/wiki/Comparison_of_text_editors
converted to XLS via Gnumeric.

The WORKAROUND is to edit the header cells and make them all unmerged, or all merged the same way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant