You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TO thombashi, Tsuyoshi Hombashi
SUBJ Handle Wikipedia table faulty boolean design, and headers with row/colspan > 1
I have noticed a couple design flaws in how Wikipedia generates tables.
Maybe you would like to incorporate some workarounds in sqlitebiter:
A) Some (all?) tables use a GREEN CHECK / RED X indicator for Boolean columns. A broken design, they are using
only TD attributes to display the image, and the actual TD cell content is blank, empty. This results in NULLs
coming out of conversion programs (incl sqlitebiter).
This situation can be searched/fixed by looking for pattern <td data-sort-value="Yes" ... (or "No"), and injecting 1/0.
B) Some tables have header cells with rowspan/colspan > 1, for making a fancy "grouped" header.
This breaks the column assignment for later data values.
For example, if the first header row has 5 cells including one with colspan=2, and the 2nd header row and data rows have 6 cells,
then converters (incl SqliteBiter) will see only the 5 cells, and some data will go missing or end up in wrong-named columns.
It's a little tricky to detect/automate, since the rowspan/colspan settings are across multiple html lines.
Example broken Wikipedia tables at https://en.wikipedia.org/wiki/Comparison_of_text_editors
NOTE OF APPRECIATION FOR Tsuyoshi Hombashi
Very nice work on SqliteBiter. And great name, clever with good rhyme.
Just what we need to grab/manage those tables floating around everywhere.
I had come up with some multi-stage dataflows through other apps, but too many steps, kept searching
for a better solution, which you have made.
Thanks for producing such a useful comprehensive CLI converter tool.
Best regards,
John
-- jbthiel
The text was updated successfully, but these errors were encountered:
JBThiel
changed the title
Handle Wikipedia table faulty boolean design; and headers with row/colspan > 1
Handle Wikipedia table faulty boolean design; and HTML headers with row/colspan > 1 (also XLS)
Apr 1, 2024
A related issue is in XLS sheets converted via Gnumeric, from HTML tables with header rowspan/colspan = 2.
If all the headers are merged (per all rowspan=2) it's ok, but if there is a mix of merged and unmerged cells, then sqlitebiter fails on XLS with:
$ sqlitebiter file -f excel ../first-sheet-merged-headers.xls
[WARNING] convertible table not found in ../first-sheet-merged-headers.xls
[INFO] converted results: source=0
TO thombashi, Tsuyoshi Hombashi
SUBJ Handle Wikipedia table faulty boolean design, and headers with row/colspan > 1
I have noticed a couple design flaws in how Wikipedia generates tables.
Maybe you would like to incorporate some workarounds in sqlitebiter:
A) Some (all?) tables use a GREEN CHECK / RED X indicator for Boolean columns. A broken design, they are using
only TD attributes to display the image, and the actual TD cell content is blank, empty. This results in NULLs
coming out of conversion programs (incl sqlitebiter).
This situation can be searched/fixed by looking for pattern <td data-sort-value="Yes" ... (or "No"), and injecting 1/0.
B) Some tables have header cells with rowspan/colspan > 1, for making a fancy "grouped" header.
This breaks the column assignment for later data values.
For example, if the first header row has 5 cells including one with colspan=2, and the 2nd header row and data rows have 6 cells,
then converters (incl SqliteBiter) will see only the 5 cells, and some data will go missing or end up in wrong-named columns.
It's a little tricky to detect/automate, since the rowspan/colspan settings are across multiple html lines.
Example broken Wikipedia tables at https://en.wikipedia.org/wiki/Comparison_of_text_editors
NOTE OF APPRECIATION FOR Tsuyoshi Hombashi
Very nice work on SqliteBiter. And great name, clever with good rhyme.
Just what we need to grab/manage those tables floating around everywhere.
I had come up with some multi-stage dataflows through other apps, but too many steps, kept searching
for a better solution, which you have made.
Thanks for producing such a useful comprehensive CLI converter tool.
Best regards,
John
-- jbthiel
The text was updated successfully, but these errors were encountered: