Provide format-specific metadata #71

scraperdragon · 2013-07-15T16:29:48Z

Often I find myself wanting more details about the individual cells than just their values.

e.g.

Some HTML cells contain more than just a single value; and this can require additional parsing to understand what the true value is. For example,

42¹

is naively converted to 421 at the moment; in order to do this additional processing I require the HTML source of the cell.

Some formats (Excel, HTML, etc.) support additional formatting - e.g. bold, font colour, background colour. It would be good to allow future support for these.
But we don't want to write enormous amounts of code to cover all use cases, especially where features are limited to one or two formats. But making available the internals of the library parsing the file (e.g. LXML's internal rendering of the cell) we can allow people to interrogate this data without hacking on messytables directly.

So: I propose adding a "properties" attribute to messytables Cells, which is a dictionary; what keys exist is entirely dependant on the helper library.

Currently, I:

expose internals via "_lxml", "_xlrd", "_pyxl"
expose raw HTML for the cell via "html"
expose whether a cell was spanned via "span" (HTML only so far)

Does this sound like a good idea / terrible idea?

rossjones · 2013-07-19T10:30:12Z

In practice, I can imagine having dicts of data on every single cell being pretty inefficient. When iterating over rows this is likely to stop those items (refs to underlying libs repr of cells) being garbage-collected if the caller keeps a reference to what they thought was just some text and a type obj - isn't it?

rossjones · 2013-07-22T13:56:49Z

Random thought, if this is currently only useful for HTML tables, what do you think of only supporting it for HTML tables and having a nop for other formats until a use-case presents itself for them?

scraperdragon · 2013-07-22T14:22:50Z

I'd drastically underestimated the memory usage of those dicts; it's about +50% in the case of {} on a CSV file. A better method might be more appropriate.

scraperdragon mentioned this issue Jul 15, 2013

[WIP] Provide format-specific metadata #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide format-specific metadata #71

Provide format-specific metadata #71

scraperdragon commented Jul 15, 2013

rossjones commented Jul 19, 2013

rossjones commented Jul 22, 2013

scraperdragon commented Jul 22, 2013

Provide format-specific metadata #71

Provide format-specific metadata #71

Comments

scraperdragon commented Jul 15, 2013

rossjones commented Jul 19, 2013

rossjones commented Jul 22, 2013

scraperdragon commented Jul 22, 2013