Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide format-specific metadata #71

Open
scraperdragon opened this issue Jul 15, 2013 · 3 comments
Open

Provide format-specific metadata #71

scraperdragon opened this issue Jul 15, 2013 · 3 comments

Comments

@scraperdragon
Copy link

Often I find myself wanting more details about the individual cells than just their values.

e.g.

  1. Some HTML cells contain more than just a single value; and this can require additional parsing to understand what the true value is. For example,
421

is naively converted to 421 at the moment; in order to do this additional processing I require the HTML source of the cell.

  1. Some formats (Excel, HTML, etc.) support additional formatting - e.g. bold, font colour, background colour. It would be good to allow future support for these.

  2. But we don't want to write enormous amounts of code to cover all use cases, especially where features are limited to one or two formats. But making available the internals of the library parsing the file (e.g. LXML's internal rendering of the cell) we can allow people to interrogate this data without hacking on messytables directly.

So: I propose adding a "properties" attribute to messytables Cells, which is a dictionary; what keys exist is entirely dependant on the helper library.

Currently, I:

  • expose internals via "_lxml", "_xlrd", "_pyxl"
  • expose raw HTML for the cell via "html"
  • expose whether a cell was spanned via "span" (HTML only so far)

Does this sound like a good idea / terrible idea?

@rossjones
Copy link
Contributor

In practice, I can imagine having dicts of data on every single cell being pretty inefficient. When iterating over rows this is likely to stop those items (refs to underlying libs repr of cells) being garbage-collected if the caller keeps a reference to what they thought was just some text and a type obj - isn't it?

@rossjones
Copy link
Contributor

Random thought, if this is currently only useful for HTML tables, what do you think of only supporting it for HTML tables and having a nop for other formats until a use-case presents itself for them?

@scraperdragon
Copy link
Author

I'd drastically underestimated the memory usage of those dicts; it's about +50% in the case of {} on a CSV file. A better method might be more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants