Skip to content

Meta-data probability corpus

Latest
Compare
Choose a tag to compare
@jcarlosroldan jcarlosroldan released this 09 Mar 10:51
· 5 commits to master since this release

This corpus is used to add an extra feature to the table extraction tool: the probabilities of being labelled as meta-data in a big corpus of tables.

To compute it, we have performed an unsupervised annotation method over a corpus of 145,533,822 tables:

  1. First, iterate over the corpus annotating every cell in the first row or column as meta-data, and every other cell as data. We build a dictionary using this heuristic, where the key is the text of a cell and its value is the likelyhood of that cell being meta-data.
  2. Then, we iterate over the corpus again, but this time we use the previously computed dictionary to average the likelyhood for a whole row or column. If the average is higher than 0.5, we consider that every cell in that row or column is meta-data, data otherwise. Then, we rebuild the previous dictionary using the new meta-data/data occurrences.
  3. We repeat the previous step until no significant changes are produced.

While this simple method is not very effective, it can be used as another feature of the table extraction tool.