Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.
skoulouzis edited this page Jul 1, 2016 · 6 revisions

Overview

Train a system to identify the similarity of a text file against on a set of pre-determined classes. The output will be a table with the categories as a header and the corresponding similarity measure

System: EDISON-COmpetencies ClassificatiOn (E-CO^2)

Separated in three parts:

Train

  • Manually define each category. In practice write in a simple txt file sentences and keyword that represent the category. The quality of the classification depends on this step. Therefore the text has to be concrete and representative and contain specific nouns. For example expressions like "analyze large data sets and investigate possible solutions" are not concrete.

  • Perform term extraction on the manually specified text file to produce a list of possible terms with the help of a context terminology dictionary (created by a context corpus)

  • Perform word sense disambiguation on the possible terms and save the "best" definition for each term

  • Generate TD-IDF values for the collection of exported definitions. The output will be a table with the extracted terms as its header and each line will contain the TD-IDF values for each processed document.

  • The values of the table will be summed and filtered to create a vector to represent one category

Context terminology Dictionary

  • Manually gather context documents containing definitions, terminology, expert analysis, etc. from sources like Wikipedia, scientific articles, etc.

  • Extract association rules (N-grams) from corpus. The output will be a table containing two columns the first will contain the association rules and the second their probability of appearing as a combination

Text Processing Utilities

This is a collection of tools and utilities used by all parts to filter, tokenize, stem, lemmtize etc. text.

Clone this wiki locally