Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.
skoulouzis edited this page Nov 23, 2016 · 6 revisions

Overview

The EDISON COmpetencies ClassificatiOn (E-CO-2) system is a distributed automated tool designed to support gap analysis. It can identify the similarity of a document against a set of predefined categories. It can therefore be used to perform a gap analysis based on the EDISON DS-taxonomy to identify mismatches between education and industry. Moreover, students, practitioners educators and other stake holders can use these tools to identify the gaps in their skills and competences.

System: EDISON-COmpetencies ClassificatiOn (E-CO^2)

Performs the following actions:

Train

  • Manually define each category. In practice, provide for each category a set of simple txt files that contain keywords, definitions that represent the category. The quality of the classification depends on this step. Therefore the text has to be concrete and representative and contain specific nouns. For example expressions like "analyze large data sets and investigate possible solutions" are not concrete.

  • Perform term extraction on the text files to produce a list of terms. Identify terms used in a subject or content.

  • Generate TD-IDF values for the collection of extracted terms. The output will be a table with the extracted terms as its header and each line will contain the TD-IDF values for each processed document.

  • The values of the table are summed and filtered to create a vector to represent one category

Classify

  • Provide input text for classification
  • Do text filtering
  • Find overlapping terms
  • Calculate TF-IDF of terms
  • For each category vector calculate cosine similarity
  • The output is a table with the similarity for each category
Clone this wiki locally