Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.
skoulouzis edited this page May 22, 2017 · 15 revisions

Term Extraction

To extract terms from a simple text file using a Hadoop cluster:

hadoop jar E-CO-2/training/target/training-1.0-SNAPSHOT-jar-with-dependencies.jar -op x -i <input_folder> -o <output_file.csv> -p E-CO-2/etc/configure.properties

The input_folder should contain text files relevant to the category we want to train for. Files can be expert definitions etc. such as wikipedia articles. The quality of the classification depends on this step. Therefore the text has to be concrete and representative and contain specific nouns. For example expressions like "analyze large data sets and investigate possible solutions" are not concrete.

The output_file.csv is the output file that contains the extracted terms.

For example, in order to create the category vector for the competence DSDA01 you can run:

hadoop jar E-CO-2/training/target/training-1.0-SNAPSHOT-jar-with-dependencies.jar -op x -i E-CO-2/Competences/data_analytics/data_analytics-DSDA01_predictive_analytics/ -o DSDA01.csv -p E-CO-2/etc/configure.properties

For more options look at the E-COCO/etc/configure.properties file

Classification

Assuming we have defined as many categories as we need we can now classify a text document based on these categories. To classify simple text files using a hadoop cluster execute:

hadoop jar E-CO-2/classification/target/classification-1.0-SNAPSHOT-jar-with-dependencies.jar -op c -i <textdocs_folder> -o <output_folder> -c <categories_folder> -p E-CO-2/etc/classification.properties

The textdocs_folder is a folder that should contain text files that we want to classify. The output_folder is the folder to save the results and should exists. The categories_folder is the folder that contains the csv files generated form the training step. Each csv file should be contained on it's own separate folder.

For example:

hadoop jar E-CO-2/classification/target/classification-1.0-SNAPSHOT-jar-with-dependencies.jar -op c -i textdocs -o output -c E-CO-2/Competences/ -p E-CO-2/etc/classification.properties

Run Service

To run the E-CO-2 rest service you can run:

java -jar  rest-1.0-SNAPSHOT-jar-with-dependencies.jar E-CO-2/etc/configure.properties

To set the service runing after loging out you may use :

screen -dmSL e-co2 java -jar  E-CO-2/rest/target/rest-1.0-SNAPSHOT-jar-with-dependencies.jar E-CO-2/etc/configure.properties

Classify Job Ad

curl -H "Content-Type: application/json" -X POST -d '{"title":"title","id":"job_102437518.txt","contents":"    Data Scientist Data Science  Analytics Hadoop  Java  AWS    Dublin London    A whole new digital world The world is a rapidly changing place in    which technology now has its place in our day to day lives  If you want    to be a part in shaping this digital future we want to hear from you  A    passion for a things digital is key  you will have you finger on the    pulse of the digital space and a true passion for technology  This role    will see you work on dynamic and innovative projects that will be in    the hearts and minds of the general public  You’ll never have to explain what you do again"}' http://localhost:9999/e-co2/classification/job

Because the classification may take some minutes you'll get back an ID which you can use to retrieve the result. Is the result is not ready you'll get back 202 (accepted). If it's ready you'll get back the answer. After the results is calculated:

curl http://localhost:9999/e-co2/classification/$ID
Clone this wiki locally