You must be signed in to change notification settings - Fork 3
To extract terms from a simple text file using a Hadoop cluster:
hadoop jar E-CO-2/training/target/training-1.0-SNAPSHOT-jar-with-dependencies.jar -op x -i <input_folder> -o <output_file.csv> -p E-CO-2/etc/configure.properties
The input_folder should contain text files relevant to the category we want to train for. Files can be expert definitions etc. such as wikipedia articles. The quality of the classification depends on this step. Therefore the text has to be concrete and representative and contain specific nouns. For example expressions like "analyze large data sets and investigate possible solutions" are not concrete.
The output_file.csv is the output file that contains the extracted terms.
For example, in order to create the category vector for the competence DSDA01 you can run:
hadoop jar E-CO-2/training/target/training-1.0-SNAPSHOT-jar-with-dependencies.jar -op x -i E-CO-2/Competences/data_analytics/data_analytics-DSDA01_predictive_analytics/ -o DSDA01.csv -p E-CO-2/etc/configure.properties
For more options look at the E-COCO/etc/configure.properties file
Assuming we have defined as many categories as we need we can now classify a text document based on these categories. To classify simple text files using a hadoop cluster execute:
hadoop jar E-CO-2/classification/target/classification-1.0-SNAPSHOT-jar-with-dependencies.jar -op c -i <textdocs_folder> -o <output_folder> -c <categories_folder> -p E-CO-2/etc/classification.properties
The textdocs_folder is a folder that should contain text files that we want to classify. The output_folder is the folder to save the results and should exists. The categories_folder is the folder that contains the csv files generated form the training step. Each csv file should be contained on it's own separate folder.
For example:
hadoop jar E-CO-2/classification/target/classification-1.0-SNAPSHOT-jar-with-dependencies.jar -op c -i textdocs -o output -c E-CO-2/Competences/ -p E-CO-2/etc/classification.properties
To run the E-CO-2 rest service you can run:
java -jar rest-1.0-SNAPSHOT-jar-with-dependencies.jar E-CO-2/etc/configure.properties
To set the service runing after loging out you may use :
screen -dmSL e-co2 java -jar E-CO-2/rest/target/rest-1.0-SNAPSHOT-jar-with-dependencies.jar E-CO-2/etc/configure.properties
curl -H "Content-Type: application/json" -X POST -d '{"title":"title","id":"job_102437518.txt","contents":" Data Scientist Data Science Analytics Hadoop Java AWS Dublin London A whole new digital world The world is a rapidly changing place in which technology now has its place in our day to day lives If you want to be a part in shaping this digital future we want to hear from you A passion for a things digital is key you will have you finger on the pulse of the digital space and a true passion for technology This role will see you work on dynamic and innovative projects that will be in the hearts and minds of the general public You’ll never have to explain what you do again"}' http://localhost:9999/e-co2/classification/job
Because the classification may take some minutes you'll get back an ID which you can use to retrieve the result. Is the result is not ready you'll get back 202 (accepted). If it's ready you'll get back the answer. After the results is calculated:
curl http://localhost:9999/e-co2/classification/$ID