Scala Project ML Career Service

CSYE7200 Scala Project : ML Career Service

Introduction

MLBC is Web based system is for candidates who are seeking a job and recruiter who are looking for the right talent.
The platform scrapes and analyzes publicly available job posting data from sites like Glassdoor, LinkedIn, etc.
The system will recommend the candidate which job title will fit them based on their resume. The system will also give job links to the suitable for them.
The platform will also help us to gain statistical information about the latest job trend.

Team Details

Name	NUID	Email Address
Menita Koonani	001883043	[email protected]
Raghavi Kirouchenaradjou	001826638	[email protected]
Sreerag Mandakathil Sreenath	001838559	[email protected]

Use case

Actor :
- A user (In this case a graduate IS student) will initiate the interaction with the web api
Action :
- The user will upload their resume in pdf/text format to the web api
Reaction :
- The system will recommend the user the best job titles based on their skills mentioned in the resume. The system will also send in URLs to apply for the suggested job titles. Predict salary expectation in different profiles and which is best suitable for that resume

Project Stack

Sno.	Task	Library
1.	Web Scraping	Python : Scrapy, BeautifulSoup and LXML
2.	Data Cleaning	Scala/Spark : RegexTokenizer and StopWordsRemover
3.	Classification and ML modeling	Scala/Spark : NLTK and Classification
4.	Web Server and Rest API	Scala : AKKA HTML

Web Scraping

Indeed.com is Web Scraped for 3 Job Titles - Software Engineer, Data Scientist, Technical Writer to get information for about 100 jobs per job title using Python.

Attributes that are scraped for every job title:

Title
Url
Description
Company
Salary
Location

These attributes are scraped using libraries like Scrapy and Beautiful Soup and are stored as a json file.

Data Cleaning

The scraped file is cleaned in Scala to remove punctuation and stopwords from the job_posting_desc column in order to make it fit for training the model.

Model Training Result

The model is trained using Naive bayes classifier for both raw and cleaned data which were scraped from indeed.com on 04-11-2019

The data was split into 3:1 ratio for testing and training

The percentage of accuracy as follows:

Sno.	Type of data	Accuray
1.	indeed 04-11-19 (raw)	87.75%
2.	indeed 04-11-19 (cleaned)	90.48%

Web Service

Developing WEB API's using AKKA HTTP

Technology Stack

AKKA HTTP: Akka HTTP modules implement a full server and client-side HTTP stack on top of akka-actor and akka-stream.
PDFTextStripper: This class will take a pdf document and strip out all of the text and ignore the formatting and such.

Library	Version
akka-actor	2.5.13
akka-http	10.1.3
akka-stream	2.5.13
pdfbox	2.0.1
fontbox	2.0.1

Final Project Execution

To execute the project navigate to the final_project folder and open it on Intellij

\Scala_Project_ML_Career_Service\final_project\

Run it via the Postman or Advanced Rest Client

Method : POST
Endpoint : http://localhost:9000/pdf_file
Body (form-encoded): Upload a PDF File by providing the key as "filePdfUpload" and value as the path to the PDF file
Response : A json listing the Predicted Job Title and the Available Job Links as follows

{
    "Predicted Role": "software engineer",
    "Available Jobs": [
        {
            "job_posting_title": "Software Quality Engineer",
            "company": "DELL",
            "location": "Boston, MA",
            "job_posting_url": "https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0DhVAxkc_TxySVbUOs6bxWYWOfhmDTNcVTjFFBAY1FXZ_f-lnuRL7vGmhrcjkjTSE3fin6ve_ms4_9mScuaceMLDlH5RM-fUHmHxZE5PndrOse_GkPZwuCVyi6uzk699vmQcNe663vhzNZYMDTKXCuX_SXq9blbeu-m_sPFggUQmSJ3v2d4J7fnz01fJbN01w-4bzs4qhEzv8nkSkWNqEwrSGidIZ4UO13FjMoRpSx5MCV6FY8nm0BD3WFo6ZkQjgkPxQ4F4_N8f-MKYhPbXYZZmB0dwoVPXx02RGboZ7Ar7vzZtO8Lza-DXlpN05wt-_4ghGoPuQe6TKMc4NBW8q4mIDORxZ-dUoVWA76HFzHSSMDBlBWlT5fVyILzmdm7FEoPlA_waWP3GgMYUwRaqm204uEfvZJ3Osdb8u9BnQp4cw==&vjs=3&p=9&fvj=1",
            "job_posting_salary": "None"
        },
         {
            "job_posting_title": "User Interface Software Engineer",
            "company": "JW Fishers Mfg., Inc.",
            "location": "Boston, MA",
            "job_posting_url": "https://www.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0DhVAxkc_TxySVbUOs6bxWYWOfhmDTNcVTjFFBAY1FXZ_f-lnuRL7vGmhrcjkjTSE3fin6ve_ms4_9mScuaceMLDlH5RM-fUHmHxZE5PndrOse_GkPZwuCVyi6uzk699vmQcNe663vhzNZYMDTKXCuX_SXq9blbeu-m_sPFggUQmSJ3v2d4J7fnz01fJbN01w-4bzs4qhEzv8nkSkWNqEwrSGidIZ4UO13FjMoRpSx5MCV6FY8nm0BD3WFo6ZkQjgkPxQ4F4_N8f-MKYhPbXYZZmB0dwoVPXx02RGboZ7Ar7vzZtO8Lza-DXlpN05wt-_4ghGoPuQe6TKMc4NBW8q4mIDORxZ-dUoVWA76HFzHSSMDBlBWlT5fVyILzmdm7FEoPlA_waWP3GgMYUwRaqm204uEfvZJ3Osdb8u9BnQp4cw==&vjs=3&p=9&fvj=1",
            "job_posting_salary": "80000 - 120000"
         }
     ]
}

If the uploaded file is not a PDF you would get the following :

{
    "errorMessage": "Error in file uploading:only PDF allowed"
}

Notes

To know more about

AKKA HTTP : https://doc.akka.io/docs/akka-http/current/
PdfTextStripper : https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/PDFTextStripper.html

Data Cleaning:

Scala:

StopWordsRemover: https://spark.apache.org/docs/latest/ml-features.html#stopwordsremover
RegexTokenizer: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer

Python:

scrapy: https://docs.scrapy.org/en/latest/intro/overview.html
BeatifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
data		data
data_cleaning		data_cleaning
final_project		final_project
spark_model		spark_model
web-service		web-service
.gitignore		.gitignore
README.md		README.md
Scala Project Demo.pptx		Scala Project Demo.pptx
Scala Project.pptx		Scala Project.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scala Project ML Career Service

Introduction

Team Details

Use case

Project Stack

Web Scraping

Data Cleaning

Model Training Result

Web Service

Technology Stack

Final Project Execution

Run it via the Postman or Advanced Rest Client

Notes

About

Releases

Packages

Contributors 3

Languages

sreeragsreenath/Scala_Project_ML_Career_Service

Folders and files

Latest commit

History

Repository files navigation

Scala Project ML Career Service

Introduction

Team Details

Use case

Project Stack

Web Scraping

Data Cleaning

Model Training Result

Web Service

Technology Stack

Final Project Execution

Run it via the Postman or Advanced Rest Client

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages