Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eggint #5

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 10 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,20 @@
ML Challenge Markdown
# Wave Machine Learning Engineer Challenge
Applicants for the Software Engineer (and Senior), Machine Learning(https://wave.bamboohr.co.uk/jobs/view.php?id=1) role at Wave must complete the following challenge, and submit a solution prior to the onsite interview.

The purpose of this exercise is to create something that we can work on together during the onsite. We do this so that you get a chance to collaborate with Wavers during the interview in a situation where you know something better than us (it's your code, after all!)
### Instructions on how to run the application

There isn't a hard deadline for this exercise; take as long as you need to complete it. However, in terms of total time spent actively working on the challenge, we ask that you not spend more than a few hours, as we value your time and are happy to leave things open to discussion in the onsite interview.
There are three main functions: get_data() is to parse the files and prepare data for models; category_prediction() is to run models for assigning individual expense transaction to one of predefined categories (Two modes are provided for use. Mode 1 uses logistic regression on Spark, while mode 2 applies the neural network.); expense_type_prediction() is for predicting the type (personal or business) of each transaction.

Please use whatever programming language, libraries and framework you feel the most comfortable with. Preference here at Wave is Python.
To further explain the models and various components, some comments are presented in the code including assumptions for assigning the original expense types.

Feel free to email [[email protected]]([email protected]) if you have any questions.
### A paragraph or two about what what algorithm was chosen for which problem, why (including pros/cons) and what you are particularly proud of in your implementation, and why

## Project Description
Continue improvements in automation and enhancing the user experience are keys to what make Wave successful. Simplifying the lives of our customers through automation is a key initiative for the machine learning team. Your task is to solve the following questions around automation.
In this practice, logistic regression and the neural network were applied. Logistic regression is a simple but powerful traditional algorithm that works well on large dataset. Since only a small sized sample was given, the prediction is not optimal. On the other hand, the neural network has a greater potential to solve complicated problems but higher computing power/time is required and the result may not be as easily interpreted as those from traditional algorithms. In addition, it is unstable with small dataset.

### What your learning application must do:
Despite the ability of choosing different algorithms, one example of error handling is added to ensure robustness of the code.

1. Your application must be able read provided comma separated files.
### Overall performance of your algorithm(s)

2. Similarly, your application must accept a separate comma separated file as validation data with the same format.
3. You can make the following assumptions:
* Columns will always be in that order.
* There will always be data in each column.
* There will always be a header line.
The limitation here is mainly the size of the given dataset (only 24 training entries and 12 validation entries).

An example input files named `training_data_example.csv`, `validation_data_example.csv` and `employee.csv` are included in this repo. A sample code `file_parser.py` is provided in Python to help get you started with loading all the files. You are welcome to use if you like.

1. Your application must parse the given files.
2. Your application should train only on the training data but report on its performance for both data sets.
3. You are free to define appropriate performance metrics, in additional to any predefined, that fit the problem and chosen algorithm.
4. You are welcome to answer one or more of the following questions. Also, you are free to drill down further on any of these questions by providing additional insights.

Your application should be easy to run, and should run on either Linux or Mac OS X. It should not require any non open-source software.

There are many ways and algorithms to solve these questions; we ask that you approach them in a way that showcases one of your strengths. We're happy to tweak the requirements slightly if it helps you show off one of your strengths.

### Questions to answer:
1. Train a learning model that assigns each expense transaction to one of the set of predefined categories and evaluate it against the validation data provided. The set of categories are those found in the "category" column in the training data. Report on accuracy and at least one other performance metric.
2. Mixing of personal and business expenses is a common problem for small business. Create an algorithm that can separate any potential personal expenses in the training data. Labels of personal and business expenses were deliberately not given as this is often the case in our system. There is no right answer so it is important you provide any assumptions you have made.
3. (Bonus) Train your learning algorithm for one of the above questions in a distributed fashion, such as using Spark. Here, you can assume either the data or the model is too large/efficient to be process in a single computer.

### Documentation:

Please modify `README.md` to add:

1. Instructions on how to run your application
2. A paragraph or two about what what algorithm was chosen for which problem, why (including pros/cons) and what you are particularly proud of in your implementation, and why
3. Overall performance of your algorithm(s)

## Submission Instructions

1. Fork this project on github. You will need to create an account if you don't already have one.
2. Complete the project as described below within your fork.
3. Push all of your changes to your fork on github and submit a pull request.
4. You should also email [[email protected]]([email protected]) and your recruiter to let them know you have submitted a solution. Make sure to include your github username in your email (so we can match applicants with pull requests.)

## Alternate Submission Instructions (if you don't want to publicize completing the challenge)
1. Clone the repository.
2. Complete your project as described below within your local repository.
3. Email a patch file to [[email protected]]([email protected])

## Evaluation
Evaluation of your submission will be based on the following criteria.

1. Did you follow the instructions for submission?
2. Did you apply an appropriate machine learning algorithm for the problem and why you have chosen it?
3. What features in the data set were used and why?
4. What design decisions did you make when designing your models? Why (i.e. were they explained)?
5. Did you separate any concerns in your application? Why or why not?
6. Does your solution use appropriate datatypes for the problem as described?
For category prediction on the validation dataset: logistic regression results in 0.75 for precision, recall and f1 score, while the weighted scores range from 0.66 to 0.75. On the other hand, the neural network has highest accuracy of 0.5 and lowest loss of 1.3.
For expense type prediction on the validation dataset: accuracy tops at 0.83 while loss 0.45.
Loading