From b414257ddf947e85661d5d9d59c88510395edaff Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 01:35:09 -0400 Subject: [PATCH 01/13] Initial commit --- clasification.ipynb | 982 ++++++++ ...sification_attemp_with_deep_learning.ipynb | 1463 ++++++++++++ clustering_personal_business.ipynb | 2076 +++++++++++++++++ luigi_solution.py | 28 + pre_processor.py | 94 + predictor.py | 93 + run.sh | 1 + spark_random_forest.ipynb | 391 ++++ 8 files changed, 5128 insertions(+) create mode 100644 clasification.ipynb create mode 100644 classification_attemp_with_deep_learning.ipynb create mode 100644 clustering_personal_business.ipynb create mode 100644 luigi_solution.py create mode 100644 pre_processor.py create mode 100644 predictor.py create mode 100755 run.sh create mode 100644 spark_random_forest.ipynb diff --git a/clasification.ipynb b/clasification.ipynb new file mode 100644 index 0000000..2fd59ff --- /dev/null +++ b/clasification.ipynb @@ -0,0 +1,982 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Classification" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The classification solution tries to address the problem:\n", + "\n", + "*Train a learning model that assigns each expense transaction to one of the set of predefined categories and evaluate it against the validation data provided. The set of categories are those found in the \"category\" column in the training data. Report on accuracy and at least one other performance metric.*\n", + "\n", + "For this, my approach is to first load the data, do a bit of exploratory analysis, and then apply several classifiers (Logistic Regression, Naive Bayes, SVM, XGBoost, Random Forest) on the data, so performance evaluation and see which (or a combination of which) performs the best. We will try ensembling several classifiers since ensembles are known to perform better in general." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Necessary imports" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from collections import Counter\n", + "import itertools\n", + "\n", + "from sklearn import preprocessing\n", + "from sklearn.preprocessing import LabelEncoder\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", + "from sklearn.feature_extraction.text import TfidfTransformer\n", + "from sklearn.naive_bayes import GaussianNB, MultinomialNB\n", + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn import svm\n", + "from xgboost import XGBClassifier\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.model_selection import GridSearchCV, LeaveOneOut\n", + "from sklearn.metrics import classification_report\n", + "from sklearn.cluster import KMeans\n", + "\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "from sklearn.metrics import confusion_matrix" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load data from CSV to pandas dataframes" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "df_train = pd.read_csv('training_data_example.csv')\n", + "df_val = pd.read_csv('validation_data_example.csv')\n", + "df_employee = pd.read_csv('employee.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "df_employee['city'] = df_employee['employee address'].apply(lambda x: x.split(',')[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Preprocessing and Observations for feature engineering\n", + "\n", + "We will join Employee data with both Training and Validation data separately to get possibly useful information from the Employee data set. Some useful observations:\n", + "\n", + "- **Role** and **Address** are useful information in Employee. We can extract the **city** information from the address, however the **tax name** informtion already contains province/country so we can do with that. \n", + "- **date** is a useful information. However, it will possibly be better if we extract the **month** and **day of week** from the day. It's possible that people have certain expending habits on certain months or on certain days of weeks (like weekends).\n", + "- Perhaps the most useful piece of data for tackling this problem is the **expense description** field. Intuitively, an expense category should have the biggest correlation with this field than any other field, however, thata ssumption is not necessarily true in practice.\n", + "- **expense description** is a free form text field whereas all other fields are numerical/categorical. So it makes sense to vectorize and analyze this field separately and later combine with other features.\n", + "\n", + "So we will first exclude the **expense description** field from our preprocessing step, and vectorize the rest of the useful fields, which are:\n", + "\n", + "- day_of_week\n", + "- month\n", + "- employee id\n", + "- pre-tax amount\n", + "- role\n", + "\n", + "As part of this we will also **one-hot encode** the categorical variables (the ones other than **pre-tax amount**)\n", + "\n", + "Also, since pre-tax amount is a real valued continuous variable, we will scale it to 0-1 as a means of feature normalization, to avoid any bias in classification for this large valued field.\n", + "\n", + "Let's define a method for preprocessing." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def pre_process(df, columns_to_drop=['date',\n", + " 'category', \n", + " 'tax amount', \n", + " 'expense description']):\n", + " \n", + " df['day_of_week'] = pd.to_datetime(df['date']).apply(lambda x: x.weekday()).astype(str) # str so that treated as categoical\n", + " df['month'] = pd.to_datetime(df['date']).apply(lambda x: x.month).astype(str)\n", + " df = pd.merge(df, df_employee[['employee id', 'role']], how='inner', on=['employee id'])\n", + " df['employee id'] = df['employee id'].astype(str)\n", + " df = df.drop(columns_to_drop, axis=1)\n", + " \n", + " # one-hot encode the categorical variables\n", + " df = pd.get_dummies(df)\n", + " \n", + " df['pre-tax amount'] = preprocessing.minmax_scale(df[['pre-tax amount']])\n", + " \n", + " return df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now obtain the preprocessed training and validation data and and fix any alignment issues and missing values in the validation data." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "x_train = pre_process(df_train)\n", + "x_val = pre_process(df_val)\n", + "x_train, x_val = x_train.align(x_val, join='left', axis=1)\n", + "x_val = x_val.fillna(0)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pre-tax amountemployee id_1employee id_3employee id_4employee id_5employee id_6employee id_7tax name_CA Sales taxtax name_NY Sales taxday_of_week_0...day_of_week_5day_of_week_6month_10month_11month_12month_9role_CEOrole_Engineerrole_IT and Adminrole_Sales
00.00451400.000.001010...000100000.01
10.05591810.000.000100...100100100.00
20.04714110.000.000101...000100100.00
31.00000000.010.000100...000001000.01
40.00200600.010.000100...000001000.01
\n", + "

5 rows × 24 columns

\n", + "
" + ], + "text/plain": [ + " pre-tax amount employee id_1 employee id_3 employee id_4 employee id_5 \\\n", + "0 0.004514 0 0.0 0 0.0 \n", + "1 0.055918 1 0.0 0 0.0 \n", + "2 0.047141 1 0.0 0 0.0 \n", + "3 1.000000 0 0.0 1 0.0 \n", + "4 0.002006 0 0.0 1 0.0 \n", + "\n", + " employee id_6 employee id_7 tax name_CA Sales tax tax name_NY Sales tax \\\n", + "0 0 1 0 1 \n", + "1 0 0 1 0 \n", + "2 0 0 1 0 \n", + "3 0 0 1 0 \n", + "4 0 0 1 0 \n", + "\n", + " day_of_week_0 ... day_of_week_5 day_of_week_6 month_10 \\\n", + "0 0 ... 0 0 0 \n", + "1 0 ... 1 0 0 \n", + "2 1 ... 0 0 0 \n", + "3 0 ... 0 0 0 \n", + "4 0 ... 0 0 0 \n", + "\n", + " month_11 month_12 month_9 role_CEO role_Engineer role_IT and Admin \\\n", + "0 1 0 0 0 0 0.0 \n", + "1 1 0 0 1 0 0.0 \n", + "2 1 0 0 1 0 0.0 \n", + "3 0 0 1 0 0 0.0 \n", + "4 0 0 1 0 0 0.0 \n", + "\n", + " role_Sales \n", + "0 1 \n", + "1 0 \n", + "2 0 \n", + "3 1 \n", + "4 1 \n", + "\n", + "[5 rows x 24 columns]" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x_val.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now for the text field **expense description**, we vectorize it using **TFIdfVectorizer**. Although the vectorizer provides a sparse matrix, we densify it so that we can merge it to the non-text features based vectorized data obtained previously." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "vectorizer = TfidfVectorizer(stop_words='english')\n", + "vectorizer.fit(df_train['expense description'])\n", + "x_train_tfidf = vectorizer.transform(df_train['expense description']).toarray()\n", + "x_val_tfidf = vectorizer.transform(df_val['expense description']).toarray()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now concatenate the text and non-text features to get the final vectorized datasets." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "x_train = np.concatenate((x_train.values, x_train_tfidf), axis=1)\n", + "x_val = np.concatenate((x_val.values, x_val_tfidf), axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now labelize the **categories** of the training and validation data." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array(['Computer - Hardware', 'Computer - Software',\n", + " 'Meals and Entertainment', 'Office Supplies', 'Travel'], dtype=object)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "lencoder = LabelEncoder()\n", + "lencoder.fit(df_train['category'])\n", + "names = set(df_val['category']) # label names to be used later\n", + "y_train = lencoder.transform(df_train['category'])\n", + "y_val = lencoder.transform(df_val['category'])\n", + "val_categoroes = []\n", + "for clazz in lencoder.classes_:\n", + " if clazz in names:\n", + " val_categoroes.append(clazz)\n", + "lencoder.classes_" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Observation on the category distribution and possible performatce metric\n", + "\n", + "Interestingly, it turns out that the distribution of the **category** variable is not even. It's pretty skewed as is evident from the flowwing figure. The distrbution is largely dominated by **Meals and Entertainment** followed by others, which is an indication that **accuracy** alone is not going to be a reliable performance metric. We will compute **precision**, **recall** and **f1 score** mesurements on the result." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAFwCAYAAACo8oBFAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzt3Xe83GWd9vHPRRABBREJiGJMKApI1ahYV0Afxd4FG2vDLuqju+DuvnR11bU+KPKosSD2tcAjKqCIArYFEkCSUB6KIigKNkBE6rV//H7jmRxPyjkzmXvm/l3v1+u8cuY3k8w3k8l17rmrbBMREZNvg9IFRETEcCTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISmw4yifbaqutvHDhwlE+ZUTExFu2bNnvbM9f2+NGGugLFy5k6dKlo3zKiIiJJ+nydXlculwiIiqRQI+IqEQCPSKiEgn0iIhKJNAjIiqx1kCX9GlJV0ta0XdtS0knS7q4/fWu67fMiIhYm3VpoX8GeNy0a4cBp9jeCTilvR0REQWtNdBtnw78YdrlpwDHtN8fAzx1yHVFRMQszbUPfRvbV7Xf/wbYZkj1RETEHA28UtS2Ja32pGlJhwCHACxYsGDQp2PhYd8e+M8Y1C/+8wmlS4iI+DtzbaH/VtK2AO2vV6/ugbaX2F5se/H8+WvdiiAiIuZoroF+PHBw+/3BwDeGU05ERMzVukxb/BLwU+C+kq6U9BLgP4HHSLoYeHR7OyIiClprH7rtg1Zz1/5DriUiIgaQlaIREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlBgp0SW+QtFLSCklfkrTxsAqLiIjZmXOgS7on8Dpgse3dgHnAgcMqLCIiZmfQLpcNgU0kbQhsCvx68JIiImIu5hzotn8FvB/4JXAVcK3t7w6rsIiImJ1BulzuCjwFWATcA7iTpOfP8LhDJC2VtPSaa66Ze6UREbFGg3S5PBr4ue1rbN8CHAs8dPqDbC+xvdj24vnz5w/wdBERsSaDBPovgX0kbSpJwP7ABcMpKyIiZmuQPvQzgK8BZwPL2z9ryZDqioiIWdpwkN9s+63AW4dUS0REDCArRSMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKjFQoEvaQtLXJF0o6QJJDxlWYRERMTsbDvj7PwScZPuZkjYCNh1CTRERMQdzDnRJdwEeCfwjgO2bgZuHU1ZERMzWIC30RcA1wNGS9gSWAYfavqH/QZIOAQ4BWLBgwQBPF3/nbXcpXQG87drSFUREa5A+9A2B+wMftb03cANw2PQH2V5ie7HtxfPnzx/g6SIiYk0GCfQrgSttn9He/hpNwEdERAFzDnTbvwGukHTf9tL+wPlDqSoiImZt0FkurwW+0M5wuQx40eAlRUTEXAwU6LbPBRYPqZaIiBhAVopGRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVGDjQJc2TdI6kbw2joIiImJthtNAPBS4Ywp8TEREDGCjQJW0HPAH45HDKiYiIuRq0hX4E8E/A7UOoJSIiBrDhXH+jpCcCV9teJulRa3jcIcAhAAsWLJjr00Ws1u7H7F66BACWH7y8dAnRcYO00B8GPFnSL4AvA/tJ+vz0B9leYnux7cXz588f4OkiImJN5hzotg+3vZ3thcCBwPdtP39olUVExKxkHnpERCXm3Ifez/apwKnD+LMiImJu0kKPiKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIqMZQDLiJiPFyw8y6lSwBglwsvKF1CJ6WFHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVCKBHhFRiQR6REQlEugREZVIoEdEVGLOgS7pXpJ+IOl8SSslHTrMwiIiYnYGObHoVuB/2z5b0mbAMkkn2z5/SLVFRMQszLmFbvsq22e3318PXADcc1iFRUTE7AylD13SQmBv4IwZ7jtE0lJJS6+55pphPF1ERMxg4ECXdGfg68DrbV83/X7bS2wvtr14/vz5gz5dRESsxkCBLukONGH+BdvHDqekiIiYi0FmuQj4FHCB7Q8Or6SIiJiLQVroDwNeAOwn6dz26/FDqisiImZpztMWbf8I0BBriYiIAWSlaEREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFQigR4RUYkEekREJRLoERGVSKBHRFRizvuhR0SMs6Ne8f3SJQDw6o/tN7LnSgs9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioRAI9IqISCfSIiEok0CMiKpFAj4ioxECBLulxki6SdImkw4ZVVEREzN6cA13SPOAo4ABgV+AgSbsOq7CIiJidQVroDwIusX2Z7ZuBLwNPGU5ZERExW4ME+j2BK/puX9lei4iIAtb7IdGSDgEOaW/+WdJF6/s512Ir4HeD/AF6z5AqKW/g14J/13AqKW/w98U/5rX4G+W16HnNx4dSx73X5UGDBPqvgHv13d6uvbYK20uAJQM8z1BJWmp7cek6xkFeiyl5LabktZgyaa/FIF0uZwE7SVokaSPgQOD44ZQVERGzNecWuu1bJb0G+A4wD/i07ZVDqywiImZloD502ycAJwypllEZm+6fMZDXYkpeiyl5LaZM1Gsh26VriIiIIcjS/4iISiTQIyIqsd7noZcm6RTb+6/tWs0k7bGm+22fN6paYjxJ2gRYYLv0OpGiJN0JuNH27ZLuA+wMnGj7lsKlrZNqA13SxsCmwFaS7gr0VjpsTvdWtB61hvsMPHJUhYwLSdsA7wLuYfuAdh+ih9j+VOHSRk7Sk4D3AxsBiyTtBbzd9pPLVlbE6cAj2sz4Ls307OcAzyta1TqqdlBU0qHA64F70Cx46gX6dcAnbH+kVG1RnqQTgaOBf7G9p6QNgXNs7164tJGTtAzYDzjV9t7tteUdfS3Otn1/Sa8FNrH9Xknn2t6rdG3roto+dNsfsr0IeJPt7W0var/27GqYS9pE0mGSPtre3lHSAaXrKmQr218BbodmXQVwW9mSirnF9rXTrtXZ0ls7SXoITYv82+21eQXrmZVqu1x6bB8p6aHAQvr+vrY/W6yocj4NLAce0d7+NfBV4MRiFZVzg6S70QaXpH2A6aHWFSslPReYJ2kn4HXATwrXVMrrgcOB42yvlLQ98IPCNa2zartceiR9DtgBOJepFphtv65cVWX09qWQdE7fR+uJ+Tg5TJLuDxwJ7AasAOYDz+ziALGkTYF/Af5Xe+k7wH/Y/mu5qsqStKntv5SuY7aqb6EDi4FdXftPrnVzcztY3GuVLgJuLlvS6EnaANgY+AfgvjTjKxdNykyGYWoPqnm77TfRhHqntd0tnwLuDCyQtCfwctuvKlvZuqm2D73PCuDupYsYE+8ATgK2k3QMzUfJw8uWNHq2bweOsn2r7ZW2V3QxzAFs3wY8vHQdY+QI4LHA7wFs/4wJmgXWhRb6VsD5ks4Ebupd7OKULNsnSloKPJSmVfpm21cXLquUUyQ9Azg2n944R9LxNOMpN/Qu2j62XEnl2L5Cq+7nPjGD5V0I9LeVLmBcSDoW+BLwLds3lq6nsJcDbwRulfRXmh9wtr152bKK2JimRbpf3zUDXQz0K9pJFJZ0B+BQ4ILCNa2z6gdFASTdG9jJ9vfaAaB5tq8vXdeoSdqfZpHE42hmMXwZOKE9Ezai8yRtBXwIeDTND/nvAofa/n3RwtZR9YEu6WU0R+BtaXuHdlrWx7q09H+6dhHNY4CXAI+2vUXhkopoVwPuRNNCBcD26eUqKqMdKH8JcD9WfS1eXKyomJMudLm8GngQcAaA7YslbV22pHIk3RF4Ak1LfTFNF0znSHopzcfp7WimtO4D/JRVux264nPAhTSDgW+nWVQzMd0MwyDpn9pVoUcyw6KqSZnm3IVAv8n2zb1BjrZ1WvfHktWQ9EXgYcDJwCeB57azHLroUOCBwH/b3lfSzjR7u3TRjrafJekpto9p3yc/LF3UiPV+gC0tWsWAuhDop0l6C7CJpMcArwK+WbimUr4AHNzVKXrT/NX2XyUh6Y62L5R039JFFdJ7P/xJ0m7Ab4BOfYq1/c3212NK1zKILgT6YTT9g8tpZjacQNM67Rzb35a0c7uzYH9f6RcLllXKlZK2AP4fcLKkPwKXF66plCXteMK/0Rz0fuf2+86Q9E3W8Ml9UqY5Vz8oGlMk/SvN8u6daZZ3Pxb4ke2nFy2sMEn/ANwFOCkzfrqpfQ+slu3TRlXLIKoPdElPpFkheW+aTySdnW8saTmwF3B2u2XstsBnbD+2cGkjJ+kdNHtf/8T2DWt7fM0kXQr8N02/+Q9tryxcUlGSNqJp9JhmS4iJ+SHfhaX/RwAHA3ezvbntzboY5q0b20HQWyVtRtNXeu/CNZVyGXAQsFTSmZI+IOkppYsqZFfg48DdgPdJulTScYVrKkLSE4BLgQ8DHwEumaQtprvQh34FsCLLu4FmifcWNNvoLqU57OPMsiWVYfto4GhJdweeDbyJZr3CZkULK+M2moHR22j2h7+6/eqiDwD72r4EQNIONPuiT8QW013ocnkgTZfLaay6l8sHixVVgJp5m3e3fVV7e0dgc9tnl62sDEmfpGmZ/pamq+FHNF1RtxYtrABJf6GZNPBB4HuTsipyfZB0lu0H9t0WcGb/tXHWhRb6O4E/08zq2KhwLcXYtqSTafb/ptcC6bC70ZxE8yfgD8DvuhjmrYNodlx8FfBSST8BTrd9Stmyilgq6QTgKzR96M8CzpL0dBj/Dcu60EJfYXu30nWMA0mfBz5g+5zStYwLSbvQzPZ5A80eP9sVLqmYdnHVATSn9mxte5PCJY2cpKPXcLfHfTuELgT6e2k+Rn63dC2lSNrQ9q2SVtIc6HApzTapvRk/9y9aYAHt7KdH0Ox1vQXtLA/bny5aWAGSvg7sSfO+OJ2m++mMLp9YNKm6EOjXA3ei6T+/hQ5OW+w7yXyHme63femoaypN0keYmqb369L1lCRpMXBOh7eB+Jv2DNEP0eztY5r9fd5g+7Kiha2j6vvQbXdx1sJ0gm4G9xrcYPu/+i9Ieo/tfy5V0Kj1+oVbC6Yd6jD2/cXryReBo4CntbcPpNnA7sHFKpqF6lvoAJLuydTCIqBb26RKupJmBsOMujbjB6Y+tUy7dp7tPUrVNGp9/cVb05xi9f329r40C66eWKSwgmZ6D0j6me09S9U0G9W30CW9h2ar2POZOkrKNH2FXTGPZn8Ore2BtZP0SprZHDtIOq/vrs2AH5epqgzbLwJoZz/t2jeldVvgMwVLK+lESYfRHP5imuw4QdKWALb/ULK4tam+hS7pImAP2zet9cGVmqk12lWS9gb+CLybZuO2nuvH/T/r+iLpAtu79N3eAFjZf60rJP18DXfb9vYjK2YOqm+h0yzxvgN9i4o6qPMt8z6ftP0ASVvb7uruitOdIuk7TB128hzgewXrKcb2otI1DKILLfTelKxTWHWl6EScQDIMkrbsautzOknn0Jxu/0rg/0y/v4vjCQCSnkYzhROaRUVd3cvlhTNdt/3ZUdcyF11ooR/ffnVWwnwVBwJPpXnvd34GlKR5NOs09gU6GeLT9C/x3xjYHzgbmIhAr76FHjETSQfYnogNl9Y3SacAT7d9belaxk27md2XbT+udC3rotoWuqSv2H52uwf4TIe+dmZ6WszoJ5I+yFQ3w2nA2zsaan8GlrezXf62N3yXuiXX4AZgYvrVqw10mkOAATo3lzbWyaeBFTRb5wK8ADga6OLpTce2X5037Si6ecAuNBt1TYR0uUQnSTrX9l5ruxbdMu0ouluBy21fWaqe2ar+xCJJ+0g6S9KfJd0s6TZJ15WuK4q7UdLDezckPQy4sWA9xUjaSdLXJJ0v6bLeV+m6SrB9Wnt+6AqaFbTbFC5pVqoPdJpjpA4CLgY2AV5Ks1dDdNsrgKMk/ULS5TTvk1cUrqmUo4GP0rRI96WZ0fH5ohWNmKRvSdqt/X5bmkB/MfA5Sa8vWtwsVN/lImmp7cX9ezRIOsf23qVri/IkbQ5gu7Of2iQtaxdbLbe9e/+10rWNiqSVtu/Xfv8WYGfbL2zP3v3xpEyi6EIL/S/tKd7nSnqvpDfQjb93zEDSkyT1H4z9euCHko6XNDGzGYbspna5/8WSXtMuMrpz6aJG7Ja+7/cHTgCwfT3NOasToQvB9gKav+draKYg3YtuzmSIxjuBa+Bvh1w8n+aj9fHAxwrWVdKhwKbA64AH0PyfObhoRaN3haTXtj/M7g+cBCBpE5qtQyZCF7pcDrX9obVdi27o3wpV0qeBi2y/p72dTcw6StLWwNuBbYGjeiecSdoXeIDt95esb111IdBn2vc6fegd1W6Z+1DgL8DPgWfYXtred77tXUvWN0rT5lz/HdtPHmE5MQTVLiySdBDwXGB7Sf17uWxGc8p7dNMRwLnAdcAFfWG+N3BVycIK6LU6BXyCZgZYTLBqW+jtwNciZtj3GjjP9q1FCovi2hOstgZ+Zvv29tq2wB1s/7JocYXkU2sdqm2h2768PXrtr+1CgQgAbP8K+NW0a11rnU9XZ8uuY6qe5dKeYn67pLuUriVi3EjasvcFzJN012nXOkfSfSSdImlFe3sPSf9auq51VW2XS4+kbwB7A9lJLqJPe9yamflEq7E/bm19kHQa8Gbg470uKEkrbO9WtrJ1U22XS5/sJBdrJOkQ20tK1zFqk37c2nqyqe0zpVV+xk3MeFv1gW77mHZxwALbF5WuJ8bSK4DOBXrM6HeSdqAdU5D0TCZo9lPVfejQLPWmmabWW/m117RpjBE5RDt6Xg18HNhZ0q9otoZ4ZdmS1l0X+tCXAfsBp05in1isf5K2m6Q9r2P9k3QnYIN2L5eJUX0LHbhlhmPFJmaznVj/EuZTJB1SuoaSJL1L0ha2b7B9fTvz5z9K17WuuhDoKyU9l2Za1k6SjgR+UrqoiDHV1T3hew6w/afeDdt/BB5fsJ5Z6UKgvxa4H3AT8EXgWqbOG40OkrSBpIeWrmNMdX08YZ6kO/ZutBMq7riGx4+VLvShP8v2V9d2LbolS91n1vXxBEn/DDyJ5hQngBcBx9t+b7mq1l0XAn2m3RazTWrHSXo/8FPgWNf+nyBmRdIBNIdcAJxs+zsl65mNagO9/Ud5PPBs4L/67toc2NX2g4oUFmNB0vXAnYDbaA6HFs3qyM2LFhYxgJoXFv0aWAo8GVjWd/164A1FKoqxYXuz0jWMg/bouX1sd3qigKQf2X54+4O+v5U7UT/oq22h90i6g+1b1v7I6BI1a7ufByyy/Q5J9wK2tX1m4dJGLuMJIGl725eVrmNQXZjl8iBJJ0v6/5Iuk/RzSRP/DxcD+7/AQ2gOQQH4M3BUuXKKOkXSMzRtA5OO+SqApFNKFzKILrTQL6TpYllG018KgO3fFysqiusNjPe3TvvPG+2SjCc0n1JoQv1VwAen32/7766No5r70HuutX1i6SJi7NwiaR5TmzDNp6MriDOeAMCBwFOBeTTHVE6kLgT6DyS9j2YL3Zt6F22fXa6kGAMfBo4Dtpb0TuCZwL+VLamMjCcA8Djb75F0R9tvL13MXHWhy+UHM1y27f1GXkyMFUk708w3FnCK7QsKl1SEpI/SfDrZz/Yuku4KfNf2AwuXNjKSzrW916SvUam+hW5739I1xPiR9DnbLwAunOFa1zy4N54Azf4lkjYqXdSIXSDpYuAeks7ru94bT9ijUF2zUu0sF0lH9H1/6LT7PjPygmLc3K//Rtuf/oBCtZTW+fEE2wcBjwAuoVn63/t6YvvrRKg20IFH9n1/8LT7JuKnbQyfpMPbWR17SLpO0vXt7auBbxQur5Tp4wk/At5dtqTRs/0b4ME0g6KbAb+1fbnty8tWtu6q7UOfNh1tlYUTk95PFoOT9G7bh5euY1x0fTxB0obAu2g24/olzetwL5pNuv5lUhYn1tyHvkE7uLNB3/e9hRPzypUVY+JESY+cftH26SWKKSnjCQC8j6ZVvn3vlCJJmwPvb78mYsvtmlvov6DpB5xp9Zttbz/aimKcSPpm382NgQcBy7o4+2n6J9a2P3257V0LljVS7YDofabvvNm+Fhfa3qlMZbNTbQvd9sLSNcT4sr3KQFc79/qI1Ty8SpIOB94CbCLpOqYaPzcDS4oVVoZn2kbZ9m2SJqbVW/OgaMRsXAnsUrqIUbL97naV6Ptsb257s/brbh0cXzhf0gunX5T0fPq6osZdtV0uEWvSni3be/NvAOwF/ML288tVVcZMYwnQrfEESfekWU1+I1PbbS8GNgGeZvtXpWqbjQR6dJKk/qmst9KE+Y9L1VNSxhOmSNqPqTUK59ueqN0Xqw10SVuu6X7bfxhVLTGe2tWQO9O01C+yfXPhksZCbzzB9jNK1xKzU+2gKM3HJtMM9CwA/th+vwXNPNNF5UqL0iQ9Hvg4cCnN+2KRpJdnZ06gg+MJtag20G0vApD0CeA42ye0tw+g2SYzuu2DwL62LwGQtAPwbaBzgb6a8YTsRjqBqu1y6ZG03Pbua7sW3SLprP7dBNstZM/s0g6DPRlPqEcXAv07wA+Bz7eXngc80vZjy1UVpbVbxt4b+ApN6/RZNF1x3wOwfWy56kYv4wl16EKgbwm8lanNuk4H/j2Dot0m6eg13G3bLx5ZMYXNNJ4AZDxhAlUf6BGxZu25u0+cPp5ge+eylcVsVTso2iPpPsCbgIX0/X27OMc2pkhaBLyWv39fPLlUTQVd3wvz1mXA9aWKibmrvoUu6WfAx2imMd7Wu2572Wp/U1SvfV98ClhO32EOtk8rVlQhGU+oRxcCfZntrp5EE6sh6QzbDy5dxzjIeEI9uhDob6M5jeY44Kbe9QyKdpuk5wI7Ad9l1fdF5l/HxOpCoP98hsvZD73jJL0beAHNzI5el4u7OLaS8YR6VB/oETORdAmwa+ZbZzyhJtXPcgGQtBuwK81OcgDY/my5imIMrKDZ1+fq0oWMgb/a/nDpImJw1bfQJb0VeBRNoJ8AHAD8yPYzS9YVZUk6FdgDOItV+9A7182Q8YR6dKGF/kxgT+Ac2y+StA1T2wBEd721dAFjZHea8YT96BtPaG/HBOlCoN9o+3ZJt7aneF8N3Kt0UVGW7dPaH+69zbjOtN3V7pdn0Zx23/nxhEnXhTNFl0raAvgEzeKis4Gfli0pSpP0bOBMmjB7NnCGpK52w/XGE2LCVd+H3k/SQmBz2+cVLiUKa2d2PKbXKpc0H/ie7T3LVjZ6GU+oRxe6XP7G9i9K1xBjY4NpXSy/pxufWGeS8YRKdCrQI/qc1O6V/6X29nPo4GlFkPGEmnSqyyWin6SnAw9vb/7Q9nEl6ymlHU94H3AqzX7ojwDebPtrJeuK2as+0Nu9na+0fZOkR9H0FX7W9p/KVhYlSNoR2Gb6EWuSHg5cZfvSMpWVk/GEenShz/DrwG3tf+QlNFMWv1i2pCjoCOC6Ga5f297XRRlPqEQX+tBvt32rpKcBR9o+UtI5pYuKYraxvXz6RdvL21lQXZTxhEp0IdBvkXQQcDDwpPbaHQrWE2Wtab71JiOrYozYfvO08YQlXR1PmHRd+Fj1IuAhwDtt/7zdKvRzhWuKcpZKetn0i5JeSrPwrDMk7SjpYdCcSmT7jbbfCFzTjj3FhKl+UDSiXzs97zjgZqYCfDGwEfA0278pVduoSfoWcPj0LihJuwPvsv2kmX9njKtqA13ScpoNhmZke48RlhNjRtK+wG7tzZW2v1+ynhIknWX7gau5b7nt3UddUwym5kC/95rut335qGqJGEeSLra902ruu8T2jqOuKQZT7aBoAjtirZZKepntT/Rf7OJ4Qi2qbaH3SNoHOBLYhaafdB5wg+3NixYWUVjGE+rThUBfChwIfJXmzfpC4D62Dy9aWMSYyHhCPToR6LYXSzqvNxAq6Rzbe5euLSJimKrtQ+/zF0kbAedKei9wFd2Yfx8RHdOFYHsBzd/zNcANNHu5PKNoRRER60H1XS4AkjYBFti+qHQtERHrS/UtdElPAs4FTmpv7yXp+LJVRUQMX/WBDrwNeBDwJwDb5wKLShYUEbE+dCHQb7F97bRr9fczRUTndGGWy0pJzwXmSdoJeB3wk8I1RUQMXRda6K8F7gfcRLOB/3XA64tWFBGxHnRilktERBdU2+Wytpkstp88qloiIkah2kCnOaXoCppuljMAlS0nImL9qrbLRdI84DHAQcAewLeBL9leWbSwiIj1pNpBUdu32T7J9sHAPsAlwKmSXlO4tIiI9aLmLhck3RF4Ak0rfSHwYZr9nyMiqlNzl8tnafZ4PgH4su0VhUuKiFivag7022lj/gG6AAAAOklEQVR2V4RVV4YKcE4siojaVBvoERFdU+2gaERE1yTQIyIqkUCPiKhEAj0iohIJ9IiISiTQIyIq8T+UwchQpunyzQAAAABJRU5ErkJggg==\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "f = pd.value_counts(df_train['category']).plot.bar()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So now that we have the vectorized features ready, it's time to train and evaluate performance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Model fitting and evaluation\n", + "\n", + "Our plan is to try out the following classifiers and then ensemble them at the end if possible:\n", + "- Logistic Regression\n", + "- Multinomial Naive Bayes\n", + "- SVM\n", + "- Gradient Boosting (XGBoost)\n", + "- Random Forest\n", + "\n", + "For each of these, we will try out fitting the models on the training set in a two fold approach - with and without **cross validation**. Also, as mentioned earlier, since the **expense description** field is intuitively the most expressive field, we will be fitting the models in 2 ways - all features and only the text features. So overall, we will be fitting each model 4 times:\n", + "- All features and no cross validation\n", + "- All features with cross validation\n", + "- Only text features and no cross validation\n", + "- Only text features with cross validation\n", + "\n", + "And finally will be taking the one of these that performs best on the validation data. Since the datasets are small, we will be using **Leave one out** cross validation. For cross validation purpose, we will be using **GridSearchCV** to try out different combinations of parameters. (These combinations are not exhaustive, of course, I will only try out a few)\n", + "\n", + "Let's define a general purpose method that can fitting and scoring for the above 4 scenarios." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "def evaluate(model, x_train, y_train, x_val, y_val, cross_validate=False, grid_search_params={}):\n", + " if cross_validate:\n", + " grid_search_model = GridSearchCV(estimator=model,\n", + " cv=LeaveOneOut(),\n", + " param_grid=grid_search_params, \n", + " verbose=0)\n", + " grid_search_model.fit(x_train, y_train)\n", + " return (grid_search_model.best_estimator_.score(x_train, y_train),\n", + " grid_search_model.best_estimator_.score(x_val, y_val),\n", + " grid_search_model.best_estimator_.predict(x_val),\n", + " grid_search_model.best_estimator_)\n", + " else:\n", + " model.fit(x_train, y_train)\n", + " return (model.score(x_train, y_train), model.score(x_val, y_val), model.predict(x_val), 0) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's define the set of models along with the parameter combinations that will be used for grid search. The paramter combinations I am using for different models are:\n", + "- LogicticRegression\n", + " - C (Regularization factor)\n", + "- MultinomialNB\n", + " - alpha (Regularization factor)\n", + "- SVM\n", + " - C (Regularization factor)\n", + "- XGBoost\n", + " - max depth\n", + " - Learning rate\n", + " - subsample (Subsample ratio of the training instance.)\n", + " - colsample_bytree (Subsample ratio of columns when constructing each tree)\n", + " - reg_alpha (L1 regularization term on weights)\n", + "- Random Forest\n", + " - n_estimators (number of trees)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "models = [\n", + " {\n", + " 'name': 'LogisticRegression',\n", + " 'model': LogisticRegression(),\n", + " 'grid_search_params': {'C':[2,1,0.5]}\n", + " },\n", + " {\n", + " 'name': 'MultinomialNB',\n", + " 'model': MultinomialNB(),\n", + " 'grid_search_params': {'alpha':[2,1,0.5]} \n", + " },\n", + " {\n", + " 'name': 'SVM',\n", + " 'model': svm.SVC(kernel='linear', random_state=1),\n", + " 'grid_search_params': {'C':[2,1,0.5]} \n", + " },\n", + " {\n", + " 'name': 'XGBoost',\n", + " 'model': XGBClassifier(objective='multi:softmax', \n", + " num_class=len(set(y_train)),\n", + " random_state=1),\n", + " 'grid_search_params': {'max_depth': [3,5,7],\n", + " 'learning_rate': [0.001,0.01,0.1],\n", + " 'subsample':[0.5,1],\n", + " 'colsample_bytree':[0.5,0.7,1],\n", + " 'reg_alpha':[0,1,2]\n", + " }\n", + " },\n", + " {\n", + " 'name': 'RandomForest',\n", + " 'model': RandomForestClassifier(n_estimators=50, random_state=1),\n", + " 'grid_search_params': {'n_estimators': [10,25,50,100,200,300]}\n", + " }\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's fit and evaluate all the models and see the results" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "LogisticRegression\n", + "\n", + "Training score with all features and no cross validation: 1.0\n", + "Validation score with all features and no cross validation: 0.75\n", + "Training score with all features and cross validation: 1.0\n", + "Validation score with all features and cross validation: 0.8333333333333334\n", + "Training score with only text (description) feature and no cross validation: 0.9583333333333334\n", + "Validation score with only text (description) feature and no cross validation: 0.8333333333333334\n", + "Training score with only text (description) feature and cross validation: 1.0\n", + "Validation score with only text (description) feature and cross validation: 0.9166666666666666\n", + "\n", + "MultinomialNB\n", + "\n", + "Training score with all features and no cross validation: 0.9166666666666666\n", + "Validation score with all features and no cross validation: 0.75\n", + "Training score with all features and cross validation: 0.9166666666666666\n", + "Validation score with all features and cross validation: 0.75\n", + "Training score with only text (description) feature and no cross validation: 0.9166666666666666\n", + "Validation score with only text (description) feature and no cross validation: 0.8333333333333334\n", + "Training score with only text (description) feature and cross validation: 0.9583333333333334\n", + "Validation score with only text (description) feature and cross validation: 0.8333333333333334\n", + "\n", + "SVM\n", + "\n", + "Training score with all features and no cross validation: 1.0\n", + "Validation score with all features and no cross validation: 0.8333333333333334\n", + "Training score with all features and cross validation: 1.0\n", + "Validation score with all features and cross validation: 0.8333333333333334\n", + "Training score with only text (description) feature and no cross validation: 1.0\n", + "Validation score with only text (description) feature and no cross validation: 0.9166666666666666\n", + "Training score with only text (description) feature and cross validation: 1.0\n", + "Validation score with only text (description) feature and cross validation: 0.9166666666666666\n", + "\n", + "XGBoost\n", + "\n", + "Training score with all features and no cross validation: 1.0\n", + "Validation score with all features and no cross validation: 0.6666666666666666\n", + "Training score with all features and cross validation: 0.875\n", + "Validation score with all features and cross validation: 0.6666666666666666\n", + "Training score with only text (description) feature and no cross validation: 0.6666666666666666\n", + "Validation score with only text (description) feature and no cross validation: 0.6666666666666666\n", + "Training score with only text (description) feature and cross validation: 0.5416666666666666\n", + "Validation score with only text (description) feature and cross validation: 0.6666666666666666\n", + "\n", + "RandomForest\n", + "\n", + "Training score with all features and no cross validation: 1.0\n", + "Validation score with all features and no cross validation: 0.75\n", + "Training score with all features and cross validation: 1.0\n", + "Validation score with all features and cross validation: 0.75\n", + "Training score with only text (description) feature and no cross validation: 1.0\n", + "Validation score with only text (description) feature and no cross validation: 0.9166666666666666\n", + "Training score with only text (description) feature and cross validation: 1.0\n", + "Validation score with only text (description) feature and cross validation: 0.9166666666666666\n", + "\n" + ] + } + ], + "source": [ + "scores = []\n", + "predictions = []\n", + "for model in models:\n", + " model['scores'] = []\n", + " model['predictions'] = []\n", + " \n", + " print('{}\\n'.format(model['name']))\n", + " score_train, score_val, predictions, estimator = evaluate(model['model'], x_train, y_train, x_val, y_val)\n", + " model['scores'].append(score_val) \n", + " model['predictions'].append(predictions) \n", + " print('Training score with all features and no cross validation: {}'.format(score_train))\n", + " print('Validation score with all features and no cross validation: {}'.format(score_val))\n", + " score_train, score_val, predictions, estimator = evaluate(model['model'], x_train, y_train, x_val, y_val, \n", + " cross_validate=True, grid_search_params=model['grid_search_params'])\n", + " model['scores'].append(score_val) \n", + " model['predictions'].append(predictions) \n", + " print('Training score with all features and cross validation: {}'.format(score_train))\n", + " print('Validation score with all features and cross validation: {}'.format(score_val))\n", + " \n", + " score_train, score_val, predictions, estimator = evaluate(model['model'], x_train_tfidf, y_train, x_val_tfidf, y_val)\n", + " model['scores'].append(score_val) \n", + " model['predictions'].append(predictions) \n", + " print('Training score with only text (description) feature and no cross validation: {}'.format(score_train))\n", + " print('Validation score with only text (description) feature and no cross validation: {}'.format(score_val))\n", + " score_train, score_val, predictions, estimator = evaluate(model['model'], x_train_tfidf, y_train, x_val_tfidf, y_val, \n", + " cross_validate=True, grid_search_params=model['grid_search_params'])\n", + " model['scores'].append(score_val) \n", + " model['predictions'].append(predictions) \n", + " print('Training score with only text (description) feature and cross validation: {}'.format(score_train))\n", + " print('Validation score with only text (description) feature and cross validation: {}\\n'.format(score_val))\n", + " \n", + " model['best_prediction'] = model['predictions'][np.argmax(model['scores'])]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's print out the best predictions for each model." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "LogisticRegression\n", + "Best score:0.916666666667\n", + "Best predictions: [4 2 0 3 2 4 2 2 2 2 2 2]\n", + "MultinomialNB\n", + "Best score:0.833333333333\n", + "Best predictions: [4 2 0 2 2 4 2 2 2 2 2 2]\n", + "SVM\n", + "Best score:0.916666666667\n", + "Best predictions: [4 2 0 3 2 4 2 2 2 2 2 2]\n", + "XGBoost\n", + "Best score:0.666666666667\n", + "Best predictions: [4 2 4 0 4 4 4 2 2 2 2 2]\n", + "RandomForest\n", + "Best score:0.916666666667\n", + "Best predictions: [4 2 0 3 2 4 2 2 2 2 2 2]\n" + ] + } + ], + "source": [ + "for model in models:\n", + " print(model['name'])\n", + " print('Best score:' + str(np.max(model['scores']))) \n", + " print('Best predictions: ' + str(model['best_prediction']))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Some observations here\n", + "\n", + "From the above results, it looks like that except for XGBoost, all other classifiers are performing better using only the test features rather than all features. With Random forest, SVM and Logistic Regression, the training accuracy is 100% and the validation accuracy is 91.67%. With XGBoost (a bit surprisengly to me), while the training accuracy is 100% (using all features and no cross validation), the validation accuracy achieved is only 66.67%. (This is something that deserves further investigation and tuning for sure)\n", + "\n", + "## Ensemble\n", + "Let's try an ensemble of Random forest, SVM and Logistic Regression predictions to get our final predictions." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "def ensemble(preds):\n", + " def mode(vals):\n", + " counts = dict(Counter(vals))\n", + " return sorted(counts.items(), key=lambda x: (x[1], x[0]), reverse=True)[0][0]\n", + " \n", + " preds = np.vstack(preds)\n", + " return np.apply_along_axis(mode, 0, preds)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Ensemble accuracy: 0.9166666666666666\n" + ] + } + ], + "source": [ + "ensemble_result = ensemble([models[0]['best_prediction'],\n", + " models[2]['best_prediction'],\n", + " models[4]['best_prediction']])\n", + "\n", + "ensemble_accuracy = np.sum(ensemble_result == y_val) / len(y_val)\n", + "print('Ensemble accuracy: {}'.format(ensemble_accuracy))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "So the ensemble accuracy is no better that the individual models' accuracy, and that's based on the fact that each of those 3 models were unable to identify the correct category for the same single data point, which is the fifth data point in the validation set, where the actual category is **Office Supplies** but it has been classified as **Meals and Entertainment**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's get the precision, recall and f1-score:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + " Computer - Hardware 1.00 1.00 1.00 1\n", + "Meals and Entertainment 0.88 1.00 0.93 7\n", + " Office Supplies 1.00 0.50 0.67 2\n", + " Travel 1.00 1.00 1.00 2\n", + "\n", + " avg / total 0.93 0.92 0.91 12\n", + "\n" + ] + } + ], + "source": [ + "print(classification_report(y_val, ensemble_result, target_names=val_categoroes))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Because of the misclassification of the one data point, the recall on the **Office Supplies** is only 0.5 (since there are only 2 data points in this category), and the precision of **Meals and Entertainment** is also lowered to 0.88. However, overall, we seem to have a decent precision and recall of 93% and 92%, respectively.\n", + "\n", + "Let's also visualize the confusion matrix, which just conveys the same information a bit more pictorially." + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# This method is just copied from a scikit learn documentation example\n", + "def plot_confusion_matrix(cm, classes,\n", + " normalize=False,\n", + " title='Confusion matrix',\n", + " cmap=plt.cm.Blues):\n", + " \"\"\"\n", + " This function prints and plots the confusion matrix.\n", + " Normalization can be applied by setting `normalize=True`.\n", + " \"\"\"\n", + " if normalize:\n", + " cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n", + " print(\"Normalized confusion matrix\")\n", + " else:\n", + " print('Confusion matrix, without normalization')\n", + "\n", + " print(cm)\n", + "\n", + " plt.imshow(cm, interpolation='nearest', cmap=cmap)\n", + " plt.title(title)\n", + " plt.colorbar()\n", + " tick_marks = np.arange(len(classes))\n", + " plt.xticks(tick_marks, classes, rotation=45)\n", + " plt.yticks(tick_marks, classes)\n", + "\n", + " fmt = '.2f' if normalize else 'd'\n", + " thresh = cm.max() / 2.\n", + " for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n", + " plt.text(j, i, format(cm[i, j], fmt),\n", + " horizontalalignment=\"center\",\n", + " color=\"white\" if cm[i, j] > thresh else \"black\")\n", + "\n", + " plt.tight_layout()\n", + " plt.ylabel('True label')\n", + " plt.xlabel('Predicted label')" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Confusion matrix, without normalization\n", + "[[1 0 0 0]\n", + " [0 7 0 0]\n", + " [0 1 1 0]\n", + " [0 0 0 2]]\n" + ] + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWQAAAEmCAYAAABVi+pHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAIABJREFUeJzsnXeclNX1h58vrIjdxC5gATt2QI29K4K9YI3EFvIzKoma2BJbEo0mlqjRGHs3lth71FgjigUVGyoqWBF7o31/f9w7OKxbZndmd2bZ8/B5P8zb7j3v+86eOe+5554j2wRBEATVp0u1BQiCIAgSoZCDIAhqhFDIQRAENUIo5CAIghohFHIQBEGNEAo5CIKgRgiFHExH0mySbpX0maTrymhnD0n3VFK2aiFpPUmv1Ep/kpaQZEl17SVTR0HSWEmb5s9HSbqgDfo4T9LvKt3u9PYjDrnjIWl34NfAcsAXwLPAH20/Uma7ewEHAWvbnlK2oDWOJANL2x5TbVkaQ9JYYD/b9+X1JYA3gVkq/YwkXQKMs31MJdttL+rfqwq0NzS3t24l2iuFsJA7GJJ+DZwB/AlYCFgM+DuwbQWaXxx4tTMo41IIK7TtiHvbCLZj6SALMA/wJbBzE8fMSlLY7+blDGDWvG9DYBxwKPAh8B7ws7zveGASMDn3sS9wHHBFUdtLAAbq8vpQ4A2Slf4msEfR9keKzlsbeBL4LP+/dtG+B4ETgUdzO/cA8zdybQX5f1Mk/3bAVsCrwETgqKLj1wAeBz7Nx54NdMv7HsrX8lW+3iFF7f8WeB+4vLAtn9Mn97F6Xl8U+AjYsIRndylwaP7cI/d9YL12u9Tr73JgGvBNlvE3Rc9gb+BtYAJwdInPf4bnkrcZWAo4ID/7SbmvWxu5DgPDgNfyfT2H79+0uwDHAG/l53MZME+9786+We6Hirb9DHgH+CS3PQAYlds/u6jvPsD9wMf5uq8E5i3aPxbYNH8+jvzdzc/9y6JlCnBc3ncE8Drpuzca2D5vXx74Fpiaz/k0b78E+ENRn/sDY/LzuwVYtJR71ej3pNpKJpbSF2DL/GWqa+KYE4D/AQsCCwCPASfmfRvm808AZiEpsq+BH+X907/EjawX/oDqgDmAz4Fl875FgL7581DyHz7w4/yHtlc+b7e8Pl/e/2D+g1gGmC2vn9zItRXk/32Wf3+SQrwKmAvoS1JeS+bj+wFr5X6XAF4Chhe1Z2CpBtr/M0mxzUaRgszH7J//cGcH7gb+UuKz24es5IDd8zVfW7Tv5iIZivsbS1Yy9Z7BP7N8qwDfAcuX8PynP5eG7gH1lE0j12HgNmBe0tvZR8CWRdcxBugNzAncCFxeT+7LSN+d2Yq2nQd0BzYnKcGbsvw9SIp9g9zGUsBm+dksQFLqZzR0r6j33S06ZtUs82p5fWfSD2sX0o/yV8AiTdyv6fcI2Jj0w7B6luks4KFS7lVjS7gsOhbzARPctEthD+AE2x/a/ohk+e5VtH9y3j/Z9h2kX/9lWynPNGBFSbPZfs/2iw0cMwh4zfbltqfYvhp4Gdi66JiLbb9q+xvgX6Q/msaYTPKXTwauAeYHzrT9Re5/NElJYXuk7f/lfscC/wA2KOGajrX9XZZnBmz/k6R0niD9CB3dTHsF/gusK6kLsD5wCrBO3rdB3t8Sjrf9je3ngOfI10zzz78SnGz7U9tvAw/w/fPaAzjN9hu2vwSOBHat5544zvZX9e7tiba/tX0PSSFeneUfDzwMrAZge4zte/Oz+Qg4jeaf53QkLUBS9gfZfia3eZ3td21Ps30tyZpdo8Qm9wAusv207e/y9f4k+/kLNHavGiQUcsfiY2D+Zvxvi5JeGQu8lbdNb6OeQv+aZM20CNtfkSyKYcB7km6XtFwJ8hRk6lG0/n4L5PnY9tT8ufBH/UHR/m8K50taRtJtkt6X9DnJ7z5/E20DfGT722aO+SewInBW/kNsFtuvk5TNqsB6JMvpXUnL0jqF3Ng9a+75V4KW9F1HGuso8E4D7dV/fo09z4UkXSNpfH6eV9D88ySfOwtwPXCV7WuKtv9U0rOSPpX0Kem5ltQm9a43/wh9TOu/26GQOxiPk15Pt2vimHdJg3MFFsvbWsNXpFfzAgsX77R9t+3NSJbiyyRF1Zw8BZnGt1KmlnAuSa6lbc8NHAWomXOaDDuSNCfJL3shcJykH7dAnv8CO5H82OPz+t7Aj0iRMi2WpwGaev4zPE9JMzzPVvRVSt9TmFHBltPHn/L5K+XnuSfNP88CZ5FcbNMjSCQtTvrO/pLkQpsXeKGozeZkneF6Jc1Beott9Xc7FHIHwvZnJP/pOZK2kzS7pFkkDZR0Sj7sauAYSQtImj8ff0Uru3wWWF/SYpLmIb2SAdOtlW3zl/A7kutjWgNt3AEsI2l3SXWShgArkCzEtmYu0h/hl9l6/0W9/R+Q/J0t4UzgKdv7AbeT/J8ASDpO0oNNnPtf0h//Q3n9wbz+SJHVX5+WytjU838O6CtpVUndSX7WcvpqqO9fSVoy/3D9ieQnr1TUzlyk79lnknoAh5dykqSfk95C9rBd/B2dg6R0P8rH/YxkIRf4AOgpqVsjTV8N/Czfz1lJ1/tEdo+1ilDIHQzbfyXFIB9D+iK9Q/qjvikf8gfgKdIo9fPA03lba/q6F7g2tzWSGZVolyzHu6QR5g34ocLD9sfAYFJkx8ekSIHBtie0RqYWchhpAO0LkiV0bb39xwGX5tfVXZprTNK2pIHVwnX+Glhd0h55vRcpWqQx/ktSKgWF/AjJYn2o0TPgJJKC/VTSYc3JSBPP3/arpEG/+0i+0vpx6xcCK+S+bqLlXESKDHmIFHXzLSmuvVIcTxpA+4z0Y3hjieftRvqheVfSl3k5yvZo4K+kN88PgJWY8fndD7wIvC/pB99Xp3jn3wE3kKJ4+gC7tubCCsTEkCCoEJKeBTbJP0JB0GJCIQdBENQI4bIIgiBoAZKWzZEZheVzScMr0nZYyEEQBK1DUldSVMWatuuHd7aYsJCDIAhazybA65VQxpCCtoOgzfnxfPO7R8/Fqi3GDHSrC3ukI/LWW2OZMGFCqfHHDdJ17sXtKT+YiAmAv/noRVKESIHzbZ/fSFO7ksLfKkIo5KBd6NFzMW68p6zsoBWn13yzN39QUHOss2b/stvwlG+ZdbmGI9S+feasb20320mOT96Govj8cgmFHARB50OAyjKyAQYCT9v+oNkjSyQUchAEnZMuXcttYTcq6K6AUMhBEHRKVJZCzikDNgN+XjGRCIUcBEFnROUp5JztcL7KCZQIhRwEQedEtRdlEwo5CIJOSHkWclsRCjkIgs6HCIUcBEFQGyhcFkEQBDWBgK5hIQdBENQAtelDrj2bPQga4cjhw1ir7+IM2qD8qbOV4p6772LlvsvSd7mlOPWUk6stDhAylUyXrg0v1RSpqr0HQQvYYcieXHh1ayoLtQ1Tp05l+MEHcvOtd/LMqNFcd83VvDR6dMjUAWRCanypIqGQgw7DgJ+syzzztqTIc9vy5IgR9OmzFEv27k23bt3Yeciu3HbrzSFTB5AJCAs5CGYm3n13PD179pq+3qNHT8aPb3UF+IoQMpWKQiG3FkkLS7pG0uuSRkq6Q9IyVZLlqDZuf2wu315Y31DSbU2dU0Kbl0jaqXzpgmAmQYIudQ0vVaTmFbIkAf8GHrTdx3Y/Uv7RhaokUosVci7z0i5IavNvVHv00RFYdNEejBv3zvT18ePH0aNHjypKFDK1iPAht4qNgMm2zytssP2c7YeVOFXSC5KelzQEpluV/5V0s6Q3JJ0saQ9JI/JxffJxl0g6T9JTkl6VNDhvHyrp7EJ/km7LbZ4MzJYLG16Z9+2Z231W0j8KylfSl5L+Kuk54CeVuBGS1pD0uKRnJD0madkieW+RdD/wn3xfzpb0iqT7gAXzcQMk3Zg/byvpG0ndJHWX9Ebevr+kJyU9J+kGSbPXu1dPAKdImkPSRfnan5G0bSWusSPRf8AAxox5jbFvvsmkSZO47tprGDR4m5CpA8gElOWykDSvpOslvSzpJUkV+RvvCJbOisDIRvbtAKwKrALMDzwp6aG8bxVgeWAi8AZwge01JB0CHAQUqsQuAawB9AEekLRUY4LYPkLSL22vCiBpeWAIsI7tyZL+DuwBXAbMATxh+9BWXPMDkqbmz3MCL+fPLwPr2Z4iaVPgT8COed/qwMq2J0raAVgWWIH0JjEauAh4hnS/ANYDXgAGkL4HT+TtN9r+Z76+PwD7AmflfT2BtW1PlfQn4H7b+0iaFxgh6b6cBYt8/gHAAQCLFvkQW8uvhu3NiMce5pOJH7Peaktz8OHHsPPue5fdbmupq6vj9DPPZutBWzB16lT2HroPK/TtWzV5QqYWUGa2N+BM4C7bO+XKIRUpP9MRFHJTrAtcbXsq8IGk/5IUzOfAk7bfA5D0OnBPPud5ktVd4F+2pwGvZStxuRb0vwnQj/RDADAb8GHeNxW4oVVXBRvZnpBl3xA4LG+fB7hU0tKAgVmKzrnX9sT8eX2+vy/vZsuZrMhfzz8kawCn5WO7Ag/nc1fMinhe0o/B3UV9XJfbBNgc2EZSQbbuwGLAS4WDcx2y8wFWWmX1ssubn37epeU2UXG2HLgVWw7cqtpizEDI1DwCunRpnYNA0jykv5uhALYnAZMqIVdHUMgvAq0ZkPqu6PO0ovVpzHjd9RWFgSnM6M7p3kgfAi613VBNrW+LlNf3J0i9gFvz6nnFrpgSOBF4wPb2kpYAHiza91VDJzTAQ6TSM5OB+4BLSAr58Lz/EmA7289JGgps2EgfAna0/UoL5A+C2kB5aZj5JT1VtF6/yOmSwEfAxZJWIb3BH1L8dthaOoIP+X5g1vz6C4CklSWtR7LqhkjqKmkB0q/WiBa2v7OkLtmv3Bt4BRgLrJq39yJZkwUmSypYpv8BdpJU8NH+WNLiTXVm+x3bq+alJcoYkoVciBca2sRxD/H9fVmEGd8IHia5ax63/REpyfayJPcFwFzAe/ka92iij7uBg5RfDSSt1sJrCYIqIrp06dLgAkyw3b9oqV9xuo7kIjzX9mokQ+WISkhV8wrZtoHtgU3z6/aLwEnA+6Toi1HAcyTF/Rvb77ewi7dJSvxOYJjtb4FHgTdJvte/AU8XHX8+MErSlbZHA8cA90gaBdwLLNK6Ky2JU4CTJD1D0283/wZeI8l/GfB40b4nSH7lgq99FPB8vs8Av8vHPMr3vuuGOJHkMhmVn8mJLbuUIKguTSjk5hgHjLNdGHe5nqSgy0bf/x12PiRdAtxm+/pqyzKzs9Iqq/vGex6pthgz0Gu+iozDBO3MOmv2Z+TIp8qKT6ubr7fn3PKEBvd9dtVeI203mTBF0sPAfrZfkXQcMIftw5s6pyS5ym0gCIKgI6LyYo4PAq7MERZvAD+rhEydWiHbHlptGYIgqA6tjbIAsP0sUPG0g51aIQdB0EkRqEt1Z+U1RCjkIAg6HcpRFrVGKOQgCDontWcgh0IOgqATovJ8yG1FKOQgCDod4bIIgiCoJcJlEQRBUAOEyyIIgqB2iLC3IAiCGkAKH3IQBEHNUObU6TYhFHLQLnSr61JzyXx+NOCX1RbhB3zy5NnNHxRUhHBZBEEQ1AIxqBcEQVAbpDjk1lvIksYCX5BKtU1pLl1nqYRCDoKgU1IBF/L02peVIhRyEASdD1GWhdxW1J4TJQiCoI0R0LWrGlzIRU6LlgMaaMKk0m0jG9nfKsJCDoKg89G0hTyhBJ/wurbH5wLH90p62fZDzZzTLGEhB0HQ6RApDrmhpRRsj8//f0gqKrxG02eURijkIAg6ISnKoqGl2TOlOSTNVfgMbA68UAmpwmURBEHno7xBvYWAf2drug64yvZdlRArFHIQBJ0O0XqFbPsNYJWKCpQJl0XQYbjn7rtYue+y9F1uKU495eRqi8PSiy/I/645YvrywcOn8svdN6y2WDV3n6A2ZSrHh9xWhIUcdAimTp3K8IMP5PY776VHz56su9YABg/ehuVXWKFqMr321oestWtSLl26iNfv/iO3PPBc1eSB2rxPtShTxCEHQRk8OWIEffosxZK9e9OtWzd2HrIrt916c7XFms5GayzLm+M+4u33PqmqHLV4n2pRJpUxqNeWNKqQJc3d1NKeQgbBu++Op2fPXtPXe/Toyfjx46so0YzsvEU//nXXyGqLUZP3qRZlAjqWQgZeJIVyvFi0vFD0f5NIsqQritbrJH0k6bbWCCppCUkVCS0psb+xkuZvZPvzkp7Ny9+aaWdVSVu1ov8lJO1ewnGLSrq+pe1XCknDJdVWXs12Zpa6rgzaYCVuvPeZaosSlIpSLouGlmrSqA/Zdq/G9pXIV8CKkmaz/Q2wGVD9n8XK0JKkIqsC/YE7Sm1cUh2wBLA7cFVTx9p+F9ip1LbbgOHAFcDXbdnJoov2YNy4d6avjx8/jh49erRllyWzxbor8OzL7/DhxC+qLUpN3qdalClFWdSex7YkiSTtKumo/LmnpH4ltn8HMCh/3g24uqjNOSRdJGmEpGckbZu3LyHpYUlP52XtBuTpm897VtIoSUs3cMy5eR76i5KOL9o+VtLxue3nJS2Xt88n6Z58/AW0sCatpAcl/TnL9aqk9SR1A04AhmRZhzRx3UMl3SLpfuA/wMnAevm8XzV2X4rfHHIbN0q6S9Jrkk4pku9LSafm67tP0hpZ5jckbZOP6ZqPeTLf15/n7RvmY6+X9LKkK5U4GFgUeEDSAy25Xy2l/4ABjBnzGmPffJNJkyZx3bXXMGjwNm3ZZcnssmX/mnBXQG3ep1qUCTqeywIASWcDGwF75U1fA+eV2P41wK6SugMrA08U7TsauN/2Grn9U5VmvXwIbGZ7dWAI0JBLYBhwpu2C9TmugWOOzvPRVwY2kLRy0b4Juf1zgcPytmOBR2z3JU2FXKyJ63qgyGXxq6Ltdfl6hgPH2p4E/B641vaqtq9t4roBVgd2sr0BcATwcD7v9BLvCySLfAiwEumHoPCmM0futy8pj+sfSG8t25N+NAD2BT6zPQAYAOwvacm8b7V8XSsAvYF1bP8NeJf0xrBRE/erbOrq6jj9zLPZetAWrLrS8uy48y6s0LdvW3ZZErN378bGay7Hzfc/W21RgNq8T7UoU4dzWRSxtu3VJT0DYHtitvyaxfYoSUuQrOP6r+ybA9tIKijE7iQl+C5wtqRVScmfl2mg6ceBoyX1BG60/VoDx+yilIWpDliEpEhG5X035v9HAjvkz+sXPtu+XVJTw+WNuSyK212ikXMbu26Ae21PbOS8WWj+vgD8x/ZnAJJGA4sD7wCTgMJsoueB72xPlvR8kaybAytLKrhA5gGWzueOsD0ut/tsPueRRmQgH3cAcABAr8Wa+n0rjS0HbsWWA1vsjm9Tvv52Ej03+m21xZiBWrxPtSZTuQnq24pSFPJkSV1I6eaQNB8wrQV93AL8BdgQmK9ou4Adbb9SfLCk44APSDNhugDf1m/Q9lWSniC5Q+6Q9HPb9xe1sSTJ8h1g+xNJl5AUX4Hv8v9TqWwsdintNnbda5L87o3xK5q5L/VkqC/HZNvOn6cVjrM9LfusC7IdZPvuerJt2ES7jWL7fOB8gH79+ruZw4OgXelagwq5FB/yOcANwALZF/sI8OcW9HERcLzt5+ttvxs4SEovCZJWy9vnAd6zPY3kJulav0FJvYE38ivzzSS3RDFzk5TbZ5IWAgaWIOdDpEE0JA0EflTCOaXwBTBX0Xpj193cec3elwpwN/ALSbNk2ZYpcqc0Rn05g6DmkZJCbmipJs0qZNuXAceQrNyJwM62rym1A9vjsuKsz4mk1/BRkl7M6wB/B/aW9BywHA1bjbsAL+RX5xWBy+r1+RzwDPAyKUrh0RJEPR5YP8uyA/B2E8cW+5Ava+I4gAeAFQqDejR+3fUZBUyV9Fz2U5dyX8rlAmA08HQeKPwHzVvC5wN3tfWgXhBUGtXg1Gl9/xbbxEFpQGxdktviUdujmjklCGagX7/+fvSJp6otxgz8aMAvqy3CD/jkybOrLULNs86a/Rk58qmyNOe8iy/vdY9q2Ja6fdgaI0spWiqpK/AUMN724HLkKVBKlMXRpHC1RYGewFWSjqxE50EQBFVBDbsrWuiyOAR4qZJileJD/ilpcOwY20eTMuMPraQQQRAE7Ykoz4ecI7wGkdx8FaOUCIP36h1Xl7cFQRB0WJpwF88vqdi/dn6OGCrmDOA3VHhAu1GFLOl0ks94IvCipLvz+ubAk5UUIgiCoD1RGUVOJQ0GPrQ9MoeEVoymLORCIp8XgduLtv+vkgIEQRBUg66tj6hYhzS5ayvS/Ia5JV1he89yZWoqudCF5TYeBEFQixR8yK3B9pHAkTB90tRhlVDGUIIPWVIf4I+kqcfTZ7vZbmzqbhAEQW1TAzHHDVFKlMUlwMWkH5WBwL+Aa9tQpiAIgjanEtnebD9YqRhkKE0hz17IbWD7ddvHUNpU5CAIgpqk3LC3tqKUsLfvcnKh1yUNIyWZj9wFQRB0WKSyBvXajFIU8q9IuXQPJvmS5wH2aUuhgiAI2poa1MclpVAsJJX/gu+T1AdBEHRoOlQ+ZEn/JudAbgjbOzS2LwiCoJaRqu8vboimLORIOxXM1Iy665TmD2pn3vm4TWvFtope882cRcU7lA/Z9n/aU5AgCIL2QlCTcciVLF8UBEHQYagrJei3nQmFHARBp6NQwqnWKFkhS5rV9nfNHxkEQVDblJPLoi0ppWLIGrlU/Gt5fRVJZ7W5ZEEQBG1Il0aWalJK/38DBgMfw/QCohu1pVBBEARtiSpTwqnilOKy6GL7rXojklPbSJ4gCIJ2oWsrzWFJ3YGHgFlJOvR628dWQqZSFPI7ktYAnKusHgS8WonOgyAIqoGALq0Pe/sO2Nj2l5JmAR6RdKftsot3lKKQf0FyWywGfADcl7cFQRB0TNR6C9m2gS/z6ix5aXRWc0soJZfFh8CulegsCIKgFhBNztRrtshp9haMBJYCzinK+VMWpURZ/FPS+fWXSnQeBC3hnrvvYuW+y9J3uaU49ZSTqy0ORw4fxlp9F2fQBo3Ww2x3alEmqL1nB03mQ55gu3/R8gN9Z3uq7VWBnsAaklashEylGO33Af/Jy6PAgiQfShC0G1OnTmX4wQdy86138syo0Vx3zdW8NHp0VWXaYcieXHj1TVWVoT61KFMtPrvkQ254aQm2PwUeALashFzNKmTb1xYtlwI7AP0q0XkQlMqTI0bQp89SLNm7N926dWPnIbty2603V1WmAT9Zl3nm/XFVZahPLcpUi88Otb5iiKQFJM2bP88GbAa8XAmxWuPWXhJYqBKdB0GpvPvueHr27DV9vUePnowfP76KEgWlUovPLs3Ua3gpgUWABySNAp4E7rV9WyXkKqXq9Cd8P4LYBZgIHFGJzjsKknoC55Aqb3cBbgMOtz0p778a6EsqBnsncA3pnu0EXG577TL7Xwi4EOhFGtEda3urctpsoI8NSeXMB0vaBljBdm04+4Kgwgi1Ov2m7VHAapWVKNGkQlaaDbIKqY4ewLQc8tFpyPfgRuBc29vm0dXzSeWsDpe0MDDA9lL5+CNIgeJ/yE2UpYwzJ5B+hc/MfaxcgTYbxfYtwC1t2UdLWXTRHowb98709fHjx9GjR48qShSUSk0+u1b4i9uDJg30rHzvyCOKUzubMs5sDHxr+2JIo6ukOoP7SJoduAfoIelZSccCw4FfSHoAQFIhXhFJv5X0vKTnJJ2ct/WRdJekkZIelrRcAzIsAowrrORfaCRtKGn6q5KksyUNzZ/HSjol9zdCUuEH4xJJ50l6StKrkn5QwlzSUEln588LSLpB0pN5WSdv3yBf87OSnpHUpoVv+w8YwJgxrzH2zTeZNGkS1117DYMGb9OWXQYVohafXUeuOv2spNVsP9Pm0tQmfUnxhtOx/bmkt0kxiNsAt+UQmIJF/aXtvxSfI2kgsC2wpu2vJRVGXs4Hhtl+TdKawN9JPwLFnANcK+mXpKiXi22/W4Lsn9leSdJPgTNIOUkAlgDWAPqQfGFLNdHGmcDpth+RtBhwN7A8cBhwoO1HJc0JfFv/REkHAAcA9FpssRLEbZy6ujpOP/Nsth60BVOnTmXvofuwQt++ZbVZLr8atjcjHnuYTyZ+zHqrLc3Bhx/DzrvvHTLVoxafHdRmtremaurV2Z5C8pU8Kel14CvSj4ttr95OMs4sbEpSpF8D2J6YFdnawHVFuUJmrX+i7bsl9SaF1gwEnikx7vHqov9PL9r+L9vTgNckvQE0ZJUXy71CkXxzZ7kfBU6TdCVwo+1x9U/M8ZvnA/Tr17/st6stB27FlgMr6jovi9PPu7TaIvyAWpQJau/ZSR2shBMwAlidZAF2ZkaTBuemI2lu0lTyMaS47NbSBfi0YF03he2JwFXAVdlNsT5pKnux26l7/dNK+NzQen0Z17Jd3wI+WdLtwFbAo5K2sF2R0J8gaA9qTx037UMWgO3XG1raSb5a4D/A7Pm1vzBl8q/AJQVrt0TuBX6W/c5I+rHtz4E3Je2ct0nSKvVPlLRx0XlzkVwNbwNvkazXWXNc5Cb1Th1S9P/jRdt3ltRFUh+gN/BKE3LfQ0ooVZCl4JrpY/t5238mhf40ZWUHQU1RmDrd0FJNmrKQF5D068Z22j6tDeSpOWxb0vbA3yX9jvQjdgdwVAvbuSsrs6ckTSpqYw/gXEnHkELargGeq3d6P+BsSVNy/xfYfhJA0r+AF4A3gfp+/h/lWMnvgN2Ktr9NegOam+S//laNfxEPBs7J7dSR0g4OA4ZL2giYBrxICvcLgg6C6FKDPmQ1Fjgh6T3gXBqx7G0f34ZyBWUiaSzQ3/aEetsvIQ1CXt+e8vTr19+PPvFU8we2I+983JIXnM5Lr/lmr7YIM7DOmv0ZOfKpsrRpnxVW8UlXNWxDDFmtx0jbVUkG0pSF/J7tE9pNkiAIgnak9uzjphVyLcoblIjtJRrZPrR9JQmC2qMjRlnUHyAKgiCYaSijYkib0ahCzmFWQRAEMx3NJKivGtWueh0EQVAVpIaX5s9TL0kPSBot6UVJh1RKplKmTgdBEMxkqByXxRTgUNsoNxSTAAAgAElEQVRP53kBIyXda7vsrPuhkIMg6HSUM6hn+z3gvfz5C0kvAT1Is3rLIhRyEASdkkq4kCUtQcr3U5Eip6GQgyDodJRbdRogJ9m6ARie0yCUTSjkIAg6JU34kCc0N1NP0iwkZXyl7RsrJVMo5CAIOh3lhL3lnOcXAi9VOqdPhL0FQdAJUaP/SmAdYC9g46KqORVJ9hwWchAEnY8yaurZfoQ2Si0RCjnotNRaFrNa5eHXPqq2CDPwxXdTym6jVmfqhUIOgqBT0kQO8KoRCjkIgk5JDerjUMhBEHROQiEHQRDUAFIHS78ZBEEwM1OD+jgUchAEnZGSY47blVDIQRB0OkTr45DbklDIQRB0SiLsLQiCoEYICzkIgqAWEG00+bk8IrlQ0GG45+67WLnvsvRdbilOPeXkaosDhEyl8OF74zl86PbsN3hd9t96Pf59+Q9SC7c7yYesBpdqEgo56BBMnTqV4QcfyM233skzo0Zz3TVX89LosivmhEztQNe6Og74zfFccNsjnHnNndxy1UW8NeaVqsoEZRU5vUjSh5JeqLRMoZCDDsGTI0bQp89SLNm7N926dWPnIbty2603h0wdQKb5FliIpVdYGYDZ55iTxXovw4QP36uqTFCWhXwJsGWbyNQWjQZBpXn33fH07Nlr+nqPHj0ZP358FSUKmVrD++PfZsxLz7Pcyv2qLcp0N3L9pTlsPwRMbAuZQiHPZEiaryhp9vuSxhetd2ujPh+RtGpbtB3MPHzz1ZeccMg+/OLIE5ljzrmqKotIYW8NLdUkoixmMmx/DKwKIOk44Evbfyk+Jpegke1p7S9h61h00R6MG/fO9PXx48fRo0ePKkoUMrWEKZMnc8Lwfdh48I6su9ngaovTXIL6koqctgVhIXcSJC0labSkK4EXgUUknS/pKUkvSvp9Pm6wpKuLzttU0k3580BJj0t6WtK1kuZoL/n7DxjAmDGvMfbNN5k0aRLXXXsNgwZv017dh0xlYJvTfjecxXovw05Df1FVWWagcZ/FBNv9i5Z2CwsJC7lzsRzwU9tPAUg6wvZESXXAA5KuB+4BzpU0m+1vgCHANZIWBI4ANrH9taSjgUOAPzXWmaQDgAMAei22WFmC19XVcfqZZ7P1oC2YOnUqew/dhxX69i2rzXIJmUrjxaef4L5brmPJZZZn2PYbAbDP8KNZY4NNqyhV9UPcGiIUcufi9YIyzuwmaV/S92BRYAXboyXdCwySdDNpNHk4sBmwAvBY9rN1Ax5pqrNsWZwP0K9ff5cr/JYDt2LLgRWpJVkxQqbmWbHfWtwz+sNqizEDyYfcynPTG+SGJNfGOOBY2xdWQq5QyJ2LrwofJC1NsnDXsP2ppCuA7nn3NcB+wNfA47a/yn7nu2zv1d5CB0Fb0FoL2fZuFRZlOuFD7rzMDXwBfC5pEWCLon33A2sC+5KUM8BjwAaSegNImiMr9SDokLQ27K0tCQu58/I0MBp4GXgLeLSww/YUSXcCuwN75G0fZPfGtUXhc0cBr7Wr1EFQCRTZ3oJ2xvZxRZ/HkMPh8rqBRt0PtocBw+ptuxe4t4Fj162AuEHQbkQ+5CAIghoioiyCIAhqhdrTx6GQgyDofKjpmXpVIxRyEASdkhjUC4IgqBHCQg6CIKgJhGrQiRwKOQiCTkc5U6fbklDIQRB0SiLsLQiCoAao1SiLyGURBEHnpIxkFpK2lPSKpDGSjqiUSKGQgyDolLS2yKmkrsA5wEBSStrdJK1QEZkq0UgQBEFHQ2p4KYE1gDG237A9iZQRcdtKyBQKOQiCTkdKLtQ6CxnoAbxTtD4ubyubGNQL2oWnnx45YbZZ9FaFmpsfmFChtipFrclUa/JA5WRavNwGnn565N2zzaL5G9ndvVpFTkMhB+2C7QUq1Zakp2z3r1R7laDWZKo1eaC2ZLK9ZRmnjwd6Fa33zNvKJlwWQRAELeNJYGlJS+ZiDbsCt1Si4bCQgyAIWkCuqPNL4G6gK3CR7Rcr0XYo5KAj0i7+vBZSazLVmjxQmzK1Ctt3AHdUul2lSj5BEARBtQkfchAEQY0QCjkIgqBGCIUcBDMxkio2TqQ2KLEhqb+kDdqi7Y5IKOQgKJNaVSaS5gH2kdRV0v6SDi+jLdm2pE0lrVlBMdcGTgbWqdX72J6EQg6CFlBQGpKWkbSqpFlcoyPjtj8D5gPeBw4ALimjLUsaBJwNzFuubIX7aPtvwFXA8cB6nV0ph0IOghZQpJhuBPYDnpe0bJXFmoF6Su1y4DWgG/BFA/tLbXN+4HfAnrbvlrS6pK0ktXgGZsHaLqzbPotQykAo5CBoEZKWAA4DNgduBwx8VLS/qsqkWNlJ6mV7nO21gZuARyT1zD8qLZ3CPAkYA2wm6TLgKOAkYLsy5NtL0hGS9iT9cFwB/B5Yt9r3sVqEQg6CZihyU/QkJce5kTRd9jhgkO2JkgZDsqCrKWeRsjsIuE3SlZLWtn0saWbZ9ZKOBK6RtHBTbeX/V5a0KklXXALMDlxmeyfgz8D6kmYpVYEWyTccGAq8DQwH9rF9IXADcAbwkxbfgJmAUMhB0AzZotyENNNsJWBT4GfAdrbfkLQWcJKkvtWWE0DSdsB6JOt1DLCnpK1sHw2cC8wGbGP7/YbakdQlX/OWpFy/W5HSTb5p+3e275O0HslKvsL25OZ+iIoVtqRZgeVsbwIsSHrDuEhSN9vnkvzUFUnW09GIqdNB0Ay5GsSewBm2H5d0BsmKGyqpO7ADcFSl8hmUg6Q+wNHA/2y/KekE4BBgcJb1cgDb0xo4dy7bX9ieJqkX8Ftga2ApkoL8PB/Xi2TVHmH77lLkKvqx6Gd7pKQ5JN1P8mtvnfND7CtptO2Ly7sLHZewkIOgEXK4WBdgL2B1YIVsxT0A/AL4GPgKOND2rdXwezbQ5zjgn8CGknawPdX2aSSFug4weyPKeB5guKQF86b3SLkaNgVOICnNjyRtT7ru/Wzf1kJZFwCOlbQycD0wN3BxVsZ7A4eSIkI6LZHLIgjqURRzO4ftryTNQhrI6wVcCTxhe0p1pfyBz3gIsBDwAvAisBHJqr/A9k35mB/bnthAO7PZ/kbSIqS35nWBa0mDlv2BRW1PzgOB5wFDbb/QCnnnIUVqPGP7Skn7kaz3UcCKwG62R7e03ZmJcFkEQRFFynhL4BeSHgGeBk4hRQDsAtRJerghS7M9KVLGw4CfknyvN5Es+kImskMlTbF9WyPKeG7gfElX2b5F0r6kCJL3gd2AkcApkr4kuS+ObakyzhNJXrX9iaRbgEskPWv7Akl3kgYKP7f9QcvvwsxFKOQgKKIwGw04Ffg/kgU3EFiYFCd7EjAEeA74tFpywnR3xYKkiIRBJIX5JHBr9gPfDkwjydoU9wP7SvoCuAiYSoqAmAL0I11vHXCQ7YfrxxE3JFfRj8VCwD6kULaDsiwnAhtLesl2pxy8a4xQyEHADJax+F4J9QB6k17TdycpqCOBPrarooyLlV3+/wNJr5NC0roDm2Vl/CvgXtv/aqo9259LupYUZ3x43nZJ9p0PBW6z/Y9655SqjOfPVu/PJe1Dqsy8H/Bj4DvgnBbfgJmcUMhBwHTLeENgsu0/K81M+yuwve23JO1Oij1+xPZr1ZQTUnxwXh8FfEKq6/azrIx3JlmljZYVKvoB6gZ8mZUwwGF530XZd76tpCdsv9tC+X5FmnX3CfDP3N7cwDLAX4C+JJ98pQrfzhSEQg46NfVevxcCDsoj/h8BCwCzSFoe+Ab4bbVesetZnr8mRXm8IWmc7X0lLQkcKWlOkntlN9uvN9ZeVsbbkeKp55Z0MnAX6S1guKSutv8h6fZSlXGRrPuSrOGBwONAL0n/AO60/ZSkbYEutj9p6X2Y2QmFHHRqsmLaAHiTNEtsYWBd25dmH+yNgIDjbb9aDRnrKePZSHkpfmJ7gqQnJV1s+2dKM++WBMbafq+ZNvuS4owPJ1mth5EmvlxFGmQ7TNJI2+NaKN9cpIRGewI/J4XPPQD8Bugm6Q6npEdBA0TYW9ApKXplnwX4D8kaPoEUb1wH/M72l5KWBr6xPa65waw2krNLIZpDqbDmFsAcwNG2H8/b/wd8YXuzEttcljRAOdn2XnnbVqRZfJuTZuX9qJS3AUlL2B6bPx9EGmQ8gfS28Q/bg/K+p0mDh8fb/qLEy+90xMSQoFMhadYiZbyk7cnAH0m+zAmk1JL7A38HsP1awUpsb2Wc+ywo4/VJyvI6ktW5saTV8jFrpUPUq7F28mBlgXeBscAiktZXmuxyBynueDHbX5eojLcC7pU0l6RdgTVJcc+TgS+BhZVm320HfACcGcq4acJCDjoNSjPFDiYl2XmRlBynOzAMuBR4mJRx7EJgMWBn229UR9rvkbQ5aYBud9s3KuWR2Jo07fgu20+W2M76JJfMRKd8FCeSfoBeIU3OuBrYwfYTJbS1BXA6KSnQ/yRdRZrVt4Ttr/Mxg4ADSVEV+9t+vkUX3gkJhRx0GvIo/x9JORnuAJ4hWcITgZeB7UmhXpOA+WyPqZKcP3CNSLob6GF7xby+NrAHKVvaGcCkhiz4oreBNUnW9bXA+sCztn8u6fekAbhngets31XsJmlEvs1JOTEeJuXweDXf2ytJbpAdCn0Ds5Kma/9gUkrwQ0IhB50CSXVOORM2Jw0wfQ382vaYHNK2DGkm3vG2j6+inMUDZBsCc9u+Ja/fTvLtrp3X1yRlYPuwmTY3AHYGbrd9Z942ArgHOBb4E8lvfgXwXDPKeBOSr/l4krW9IClW+eGslM8BZiFFeYRyaSHhQw46BVkZb0FyU1xOeo3+laRVbReqVRwIjKiGfMoUKeNhwN+A4yVdLGmRPED2vqTR+ZqeaE4ZZ5YjxSX3Ltq2F7C47akkhTwnsCPJom2Kz0m5LK4EbiO9TQyStI7tz0n3sDvQaTO2lUNYyEGnQdLZwGu2z5Q0LykaoAdpOvTTRQNo1Yim6G772/x5Y+Aw21vl9X+SFN8fbL+X/bVHFaIbmmizL7CS7Wsk7Q/8mmS5PpvfFP4IDMzhc3MD3UtU8NOjP3IUyl6kULxbbD+WQ9/mamn8chAWctAJUKr9tiEpSdBykhZymvp8LLAWKYnOPIXjq6CMlwb+IakuT+zYAlhRKfE9JEXaDfizpIVt796UMi6KqFgP2FrSzrb/SZoCfrOkC0m+8pOyMu5q+/NSlTF8H/3hNGvxctLEmd0kremUUzmUcSsIhRzM1EgaQKps8TUpkgBgozyJYl5SusqrqjlrLCu14cDapB+GP5AiHvaQtHoOFfs10OSEiiJFvGhu9zzgQWBzSbvYPpOU2GdzUhmmG7MynloB+a8lhdO9WU5bnZ1wWQQzLZJ6kLK2dXOqAYekn5Jy/K5M8iMfbfvW6kn5PZL+QspjPJg0hfn/gPmBy22PaMyVolR4dW7bo3Jo38Wkgbbz8v5hpORI59q+WinPxKHA5q5g/mFJs+QY5KCVhIUczMxMBh4DlpC0G4Dty0jFSfcFdqqWMq43UQMA24eRwvFuIEU9nENyBeyiVIeuoXaWB24GtlCadfgVSSFvoZRhrWApfwMMzO6a00kJfr6r5DWFMi6fsJCDmYaimNsBpDzA39genQe01iRNori+ulI2WB16AVI420F5259IVTt2A74l/Z1OaKCdJUmugtNtX11v37akpPUPAvcBZ5IGAp9qq+sKyicUcjBTUPCF5tlhp5Dqyg0FDrX9H6UMZJsA/7Z9XRVFnY6k4aTK0PuQFOcrzvkoJJ1JClfbqjEfb1a6A20Py+vrkSZ9vEWa9PIjUpgfwKnOpZyC2iWyvQUdGqWinBOyMl6B5I4YTIow6ApcIOlg2xdKqiPNyKuWrMWJgn4ELE2qWL0PKU3ljyQ9bvsntg+RtEAzA27fAHNI2oiUWa0r30eLLGf7GKXqJ/PYfr8a4XxBywgLOeiwKCVXP4k0OLcvKTRsybx+Dum1f3/SpI89qjl4l+Oe13OqTr0ZyX/7LClR+2m2f6KUFP99Ugmm7Utosytp0LInybj6q+1HJfUkpQ3dyfbbbXRJQRsQFnLQkZlMCg/bi5ToZrjtl5SqGV/llD5zPMnP+lW1hMwDeHXAapKOJfm317P9nVLx0BezMt6U9ONxVQltdsnW86+Vq0YX7V6QpPCrXhk7aBmhkIMOR/Grt1MFismkmnInSzqSFK+7raSppNCxnW0/XaUZeIU+J2Q55waetF2IcJhCyv1wGkkhb+gmKn0UyLPkutqeWlDGeYZcf1Kyod/H5IyOR7gsgg5FntDxd+CAQuSBpFtJFaA/B762fbikPYE+pCnRVY8zlvQzYBfSgONgYC7gENvfSFoXGA9Msf1OI+cXIkjmIGV2m1y8PX/enOSiudz2LeEz7niEQg46HJJuICXB+SlwAfB6VsLLkEoSfUNyX1QtN0U9eTcFjga2zm6UlYEDSJbxCGAAKfPc1820Mwg4iDQb7gPbxzVwzEK2P6j2NQetIyaGBB2GPIiF7R1J1T1eAcbbPjwf8iZwNqnE0fKF86rhpij6PB9pqvJypBSfkJLjn0my6A8AzipBGa9H8i//mlSAdVul+nqF/YV780H+P5RxByQs5KBDUZx7QdJZpNCx7fx9prQupIToX1ZJvvoFSb8lTfz4DSkt5d+LpytLmssllDWSNDC3NTvwO2CI7bckLW/7pTa4lKAKhIUcdChyvHHBGjyIVKvt5hxWhu1p1VLGuf+CMj6UlMj9TlIo3pXAh8ABklYqOr5BZVywsiUtmwfrZicNXB4NbJaV8SbAIYVrDzo+oZCDDkc9pbw38AlJKXetrmQJSXsAW9geSqq+/FPbz5ByVEwG9swx1I2SB/AGk+KpF7d9A3ArMBvQTdI2pAT2tzqlEg1mAsJlEXQY6g9UKZdlyp9Xtj2q8bPbVK7pcuT1n5P826uRchtvm2OOu5EU9Le2P2qmzeVJ8dP72R5RtP3s3MbswNm274wBvJmHUMhBh6AoV8WPgT7OlZZVgXy+Zco1D7Cm7Xty2NnnpFjgXUiDb7vanizpKNKEjV81pDwl9SZV8/hjXt8UOLAwY09SN9uTio6fw3bVJrsEbUO4LIIOQVbGi5HSUy5YvL16UgFposeykh4E/mL7f8BNpGnc/wOWUcrBPAQ4vxFlvCjpb/EBSb3y5tGAJa2VLeBJkjaSdKhSms1v6rcTdHxCIQc1iaQFczKgYnYjZWu7vRoyFVMYdMsTObqS0ns+li3ZcaRwtuVJlUB2AHZ3A8nglVJojsptPQbcLunyPMtuFCkb3KH6vtrzM7Ynu4nK0EHHJVwWQc1QNBttPeAIUkKgT4v2F/uMq+Y3bcCXLWAd0tTnOYEzbb8jaWnbr0mavaE44+xT7keaXXcTyTd8FTCSVDD0N5J2AQbmU26wfVubXlxQVSKXRVAzZGW8Cakc/Rn1oweKB85qQRlLOpg04eNj4PfARFLWuV9K+hhYUlKDM/DyoN1epIG7BUjW7zG2v5LUH3g2+8cPBf5VUOoxgDdzEy6LoCYomt22DjCM/N3M/tKaoUgZbwbsCdwL9Cb5tl8G/kHKLLcjcI5nzMJGPndu4HxS6aXXgUmkUlOzS1ohx1GvCuwu6aJ82jfF/QczJ+GyCKpKkZtiEdvv5W1HkIpwrmz7vfphZVWScw1SUp9nJW1H8mdf51wSStJlwHzANnkAcs6mJqhIupyUYGg+UkTG4qSBvw9JrolXJc0J9LP93za9uKBmCAs5qCpZGQ8CLpH0V0m/tX0yyYJ8TFKvGlDGW5JyZBSYACwGrJGVJrZ/SrJ0C2WSmgtJuw3YGpho+70cnXEnKbn+bpKWs/2l7f8W58YIZm5CIQdVRdJqpKof+5GsxQ2UEq4fDfwLeErSrNVSSpK2AE4EjszW8fykQbeDSFnadi1SytuTSin9wLUgqbekAyX9NLf5GLAzME3Sufmce4D7SbXwpldwDjdF5yFcFkFVkbQBsAqpnNGppIkUb0paJr+2L2P71SrJtjzwJLCj7btziNoFwJ+cCqeuT0r0cwtwUWMTNZRq/d0C3E6yovck+ZpPAhbObb5k++B8/Hy2P27bqwtqkVDIQbuhlP1se9tX5UiC5UiW4o2kTGjr2/4wuzB2A34BfFlNC1HSTaTk938iKc5/2z69aP+6wG+BvRrKKZFn8t0KXGz74rytJ3AhMNL2UXn9GuAF28MikqLzEmFvQbvhVB1jRUnvA+8BewDvkBTyAsDakiYCfySVIGo2LWVbURhItL2dpGuBJ4AT6injbYDnSRb0pEaa+g54i+R+QVJ32+OU6v49Lul521dL2g1YBMJF0ZkJCzloF5SKck6TtAjwH2BW233yvpWAlUhxuRNI0QtVL0GkGXMvX0ZK6LOHU6KgvYCDSS6WRmvgSfoR8ChweGGGYSEvRY4mmWr71Da/mKBDEAo5aDeyL3UeYAzJBbAW0D8ruAWzu6LO9pRqKOPsE+5h++qibcVK+XpSUdK7SYN3+zQ0HbqBdn9Omlr9tzwwWEiU9Fugu+3jq/3jE9QGEWURtDlKVTwg5XX4Bamix/7A08CIPDvvZklLFELcqqScugNnSdq+sMEz5l7eiWQl/4ESlXHmRpKLZpikTXKba5MmwDyU2w5lHISFHLQ9kua1/alSsqDTSErtJNuvS/orqaLGRdXK01Bwp+TPhwGHAL+1fVXRMcWW8sK2329hHwuRJn4cSIrcWJ7kk765QpcRzASEQg7aFElLAEcBV9h+KFubZwNLAL/MSnlOp2rM1fYZH0JK9jOFFOWxv+0rivaXnXtZ0sL5Y3fbY6t9zUFtEQo5aFNy7O7epCiKq20/kpXy66T8D79paopxO8ko0tTlm0jFQ1+RtBFwHTC8WCkHQVsSPuSgohRm1Enql5WaSZMg3iBNCV6LNO34aeDSainj4pl/2UJ9n5R/WJJmsf0A8GfgMqWKz0HQ5oRCDipKzk2xKWlm2u7Ag8AapKxorwJnkCzjC20/UQ0Zi90EkpaS9GPb3wJfkPzHhQKkr5PSY1ZlpmDQ+QiXRVBRcmjbYcBlth/Mkyd+CZxs+/48K20O269USb7iAbzhWbbHgBeBU0gz5kzKJbE6MNj2m9WQNeh8xEy9oCLk0LYupATtawIjJT2aJ3gsCPxe0uNO5Y2qRpEyXgdYChhMqvLxG1KC+CGSBpAiP04IZRy0J+GyCMqiyBc7b44hPoo0TXhZkmIGeAH4BKh6HThJXSQtR3KbzGr7ZZLv+Figr6RzbD9p+1+2X6uqsEGnIxRyUBbZZ7w18KCkM0mVMv4AfA0cJ+lCkt/4UtvfVUPGegN407ISPhDYWNK6OQ/FK6QcGt1zzHAQtDvhQw5aRWFgTNKPSQr4bpIFvHf+fAGpztwKwL9s31BNOfPn7UgJfB6w/bKkPYHjgZ/lGOkuQF0TiYKCoE0JH3LQKrIyXhtYD5hi+2alKsqfkwbKupMSu58EbCJprO2R1ZATphckHQLcA1wt6Qzbl0qaBtwkabDtx0j5ioOgKoTLImgVOQ/wRaSY4h0kbZEty0eB84DNSJUvTiH5j6s2mCdpTWBjYANSZehZgC0lDc3To39OqmUXBFUlXBZBi8mDYheQohIelPQLYHvgL7bvyTkr5rY9MR9f7SnRAhYFViPNvNtU0u+AffI1XFkt2YKgmLCQg9YwDzA3cACA7XOB60mDeFs6JXafWDi4ysq4ixPjgV5AIYxtDCnJz33Vki0I6hMWctAsRQN4CwFdbb8rqR+p0Oe7to/Kx/0fqSxR1WfgNbJ/KdIPxxvAMqRKH1WZoBIEDREKOSgJSduSYowFjCBNKZ5CspIn2j60iuLVj6bYn2QJP1jIr1x0XG9gc+A+22PaX9IgaJxwWQQNkiMmCp+XJBXyHApsAnwJDCQpvX8APSQtUwUxiykkNfolKQn+2PrKGMD2G7bPC2Uc1CKhkIMfkJXrOZK2kDQXKb74O+Bzp8KjJwFrk1JV/g/4ue2qJOCRtJakxZ3q9S1GCm3bFXhL0o6S9pG0SjVkC4KWEnHIwQzk5ECXAZcAz9r+IkcpPA+sJ+kB2x/kGXgLAdj+rGoCw5bATpIG2X5L0vOkadAFY6MOmBV4rloCBkGphA85mI6kuUlJ2q+wfVG9fbsCG5Ks5ReAw4EDbN/b3nLWJ4ewbQVsDSxMKp76X9uv5QkhA4Ch5Vb7CIK2JizkoJhvgHeAGyCVLAKm5bCxayR9Sar80RfY1/b91RCyfjSF7RMldQduJrlRLshJhPYG9svbQhkHNU/4kINi5iDlAF4XUsVlkv7rKml24FvbF5PKLlVdGUtaNbtYsH008ABwQ/Ylz01KjL+b7ZeqIWsQtJSwkIPpOFWGPgvYUdJ428+S3FpTJW0M7CLpCVKURbVkLM5NsRvwvKRetgfaPkbSVFJyow1Is/ImV0vWIGgpYSEH9fk38B4wLCvhaTmZ+ymkIqVfVHPmHYCkHUnRFBuTcmSsJ+kZANvHAlcCs4cyDjoaMagX/IA8I28X4P9IxUj7kEow3VSNvBT1+5S0LCmr3DbANrYHSXqaNIswQtyCDkso5KBRsmKeRqqsMa7ayljSwgC238/rfyfNuLtR0m9JVvN2tt9uTxmDoFKEQg46BJIOI/mF5wNuBc4EjiDFGE8F+gE/tf1B1YQMgjIJH3JQ8+RKH5vY3pqUpW0N218DNwLvkwqSHh7KOOjohIUc1BySNgV+YvvEvL41yRJeFlgf2Nr2pDxl+q2cYrPqBVSDoFwi7C2oGfIU7Trgb0AvSbPaPgb4FjgOeAsYmPNWHAxsLmkX0oSWIOjwhIUc1BySBpGiPKaRU3tKOg3oCVwFLEGq9rGb7RerJmgQVJjwIQc1QY7oKPAWaabdv0lx0H+0/WtSvb6fkCpZDwllHMxshIUcVJ3sM76aVDT1LGA8aRbeNiT3xf7AO7Z/n4+fJSZ9BDMjYSEHtcCHwOzAvsAg4J+kSl2KZqkAAAQGSURBVNXPk6zlU4DlJZ2cj/9B4vkgmBmIQb2g6tgeJak/8CApfeb5wGmkrHITbP9D+v/27jfUzzGO4/j77f+fnfx5gHhysC0ixpqEJHRi/rREWaRlDVMipRSKUlaeSUKUpIRQizTaA0YHm8M0sS3EEw9WSsyk5uvBfY2f0/78dnbizvm86te5z31f9/W97l+db9fv6neurw8CP7b2+VgX/0tJyNELVfWluhBYDXxeVRe0PTR+b9ezXhz/e0nI0RtV9Yk6BrytHl1Vz/7XY4r4NyUhR69U1cfqJcBa9Y+2/3LEjJBvWUQvqWcBv1bVxv96LBH/liTkiIieyNfeIiJ6Igk5IqInkpAjInoiCTkioieSkCMieiIJOWY0dbv6mbpBfUU9bB/6ukh9ox1frd67m7ZHqrdPIcaDrZzVUOcntXlOvXYvYo2qG/Z2jDF1Scgx022rqnlVdTrdv2nfNnjRzl7/nVTVyqpasZsmR9JV9Y74SxJyxN/WALPbzHCj+jywga56yZg6rk60mfQsAPUy9St1ArhmR0fqEvXxdnys+rq6vr3OA1YAJ7fZ+aOt3T3qWvVz9aGBvu5TN6nv05Wx2i11WetnvfrqpFn/peq61t+Vrf3+6qMDsW/d1zcypiYJOQJQDwAup9vyE2AO8ERVnQZsBe4HLq2qs4F1wN3qIXRbhV5FV/X6uF10/xjwblWdCZwNfEFXMfvrNju/p+3hMQc4B5gHzFcvVOcD17dzC4EFQzzOa1W1oMX7km5b0x1GW4wrgCfbMywFfqqqBa3/ZeqJQ8SJaZa9LGKmO1T9rB2vAZ4Fjge+q6oP2/lz6aqUfNCV/eMgYBw4Bfi2qjYDqC8At+wkxsXATQBVtR34ST1qUpux9vq0/T6LLkGPAK+3KtuoK4d4ptPVh+mWRWYBqwauvdwKwm5Wv2nPMAacMbC+fESLvWmIWDGNkpBjpttWVfMGT7Sku3XwFPBOVS2e1O4f9+0jgUeq6qlJMe6aQl/PAYuqar26BLho4NrkvRKqxb6jqgYTN+roFGLHPsiSRcSefQicr84GUA9X5wJfAaPqya3d4l3cvxpY3u7dXz0C+Jlu9rvDKuDmgbXpE9RjgPeAReqh6gjd8siejAA/qAcCN0y6dp26XxvzScDGFnt5a486Vz18iDgxzTJDjtiDqtrSZpovqge30/dX1Sb1FuBN9Ve6JY+RnXRxJ/C0uhTYDiyvqnH1g/a1srfaOvKpwHibof8C3FhVE+pLwHq6UldrhxjyA8BHwJb2c3BM3wMf0xWRva2qflOfoVtbnrALvgVYNNy7E9Mpu71FRPREliwiInoiCTkioieSkCMieiIJOSKiJ5KQIyJ6Igk5IqInkpAjInriTxQp0B8QywZWAAAAAElFTkSuQmCC\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Compute confusion matrix\n", + "cnf_matrix = confusion_matrix(y_val, ensemble_result)\n", + "np.set_printoptions(precision=2)\n", + "\n", + "# Plot non-normalized confusion matrix\n", + "plt.figure()\n", + "plot_confusion_matrix(cnf_matrix, classes=val_categoroes,\n", + " title='Confusion matrix, without normalization')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "We have tried several approches for classification with crss-validation and grid-searching, and also tried an ensemble of a few approaches. A few takeaway points for myself:\n", + "\n", + "- I had anticipated XGBoost to perform much better. I was quite surprised it didn't. Probably that will be acievable with some further parameter tuning and better feature generation, and more importantly with more data. I am relatively new to XGBoost so I need further insight into this\n", + "\n", + "- All classifiers were unable to predict one data point correctly. Even the ensemble didnt help. Probably an indication that better tuning and feature engineering is more necessary that ensembling different approaches.\n", + "\n", + "- I have tried to define a general approach of configuring models and parameters and then run them through a generic method, which I believe, could be handy. I have separation of concern in mind and hence I separated the pre-processing, vectorization and prediction steps, and the made the prediction step customizable using configurable models and parameters\n", + "\n", + "- Most of the classifiers were at least as good with only text features as with all features.\n", + "\n", + "- Random Forest, Logistic regression and SVM all performed pretty well.\n", + "\n", + "**Most importantly**\n", + "\n", + "**Since there were not enough data, with a bigger dataset, it is possible that these behaviour might change. I have learnt from my experience that models performing well on smaller data are not becessary the best models for real world data with massive size.** \n", + "\n", + "\n", + "My personal preference is to go with ensemble approaches like **Random Forest** and **XGBoost** (although XGBoost didn't perform here as well as expected as pointed out previously), since they have the power of learning from an ensemble of different classifiers (trees), rather than relying on the type of kernel (as in SVM). Also, Scikit learn's Random forest and XGBoost implementation provides a lot of parameters to tune the classifiers.\n", + "\n", + "In the Spark based bonus solution, I will try the **Random forest** approach." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/classification_attemp_with_deep_learning.ipynb b/classification_attemp_with_deep_learning.ipynb new file mode 100644 index 0000000..1208bbc --- /dev/null +++ b/classification_attemp_with_deep_learning.ipynb @@ -0,0 +1,1463 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Classification with deep learning\n", + "\n", + "The classification solution tries to address the problem using deep learning. I just tried it out of curiosity:\n", + "\n", + "*Train a learning model that assigns each expense transaction to one of the set of predefined categories and evaluate it against the validation data provided. The set of categories are those found in the \"category\" column in the training data. Report on accuracy and at least one other performance metric.*\n", + "\n", + "For this, I tried the following:\n", + "- **Using a simple 3-layer ANN with all features (non-text and text, with text vectorized using scikit learn's TfIdfVectorizer)**\n", + "- **Using a simple 3-layer ANN with only text features (with text vectorized using Keras's text to matrix binary vectorizer)**\n", + "- **Using a RNN with LSTM with a pre-trained embedding layer and only text features (with text vectorized using Keras's text to matrix binary vectorizer)**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Necessary imports" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from collections import Counter\n", + "import itertools\n", + "\n", + "from sklearn.preprocessing import LabelEncoder\n", + "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", + "\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "from sklearn.metrics import confusion_matrix\n", + "\n", + "from keras.preprocessing.text import Tokenizer\n", + "# from nltk.corpus import stopwords\n", + "# from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer\n", + "from keras import models, layers\n", + "from keras.utils import np_utils\n", + "from keras import preprocessing as kreprocessing\n", + "import os" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Read data nd preprocess" + ] + }, + { + "cell_type": "code", + "execution_count": 109, + "metadata": {}, + "outputs": [], + "source": [ + "df_train = pd.read_csv('training_data_example.csv')\n", + "df_val = pd.read_csv('validation_data_example.csv')\n", + "df_employee = pd.read_csv('employee.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": 110, + "metadata": {}, + "outputs": [], + "source": [ + "def pre_process(df, columns_to_drop=['date',\n", + " 'category', \n", + " 'tax amount', \n", + " 'expense description']):\n", + " \n", + " df['day_of_week'] = pd.to_datetime(df['date']).apply(lambda x: x.weekday()).astype(str) # str so that treated as categoical\n", + " df['month'] = pd.to_datetime(df['date']).apply(lambda x: x.month).astype(str)\n", + " df = pd.merge(df, df_employee[['employee id', 'role']], how='inner', on=['employee id'])\n", + " df['employee id'] = df['employee id'].astype(str)\n", + " df = df.drop(columns_to_drop, axis=1)\n", + " \n", + " # one-hot encode the categorical variables\n", + " df = pd.get_dummies(df)\n", + " \n", + " df['pre-tax amount'] = preprocessing.minmax_scale(df[['pre-tax amount']])\n", + " \n", + " return df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ready the pre-processed data" + ] + }, + { + "cell_type": "code", + "execution_count": 111, + "metadata": {}, + "outputs": [], + "source": [ + "x_train = pre_process(df_train)\n", + "x_val = pre_process(df_val)\n", + "x_train, x_val = x_train.align(x_val, join='left', axis=1)\n", + "x_val = x_val.fillna(0)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Process text features separately" + ] + }, + { + "cell_type": "code", + "execution_count": 112, + "metadata": {}, + "outputs": [], + "source": [ + "vectorizer = TfidfVectorizer(stop_words='english')\n", + "vectorizer.fit(df_train['expense description'])\n", + "x_train_tfidf = vectorizer.transform(df_train['expense description']).toarray()\n", + "x_val_tfidf = vectorizer.transform(df_val['expense description']).toarray()\n", + "# x_train_tfidf.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Concatenate text and non-text features" + ] + }, + { + "cell_type": "code", + "execution_count": 113, + "metadata": {}, + "outputs": [], + "source": [ + "x_train = np.concatenate((x_train.values, x_train_tfidf), axis=1)\n", + "x_val = np.concatenate((x_val.values, x_val_tfidf), axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Generate vectorized output labels for feeding to Keras." + ] + }, + { + "cell_type": "code", + "execution_count": 114, + "metadata": {}, + "outputs": [], + "source": [ + "lencoder = LabelEncoder()\n", + "lencoder.fit(df_train['category'])\n", + "names = set(df_val['category']) # label names to be used later\n", + "y_train = lencoder.transform(df_train['category'])\n", + "y_val = lencoder.transform(df_val['category'])\n", + "dummy_y_train = np_utils.to_categorical(y_train)\n", + "dummy_y_val = np_utils.to_categorical(y_val)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Approach 1\n", + "\n", + "### Using a simple 3-layer ANN with all features (non-text and text, with text vectorized using scikit learn's TfIdfVectorizer)" + ] + }, + { + "cell_type": "code", + "execution_count": 115, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "dense_41 (Dense) (None, 16) 832 \n", + "_________________________________________________________________\n", + "dense_42 (Dense) (None, 16) 272 \n", + "_________________________________________________________________\n", + "dense_43 (Dense) (None, 5) 85 \n", + "=================================================================\n", + "Total params: 1,189\n", + "Trainable params: 1,189\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "model = models.Sequential()\n", + "model.add(layers.Dense(16, activation='relu', input_shape=(x_train.shape[1],)))\n", + "model.add(layers.Dense(16, activation='relu'))\n", + "model.add(layers.Dense(5, activation='softmax'))\n", + "\n", + "model.compile(optimizer='rmsprop',\n", + " loss='categorical_crossentropy',\n", + " metrics=['accuracy'])\n", + "\n", + "model.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train on 24 samples, validate on 12 samples\n", + "Epoch 1/200\n", + "24/24 [==============================] - 0s 14ms/step - loss: 1.6079 - acc: 0.1250 - val_loss: 1.6118 - val_acc: 0.2500\n", + "Epoch 2/200\n", + "24/24 [==============================] - 0s 324us/step - loss: 1.5794 - acc: 0.1667 - val_loss: 1.6026 - val_acc: 0.1667\n", + "Epoch 3/200\n", + "24/24 [==============================] - 0s 628us/step - loss: 1.5615 - acc: 0.2500 - val_loss: 1.5947 - val_acc: 0.1667\n", + "Epoch 4/200\n", + "24/24 [==============================] - 0s 308us/step - loss: 1.5471 - acc: 0.2500 - val_loss: 1.5873 - val_acc: 0.0833\n", + "Epoch 5/200\n", + "24/24 [==============================] - 0s 332us/step - loss: 1.5331 - acc: 0.3333 - val_loss: 1.5800 - val_acc: 0.1667\n", + "Epoch 6/200\n", + "24/24 [==============================] - 0s 526us/step - loss: 1.5206 - acc: 0.4167 - val_loss: 1.5730 - val_acc: 0.3333\n", + "Epoch 7/200\n", + "24/24 [==============================] - 0s 418us/step - loss: 1.5087 - acc: 0.5000 - val_loss: 1.5665 - val_acc: 0.3333\n", + "Epoch 8/200\n", + "24/24 [==============================] - 0s 356us/step - loss: 1.4968 - acc: 0.5417 - val_loss: 1.5605 - val_acc: 0.3333\n", + "Epoch 9/200\n", + "24/24 [==============================] - 0s 360us/step - loss: 1.4858 - acc: 0.5417 - val_loss: 1.5542 - val_acc: 0.3333\n", + "Epoch 10/200\n", + "24/24 [==============================] - 0s 613us/step - loss: 1.4756 - acc: 0.5417 - val_loss: 1.5484 - val_acc: 0.3333\n", + "Epoch 11/200\n", + "24/24 [==============================] - 0s 448us/step - loss: 1.4647 - acc: 0.5833 - val_loss: 1.5417 - val_acc: 0.3333\n", + "Epoch 12/200\n", + "24/24 [==============================] - 0s 355us/step - loss: 1.4539 - acc: 0.6667 - val_loss: 1.5359 - val_acc: 0.3333\n", + "Epoch 13/200\n", + "24/24 [==============================] - 0s 406us/step - loss: 1.4429 - acc: 0.6667 - val_loss: 1.5299 - val_acc: 0.3333\n", + "Epoch 14/200\n", + "24/24 [==============================] - 0s 482us/step - loss: 1.4333 - acc: 0.6667 - val_loss: 1.5245 - val_acc: 0.3333\n", + "Epoch 15/200\n", + "24/24 [==============================] - 0s 369us/step - loss: 1.4225 - acc: 0.7083 - val_loss: 1.5188 - val_acc: 0.3333\n", + "Epoch 16/200\n", + "24/24 [==============================] - 0s 405us/step - loss: 1.4113 - acc: 0.7083 - val_loss: 1.5120 - val_acc: 0.4167\n", + "Epoch 17/200\n", + "24/24 [==============================] - 0s 307us/step - loss: 1.3995 - acc: 0.7083 - val_loss: 1.5059 - val_acc: 0.4167\n", + "Epoch 18/200\n", + "24/24 [==============================] - 0s 731us/step - loss: 1.3887 - acc: 0.7083 - val_loss: 1.4990 - val_acc: 0.4167\n", + "Epoch 19/200\n", + "24/24 [==============================] - 0s 701us/step - loss: 1.3771 - acc: 0.7083 - val_loss: 1.4930 - val_acc: 0.4167\n", + "Epoch 20/200\n", + "24/24 [==============================] - 0s 350us/step - loss: 1.3654 - acc: 0.7083 - val_loss: 1.4860 - val_acc: 0.4167\n", + "Epoch 21/200\n", + "24/24 [==============================] - 0s 367us/step - loss: 1.3550 - acc: 0.6667 - val_loss: 1.4799 - val_acc: 0.4167\n", + "Epoch 22/200\n", + "24/24 [==============================] - 0s 463us/step - loss: 1.3417 - acc: 0.6667 - val_loss: 1.4741 - val_acc: 0.4167\n", + "Epoch 23/200\n", + "24/24 [==============================] - 0s 819us/step - loss: 1.3302 - acc: 0.7083 - val_loss: 1.4678 - val_acc: 0.4167\n", + "Epoch 24/200\n", + "24/24 [==============================] - 0s 380us/step - loss: 1.3184 - acc: 0.6667 - val_loss: 1.4626 - val_acc: 0.4167\n", + "Epoch 25/200\n", + "24/24 [==============================] - 0s 540us/step - loss: 1.3059 - acc: 0.6667 - val_loss: 1.4568 - val_acc: 0.4167\n", + "Epoch 26/200\n", + "24/24 [==============================] - 0s 343us/step - loss: 1.2939 - acc: 0.6667 - val_loss: 1.4512 - val_acc: 0.4167\n", + "Epoch 27/200\n", + "24/24 [==============================] - 0s 327us/step - loss: 1.2812 - acc: 0.6667 - val_loss: 1.4447 - val_acc: 0.4167\n", + "Epoch 28/200\n", + "24/24 [==============================] - 0s 702us/step - loss: 1.2698 - acc: 0.6667 - val_loss: 1.4383 - val_acc: 0.4167\n", + "Epoch 29/200\n", + "24/24 [==============================] - 0s 541us/step - loss: 1.2571 - acc: 0.6667 - val_loss: 1.4322 - val_acc: 0.4167\n", + "Epoch 30/200\n", + "24/24 [==============================] - 0s 683us/step - loss: 1.2446 - acc: 0.6667 - val_loss: 1.4262 - val_acc: 0.4167\n", + "Epoch 31/200\n", + "24/24 [==============================] - 0s 420us/step - loss: 1.2309 - acc: 0.6667 - val_loss: 1.4195 - val_acc: 0.4167\n", + "Epoch 32/200\n", + "24/24 [==============================] - 0s 498us/step - loss: 1.2181 - acc: 0.6667 - val_loss: 1.4136 - val_acc: 0.4167\n", + "Epoch 33/200\n", + "24/24 [==============================] - 0s 540us/step - loss: 1.2051 - acc: 0.7500 - val_loss: 1.4061 - val_acc: 0.4167\n", + "Epoch 34/200\n", + "24/24 [==============================] - 0s 524us/step - loss: 1.1915 - acc: 0.7500 - val_loss: 1.3990 - val_acc: 0.5000\n", + "Epoch 35/200\n", + "24/24 [==============================] - 0s 421us/step - loss: 1.1790 - acc: 0.7500 - val_loss: 1.3922 - val_acc: 0.5000\n", + "Epoch 36/200\n", + "24/24 [==============================] - 0s 504us/step - loss: 1.1661 - acc: 0.7500 - val_loss: 1.3858 - val_acc: 0.5000\n", + "Epoch 37/200\n", + "24/24 [==============================] - 0s 332us/step - loss: 1.1545 - acc: 0.7500 - val_loss: 1.3784 - val_acc: 0.5000\n", + "Epoch 38/200\n", + "24/24 [==============================] - 0s 321us/step - loss: 1.1399 - acc: 0.7500 - val_loss: 1.3726 - val_acc: 0.5000\n", + "Epoch 39/200\n", + "24/24 [==============================] - 0s 336us/step - loss: 1.1271 - acc: 0.7500 - val_loss: 1.3643 - val_acc: 0.5000\n", + "Epoch 40/200\n", + "24/24 [==============================] - 0s 369us/step - loss: 1.1138 - acc: 0.7500 - val_loss: 1.3584 - val_acc: 0.5000\n", + "Epoch 41/200\n", + "24/24 [==============================] - 0s 319us/step - loss: 1.1013 - acc: 0.7500 - val_loss: 1.3498 - val_acc: 0.5000\n", + "Epoch 42/200\n", + "24/24 [==============================] - 0s 782us/step - loss: 1.0882 - acc: 0.7500 - val_loss: 1.3421 - val_acc: 0.5000\n", + "Epoch 43/200\n", + "24/24 [==============================] - 0s 496us/step - loss: 1.0741 - acc: 0.7500 - val_loss: 1.3357 - val_acc: 0.5000\n", + "Epoch 44/200\n", + "24/24 [==============================] - 0s 319us/step - loss: 1.0617 - acc: 0.7500 - val_loss: 1.3271 - val_acc: 0.5000\n", + "Epoch 45/200\n", + "24/24 [==============================] - 0s 469us/step - loss: 1.0495 - acc: 0.7500 - val_loss: 1.3207 - val_acc: 0.5833\n", + "Epoch 46/200\n", + "24/24 [==============================] - 0s 402us/step - loss: 1.0349 - acc: 0.7500 - val_loss: 1.3127 - val_acc: 0.5833\n", + "Epoch 47/200\n", + "24/24 [==============================] - 0s 255us/step - loss: 1.0206 - acc: 0.7500 - val_loss: 1.3062 - val_acc: 0.5833\n", + "Epoch 48/200\n", + "24/24 [==============================] - 0s 438us/step - loss: 1.0073 - acc: 0.7500 - val_loss: 1.2981 - val_acc: 0.5833\n", + "Epoch 49/200\n", + "24/24 [==============================] - 0s 638us/step - loss: 0.9942 - acc: 0.7500 - val_loss: 1.2908 - val_acc: 0.5833\n", + "Epoch 50/200\n", + "24/24 [==============================] - 0s 496us/step - loss: 0.9811 - acc: 0.7500 - val_loss: 1.2840 - val_acc: 0.5833\n", + "Epoch 51/200\n", + "24/24 [==============================] - 0s 438us/step - loss: 0.9669 - acc: 0.7500 - val_loss: 1.2786 - val_acc: 0.5833\n", + "Epoch 52/200\n", + "24/24 [==============================] - 0s 495us/step - loss: 0.9537 - acc: 0.7500 - val_loss: 1.2702 - val_acc: 0.5833\n", + "Epoch 53/200\n", + "24/24 [==============================] - 0s 263us/step - loss: 0.9392 - acc: 0.7500 - val_loss: 1.2642 - val_acc: 0.5833\n", + "Epoch 54/200\n", + "24/24 [==============================] - 0s 319us/step - loss: 0.9267 - acc: 0.7500 - val_loss: 1.2581 - val_acc: 0.6667\n", + "Epoch 55/200\n", + "24/24 [==============================] - 0s 221us/step - loss: 0.9144 - acc: 0.7500 - val_loss: 1.2505 - val_acc: 0.6667\n", + "Epoch 56/200\n", + "24/24 [==============================] - 0s 555us/step - loss: 0.9016 - acc: 0.7500 - val_loss: 1.2450 - val_acc: 0.6667\n", + "Epoch 57/200\n", + "24/24 [==============================] - 0s 762us/step - loss: 0.8887 - acc: 0.7500 - val_loss: 1.2379 - val_acc: 0.7500\n", + "Epoch 58/200\n", + "24/24 [==============================] - 0s 415us/step - loss: 0.8761 - acc: 0.7500 - val_loss: 1.2339 - val_acc: 0.7500\n", + "Epoch 59/200\n", + "24/24 [==============================] - 0s 384us/step - loss: 0.8638 - acc: 0.7917 - val_loss: 1.2275 - val_acc: 0.7500\n", + "Epoch 60/200\n", + "24/24 [==============================] - 0s 329us/step - loss: 0.8512 - acc: 0.7917 - val_loss: 1.2225 - val_acc: 0.7500\n", + "Epoch 61/200\n", + "24/24 [==============================] - 0s 379us/step - loss: 0.8383 - acc: 0.7917 - val_loss: 1.2165 - val_acc: 0.7500\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 62/200\n", + "24/24 [==============================] - 0s 343us/step - loss: 0.8262 - acc: 0.7917 - val_loss: 1.2089 - val_acc: 0.7500\n", + "Epoch 63/200\n", + "24/24 [==============================] - 0s 370us/step - loss: 0.8135 - acc: 0.7917 - val_loss: 1.2033 - val_acc: 0.7500\n", + "Epoch 64/200\n", + "24/24 [==============================] - 0s 388us/step - loss: 0.8055 - acc: 0.7917 - val_loss: 1.1976 - val_acc: 0.7500\n", + "Epoch 65/200\n", + "24/24 [==============================] - 0s 627us/step - loss: 0.7914 - acc: 0.7917 - val_loss: 1.1922 - val_acc: 0.7500\n", + "Epoch 66/200\n", + "24/24 [==============================] - 0s 387us/step - loss: 0.7783 - acc: 0.7917 - val_loss: 1.1860 - val_acc: 0.7500\n", + "Epoch 67/200\n", + "24/24 [==============================] - 0s 574us/step - loss: 0.7668 - acc: 0.7917 - val_loss: 1.1800 - val_acc: 0.7500\n", + "Epoch 68/200\n", + "24/24 [==============================] - 0s 320us/step - loss: 0.7555 - acc: 0.7917 - val_loss: 1.1744 - val_acc: 0.7500\n", + "Epoch 69/200\n", + "24/24 [==============================] - 0s 361us/step - loss: 0.7468 - acc: 0.7917 - val_loss: 1.1689 - val_acc: 0.7500\n", + "Epoch 70/200\n", + "24/24 [==============================] - 0s 420us/step - loss: 0.7328 - acc: 0.7917 - val_loss: 1.1607 - val_acc: 0.7500\n", + "Epoch 71/200\n", + "24/24 [==============================] - 0s 296us/step - loss: 0.7213 - acc: 0.7917 - val_loss: 1.1567 - val_acc: 0.7500\n", + "Epoch 72/200\n", + "24/24 [==============================] - 0s 866us/step - loss: 0.7099 - acc: 0.7917 - val_loss: 1.1496 - val_acc: 0.7500\n", + "Epoch 73/200\n", + "24/24 [==============================] - 0s 357us/step - loss: 0.6982 - acc: 0.7917 - val_loss: 1.1448 - val_acc: 0.7500\n", + "Epoch 74/200\n", + "24/24 [==============================] - 0s 340us/step - loss: 0.6875 - acc: 0.7917 - val_loss: 1.1381 - val_acc: 0.7500\n", + "Epoch 75/200\n", + "24/24 [==============================] - 0s 413us/step - loss: 0.6797 - acc: 0.7917 - val_loss: 1.1351 - val_acc: 0.7500\n", + "Epoch 76/200\n", + "24/24 [==============================] - 0s 528us/step - loss: 0.6659 - acc: 0.8333 - val_loss: 1.1267 - val_acc: 0.7500\n", + "Epoch 77/200\n", + "24/24 [==============================] - 0s 458us/step - loss: 0.6556 - acc: 0.8333 - val_loss: 1.1212 - val_acc: 0.7500\n", + "Epoch 78/200\n", + "24/24 [==============================] - 0s 852us/step - loss: 0.6444 - acc: 0.8333 - val_loss: 1.1155 - val_acc: 0.7500\n", + "Epoch 79/200\n", + "24/24 [==============================] - 0s 516us/step - loss: 0.6357 - acc: 0.8333 - val_loss: 1.1101 - val_acc: 0.7500\n", + "Epoch 80/200\n", + "24/24 [==============================] - 0s 589us/step - loss: 0.6257 - acc: 0.8333 - val_loss: 1.1045 - val_acc: 0.7500\n", + "Epoch 81/200\n", + "24/24 [==============================] - 0s 411us/step - loss: 0.6141 - acc: 0.8333 - val_loss: 1.0982 - val_acc: 0.7500\n", + "Epoch 82/200\n", + "24/24 [==============================] - 0s 384us/step - loss: 0.6039 - acc: 0.8333 - val_loss: 1.0911 - val_acc: 0.7500\n", + "Epoch 83/200\n", + "24/24 [==============================] - 0s 625us/step - loss: 0.5952 - acc: 0.8333 - val_loss: 1.0852 - val_acc: 0.7500\n", + "Epoch 84/200\n", + "24/24 [==============================] - 0s 380us/step - loss: 0.5864 - acc: 0.8333 - val_loss: 1.0775 - val_acc: 0.7500\n", + "Epoch 85/200\n", + "24/24 [==============================] - 0s 419us/step - loss: 0.5750 - acc: 0.8333 - val_loss: 1.0734 - val_acc: 0.7500\n", + "Epoch 86/200\n", + "24/24 [==============================] - 0s 705us/step - loss: 0.5669 - acc: 0.8333 - val_loss: 1.0660 - val_acc: 0.7500\n", + "Epoch 87/200\n", + "24/24 [==============================] - 0s 394us/step - loss: 0.5572 - acc: 0.8750 - val_loss: 1.0601 - val_acc: 0.7500\n", + "Epoch 88/200\n", + "24/24 [==============================] - 0s 337us/step - loss: 0.5493 - acc: 0.8750 - val_loss: 1.0527 - val_acc: 0.7500\n", + "Epoch 89/200\n", + "24/24 [==============================] - 0s 382us/step - loss: 0.5376 - acc: 0.8750 - val_loss: 1.0475 - val_acc: 0.7500\n", + "Epoch 90/200\n", + "24/24 [==============================] - 0s 465us/step - loss: 0.5295 - acc: 0.8750 - val_loss: 1.0411 - val_acc: 0.7500\n", + "Epoch 91/200\n", + "24/24 [==============================] - 0s 295us/step - loss: 0.5213 - acc: 0.8750 - val_loss: 1.0355 - val_acc: 0.7500\n", + "Epoch 92/200\n", + "24/24 [==============================] - 0s 397us/step - loss: 0.5130 - acc: 0.8750 - val_loss: 1.0286 - val_acc: 0.7500\n", + "Epoch 93/200\n", + "24/24 [==============================] - 0s 511us/step - loss: 0.5051 - acc: 0.8750 - val_loss: 1.0214 - val_acc: 0.7500\n", + "Epoch 94/200\n", + "24/24 [==============================] - 0s 418us/step - loss: 0.4962 - acc: 0.8750 - val_loss: 1.0162 - val_acc: 0.7500\n", + "Epoch 95/200\n", + "24/24 [==============================] - 0s 352us/step - loss: 0.4881 - acc: 0.8750 - val_loss: 1.0086 - val_acc: 0.7500\n", + "Epoch 96/200\n", + "24/24 [==============================] - 0s 425us/step - loss: 0.4792 - acc: 0.8750 - val_loss: 1.0036 - val_acc: 0.7500\n", + "Epoch 97/200\n", + "24/24 [==============================] - 0s 479us/step - loss: 0.4725 - acc: 0.8750 - val_loss: 0.9984 - val_acc: 0.8333\n", + "Epoch 98/200\n", + "24/24 [==============================] - 0s 503us/step - loss: 0.4638 - acc: 0.8750 - val_loss: 0.9900 - val_acc: 0.8333\n", + "Epoch 99/200\n", + "24/24 [==============================] - 0s 474us/step - loss: 0.4560 - acc: 0.8750 - val_loss: 0.9833 - val_acc: 0.8333\n", + "Epoch 100/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.4489 - acc: 0.8750 - val_loss: 0.9770 - val_acc: 0.8333\n", + "Epoch 101/200\n", + "24/24 [==============================] - 0s 639us/step - loss: 0.4408 - acc: 0.8750 - val_loss: 0.9710 - val_acc: 0.8333\n", + "Epoch 102/200\n", + "24/24 [==============================] - 0s 380us/step - loss: 0.4337 - acc: 0.8750 - val_loss: 0.9644 - val_acc: 0.8333\n", + "Epoch 103/200\n", + "24/24 [==============================] - 0s 822us/step - loss: 0.4253 - acc: 0.8750 - val_loss: 0.9604 - val_acc: 0.8333\n", + "Epoch 104/200\n", + "24/24 [==============================] - 0s 539us/step - loss: 0.4188 - acc: 0.8750 - val_loss: 0.9540 - val_acc: 0.9167\n", + "Epoch 105/200\n", + "24/24 [==============================] - 0s 367us/step - loss: 0.4109 - acc: 0.8750 - val_loss: 0.9490 - val_acc: 0.9167\n", + "Epoch 106/200\n", + "24/24 [==============================] - 0s 719us/step - loss: 0.4028 - acc: 0.8750 - val_loss: 0.9427 - val_acc: 0.9167\n", + "Epoch 107/200\n", + "24/24 [==============================] - 0s 474us/step - loss: 0.3953 - acc: 0.8750 - val_loss: 0.9407 - val_acc: 0.9167\n", + "Epoch 108/200\n", + "24/24 [==============================] - 0s 443us/step - loss: 0.3887 - acc: 0.9167 - val_loss: 0.9317 - val_acc: 0.9167\n", + "Epoch 109/200\n", + "24/24 [==============================] - 0s 443us/step - loss: 0.3816 - acc: 0.9167 - val_loss: 0.9285 - val_acc: 0.9167\n", + "Epoch 110/200\n", + "24/24 [==============================] - 0s 448us/step - loss: 0.3745 - acc: 0.9167 - val_loss: 0.9210 - val_acc: 0.9167\n", + "Epoch 111/200\n", + "24/24 [==============================] - 0s 811us/step - loss: 0.3676 - acc: 0.9167 - val_loss: 0.9191 - val_acc: 0.9167\n", + "Epoch 112/200\n", + "24/24 [==============================] - 0s 489us/step - loss: 0.3615 - acc: 0.9167 - val_loss: 0.9133 - val_acc: 0.9167\n", + "Epoch 113/200\n", + "24/24 [==============================] - 0s 424us/step - loss: 0.3550 - acc: 0.9167 - val_loss: 0.9074 - val_acc: 0.9167\n", + "Epoch 114/200\n", + "24/24 [==============================] - 0s 360us/step - loss: 0.3492 - acc: 0.9167 - val_loss: 0.9029 - val_acc: 0.9167\n", + "Epoch 115/200\n", + "24/24 [==============================] - 0s 399us/step - loss: 0.3417 - acc: 0.9167 - val_loss: 0.8994 - val_acc: 0.9167\n", + "Epoch 116/200\n", + "24/24 [==============================] - 0s 446us/step - loss: 0.3356 - acc: 0.9167 - val_loss: 0.8948 - val_acc: 0.9167\n", + "Epoch 117/200\n", + "24/24 [==============================] - 0s 371us/step - loss: 0.3303 - acc: 0.9167 - val_loss: 0.8882 - val_acc: 0.9167\n", + "Epoch 118/200\n", + "24/24 [==============================] - 0s 788us/step - loss: 0.3243 - acc: 0.9167 - val_loss: 0.8842 - val_acc: 0.9167\n", + "Epoch 119/200\n", + "24/24 [==============================] - 0s 700us/step - loss: 0.3182 - acc: 0.9167 - val_loss: 0.8811 - val_acc: 0.9167\n", + "Epoch 120/200\n", + "24/24 [==============================] - 0s 382us/step - loss: 0.3146 - acc: 0.9167 - val_loss: 0.8740 - val_acc: 0.9167\n", + "Epoch 121/200\n", + "24/24 [==============================] - ETA: 0s - loss: 0.4652 - acc: 0.833 - 0s 362us/step - loss: 0.3084 - acc: 0.9167 - val_loss: 0.8698 - val_acc: 0.9167\n", + "Epoch 122/200\n", + "24/24 [==============================] - 0s 490us/step - loss: 0.3013 - acc: 0.9167 - val_loss: 0.8648 - val_acc: 0.9167\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 123/200\n", + "24/24 [==============================] - 0s 421us/step - loss: 0.2979 - acc: 0.9167 - val_loss: 0.8615 - val_acc: 0.9167\n", + "Epoch 124/200\n", + "24/24 [==============================] - 0s 385us/step - loss: 0.2916 - acc: 0.9167 - val_loss: 0.8556 - val_acc: 0.9167\n", + "Epoch 125/200\n", + "24/24 [==============================] - 0s 338us/step - loss: 0.2864 - acc: 0.9167 - val_loss: 0.8518 - val_acc: 0.9167\n", + "Epoch 126/200\n", + "24/24 [==============================] - 0s 312us/step - loss: 0.2819 - acc: 0.9167 - val_loss: 0.8474 - val_acc: 0.9167\n", + "Epoch 127/200\n", + "24/24 [==============================] - 0s 409us/step - loss: 0.2782 - acc: 0.9167 - val_loss: 0.8433 - val_acc: 0.9167\n", + "Epoch 128/200\n", + "24/24 [==============================] - 0s 501us/step - loss: 0.2715 - acc: 0.9167 - val_loss: 0.8398 - val_acc: 0.9167\n", + "Epoch 129/200\n", + "24/24 [==============================] - 0s 447us/step - loss: 0.2680 - acc: 0.9167 - val_loss: 0.8374 - val_acc: 0.9167\n", + "Epoch 130/200\n", + "24/24 [==============================] - 0s 598us/step - loss: 0.2627 - acc: 0.9167 - val_loss: 0.8339 - val_acc: 0.9167\n", + "Epoch 131/200\n", + "24/24 [==============================] - 0s 621us/step - loss: 0.2586 - acc: 0.9167 - val_loss: 0.8279 - val_acc: 0.9167\n", + "Epoch 132/200\n", + "24/24 [==============================] - 0s 472us/step - loss: 0.2541 - acc: 0.9167 - val_loss: 0.8279 - val_acc: 0.9167\n", + "Epoch 133/200\n", + "24/24 [==============================] - 0s 472us/step - loss: 0.2496 - acc: 0.9167 - val_loss: 0.8252 - val_acc: 0.9167\n", + "Epoch 134/200\n", + "24/24 [==============================] - 0s 316us/step - loss: 0.2458 - acc: 0.9167 - val_loss: 0.8224 - val_acc: 0.9167\n", + "Epoch 135/200\n", + "24/24 [==============================] - 0s 663us/step - loss: 0.2417 - acc: 0.9167 - val_loss: 0.8204 - val_acc: 0.9167\n", + "Epoch 136/200\n", + "24/24 [==============================] - 0s 591us/step - loss: 0.2382 - acc: 0.9583 - val_loss: 0.8147 - val_acc: 0.9167\n", + "Epoch 137/200\n", + "24/24 [==============================] - 0s 580us/step - loss: 0.2338 - acc: 0.9167 - val_loss: 0.8118 - val_acc: 0.9167\n", + "Epoch 138/200\n", + "24/24 [==============================] - 0s 713us/step - loss: 0.2293 - acc: 0.9583 - val_loss: 0.8101 - val_acc: 0.9167\n", + "Epoch 139/200\n", + "24/24 [==============================] - 0s 348us/step - loss: 0.2257 - acc: 0.9583 - val_loss: 0.8033 - val_acc: 0.9167\n", + "Epoch 140/200\n", + "24/24 [==============================] - 0s 427us/step - loss: 0.2217 - acc: 0.9583 - val_loss: 0.8001 - val_acc: 0.9167\n", + "Epoch 141/200\n", + "24/24 [==============================] - 0s 704us/step - loss: 0.2174 - acc: 0.9583 - val_loss: 0.7977 - val_acc: 0.9167\n", + "Epoch 142/200\n", + "24/24 [==============================] - 0s 541us/step - loss: 0.2133 - acc: 0.9583 - val_loss: 0.7951 - val_acc: 0.9167\n", + "Epoch 143/200\n", + "24/24 [==============================] - 0s 627us/step - loss: 0.2097 - acc: 0.9583 - val_loss: 0.7919 - val_acc: 0.9167\n", + "Epoch 144/200\n", + "24/24 [==============================] - 0s 636us/step - loss: 0.2060 - acc: 0.9583 - val_loss: 0.7887 - val_acc: 0.9167\n", + "Epoch 145/200\n", + "24/24 [==============================] - 0s 522us/step - loss: 0.2036 - acc: 0.9583 - val_loss: 0.7865 - val_acc: 0.9167\n", + "Epoch 146/200\n", + "24/24 [==============================] - 0s 541us/step - loss: 0.1991 - acc: 0.9583 - val_loss: 0.7828 - val_acc: 0.9167\n", + "Epoch 147/200\n", + "24/24 [==============================] - 0s 682us/step - loss: 0.1967 - acc: 0.9583 - val_loss: 0.7788 - val_acc: 0.9167\n", + "Epoch 148/200\n", + "24/24 [==============================] - 0s 497us/step - loss: 0.1919 - acc: 0.9583 - val_loss: 0.7723 - val_acc: 0.9167\n", + "Epoch 149/200\n", + "24/24 [==============================] - 0s 353us/step - loss: 0.1895 - acc: 0.9583 - val_loss: 0.7723 - val_acc: 0.9167\n", + "Epoch 150/200\n", + "24/24 [==============================] - 0s 442us/step - loss: 0.1864 - acc: 0.9583 - val_loss: 0.7683 - val_acc: 0.9167\n", + "Epoch 151/200\n", + "24/24 [==============================] - 0s 487us/step - loss: 0.1822 - acc: 0.9583 - val_loss: 0.7642 - val_acc: 0.9167\n", + "Epoch 152/200\n", + "24/24 [==============================] - 0s 486us/step - loss: 0.1792 - acc: 0.9583 - val_loss: 0.7616 - val_acc: 0.9167\n", + "Epoch 153/200\n", + "24/24 [==============================] - 0s 393us/step - loss: 0.1770 - acc: 0.9583 - val_loss: 0.7577 - val_acc: 0.9167\n", + "Epoch 154/200\n", + "24/24 [==============================] - 0s 373us/step - loss: 0.1732 - acc: 0.9583 - val_loss: 0.7556 - val_acc: 0.9167\n", + "Epoch 155/200\n", + "24/24 [==============================] - 0s 808us/step - loss: 0.1696 - acc: 0.9583 - val_loss: 0.7536 - val_acc: 0.9167\n", + "Epoch 156/200\n", + "24/24 [==============================] - 0s 517us/step - loss: 0.1673 - acc: 0.9583 - val_loss: 0.7518 - val_acc: 0.9167\n", + "Epoch 157/200\n", + "24/24 [==============================] - 0s 415us/step - loss: 0.1638 - acc: 0.9583 - val_loss: 0.7474 - val_acc: 0.9167\n", + "Epoch 158/200\n", + "24/24 [==============================] - 0s 354us/step - loss: 0.1612 - acc: 0.9583 - val_loss: 0.7453 - val_acc: 0.9167\n", + "Epoch 159/200\n", + "24/24 [==============================] - 0s 401us/step - loss: 0.1580 - acc: 0.9583 - val_loss: 0.7397 - val_acc: 0.9167\n", + "Epoch 160/200\n", + "24/24 [==============================] - 0s 272us/step - loss: 0.1545 - acc: 0.9583 - val_loss: 0.7384 - val_acc: 0.9167\n", + "Epoch 161/200\n", + "24/24 [==============================] - 0s 419us/step - loss: 0.1515 - acc: 0.9583 - val_loss: 0.7351 - val_acc: 0.9167\n", + "Epoch 162/200\n", + "24/24 [==============================] - 0s 383us/step - loss: 0.1495 - acc: 0.9583 - val_loss: 0.7351 - val_acc: 0.9167\n", + "Epoch 163/200\n", + "24/24 [==============================] - 0s 360us/step - loss: 0.1461 - acc: 0.9583 - val_loss: 0.7336 - val_acc: 0.9167\n", + "Epoch 164/200\n", + "24/24 [==============================] - 0s 401us/step - loss: 0.1435 - acc: 0.9583 - val_loss: 0.7314 - val_acc: 0.9167\n", + "Epoch 165/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.1418 - acc: 0.9583 - val_loss: 0.7272 - val_acc: 0.9167\n", + "Epoch 166/200\n", + "24/24 [==============================] - 0s 683us/step - loss: 0.1392 - acc: 0.9583 - val_loss: 0.7272 - val_acc: 0.9167\n", + "Epoch 167/200\n", + "24/24 [==============================] - 0s 812us/step - loss: 0.1358 - acc: 1.0000 - val_loss: 0.7255 - val_acc: 0.9167\n", + "Epoch 168/200\n", + "24/24 [==============================] - 0s 475us/step - loss: 0.1329 - acc: 1.0000 - val_loss: 0.7206 - val_acc: 0.9167\n", + "Epoch 169/200\n", + "24/24 [==============================] - 0s 486us/step - loss: 0.1307 - acc: 1.0000 - val_loss: 0.7208 - val_acc: 0.9167\n", + "Epoch 170/200\n", + "24/24 [==============================] - 0s 816us/step - loss: 0.1280 - acc: 1.0000 - val_loss: 0.7212 - val_acc: 0.9167\n", + "Epoch 171/200\n", + "24/24 [==============================] - 0s 725us/step - loss: 0.1272 - acc: 1.0000 - val_loss: 0.7191 - val_acc: 0.9167\n", + "Epoch 172/200\n", + "24/24 [==============================] - 0s 608us/step - loss: 0.1235 - acc: 1.0000 - val_loss: 0.7180 - val_acc: 0.9167\n", + "Epoch 173/200\n", + "24/24 [==============================] - 0s 547us/step - loss: 0.1206 - acc: 1.0000 - val_loss: 0.7171 - val_acc: 0.9167\n", + "Epoch 174/200\n", + "24/24 [==============================] - 0s 550us/step - loss: 0.1187 - acc: 1.0000 - val_loss: 0.7135 - val_acc: 0.9167\n", + "Epoch 175/200\n", + "24/24 [==============================] - 0s 679us/step - loss: 0.1164 - acc: 1.0000 - val_loss: 0.7128 - val_acc: 0.9167\n", + "Epoch 176/200\n", + "24/24 [==============================] - 0s 505us/step - loss: 0.1134 - acc: 1.0000 - val_loss: 0.7106 - val_acc: 0.9167\n", + "Epoch 177/200\n", + "24/24 [==============================] - 0s 502us/step - loss: 0.1113 - acc: 1.0000 - val_loss: 0.7116 - val_acc: 0.9167\n", + "Epoch 178/200\n", + "24/24 [==============================] - 0s 571us/step - loss: 0.1084 - acc: 1.0000 - val_loss: 0.7094 - val_acc: 0.9167\n", + "Epoch 179/200\n", + "24/24 [==============================] - 0s 445us/step - loss: 0.1065 - acc: 1.0000 - val_loss: 0.7080 - val_acc: 0.9167\n", + "Epoch 180/200\n", + "24/24 [==============================] - 0s 519us/step - loss: 0.1039 - acc: 1.0000 - val_loss: 0.7091 - val_acc: 0.9167\n", + "Epoch 181/200\n", + "24/24 [==============================] - 0s 414us/step - loss: 0.1015 - acc: 1.0000 - val_loss: 0.7100 - val_acc: 0.9167\n", + "Epoch 182/200\n", + "24/24 [==============================] - 0s 467us/step - loss: 0.0995 - acc: 1.0000 - val_loss: 0.7064 - val_acc: 0.9167\n", + "Epoch 183/200\n", + "24/24 [==============================] - 0s 389us/step - loss: 0.0965 - acc: 1.0000 - val_loss: 0.7045 - val_acc: 0.9167\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 184/200\n", + "24/24 [==============================] - 0s 431us/step - loss: 0.0942 - acc: 1.0000 - val_loss: 0.7043 - val_acc: 0.9167\n", + "Epoch 185/200\n", + "24/24 [==============================] - 0s 563us/step - loss: 0.0917 - acc: 1.0000 - val_loss: 0.7039 - val_acc: 0.9167\n", + "Epoch 186/200\n", + "24/24 [==============================] - 0s 408us/step - loss: 0.0907 - acc: 1.0000 - val_loss: 0.7024 - val_acc: 0.9167\n", + "Epoch 187/200\n", + "24/24 [==============================] - 0s 333us/step - loss: 0.0874 - acc: 1.0000 - val_loss: 0.7034 - val_acc: 0.9167\n", + "Epoch 188/200\n", + "24/24 [==============================] - 0s 313us/step - loss: 0.0854 - acc: 1.0000 - val_loss: 0.7043 - val_acc: 0.9167\n", + "Epoch 189/200\n", + "24/24 [==============================] - 0s 290us/step - loss: 0.0831 - acc: 1.0000 - val_loss: 0.7029 - val_acc: 0.9167\n", + "Epoch 190/200\n", + "24/24 [==============================] - 0s 386us/step - loss: 0.0809 - acc: 1.0000 - val_loss: 0.7020 - val_acc: 0.9167\n", + "Epoch 191/200\n", + "24/24 [==============================] - 0s 289us/step - loss: 0.0791 - acc: 1.0000 - val_loss: 0.7026 - val_acc: 0.9167\n", + "Epoch 192/200\n", + "24/24 [==============================] - 0s 291us/step - loss: 0.0770 - acc: 1.0000 - val_loss: 0.7011 - val_acc: 0.9167\n", + "Epoch 193/200\n", + "24/24 [==============================] - 0s 319us/step - loss: 0.0748 - acc: 1.0000 - val_loss: 0.7019 - val_acc: 0.9167\n", + "Epoch 194/200\n", + "24/24 [==============================] - 0s 383us/step - loss: 0.0725 - acc: 1.0000 - val_loss: 0.7032 - val_acc: 0.9167\n", + "Epoch 195/200\n", + "24/24 [==============================] - 0s 2ms/step - loss: 0.0708 - acc: 1.0000 - val_loss: 0.7021 - val_acc: 0.9167\n", + "Epoch 196/200\n", + "24/24 [==============================] - 0s 542us/step - loss: 0.0688 - acc: 1.0000 - val_loss: 0.7019 - val_acc: 0.9167\n", + "Epoch 197/200\n", + "24/24 [==============================] - 0s 536us/step - loss: 0.0667 - acc: 1.0000 - val_loss: 0.7015 - val_acc: 0.9167\n", + "Epoch 198/200\n", + "24/24 [==============================] - 0s 497us/step - loss: 0.0659 - acc: 1.0000 - val_loss: 0.7023 - val_acc: 0.9167\n", + "Epoch 199/200\n", + "24/24 [==============================] - 0s 582us/step - loss: 0.0632 - acc: 1.0000 - val_loss: 0.7014 - val_acc: 0.9167\n", + "Epoch 200/200\n", + "24/24 [==============================] - 0s 461us/step - loss: 0.0614 - acc: 1.0000 - val_loss: 0.7033 - val_acc: 0.9167\n" + ] + } + ], + "source": [ + "epochs = 200\n", + "history = model.fit(x_train,\n", + " dummy_y_train,\n", + " epochs=epochs,\n", + " batch_size=12,\n", + " validation_data=(x_val, dummy_y_val))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### From the logs above, the training accuracy is 100% and the validation accuracy is 91.67%" + ] + }, + { + "cell_type": "code", + "execution_count": 116, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([3, 4, 4, 0, 0, 3, 4, 3, 4, 3, 1, 3])" + ] + }, + "execution_count": 116, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.argmax(model.predict(x_val), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 117, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([4, 2, 0, 3, 3, 4, 2, 2, 2, 2, 2, 2])" + ] + }, + "execution_count": 117, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_val" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Approach 2\n", + "\n", + "### Using a simple 3-layer ANN with only text features (with text vectorized using Keras's text to matrix binary vectorizer)" + ] + }, + { + "cell_type": "code", + "execution_count": 118, + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer = Tokenizer(num_words=10000)\n", + "tokenizer.fit_on_texts(df_train['expense description'])" + ] + }, + { + "cell_type": "code", + "execution_count": 119, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(24, 10000)" + ] + }, + "execution_count": 119, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x_train_tfidf_keras = tokenizer.texts_to_matrix(df_train['expense description'], mode='binary')\n", + "x_val_tfidf_keras = tokenizer.texts_to_matrix(df_val['expense description'], mode='binary')\n", + "x_train_tfidf_keras.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 120, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "dense_44 (Dense) (None, 16) 160016 \n", + "_________________________________________________________________\n", + "dense_45 (Dense) (None, 16) 272 \n", + "_________________________________________________________________\n", + "dense_46 (Dense) (None, 5) 85 \n", + "=================================================================\n", + "Total params: 160,373\n", + "Trainable params: 160,373\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "model = models.Sequential()\n", + "model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))\n", + "model.add(layers.Dense(16, activation='relu'))\n", + "model.add(layers.Dense(5, activation='softmax'))\n", + "\n", + "model.compile(optimizer='rmsprop',\n", + " loss='categorical_crossentropy',\n", + " metrics=['accuracy'])\n", + "\n", + "model.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train on 24 samples, validate on 12 samples\n", + "Epoch 1/200\n", + "24/24 [==============================] - 0s 19ms/step - loss: 1.6093 - acc: 0.3750 - val_loss: 1.5909 - val_acc: 0.5000\n", + "Epoch 2/200\n", + "24/24 [==============================] - 0s 547us/step - loss: 1.5923 - acc: 0.4167 - val_loss: 1.5764 - val_acc: 0.6667\n", + "Epoch 3/200\n", + "24/24 [==============================] - 0s 772us/step - loss: 1.5797 - acc: 0.5833 - val_loss: 1.5647 - val_acc: 0.6667\n", + "Epoch 4/200\n", + "24/24 [==============================] - 0s 729us/step - loss: 1.5701 - acc: 0.6250 - val_loss: 1.5547 - val_acc: 0.6667\n", + "Epoch 5/200\n", + "24/24 [==============================] - 0s 842us/step - loss: 1.5604 - acc: 0.5833 - val_loss: 1.5448 - val_acc: 0.6667\n", + "Epoch 6/200\n", + "24/24 [==============================] - 0s 841us/step - loss: 1.5512 - acc: 0.6667 - val_loss: 1.5351 - val_acc: 0.6667\n", + "Epoch 7/200\n", + "24/24 [==============================] - 0s 836us/step - loss: 1.5421 - acc: 0.6667 - val_loss: 1.5246 - val_acc: 0.6667\n", + "Epoch 8/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.5325 - acc: 0.7083 - val_loss: 1.5138 - val_acc: 0.6667\n", + "Epoch 9/200\n", + "24/24 [==============================] - 0s 882us/step - loss: 1.5224 - acc: 0.7083 - val_loss: 1.5022 - val_acc: 0.7500\n", + "Epoch 10/200\n", + "24/24 [==============================] - 0s 704us/step - loss: 1.5122 - acc: 0.7083 - val_loss: 1.4902 - val_acc: 0.7500\n", + "Epoch 11/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.5012 - acc: 0.7083 - val_loss: 1.4778 - val_acc: 0.7500\n", + "Epoch 12/200\n", + "24/24 [==============================] - 0s 962us/step - loss: 1.4896 - acc: 0.7500 - val_loss: 1.4657 - val_acc: 0.7500\n", + "Epoch 13/200\n", + "24/24 [==============================] - 0s 583us/step - loss: 1.4781 - acc: 0.7917 - val_loss: 1.4535 - val_acc: 0.7500\n", + "Epoch 14/200\n", + "24/24 [==============================] - 0s 716us/step - loss: 1.4661 - acc: 0.7917 - val_loss: 1.4409 - val_acc: 0.7500\n", + "Epoch 15/200\n", + "24/24 [==============================] - 0s 600us/step - loss: 1.4540 - acc: 0.7917 - val_loss: 1.4280 - val_acc: 0.7500\n", + "Epoch 16/200\n", + "24/24 [==============================] - 0s 635us/step - loss: 1.4420 - acc: 0.7917 - val_loss: 1.4154 - val_acc: 0.7500\n", + "Epoch 17/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.4302 - acc: 0.7917 - val_loss: 1.4033 - val_acc: 0.7500\n", + "Epoch 18/200\n", + "24/24 [==============================] - 0s 789us/step - loss: 1.4178 - acc: 0.7917 - val_loss: 1.3904 - val_acc: 0.7500\n", + "Epoch 19/200\n", + "24/24 [==============================] - 0s 700us/step - loss: 1.4056 - acc: 0.7917 - val_loss: 1.3777 - val_acc: 0.7500\n", + "Epoch 20/200\n", + "24/24 [==============================] - 0s 848us/step - loss: 1.3924 - acc: 0.7917 - val_loss: 1.3637 - val_acc: 0.7500\n", + "Epoch 21/200\n", + "24/24 [==============================] - 0s 787us/step - loss: 1.3796 - acc: 0.7917 - val_loss: 1.3494 - val_acc: 0.7500\n", + "Epoch 22/200\n", + "24/24 [==============================] - 0s 730us/step - loss: 1.3667 - acc: 0.7917 - val_loss: 1.3367 - val_acc: 0.7500\n", + "Epoch 23/200\n", + "24/24 [==============================] - 0s 912us/step - loss: 1.3532 - acc: 0.7917 - val_loss: 1.3224 - val_acc: 0.7500\n", + "Epoch 24/200\n", + "24/24 [==============================] - 0s 806us/step - loss: 1.3394 - acc: 0.7917 - val_loss: 1.3081 - val_acc: 0.7500\n", + "Epoch 25/200\n", + "24/24 [==============================] - 0s 898us/step - loss: 1.3266 - acc: 0.7917 - val_loss: 1.2941 - val_acc: 0.7500\n", + "Epoch 26/200\n", + "24/24 [==============================] - 0s 713us/step - loss: 1.3118 - acc: 0.7917 - val_loss: 1.2798 - val_acc: 0.7500\n", + "Epoch 27/200\n", + "24/24 [==============================] - 0s 777us/step - loss: 1.2975 - acc: 0.7917 - val_loss: 1.2657 - val_acc: 0.7500\n", + "Epoch 28/200\n", + "24/24 [==============================] - 0s 696us/step - loss: 1.2831 - acc: 0.7917 - val_loss: 1.2504 - val_acc: 0.7500\n", + "Epoch 29/200\n", + "24/24 [==============================] - 0s 877us/step - loss: 1.2693 - acc: 0.7917 - val_loss: 1.2344 - val_acc: 0.7500\n", + "Epoch 30/200\n", + "24/24 [==============================] - 0s 868us/step - loss: 1.2535 - acc: 0.7917 - val_loss: 1.2190 - val_acc: 0.7500\n", + "Epoch 31/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.2384 - acc: 0.7917 - val_loss: 1.2038 - val_acc: 0.7500\n", + "Epoch 32/200\n", + "24/24 [==============================] - 0s 702us/step - loss: 1.2234 - acc: 0.8333 - val_loss: 1.1880 - val_acc: 0.7500\n", + "Epoch 33/200\n", + "24/24 [==============================] - 0s 736us/step - loss: 1.2078 - acc: 0.7917 - val_loss: 1.1720 - val_acc: 0.7500\n", + "Epoch 34/200\n", + "24/24 [==============================] - 0s 713us/step - loss: 1.1938 - acc: 0.8333 - val_loss: 1.1548 - val_acc: 0.7500\n", + "Epoch 35/200\n", + "24/24 [==============================] - 0s 577us/step - loss: 1.1755 - acc: 0.8333 - val_loss: 1.1386 - val_acc: 0.7500\n", + "Epoch 36/200\n", + "24/24 [==============================] - 0s 632us/step - loss: 1.1594 - acc: 0.8333 - val_loss: 1.1221 - val_acc: 0.8333\n", + "Epoch 37/200\n", + "24/24 [==============================] - 0s 678us/step - loss: 1.1428 - acc: 0.8750 - val_loss: 1.1061 - val_acc: 0.8333\n", + "Epoch 38/200\n", + "24/24 [==============================] - 0s 673us/step - loss: 1.1264 - acc: 0.8333 - val_loss: 1.0894 - val_acc: 0.8333\n", + "Epoch 39/200\n", + "24/24 [==============================] - 0s 865us/step - loss: 1.1109 - acc: 0.8750 - val_loss: 1.0730 - val_acc: 0.8333\n", + "Epoch 40/200\n", + "24/24 [==============================] - 0s 758us/step - loss: 1.0934 - acc: 0.8750 - val_loss: 1.0574 - val_acc: 0.8333\n", + "Epoch 41/200\n", + "24/24 [==============================] - 0s 624us/step - loss: 1.0763 - acc: 0.8750 - val_loss: 1.0424 - val_acc: 0.8333\n", + "Epoch 42/200\n", + "24/24 [==============================] - 0s 588us/step - loss: 1.0597 - acc: 0.8750 - val_loss: 1.0258 - val_acc: 0.8333\n", + "Epoch 43/200\n", + "24/24 [==============================] - 0s 982us/step - loss: 1.0425 - acc: 0.8750 - val_loss: 1.0114 - val_acc: 0.8333\n", + "Epoch 44/200\n", + "24/24 [==============================] - 0s 798us/step - loss: 1.0242 - acc: 0.8750 - val_loss: 0.9934 - val_acc: 0.8333\n", + "Epoch 45/200\n", + "24/24 [==============================] - 0s 668us/step - loss: 1.0063 - acc: 0.8750 - val_loss: 0.9766 - val_acc: 0.8333\n", + "Epoch 46/200\n", + "24/24 [==============================] - 0s 946us/step - loss: 0.9891 - acc: 0.8750 - val_loss: 0.9606 - val_acc: 0.8333\n", + "Epoch 47/200\n", + "24/24 [==============================] - 0s 780us/step - loss: 0.9704 - acc: 0.9167 - val_loss: 0.9451 - val_acc: 0.8333\n", + "Epoch 48/200\n", + "24/24 [==============================] - 0s 656us/step - loss: 0.9531 - acc: 0.9167 - val_loss: 0.9293 - val_acc: 0.8333\n", + "Epoch 49/200\n", + "24/24 [==============================] - 0s 893us/step - loss: 0.9366 - acc: 0.9167 - val_loss: 0.9141 - val_acc: 0.8333\n", + "Epoch 50/200\n", + "24/24 [==============================] - 0s 541us/step - loss: 0.9184 - acc: 0.9167 - val_loss: 0.8984 - val_acc: 0.8333\n", + "Epoch 51/200\n", + "24/24 [==============================] - 0s 656us/step - loss: 0.9012 - acc: 0.9583 - val_loss: 0.8836 - val_acc: 0.8333\n", + "Epoch 52/200\n", + "24/24 [==============================] - 0s 927us/step - loss: 0.8840 - acc: 0.9167 - val_loss: 0.8686 - val_acc: 0.8333\n", + "Epoch 53/200\n", + "24/24 [==============================] - 0s 787us/step - loss: 0.8662 - acc: 0.9583 - val_loss: 0.8538 - val_acc: 0.8333\n", + "Epoch 54/200\n", + "24/24 [==============================] - 0s 635us/step - loss: 0.8508 - acc: 0.9583 - val_loss: 0.8391 - val_acc: 0.8333\n", + "Epoch 55/200\n", + "24/24 [==============================] - 0s 586us/step - loss: 0.8339 - acc: 0.9583 - val_loss: 0.8239 - val_acc: 0.8333\n", + "Epoch 56/200\n", + "24/24 [==============================] - 0s 828us/step - loss: 0.8162 - acc: 0.9583 - val_loss: 0.8101 - val_acc: 0.8333\n", + "Epoch 57/200\n", + "24/24 [==============================] - 0s 596us/step - loss: 0.8004 - acc: 0.9583 - val_loss: 0.7967 - val_acc: 0.8333\n", + "Epoch 58/200\n", + "24/24 [==============================] - 0s 588us/step - loss: 0.7837 - acc: 0.9583 - val_loss: 0.7818 - val_acc: 0.8333\n", + "Epoch 59/200\n", + "24/24 [==============================] - 0s 606us/step - loss: 0.7673 - acc: 0.9583 - val_loss: 0.7684 - val_acc: 0.8333\n", + "Epoch 60/200\n", + "24/24 [==============================] - 0s 798us/step - loss: 0.7527 - acc: 0.9583 - val_loss: 0.7553 - val_acc: 0.8333\n", + "Epoch 61/200\n", + "24/24 [==============================] - 0s 759us/step - loss: 0.7364 - acc: 0.9583 - val_loss: 0.7425 - val_acc: 0.8333\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 62/200\n", + "24/24 [==============================] - 0s 3ms/step - loss: 0.7207 - acc: 0.9583 - val_loss: 0.7289 - val_acc: 0.8333\n", + "Epoch 63/200\n", + "24/24 [==============================] - 0s 746us/step - loss: 0.7050 - acc: 0.9583 - val_loss: 0.7160 - val_acc: 0.8333\n", + "Epoch 64/200\n", + "24/24 [==============================] - 0s 772us/step - loss: 0.6892 - acc: 0.9583 - val_loss: 0.7032 - val_acc: 0.8333\n", + "Epoch 65/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.6745 - acc: 0.9583 - val_loss: 0.6910 - val_acc: 0.8333\n", + "Epoch 66/200\n", + "24/24 [==============================] - 0s 849us/step - loss: 0.6614 - acc: 0.9583 - val_loss: 0.6800 - val_acc: 0.9167\n", + "Epoch 67/200\n", + "24/24 [==============================] - 0s 812us/step - loss: 0.6449 - acc: 0.9583 - val_loss: 0.6678 - val_acc: 0.9167\n", + "Epoch 68/200\n", + "24/24 [==============================] - 0s 812us/step - loss: 0.6304 - acc: 1.0000 - val_loss: 0.6572 - val_acc: 0.9167\n", + "Epoch 69/200\n", + "24/24 [==============================] - 0s 925us/step - loss: 0.6159 - acc: 1.0000 - val_loss: 0.6446 - val_acc: 0.9167\n", + "Epoch 70/200\n", + "24/24 [==============================] - 0s 850us/step - loss: 0.6017 - acc: 1.0000 - val_loss: 0.6338 - val_acc: 0.9167\n", + "Epoch 71/200\n", + "24/24 [==============================] - 0s 678us/step - loss: 0.5887 - acc: 1.0000 - val_loss: 0.6241 - val_acc: 0.9167\n", + "Epoch 72/200\n", + "24/24 [==============================] - 0s 904us/step - loss: 0.5741 - acc: 1.0000 - val_loss: 0.6128 - val_acc: 0.9167\n", + "Epoch 73/200\n", + "24/24 [==============================] - 0s 844us/step - loss: 0.5600 - acc: 1.0000 - val_loss: 0.6023 - val_acc: 0.9167\n", + "Epoch 74/200\n", + "24/24 [==============================] - 0s 843us/step - loss: 0.5469 - acc: 1.0000 - val_loss: 0.5924 - val_acc: 0.9167\n", + "Epoch 75/200\n", + "24/24 [==============================] - 0s 805us/step - loss: 0.5334 - acc: 1.0000 - val_loss: 0.5823 - val_acc: 0.9167\n", + "Epoch 76/200\n", + "24/24 [==============================] - 0s 831us/step - loss: 0.5210 - acc: 1.0000 - val_loss: 0.5734 - val_acc: 0.9167\n", + "Epoch 77/200\n", + "24/24 [==============================] - 0s 989us/step - loss: 0.5075 - acc: 1.0000 - val_loss: 0.5628 - val_acc: 0.9167\n", + "Epoch 78/200\n", + "24/24 [==============================] - 0s 840us/step - loss: 0.4947 - acc: 1.0000 - val_loss: 0.5535 - val_acc: 0.9167\n", + "Epoch 79/200\n", + "24/24 [==============================] - 0s 695us/step - loss: 0.4827 - acc: 1.0000 - val_loss: 0.5441 - val_acc: 0.9167\n", + "Epoch 80/200\n", + "24/24 [==============================] - 0s 853us/step - loss: 0.4706 - acc: 1.0000 - val_loss: 0.5357 - val_acc: 0.9167\n", + "Epoch 81/200\n", + "24/24 [==============================] - 0s 768us/step - loss: 0.4585 - acc: 1.0000 - val_loss: 0.5271 - val_acc: 0.9167\n", + "Epoch 82/200\n", + "24/24 [==============================] - 0s 822us/step - loss: 0.4473 - acc: 1.0000 - val_loss: 0.5192 - val_acc: 0.9167\n", + "Epoch 83/200\n", + "24/24 [==============================] - 0s 661us/step - loss: 0.4347 - acc: 1.0000 - val_loss: 0.5104 - val_acc: 0.9167\n", + "Epoch 84/200\n", + "24/24 [==============================] - 0s 661us/step - loss: 0.4235 - acc: 1.0000 - val_loss: 0.5018 - val_acc: 0.9167\n", + "Epoch 85/200\n", + "24/24 [==============================] - 0s 696us/step - loss: 0.4141 - acc: 1.0000 - val_loss: 0.4952 - val_acc: 0.9167\n", + "Epoch 86/200\n", + "24/24 [==============================] - 0s 675us/step - loss: 0.4020 - acc: 1.0000 - val_loss: 0.4864 - val_acc: 0.9167\n", + "Epoch 87/200\n", + "24/24 [==============================] - 0s 740us/step - loss: 0.3908 - acc: 1.0000 - val_loss: 0.4800 - val_acc: 0.9167\n", + "Epoch 88/200\n", + "24/24 [==============================] - 0s 679us/step - loss: 0.3808 - acc: 1.0000 - val_loss: 0.4723 - val_acc: 0.9167\n", + "Epoch 89/200\n", + "24/24 [==============================] - 0s 639us/step - loss: 0.3705 - acc: 1.0000 - val_loss: 0.4647 - val_acc: 0.9167\n", + "Epoch 90/200\n", + "24/24 [==============================] - 0s 736us/step - loss: 0.3613 - acc: 1.0000 - val_loss: 0.4576 - val_acc: 0.9167\n", + "Epoch 91/200\n", + "24/24 [==============================] - 0s 765us/step - loss: 0.3513 - acc: 1.0000 - val_loss: 0.4507 - val_acc: 0.9167\n", + "Epoch 92/200\n", + "24/24 [==============================] - 0s 736us/step - loss: 0.3412 - acc: 1.0000 - val_loss: 0.4441 - val_acc: 0.9167\n", + "Epoch 93/200\n", + "24/24 [==============================] - 0s 722us/step - loss: 0.3318 - acc: 1.0000 - val_loss: 0.4378 - val_acc: 0.9167\n", + "Epoch 94/200\n", + "24/24 [==============================] - 0s 853us/step - loss: 0.3230 - acc: 1.0000 - val_loss: 0.4309 - val_acc: 0.9167\n", + "Epoch 95/200\n", + "24/24 [==============================] - 0s 705us/step - loss: 0.3136 - acc: 1.0000 - val_loss: 0.4254 - val_acc: 0.9167\n", + "Epoch 96/200\n", + "24/24 [==============================] - 0s 936us/step - loss: 0.3057 - acc: 1.0000 - val_loss: 0.4188 - val_acc: 0.9167\n", + "Epoch 97/200\n", + "24/24 [==============================] - 0s 816us/step - loss: 0.2971 - acc: 1.0000 - val_loss: 0.4136 - val_acc: 0.9167\n", + "Epoch 98/200\n", + "24/24 [==============================] - 0s 624us/step - loss: 0.2880 - acc: 1.0000 - val_loss: 0.4074 - val_acc: 0.9167\n", + "Epoch 99/200\n", + "24/24 [==============================] - 0s 883us/step - loss: 0.2798 - acc: 1.0000 - val_loss: 0.4016 - val_acc: 0.9167\n", + "Epoch 100/200\n", + "24/24 [==============================] - 0s 638us/step - loss: 0.2718 - acc: 1.0000 - val_loss: 0.3955 - val_acc: 0.9167\n", + "Epoch 101/200\n", + "24/24 [==============================] - 0s 839us/step - loss: 0.2640 - acc: 1.0000 - val_loss: 0.3900 - val_acc: 0.9167\n", + "Epoch 102/200\n", + "24/24 [==============================] - 0s 657us/step - loss: 0.2574 - acc: 1.0000 - val_loss: 0.3858 - val_acc: 0.9167\n", + "Epoch 103/200\n", + "24/24 [==============================] - 0s 976us/step - loss: 0.2490 - acc: 1.0000 - val_loss: 0.3801 - val_acc: 0.9167\n", + "Epoch 104/200\n", + "24/24 [==============================] - 0s 679us/step - loss: 0.2416 - acc: 1.0000 - val_loss: 0.3745 - val_acc: 0.9167\n", + "Epoch 105/200\n", + "24/24 [==============================] - 0s 815us/step - loss: 0.2343 - acc: 1.0000 - val_loss: 0.3707 - val_acc: 0.9167\n", + "Epoch 106/200\n", + "24/24 [==============================] - 0s 805us/step - loss: 0.2279 - acc: 1.0000 - val_loss: 0.3647 - val_acc: 0.9167\n", + "Epoch 107/200\n", + "24/24 [==============================] - 0s 784us/step - loss: 0.2210 - acc: 1.0000 - val_loss: 0.3601 - val_acc: 0.9167\n", + "Epoch 108/200\n", + "24/24 [==============================] - 0s 781us/step - loss: 0.2154 - acc: 1.0000 - val_loss: 0.3567 - val_acc: 0.9167\n", + "Epoch 109/200\n", + "24/24 [==============================] - 0s 828us/step - loss: 0.2082 - acc: 1.0000 - val_loss: 0.3511 - val_acc: 0.9167\n", + "Epoch 110/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.2022 - acc: 1.0000 - val_loss: 0.3464 - val_acc: 0.9167\n", + "Epoch 111/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.1955 - acc: 1.0000 - val_loss: 0.3432 - val_acc: 0.9167\n", + "Epoch 112/200\n", + "24/24 [==============================] - 0s 735us/step - loss: 0.1905 - acc: 1.0000 - val_loss: 0.3390 - val_acc: 0.9167\n", + "Epoch 113/200\n", + "24/24 [==============================] - 0s 864us/step - loss: 0.1842 - acc: 1.0000 - val_loss: 0.3345 - val_acc: 0.9167\n", + "Epoch 114/200\n", + "24/24 [==============================] - 0s 705us/step - loss: 0.1792 - acc: 1.0000 - val_loss: 0.3316 - val_acc: 0.9167\n", + "Epoch 115/200\n", + "24/24 [==============================] - 0s 649us/step - loss: 0.1736 - acc: 1.0000 - val_loss: 0.3277 - val_acc: 0.9167\n", + "Epoch 116/200\n", + "24/24 [==============================] - 0s 812us/step - loss: 0.1689 - acc: 1.0000 - val_loss: 0.3228 - val_acc: 0.9167\n", + "Epoch 117/200\n", + "24/24 [==============================] - 0s 708us/step - loss: 0.1634 - acc: 1.0000 - val_loss: 0.3193 - val_acc: 0.9167\n", + "Epoch 118/200\n", + "24/24 [==============================] - 0s 2ms/step - loss: 0.1584 - acc: 1.0000 - val_loss: 0.3161 - val_acc: 0.9167\n", + "Epoch 119/200\n", + "24/24 [==============================] - 0s 886us/step - loss: 0.1539 - acc: 1.0000 - val_loss: 0.3120 - val_acc: 0.9167\n", + "Epoch 120/200\n", + "24/24 [==============================] - 0s 716us/step - loss: 0.1493 - acc: 1.0000 - val_loss: 0.3082 - val_acc: 0.9167\n", + "Epoch 121/200\n", + "24/24 [==============================] - 0s 749us/step - loss: 0.1448 - acc: 1.0000 - val_loss: 0.3059 - val_acc: 0.9167\n", + "Epoch 122/200\n", + "24/24 [==============================] - 0s 745us/step - loss: 0.1404 - acc: 1.0000 - val_loss: 0.3018 - val_acc: 0.9167\n", + "Epoch 123/200\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "24/24 [==============================] - 0s 666us/step - loss: 0.1366 - acc: 1.0000 - val_loss: 0.2996 - val_acc: 0.9167\n", + "Epoch 124/200\n", + "24/24 [==============================] - 0s 721us/step - loss: 0.1322 - acc: 1.0000 - val_loss: 0.2955 - val_acc: 0.9167\n", + "Epoch 125/200\n", + "24/24 [==============================] - 0s 609us/step - loss: 0.1282 - acc: 1.0000 - val_loss: 0.2919 - val_acc: 0.9167\n", + "Epoch 126/200\n", + "24/24 [==============================] - 0s 639us/step - loss: 0.1243 - acc: 1.0000 - val_loss: 0.2895 - val_acc: 0.9167\n", + "Epoch 127/200\n", + "24/24 [==============================] - 0s 918us/step - loss: 0.1208 - acc: 1.0000 - val_loss: 0.2859 - val_acc: 0.9167\n", + "Epoch 128/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.1169 - acc: 1.0000 - val_loss: 0.2835 - val_acc: 0.9167\n", + "Epoch 129/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.1134 - acc: 1.0000 - val_loss: 0.2800 - val_acc: 0.9167\n", + "Epoch 130/200\n", + "24/24 [==============================] - 0s 990us/step - loss: 0.1106 - acc: 1.0000 - val_loss: 0.2777 - val_acc: 0.9167\n", + "Epoch 131/200\n", + "24/24 [==============================] - 0s 899us/step - loss: 0.1071 - acc: 1.0000 - val_loss: 0.2742 - val_acc: 0.9167\n", + "Epoch 132/200\n", + "24/24 [==============================] - 0s 826us/step - loss: 0.1039 - acc: 1.0000 - val_loss: 0.2716 - val_acc: 0.9167\n", + "Epoch 133/200\n", + "24/24 [==============================] - 0s 915us/step - loss: 0.1006 - acc: 1.0000 - val_loss: 0.2686 - val_acc: 0.9167\n", + "Epoch 134/200\n", + "24/24 [==============================] - 0s 904us/step - loss: 0.0976 - acc: 1.0000 - val_loss: 0.2659 - val_acc: 0.9167\n", + "Epoch 135/200\n", + "24/24 [==============================] - 0s 873us/step - loss: 0.0947 - acc: 1.0000 - val_loss: 0.2628 - val_acc: 0.9167\n", + "Epoch 136/200\n", + "24/24 [==============================] - 0s 848us/step - loss: 0.0920 - acc: 1.0000 - val_loss: 0.2606 - val_acc: 0.9167\n", + "Epoch 137/200\n", + "24/24 [==============================] - 0s 887us/step - loss: 0.0891 - acc: 1.0000 - val_loss: 0.2565 - val_acc: 0.9167\n", + "Epoch 138/200\n", + "24/24 [==============================] - 0s 811us/step - loss: 0.0865 - acc: 1.0000 - val_loss: 0.2548 - val_acc: 0.9167\n", + "Epoch 139/200\n", + "24/24 [==============================] - 0s 717us/step - loss: 0.0838 - acc: 1.0000 - val_loss: 0.2525 - val_acc: 0.9167\n", + "Epoch 140/200\n", + "24/24 [==============================] - 0s 686us/step - loss: 0.0814 - acc: 1.0000 - val_loss: 0.2491 - val_acc: 0.9167\n", + "Epoch 141/200\n", + "24/24 [==============================] - 0s 788us/step - loss: 0.0789 - acc: 1.0000 - val_loss: 0.2476 - val_acc: 0.9167\n", + "Epoch 142/200\n", + "24/24 [==============================] - 0s 643us/step - loss: 0.0770 - acc: 1.0000 - val_loss: 0.2446 - val_acc: 0.9167\n", + "Epoch 143/200\n", + "24/24 [==============================] - 0s 854us/step - loss: 0.0747 - acc: 1.0000 - val_loss: 0.2427 - val_acc: 0.9167\n", + "Epoch 144/200\n", + "24/24 [==============================] - 0s 679us/step - loss: 0.0723 - acc: 1.0000 - val_loss: 0.2407 - val_acc: 0.9167\n", + "Epoch 145/200\n", + "24/24 [==============================] - 0s 814us/step - loss: 0.0704 - acc: 1.0000 - val_loss: 0.2377 - val_acc: 0.9167\n", + "Epoch 146/200\n", + "24/24 [==============================] - 0s 724us/step - loss: 0.0680 - acc: 1.0000 - val_loss: 0.2359 - val_acc: 0.9167\n", + "Epoch 147/200\n", + "24/24 [==============================] - 0s 779us/step - loss: 0.0661 - acc: 1.0000 - val_loss: 0.2326 - val_acc: 0.9167\n", + "Epoch 148/200\n", + "24/24 [==============================] - 0s 635us/step - loss: 0.0642 - acc: 1.0000 - val_loss: 0.2303 - val_acc: 0.9167\n", + "Epoch 149/200\n", + "24/24 [==============================] - 0s 787us/step - loss: 0.0624 - acc: 1.0000 - val_loss: 0.2288 - val_acc: 0.9167\n", + "Epoch 150/200\n", + "24/24 [==============================] - 0s 547us/step - loss: 0.0607 - acc: 1.0000 - val_loss: 0.2254 - val_acc: 0.9167\n", + "Epoch 151/200\n", + "24/24 [==============================] - 0s 857us/step - loss: 0.0587 - acc: 1.0000 - val_loss: 0.2229 - val_acc: 0.9167\n", + "Epoch 152/200\n", + "24/24 [==============================] - 0s 726us/step - loss: 0.0569 - acc: 1.0000 - val_loss: 0.2214 - val_acc: 0.9167\n", + "Epoch 153/200\n", + "24/24 [==============================] - 0s 652us/step - loss: 0.0556 - acc: 1.0000 - val_loss: 0.2186 - val_acc: 0.9167\n", + "Epoch 154/200\n", + "24/24 [==============================] - 0s 765us/step - loss: 0.0536 - acc: 1.0000 - val_loss: 0.2165 - val_acc: 0.9167\n", + "Epoch 155/200\n", + "24/24 [==============================] - 0s 925us/step - loss: 0.0520 - acc: 1.0000 - val_loss: 0.2138 - val_acc: 0.9167\n", + "Epoch 156/200\n", + "24/24 [==============================] - 0s 722us/step - loss: 0.0506 - acc: 1.0000 - val_loss: 0.2122 - val_acc: 0.9167\n", + "Epoch 157/200\n", + "24/24 [==============================] - 0s 729us/step - loss: 0.0489 - acc: 1.0000 - val_loss: 0.2083 - val_acc: 0.9167\n", + "Epoch 158/200\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.0475 - acc: 1.0000 - val_loss: 0.2066 - val_acc: 0.9167\n", + "Epoch 159/200\n", + "24/24 [==============================] - 0s 775us/step - loss: 0.0461 - acc: 1.0000 - val_loss: 0.2048 - val_acc: 0.9167\n", + "Epoch 160/200\n", + "24/24 [==============================] - 0s 767us/step - loss: 0.0448 - acc: 1.0000 - val_loss: 0.2032 - val_acc: 0.9167\n", + "Epoch 161/200\n", + "24/24 [==============================] - 0s 644us/step - loss: 0.0434 - acc: 1.0000 - val_loss: 0.2000 - val_acc: 0.9167\n", + "Epoch 162/200\n", + "24/24 [==============================] - 0s 779us/step - loss: 0.0421 - acc: 1.0000 - val_loss: 0.1973 - val_acc: 0.9167\n", + "Epoch 163/200\n", + "24/24 [==============================] - 0s 782us/step - loss: 0.0407 - acc: 1.0000 - val_loss: 0.1950 - val_acc: 0.9167\n", + "Epoch 164/200\n", + "24/24 [==============================] - 0s 806us/step - loss: 0.0395 - acc: 1.0000 - val_loss: 0.1934 - val_acc: 0.9167\n", + "Epoch 165/200\n", + "24/24 [==============================] - 0s 582us/step - loss: 0.0384 - acc: 1.0000 - val_loss: 0.1919 - val_acc: 0.9167\n", + "Epoch 166/200\n", + "24/24 [==============================] - 0s 669us/step - loss: 0.0373 - acc: 1.0000 - val_loss: 0.1888 - val_acc: 0.9167\n", + "Epoch 167/200\n", + "24/24 [==============================] - 0s 742us/step - loss: 0.0359 - acc: 1.0000 - val_loss: 0.1876 - val_acc: 0.9167\n", + "Epoch 168/200\n", + "24/24 [==============================] - 0s 794us/step - loss: 0.0349 - acc: 1.0000 - val_loss: 0.1840 - val_acc: 0.9167\n", + "Epoch 169/200\n", + "24/24 [==============================] - 0s 712us/step - loss: 0.0338 - acc: 1.0000 - val_loss: 0.1827 - val_acc: 0.9167\n", + "Epoch 170/200\n", + "24/24 [==============================] - 0s 844us/step - loss: 0.0327 - acc: 1.0000 - val_loss: 0.1812 - val_acc: 0.9167\n", + "Epoch 171/200\n", + "24/24 [==============================] - 0s 680us/step - loss: 0.0316 - acc: 1.0000 - val_loss: 0.1794 - val_acc: 0.9167\n", + "Epoch 172/200\n", + "24/24 [==============================] - 0s 494us/step - loss: 0.0306 - acc: 1.0000 - val_loss: 0.1763 - val_acc: 0.9167\n", + "Epoch 173/200\n", + "24/24 [==============================] - 0s 635us/step - loss: 0.0299 - acc: 1.0000 - val_loss: 0.1748 - val_acc: 0.9167\n", + "Epoch 174/200\n", + "24/24 [==============================] - 0s 583us/step - loss: 0.0288 - acc: 1.0000 - val_loss: 0.1731 - val_acc: 0.9167\n", + "Epoch 175/200\n", + "24/24 [==============================] - 0s 586us/step - loss: 0.0281 - acc: 1.0000 - val_loss: 0.1703 - val_acc: 0.9167\n", + "Epoch 176/200\n", + "24/24 [==============================] - 0s 871us/step - loss: 0.0271 - acc: 1.0000 - val_loss: 0.1688 - val_acc: 0.9167\n", + "Epoch 177/200\n", + "24/24 [==============================] - 0s 689us/step - loss: 0.0262 - acc: 1.0000 - val_loss: 0.1676 - val_acc: 0.9167\n", + "Epoch 178/200\n", + "24/24 [==============================] - 0s 582us/step - loss: 0.0254 - acc: 1.0000 - val_loss: 0.1644 - val_acc: 0.9167\n", + "Epoch 179/200\n", + "24/24 [==============================] - 0s 833us/step - loss: 0.0245 - acc: 1.0000 - val_loss: 0.1631 - val_acc: 0.9167\n", + "Epoch 180/200\n", + "24/24 [==============================] - 0s 644us/step - loss: 0.0236 - acc: 1.0000 - val_loss: 0.1611 - val_acc: 0.9167\n", + "Epoch 181/200\n", + "24/24 [==============================] - 0s 550us/step - loss: 0.0229 - acc: 1.0000 - val_loss: 0.1586 - val_acc: 0.9167\n", + "Epoch 182/200\n", + "24/24 [==============================] - 0s 706us/step - loss: 0.0222 - acc: 1.0000 - val_loss: 0.1573 - val_acc: 0.9167\n", + "Epoch 183/200\n", + "24/24 [==============================] - 0s 684us/step - loss: 0.0213 - acc: 1.0000 - val_loss: 0.1547 - val_acc: 0.9167\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Epoch 184/200\n", + "24/24 [==============================] - 0s 585us/step - loss: 0.0207 - acc: 1.0000 - val_loss: 0.1535 - val_acc: 0.9167\n", + "Epoch 185/200\n", + "24/24 [==============================] - 0s 725us/step - loss: 0.0199 - acc: 1.0000 - val_loss: 0.1521 - val_acc: 0.9167\n", + "Epoch 186/200\n", + "24/24 [==============================] - 0s 692us/step - loss: 0.0192 - acc: 1.0000 - val_loss: 0.1486 - val_acc: 0.9167\n", + "Epoch 187/200\n", + "24/24 [==============================] - 0s 529us/step - loss: 0.0187 - acc: 1.0000 - val_loss: 0.1475 - val_acc: 0.9167\n", + "Epoch 188/200\n", + "24/24 [==============================] - 0s 608us/step - loss: 0.0179 - acc: 1.0000 - val_loss: 0.1466 - val_acc: 0.9167\n", + "Epoch 189/200\n", + "24/24 [==============================] - 0s 618us/step - loss: 0.0173 - acc: 1.0000 - val_loss: 0.1443 - val_acc: 0.9167\n", + "Epoch 190/200\n", + "24/24 [==============================] - 0s 758us/step - loss: 0.0167 - acc: 1.0000 - val_loss: 0.1434 - val_acc: 0.9167\n", + "Epoch 191/200\n", + "24/24 [==============================] - 0s 553us/step - loss: 0.0161 - acc: 1.0000 - val_loss: 0.1425 - val_acc: 0.9167\n", + "Epoch 192/200\n", + "24/24 [==============================] - 0s 909us/step - loss: 0.0157 - acc: 1.0000 - val_loss: 0.1393 - val_acc: 0.9167\n", + "Epoch 193/200\n", + "24/24 [==============================] - 0s 778us/step - loss: 0.0151 - acc: 1.0000 - val_loss: 0.1379 - val_acc: 0.9167\n", + "Epoch 194/200\n", + "24/24 [==============================] - 0s 685us/step - loss: 0.0146 - acc: 1.0000 - val_loss: 0.1371 - val_acc: 0.9167\n", + "Epoch 195/200\n", + "24/24 [==============================] - 0s 604us/step - loss: 0.0140 - acc: 1.0000 - val_loss: 0.1361 - val_acc: 0.9167\n", + "Epoch 196/200\n", + "24/24 [==============================] - 0s 897us/step - loss: 0.0136 - acc: 1.0000 - val_loss: 0.1338 - val_acc: 0.9167\n", + "Epoch 197/200\n", + "24/24 [==============================] - 0s 700us/step - loss: 0.0130 - acc: 1.0000 - val_loss: 0.1325 - val_acc: 0.9167\n", + "Epoch 198/200\n", + "24/24 [==============================] - 0s 591us/step - loss: 0.0125 - acc: 1.0000 - val_loss: 0.1305 - val_acc: 0.9167\n", + "Epoch 199/200\n", + "24/24 [==============================] - 0s 701us/step - loss: 0.0122 - acc: 1.0000 - val_loss: 0.1287 - val_acc: 0.9167\n", + "Epoch 200/200\n", + "24/24 [==============================] - 0s 592us/step - loss: 0.0117 - acc: 1.0000 - val_loss: 0.1263 - val_acc: 0.9167\n" + ] + } + ], + "source": [ + "epochs = 200\n", + "history = model.fit(x_train_tfidf_keras,\n", + " dummy_y_train,\n", + " epochs=epochs,\n", + " batch_size=12,\n", + " validation_data=(x_val_tfidf_keras, dummy_y_val))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### From the logs above, the training accuracy is 100% and the validation accuracy is 91.67%" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Approach 3\n", + "\n", + "### Using a RNN with LSTM with a pre-trained embedding layer (GLOVE) and only text features (with text vectorized using Keras's text to matrix binary vectorizer)" + ] + }, + { + "cell_type": "code", + "execution_count": 121, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Found 400000 word vectors.\n" + ] + } + ], + "source": [ + "glove_dir = 'glove.6B'\n", + "embeddings_index = {}\n", + "with open('glove.6B.100d.txt') as f:\n", + " for line in f:\n", + " values = line.split()\n", + " word = values[0]\n", + " coefs = np.asarray(values[1:], dtype='float32')\n", + " embeddings_index[word] = coefs\n", + "print('Found %s word vectors.' % len(embeddings_index))" + ] + }, + { + "cell_type": "code", + "execution_count": 122, + "metadata": {}, + "outputs": [], + "source": [ + "max_len=500\n", + "embedding_dimension=100\n", + "\n", + "word_index = tokenizer.word_index\n", + "train_sequences = tokenizer.texts_to_sequences(df_train['expense description'])\n", + "val_sequences = tokenizer.texts_to_sequences(df_val['expense description'])\n", + "\n", + "train_data = kreprocessing.sequence.pad_sequences(train_sequences, maxlen=max_len)\n", + "val_data = kreprocessing.sequence.pad_sequences(val_sequences, maxlen=max_len)" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(24, 500)" + ] + }, + "execution_count": 99, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "train_data.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "metadata": {}, + "outputs": [], + "source": [ + "max_words = 10000\n", + "embedding_matrix = np.zeros((max_words, embedding_dimension))\n", + "for word, i in word_index.items():\n", + " if i < max_words:\n", + " embedding_vector = embeddings_index.get(word)\n", + " if embedding_vector is not None:\n", + " embedding_matrix[i] = embedding_vector" + ] + }, + { + "cell_type": "code", + "execution_count": 101, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "embedding_6 (Embedding) (None, 500, 100) 1000000 \n", + "_________________________________________________________________\n", + "lstm_5 (LSTM) (None, 32) 17024 \n", + "_________________________________________________________________\n", + "dense_40 (Dense) (None, 5) 165 \n", + "=================================================================\n", + "Total params: 1,017,189\n", + "Trainable params: 1,017,189\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "model = models.Sequential()\n", + "model.add(layers.Embedding(max_words, embedding_dimension, input_length=max_len))\n", + "model.add(layers.LSTM(32))\n", + "model.add(layers.Dense(5, activation='softmax'))\n", + "model.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": 102, + "metadata": {}, + "outputs": [], + "source": [ + "model.layers[0].set_weights([embedding_matrix])\n", + "model.layers[0].trainable = False" + ] + }, + { + "cell_type": "code", + "execution_count": 103, + "metadata": {}, + "outputs": [], + "source": [ + "model.compile(optimizer='rmsprop',\n", + " loss='categorical_crossentropy',\n", + " metrics=['accuracy'])" + ] + }, + { + "cell_type": "code", + "execution_count": 104, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Train on 24 samples, validate on 12 samples\n", + "Epoch 1/20\n", + "24/24 [==============================] - 3s 123ms/step - loss: 1.5062 - acc: 0.3750 - val_loss: 1.3186 - val_acc: 0.7500\n", + "Epoch 2/20\n", + "24/24 [==============================] - 2s 71ms/step - loss: 1.2632 - acc: 0.7917 - val_loss: 1.1607 - val_acc: 0.8333\n", + "Epoch 3/20\n", + "24/24 [==============================] - 2s 71ms/step - loss: 1.0976 - acc: 0.9167 - val_loss: 1.0268 - val_acc: 0.8333\n", + "Epoch 4/20\n", + "24/24 [==============================] - 2s 70ms/step - loss: 0.9557 - acc: 0.9167 - val_loss: 0.9094 - val_acc: 0.8333\n", + "Epoch 5/20\n", + "24/24 [==============================] - 2s 70ms/step - loss: 0.8403 - acc: 0.9167 - val_loss: 0.8069 - val_acc: 0.8333\n", + "Epoch 6/20\n", + "24/24 [==============================] - 2s 72ms/step - loss: 0.7366 - acc: 0.9167 - val_loss: 0.7196 - val_acc: 0.8333\n", + "Epoch 7/20\n", + "24/24 [==============================] - 2s 73ms/step - loss: 0.6395 - acc: 0.9167 - val_loss: 0.6425 - val_acc: 0.8333\n", + "Epoch 8/20\n", + "24/24 [==============================] - 2s 71ms/step - loss: 0.5593 - acc: 0.9167 - val_loss: 0.5777 - val_acc: 0.8333\n", + "Epoch 9/20\n", + "24/24 [==============================] - 2s 72ms/step - loss: 0.4913 - acc: 0.9167 - val_loss: 0.5215 - val_acc: 0.8333\n", + "Epoch 10/20\n", + "24/24 [==============================] - 2s 75ms/step - loss: 0.4267 - acc: 0.9167 - val_loss: 0.4744 - val_acc: 0.8333\n", + "Epoch 11/20\n", + "24/24 [==============================] - 2s 75ms/step - loss: 0.3747 - acc: 0.9167 - val_loss: 0.4366 - val_acc: 0.8333\n", + "Epoch 12/20\n", + "24/24 [==============================] - 2s 68ms/step - loss: 0.3289 - acc: 0.9167 - val_loss: 0.4040 - val_acc: 0.8333\n", + "Epoch 13/20\n", + "24/24 [==============================] - 2s 70ms/step - loss: 0.2909 - acc: 0.9583 - val_loss: 0.3759 - val_acc: 0.8333\n", + "Epoch 14/20\n", + "24/24 [==============================] - 2s 70ms/step - loss: 0.2546 - acc: 0.9583 - val_loss: 0.3516 - val_acc: 0.8333\n", + "Epoch 15/20\n", + "24/24 [==============================] - 2s 73ms/step - loss: 0.2245 - acc: 0.9583 - val_loss: 0.3315 - val_acc: 0.9167\n", + "Epoch 16/20\n", + "24/24 [==============================] - 2s 73ms/step - loss: 0.1986 - acc: 0.9583 - val_loss: 0.3123 - val_acc: 0.9167\n", + "Epoch 17/20\n", + "24/24 [==============================] - 2s 71ms/step - loss: 0.1746 - acc: 1.0000 - val_loss: 0.2954 - val_acc: 0.9167\n", + "Epoch 18/20\n", + "24/24 [==============================] - 2s 71ms/step - loss: 0.1547 - acc: 1.0000 - val_loss: 0.2806 - val_acc: 0.9167\n", + "Epoch 19/20\n", + "24/24 [==============================] - 2s 71ms/step - loss: 0.1366 - acc: 1.0000 - val_loss: 0.2674 - val_acc: 0.9167\n", + "Epoch 20/20\n", + "24/24 [==============================] - 2s 73ms/step - loss: 0.1216 - acc: 1.0000 - val_loss: 0.2544 - val_acc: 0.9167\n" + ] + } + ], + "source": [ + "history = model.fit(train_data, dummy_y_train,\n", + " epochs=20,\n", + " batch_size=6,\n", + " validation_data=(val_data, dummy_y_val))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### From the logs above, the training accuracy is 100% and the validation accuracy is 91.67%" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/clustering_personal_business.ipynb b/clustering_personal_business.ipynb new file mode 100644 index 0000000..108f1ff --- /dev/null +++ b/clustering_personal_business.ipynb @@ -0,0 +1,2076 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Clustering\n", + "\n", + "The clustering solution tries to address the problem:\n", + "\n", + "*Mixing of personal and business expenses is a common problem for small business. Create an algorithm that can separate any potential personal expenses in the training data. Labels of personal and business expenses were deliberately not given as this is often the case in our system. There is no right answer so it is important you provide any assumptions you have made.*\n", + "\n", + "For this, we will use K-means clustering for simplicity.\n", + "\n", + "For distinguishing personal expenses from business expenses, we are going to make the following **assumptions** (**none of those are necessarily true in all cases**):\n", + "\n", + "- If the date of expenses is on weekeends or holidays, it is more likely to be a personal expense (notice, however, the flaw of this assumption is that an expense on Friday night might very well be personal)\n", + "- Higher expense amount has more chance of being a business expense\n", + "- The text fields along with the previous 2 criteria (expense description and category) can possibly give us some clue about the type of expense. For example, category **office supplies** and **Computer Hardware** has more chance of being a business expense, whereas **Meals and Entertainment** with a relatively low expense amount has a higher chance of being a personal expense, especially if it is on a weekend.\n", + "- Employee ID might useful as well since certain employees might have certain habits of making personal or business type expenses.\n", + "- Same goes for role.\n", + "\n", + "(Though not shown in this notebook, later I have found that excluding employee id and role does not alter the resulting clusters)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Necessary imports" + ] + }, + { + "cell_type": "code", + "execution_count": 157, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "from collections import defaultdict\n", + "import itertools\n", + "\n", + "from sklearn import preprocessing\n", + "from sklearn.cluster import KMeans" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Read in the data" + ] + }, + { + "cell_type": "code", + "execution_count": 176, + "metadata": {}, + "outputs": [], + "source": [ + "df_train = pd.read_csv('training_data_example.csv')\n", + "df_val = pd.read_csv('validation_data_example.csv')\n", + "df_employee = pd.read_csv('employee.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Ready the features to be used\n", + "\n", + "We will define a pre-processing function to ready our features. \n", + "- It has an internal function to determine whether a prticular day is a holiday. It's kind of simplified and by no means an exact dertminer of a holiday. I am assuming weekends and any day after December 19 as a holiday.\n", + "- It basically drops all columns except employee id, role and pre-tax amount\n", + "- Text features (category and expense description will be processed separately)\n", + "- Pre-tax amount is normalized to 0-1 scale" + ] + }, + { + "cell_type": "code", + "execution_count": 239, + "metadata": {}, + "outputs": [], + "source": [ + "def pre_process(df, columns_to_drop=['date',\n", + " 'employee id', \n", + " 'category', \n", + "# 'pre-tax amount', \n", + " 'tax amount',\n", + " 'tax name', \n", + " 'role', \n", + " 'expense description']):\n", + " \n", + " def holiday(x):\n", + " if x.weekday() >= 5:\n", + " return 1\n", + " if x.month == 12 and x.day >= 20:\n", + " return 1\n", + " return 0\n", + " \n", + " df['holiday'] = pd.to_datetime(df['date']).apply(lambda x: holiday(x)).astype(str) \n", + " df = pd.merge(df, df_employee[['employee id', 'role']], how='inner', on=['employee id'])\n", + " df['employee id'] = df['employee id'].astype(str)\n", + " df = df.drop(columns_to_drop, axis=1)\n", + " \n", + " # one-hot encode the categorical variables\n", + " df = pd.get_dummies(df)\n", + " \n", + " df['pre-tax amount'] = preprocessing.minmax_scale(df[['pre-tax amount']])\n", + " \n", + " return df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get the pre-process features for training and validation data." + ] + }, + { + "cell_type": "code", + "execution_count": 240, + "metadata": {}, + "outputs": [], + "source": [ + "x_train = pre_process(df_train)\n", + "x_val = pre_process(df_val)\n", + "x_train, x_val = x_train.align(x_val, join='left', axis=1)\n", + "x_val = x_val.fillna(0)" + ] + }, + { + "cell_type": "code", + "execution_count": 241, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pre-tax amountholiday_0holiday_1text_data_Airplane ticket to NY Traveltext_data_Client dinner Meals and Entertainmenttext_data_Coffee with Steve Meals and Entertainmenttext_data_Dinner Meals and Entertainmenttext_data_Dinner with client Meals and Entertainmenttext_data_Dinner with potential client Meals and Entertainmenttext_data_Dropbox Subscription Computer - Softwaretext_data_Flight to Miami Traveltext_data_HP Laptop Computer Computer - Hardwaretext_data_Macbook Air Computer Computer - Hardwaretext_data_Microsoft Office Computer - Softwaretext_data_Paper Office Suppliestext_data_Starbucks coffee Meals and Entertainmenttext_data_Taxi ride Traveltext_data_Team lunch Meals and Entertainmenttext_data_iCloud Subscription Computer - Softwaretext_data_iPhone Computer - Hardware
00.0180451000000000000001000
10.0982461001000000000000000
21.0000001000000000010000000
30.1157891000000000000000100
40.0055140100000000000000010
\n", + "
" + ], + "text/plain": [ + " pre-tax amount holiday_0 holiday_1 \\\n", + "0 0.018045 1 0 \n", + "1 0.098246 1 0 \n", + "2 1.000000 1 0 \n", + "3 0.115789 1 0 \n", + "4 0.005514 0 1 \n", + "\n", + " text_data_Airplane ticket to NY Travel \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Client dinner Meals and Entertainment \\\n", + "0 0 \n", + "1 1 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Coffee with Steve Meals and Entertainment \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Dinner Meals and Entertainment \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Dinner with client Meals and Entertainment \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Dinner with potential client Meals and Entertainment \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Dropbox Subscription Computer - Software \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Flight to Miami Travel \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_HP Laptop Computer Computer - Hardware \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Macbook Air Computer Computer - Hardware \\\n", + "0 0 \n", + "1 0 \n", + "2 1 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Microsoft Office Computer - Software \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Paper Office Supplies \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Starbucks coffee Meals and Entertainment \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 \n", + "\n", + " text_data_Taxi ride Travel text_data_Team lunch Meals and Entertainment \\\n", + "0 1 0 \n", + "1 0 0 \n", + "2 0 0 \n", + "3 0 1 \n", + "4 0 0 \n", + "\n", + " text_data_iCloud Subscription Computer - Software \\\n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 1 \n", + "\n", + " text_data_iPhone Computer - Hardware \n", + "0 0 \n", + "1 0 \n", + "2 0 \n", + "3 0 \n", + "4 0 " + ] + }, + "execution_count": 241, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "x_train.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Processing the text fields\n", + "\n", + "For this purpose, we can make use of as much text as available.\n", + "\n", + "- We will concatenate **category** and **text description** together into a single field\n", + "- Since K-means will internally use Euclidean distance, we will vecorize the field. We will try 2 different ways if vectorizing it:\n", + " - Using Scikit learn's TfIdfVectoirizer (which is just a frequency based embedding of the words)\n", + " - Using a pre-trained prediction based embedding of the words. This approach makes more sense because this sort of embeddings caoture useful relations between words. We will use the **glove** embedding https://nlp.stanford.edu/projects/glove/ with 100 dimensional vector. We will use the technique describe here http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/. The following class was taken from this reference. I tweaked it slightly to work with python 3.\n", + " - We will then compare the results of both of these approaches" + ] + }, + { + "cell_type": "code", + "execution_count": 179, + "metadata": {}, + "outputs": [], + "source": [ + "# Reference: http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/\n", + "class TfidfEmbeddingVectorizer(object):\n", + " def __init__(self, word2vec):\n", + " self.word2vec = word2vec\n", + " self.word2weight = None\n", + " self.dim = len(word2vec.values())\n", + "\n", + " def fit(self, X):\n", + " tfidf = TfidfVectorizer(analyzer=lambda x: x)\n", + " tfidf.fit(X)\n", + " # if a word was never seen - it must be at least as infrequent\n", + " # as any of the known words - so the default idf is the max of \n", + " # known idf's\n", + " max_idf = max(tfidf.idf_)\n", + " self.word2weight = defaultdict(\n", + " lambda: max_idf,\n", + " [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])\n", + "\n", + " return self\n", + "\n", + " def transform(self, X):\n", + " return np.array([\n", + " np.mean([self.word2vec[w] * self.word2weight[w]\n", + " for w in words if w in self.word2vec] or\n", + " [np.zeros(self.dim)], axis=0)\n", + " for words in X\n", + " ])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, we will read in the embedding vectors from the relevant file and fit using the above TfidfEmbeddingVectorizer" + ] + }, + { + "cell_type": "code", + "execution_count": 152, + "metadata": {}, + "outputs": [], + "source": [ + "with open(\"glove.6B.100d.txt\", \"rb\") as lines:\n", + " w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))\n", + " for line in lines}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will create a new field concatenating category and expense description." + ] + }, + { + "cell_type": "code", + "execution_count": 242, + "metadata": {}, + "outputs": [], + "source": [ + "df_train['text_data'] = df_train['expense description'] + ' ' + df_train['category']\n", + "df_val['text_data'] = df_val['expense description'] + ' ' + df_val['category']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's fit the embedding vectorizer, which considers TfIdf on top of the embedding vectors." + ] + }, + { + "cell_type": "code", + "execution_count": 243, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "<__main__.TfidfEmbeddingVectorizer at 0x117adfef0>" + ] + }, + "execution_count": 243, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "embedding_vectorizer = TfidfEmbeddingVectorizer(w2v)\n", + "embedding_vectorizer.fit(df_train['text_data'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's also fit using only the TfIdfVectorizer." + ] + }, + { + "cell_type": "code", + "execution_count": 244, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',\n", + " dtype=, encoding='utf-8', input='content',\n", + " lowercase=True, max_df=1.0, max_features=None, min_df=1,\n", + " ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,\n", + " stop_words='english', strip_accents=None, sublinear_tf=False,\n", + " token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b', tokenizer=None, use_idf=True,\n", + " vocabulary=None)" + ] + }, + "execution_count": 244, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "vectorizer = TfidfVectorizer(stop_words='english')\n", + "vectorizer.fit(df_train['text_data'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's prepare the text vectors for the training and validation data, for both - using and not using pre-trained embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 245, + "metadata": {}, + "outputs": [], + "source": [ + "x_train_tfidf = vectorizer.transform(df_train['text_data']).toarray()\n", + "x_val_tfidf = vectorizer.transform(df_val['text_data']).toarray()\n", + "\n", + "x_train_tfidf_embedding = embedding_vectorizer.transform(df_train['text_data'])\n", + "x_val_tfidf_embedding = embedding_vectorizer.transform(df_val['text_data'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Concatenate the text vectors with the non-text feature vectors to get the final feature vectors (again, for both - using and not using pre-trained embeddings.)" + ] + }, + { + "cell_type": "code", + "execution_count": 246, + "metadata": {}, + "outputs": [], + "source": [ + "x_train_final_no_embedding = np.concatenate((x_train.values, x_train_tfidf), axis=1)\n", + "x_val_final_no_embedding = np.concatenate((x_val.values, x_val_tfidf), axis=1)\n", + "\n", + "x_train_final_embedding = np.concatenate((x_train.values, x_train_tfidf_embedding), axis=1)\n", + "x_val_final_embedding = np.concatenate((x_val.values, x_val_tfidf_embedding), axis=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Apply K-means\n", + "\n", + "Now that we have the fetures ready, we will apply K-means to it.\n", + "First, let's try with the one that has no pretrained embeddings considered (only tf-idf)." + ] + }, + { + "cell_type": "code", + "execution_count": 261, + "metadata": {}, + "outputs": [], + "source": [ + "kmeans = KMeans(n_clusters=2, random_state=1)\n", + "kmeans.fit(x_train_final_no_embedding)\n", + "clusters_train = kmeans.predict(x_train_final_no_embedding)" + ] + }, + { + "cell_type": "code", + "execution_count": 248, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1,\n", + " 1], dtype=int32)" + ] + }, + "execution_count": 248, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "clusters_train" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's have a look at the clusters." + ] + }, + { + "cell_type": "code", + "execution_count": 249, + "metadata": {}, + "outputs": [], + "source": [ + "cluster1 = df_train[clusters_train==0]\n", + "cluster2 = df_train[clusters_train==1]" + ] + }, + { + "cell_type": "code", + "execution_count": 250, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datecategoryemployee idexpense descriptionpre-tax amounttax nametax amountholidaytext_data
011/1/2016Travel7Taxi ride40.0NY Sales tax3.550Taxi ride Travel
111/15/2016Meals and Entertainment1Team lunch235.0CA Sales tax30.550Team lunch Meals and Entertainment
211/30/2016Computer - Hardware3HP Laptop Computer999.0CA Sales tax129.870HP Laptop Computer Computer - Hardware
311/14/2016Computer - Software3Microsoft Office899.0CA Sales tax116.870Microsoft Office Computer - Software
612/9/2016Meals and Entertainment6Coffee with Steve300.0CA Sales tax39.000Coffee with Steve Meals and Entertainment
711/12/2016Travel4Taxi ride230.0CA Sales tax29.901Taxi ride Travel
811/21/2016Meals and Entertainment7Client dinner200.0NY Sales tax17.750Client dinner Meals and Entertainment
910/4/2016Travel6Flight to Miami200.0CA Sales tax26.000Flight to Miami Travel
1010/12/2016Computer - Hardware7Macbook Air Computer1999.0NY Sales tax177.410Macbook Air Computer Computer - Hardware
1412/30/2016Meals and Entertainment4Dinner with potential client200.0CA Sales tax26.001Dinner with potential client Meals and Enterta...
1511/6/2016Computer - Hardware6iPhone200.0CA Sales tax26.001iPhone Computer - Hardware
1611/7/2016Travel1Airplane ticket to NY200.0CA Sales tax26.000Airplane ticket to NY Travel
1812/18/2016Travel6Airplane ticket to NY1500.0CA Sales tax195.001Airplane ticket to NY Travel
1912/15/2016Meals and Entertainment4Dinner with client200.0CA Sales tax26.000Dinner with client Meals and Entertainment
2012/1/2016Meals and Entertainment4Dinner210.0CA Sales tax27.300Dinner Meals and Entertainment
\n", + "
" + ], + "text/plain": [ + " date category employee id \\\n", + "0 11/1/2016 Travel 7 \n", + "1 11/15/2016 Meals and Entertainment 1 \n", + "2 11/30/2016 Computer - Hardware 3 \n", + "3 11/14/2016 Computer - Software 3 \n", + "6 12/9/2016 Meals and Entertainment 6 \n", + "7 11/12/2016 Travel 4 \n", + "8 11/21/2016 Meals and Entertainment 7 \n", + "9 10/4/2016 Travel 6 \n", + "10 10/12/2016 Computer - Hardware 7 \n", + "14 12/30/2016 Meals and Entertainment 4 \n", + "15 11/6/2016 Computer - Hardware 6 \n", + "16 11/7/2016 Travel 1 \n", + "18 12/18/2016 Travel 6 \n", + "19 12/15/2016 Meals and Entertainment 4 \n", + "20 12/1/2016 Meals and Entertainment 4 \n", + "\n", + " expense description pre-tax amount tax name tax amount \\\n", + "0 Taxi ride 40.0 NY Sales tax 3.55 \n", + "1 Team lunch 235.0 CA Sales tax 30.55 \n", + "2 HP Laptop Computer 999.0 CA Sales tax 129.87 \n", + "3 Microsoft Office 899.0 CA Sales tax 116.87 \n", + "6 Coffee with Steve 300.0 CA Sales tax 39.00 \n", + "7 Taxi ride 230.0 CA Sales tax 29.90 \n", + "8 Client dinner 200.0 NY Sales tax 17.75 \n", + "9 Flight to Miami 200.0 CA Sales tax 26.00 \n", + "10 Macbook Air Computer 1999.0 NY Sales tax 177.41 \n", + "14 Dinner with potential client 200.0 CA Sales tax 26.00 \n", + "15 iPhone 200.0 CA Sales tax 26.00 \n", + "16 Airplane ticket to NY 200.0 CA Sales tax 26.00 \n", + "18 Airplane ticket to NY 1500.0 CA Sales tax 195.00 \n", + "19 Dinner with client 200.0 CA Sales tax 26.00 \n", + "20 Dinner 210.0 CA Sales tax 27.30 \n", + "\n", + " holiday text_data \n", + "0 0 Taxi ride Travel \n", + "1 0 Team lunch Meals and Entertainment \n", + "2 0 HP Laptop Computer Computer - Hardware \n", + "3 0 Microsoft Office Computer - Software \n", + "6 0 Coffee with Steve Meals and Entertainment \n", + "7 1 Taxi ride Travel \n", + "8 0 Client dinner Meals and Entertainment \n", + "9 0 Flight to Miami Travel \n", + "10 0 Macbook Air Computer Computer - Hardware \n", + "14 1 Dinner with potential client Meals and Enterta... \n", + "15 1 iPhone Computer - Hardware \n", + "16 0 Airplane ticket to NY Travel \n", + "18 1 Airplane ticket to NY Travel \n", + "19 0 Dinner with client Meals and Entertainment \n", + "20 0 Dinner Meals and Entertainment " + ] + }, + "execution_count": 250, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cluster1" + ] + }, + { + "cell_type": "code", + "execution_count": 251, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datecategoryemployee idexpense descriptionpre-tax amounttax nametax amountholidaytext_data
411/6/2016Computer - Software4Dropbox Subscription50.0CA Sales tax6.501Dropbox Subscription Computer - Software
511/3/2016Computer - Software3Dropbox Subscription50.0CA Sales tax6.500Dropbox Subscription Computer - Software
1112/11/2016Computer - Software1iCloud Subscription15.0CA Sales tax1.951iCloud Subscription Computer - Software
129/18/2016Travel1Taxi ride60.0CA Sales tax7.801Taxi ride Travel
139/30/2016Office Supplies3Paper200.0CA Sales tax26.000Paper Office Supplies
1712/3/2016Meals and Entertainment5Starbucks coffee4.0CA Sales tax0.521Starbucks coffee Meals and Entertainment
2112/8/2016Meals and Entertainment4Dinner180.0CA Sales tax23.400Dinner Meals and Entertainment
2212/31/2016Meals and Entertainment4Dinner1000.0CA Sales tax130.001Dinner Meals and Entertainment
2312/9/2016Meals and Entertainment4Dinner30.0CA Sales tax3.900Dinner Meals and Entertainment
\n", + "
" + ], + "text/plain": [ + " date category employee id expense description \\\n", + "4 11/6/2016 Computer - Software 4 Dropbox Subscription \n", + "5 11/3/2016 Computer - Software 3 Dropbox Subscription \n", + "11 12/11/2016 Computer - Software 1 iCloud Subscription \n", + "12 9/18/2016 Travel 1 Taxi ride \n", + "13 9/30/2016 Office Supplies 3 Paper \n", + "17 12/3/2016 Meals and Entertainment 5 Starbucks coffee \n", + "21 12/8/2016 Meals and Entertainment 4 Dinner \n", + "22 12/31/2016 Meals and Entertainment 4 Dinner \n", + "23 12/9/2016 Meals and Entertainment 4 Dinner \n", + "\n", + " pre-tax amount tax name tax amount holiday \\\n", + "4 50.0 CA Sales tax 6.50 1 \n", + "5 50.0 CA Sales tax 6.50 0 \n", + "11 15.0 CA Sales tax 1.95 1 \n", + "12 60.0 CA Sales tax 7.80 1 \n", + "13 200.0 CA Sales tax 26.00 0 \n", + "17 4.0 CA Sales tax 0.52 1 \n", + "21 180.0 CA Sales tax 23.40 0 \n", + "22 1000.0 CA Sales tax 130.00 1 \n", + "23 30.0 CA Sales tax 3.90 0 \n", + "\n", + " text_data \n", + "4 Dropbox Subscription Computer - Software \n", + "5 Dropbox Subscription Computer - Software \n", + "11 iCloud Subscription Computer - Software \n", + "12 Taxi ride Travel \n", + "13 Paper Office Supplies \n", + "17 Starbucks coffee Meals and Entertainment \n", + "21 Dinner Meals and Entertainment \n", + "22 Dinner Meals and Entertainment \n", + "23 Dinner Meals and Entertainment " + ] + }, + "execution_count": 251, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cluster2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Observations:\n", + "\n", + "- One cluster has 15 data ponits, whereas the other has 9\n", + "- The first one has fewer holidays, all the computer hardware stuffs and meals and entertainment that re mostly related to client, team etc, indicating that these are business expense.\n", + "- The seond cluster has more holidays, meals and entertainment that look more personal, even though it has the **Office Supplies** record which clearly in the wrong cluster.\n", + "- Overall the clusters kind of make sense." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now let's try with pre-trained embedding vectors." + ] + }, + { + "cell_type": "code", + "execution_count": 268, + "metadata": {}, + "outputs": [], + "source": [ + "kmeans = KMeans(n_clusters=2, random_state=1)\n", + "kmeans.fit(x_train_final_embedding)\n", + "clusters_train = kmeans.predict(x_train_final_embedding)" + ] + }, + { + "cell_type": "code", + "execution_count": 253, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0,\n", + " 0], dtype=int32)" + ] + }, + "execution_count": 253, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "clusters_train" + ] + }, + { + "cell_type": "code", + "execution_count": 254, + "metadata": {}, + "outputs": [], + "source": [ + "cluster1 = df_train[clusters_train==0]\n", + "cluster2 = df_train[clusters_train==1]" + ] + }, + { + "cell_type": "code", + "execution_count": 255, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datecategoryemployee idexpense descriptionpre-tax amounttax nametax amountholidaytext_data
411/6/2016Computer - Software4Dropbox Subscription50.0CA Sales tax6.501Dropbox Subscription Computer - Software
511/3/2016Computer - Software3Dropbox Subscription50.0CA Sales tax6.500Dropbox Subscription Computer - Software
1112/11/2016Computer - Software1iCloud Subscription15.0CA Sales tax1.951iCloud Subscription Computer - Software
129/18/2016Travel1Taxi ride60.0CA Sales tax7.801Taxi ride Travel
139/30/2016Office Supplies3Paper200.0CA Sales tax26.000Paper Office Supplies
1712/3/2016Meals and Entertainment5Starbucks coffee4.0CA Sales tax0.521Starbucks coffee Meals and Entertainment
2112/8/2016Meals and Entertainment4Dinner180.0CA Sales tax23.400Dinner Meals and Entertainment
2212/31/2016Meals and Entertainment4Dinner1000.0CA Sales tax130.001Dinner Meals and Entertainment
2312/9/2016Meals and Entertainment4Dinner30.0CA Sales tax3.900Dinner Meals and Entertainment
\n", + "
" + ], + "text/plain": [ + " date category employee id expense description \\\n", + "4 11/6/2016 Computer - Software 4 Dropbox Subscription \n", + "5 11/3/2016 Computer - Software 3 Dropbox Subscription \n", + "11 12/11/2016 Computer - Software 1 iCloud Subscription \n", + "12 9/18/2016 Travel 1 Taxi ride \n", + "13 9/30/2016 Office Supplies 3 Paper \n", + "17 12/3/2016 Meals and Entertainment 5 Starbucks coffee \n", + "21 12/8/2016 Meals and Entertainment 4 Dinner \n", + "22 12/31/2016 Meals and Entertainment 4 Dinner \n", + "23 12/9/2016 Meals and Entertainment 4 Dinner \n", + "\n", + " pre-tax amount tax name tax amount holiday \\\n", + "4 50.0 CA Sales tax 6.50 1 \n", + "5 50.0 CA Sales tax 6.50 0 \n", + "11 15.0 CA Sales tax 1.95 1 \n", + "12 60.0 CA Sales tax 7.80 1 \n", + "13 200.0 CA Sales tax 26.00 0 \n", + "17 4.0 CA Sales tax 0.52 1 \n", + "21 180.0 CA Sales tax 23.40 0 \n", + "22 1000.0 CA Sales tax 130.00 1 \n", + "23 30.0 CA Sales tax 3.90 0 \n", + "\n", + " text_data \n", + "4 Dropbox Subscription Computer - Software \n", + "5 Dropbox Subscription Computer - Software \n", + "11 iCloud Subscription Computer - Software \n", + "12 Taxi ride Travel \n", + "13 Paper Office Supplies \n", + "17 Starbucks coffee Meals and Entertainment \n", + "21 Dinner Meals and Entertainment \n", + "22 Dinner Meals and Entertainment \n", + "23 Dinner Meals and Entertainment " + ] + }, + "execution_count": 255, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cluster1" + ] + }, + { + "cell_type": "code", + "execution_count": 256, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datecategoryemployee idexpense descriptionpre-tax amounttax nametax amountholidaytext_data
011/1/2016Travel7Taxi ride40.0NY Sales tax3.550Taxi ride Travel
111/15/2016Meals and Entertainment1Team lunch235.0CA Sales tax30.550Team lunch Meals and Entertainment
211/30/2016Computer - Hardware3HP Laptop Computer999.0CA Sales tax129.870HP Laptop Computer Computer - Hardware
311/14/2016Computer - Software3Microsoft Office899.0CA Sales tax116.870Microsoft Office Computer - Software
612/9/2016Meals and Entertainment6Coffee with Steve300.0CA Sales tax39.000Coffee with Steve Meals and Entertainment
711/12/2016Travel4Taxi ride230.0CA Sales tax29.901Taxi ride Travel
811/21/2016Meals and Entertainment7Client dinner200.0NY Sales tax17.750Client dinner Meals and Entertainment
910/4/2016Travel6Flight to Miami200.0CA Sales tax26.000Flight to Miami Travel
1010/12/2016Computer - Hardware7Macbook Air Computer1999.0NY Sales tax177.410Macbook Air Computer Computer - Hardware
1412/30/2016Meals and Entertainment4Dinner with potential client200.0CA Sales tax26.001Dinner with potential client Meals and Enterta...
1511/6/2016Computer - Hardware6iPhone200.0CA Sales tax26.001iPhone Computer - Hardware
1611/7/2016Travel1Airplane ticket to NY200.0CA Sales tax26.000Airplane ticket to NY Travel
1812/18/2016Travel6Airplane ticket to NY1500.0CA Sales tax195.001Airplane ticket to NY Travel
1912/15/2016Meals and Entertainment4Dinner with client200.0CA Sales tax26.000Dinner with client Meals and Entertainment
2012/1/2016Meals and Entertainment4Dinner210.0CA Sales tax27.300Dinner Meals and Entertainment
\n", + "
" + ], + "text/plain": [ + " date category employee id \\\n", + "0 11/1/2016 Travel 7 \n", + "1 11/15/2016 Meals and Entertainment 1 \n", + "2 11/30/2016 Computer - Hardware 3 \n", + "3 11/14/2016 Computer - Software 3 \n", + "6 12/9/2016 Meals and Entertainment 6 \n", + "7 11/12/2016 Travel 4 \n", + "8 11/21/2016 Meals and Entertainment 7 \n", + "9 10/4/2016 Travel 6 \n", + "10 10/12/2016 Computer - Hardware 7 \n", + "14 12/30/2016 Meals and Entertainment 4 \n", + "15 11/6/2016 Computer - Hardware 6 \n", + "16 11/7/2016 Travel 1 \n", + "18 12/18/2016 Travel 6 \n", + "19 12/15/2016 Meals and Entertainment 4 \n", + "20 12/1/2016 Meals and Entertainment 4 \n", + "\n", + " expense description pre-tax amount tax name tax amount \\\n", + "0 Taxi ride 40.0 NY Sales tax 3.55 \n", + "1 Team lunch 235.0 CA Sales tax 30.55 \n", + "2 HP Laptop Computer 999.0 CA Sales tax 129.87 \n", + "3 Microsoft Office 899.0 CA Sales tax 116.87 \n", + "6 Coffee with Steve 300.0 CA Sales tax 39.00 \n", + "7 Taxi ride 230.0 CA Sales tax 29.90 \n", + "8 Client dinner 200.0 NY Sales tax 17.75 \n", + "9 Flight to Miami 200.0 CA Sales tax 26.00 \n", + "10 Macbook Air Computer 1999.0 NY Sales tax 177.41 \n", + "14 Dinner with potential client 200.0 CA Sales tax 26.00 \n", + "15 iPhone 200.0 CA Sales tax 26.00 \n", + "16 Airplane ticket to NY 200.0 CA Sales tax 26.00 \n", + "18 Airplane ticket to NY 1500.0 CA Sales tax 195.00 \n", + "19 Dinner with client 200.0 CA Sales tax 26.00 \n", + "20 Dinner 210.0 CA Sales tax 27.30 \n", + "\n", + " holiday text_data \n", + "0 0 Taxi ride Travel \n", + "1 0 Team lunch Meals and Entertainment \n", + "2 0 HP Laptop Computer Computer - Hardware \n", + "3 0 Microsoft Office Computer - Software \n", + "6 0 Coffee with Steve Meals and Entertainment \n", + "7 1 Taxi ride Travel \n", + "8 0 Client dinner Meals and Entertainment \n", + "9 0 Flight to Miami Travel \n", + "10 0 Macbook Air Computer Computer - Hardware \n", + "14 1 Dinner with potential client Meals and Enterta... \n", + "15 1 iPhone Computer - Hardware \n", + "16 0 Airplane ticket to NY Travel \n", + "18 1 Airplane ticket to NY Travel \n", + "19 0 Dinner with client Meals and Entertainment \n", + "20 0 Dinner Meals and Entertainment " + ] + }, + "execution_count": 256, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cluster2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Observations:\n", + "\n", + "It turns out that we got the exact same clusters. So using pre-trained embeddings hasn't really helped too much in this case, may be because there wasn't enough data for the embeddings to take effect. Nevertheless, it was a good exercise for me to go through both approaches." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Just try the performance on the validation set\n", + "\n", + "Although not asked in the question, I am just trying out the performace on the validation set as well using the one that uses pre-trained embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 269, + "metadata": {}, + "outputs": [], + "source": [ + "clusters_val = kmeans.predict(x_val_final_embedding)" + ] + }, + { + "cell_type": "code", + "execution_count": 270, + "metadata": {}, + "outputs": [], + "source": [ + "cluster1 = df_val[clusters_val==0]\n", + "cluster2 = df_val[clusters_val==1]" + ] + }, + { + "cell_type": "code", + "execution_count": 271, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datecategoryemployee idexpense descriptionpre-tax amounttax nametax amountholidaytext_data
111/12/2016Meals and Entertainment1Dinner with Family235.0CA Sales tax30.551Dinner with Family Meals and Entertainment
912/2/2016Meals and Entertainment4Dinner180.0CA Sales tax23.400Dinner Meals and Entertainment
102/14/2017Meals and Entertainment4Dinner500.0CA Sales tax65.000Dinner Meals and Entertainment
\n", + "
" + ], + "text/plain": [ + " date category employee id expense description \\\n", + "1 11/12/2016 Meals and Entertainment 1 Dinner with Family \n", + "9 12/2/2016 Meals and Entertainment 4 Dinner \n", + "10 2/14/2017 Meals and Entertainment 4 Dinner \n", + "\n", + " pre-tax amount tax name tax amount holiday \\\n", + "1 235.0 CA Sales tax 30.55 1 \n", + "9 180.0 CA Sales tax 23.40 0 \n", + "10 500.0 CA Sales tax 65.00 0 \n", + "\n", + " text_data \n", + "1 Dinner with Family Meals and Entertainment \n", + "9 Dinner Meals and Entertainment \n", + "10 Dinner Meals and Entertainment " + ] + }, + "execution_count": 271, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cluster1" + ] + }, + { + "cell_type": "code", + "execution_count": 272, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
datecategoryemployee idexpense descriptionpre-tax amounttax nametax amountholidaytext_data
011/10/2016Travel7Taxi ride30.0NY Sales tax2.660Taxi ride Travel
29/2/2016Computer - Hardware4Macbook Air Computer4000.0CA Sales tax520.000Macbook Air Computer Computer - Hardware
39/2/2016Office Supplies4Paper20.0CA Sales tax2.600Paper Office Supplies
49/2/2016Office Supplies4Pens20.0CA Sales tax2.600Pens Office Supplies
511/21/2016Travel1Airplane ticket to Miami200.0CA Sales tax26.000Airplane ticket to Miami Travel
612/4/2016Meals and Entertainment2Starbucks coffee12.0CA Sales tax1.561Starbucks coffee Meals and Entertainment
71/18/2010Meals and Entertainment6Dinner30.0CA Sales tax3.900Dinner Meals and Entertainment
810/5/2016Meals and Entertainment4Dinner with client220.0CA Sales tax28.600Dinner with client Meals and Entertainment
112/14/2016Meals and Entertainment4Dinner600.0CA Sales tax78.001Dinner Meals and Entertainment
\n", + "
" + ], + "text/plain": [ + " date category employee id \\\n", + "0 11/10/2016 Travel 7 \n", + "2 9/2/2016 Computer - Hardware 4 \n", + "3 9/2/2016 Office Supplies 4 \n", + "4 9/2/2016 Office Supplies 4 \n", + "5 11/21/2016 Travel 1 \n", + "6 12/4/2016 Meals and Entertainment 2 \n", + "7 1/18/2010 Meals and Entertainment 6 \n", + "8 10/5/2016 Meals and Entertainment 4 \n", + "11 2/14/2016 Meals and Entertainment 4 \n", + "\n", + " expense description pre-tax amount tax name tax amount \\\n", + "0 Taxi ride 30.0 NY Sales tax 2.66 \n", + "2 Macbook Air Computer 4000.0 CA Sales tax 520.00 \n", + "3 Paper 20.0 CA Sales tax 2.60 \n", + "4 Pens 20.0 CA Sales tax 2.60 \n", + "5 Airplane ticket to Miami 200.0 CA Sales tax 26.00 \n", + "6 Starbucks coffee 12.0 CA Sales tax 1.56 \n", + "7 Dinner 30.0 CA Sales tax 3.90 \n", + "8 Dinner with client 220.0 CA Sales tax 28.60 \n", + "11 Dinner 600.0 CA Sales tax 78.00 \n", + "\n", + " holiday text_data \n", + "0 0 Taxi ride Travel \n", + "2 0 Macbook Air Computer Computer - Hardware \n", + "3 0 Paper Office Supplies \n", + "4 0 Pens Office Supplies \n", + "5 0 Airplane ticket to Miami Travel \n", + "6 1 Starbucks coffee Meals and Entertainment \n", + "7 0 Dinner Meals and Entertainment \n", + "8 0 Dinner with client Meals and Entertainment \n", + "11 1 Dinner Meals and Entertainment " + ] + }, + "execution_count": 272, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cluster2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Observations:\n", + "\n", + "It seems that the performance on the validation set is not bad after all. Thefirst cluster hints mostly personal, whereas the second cluster hints mostly business expenses. Note that it has now put **Office Supplies** rightly in the sond cluster." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion:\n", + "\n", + "There can be a myria of other ways we can possibly try to do the clustering. I have tried as much as I can in my limited time. A bit of improvement on my approach could be that we could try mutiple iterations of the K-means fitting using mutiple initial random state, and then then assign each data point to the cluster that they belong to the most number of times." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/luigi_solution.py b/luigi_solution.py new file mode 100644 index 0000000..a0180b4 --- /dev/null +++ b/luigi_solution.py @@ -0,0 +1,28 @@ +import argparse +import luigi +import sys +from predictor import Predictor + + +parser = argparse.ArgumentParser() +parser.add_argument('--input-path', help='Input path containing files', + default='.') +parser.add_argument('--output-path', help='Directory where output file will be generated', + default='.') + + +def main(): + try: + args = parser.parse_args() + except: + parser.print_help() + sys.exit(0) + + task = Predictor(input_path=args.input_path, + output_path=args.output_path) + + if luigi.build([task], local_scheduler=True, workers=1): + print(task.output()) + +if __name__ == '__main__': + main() diff --git a/pre_processor.py b/pre_processor.py new file mode 100644 index 0000000..5b2f71e --- /dev/null +++ b/pre_processor.py @@ -0,0 +1,94 @@ +from sklearn import preprocessing +from sklearn.preprocessing import LabelEncoder +from sklearn.feature_extraction.text import TfidfVectorizer + +import luigi +from luigi import LocalTarget +import pandas as pd +import numpy as np +import pickle +import os + + +class PreProcess(luigi.Task): + """ + This class will pre-process and vectorize the data + and output to CSVs and pickles + """ + + output_path = luigi.Parameter() + input_path = luigi.Parameter() + + def output(self): + return { + 'train_x': LocalTarget(os.path.join(self.output_path, 'train_processed.csv')), + 'train_x_tfidf': LocalTarget(os.path.join(self.output_path, 'train_processed_tfidf.csv')), + 'validation_x': LocalTarget(os.path.join(self.output_path, 'validation_processed.csv')), + 'validation_x_tfidf': LocalTarget(os.path.join(self.output_path, 'validation_processed_tfidf.csv')), + 'labels': LocalTarget(os.path.join(self.output_path, 'labels.pkl')) + } + + def pre_process(self, df, df_employee, + columns_to_drop=['date', + 'category', + 'tax amount', + 'expense description']): + + df['day_of_week'] = pd.to_datetime(df['date']).apply(lambda x: x.weekday()).astype( + str) # str so that treated as categoical + df['month'] = pd.to_datetime(df['date']).apply(lambda x: x.month).astype(str) + df = pd.merge(df, df_employee[['employee id', 'role']], how='inner', on=['employee id']) + df['employee id'] = df['employee id'].astype(str) + df = df.drop(columns_to_drop, axis=1) + + # one-hot encode the categorical variables + df = pd.get_dummies(df) + + df['pre-tax amount'] = preprocessing.minmax_scale(df[['pre-tax amount']]) + + return df + + def run(self): + df_train = pd.read_csv(os.path.join(self.input_path, 'training_data_example.csv')) + df_val = pd.read_csv(os.path.join(self.input_path, 'validation_data_example.csv')) + df_employee = pd.read_csv(os.path.join(self.input_path, 'employee.csv')) + + x_train = self.pre_process(df_train, df_employee) + x_val = self.pre_process(df_train, df_employee) + x_train, x_val = x_train.align(x_val, join='left', axis=1) + x_val = x_val.fillna(0) + + vectorizer = TfidfVectorizer(stop_words='english') + vectorizer.fit(df_train['expense description']) + x_train_tfidf = vectorizer.transform(df_train['expense description']).toarray() + x_val_tfidf = vectorizer.transform(df_val['expense description']).toarray() + + lencoder = LabelEncoder() + lencoder.fit(df_train['category']) + names = set(df_val['category']) # label names to be used later + y_train = lencoder.transform(df_train['category']) + y_val = lencoder.transform(df_val['category']) + + val_categories = [] + for clazz in lencoder.classes_: + if clazz in names: + val_categories.append(clazz) + + x_train.to_csv(self.output()['train_x'].path, sep=',') + x_val.to_csv(self.output()['validation_x'].path, sep=',') + np.savetxt(self.output()['train_x_tfidf'].path, x_train_tfidf, delimiter=",") + np.savetxt(self.output()['validation_x_tfidf'].path, x_val_tfidf, delimiter=",") + + labels = { + 'y_train': y_train, + 'y_val': y_val, + 'categories': lencoder.classes_, + 'val_categories': val_categories + } + + with open(self.output()['labels'].path, 'wb') as f: + pickle.dump(labels, f) + + + + diff --git a/predictor.py b/predictor.py new file mode 100644 index 0000000..a38de17 --- /dev/null +++ b/predictor.py @@ -0,0 +1,93 @@ +from sklearn.ensemble import RandomForestClassifier +from sklearn.model_selection import GridSearchCV, LeaveOneOut +from sklearn.metrics import classification_report + +import luigi +from luigi import LocalTarget +import pandas as pd +import pickle +import os +import numpy as np + +from pre_processor import PreProcess + + +class Predictor(luigi.Task): + """ + This class does the actual prediction and outputs a couple of files + comparison.csv: A file showing actual and predicted categries side by side + report.txt: Report on accuracy, precision and recall + """ + + output_path = luigi.Parameter() + input_path = luigi.Parameter() + + def requires(self): + return PreProcess(input_path=self.input_path, + output_path=self.output_path) + + def output(self): + return { + 'result_report': LocalTarget(os.path.join(self.output_path, 'report.txt')), + 'result_comparison': LocalTarget(os.path.join(self.output_path, 'comparison.csv')) + } + + def evaluate(self, model, x_train, y_train, x_val, y_val, cross_validate=False, grid_search_params={}): + if cross_validate: + grid_search_model = GridSearchCV(estimator=model, + cv=LeaveOneOut(), + param_grid=grid_search_params, + verbose=0) + grid_search_model.fit(x_train, y_train) + return (grid_search_model.best_estimator_.score(x_train, y_train), + grid_search_model.best_estimator_.score(x_val, y_val), + grid_search_model.best_estimator_.predict(x_val), + grid_search_model.best_estimator_) + else: + model.fit(x_train, y_train) + return model.score(x_train, y_train), model.score(x_val, y_val), model.predict(x_val), 0 + + def run(self): + + # we won't use these 2 for predicting purpose here + # we will only use the tf-idf features + x_train = pd.read_csv(self.input()['train_x'].path) + x_val = pd.read_csv(self.input()['validation_x'].path) + + x_train_tfidf = np.genfromtxt(self.input()['train_x_tfidf'].path, delimiter=',') + x_val_tfidf = np.genfromtxt(self.input()['validation_x_tfidf'].path, delimiter=',') + + with open(self.input()['labels'].path, 'rb') as f: + labels = pickle.load(f) + + y_train = labels['y_train'] + y_val = labels['y_val'] + categories = labels['categories'] + val_categories = labels['val_categories'] + + grid_search_params = {'n_estimators': [10,25,50,100,200,300]} + model = RandomForestClassifier(n_estimators=50, random_state=1) + score_train, score_val, predictions, estimator = self.evaluate(model, x_train_tfidf, + y_train, x_val_tfidf, y_val, + cross_validate=True, + grid_search_params=grid_search_params) + + actual_categories = [categories[i] for i in y_val] + predicted_categories = [categories[i] for i in predictions] + + result_comp = pd.DataFrame({'actual': actual_categories, + 'prediction': predicted_categories}) + + result_comp.to_csv(self.output()['result_comparison'].path, sep=',') + + with open(self.output()['result_report'].path, 'w') as f: + f.write('Training Score: {} %\n'.format(score_train * 100)) + f.write('Validation Score: {} %\n\n'.format(score_val * 100)) + f.write(classification_report(y_val, predictions, target_names=val_categories)) + + + + + + + diff --git a/run.sh b/run.sh new file mode 100755 index 0000000..95e87db --- /dev/null +++ b/run.sh @@ -0,0 +1 @@ +python luigi_solution.py --input-path . --output-path . \ No newline at end of file diff --git a/spark_random_forest.ipynb b/spark_random_forest.ipynb new file mode 100644 index 0000000..5cddedf --- /dev/null +++ b/spark_random_forest.ipynb @@ -0,0 +1,391 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Random forest classifcation using Spark\n", + "\n", + "This notebook tries to address the following problem:\n", + "\n", + "*(Bonus) Train your learning algorithm for one of the above questions in a distributed fashion, such as using Spark. Here, you can assume either the data or the model is too large/efficient to be process in a single computer.*\n", + "\n", + "\n", + "Although in the original solution I tried different classifiers and ensemble, int his spark based one, I am only using RandomForest for simplicity. I don't have too much exposure to Spark's machine learning libraries, so this ia quite a bit of an exploration for me.\n", + "\n", + "Although here I intended to generate all features (text and non-text), for limited time, I will finally be using only the text features for classification purpose." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Necessary imports" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "import calendar\n", + " \n", + "from pyspark.mllib.tree import RandomForest, RandomForestModel\n", + "from pyspark.mllib.util import MLUtils\n", + "from pyspark.mllib.regression import LabeledPoint\n", + "from pyspark.sql import SparkSession\n", + "from pyspark.sql.functions import lower\n", + "from pyspark.ml.feature import HashingTF, IDF, Tokenizer\n", + "from pyspark.ml.classification import RandomForestClassifier\n", + "from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors\n", + "\n", + "from sklearn.metrics import classification_report\n", + "from sklearn.preprocessing import LabelEncoder" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Initiate sark session**" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "spark = SparkSession.builder.getOrCreate()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Read n the data in datafarames and create temp views**" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "df_train = spark.read.format('csv').options(header='true', inferschema='true').load('training_data_example.csv')\n", + "\n", + "df_val = spark.read.format('csv').options(header='true', inferschema='true').load('validation_data_example.csv')\n", + "\n", + "df_employee = spark.read.format('csv').options(header='true', inferschema='true').load('employee.csv')\n", + "df_employee = df_employee.withColumnRenamed('employee id', 'employee_id')\n", + "df_employee.createOrReplaceTempView('employee')" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "lencoder = LabelEncoder()\n", + "lencoder.fit(df_train.select('category').rdd.map(lambda x: x[0]).collect())\n", + "names = set(df_val.select('category').rdd.map(lambda x: x[0]).collect()) # label names to be used later\n", + "y_train = lencoder.transform(df_train.select('category').rdd.map(lambda x: x[0]).collect())\n", + "y_val = lencoder.transform(df_val.select('category').rdd.map(lambda x: x[0]).collect())\n", + "val_categoroes = []\n", + "for clazz in lencoder.classes_:\n", + " if clazz in names:\n", + " val_categoroes.append(clazz)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Define some UDFs to use in spark SQL**" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def get_weekday(date):\n", + " month, day, year = (int(x) for x in date.split('/')) \n", + " weekday = datetime.date(year, month, day)\n", + " return calendar.day_name[weekday.weekday()]\n", + "\n", + "def get_month(date):\n", + " month, day, year = (int(x) for x in date.split('/')) \n", + " return month\n", + "\n", + "spark.udf.register('get_weekday', get_weekday)\n", + "spark.udf.register('get_month', get_month)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Next define some tokenizer and vectorizer for the TF-IDF vectorization purpose**" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [], + "source": [ + "tokenizer = Tokenizer(inputCol=\"expense_description\", outputCol=\"words\")\n", + "hashingTF = HashingTF(inputCol=\"words\", outputCol=\"rawFeatures\", numFeatures=20)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Define a couple of methods for getting a bit more formatted features and vectorization**" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [], + "source": [ + "def pre_process(df, table):\n", + " df = df.withColumnRenamed('employee id', 'employee_id') \\\n", + " .withColumnRenamed('expense description', 'expense_description') \\\n", + " .withColumnRenamed('pre-tax amount', 'pre_tax_amount') \\\n", + " .withColumnRenamed('tax amount', 'tax_amount') \\\n", + " .withColumnRenamed('tax name', 'tax_name')\n", + " \n", + " df.createOrReplaceTempView(table)\n", + " \n", + " df = spark.sql(\"\"\"\n", + " select employee_id,\n", + " get_weekday(date) as weekday,\n", + " cast(get_month(date) as int) as month,\n", + " pre_tax_amount,\n", + " role,\n", + " expense_description,\n", + " case when category = 'Computer - Hardware' then 0\n", + " when category = 'Computer - Software' then 1\n", + " when category = 'Meals and Entertainment' then 2\n", + " when category = 'Office Supplies' then 3\n", + " else 4\n", + " end as category\n", + " from \n", + " {table} \n", + " inner join employee using(employee_id)\n", + " \n", + " \"\"\".format(table=table))\n", + " \n", + " return df\n", + "\n", + "def vectorize(df):\n", + " wordsData = tokenizer.transform(df)\n", + " \n", + " featurizedData = hashingTF.transform(wordsData)\n", + "\n", + " idf = IDF(inputCol=\"rawFeatures\", outputCol=\"features\")\n", + " idfModel = idf.fit(featurizedData)\n", + " rescaledData = idfModel.transform(featurizedData)\n", + "\n", + " rescaledData = rescaledData.select(\"features\", \"category\")\n", + " return rescaledData" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Get the features with pre-processing and vectorization done**" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [], + "source": [ + "df_train = pre_process(df_train, 'train')\n", + "data_train = vectorize(df_train)\n", + "df_val = pre_process(df_val, 'validation')\n", + "data_val = vectorize(df_val)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Tranform the data in a way so that it can be fed to Spark's RandomForest Classifier**" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [], + "source": [ + "data_train_rdd = data_train.rdd.map(lambda x: LabeledPoint(x.category, MLLibVectors.fromML(x.features)))\n", + "data_val_rdd = data_val.rdd.map(lambda x: LabeledPoint(x.category, MLLibVectors.fromML(x.features)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Train model and make prediction**" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "model = RandomForest.trainClassifier(data_train_rdd, numClasses=5, categoricalFeaturesInfo={},\n", + " numTrees=50, featureSubsetStrategy=\"sqrt\",\n", + " impurity='gini', maxDepth=3, maxBins=32, seed=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [], + "source": [ + "train_predictions = model.predict(data_train_rdd.map(lambda x: x.features))" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "metadata": {}, + "outputs": [], + "source": [ + "val_predictions = model.predict(data_val_rdd.map(lambda x: x.features))" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "metadata": {}, + "outputs": [], + "source": [ + "train_labelsAndPredictions = data_train_rdd.map(lambda lp: lp.label).zip(train_predictions)\n", + "val_labelsAndPredictions = data_val_rdd.map(lambda lp: lp.label).zip(val_predictions)" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Training accuracy = 0.875\n", + "Validation accuracy = 0.6666666666666666\n" + ] + } + ], + "source": [ + "train_accuracy = train_labelsAndPredictions.filter(\n", + " lambda lp: lp[0] == lp[1]).count() / float(df_train.count())\n", + "val_accuracy = val_labelsAndPredictions.filter(\n", + " lambda lp: lp[0] == lp[1]).count() / float(df_val.count())\n", + "print('Training accuracy = ' + str(train_accuracy))\n", + "print('Validation accuracy = ' + str(val_accuracy))\n", + "# print('Learned classification forest model:')\n", + "# print(model.toDebugString())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**So it turns out the training accuracy is 87.5% and validation accuracy is 83.33%**\n", + "\n", + "**Let's also print the classification report with precision, recall and f1-score**" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + " Computer - Hardware 1.00 1.00 1.00 1\n", + "Meals and Entertainment 1.00 0.78 0.88 9\n", + " Office Supplies 0.00 0.00 0.00 0\n", + " Travel 1.00 1.00 1.00 2\n", + "\n", + " avg / total 1.00 0.83 0.91 12\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/asifiqbal/workspace/envs/t3/lib/python3.6/site-packages/sklearn/metrics/classification.py:1137: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples.\n", + " 'recall', 'true', average, warn_for)\n" + ] + } + ], + "source": [ + "print(classification_report(val_predictions.collect(), df_val.select('category').rdd.map(lambda x:x[0]).collect(), target_names=val_categoroes))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 415a461e50f3161d66babb63f3d933aab6fb88c3 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 01:42:03 -0400 Subject: [PATCH 02/13] Add requirements.txt --- requirements.txt | 11 +++++++++++ 1 file changed, 11 insertions(+) create mode 100644 requirements.txt diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..a6f4abb --- /dev/null +++ b/requirements.txt @@ -0,0 +1,11 @@ +numpy==1.13.3 +pandas +scikit-learn +pyspark==2.3.0 +matplotlib +xgboost +jupyter +tensorflow +keras +nltk +luigi From 66c3e8f3f668495886c386e5c52693a34eeee455 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 01:43:09 -0400 Subject: [PATCH 03/13] Changes not complete yet --- README.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/README.md b/README.md index ef266d6..027fc0e 100644 --- a/README.md +++ b/README.md @@ -68,3 +68,16 @@ Evaluation of your submission will be based on the following criteria. 4. What design decisions did you make when designing your models? Why (i.e. were they explained)? 5. Did you separate any concerns in your application? Why or why not? 6. Does your solution use appropriate datatypes for the problem as described? + +## Asif's Notes + +### Instructions + +My solution is python 3 based and I have presented my solution in 2 ways: + +- Jupyter Notebooks for more interactivity +- A [luigi](https://github.com/spotify/luigi) based solution that kind of simulates (not quite) a real production type run + +For both, you need to have the packages in [requirements.txt](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/requirements.txt) installed: + +`pip install -r requirements.txt` From 9d44de0e360bb2883437c51e53040c39b0c977df Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 09:36:34 -0400 Subject: [PATCH 04/13] Refactor --- luigi_solution.py => luigi/luigi_solution.py | 0 pre_processor.py => luigi/pre_processor.py | 0 predictor.py => luigi/predictor.py | 0 run.sh => luigi/run.sh | 0 luigi/training_data_example.csv | 25 ++++++++++++++++++++ luigi/validation_data_example.csv | 13 ++++++++++ 6 files changed, 38 insertions(+) rename luigi_solution.py => luigi/luigi_solution.py (100%) rename pre_processor.py => luigi/pre_processor.py (100%) rename predictor.py => luigi/predictor.py (100%) rename run.sh => luigi/run.sh (100%) create mode 100644 luigi/training_data_example.csv create mode 100644 luigi/validation_data_example.csv diff --git a/luigi_solution.py b/luigi/luigi_solution.py similarity index 100% rename from luigi_solution.py rename to luigi/luigi_solution.py diff --git a/pre_processor.py b/luigi/pre_processor.py similarity index 100% rename from pre_processor.py rename to luigi/pre_processor.py diff --git a/predictor.py b/luigi/predictor.py similarity index 100% rename from predictor.py rename to luigi/predictor.py diff --git a/run.sh b/luigi/run.sh similarity index 100% rename from run.sh rename to luigi/run.sh diff --git a/luigi/training_data_example.csv b/luigi/training_data_example.csv new file mode 100644 index 0000000..f15e7b9 --- /dev/null +++ b/luigi/training_data_example.csv @@ -0,0 +1,25 @@ +date,category,employee id,expense description,pre-tax amount,tax name,tax amount +11/1/2016,Travel,7,Taxi ride,40.00,NY Sales tax,3.55 +11/15/2016,Meals and Entertainment,1,Team lunch,235.00,CA Sales tax,30.55 +11/30/2016,Computer - Hardware,3,HP Laptop Computer,999.00,CA Sales tax,129.87 +11/14/2016,Computer - Software,3,Microsoft Office,899.00,CA Sales tax,116.87 +11/6/2016,Computer - Software,4,Dropbox Subscription,50.00,CA Sales tax,6.5 +11/3/2016,Computer - Software,3,Dropbox Subscription,50.00,CA Sales tax,6.5 +12/9/2016,Meals and Entertainment,6,Coffee with Steve,300.00,CA Sales tax,39 +11/12/2016,Travel,4,Taxi ride,230.00,CA Sales tax,29.9 +11/21/2016,Meals and Entertainment,7,Client dinner,200.00,NY Sales tax,17.75 +10/4/2016,Travel,6,Flight to Miami,200.00,CA Sales tax,26 +10/12/2016,Computer - Hardware,7,Macbook Air Computer,1999.00 ,NY Sales tax,177.41 +12/11/2016,Computer - Software,1,iCloud Subscription,15.00,CA Sales tax,1.95 +9/18/2016,Travel,1,Taxi ride,60.00,CA Sales tax,7.8 +9/30/2016,Office Supplies,3,Paper,200.00,CA Sales tax,26 +12/30/2016,Meals and Entertainment,4,Dinner with potential client,200.00,CA Sales tax,26 +11/6/2016,Computer - Hardware,6,iPhone,200.00,CA Sales tax,26 +11/7/2016,Travel,1,Airplane ticket to NY,200.00,CA Sales tax,26 +12/3/2016,Meals and Entertainment,5,Starbucks coffee,4.00,CA Sales tax,0.52 +12/18/2016,Travel,6,Airplane ticket to NY,1500.00 ,CA Sales tax,195 +12/15/2016,Meals and Entertainment,4,Dinner with client,200.00,CA Sales tax,26 +12/1/2016,Meals and Entertainment,4,Dinner,210.00,CA Sales tax,27.3 +12/8/2016,Meals and Entertainment,4,Dinner,180.00,CA Sales tax,23.4 +12/31/2016,Meals and Entertainment,4,Dinner,1000.00,CA Sales tax,130 +12/9/2016,Meals and Entertainment,4,Dinner,30.00,CA Sales tax,3.9 \ No newline at end of file diff --git a/luigi/validation_data_example.csv b/luigi/validation_data_example.csv new file mode 100644 index 0000000..1ce73df --- /dev/null +++ b/luigi/validation_data_example.csv @@ -0,0 +1,13 @@ +date,category,employee id,expense description,pre-tax amount,tax name,tax amount +11/10/2016,Travel,7,Taxi ride,30.00,NY Sales tax,2.66 +11/12/2016,Meals and Entertainment,1,Dinner with Family,235.00,CA Sales tax,30.55 +9/2/2016,Computer - Hardware,4,Macbook Air Computer,4000.00,CA Sales tax,520 +9/2/2016,Office Supplies,4,Paper,20.00,CA Sales tax,2.6 +9/2/2016,Office Supplies,4,Pens,20.00,CA Sales tax,2.6 +11/21/2016,Travel,1,Airplane ticket to Miami,200.00,CA Sales tax,26 +12/4/2016,Meals and Entertainment,2,Starbucks coffee,12.00,CA Sales tax,1.56 +1/18/2010,Meals and Entertainment,6,Dinner,30.00 ,CA Sales tax,3.9 +10/5/2016,Meals and Entertainment,4,Dinner with client,220.00,CA Sales tax,28.6 +12/2/2016,Meals and Entertainment,4,Dinner,180.00,CA Sales tax,23.4 +2/14/2017,Meals and Entertainment,4,Dinner,500.00,CA Sales tax,65 +2/14/2016,Meals and Entertainment,4,Dinner,600.00,CA Sales tax,78 \ No newline at end of file From 12b5749f6a39510dd7c4d5addf247755b1402b61 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 09:47:56 -0400 Subject: [PATCH 05/13] Not complete --- README.md | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 027fc0e..012548a 100644 --- a/README.md +++ b/README.md @@ -69,15 +69,32 @@ Evaluation of your submission will be based on the following criteria. 5. Did you separate any concerns in your application? Why or why not? 6. Does your solution use appropriate datatypes for the problem as described? -## Asif's Notes +# Asif's Notes -### Instructions +## Instructions My solution is python 3 based and I have presented my solution in 2 ways: - Jupyter Notebooks for more interactivity - A [luigi](https://github.com/spotify/luigi) based solution that kind of simulates (not quite) a real production type run -For both, you need to have the packages in [requirements.txt](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/requirements.txt) installed: - -`pip install -r requirements.txt` +Before proceeding, please install/setup the pre-requisites below. I developed this on a Mac. For linux, the steps should be mostly same, except the XGBoost installation might be a bit different. + +### Pre-requisites + +- Clone/download this repository and make it the working directory +- Install python 3 `brew install python3` +- Install virtual environment `pip3 install virtualenv` +- Create a virtual env `virtualenv venv` and activate it `source venv/bin/activate` +- Do the following as apre-requisite for XGBoost +``` +brew install gcc +brew install gcc@5 +``` +- Now install the packages in [requirements.txt](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/requirements.txt) + +At this point, you can run `jupter notebook` and run the notebooks interactively. You can also run the luigi solution by doing +``` +cd luigi +./run.sh +``` From 37f2bf01de227e880da8a9d029835744afdaa38a Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 13:34:37 -0400 Subject: [PATCH 06/13] Not complete --- README.md | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 52 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 012548a..fbf2772 100644 --- a/README.md +++ b/README.md @@ -75,7 +75,7 @@ Evaluation of your submission will be based on the following criteria. My solution is python 3 based and I have presented my solution in 2 ways: -- Jupyter Notebooks for more interactivity +- Jupyter Notebooks for more interactivity and details - A [luigi](https://github.com/spotify/luigi) based solution that kind of simulates (not quite) a real production type run Before proceeding, please install/setup the pre-requisites below. I developed this on a Mac. For linux, the steps should be mostly same, except the XGBoost installation might be a bit different. @@ -98,3 +98,54 @@ At this point, you can run `jupter notebook` and run the notebooks interactively cd luigi ./run.sh ``` +## Solution Overview + +There are 4 Jupyter notebooks: +- [classification.ipynb](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/clasification.ipynb) +- [classification_attemp_with_deep_learning.ipynb](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/classification_attemp_with_deep_learning.ipynb) +- [clustering_personal_business.ipynb](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/clustering_personal_business.ipynb) +- [spark_random_forest.ipynb](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/spark_random_forest.ipynb) + +And a folder called [luigi](https://github.com/asif31iqbal/ml-challenge-expenses/tree/master/luigi) that has a luigi based solution. Let's go through an overview of each of these. + +### classification.ipynb + +This notebook tries to address the first question in the problem set - classifying the data points into the right category. **Please follow the cell by cell detailed guideline to walk thruogh it**. Key summary points are: +- Intuitively the **expense description** field seems to be the best determiner +- Tried several **shallow learning** classifiers - Logistic Regression, Naive Bayes, SVM, Gradient Boosting (XGBoost) and Random Forest and attempted to take an ensemble of the ones that performed best +- Feature engineering involved vectorizing text (**expense description** field) as TF-IDF ventors and using other fields like day_of_week, expense amount, employee ID and tax name. I didn't use n-grams here, but that's possibly an improvement option +- Tried the classifiers in a two-fold approach - with all features and with only text features +- Tried with Grid Search cross validation and tuning several parameters like regularization factor, number of estimators etc +- Most classifiers worked better with only text features, indicating that the **expense description** text field is the key field for classification here, and that goes right with common intuition +- XGBoost didn't perform as well as I had anticipated +- My personal pick of the classifiers in this case would be Random forest based on the performance (although I tried a further ensemle of Random Forest with SVM amd Logistic Regression) +- Target categories are skeweed in distribution (like **Meals and Entertainment** being the majority of it), so **accuracy** alone is not enough. Considered **precision** and **recall** as additional useful performance measures +- Achieved an accuracy of **100%** on the training data and **91.67%** on the validation data,and an average of **93%** precision and **92%** recall and **91%** f1-score +- I do understand that with bigger data and more intelligent feature engineering, the performance and prediction can change and nothing here is necessarily conclusive. + +### classification_attemp_with_deep_learning.ipynb + +This notebook tries to address the same problem as the previous one, but this time using **Deep Learning** using **Keras**. This notebook is not documented in detail, but I get a chance I can explain more in person. Key summary points are: +- Tried a simple 3-layer ANN with all features (non-text and text, with text vectorized using scikit learn's TfIdfVectorizer) +- Tried a simple 3-layer ANN with only text features (with text vectorized using Keras's text to matrix binary vectorizer) +- Tried a RNN with LSTM with a pre-trained embedding layer and only text features (with text vectorized using Keras's text to matrix binary vectorizer) +- The accuracy, precision, recall and f1-score that was acieved was the exact same as the one earned with the previous shallow learning approach + +### clustering_personal_business.ipynb + +This notebook tries to address the second problem in the problem set - identifying expenses as either personal or business. **Please follow the cell by cell detailed guideline to walk thruogh it**. Key summary points: + +- Decided to use **K-means** +- Extracted holiday (not in a very proper way), along with employee ID and role, and most importantly, again, the text features. This time concatenated **expense description** and **category** +- 15 data points in business cluster and 9 in personal cluster +- Resulting clusters kind of makes sense as has been described in the notebook +- Tried the cluster model on the validation data as well and that sort of makes sense as well + +## spark_random_forest.ipynb + +This notebook tries to address the bonus of problem of running one solutio via spark. For this I tried implementing the classification problem. This notebook is not very well documented, I can explain in person if I get the chance. Key summary points: + +- Although in the original solution I tried several approaches, i only used Random firests here to keep things simple as I am very new to using Spark's ML library +- Only used text TF-IDF features +- Training Accuracy **87.5%**, validation accuracy **83.33%**, precision **100%**, **83%** and f1-score **91%** +- Tons of improvement possible From fa79336152b94acf001d2f7cdb1a11c07c48ff08 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 13:40:33 -0400 Subject: [PATCH 07/13] notebook changes --- ...sification_attemp_with_deep_learning.ipynb | 974 +++++++++--------- clustering_personal_business.ipynb | 233 +---- spark_random_forest.ipynb | 54 +- 3 files changed, 582 insertions(+), 679 deletions(-) diff --git a/classification_attemp_with_deep_learning.ipynb b/classification_attemp_with_deep_learning.ipynb index 1208bbc..166f19f 100644 --- a/classification_attemp_with_deep_learning.ipynb +++ b/classification_attemp_with_deep_learning.ipynb @@ -25,7 +25,7 @@ }, { "cell_type": "code", - "execution_count": 65, + "execution_count": 28, "metadata": {}, "outputs": [], "source": [ @@ -34,8 +34,10 @@ "from collections import Counter\n", "import itertools\n", "\n", + "from sklearn import preprocessing\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", + "from sklearn.metrics import classification_report\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", @@ -59,7 +61,7 @@ }, { "cell_type": "code", - "execution_count": 109, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -70,7 +72,7 @@ }, { "cell_type": "code", - "execution_count": 110, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -102,7 +104,7 @@ }, { "cell_type": "code", - "execution_count": 111, + "execution_count": 8, "metadata": {}, "outputs": [], "source": [ @@ -121,7 +123,7 @@ }, { "cell_type": "code", - "execution_count": 112, + "execution_count": 9, "metadata": {}, "outputs": [], "source": [ @@ -141,7 +143,7 @@ }, { "cell_type": "code", - "execution_count": 113, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ @@ -158,7 +160,7 @@ }, { "cell_type": "code", - "execution_count": 114, + "execution_count": 11, "metadata": {}, "outputs": [], "source": [ @@ -168,7 +170,12 @@ "y_train = lencoder.transform(df_train['category'])\n", "y_val = lencoder.transform(df_val['category'])\n", "dummy_y_train = np_utils.to_categorical(y_train)\n", - "dummy_y_val = np_utils.to_categorical(y_val)" + "dummy_y_val = np_utils.to_categorical(y_val)\n", + "\n", + "val_categoroes = []\n", + "for clazz in lencoder.classes_:\n", + " if clazz in names:\n", + " val_categoroes.append(clazz)\n" ] }, { @@ -182,7 +189,7 @@ }, { "cell_type": "code", - "execution_count": 115, + "execution_count": 12, "metadata": {}, "outputs": [ { @@ -192,11 +199,11 @@ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", - "dense_41 (Dense) (None, 16) 832 \n", + "dense_1 (Dense) (None, 16) 832 \n", "_________________________________________________________________\n", - "dense_42 (Dense) (None, 16) 272 \n", + "dense_2 (Dense) (None, 16) 272 \n", "_________________________________________________________________\n", - "dense_43 (Dense) (None, 5) 85 \n", + "dense_3 (Dense) (None, 5) 85 \n", "=================================================================\n", "Total params: 1,189\n", "Trainable params: 1,189\n", @@ -220,7 +227,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 13, "metadata": {}, "outputs": [ { @@ -229,423 +236,423 @@ "text": [ "Train on 24 samples, validate on 12 samples\n", "Epoch 1/200\n", - "24/24 [==============================] - 0s 14ms/step - loss: 1.6079 - acc: 0.1250 - val_loss: 1.6118 - val_acc: 0.2500\n", + "24/24 [==============================] - 0s 13ms/step - loss: 1.5759 - acc: 0.4167 - val_loss: 1.3881 - val_acc: 0.5833\n", "Epoch 2/200\n", - "24/24 [==============================] - 0s 324us/step - loss: 1.5794 - acc: 0.1667 - val_loss: 1.6026 - val_acc: 0.1667\n", + "24/24 [==============================] - 0s 634us/step - loss: 1.5282 - acc: 0.4167 - val_loss: 1.3735 - val_acc: 0.5833\n", "Epoch 3/200\n", - "24/24 [==============================] - 0s 628us/step - loss: 1.5615 - acc: 0.2500 - val_loss: 1.5947 - val_acc: 0.1667\n", + "24/24 [==============================] - 0s 405us/step - loss: 1.5027 - acc: 0.4167 - val_loss: 1.3632 - val_acc: 0.5833\n", "Epoch 4/200\n", - "24/24 [==============================] - 0s 308us/step - loss: 1.5471 - acc: 0.2500 - val_loss: 1.5873 - val_acc: 0.0833\n", + "24/24 [==============================] - 0s 334us/step - loss: 1.4771 - acc: 0.4167 - val_loss: 1.3529 - val_acc: 0.5833\n", "Epoch 5/200\n", - "24/24 [==============================] - 0s 332us/step - loss: 1.5331 - acc: 0.3333 - val_loss: 1.5800 - val_acc: 0.1667\n", + "24/24 [==============================] - 0s 362us/step - loss: 1.4578 - acc: 0.4167 - val_loss: 1.3440 - val_acc: 0.5833\n", "Epoch 6/200\n", - "24/24 [==============================] - 0s 526us/step - loss: 1.5206 - acc: 0.4167 - val_loss: 1.5730 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 359us/step - loss: 1.4386 - acc: 0.4167 - val_loss: 1.3355 - val_acc: 0.5833\n", "Epoch 7/200\n", - "24/24 [==============================] - 0s 418us/step - loss: 1.5087 - acc: 0.5000 - val_loss: 1.5665 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 570us/step - loss: 1.4212 - acc: 0.4167 - val_loss: 1.3276 - val_acc: 0.5833\n", "Epoch 8/200\n", - "24/24 [==============================] - 0s 356us/step - loss: 1.4968 - acc: 0.5417 - val_loss: 1.5605 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 643us/step - loss: 1.4026 - acc: 0.4167 - val_loss: 1.3202 - val_acc: 0.5833\n", "Epoch 9/200\n", - "24/24 [==============================] - 0s 360us/step - loss: 1.4858 - acc: 0.5417 - val_loss: 1.5542 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 338us/step - loss: 1.3858 - acc: 0.4167 - val_loss: 1.3126 - val_acc: 0.5833\n", "Epoch 10/200\n", - "24/24 [==============================] - 0s 613us/step - loss: 1.4756 - acc: 0.5417 - val_loss: 1.5484 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.3697 - acc: 0.4167 - val_loss: 1.3053 - val_acc: 0.5833\n", "Epoch 11/200\n", - "24/24 [==============================] - 0s 448us/step - loss: 1.4647 - acc: 0.5833 - val_loss: 1.5417 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 333us/step - loss: 1.3545 - acc: 0.4167 - val_loss: 1.2986 - val_acc: 0.5833\n", "Epoch 12/200\n", - "24/24 [==============================] - 0s 355us/step - loss: 1.4539 - acc: 0.6667 - val_loss: 1.5359 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 333us/step - loss: 1.3389 - acc: 0.4167 - val_loss: 1.2907 - val_acc: 0.5833\n", "Epoch 13/200\n", - "24/24 [==============================] - 0s 406us/step - loss: 1.4429 - acc: 0.6667 - val_loss: 1.5299 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 284us/step - loss: 1.3230 - acc: 0.4167 - val_loss: 1.2842 - val_acc: 0.5833\n", "Epoch 14/200\n", - "24/24 [==============================] - 0s 482us/step - loss: 1.4333 - acc: 0.6667 - val_loss: 1.5245 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 302us/step - loss: 1.3089 - acc: 0.4167 - val_loss: 1.2770 - val_acc: 0.5833\n", "Epoch 15/200\n", - "24/24 [==============================] - 0s 369us/step - loss: 1.4225 - acc: 0.7083 - val_loss: 1.5188 - val_acc: 0.3333\n", + "24/24 [==============================] - 0s 402us/step - loss: 1.2952 - acc: 0.4167 - val_loss: 1.2701 - val_acc: 0.5833\n", "Epoch 16/200\n", - "24/24 [==============================] - 0s 405us/step - loss: 1.4113 - acc: 0.7083 - val_loss: 1.5120 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 332us/step - loss: 1.2785 - acc: 0.4167 - val_loss: 1.2624 - val_acc: 0.5833\n", "Epoch 17/200\n", - "24/24 [==============================] - 0s 307us/step - loss: 1.3995 - acc: 0.7083 - val_loss: 1.5059 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 463us/step - loss: 1.2662 - acc: 0.4167 - val_loss: 1.2560 - val_acc: 0.5833\n", "Epoch 18/200\n", - "24/24 [==============================] - 0s 731us/step - loss: 1.3887 - acc: 0.7083 - val_loss: 1.4990 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 618us/step - loss: 1.2515 - acc: 0.4167 - val_loss: 1.2480 - val_acc: 0.5833\n", "Epoch 19/200\n", - "24/24 [==============================] - 0s 701us/step - loss: 1.3771 - acc: 0.7083 - val_loss: 1.4930 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 453us/step - loss: 1.2377 - acc: 0.4583 - val_loss: 1.2416 - val_acc: 0.5833\n", "Epoch 20/200\n", - "24/24 [==============================] - 0s 350us/step - loss: 1.3654 - acc: 0.7083 - val_loss: 1.4860 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 356us/step - loss: 1.2230 - acc: 0.4167 - val_loss: 1.2341 - val_acc: 0.5833\n", "Epoch 21/200\n", - "24/24 [==============================] - 0s 367us/step - loss: 1.3550 - acc: 0.6667 - val_loss: 1.4799 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 358us/step - loss: 1.2140 - acc: 0.4583 - val_loss: 1.2286 - val_acc: 0.5833\n", "Epoch 22/200\n", - "24/24 [==============================] - 0s 463us/step - loss: 1.3417 - acc: 0.6667 - val_loss: 1.4741 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 614us/step - loss: 1.1941 - acc: 0.4583 - val_loss: 1.2235 - val_acc: 0.5833\n", "Epoch 23/200\n", - "24/24 [==============================] - 0s 819us/step - loss: 1.3302 - acc: 0.7083 - val_loss: 1.4678 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 479us/step - loss: 1.1813 - acc: 0.4583 - val_loss: 1.2158 - val_acc: 0.5833\n", "Epoch 24/200\n", - "24/24 [==============================] - 0s 380us/step - loss: 1.3184 - acc: 0.6667 - val_loss: 1.4626 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 418us/step - loss: 1.1676 - acc: 0.4583 - val_loss: 1.2082 - val_acc: 0.5833\n", "Epoch 25/200\n", - "24/24 [==============================] - 0s 540us/step - loss: 1.3059 - acc: 0.6667 - val_loss: 1.4568 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 801us/step - loss: 1.1544 - acc: 0.4583 - val_loss: 1.2037 - val_acc: 0.5833\n", "Epoch 26/200\n", - "24/24 [==============================] - 0s 343us/step - loss: 1.2939 - acc: 0.6667 - val_loss: 1.4512 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 376us/step - loss: 1.1416 - acc: 0.4583 - val_loss: 1.1965 - val_acc: 0.5833\n", "Epoch 27/200\n", - "24/24 [==============================] - 0s 327us/step - loss: 1.2812 - acc: 0.6667 - val_loss: 1.4447 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 475us/step - loss: 1.1266 - acc: 0.4583 - val_loss: 1.1888 - val_acc: 0.5833\n", "Epoch 28/200\n", - "24/24 [==============================] - 0s 702us/step - loss: 1.2698 - acc: 0.6667 - val_loss: 1.4383 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 398us/step - loss: 1.1153 - acc: 0.4583 - val_loss: 1.1848 - val_acc: 0.5833\n", "Epoch 29/200\n", - "24/24 [==============================] - 0s 541us/step - loss: 1.2571 - acc: 0.6667 - val_loss: 1.4322 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 411us/step - loss: 1.1012 - acc: 0.5000 - val_loss: 1.1763 - val_acc: 0.5833\n", "Epoch 30/200\n", - "24/24 [==============================] - 0s 683us/step - loss: 1.2446 - acc: 0.6667 - val_loss: 1.4262 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 413us/step - loss: 1.0885 - acc: 0.5000 - val_loss: 1.1706 - val_acc: 0.5833\n", "Epoch 31/200\n", - "24/24 [==============================] - 0s 420us/step - loss: 1.2309 - acc: 0.6667 - val_loss: 1.4195 - val_acc: 0.4167\n", + "24/24 [==============================] - ETA: 0s - loss: 1.0532 - acc: 0.500 - 0s 336us/step - loss: 1.0744 - acc: 0.5000 - val_loss: 1.1632 - val_acc: 0.5833\n", "Epoch 32/200\n", - "24/24 [==============================] - 0s 498us/step - loss: 1.2181 - acc: 0.6667 - val_loss: 1.4136 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 544us/step - loss: 1.0617 - acc: 0.5000 - val_loss: 1.1575 - val_acc: 0.5833\n", "Epoch 33/200\n", - "24/24 [==============================] - 0s 540us/step - loss: 1.2051 - acc: 0.7500 - val_loss: 1.4061 - val_acc: 0.4167\n", + "24/24 [==============================] - 0s 369us/step - loss: 1.0515 - acc: 0.5000 - val_loss: 1.1498 - val_acc: 0.5833\n", "Epoch 34/200\n", - "24/24 [==============================] - 0s 524us/step - loss: 1.1915 - acc: 0.7500 - val_loss: 1.3990 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 406us/step - loss: 1.0364 - acc: 0.5000 - val_loss: 1.1465 - val_acc: 0.5833\n", "Epoch 35/200\n", - "24/24 [==============================] - 0s 421us/step - loss: 1.1790 - acc: 0.7500 - val_loss: 1.3922 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 346us/step - loss: 1.0224 - acc: 0.5417 - val_loss: 1.1395 - val_acc: 0.5833\n", "Epoch 36/200\n", - "24/24 [==============================] - 0s 504us/step - loss: 1.1661 - acc: 0.7500 - val_loss: 1.3858 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 443us/step - loss: 1.0110 - acc: 0.5417 - val_loss: 1.1352 - val_acc: 0.5833\n", "Epoch 37/200\n", - "24/24 [==============================] - 0s 332us/step - loss: 1.1545 - acc: 0.7500 - val_loss: 1.3784 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 870us/step - loss: 0.9982 - acc: 0.5833 - val_loss: 1.1283 - val_acc: 0.5833\n", "Epoch 38/200\n", - "24/24 [==============================] - 0s 321us/step - loss: 1.1399 - acc: 0.7500 - val_loss: 1.3726 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 629us/step - loss: 0.9840 - acc: 0.5833 - val_loss: 1.1241 - val_acc: 0.5833\n", "Epoch 39/200\n", - "24/24 [==============================] - 0s 336us/step - loss: 1.1271 - acc: 0.7500 - val_loss: 1.3643 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 502us/step - loss: 0.9758 - acc: 0.5417 - val_loss: 1.1200 - val_acc: 0.5833\n", "Epoch 40/200\n", - "24/24 [==============================] - 0s 369us/step - loss: 1.1138 - acc: 0.7500 - val_loss: 1.3584 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 403us/step - loss: 0.9602 - acc: 0.5833 - val_loss: 1.1148 - val_acc: 0.5833\n", "Epoch 41/200\n", - "24/24 [==============================] - 0s 319us/step - loss: 1.1013 - acc: 0.7500 - val_loss: 1.3498 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 406us/step - loss: 0.9477 - acc: 0.5833 - val_loss: 1.1119 - val_acc: 0.5833\n", "Epoch 42/200\n", - "24/24 [==============================] - 0s 782us/step - loss: 1.0882 - acc: 0.7500 - val_loss: 1.3421 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 586us/step - loss: 0.9366 - acc: 0.5833 - val_loss: 1.1065 - val_acc: 0.5833\n", "Epoch 43/200\n", - "24/24 [==============================] - 0s 496us/step - loss: 1.0741 - acc: 0.7500 - val_loss: 1.3357 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 487us/step - loss: 0.9244 - acc: 0.5833 - val_loss: 1.1015 - val_acc: 0.6667\n", "Epoch 44/200\n", - "24/24 [==============================] - 0s 319us/step - loss: 1.0617 - acc: 0.7500 - val_loss: 1.3271 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 307us/step - loss: 0.9124 - acc: 0.5833 - val_loss: 1.0997 - val_acc: 0.6667\n", "Epoch 45/200\n", - "24/24 [==============================] - 0s 469us/step - loss: 1.0495 - acc: 0.7500 - val_loss: 1.3207 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 336us/step - loss: 0.9008 - acc: 0.6250 - val_loss: 1.0971 - val_acc: 0.6667\n", "Epoch 46/200\n", - "24/24 [==============================] - 0s 402us/step - loss: 1.0349 - acc: 0.7500 - val_loss: 1.3127 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 356us/step - loss: 0.8914 - acc: 0.6250 - val_loss: 1.0924 - val_acc: 0.6667\n", "Epoch 47/200\n", - "24/24 [==============================] - 0s 255us/step - loss: 1.0206 - acc: 0.7500 - val_loss: 1.3062 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 437us/step - loss: 0.8766 - acc: 0.6667 - val_loss: 1.0882 - val_acc: 0.6667\n", "Epoch 48/200\n", - "24/24 [==============================] - 0s 438us/step - loss: 1.0073 - acc: 0.7500 - val_loss: 1.2981 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 459us/step - loss: 0.8651 - acc: 0.6667 - val_loss: 1.0846 - val_acc: 0.6667\n", "Epoch 49/200\n", - "24/24 [==============================] - 0s 638us/step - loss: 0.9942 - acc: 0.7500 - val_loss: 1.2908 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 663us/step - loss: 0.8536 - acc: 0.6667 - val_loss: 1.0802 - val_acc: 0.7500\n", "Epoch 50/200\n", - "24/24 [==============================] - 0s 496us/step - loss: 0.9811 - acc: 0.7500 - val_loss: 1.2840 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 282us/step - loss: 0.8444 - acc: 0.6667 - val_loss: 1.0775 - val_acc: 0.7500\n", "Epoch 51/200\n", - "24/24 [==============================] - 0s 438us/step - loss: 0.9669 - acc: 0.7500 - val_loss: 1.2786 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 371us/step - loss: 0.8325 - acc: 0.7083 - val_loss: 1.0742 - val_acc: 0.7500\n", "Epoch 52/200\n", - "24/24 [==============================] - 0s 495us/step - loss: 0.9537 - acc: 0.7500 - val_loss: 1.2702 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 507us/step - loss: 0.8202 - acc: 0.7083 - val_loss: 1.0718 - val_acc: 0.7500\n", "Epoch 53/200\n", - "24/24 [==============================] - 0s 263us/step - loss: 0.9392 - acc: 0.7500 - val_loss: 1.2642 - val_acc: 0.5833\n", + "24/24 [==============================] - 0s 501us/step - loss: 0.8097 - acc: 0.7083 - val_loss: 1.0686 - val_acc: 0.7500\n", "Epoch 54/200\n", - "24/24 [==============================] - 0s 319us/step - loss: 0.9267 - acc: 0.7500 - val_loss: 1.2581 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 812us/step - loss: 0.7978 - acc: 0.7083 - val_loss: 1.0672 - val_acc: 0.7500\n", "Epoch 55/200\n", - "24/24 [==============================] - 0s 221us/step - loss: 0.9144 - acc: 0.7500 - val_loss: 1.2505 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 863us/step - loss: 0.7868 - acc: 0.7083 - val_loss: 1.0638 - val_acc: 0.7500\n", "Epoch 56/200\n", - "24/24 [==============================] - 0s 555us/step - loss: 0.9016 - acc: 0.7500 - val_loss: 1.2450 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 398us/step - loss: 0.7748 - acc: 0.7083 - val_loss: 1.0599 - val_acc: 0.7500\n", "Epoch 57/200\n", - "24/24 [==============================] - 0s 762us/step - loss: 0.8887 - acc: 0.7500 - val_loss: 1.2379 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 463us/step - loss: 0.7678 - acc: 0.7083 - val_loss: 1.0580 - val_acc: 0.7500\n", "Epoch 58/200\n", - "24/24 [==============================] - 0s 415us/step - loss: 0.8761 - acc: 0.7500 - val_loss: 1.2339 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 369us/step - loss: 0.7534 - acc: 0.7083 - val_loss: 1.0552 - val_acc: 0.7500\n", "Epoch 59/200\n", - "24/24 [==============================] - 0s 384us/step - loss: 0.8638 - acc: 0.7917 - val_loss: 1.2275 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 425us/step - loss: 0.7427 - acc: 0.7083 - val_loss: 1.0529 - val_acc: 0.7500\n", "Epoch 60/200\n", - "24/24 [==============================] - 0s 329us/step - loss: 0.8512 - acc: 0.7917 - val_loss: 1.2225 - val_acc: 0.7500\n", - "Epoch 61/200\n", - "24/24 [==============================] - 0s 379us/step - loss: 0.8383 - acc: 0.7917 - val_loss: 1.2165 - val_acc: 0.7500\n" + "24/24 [==============================] - 0s 371us/step - loss: 0.7309 - acc: 0.7083 - val_loss: 1.0505 - val_acc: 0.7500\n", + "Epoch 61/200\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ + "24/24 [==============================] - 0s 363us/step - loss: 0.7232 - acc: 0.7083 - val_loss: 1.0494 - val_acc: 0.7500\n", "Epoch 62/200\n", - "24/24 [==============================] - 0s 343us/step - loss: 0.8262 - acc: 0.7917 - val_loss: 1.2089 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 374us/step - loss: 0.7099 - acc: 0.7917 - val_loss: 1.0480 - val_acc: 0.7500\n", "Epoch 63/200\n", - "24/24 [==============================] - 0s 370us/step - loss: 0.8135 - acc: 0.7917 - val_loss: 1.2033 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 342us/step - loss: 0.7014 - acc: 0.7917 - val_loss: 1.0447 - val_acc: 0.7500\n", "Epoch 64/200\n", - "24/24 [==============================] - 0s 388us/step - loss: 0.8055 - acc: 0.7917 - val_loss: 1.1976 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 355us/step - loss: 0.6896 - acc: 0.7917 - val_loss: 1.0429 - val_acc: 0.7500\n", "Epoch 65/200\n", - "24/24 [==============================] - 0s 627us/step - loss: 0.7914 - acc: 0.7917 - val_loss: 1.1922 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 401us/step - loss: 0.6783 - acc: 0.7917 - val_loss: 1.0402 - val_acc: 0.7500\n", "Epoch 66/200\n", - "24/24 [==============================] - 0s 387us/step - loss: 0.7783 - acc: 0.7917 - val_loss: 1.1860 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 378us/step - loss: 0.6678 - acc: 0.7917 - val_loss: 1.0390 - val_acc: 0.7500\n", "Epoch 67/200\n", - "24/24 [==============================] - 0s 574us/step - loss: 0.7668 - acc: 0.7917 - val_loss: 1.1800 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 409us/step - loss: 0.6575 - acc: 0.7917 - val_loss: 1.0347 - val_acc: 0.7500\n", "Epoch 68/200\n", - "24/24 [==============================] - 0s 320us/step - loss: 0.7555 - acc: 0.7917 - val_loss: 1.1744 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 313us/step - loss: 0.6485 - acc: 0.7917 - val_loss: 1.0344 - val_acc: 0.7500\n", "Epoch 69/200\n", - "24/24 [==============================] - 0s 361us/step - loss: 0.7468 - acc: 0.7917 - val_loss: 1.1689 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 374us/step - loss: 0.6398 - acc: 0.7917 - val_loss: 1.0332 - val_acc: 0.7500\n", "Epoch 70/200\n", - "24/24 [==============================] - 0s 420us/step - loss: 0.7328 - acc: 0.7917 - val_loss: 1.1607 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 466us/step - loss: 0.6290 - acc: 0.8333 - val_loss: 1.0296 - val_acc: 0.7500\n", "Epoch 71/200\n", - "24/24 [==============================] - 0s 296us/step - loss: 0.7213 - acc: 0.7917 - val_loss: 1.1567 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 411us/step - loss: 0.6189 - acc: 0.8333 - val_loss: 1.0256 - val_acc: 0.7500\n", "Epoch 72/200\n", - "24/24 [==============================] - 0s 866us/step - loss: 0.7099 - acc: 0.7917 - val_loss: 1.1496 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 341us/step - loss: 0.6087 - acc: 0.8333 - val_loss: 1.0248 - val_acc: 0.7500\n", "Epoch 73/200\n", - "24/24 [==============================] - 0s 357us/step - loss: 0.6982 - acc: 0.7917 - val_loss: 1.1448 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.5992 - acc: 0.8333 - val_loss: 1.0215 - val_acc: 0.7500\n", "Epoch 74/200\n", - "24/24 [==============================] - 0s 340us/step - loss: 0.6875 - acc: 0.7917 - val_loss: 1.1381 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 650us/step - loss: 0.5911 - acc: 0.8333 - val_loss: 1.0195 - val_acc: 0.7500\n", "Epoch 75/200\n", - "24/24 [==============================] - 0s 413us/step - loss: 0.6797 - acc: 0.7917 - val_loss: 1.1351 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.5808 - acc: 0.8333 - val_loss: 1.0175 - val_acc: 0.7500\n", "Epoch 76/200\n", - "24/24 [==============================] - 0s 528us/step - loss: 0.6659 - acc: 0.8333 - val_loss: 1.1267 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 564us/step - loss: 0.5710 - acc: 0.8333 - val_loss: 1.0137 - val_acc: 0.7500\n", "Epoch 77/200\n", - "24/24 [==============================] - 0s 458us/step - loss: 0.6556 - acc: 0.8333 - val_loss: 1.1212 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 363us/step - loss: 0.5616 - acc: 0.8333 - val_loss: 1.0099 - val_acc: 0.7500\n", "Epoch 78/200\n", - "24/24 [==============================] - 0s 852us/step - loss: 0.6444 - acc: 0.8333 - val_loss: 1.1155 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 327us/step - loss: 0.5520 - acc: 0.8333 - val_loss: 1.0098 - val_acc: 0.7500\n", "Epoch 79/200\n", - "24/24 [==============================] - 0s 516us/step - loss: 0.6357 - acc: 0.8333 - val_loss: 1.1101 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 366us/step - loss: 0.5426 - acc: 0.8333 - val_loss: 1.0062 - val_acc: 0.7500\n", "Epoch 80/200\n", - "24/24 [==============================] - 0s 589us/step - loss: 0.6257 - acc: 0.8333 - val_loss: 1.1045 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 519us/step - loss: 0.5330 - acc: 0.8333 - val_loss: 1.0021 - val_acc: 0.7500\n", "Epoch 81/200\n", - "24/24 [==============================] - 0s 411us/step - loss: 0.6141 - acc: 0.8333 - val_loss: 1.0982 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 508us/step - loss: 0.5250 - acc: 0.8333 - val_loss: 0.9987 - val_acc: 0.7500\n", "Epoch 82/200\n", - "24/24 [==============================] - 0s 384us/step - loss: 0.6039 - acc: 0.8333 - val_loss: 1.0911 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 271us/step - loss: 0.5148 - acc: 0.8333 - val_loss: 0.9951 - val_acc: 0.7500\n", "Epoch 83/200\n", - "24/24 [==============================] - 0s 625us/step - loss: 0.5952 - acc: 0.8333 - val_loss: 1.0852 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 304us/step - loss: 0.5098 - acc: 0.8333 - val_loss: 0.9939 - val_acc: 0.7500\n", "Epoch 84/200\n", - "24/24 [==============================] - 0s 380us/step - loss: 0.5864 - acc: 0.8333 - val_loss: 1.0775 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 379us/step - loss: 0.4982 - acc: 0.8333 - val_loss: 0.9899 - val_acc: 0.7500\n", "Epoch 85/200\n", - "24/24 [==============================] - 0s 419us/step - loss: 0.5750 - acc: 0.8333 - val_loss: 1.0734 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 420us/step - loss: 0.4893 - acc: 0.8333 - val_loss: 0.9878 - val_acc: 0.7500\n", "Epoch 86/200\n", - "24/24 [==============================] - 0s 705us/step - loss: 0.5669 - acc: 0.8333 - val_loss: 1.0660 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 457us/step - loss: 0.4823 - acc: 0.8333 - val_loss: 0.9868 - val_acc: 0.7500\n", "Epoch 87/200\n", - "24/24 [==============================] - 0s 394us/step - loss: 0.5572 - acc: 0.8750 - val_loss: 1.0601 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 360us/step - loss: 0.4731 - acc: 0.8333 - val_loss: 0.9847 - val_acc: 0.7500\n", "Epoch 88/200\n", - "24/24 [==============================] - 0s 337us/step - loss: 0.5493 - acc: 0.8750 - val_loss: 1.0527 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 382us/step - loss: 0.4652 - acc: 0.8333 - val_loss: 0.9810 - val_acc: 0.7500\n", "Epoch 89/200\n", - "24/24 [==============================] - 0s 382us/step - loss: 0.5376 - acc: 0.8750 - val_loss: 1.0475 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 365us/step - loss: 0.4584 - acc: 0.8333 - val_loss: 0.9804 - val_acc: 0.7500\n", "Epoch 90/200\n", - "24/24 [==============================] - 0s 465us/step - loss: 0.5295 - acc: 0.8750 - val_loss: 1.0411 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 387us/step - loss: 0.4490 - acc: 0.8333 - val_loss: 0.9754 - val_acc: 0.7500\n", "Epoch 91/200\n", - "24/24 [==============================] - 0s 295us/step - loss: 0.5213 - acc: 0.8750 - val_loss: 1.0355 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 516us/step - loss: 0.4399 - acc: 0.8333 - val_loss: 0.9735 - val_acc: 0.7500\n", "Epoch 92/200\n", - "24/24 [==============================] - 0s 397us/step - loss: 0.5130 - acc: 0.8750 - val_loss: 1.0286 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 767us/step - loss: 0.4330 - acc: 0.8750 - val_loss: 0.9690 - val_acc: 0.7500\n", "Epoch 93/200\n", - "24/24 [==============================] - 0s 511us/step - loss: 0.5051 - acc: 0.8750 - val_loss: 1.0214 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 368us/step - loss: 0.4242 - acc: 0.8750 - val_loss: 0.9684 - val_acc: 0.7500\n", "Epoch 94/200\n", - "24/24 [==============================] - 0s 418us/step - loss: 0.4962 - acc: 0.8750 - val_loss: 1.0162 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 443us/step - loss: 0.4151 - acc: 0.8750 - val_loss: 0.9641 - val_acc: 0.7500\n", "Epoch 95/200\n", - "24/24 [==============================] - 0s 352us/step - loss: 0.4881 - acc: 0.8750 - val_loss: 1.0086 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 351us/step - loss: 0.4075 - acc: 0.9167 - val_loss: 0.9600 - val_acc: 0.7500\n", "Epoch 96/200\n", - "24/24 [==============================] - 0s 425us/step - loss: 0.4792 - acc: 0.8750 - val_loss: 1.0036 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 349us/step - loss: 0.3995 - acc: 0.9167 - val_loss: 0.9574 - val_acc: 0.7500\n", "Epoch 97/200\n", - "24/24 [==============================] - 0s 479us/step - loss: 0.4725 - acc: 0.8750 - val_loss: 0.9984 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 407us/step - loss: 0.3913 - acc: 0.9167 - val_loss: 0.9553 - val_acc: 0.7500\n", "Epoch 98/200\n", - "24/24 [==============================] - 0s 503us/step - loss: 0.4638 - acc: 0.8750 - val_loss: 0.9900 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 402us/step - loss: 0.3859 - acc: 0.9167 - val_loss: 0.9524 - val_acc: 0.7500\n", "Epoch 99/200\n", - "24/24 [==============================] - 0s 474us/step - loss: 0.4560 - acc: 0.8750 - val_loss: 0.9833 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 352us/step - loss: 0.3748 - acc: 0.9583 - val_loss: 0.9502 - val_acc: 0.7500\n", "Epoch 100/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.4489 - acc: 0.8750 - val_loss: 0.9770 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 325us/step - loss: 0.3676 - acc: 0.9583 - val_loss: 0.9462 - val_acc: 0.7500\n", "Epoch 101/200\n", - "24/24 [==============================] - 0s 639us/step - loss: 0.4408 - acc: 0.8750 - val_loss: 0.9710 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 421us/step - loss: 0.3604 - acc: 0.9583 - val_loss: 0.9443 - val_acc: 0.7500\n", "Epoch 102/200\n", - "24/24 [==============================] - 0s 380us/step - loss: 0.4337 - acc: 0.8750 - val_loss: 0.9644 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 324us/step - loss: 0.3536 - acc: 0.9583 - val_loss: 0.9410 - val_acc: 0.7500\n", "Epoch 103/200\n", - "24/24 [==============================] - 0s 822us/step - loss: 0.4253 - acc: 0.8750 - val_loss: 0.9604 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 357us/step - loss: 0.3447 - acc: 0.9583 - val_loss: 0.9384 - val_acc: 0.7500\n", "Epoch 104/200\n", - "24/24 [==============================] - 0s 539us/step - loss: 0.4188 - acc: 0.8750 - val_loss: 0.9540 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 257us/step - loss: 0.3376 - acc: 0.9583 - val_loss: 0.9366 - val_acc: 0.7500\n", "Epoch 105/200\n", - "24/24 [==============================] - 0s 367us/step - loss: 0.4109 - acc: 0.8750 - val_loss: 0.9490 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 416us/step - loss: 0.3323 - acc: 0.9583 - val_loss: 0.9325 - val_acc: 0.7500\n", "Epoch 106/200\n", - "24/24 [==============================] - 0s 719us/step - loss: 0.4028 - acc: 0.8750 - val_loss: 0.9427 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 374us/step - loss: 0.3238 - acc: 0.9583 - val_loss: 0.9313 - val_acc: 0.7500\n", "Epoch 107/200\n", - "24/24 [==============================] - 0s 474us/step - loss: 0.3953 - acc: 0.8750 - val_loss: 0.9407 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 414us/step - loss: 0.3161 - acc: 0.9583 - val_loss: 0.9285 - val_acc: 0.7500\n", "Epoch 108/200\n", - "24/24 [==============================] - 0s 443us/step - loss: 0.3887 - acc: 0.9167 - val_loss: 0.9317 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 425us/step - loss: 0.3091 - acc: 0.9583 - val_loss: 0.9285 - val_acc: 0.7500\n", "Epoch 109/200\n", - "24/24 [==============================] - 0s 443us/step - loss: 0.3816 - acc: 0.9167 - val_loss: 0.9285 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 290us/step - loss: 0.3029 - acc: 0.9583 - val_loss: 0.9269 - val_acc: 0.7500\n", "Epoch 110/200\n", - "24/24 [==============================] - 0s 448us/step - loss: 0.3745 - acc: 0.9167 - val_loss: 0.9210 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 369us/step - loss: 0.2970 - acc: 0.9583 - val_loss: 0.9234 - val_acc: 0.6667\n", "Epoch 111/200\n", - "24/24 [==============================] - 0s 811us/step - loss: 0.3676 - acc: 0.9167 - val_loss: 0.9191 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 323us/step - loss: 0.2925 - acc: 0.9583 - val_loss: 0.9223 - val_acc: 0.6667\n", "Epoch 112/200\n", - "24/24 [==============================] - 0s 489us/step - loss: 0.3615 - acc: 0.9167 - val_loss: 0.9133 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 331us/step - loss: 0.2840 - acc: 0.9583 - val_loss: 0.9209 - val_acc: 0.6667\n", "Epoch 113/200\n", - "24/24 [==============================] - 0s 424us/step - loss: 0.3550 - acc: 0.9167 - val_loss: 0.9074 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 501us/step - loss: 0.2778 - acc: 0.9583 - val_loss: 0.9198 - val_acc: 0.6667\n", "Epoch 114/200\n", - "24/24 [==============================] - 0s 360us/step - loss: 0.3492 - acc: 0.9167 - val_loss: 0.9029 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 472us/step - loss: 0.2718 - acc: 0.9583 - val_loss: 0.9197 - val_acc: 0.6667\n", "Epoch 115/200\n", - "24/24 [==============================] - 0s 399us/step - loss: 0.3417 - acc: 0.9167 - val_loss: 0.8994 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 483us/step - loss: 0.2664 - acc: 0.9583 - val_loss: 0.9159 - val_acc: 0.6667\n", "Epoch 116/200\n", - "24/24 [==============================] - 0s 446us/step - loss: 0.3356 - acc: 0.9167 - val_loss: 0.8948 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 322us/step - loss: 0.2600 - acc: 0.9583 - val_loss: 0.9144 - val_acc: 0.6667\n", "Epoch 117/200\n", - "24/24 [==============================] - 0s 371us/step - loss: 0.3303 - acc: 0.9167 - val_loss: 0.8882 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 310us/step - loss: 0.2553 - acc: 0.9583 - val_loss: 0.9126 - val_acc: 0.6667\n", "Epoch 118/200\n", - "24/24 [==============================] - 0s 788us/step - loss: 0.3243 - acc: 0.9167 - val_loss: 0.8842 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 348us/step - loss: 0.2481 - acc: 0.9583 - val_loss: 0.9091 - val_acc: 0.6667\n", "Epoch 119/200\n", - "24/24 [==============================] - 0s 700us/step - loss: 0.3182 - acc: 0.9167 - val_loss: 0.8811 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 340us/step - loss: 0.2442 - acc: 0.9583 - val_loss: 0.9108 - val_acc: 0.6667\n", "Epoch 120/200\n", - "24/24 [==============================] - 0s 382us/step - loss: 0.3146 - acc: 0.9167 - val_loss: 0.8740 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 375us/step - loss: 0.2368 - acc: 0.9583 - val_loss: 0.9072 - val_acc: 0.6667\n", "Epoch 121/200\n", - "24/24 [==============================] - ETA: 0s - loss: 0.4652 - acc: 0.833 - 0s 362us/step - loss: 0.3084 - acc: 0.9167 - val_loss: 0.8698 - val_acc: 0.9167\n", - "Epoch 122/200\n", - "24/24 [==============================] - 0s 490us/step - loss: 0.3013 - acc: 0.9167 - val_loss: 0.8648 - val_acc: 0.9167\n" + "24/24 [==============================] - 0s 392us/step - loss: 0.2317 - acc: 0.9583 - val_loss: 0.9035 - val_acc: 0.6667\n", + "Epoch 122/200\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ + "24/24 [==============================] - 0s 340us/step - loss: 0.2258 - acc: 0.9583 - val_loss: 0.9050 - val_acc: 0.6667\n", "Epoch 123/200\n", - "24/24 [==============================] - 0s 421us/step - loss: 0.2979 - acc: 0.9167 - val_loss: 0.8615 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 346us/step - loss: 0.2215 - acc: 0.9583 - val_loss: 0.9022 - val_acc: 0.6667\n", "Epoch 124/200\n", - "24/24 [==============================] - 0s 385us/step - loss: 0.2916 - acc: 0.9167 - val_loss: 0.8556 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 372us/step - loss: 0.2156 - acc: 0.9583 - val_loss: 0.8980 - val_acc: 0.6667\n", "Epoch 125/200\n", - "24/24 [==============================] - 0s 338us/step - loss: 0.2864 - acc: 0.9167 - val_loss: 0.8518 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 327us/step - loss: 0.2112 - acc: 0.9583 - val_loss: 0.8981 - val_acc: 0.6667\n", "Epoch 126/200\n", - "24/24 [==============================] - 0s 312us/step - loss: 0.2819 - acc: 0.9167 - val_loss: 0.8474 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 335us/step - loss: 0.2051 - acc: 0.9583 - val_loss: 0.8974 - val_acc: 0.6667\n", "Epoch 127/200\n", - "24/24 [==============================] - 0s 409us/step - loss: 0.2782 - acc: 0.9167 - val_loss: 0.8433 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 321us/step - loss: 0.2005 - acc: 0.9583 - val_loss: 0.8954 - val_acc: 0.6667\n", "Epoch 128/200\n", - "24/24 [==============================] - 0s 501us/step - loss: 0.2715 - acc: 0.9167 - val_loss: 0.8398 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 314us/step - loss: 0.1950 - acc: 0.9583 - val_loss: 0.8921 - val_acc: 0.6667\n", "Epoch 129/200\n", - "24/24 [==============================] - 0s 447us/step - loss: 0.2680 - acc: 0.9167 - val_loss: 0.8374 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 304us/step - loss: 0.1909 - acc: 0.9583 - val_loss: 0.8942 - val_acc: 0.6667\n", "Epoch 130/200\n", - "24/24 [==============================] - 0s 598us/step - loss: 0.2627 - acc: 0.9167 - val_loss: 0.8339 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 404us/step - loss: 0.1851 - acc: 0.9583 - val_loss: 0.8925 - val_acc: 0.6667\n", "Epoch 131/200\n", - "24/24 [==============================] - 0s 621us/step - loss: 0.2586 - acc: 0.9167 - val_loss: 0.8279 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 318us/step - loss: 0.1812 - acc: 0.9583 - val_loss: 0.8933 - val_acc: 0.6667\n", "Epoch 132/200\n", - "24/24 [==============================] - 0s 472us/step - loss: 0.2541 - acc: 0.9167 - val_loss: 0.8279 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 282us/step - loss: 0.1765 - acc: 0.9583 - val_loss: 0.8910 - val_acc: 0.6667\n", "Epoch 133/200\n", - "24/24 [==============================] - 0s 472us/step - loss: 0.2496 - acc: 0.9167 - val_loss: 0.8252 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 423us/step - loss: 0.1717 - acc: 0.9583 - val_loss: 0.8877 - val_acc: 0.6667\n", "Epoch 134/200\n", - "24/24 [==============================] - 0s 316us/step - loss: 0.2458 - acc: 0.9167 - val_loss: 0.8224 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 316us/step - loss: 0.1681 - acc: 0.9583 - val_loss: 0.8891 - val_acc: 0.6667\n", "Epoch 135/200\n", - "24/24 [==============================] - 0s 663us/step - loss: 0.2417 - acc: 0.9167 - val_loss: 0.8204 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 410us/step - loss: 0.1633 - acc: 0.9583 - val_loss: 0.8873 - val_acc: 0.6667\n", "Epoch 136/200\n", - "24/24 [==============================] - 0s 591us/step - loss: 0.2382 - acc: 0.9583 - val_loss: 0.8147 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 330us/step - loss: 0.1588 - acc: 0.9583 - val_loss: 0.8843 - val_acc: 0.6667\n", "Epoch 137/200\n", - "24/24 [==============================] - 0s 580us/step - loss: 0.2338 - acc: 0.9167 - val_loss: 0.8118 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 299us/step - loss: 0.1553 - acc: 0.9583 - val_loss: 0.8861 - val_acc: 0.6667\n", "Epoch 138/200\n", - "24/24 [==============================] - 0s 713us/step - loss: 0.2293 - acc: 0.9583 - val_loss: 0.8101 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 440us/step - loss: 0.1511 - acc: 0.9583 - val_loss: 0.8829 - val_acc: 0.6667\n", "Epoch 139/200\n", - "24/24 [==============================] - 0s 348us/step - loss: 0.2257 - acc: 0.9583 - val_loss: 0.8033 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 360us/step - loss: 0.1479 - acc: 0.9583 - val_loss: 0.8816 - val_acc: 0.6667\n", "Epoch 140/200\n", - "24/24 [==============================] - 0s 427us/step - loss: 0.2217 - acc: 0.9583 - val_loss: 0.8001 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 387us/step - loss: 0.1431 - acc: 0.9583 - val_loss: 0.8829 - val_acc: 0.6667\n", "Epoch 141/200\n", - "24/24 [==============================] - 0s 704us/step - loss: 0.2174 - acc: 0.9583 - val_loss: 0.7977 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 371us/step - loss: 0.1391 - acc: 0.9583 - val_loss: 0.8785 - val_acc: 0.6667\n", "Epoch 142/200\n", - "24/24 [==============================] - 0s 541us/step - loss: 0.2133 - acc: 0.9583 - val_loss: 0.7951 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 328us/step - loss: 0.1356 - acc: 0.9583 - val_loss: 0.8764 - val_acc: 0.6667\n", "Epoch 143/200\n", - "24/24 [==============================] - 0s 627us/step - loss: 0.2097 - acc: 0.9583 - val_loss: 0.7919 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 360us/step - loss: 0.1325 - acc: 0.9583 - val_loss: 0.8760 - val_acc: 0.6667\n", "Epoch 144/200\n", - "24/24 [==============================] - 0s 636us/step - loss: 0.2060 - acc: 0.9583 - val_loss: 0.7887 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 336us/step - loss: 0.1286 - acc: 0.9583 - val_loss: 0.8778 - val_acc: 0.6667\n", "Epoch 145/200\n", - "24/24 [==============================] - 0s 522us/step - loss: 0.2036 - acc: 0.9583 - val_loss: 0.7865 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 441us/step - loss: 0.1262 - acc: 0.9583 - val_loss: 0.8761 - val_acc: 0.6667\n", "Epoch 146/200\n", - "24/24 [==============================] - 0s 541us/step - loss: 0.1991 - acc: 0.9583 - val_loss: 0.7828 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 376us/step - loss: 0.1216 - acc: 0.9583 - val_loss: 0.8721 - val_acc: 0.6667\n", "Epoch 147/200\n", - "24/24 [==============================] - 0s 682us/step - loss: 0.1967 - acc: 0.9583 - val_loss: 0.7788 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 349us/step - loss: 0.1186 - acc: 0.9583 - val_loss: 0.8731 - val_acc: 0.6667\n", "Epoch 148/200\n", - "24/24 [==============================] - 0s 497us/step - loss: 0.1919 - acc: 0.9583 - val_loss: 0.7723 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 322us/step - loss: 0.1159 - acc: 0.9583 - val_loss: 0.8706 - val_acc: 0.6667\n", "Epoch 149/200\n", - "24/24 [==============================] - 0s 353us/step - loss: 0.1895 - acc: 0.9583 - val_loss: 0.7723 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 307us/step - loss: 0.1125 - acc: 0.9583 - val_loss: 0.8651 - val_acc: 0.6667\n", "Epoch 150/200\n", - "24/24 [==============================] - 0s 442us/step - loss: 0.1864 - acc: 0.9583 - val_loss: 0.7683 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 352us/step - loss: 0.1102 - acc: 0.9583 - val_loss: 0.8671 - val_acc: 0.6667\n", "Epoch 151/200\n", - "24/24 [==============================] - 0s 487us/step - loss: 0.1822 - acc: 0.9583 - val_loss: 0.7642 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 351us/step - loss: 0.1060 - acc: 0.9583 - val_loss: 0.8688 - val_acc: 0.6667\n", "Epoch 152/200\n", - "24/24 [==============================] - 0s 486us/step - loss: 0.1792 - acc: 0.9583 - val_loss: 0.7616 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 324us/step - loss: 0.1041 - acc: 0.9583 - val_loss: 0.8672 - val_acc: 0.6667\n", "Epoch 153/200\n", - "24/24 [==============================] - 0s 393us/step - loss: 0.1770 - acc: 0.9583 - val_loss: 0.7577 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 294us/step - loss: 0.1006 - acc: 0.9583 - val_loss: 0.8670 - val_acc: 0.6667\n", "Epoch 154/200\n", - "24/24 [==============================] - 0s 373us/step - loss: 0.1732 - acc: 0.9583 - val_loss: 0.7556 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 241us/step - loss: 0.0983 - acc: 0.9583 - val_loss: 0.8638 - val_acc: 0.6667\n", "Epoch 155/200\n", - "24/24 [==============================] - 0s 808us/step - loss: 0.1696 - acc: 0.9583 - val_loss: 0.7536 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 460us/step - loss: 0.0949 - acc: 0.9583 - val_loss: 0.8654 - val_acc: 0.6667\n", "Epoch 156/200\n", - "24/24 [==============================] - 0s 517us/step - loss: 0.1673 - acc: 0.9583 - val_loss: 0.7518 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 428us/step - loss: 0.0925 - acc: 0.9583 - val_loss: 0.8624 - val_acc: 0.6667\n", "Epoch 157/200\n", - "24/24 [==============================] - 0s 415us/step - loss: 0.1638 - acc: 0.9583 - val_loss: 0.7474 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 356us/step - loss: 0.0903 - acc: 0.9583 - val_loss: 0.8664 - val_acc: 0.6667\n", "Epoch 158/200\n", - "24/24 [==============================] - 0s 354us/step - loss: 0.1612 - acc: 0.9583 - val_loss: 0.7453 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 953us/step - loss: 0.0871 - acc: 1.0000 - val_loss: 0.8644 - val_acc: 0.6667\n", "Epoch 159/200\n", - "24/24 [==============================] - 0s 401us/step - loss: 0.1580 - acc: 0.9583 - val_loss: 0.7397 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 499us/step - loss: 0.0846 - acc: 1.0000 - val_loss: 0.8599 - val_acc: 0.6667\n", "Epoch 160/200\n", - "24/24 [==============================] - 0s 272us/step - loss: 0.1545 - acc: 0.9583 - val_loss: 0.7384 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 865us/step - loss: 0.0826 - acc: 1.0000 - val_loss: 0.8629 - val_acc: 0.6667\n", "Epoch 161/200\n", - "24/24 [==============================] - 0s 419us/step - loss: 0.1515 - acc: 0.9583 - val_loss: 0.7351 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 436us/step - loss: 0.0800 - acc: 1.0000 - val_loss: 0.8591 - val_acc: 0.6667\n", "Epoch 162/200\n", - "24/24 [==============================] - 0s 383us/step - loss: 0.1495 - acc: 0.9583 - val_loss: 0.7351 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 359us/step - loss: 0.0776 - acc: 1.0000 - val_loss: 0.8608 - val_acc: 0.6667\n", "Epoch 163/200\n", - "24/24 [==============================] - 0s 360us/step - loss: 0.1461 - acc: 0.9583 - val_loss: 0.7336 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 476us/step - loss: 0.0753 - acc: 1.0000 - val_loss: 0.8576 - val_acc: 0.6667\n", "Epoch 164/200\n", - "24/24 [==============================] - 0s 401us/step - loss: 0.1435 - acc: 0.9583 - val_loss: 0.7314 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 375us/step - loss: 0.0728 - acc: 1.0000 - val_loss: 0.8637 - val_acc: 0.6667\n", "Epoch 165/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.1418 - acc: 0.9583 - val_loss: 0.7272 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 300us/step - loss: 0.0708 - acc: 1.0000 - val_loss: 0.8678 - val_acc: 0.6667\n", "Epoch 166/200\n", - "24/24 [==============================] - 0s 683us/step - loss: 0.1392 - acc: 0.9583 - val_loss: 0.7272 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 251us/step - loss: 0.0688 - acc: 1.0000 - val_loss: 0.8637 - val_acc: 0.6667\n", "Epoch 167/200\n", - "24/24 [==============================] - 0s 812us/step - loss: 0.1358 - acc: 1.0000 - val_loss: 0.7255 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 347us/step - loss: 0.0671 - acc: 1.0000 - val_loss: 0.8705 - val_acc: 0.6667\n", "Epoch 168/200\n", - "24/24 [==============================] - 0s 475us/step - loss: 0.1329 - acc: 1.0000 - val_loss: 0.7206 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 366us/step - loss: 0.0650 - acc: 1.0000 - val_loss: 0.8645 - val_acc: 0.6667\n", "Epoch 169/200\n", - "24/24 [==============================] - 0s 486us/step - loss: 0.1307 - acc: 1.0000 - val_loss: 0.7208 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 353us/step - loss: 0.0631 - acc: 1.0000 - val_loss: 0.8715 - val_acc: 0.6667\n", "Epoch 170/200\n", - "24/24 [==============================] - 0s 816us/step - loss: 0.1280 - acc: 1.0000 - val_loss: 0.7212 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 296us/step - loss: 0.0616 - acc: 1.0000 - val_loss: 0.8721 - val_acc: 0.6667\n", "Epoch 171/200\n", - "24/24 [==============================] - 0s 725us/step - loss: 0.1272 - acc: 1.0000 - val_loss: 0.7191 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 313us/step - loss: 0.0597 - acc: 1.0000 - val_loss: 0.8675 - val_acc: 0.6667\n", "Epoch 172/200\n", - "24/24 [==============================] - 0s 608us/step - loss: 0.1235 - acc: 1.0000 - val_loss: 0.7180 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 318us/step - loss: 0.0579 - acc: 1.0000 - val_loss: 0.8735 - val_acc: 0.6667\n", "Epoch 173/200\n", - "24/24 [==============================] - 0s 547us/step - loss: 0.1206 - acc: 1.0000 - val_loss: 0.7171 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 391us/step - loss: 0.0564 - acc: 1.0000 - val_loss: 0.8782 - val_acc: 0.6667\n", "Epoch 174/200\n", - "24/24 [==============================] - 0s 550us/step - loss: 0.1187 - acc: 1.0000 - val_loss: 0.7135 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 425us/step - loss: 0.0547 - acc: 1.0000 - val_loss: 0.8785 - val_acc: 0.6667\n", "Epoch 175/200\n", - "24/24 [==============================] - 0s 679us/step - loss: 0.1164 - acc: 1.0000 - val_loss: 0.7128 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 351us/step - loss: 0.0533 - acc: 1.0000 - val_loss: 0.8721 - val_acc: 0.6667\n", "Epoch 176/200\n", - "24/24 [==============================] - 0s 505us/step - loss: 0.1134 - acc: 1.0000 - val_loss: 0.7106 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 303us/step - loss: 0.0522 - acc: 1.0000 - val_loss: 0.8805 - val_acc: 0.6667\n", "Epoch 177/200\n", - "24/24 [==============================] - 0s 502us/step - loss: 0.1113 - acc: 1.0000 - val_loss: 0.7116 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 300us/step - loss: 0.0497 - acc: 1.0000 - val_loss: 0.8822 - val_acc: 0.6667\n", "Epoch 178/200\n", - "24/24 [==============================] - 0s 571us/step - loss: 0.1084 - acc: 1.0000 - val_loss: 0.7094 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 380us/step - loss: 0.0482 - acc: 1.0000 - val_loss: 0.8872 - val_acc: 0.6667\n", "Epoch 179/200\n", - "24/24 [==============================] - 0s 445us/step - loss: 0.1065 - acc: 1.0000 - val_loss: 0.7080 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 380us/step - loss: 0.0470 - acc: 1.0000 - val_loss: 0.8804 - val_acc: 0.6667\n", "Epoch 180/200\n", - "24/24 [==============================] - 0s 519us/step - loss: 0.1039 - acc: 1.0000 - val_loss: 0.7091 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 276us/step - loss: 0.0458 - acc: 1.0000 - val_loss: 0.8851 - val_acc: 0.6667\n", "Epoch 181/200\n", - "24/24 [==============================] - 0s 414us/step - loss: 0.1015 - acc: 1.0000 - val_loss: 0.7100 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 307us/step - loss: 0.0441 - acc: 1.0000 - val_loss: 0.8840 - val_acc: 0.6667\n", "Epoch 182/200\n", - "24/24 [==============================] - 0s 467us/step - loss: 0.0995 - acc: 1.0000 - val_loss: 0.7064 - val_acc: 0.9167\n", - "Epoch 183/200\n", - "24/24 [==============================] - 0s 389us/step - loss: 0.0965 - acc: 1.0000 - val_loss: 0.7045 - val_acc: 0.9167\n" + "24/24 [==============================] - 0s 300us/step - loss: 0.0428 - acc: 1.0000 - val_loss: 0.8883 - val_acc: 0.6667\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ + "Epoch 183/200\n", + "24/24 [==============================] - 0s 344us/step - loss: 0.0419 - acc: 1.0000 - val_loss: 0.8861 - val_acc: 0.6667\n", "Epoch 184/200\n", - "24/24 [==============================] - 0s 431us/step - loss: 0.0942 - acc: 1.0000 - val_loss: 0.7043 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 513us/step - loss: 0.0404 - acc: 1.0000 - val_loss: 0.8871 - val_acc: 0.6667\n", "Epoch 185/200\n", - "24/24 [==============================] - 0s 563us/step - loss: 0.0917 - acc: 1.0000 - val_loss: 0.7039 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 480us/step - loss: 0.0390 - acc: 1.0000 - val_loss: 0.8901 - val_acc: 0.6667\n", "Epoch 186/200\n", - "24/24 [==============================] - 0s 408us/step - loss: 0.0907 - acc: 1.0000 - val_loss: 0.7024 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 338us/step - loss: 0.0378 - acc: 1.0000 - val_loss: 0.8878 - val_acc: 0.6667\n", "Epoch 187/200\n", - "24/24 [==============================] - 0s 333us/step - loss: 0.0874 - acc: 1.0000 - val_loss: 0.7034 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 359us/step - loss: 0.0374 - acc: 1.0000 - val_loss: 0.8939 - val_acc: 0.6667\n", "Epoch 188/200\n", - "24/24 [==============================] - 0s 313us/step - loss: 0.0854 - acc: 1.0000 - val_loss: 0.7043 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 347us/step - loss: 0.0355 - acc: 1.0000 - val_loss: 0.8941 - val_acc: 0.6667\n", "Epoch 189/200\n", - "24/24 [==============================] - 0s 290us/step - loss: 0.0831 - acc: 1.0000 - val_loss: 0.7029 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 424us/step - loss: 0.0345 - acc: 1.0000 - val_loss: 0.8844 - val_acc: 0.6667\n", "Epoch 190/200\n", - "24/24 [==============================] - 0s 386us/step - loss: 0.0809 - acc: 1.0000 - val_loss: 0.7020 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 370us/step - loss: 0.0333 - acc: 1.0000 - val_loss: 0.8909 - val_acc: 0.6667\n", "Epoch 191/200\n", - "24/24 [==============================] - 0s 289us/step - loss: 0.0791 - acc: 1.0000 - val_loss: 0.7026 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 385us/step - loss: 0.0321 - acc: 1.0000 - val_loss: 0.8943 - val_acc: 0.6667\n", "Epoch 192/200\n", - "24/24 [==============================] - 0s 291us/step - loss: 0.0770 - acc: 1.0000 - val_loss: 0.7011 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 336us/step - loss: 0.0314 - acc: 1.0000 - val_loss: 0.8981 - val_acc: 0.6667\n", "Epoch 193/200\n", - "24/24 [==============================] - 0s 319us/step - loss: 0.0748 - acc: 1.0000 - val_loss: 0.7019 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 341us/step - loss: 0.0301 - acc: 1.0000 - val_loss: 0.8940 - val_acc: 0.6667\n", "Epoch 194/200\n", - "24/24 [==============================] - 0s 383us/step - loss: 0.0725 - acc: 1.0000 - val_loss: 0.7032 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 365us/step - loss: 0.0292 - acc: 1.0000 - val_loss: 0.8908 - val_acc: 0.6667\n", "Epoch 195/200\n", - "24/24 [==============================] - 0s 2ms/step - loss: 0.0708 - acc: 1.0000 - val_loss: 0.7021 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 474us/step - loss: 0.0283 - acc: 1.0000 - val_loss: 0.8932 - val_acc: 0.6667\n", "Epoch 196/200\n", - "24/24 [==============================] - 0s 542us/step - loss: 0.0688 - acc: 1.0000 - val_loss: 0.7019 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 306us/step - loss: 0.0273 - acc: 1.0000 - val_loss: 0.8968 - val_acc: 0.6667\n", "Epoch 197/200\n", - "24/24 [==============================] - 0s 536us/step - loss: 0.0667 - acc: 1.0000 - val_loss: 0.7015 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 421us/step - loss: 0.0263 - acc: 1.0000 - val_loss: 0.9010 - val_acc: 0.6667\n", "Epoch 198/200\n", - "24/24 [==============================] - 0s 497us/step - loss: 0.0659 - acc: 1.0000 - val_loss: 0.7023 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 428us/step - loss: 0.0256 - acc: 1.0000 - val_loss: 0.8976 - val_acc: 0.7500\n", "Epoch 199/200\n", - "24/24 [==============================] - 0s 582us/step - loss: 0.0632 - acc: 1.0000 - val_loss: 0.7014 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 344us/step - loss: 0.0245 - acc: 1.0000 - val_loss: 0.9007 - val_acc: 0.6667\n", "Epoch 200/200\n", - "24/24 [==============================] - 0s 461us/step - loss: 0.0614 - acc: 1.0000 - val_loss: 0.7033 - val_acc: 0.9167\n" + "24/24 [==============================] - 0s 469us/step - loss: 0.0240 - acc: 1.0000 - val_loss: 0.9069 - val_acc: 0.6667\n" ] } ], @@ -667,16 +674,16 @@ }, { "cell_type": "code", - "execution_count": 116, + "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ - "array([3, 4, 4, 0, 0, 3, 4, 3, 4, 3, 1, 3])" + "array([4, 2, 2, 0, 0, 4, 0, 2, 2, 2, 2, 2])" ] }, - "execution_count": 116, + "execution_count": 14, "metadata": {}, "output_type": "execute_result" } @@ -687,7 +694,7 @@ }, { "cell_type": "code", - "execution_count": 117, + "execution_count": 15, "metadata": {}, "outputs": [ { @@ -696,7 +703,7 @@ "array([4, 2, 0, 3, 3, 4, 2, 2, 2, 2, 2, 2])" ] }, - "execution_count": 117, + "execution_count": 15, "metadata": {}, "output_type": "execute_result" } @@ -716,7 +723,7 @@ }, { "cell_type": "code", - "execution_count": 118, + "execution_count": 16, "metadata": {}, "outputs": [], "source": [ @@ -726,7 +733,7 @@ }, { "cell_type": "code", - "execution_count": 119, + "execution_count": 17, "metadata": {}, "outputs": [ { @@ -735,7 +742,7 @@ "(24, 10000)" ] }, - "execution_count": 119, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -748,7 +755,7 @@ }, { "cell_type": "code", - "execution_count": 120, + "execution_count": 18, "metadata": {}, "outputs": [ { @@ -758,11 +765,11 @@ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", - "dense_44 (Dense) (None, 16) 160016 \n", + "dense_4 (Dense) (None, 16) 160016 \n", "_________________________________________________________________\n", - "dense_45 (Dense) (None, 16) 272 \n", + "dense_5 (Dense) (None, 16) 272 \n", "_________________________________________________________________\n", - "dense_46 (Dense) (None, 5) 85 \n", + "dense_6 (Dense) (None, 5) 85 \n", "=================================================================\n", "Total params: 160,373\n", "Trainable params: 160,373\n", @@ -786,7 +793,7 @@ }, { "cell_type": "code", - "execution_count": 59, + "execution_count": 19, "metadata": {}, "outputs": [ { @@ -795,127 +802,127 @@ "text": [ "Train on 24 samples, validate on 12 samples\n", "Epoch 1/200\n", - "24/24 [==============================] - 0s 19ms/step - loss: 1.6093 - acc: 0.3750 - val_loss: 1.5909 - val_acc: 0.5000\n", + "24/24 [==============================] - 0s 12ms/step - loss: 1.6066 - acc: 0.2500 - val_loss: 1.5969 - val_acc: 0.7500\n", "Epoch 2/200\n", - "24/24 [==============================] - 0s 547us/step - loss: 1.5923 - acc: 0.4167 - val_loss: 1.5764 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 700us/step - loss: 1.5922 - acc: 0.7083 - val_loss: 1.5869 - val_acc: 0.7500\n", "Epoch 3/200\n", - "24/24 [==============================] - 0s 772us/step - loss: 1.5797 - acc: 0.5833 - val_loss: 1.5647 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.5816 - acc: 0.8333 - val_loss: 1.5803 - val_acc: 0.7500\n", "Epoch 4/200\n", - "24/24 [==============================] - 0s 729us/step - loss: 1.5701 - acc: 0.6250 - val_loss: 1.5547 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 618us/step - loss: 1.5737 - acc: 0.8333 - val_loss: 1.5734 - val_acc: 0.7500\n", "Epoch 5/200\n", - "24/24 [==============================] - 0s 842us/step - loss: 1.5604 - acc: 0.5833 - val_loss: 1.5448 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 633us/step - loss: 1.5657 - acc: 0.8333 - val_loss: 1.5664 - val_acc: 0.7500\n", "Epoch 6/200\n", - "24/24 [==============================] - 0s 841us/step - loss: 1.5512 - acc: 0.6667 - val_loss: 1.5351 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 590us/step - loss: 1.5575 - acc: 0.8333 - val_loss: 1.5604 - val_acc: 0.7500\n", "Epoch 7/200\n", - "24/24 [==============================] - 0s 836us/step - loss: 1.5421 - acc: 0.6667 - val_loss: 1.5246 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 523us/step - loss: 1.5495 - acc: 0.8333 - val_loss: 1.5529 - val_acc: 0.7500\n", "Epoch 8/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 1.5325 - acc: 0.7083 - val_loss: 1.5138 - val_acc: 0.6667\n", + "24/24 [==============================] - 0s 515us/step - loss: 1.5417 - acc: 0.8333 - val_loss: 1.5466 - val_acc: 0.7500\n", "Epoch 9/200\n", - "24/24 [==============================] - 0s 882us/step - loss: 1.5224 - acc: 0.7083 - val_loss: 1.5022 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 550us/step - loss: 1.5335 - acc: 0.8333 - val_loss: 1.5387 - val_acc: 0.7500\n", "Epoch 10/200\n", - "24/24 [==============================] - 0s 704us/step - loss: 1.5122 - acc: 0.7083 - val_loss: 1.4902 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 574us/step - loss: 1.5252 - acc: 0.8333 - val_loss: 1.5313 - val_acc: 0.7500\n", "Epoch 11/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 1.5012 - acc: 0.7083 - val_loss: 1.4778 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 507us/step - loss: 1.5175 - acc: 0.8333 - val_loss: 1.5239 - val_acc: 0.7500\n", "Epoch 12/200\n", - "24/24 [==============================] - 0s 962us/step - loss: 1.4896 - acc: 0.7500 - val_loss: 1.4657 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 609us/step - loss: 1.5085 - acc: 0.8333 - val_loss: 1.5171 - val_acc: 0.7500\n", "Epoch 13/200\n", - "24/24 [==============================] - 0s 583us/step - loss: 1.4781 - acc: 0.7917 - val_loss: 1.4535 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 482us/step - loss: 1.4997 - acc: 0.8333 - val_loss: 1.5097 - val_acc: 0.7500\n", "Epoch 14/200\n", - "24/24 [==============================] - 0s 716us/step - loss: 1.4661 - acc: 0.7917 - val_loss: 1.4409 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 701us/step - loss: 1.4899 - acc: 0.8333 - val_loss: 1.5007 - val_acc: 0.7500\n", "Epoch 15/200\n", - "24/24 [==============================] - 0s 600us/step - loss: 1.4540 - acc: 0.7917 - val_loss: 1.4280 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 613us/step - loss: 1.4803 - acc: 0.8333 - val_loss: 1.4927 - val_acc: 0.7500\n", "Epoch 16/200\n", - "24/24 [==============================] - 0s 635us/step - loss: 1.4420 - acc: 0.7917 - val_loss: 1.4154 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 563us/step - loss: 1.4697 - acc: 0.8333 - val_loss: 1.4831 - val_acc: 0.7500\n", "Epoch 17/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 1.4302 - acc: 0.7917 - val_loss: 1.4033 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 540us/step - loss: 1.4609 - acc: 0.8333 - val_loss: 1.4752 - val_acc: 0.7500\n", "Epoch 18/200\n", - "24/24 [==============================] - 0s 789us/step - loss: 1.4178 - acc: 0.7917 - val_loss: 1.3904 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 550us/step - loss: 1.4495 - acc: 0.8333 - val_loss: 1.4652 - val_acc: 0.7500\n", "Epoch 19/200\n", - "24/24 [==============================] - 0s 700us/step - loss: 1.4056 - acc: 0.7917 - val_loss: 1.3777 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.4384 - acc: 0.8333 - val_loss: 1.4562 - val_acc: 0.7500\n", "Epoch 20/200\n", - "24/24 [==============================] - 0s 848us/step - loss: 1.3924 - acc: 0.7917 - val_loss: 1.3637 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 552us/step - loss: 1.4280 - acc: 0.8333 - val_loss: 1.4465 - val_acc: 0.7500\n", "Epoch 21/200\n", - "24/24 [==============================] - 0s 787us/step - loss: 1.3796 - acc: 0.7917 - val_loss: 1.3494 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 640us/step - loss: 1.4167 - acc: 0.8333 - val_loss: 1.4368 - val_acc: 0.7500\n", "Epoch 22/200\n", - "24/24 [==============================] - 0s 730us/step - loss: 1.3667 - acc: 0.7917 - val_loss: 1.3367 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 720us/step - loss: 1.4050 - acc: 0.8333 - val_loss: 1.4267 - val_acc: 0.7500\n", "Epoch 23/200\n", - "24/24 [==============================] - 0s 912us/step - loss: 1.3532 - acc: 0.7917 - val_loss: 1.3224 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 557us/step - loss: 1.3932 - acc: 0.8333 - val_loss: 1.4175 - val_acc: 0.7500\n", "Epoch 24/200\n", - "24/24 [==============================] - 0s 806us/step - loss: 1.3394 - acc: 0.7917 - val_loss: 1.3081 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 503us/step - loss: 1.3813 - acc: 0.8333 - val_loss: 1.4060 - val_acc: 0.7500\n", "Epoch 25/200\n", - "24/24 [==============================] - 0s 898us/step - loss: 1.3266 - acc: 0.7917 - val_loss: 1.2941 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 592us/step - loss: 1.3692 - acc: 0.8333 - val_loss: 1.3950 - val_acc: 0.7500\n", "Epoch 26/200\n", - "24/24 [==============================] - 0s 713us/step - loss: 1.3118 - acc: 0.7917 - val_loss: 1.2798 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 740us/step - loss: 1.3561 - acc: 0.8333 - val_loss: 1.3838 - val_acc: 0.7500\n", "Epoch 27/200\n", - "24/24 [==============================] - 0s 777us/step - loss: 1.2975 - acc: 0.7917 - val_loss: 1.2657 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 630us/step - loss: 1.3431 - acc: 0.8333 - val_loss: 1.3727 - val_acc: 0.7500\n", "Epoch 28/200\n", - "24/24 [==============================] - 0s 696us/step - loss: 1.2831 - acc: 0.7917 - val_loss: 1.2504 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 576us/step - loss: 1.3308 - acc: 0.8333 - val_loss: 1.3623 - val_acc: 0.7500\n", "Epoch 29/200\n", - "24/24 [==============================] - 0s 877us/step - loss: 1.2693 - acc: 0.7917 - val_loss: 1.2344 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.3164 - acc: 0.8333 - val_loss: 1.3496 - val_acc: 0.7500\n", "Epoch 30/200\n", - "24/24 [==============================] - 0s 868us/step - loss: 1.2535 - acc: 0.7917 - val_loss: 1.2190 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 774us/step - loss: 1.3034 - acc: 0.8333 - val_loss: 1.3379 - val_acc: 0.7500\n", "Epoch 31/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 1.2384 - acc: 0.7917 - val_loss: 1.2038 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 662us/step - loss: 1.2883 - acc: 0.8333 - val_loss: 1.3255 - val_acc: 0.7500\n", "Epoch 32/200\n", - "24/24 [==============================] - 0s 702us/step - loss: 1.2234 - acc: 0.8333 - val_loss: 1.1880 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 910us/step - loss: 1.2741 - acc: 0.8333 - val_loss: 1.3113 - val_acc: 0.7500\n", "Epoch 33/200\n", - "24/24 [==============================] - 0s 736us/step - loss: 1.2078 - acc: 0.7917 - val_loss: 1.1720 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.2585 - acc: 0.8333 - val_loss: 1.2979 - val_acc: 0.7500\n", "Epoch 34/200\n", - "24/24 [==============================] - 0s 713us/step - loss: 1.1938 - acc: 0.8333 - val_loss: 1.1548 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 539us/step - loss: 1.2427 - acc: 0.8333 - val_loss: 1.2828 - val_acc: 0.7500\n", "Epoch 35/200\n", - "24/24 [==============================] - 0s 577us/step - loss: 1.1755 - acc: 0.8333 - val_loss: 1.1386 - val_acc: 0.7500\n", + "24/24 [==============================] - 0s 548us/step - loss: 1.2284 - acc: 0.8333 - val_loss: 1.2704 - val_acc: 0.7500\n", "Epoch 36/200\n", - "24/24 [==============================] - 0s 632us/step - loss: 1.1594 - acc: 0.8333 - val_loss: 1.1221 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 581us/step - loss: 1.2114 - acc: 0.8333 - val_loss: 1.2553 - val_acc: 0.7500\n", "Epoch 37/200\n", - "24/24 [==============================] - 0s 678us/step - loss: 1.1428 - acc: 0.8750 - val_loss: 1.1061 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 812us/step - loss: 1.1948 - acc: 0.8333 - val_loss: 1.2412 - val_acc: 0.7500\n", "Epoch 38/200\n", - "24/24 [==============================] - 0s 673us/step - loss: 1.1264 - acc: 0.8333 - val_loss: 1.0894 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 608us/step - loss: 1.1782 - acc: 0.8333 - val_loss: 1.2257 - val_acc: 0.7500\n", "Epoch 39/200\n", - "24/24 [==============================] - 0s 865us/step - loss: 1.1109 - acc: 0.8750 - val_loss: 1.0730 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 513us/step - loss: 1.1619 - acc: 0.8333 - val_loss: 1.2110 - val_acc: 0.7500\n", "Epoch 40/200\n", - "24/24 [==============================] - 0s 758us/step - loss: 1.0934 - acc: 0.8750 - val_loss: 1.0574 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 923us/step - loss: 1.1443 - acc: 0.8333 - val_loss: 1.1961 - val_acc: 0.7500\n", "Epoch 41/200\n", - "24/24 [==============================] - 0s 624us/step - loss: 1.0763 - acc: 0.8750 - val_loss: 1.0424 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 736us/step - loss: 1.1271 - acc: 0.8333 - val_loss: 1.1802 - val_acc: 0.8333\n", "Epoch 42/200\n", - "24/24 [==============================] - 0s 588us/step - loss: 1.0597 - acc: 0.8750 - val_loss: 1.0258 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 631us/step - loss: 1.1099 - acc: 0.8750 - val_loss: 1.1647 - val_acc: 0.7500\n", "Epoch 43/200\n", - "24/24 [==============================] - 0s 982us/step - loss: 1.0425 - acc: 0.8750 - val_loss: 1.0114 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 573us/step - loss: 1.0918 - acc: 0.8333 - val_loss: 1.1484 - val_acc: 0.8333\n", "Epoch 44/200\n", - "24/24 [==============================] - 0s 798us/step - loss: 1.0242 - acc: 0.8750 - val_loss: 0.9934 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 649us/step - loss: 1.0737 - acc: 0.8750 - val_loss: 1.1331 - val_acc: 0.8333\n", "Epoch 45/200\n", - "24/24 [==============================] - 0s 668us/step - loss: 1.0063 - acc: 0.8750 - val_loss: 0.9766 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 1ms/step - loss: 1.0572 - acc: 0.8750 - val_loss: 1.1167 - val_acc: 0.8333\n", "Epoch 46/200\n", - "24/24 [==============================] - 0s 946us/step - loss: 0.9891 - acc: 0.8750 - val_loss: 0.9606 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 834us/step - loss: 1.0386 - acc: 0.8750 - val_loss: 1.1012 - val_acc: 0.8333\n", "Epoch 47/200\n", - "24/24 [==============================] - 0s 780us/step - loss: 0.9704 - acc: 0.9167 - val_loss: 0.9451 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 617us/step - loss: 1.0203 - acc: 0.8750 - val_loss: 1.0848 - val_acc: 0.8333\n", "Epoch 48/200\n", - "24/24 [==============================] - 0s 656us/step - loss: 0.9531 - acc: 0.9167 - val_loss: 0.9293 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 552us/step - loss: 1.0028 - acc: 0.8750 - val_loss: 1.0697 - val_acc: 0.8333\n", "Epoch 49/200\n", - "24/24 [==============================] - 0s 893us/step - loss: 0.9366 - acc: 0.9167 - val_loss: 0.9141 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 530us/step - loss: 0.9844 - acc: 0.9167 - val_loss: 1.0529 - val_acc: 0.8333\n", "Epoch 50/200\n", - "24/24 [==============================] - 0s 541us/step - loss: 0.9184 - acc: 0.9167 - val_loss: 0.8984 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 605us/step - loss: 0.9657 - acc: 0.9167 - val_loss: 1.0362 - val_acc: 0.8333\n", "Epoch 51/200\n", - "24/24 [==============================] - 0s 656us/step - loss: 0.9012 - acc: 0.9583 - val_loss: 0.8836 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 798us/step - loss: 0.9498 - acc: 0.9167 - val_loss: 1.0209 - val_acc: 0.8333\n", "Epoch 52/200\n", - "24/24 [==============================] - 0s 927us/step - loss: 0.8840 - acc: 0.9167 - val_loss: 0.8686 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 729us/step - loss: 0.9302 - acc: 0.9167 - val_loss: 1.0041 - val_acc: 0.8333\n", "Epoch 53/200\n", - "24/24 [==============================] - 0s 787us/step - loss: 0.8662 - acc: 0.9583 - val_loss: 0.8538 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 590us/step - loss: 0.9125 - acc: 0.9167 - val_loss: 0.9884 - val_acc: 0.8333\n", "Epoch 54/200\n", - "24/24 [==============================] - 0s 635us/step - loss: 0.8508 - acc: 0.9583 - val_loss: 0.8391 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 628us/step - loss: 0.8939 - acc: 0.9167 - val_loss: 0.9732 - val_acc: 0.8333\n", "Epoch 55/200\n", - "24/24 [==============================] - 0s 586us/step - loss: 0.8339 - acc: 0.9583 - val_loss: 0.8239 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 844us/step - loss: 0.8757 - acc: 0.9167 - val_loss: 0.9556 - val_acc: 0.8333\n", "Epoch 56/200\n", - "24/24 [==============================] - 0s 828us/step - loss: 0.8162 - acc: 0.9583 - val_loss: 0.8101 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 786us/step - loss: 0.8578 - acc: 0.9167 - val_loss: 0.9398 - val_acc: 0.8333\n", "Epoch 57/200\n", - "24/24 [==============================] - 0s 596us/step - loss: 0.8004 - acc: 0.9583 - val_loss: 0.7967 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 956us/step - loss: 0.8400 - acc: 0.9167 - val_loss: 0.9248 - val_acc: 0.8333\n", "Epoch 58/200\n", - "24/24 [==============================] - 0s 588us/step - loss: 0.7837 - acc: 0.9583 - val_loss: 0.7818 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 755us/step - loss: 0.8218 - acc: 0.9167 - val_loss: 0.9088 - val_acc: 0.8333\n", "Epoch 59/200\n", - "24/24 [==============================] - 0s 606us/step - loss: 0.7673 - acc: 0.9583 - val_loss: 0.7684 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 736us/step - loss: 0.8050 - acc: 0.9167 - val_loss: 0.8932 - val_acc: 0.8333\n", "Epoch 60/200\n", - "24/24 [==============================] - 0s 798us/step - loss: 0.7527 - acc: 0.9583 - val_loss: 0.7553 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.7863 - acc: 0.9167 - val_loss: 0.8777 - val_acc: 0.8333\n", "Epoch 61/200\n", - "24/24 [==============================] - 0s 759us/step - loss: 0.7364 - acc: 0.9583 - val_loss: 0.7425 - val_acc: 0.8333\n" + "24/24 [==============================] - 0s 689us/step - loss: 0.7698 - acc: 0.9167 - val_loss: 0.8626 - val_acc: 0.8333\n" ] }, { @@ -923,295 +930,295 @@ "output_type": "stream", "text": [ "Epoch 62/200\n", - "24/24 [==============================] - 0s 3ms/step - loss: 0.7207 - acc: 0.9583 - val_loss: 0.7289 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 624us/step - loss: 0.7519 - acc: 0.9167 - val_loss: 0.8474 - val_acc: 0.8333\n", "Epoch 63/200\n", - "24/24 [==============================] - 0s 746us/step - loss: 0.7050 - acc: 0.9583 - val_loss: 0.7160 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 620us/step - loss: 0.7357 - acc: 0.9167 - val_loss: 0.8333 - val_acc: 0.8333\n", "Epoch 64/200\n", - "24/24 [==============================] - 0s 772us/step - loss: 0.6892 - acc: 0.9583 - val_loss: 0.7032 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 900us/step - loss: 0.7182 - acc: 0.9167 - val_loss: 0.8189 - val_acc: 0.8333\n", "Epoch 65/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.6745 - acc: 0.9583 - val_loss: 0.6910 - val_acc: 0.8333\n", + "24/24 [==============================] - 0s 660us/step - loss: 0.7019 - acc: 0.9167 - val_loss: 0.8040 - val_acc: 0.8333\n", "Epoch 66/200\n", - "24/24 [==============================] - 0s 849us/step - loss: 0.6614 - acc: 0.9583 - val_loss: 0.6800 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 510us/step - loss: 0.6861 - acc: 0.9167 - val_loss: 0.7905 - val_acc: 0.8333\n", "Epoch 67/200\n", - "24/24 [==============================] - 0s 812us/step - loss: 0.6449 - acc: 0.9583 - val_loss: 0.6678 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 734us/step - loss: 0.6703 - acc: 0.9167 - val_loss: 0.7765 - val_acc: 0.8333\n", "Epoch 68/200\n", - "24/24 [==============================] - 0s 812us/step - loss: 0.6304 - acc: 1.0000 - val_loss: 0.6572 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 725us/step - loss: 0.6545 - acc: 0.9167 - val_loss: 0.7646 - val_acc: 0.8333\n", "Epoch 69/200\n", - "24/24 [==============================] - 0s 925us/step - loss: 0.6159 - acc: 1.0000 - val_loss: 0.6446 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 583us/step - loss: 0.6394 - acc: 0.9167 - val_loss: 0.7506 - val_acc: 0.8333\n", "Epoch 70/200\n", - "24/24 [==============================] - 0s 850us/step - loss: 0.6017 - acc: 1.0000 - val_loss: 0.6338 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.6236 - acc: 0.9167 - val_loss: 0.7379 - val_acc: 0.8333\n", "Epoch 71/200\n", - "24/24 [==============================] - 0s 678us/step - loss: 0.5887 - acc: 1.0000 - val_loss: 0.6241 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 908us/step - loss: 0.6093 - acc: 0.9167 - val_loss: 0.7247 - val_acc: 0.8333\n", "Epoch 72/200\n", - "24/24 [==============================] - 0s 904us/step - loss: 0.5741 - acc: 1.0000 - val_loss: 0.6128 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 700us/step - loss: 0.5952 - acc: 0.9167 - val_loss: 0.7128 - val_acc: 0.8333\n", "Epoch 73/200\n", - "24/24 [==============================] - 0s 844us/step - loss: 0.5600 - acc: 1.0000 - val_loss: 0.6023 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 830us/step - loss: 0.5804 - acc: 0.9167 - val_loss: 0.7008 - val_acc: 0.8333\n", "Epoch 74/200\n", - "24/24 [==============================] - 0s 843us/step - loss: 0.5469 - acc: 1.0000 - val_loss: 0.5924 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 663us/step - loss: 0.5660 - acc: 0.9167 - val_loss: 0.6886 - val_acc: 0.8333\n", "Epoch 75/200\n", - "24/24 [==============================] - 0s 805us/step - loss: 0.5334 - acc: 1.0000 - val_loss: 0.5823 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 581us/step - loss: 0.5528 - acc: 0.9167 - val_loss: 0.6772 - val_acc: 0.8333\n", "Epoch 76/200\n", - "24/24 [==============================] - 0s 831us/step - loss: 0.5210 - acc: 1.0000 - val_loss: 0.5734 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 645us/step - loss: 0.5391 - acc: 0.9167 - val_loss: 0.6656 - val_acc: 0.8333\n", "Epoch 77/200\n", - "24/24 [==============================] - 0s 989us/step - loss: 0.5075 - acc: 1.0000 - val_loss: 0.5628 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 589us/step - loss: 0.5261 - acc: 0.9167 - val_loss: 0.6550 - val_acc: 0.8333\n", "Epoch 78/200\n", - "24/24 [==============================] - 0s 840us/step - loss: 0.4947 - acc: 1.0000 - val_loss: 0.5535 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 504us/step - loss: 0.5131 - acc: 0.9167 - val_loss: 0.6436 - val_acc: 0.8333\n", "Epoch 79/200\n", - "24/24 [==============================] - 0s 695us/step - loss: 0.4827 - acc: 1.0000 - val_loss: 0.5441 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 561us/step - loss: 0.5018 - acc: 0.9167 - val_loss: 0.6343 - val_acc: 0.8333\n", "Epoch 80/200\n", - "24/24 [==============================] - 0s 853us/step - loss: 0.4706 - acc: 1.0000 - val_loss: 0.5357 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 722us/step - loss: 0.4896 - acc: 0.9167 - val_loss: 0.6239 - val_acc: 0.8333\n", "Epoch 81/200\n", - "24/24 [==============================] - 0s 768us/step - loss: 0.4585 - acc: 1.0000 - val_loss: 0.5271 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 847us/step - loss: 0.4774 - acc: 0.9167 - val_loss: 0.6145 - val_acc: 0.8333\n", "Epoch 82/200\n", - "24/24 [==============================] - 0s 822us/step - loss: 0.4473 - acc: 1.0000 - val_loss: 0.5192 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 689us/step - loss: 0.4659 - acc: 0.9167 - val_loss: 0.6044 - val_acc: 0.8333\n", "Epoch 83/200\n", - "24/24 [==============================] - 0s 661us/step - loss: 0.4347 - acc: 1.0000 - val_loss: 0.5104 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 987us/step - loss: 0.4543 - acc: 0.9167 - val_loss: 0.5953 - val_acc: 0.8333\n", "Epoch 84/200\n", - "24/24 [==============================] - 0s 661us/step - loss: 0.4235 - acc: 1.0000 - val_loss: 0.5018 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 935us/step - loss: 0.4437 - acc: 0.9167 - val_loss: 0.5864 - val_acc: 0.8333\n", "Epoch 85/200\n", - "24/24 [==============================] - 0s 696us/step - loss: 0.4141 - acc: 1.0000 - val_loss: 0.4952 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 604us/step - loss: 0.4336 - acc: 0.9167 - val_loss: 0.5776 - val_acc: 0.8333\n", "Epoch 86/200\n", - "24/24 [==============================] - 0s 675us/step - loss: 0.4020 - acc: 1.0000 - val_loss: 0.4864 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 535us/step - loss: 0.4227 - acc: 0.9583 - val_loss: 0.5689 - val_acc: 0.8333\n", "Epoch 87/200\n", - "24/24 [==============================] - 0s 740us/step - loss: 0.3908 - acc: 1.0000 - val_loss: 0.4800 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 520us/step - loss: 0.4126 - acc: 0.9583 - val_loss: 0.5603 - val_acc: 0.8333\n", "Epoch 88/200\n", - "24/24 [==============================] - 0s 679us/step - loss: 0.3808 - acc: 1.0000 - val_loss: 0.4723 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.4023 - acc: 0.9583 - val_loss: 0.5525 - val_acc: 0.8333\n", "Epoch 89/200\n", - "24/24 [==============================] - 0s 639us/step - loss: 0.3705 - acc: 1.0000 - val_loss: 0.4647 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 619us/step - loss: 0.3927 - acc: 0.9583 - val_loss: 0.5448 - val_acc: 0.8333\n", "Epoch 90/200\n", - "24/24 [==============================] - 0s 736us/step - loss: 0.3613 - acc: 1.0000 - val_loss: 0.4576 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 585us/step - loss: 0.3840 - acc: 0.9583 - val_loss: 0.5372 - val_acc: 0.8333\n", "Epoch 91/200\n", - "24/24 [==============================] - 0s 765us/step - loss: 0.3513 - acc: 1.0000 - val_loss: 0.4507 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 543us/step - loss: 0.3743 - acc: 0.9583 - val_loss: 0.5300 - val_acc: 0.8333\n", "Epoch 92/200\n", - "24/24 [==============================] - 0s 736us/step - loss: 0.3412 - acc: 1.0000 - val_loss: 0.4441 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 546us/step - loss: 0.3656 - acc: 0.9583 - val_loss: 0.5223 - val_acc: 0.8333\n", "Epoch 93/200\n", - "24/24 [==============================] - 0s 722us/step - loss: 0.3318 - acc: 1.0000 - val_loss: 0.4378 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 562us/step - loss: 0.3574 - acc: 0.9583 - val_loss: 0.5155 - val_acc: 0.8333\n", "Epoch 94/200\n", - "24/24 [==============================] - 0s 853us/step - loss: 0.3230 - acc: 1.0000 - val_loss: 0.4309 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 665us/step - loss: 0.3483 - acc: 0.9583 - val_loss: 0.5081 - val_acc: 0.8333\n", "Epoch 95/200\n", - "24/24 [==============================] - 0s 705us/step - loss: 0.3136 - acc: 1.0000 - val_loss: 0.4254 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 716us/step - loss: 0.3411 - acc: 0.9583 - val_loss: 0.5020 - val_acc: 0.8333\n", "Epoch 96/200\n", - "24/24 [==============================] - 0s 936us/step - loss: 0.3057 - acc: 1.0000 - val_loss: 0.4188 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 596us/step - loss: 0.3327 - acc: 0.9583 - val_loss: 0.4953 - val_acc: 0.8333\n", "Epoch 97/200\n", - "24/24 [==============================] - 0s 816us/step - loss: 0.2971 - acc: 1.0000 - val_loss: 0.4136 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 823us/step - loss: 0.3255 - acc: 0.9583 - val_loss: 0.4897 - val_acc: 0.8333\n", "Epoch 98/200\n", - "24/24 [==============================] - 0s 624us/step - loss: 0.2880 - acc: 1.0000 - val_loss: 0.4074 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 608us/step - loss: 0.3177 - acc: 0.9583 - val_loss: 0.4833 - val_acc: 0.8333\n", "Epoch 99/200\n", - "24/24 [==============================] - 0s 883us/step - loss: 0.2798 - acc: 1.0000 - val_loss: 0.4016 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 465us/step - loss: 0.3102 - acc: 0.9583 - val_loss: 0.4775 - val_acc: 0.8333\n", "Epoch 100/200\n", - "24/24 [==============================] - 0s 638us/step - loss: 0.2718 - acc: 1.0000 - val_loss: 0.3955 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 534us/step - loss: 0.3033 - acc: 0.9583 - val_loss: 0.4716 - val_acc: 0.8333\n", "Epoch 101/200\n", - "24/24 [==============================] - 0s 839us/step - loss: 0.2640 - acc: 1.0000 - val_loss: 0.3900 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 552us/step - loss: 0.2964 - acc: 0.9583 - val_loss: 0.4660 - val_acc: 0.8333\n", "Epoch 102/200\n", - "24/24 [==============================] - 0s 657us/step - loss: 0.2574 - acc: 1.0000 - val_loss: 0.3858 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 808us/step - loss: 0.2892 - acc: 0.9583 - val_loss: 0.4600 - val_acc: 0.8333\n", "Epoch 103/200\n", - "24/24 [==============================] - 0s 976us/step - loss: 0.2490 - acc: 1.0000 - val_loss: 0.3801 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 504us/step - loss: 0.2823 - acc: 0.9583 - val_loss: 0.4548 - val_acc: 0.8333\n", "Epoch 104/200\n", - "24/24 [==============================] - 0s 679us/step - loss: 0.2416 - acc: 1.0000 - val_loss: 0.3745 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 504us/step - loss: 0.2766 - acc: 0.9583 - val_loss: 0.4493 - val_acc: 0.8333\n", "Epoch 105/200\n", - "24/24 [==============================] - 0s 815us/step - loss: 0.2343 - acc: 1.0000 - val_loss: 0.3707 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 526us/step - loss: 0.2694 - acc: 0.9583 - val_loss: 0.4443 - val_acc: 0.8333\n", "Epoch 106/200\n", - "24/24 [==============================] - 0s 805us/step - loss: 0.2279 - acc: 1.0000 - val_loss: 0.3647 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 660us/step - loss: 0.2630 - acc: 0.9583 - val_loss: 0.4391 - val_acc: 0.8333\n", "Epoch 107/200\n", - "24/24 [==============================] - 0s 784us/step - loss: 0.2210 - acc: 1.0000 - val_loss: 0.3601 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 660us/step - loss: 0.2569 - acc: 0.9583 - val_loss: 0.4338 - val_acc: 0.8333\n", "Epoch 108/200\n", - "24/24 [==============================] - 0s 781us/step - loss: 0.2154 - acc: 1.0000 - val_loss: 0.3567 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 786us/step - loss: 0.2509 - acc: 0.9583 - val_loss: 0.4289 - val_acc: 0.8333\n", "Epoch 109/200\n", - "24/24 [==============================] - 0s 828us/step - loss: 0.2082 - acc: 1.0000 - val_loss: 0.3511 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 623us/step - loss: 0.2450 - acc: 0.9583 - val_loss: 0.4236 - val_acc: 0.8333\n", "Epoch 110/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.2022 - acc: 1.0000 - val_loss: 0.3464 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 630us/step - loss: 0.2391 - acc: 0.9583 - val_loss: 0.4192 - val_acc: 0.8333\n", "Epoch 111/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.1955 - acc: 1.0000 - val_loss: 0.3432 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 649us/step - loss: 0.2336 - acc: 0.9583 - val_loss: 0.4150 - val_acc: 0.9167\n", "Epoch 112/200\n", - "24/24 [==============================] - 0s 735us/step - loss: 0.1905 - acc: 1.0000 - val_loss: 0.3390 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.2280 - acc: 1.0000 - val_loss: 0.4102 - val_acc: 0.9167\n", "Epoch 113/200\n", - "24/24 [==============================] - 0s 864us/step - loss: 0.1842 - acc: 1.0000 - val_loss: 0.3345 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 716us/step - loss: 0.2228 - acc: 1.0000 - val_loss: 0.4062 - val_acc: 0.9167\n", "Epoch 114/200\n", - "24/24 [==============================] - 0s 705us/step - loss: 0.1792 - acc: 1.0000 - val_loss: 0.3316 - val_acc: 0.9167\n", + "24/24 [==============================] - ETA: 0s - loss: 0.1211 - acc: 1.000 - 0s 662us/step - loss: 0.2175 - acc: 1.0000 - val_loss: 0.4010 - val_acc: 0.9167\n", "Epoch 115/200\n", - "24/24 [==============================] - 0s 649us/step - loss: 0.1736 - acc: 1.0000 - val_loss: 0.3277 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 612us/step - loss: 0.2126 - acc: 1.0000 - val_loss: 0.3968 - val_acc: 0.9167\n", "Epoch 116/200\n", - "24/24 [==============================] - 0s 812us/step - loss: 0.1689 - acc: 1.0000 - val_loss: 0.3228 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 679us/step - loss: 0.2073 - acc: 1.0000 - val_loss: 0.3923 - val_acc: 0.9167\n", "Epoch 117/200\n", - "24/24 [==============================] - 0s 708us/step - loss: 0.1634 - acc: 1.0000 - val_loss: 0.3193 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 541us/step - loss: 0.2028 - acc: 1.0000 - val_loss: 0.3881 - val_acc: 0.9167\n", "Epoch 118/200\n", - "24/24 [==============================] - 0s 2ms/step - loss: 0.1584 - acc: 1.0000 - val_loss: 0.3161 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 919us/step - loss: 0.1977 - acc: 1.0000 - val_loss: 0.3842 - val_acc: 0.9167\n", "Epoch 119/200\n", - "24/24 [==============================] - 0s 886us/step - loss: 0.1539 - acc: 1.0000 - val_loss: 0.3120 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 712us/step - loss: 0.1934 - acc: 1.0000 - val_loss: 0.3806 - val_acc: 0.9167\n", "Epoch 120/200\n", - "24/24 [==============================] - 0s 716us/step - loss: 0.1493 - acc: 1.0000 - val_loss: 0.3082 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 784us/step - loss: 0.1888 - acc: 1.0000 - val_loss: 0.3764 - val_acc: 0.9167\n", "Epoch 121/200\n", - "24/24 [==============================] - 0s 749us/step - loss: 0.1448 - acc: 1.0000 - val_loss: 0.3059 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 542us/step - loss: 0.1845 - acc: 1.0000 - val_loss: 0.3729 - val_acc: 0.9167\n", "Epoch 122/200\n", - "24/24 [==============================] - 0s 745us/step - loss: 0.1404 - acc: 1.0000 - val_loss: 0.3018 - val_acc: 0.9167\n", - "Epoch 123/200\n" + "24/24 [==============================] - 0s 595us/step - loss: 0.1798 - acc: 1.0000 - val_loss: 0.3685 - val_acc: 0.9167\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ - "24/24 [==============================] - 0s 666us/step - loss: 0.1366 - acc: 1.0000 - val_loss: 0.2996 - val_acc: 0.9167\n", + "Epoch 123/200\n", + "24/24 [==============================] - 0s 712us/step - loss: 0.1756 - acc: 1.0000 - val_loss: 0.3650 - val_acc: 0.9167\n", "Epoch 124/200\n", - "24/24 [==============================] - 0s 721us/step - loss: 0.1322 - acc: 1.0000 - val_loss: 0.2955 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 618us/step - loss: 0.1718 - acc: 1.0000 - val_loss: 0.3616 - val_acc: 0.9167\n", "Epoch 125/200\n", - "24/24 [==============================] - 0s 609us/step - loss: 0.1282 - acc: 1.0000 - val_loss: 0.2919 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 815us/step - loss: 0.1675 - acc: 1.0000 - val_loss: 0.3578 - val_acc: 0.9167\n", "Epoch 126/200\n", - "24/24 [==============================] - 0s 639us/step - loss: 0.1243 - acc: 1.0000 - val_loss: 0.2895 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 929us/step - loss: 0.1637 - acc: 1.0000 - val_loss: 0.3549 - val_acc: 0.9167\n", "Epoch 127/200\n", - "24/24 [==============================] - 0s 918us/step - loss: 0.1208 - acc: 1.0000 - val_loss: 0.2859 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 644us/step - loss: 0.1597 - acc: 1.0000 - val_loss: 0.3514 - val_acc: 0.9167\n", "Epoch 128/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.1169 - acc: 1.0000 - val_loss: 0.2835 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 663us/step - loss: 0.1561 - acc: 1.0000 - val_loss: 0.3481 - val_acc: 0.9167\n", "Epoch 129/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.1134 - acc: 1.0000 - val_loss: 0.2800 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 585us/step - loss: 0.1530 - acc: 1.0000 - val_loss: 0.3449 - val_acc: 0.9167\n", "Epoch 130/200\n", - "24/24 [==============================] - 0s 990us/step - loss: 0.1106 - acc: 1.0000 - val_loss: 0.2777 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 757us/step - loss: 0.1491 - acc: 1.0000 - val_loss: 0.3419 - val_acc: 0.9167\n", "Epoch 131/200\n", - "24/24 [==============================] - 0s 899us/step - loss: 0.1071 - acc: 1.0000 - val_loss: 0.2742 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 609us/step - loss: 0.1457 - acc: 1.0000 - val_loss: 0.3385 - val_acc: 0.9167\n", "Epoch 132/200\n", - "24/24 [==============================] - 0s 826us/step - loss: 0.1039 - acc: 1.0000 - val_loss: 0.2716 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 721us/step - loss: 0.1424 - acc: 1.0000 - val_loss: 0.3354 - val_acc: 0.9167\n", "Epoch 133/200\n", - "24/24 [==============================] - 0s 915us/step - loss: 0.1006 - acc: 1.0000 - val_loss: 0.2686 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 756us/step - loss: 0.1392 - acc: 1.0000 - val_loss: 0.3323 - val_acc: 0.9167\n", "Epoch 134/200\n", - "24/24 [==============================] - 0s 904us/step - loss: 0.0976 - acc: 1.0000 - val_loss: 0.2659 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 635us/step - loss: 0.1364 - acc: 1.0000 - val_loss: 0.3296 - val_acc: 0.9167\n", "Epoch 135/200\n", - "24/24 [==============================] - 0s 873us/step - loss: 0.0947 - acc: 1.0000 - val_loss: 0.2628 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.1329 - acc: 1.0000 - val_loss: 0.3266 - val_acc: 0.9167\n", "Epoch 136/200\n", - "24/24 [==============================] - 0s 848us/step - loss: 0.0920 - acc: 1.0000 - val_loss: 0.2606 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 731us/step - loss: 0.1299 - acc: 1.0000 - val_loss: 0.3241 - val_acc: 0.9167\n", "Epoch 137/200\n", - "24/24 [==============================] - 0s 887us/step - loss: 0.0891 - acc: 1.0000 - val_loss: 0.2565 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 954us/step - loss: 0.1274 - acc: 1.0000 - val_loss: 0.3217 - val_acc: 0.9167\n", "Epoch 138/200\n", - "24/24 [==============================] - 0s 811us/step - loss: 0.0865 - acc: 1.0000 - val_loss: 0.2548 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.1244 - acc: 1.0000 - val_loss: 0.3184 - val_acc: 0.9167\n", "Epoch 139/200\n", - "24/24 [==============================] - 0s 717us/step - loss: 0.0838 - acc: 1.0000 - val_loss: 0.2525 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 770us/step - loss: 0.1215 - acc: 1.0000 - val_loss: 0.3159 - val_acc: 0.9167\n", "Epoch 140/200\n", - "24/24 [==============================] - 0s 686us/step - loss: 0.0814 - acc: 1.0000 - val_loss: 0.2491 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 601us/step - loss: 0.1188 - acc: 1.0000 - val_loss: 0.3130 - val_acc: 0.9167\n", "Epoch 141/200\n", - "24/24 [==============================] - 0s 788us/step - loss: 0.0789 - acc: 1.0000 - val_loss: 0.2476 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 734us/step - loss: 0.1163 - acc: 1.0000 - val_loss: 0.3106 - val_acc: 0.9167\n", "Epoch 142/200\n", - "24/24 [==============================] - 0s 643us/step - loss: 0.0770 - acc: 1.0000 - val_loss: 0.2446 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 582us/step - loss: 0.1136 - acc: 1.0000 - val_loss: 0.3073 - val_acc: 0.9167\n", "Epoch 143/200\n", - "24/24 [==============================] - 0s 854us/step - loss: 0.0747 - acc: 1.0000 - val_loss: 0.2427 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 726us/step - loss: 0.1111 - acc: 1.0000 - val_loss: 0.3051 - val_acc: 0.9167\n", "Epoch 144/200\n", - "24/24 [==============================] - 0s 679us/step - loss: 0.0723 - acc: 1.0000 - val_loss: 0.2407 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 590us/step - loss: 0.1086 - acc: 1.0000 - val_loss: 0.3027 - val_acc: 0.9167\n", "Epoch 145/200\n", - "24/24 [==============================] - 0s 814us/step - loss: 0.0704 - acc: 1.0000 - val_loss: 0.2377 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 582us/step - loss: 0.1062 - acc: 1.0000 - val_loss: 0.2999 - val_acc: 0.9167\n", "Epoch 146/200\n", - "24/24 [==============================] - 0s 724us/step - loss: 0.0680 - acc: 1.0000 - val_loss: 0.2359 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 699us/step - loss: 0.1040 - acc: 1.0000 - val_loss: 0.2978 - val_acc: 0.9167\n", "Epoch 147/200\n", - "24/24 [==============================] - 0s 779us/step - loss: 0.0661 - acc: 1.0000 - val_loss: 0.2326 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 919us/step - loss: 0.1016 - acc: 1.0000 - val_loss: 0.2948 - val_acc: 0.9167\n", "Epoch 148/200\n", - "24/24 [==============================] - 0s 635us/step - loss: 0.0642 - acc: 1.0000 - val_loss: 0.2303 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 831us/step - loss: 0.0994 - acc: 1.0000 - val_loss: 0.2927 - val_acc: 0.9167\n", "Epoch 149/200\n", - "24/24 [==============================] - 0s 787us/step - loss: 0.0624 - acc: 1.0000 - val_loss: 0.2288 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 617us/step - loss: 0.0971 - acc: 1.0000 - val_loss: 0.2900 - val_acc: 0.9167\n", "Epoch 150/200\n", - "24/24 [==============================] - 0s 547us/step - loss: 0.0607 - acc: 1.0000 - val_loss: 0.2254 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 1ms/step - loss: 0.0950 - acc: 1.0000 - val_loss: 0.2879 - val_acc: 0.9167\n", "Epoch 151/200\n", - "24/24 [==============================] - 0s 857us/step - loss: 0.0587 - acc: 1.0000 - val_loss: 0.2229 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 569us/step - loss: 0.0930 - acc: 1.0000 - val_loss: 0.2849 - val_acc: 0.9167\n", "Epoch 152/200\n", - "24/24 [==============================] - 0s 726us/step - loss: 0.0569 - acc: 1.0000 - val_loss: 0.2214 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 823us/step - loss: 0.0909 - acc: 1.0000 - val_loss: 0.2829 - val_acc: 0.9167\n", "Epoch 153/200\n", - "24/24 [==============================] - 0s 652us/step - loss: 0.0556 - acc: 1.0000 - val_loss: 0.2186 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 594us/step - loss: 0.0889 - acc: 1.0000 - val_loss: 0.2807 - val_acc: 0.9167\n", "Epoch 154/200\n", - "24/24 [==============================] - 0s 765us/step - loss: 0.0536 - acc: 1.0000 - val_loss: 0.2165 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 621us/step - loss: 0.0870 - acc: 1.0000 - val_loss: 0.2787 - val_acc: 0.9167\n", "Epoch 155/200\n", - "24/24 [==============================] - 0s 925us/step - loss: 0.0520 - acc: 1.0000 - val_loss: 0.2138 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 810us/step - loss: 0.0852 - acc: 1.0000 - val_loss: 0.2758 - val_acc: 0.9167\n", "Epoch 156/200\n", - "24/24 [==============================] - 0s 722us/step - loss: 0.0506 - acc: 1.0000 - val_loss: 0.2122 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 577us/step - loss: 0.0834 - acc: 1.0000 - val_loss: 0.2737 - val_acc: 0.9167\n", "Epoch 157/200\n", - "24/24 [==============================] - 0s 729us/step - loss: 0.0489 - acc: 1.0000 - val_loss: 0.2083 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 563us/step - loss: 0.0819 - acc: 1.0000 - val_loss: 0.2716 - val_acc: 0.9167\n", "Epoch 158/200\n", - "24/24 [==============================] - 0s 1ms/step - loss: 0.0475 - acc: 1.0000 - val_loss: 0.2066 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 478us/step - loss: 0.0797 - acc: 1.0000 - val_loss: 0.2693 - val_acc: 0.9167\n", "Epoch 159/200\n", - "24/24 [==============================] - 0s 775us/step - loss: 0.0461 - acc: 1.0000 - val_loss: 0.2048 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 598us/step - loss: 0.0784 - acc: 1.0000 - val_loss: 0.2666 - val_acc: 0.9167\n", "Epoch 160/200\n", - "24/24 [==============================] - 0s 767us/step - loss: 0.0448 - acc: 1.0000 - val_loss: 0.2032 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 834us/step - loss: 0.0766 - acc: 1.0000 - val_loss: 0.2645 - val_acc: 0.9167\n", "Epoch 161/200\n", - "24/24 [==============================] - 0s 644us/step - loss: 0.0434 - acc: 1.0000 - val_loss: 0.2000 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 820us/step - loss: 0.0748 - acc: 1.0000 - val_loss: 0.2625 - val_acc: 0.9167\n", "Epoch 162/200\n", - "24/24 [==============================] - 0s 779us/step - loss: 0.0421 - acc: 1.0000 - val_loss: 0.1973 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 610us/step - loss: 0.0734 - acc: 1.0000 - val_loss: 0.2603 - val_acc: 0.9167\n", "Epoch 163/200\n", - "24/24 [==============================] - 0s 782us/step - loss: 0.0407 - acc: 1.0000 - val_loss: 0.1950 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 563us/step - loss: 0.0721 - acc: 1.0000 - val_loss: 0.2587 - val_acc: 0.9167\n", "Epoch 164/200\n", - "24/24 [==============================] - 0s 806us/step - loss: 0.0395 - acc: 1.0000 - val_loss: 0.1934 - val_acc: 0.9167\n", + "24/24 [==============================] - ETA: 0s - loss: 0.0069 - acc: 1.000 - 0s 1ms/step - loss: 0.0703 - acc: 1.0000 - val_loss: 0.2555 - val_acc: 0.9167\n", "Epoch 165/200\n", - "24/24 [==============================] - 0s 582us/step - loss: 0.0384 - acc: 1.0000 - val_loss: 0.1919 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 721us/step - loss: 0.0688 - acc: 1.0000 - val_loss: 0.2536 - val_acc: 0.9167\n", "Epoch 166/200\n", - "24/24 [==============================] - 0s 669us/step - loss: 0.0373 - acc: 1.0000 - val_loss: 0.1888 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 729us/step - loss: 0.0675 - acc: 1.0000 - val_loss: 0.2516 - val_acc: 0.9167\n", "Epoch 167/200\n", - "24/24 [==============================] - 0s 742us/step - loss: 0.0359 - acc: 1.0000 - val_loss: 0.1876 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 626us/step - loss: 0.0661 - acc: 1.0000 - val_loss: 0.2501 - val_acc: 0.9167\n", "Epoch 168/200\n", - "24/24 [==============================] - 0s 794us/step - loss: 0.0349 - acc: 1.0000 - val_loss: 0.1840 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 627us/step - loss: 0.0646 - acc: 1.0000 - val_loss: 0.2472 - val_acc: 0.9167\n", "Epoch 169/200\n", - "24/24 [==============================] - 0s 712us/step - loss: 0.0338 - acc: 1.0000 - val_loss: 0.1827 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 526us/step - loss: 0.0634 - acc: 1.0000 - val_loss: 0.2456 - val_acc: 0.9167\n", "Epoch 170/200\n", - "24/24 [==============================] - 0s 844us/step - loss: 0.0327 - acc: 1.0000 - val_loss: 0.1812 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 637us/step - loss: 0.0620 - acc: 1.0000 - val_loss: 0.2438 - val_acc: 0.9167\n", "Epoch 171/200\n", - "24/24 [==============================] - 0s 680us/step - loss: 0.0316 - acc: 1.0000 - val_loss: 0.1794 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 543us/step - loss: 0.0607 - acc: 1.0000 - val_loss: 0.2422 - val_acc: 0.9167\n", "Epoch 172/200\n", - "24/24 [==============================] - 0s 494us/step - loss: 0.0306 - acc: 1.0000 - val_loss: 0.1763 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 570us/step - loss: 0.0596 - acc: 1.0000 - val_loss: 0.2387 - val_acc: 0.9167\n", "Epoch 173/200\n", - "24/24 [==============================] - 0s 635us/step - loss: 0.0299 - acc: 1.0000 - val_loss: 0.1748 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 636us/step - loss: 0.0581 - acc: 1.0000 - val_loss: 0.2372 - val_acc: 0.9167\n", "Epoch 174/200\n", - "24/24 [==============================] - 0s 583us/step - loss: 0.0288 - acc: 1.0000 - val_loss: 0.1731 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 542us/step - loss: 0.0571 - acc: 1.0000 - val_loss: 0.2354 - val_acc: 0.9167\n", "Epoch 175/200\n", - "24/24 [==============================] - 0s 586us/step - loss: 0.0281 - acc: 1.0000 - val_loss: 0.1703 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 714us/step - loss: 0.0557 - acc: 1.0000 - val_loss: 0.2338 - val_acc: 0.9167\n", "Epoch 176/200\n", - "24/24 [==============================] - 0s 871us/step - loss: 0.0271 - acc: 1.0000 - val_loss: 0.1688 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 578us/step - loss: 0.0544 - acc: 1.0000 - val_loss: 0.2305 - val_acc: 0.9167\n", "Epoch 177/200\n", - "24/24 [==============================] - 0s 689us/step - loss: 0.0262 - acc: 1.0000 - val_loss: 0.1676 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 686us/step - loss: 0.0534 - acc: 1.0000 - val_loss: 0.2291 - val_acc: 0.9167\n", "Epoch 178/200\n", - "24/24 [==============================] - 0s 582us/step - loss: 0.0254 - acc: 1.0000 - val_loss: 0.1644 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 943us/step - loss: 0.0521 - acc: 1.0000 - val_loss: 0.2273 - val_acc: 0.9167\n", "Epoch 179/200\n", - "24/24 [==============================] - 0s 833us/step - loss: 0.0245 - acc: 1.0000 - val_loss: 0.1631 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 741us/step - loss: 0.0510 - acc: 1.0000 - val_loss: 0.2251 - val_acc: 0.9167\n", "Epoch 180/200\n", - "24/24 [==============================] - 0s 644us/step - loss: 0.0236 - acc: 1.0000 - val_loss: 0.1611 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 631us/step - loss: 0.0500 - acc: 1.0000 - val_loss: 0.2227 - val_acc: 0.9167\n", "Epoch 181/200\n", - "24/24 [==============================] - 0s 550us/step - loss: 0.0229 - acc: 1.0000 - val_loss: 0.1586 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 573us/step - loss: 0.0488 - acc: 1.0000 - val_loss: 0.2208 - val_acc: 0.9167\n", "Epoch 182/200\n", - "24/24 [==============================] - 0s 706us/step - loss: 0.0222 - acc: 1.0000 - val_loss: 0.1573 - val_acc: 0.9167\n", - "Epoch 183/200\n", - "24/24 [==============================] - 0s 684us/step - loss: 0.0213 - acc: 1.0000 - val_loss: 0.1547 - val_acc: 0.9167\n" + "24/24 [==============================] - 0s 540us/step - loss: 0.0476 - acc: 1.0000 - val_loss: 0.2192 - val_acc: 0.9167\n", + "Epoch 183/200\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ + "24/24 [==============================] - 0s 565us/step - loss: 0.0466 - acc: 1.0000 - val_loss: 0.2173 - val_acc: 0.9167\n", "Epoch 184/200\n", - "24/24 [==============================] - 0s 585us/step - loss: 0.0207 - acc: 1.0000 - val_loss: 0.1535 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 667us/step - loss: 0.0458 - acc: 1.0000 - val_loss: 0.2149 - val_acc: 0.9167\n", "Epoch 185/200\n", - "24/24 [==============================] - 0s 725us/step - loss: 0.0199 - acc: 1.0000 - val_loss: 0.1521 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 532us/step - loss: 0.0446 - acc: 1.0000 - val_loss: 0.2128 - val_acc: 0.9167\n", "Epoch 186/200\n", - "24/24 [==============================] - 0s 692us/step - loss: 0.0192 - acc: 1.0000 - val_loss: 0.1486 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 708us/step - loss: 0.0439 - acc: 1.0000 - val_loss: 0.2115 - val_acc: 0.9167\n", "Epoch 187/200\n", - "24/24 [==============================] - 0s 529us/step - loss: 0.0187 - acc: 1.0000 - val_loss: 0.1475 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 691us/step - loss: 0.0427 - acc: 1.0000 - val_loss: 0.2094 - val_acc: 0.9167\n", "Epoch 188/200\n", - "24/24 [==============================] - 0s 608us/step - loss: 0.0179 - acc: 1.0000 - val_loss: 0.1466 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 660us/step - loss: 0.0417 - acc: 1.0000 - val_loss: 0.2081 - val_acc: 0.9167\n", "Epoch 189/200\n", - "24/24 [==============================] - 0s 618us/step - loss: 0.0173 - acc: 1.0000 - val_loss: 0.1443 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 916us/step - loss: 0.0407 - acc: 1.0000 - val_loss: 0.2047 - val_acc: 0.9167\n", "Epoch 190/200\n", - "24/24 [==============================] - 0s 758us/step - loss: 0.0167 - acc: 1.0000 - val_loss: 0.1434 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 805us/step - loss: 0.0399 - acc: 1.0000 - val_loss: 0.2031 - val_acc: 0.9167\n", "Epoch 191/200\n", - "24/24 [==============================] - 0s 553us/step - loss: 0.0161 - acc: 1.0000 - val_loss: 0.1425 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 697us/step - loss: 0.0390 - acc: 1.0000 - val_loss: 0.2019 - val_acc: 0.9167\n", "Epoch 192/200\n", - "24/24 [==============================] - 0s 909us/step - loss: 0.0157 - acc: 1.0000 - val_loss: 0.1393 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 988us/step - loss: 0.0381 - acc: 1.0000 - val_loss: 0.2001 - val_acc: 0.9167\n", "Epoch 193/200\n", - "24/24 [==============================] - 0s 778us/step - loss: 0.0151 - acc: 1.0000 - val_loss: 0.1379 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 635us/step - loss: 0.0370 - acc: 1.0000 - val_loss: 0.1983 - val_acc: 0.9167\n", "Epoch 194/200\n", - "24/24 [==============================] - 0s 685us/step - loss: 0.0146 - acc: 1.0000 - val_loss: 0.1371 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 700us/step - loss: 0.0363 - acc: 1.0000 - val_loss: 0.1957 - val_acc: 0.9167\n", "Epoch 195/200\n", - "24/24 [==============================] - 0s 604us/step - loss: 0.0140 - acc: 1.0000 - val_loss: 0.1361 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 658us/step - loss: 0.0354 - acc: 1.0000 - val_loss: 0.1943 - val_acc: 0.9167\n", "Epoch 196/200\n", - "24/24 [==============================] - 0s 897us/step - loss: 0.0136 - acc: 1.0000 - val_loss: 0.1338 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 801us/step - loss: 0.0344 - acc: 1.0000 - val_loss: 0.1922 - val_acc: 0.9167\n", "Epoch 197/200\n", - "24/24 [==============================] - 0s 700us/step - loss: 0.0130 - acc: 1.0000 - val_loss: 0.1325 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 697us/step - loss: 0.0336 - acc: 1.0000 - val_loss: 0.1910 - val_acc: 0.9167\n", "Epoch 198/200\n", - "24/24 [==============================] - 0s 591us/step - loss: 0.0125 - acc: 1.0000 - val_loss: 0.1305 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 713us/step - loss: 0.0330 - acc: 1.0000 - val_loss: 0.1899 - val_acc: 0.9167\n", "Epoch 199/200\n", - "24/24 [==============================] - 0s 701us/step - loss: 0.0122 - acc: 1.0000 - val_loss: 0.1287 - val_acc: 0.9167\n", + "24/24 [==============================] - 0s 579us/step - loss: 0.0320 - acc: 1.0000 - val_loss: 0.1864 - val_acc: 0.9167\n", "Epoch 200/200\n", - "24/24 [==============================] - 0s 592us/step - loss: 0.0117 - acc: 1.0000 - val_loss: 0.1263 - val_acc: 0.9167\n" + "24/24 [==============================] - 0s 619us/step - loss: 0.0313 - acc: 1.0000 - val_loss: 0.1845 - val_acc: 0.9167\n" ] } ], @@ -1242,7 +1249,7 @@ }, { "cell_type": "code", - "execution_count": 121, + "execution_count": 20, "metadata": {}, "outputs": [ { @@ -1267,7 +1274,7 @@ }, { "cell_type": "code", - "execution_count": 122, + "execution_count": 21, "metadata": {}, "outputs": [], "source": [ @@ -1284,7 +1291,7 @@ }, { "cell_type": "code", - "execution_count": 99, + "execution_count": 22, "metadata": {}, "outputs": [ { @@ -1293,7 +1300,7 @@ "(24, 500)" ] }, - "execution_count": 99, + "execution_count": 22, "metadata": {}, "output_type": "execute_result" } @@ -1304,7 +1311,7 @@ }, { "cell_type": "code", - "execution_count": 100, + "execution_count": 23, "metadata": {}, "outputs": [], "source": [ @@ -1319,7 +1326,7 @@ }, { "cell_type": "code", - "execution_count": 101, + "execution_count": 24, "metadata": {}, "outputs": [ { @@ -1329,11 +1336,11 @@ "_________________________________________________________________\n", "Layer (type) Output Shape Param # \n", "=================================================================\n", - "embedding_6 (Embedding) (None, 500, 100) 1000000 \n", + "embedding_1 (Embedding) (None, 500, 100) 1000000 \n", "_________________________________________________________________\n", - "lstm_5 (LSTM) (None, 32) 17024 \n", + "lstm_1 (LSTM) (None, 32) 17024 \n", "_________________________________________________________________\n", - "dense_40 (Dense) (None, 5) 165 \n", + "dense_7 (Dense) (None, 5) 165 \n", "=================================================================\n", "Total params: 1,017,189\n", "Trainable params: 1,017,189\n", @@ -1352,7 +1359,7 @@ }, { "cell_type": "code", - "execution_count": 102, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -1362,7 +1369,7 @@ }, { "cell_type": "code", - "execution_count": 103, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ @@ -1373,7 +1380,7 @@ }, { "cell_type": "code", - "execution_count": 104, + "execution_count": 27, "metadata": {}, "outputs": [ { @@ -1382,45 +1389,45 @@ "text": [ "Train on 24 samples, validate on 12 samples\n", "Epoch 1/20\n", - "24/24 [==============================] - 3s 123ms/step - loss: 1.5062 - acc: 0.3750 - val_loss: 1.3186 - val_acc: 0.7500\n", + "24/24 [==============================] - 3s 118ms/step - loss: 1.4392 - acc: 0.5417 - val_loss: 1.2330 - val_acc: 0.7500\n", "Epoch 2/20\n", - "24/24 [==============================] - 2s 71ms/step - loss: 1.2632 - acc: 0.7917 - val_loss: 1.1607 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 78ms/step - loss: 1.2242 - acc: 0.7917 - val_loss: 1.0782 - val_acc: 0.7500\n", "Epoch 3/20\n", - "24/24 [==============================] - 2s 71ms/step - loss: 1.0976 - acc: 0.9167 - val_loss: 1.0268 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 81ms/step - loss: 1.0662 - acc: 0.7917 - val_loss: 0.9564 - val_acc: 0.7500\n", "Epoch 4/20\n", - "24/24 [==============================] - 2s 70ms/step - loss: 0.9557 - acc: 0.9167 - val_loss: 0.9094 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 73ms/step - loss: 0.9456 - acc: 0.7917 - val_loss: 0.8556 - val_acc: 0.7500\n", "Epoch 5/20\n", - "24/24 [==============================] - 2s 70ms/step - loss: 0.8403 - acc: 0.9167 - val_loss: 0.8069 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 79ms/step - loss: 0.8390 - acc: 0.7917 - val_loss: 0.7706 - val_acc: 0.7500\n", "Epoch 6/20\n", - "24/24 [==============================] - 2s 72ms/step - loss: 0.7366 - acc: 0.9167 - val_loss: 0.7196 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 81ms/step - loss: 0.7375 - acc: 0.8333 - val_loss: 0.6927 - val_acc: 0.8333\n", "Epoch 7/20\n", - "24/24 [==============================] - 2s 73ms/step - loss: 0.6395 - acc: 0.9167 - val_loss: 0.6425 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 81ms/step - loss: 0.6574 - acc: 0.8750 - val_loss: 0.6278 - val_acc: 0.8333\n", "Epoch 8/20\n", - "24/24 [==============================] - 2s 71ms/step - loss: 0.5593 - acc: 0.9167 - val_loss: 0.5777 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 82ms/step - loss: 0.5811 - acc: 0.8750 - val_loss: 0.5709 - val_acc: 0.8333\n", "Epoch 9/20\n", - "24/24 [==============================] - 2s 72ms/step - loss: 0.4913 - acc: 0.9167 - val_loss: 0.5215 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 77ms/step - loss: 0.5114 - acc: 0.9167 - val_loss: 0.5208 - val_acc: 0.8333\n", "Epoch 10/20\n", - "24/24 [==============================] - 2s 75ms/step - loss: 0.4267 - acc: 0.9167 - val_loss: 0.4744 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 81ms/step - loss: 0.4524 - acc: 0.9167 - val_loss: 0.4780 - val_acc: 0.8333\n", "Epoch 11/20\n", - "24/24 [==============================] - 2s 75ms/step - loss: 0.3747 - acc: 0.9167 - val_loss: 0.4366 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 86ms/step - loss: 0.3993 - acc: 0.9167 - val_loss: 0.4393 - val_acc: 0.8333\n", "Epoch 12/20\n", - "24/24 [==============================] - 2s 68ms/step - loss: 0.3289 - acc: 0.9167 - val_loss: 0.4040 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 77ms/step - loss: 0.3541 - acc: 0.9167 - val_loss: 0.4061 - val_acc: 0.9167\n", "Epoch 13/20\n", - "24/24 [==============================] - 2s 70ms/step - loss: 0.2909 - acc: 0.9583 - val_loss: 0.3759 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 77ms/step - loss: 0.3119 - acc: 0.9167 - val_loss: 0.3782 - val_acc: 0.9167\n", "Epoch 14/20\n", - "24/24 [==============================] - 2s 70ms/step - loss: 0.2546 - acc: 0.9583 - val_loss: 0.3516 - val_acc: 0.8333\n", + "24/24 [==============================] - 2s 78ms/step - loss: 0.2754 - acc: 1.0000 - val_loss: 0.3529 - val_acc: 0.9167\n", "Epoch 15/20\n", - "24/24 [==============================] - 2s 73ms/step - loss: 0.2245 - acc: 0.9583 - val_loss: 0.3315 - val_acc: 0.9167\n", + "24/24 [==============================] - 2s 76ms/step - loss: 0.2441 - acc: 1.0000 - val_loss: 0.3308 - val_acc: 0.9167\n", "Epoch 16/20\n", - "24/24 [==============================] - 2s 73ms/step - loss: 0.1986 - acc: 0.9583 - val_loss: 0.3123 - val_acc: 0.9167\n", + "24/24 [==============================] - 2s 83ms/step - loss: 0.2169 - acc: 1.0000 - val_loss: 0.3107 - val_acc: 0.9167\n", "Epoch 17/20\n", - "24/24 [==============================] - 2s 71ms/step - loss: 0.1746 - acc: 1.0000 - val_loss: 0.2954 - val_acc: 0.9167\n", + "24/24 [==============================] - 2s 77ms/step - loss: 0.1909 - acc: 1.0000 - val_loss: 0.2925 - val_acc: 0.9167\n", "Epoch 18/20\n", - "24/24 [==============================] - 2s 71ms/step - loss: 0.1547 - acc: 1.0000 - val_loss: 0.2806 - val_acc: 0.9167\n", + "24/24 [==============================] - 2s 76ms/step - loss: 0.1692 - acc: 1.0000 - val_loss: 0.2761 - val_acc: 0.9167\n", "Epoch 19/20\n", - "24/24 [==============================] - 2s 71ms/step - loss: 0.1366 - acc: 1.0000 - val_loss: 0.2674 - val_acc: 0.9167\n", + "24/24 [==============================] - 2s 79ms/step - loss: 0.1486 - acc: 1.0000 - val_loss: 0.2606 - val_acc: 0.9167\n", "Epoch 20/20\n", - "24/24 [==============================] - 2s 73ms/step - loss: 0.1216 - acc: 1.0000 - val_loss: 0.2544 - val_acc: 0.9167\n" + "24/24 [==============================] - 2s 83ms/step - loss: 0.1303 - acc: 1.0000 - val_loss: 0.2470 - val_acc: 0.9167\n" ] } ], @@ -1437,6 +1444,47 @@ "source": [ "### From the logs above, the training accuracy is 100% and the validation accuracy is 91.67%" ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "predictions = np.argmax(model.predict(val_data), axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " precision recall f1-score support\n", + "\n", + " Computer - Hardware 1.00 1.00 1.00 1\n", + "Meals and Entertainment 0.88 1.00 0.93 7\n", + " Office Supplies 1.00 0.50 0.67 2\n", + " Travel 1.00 1.00 1.00 2\n", + "\n", + " avg / total 0.93 0.92 0.91 12\n", + "\n" + ] + } + ], + "source": [ + "print(classification_report(y_val, predictions, target_names=val_categoroes))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/clustering_personal_business.ipynb b/clustering_personal_business.ipynb index 108f1ff..b1e0620 100644 --- a/clustering_personal_business.ipynb +++ b/clustering_personal_business.ipynb @@ -32,7 +32,7 @@ }, { "cell_type": "code", - "execution_count": 157, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ @@ -54,7 +54,7 @@ }, { "cell_type": "code", - "execution_count": 176, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ @@ -78,17 +78,14 @@ }, { "cell_type": "code", - "execution_count": 239, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def pre_process(df, columns_to_drop=['date',\n", - " 'employee id', \n", " 'category', \n", - "# 'pre-tax amount', \n", " 'tax amount',\n", " 'tax name', \n", - " 'role', \n", " 'expense description']):\n", " \n", " def holiday(x):\n", @@ -120,7 +117,7 @@ }, { "cell_type": "code", - "execution_count": 240, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -132,7 +129,7 @@ }, { "cell_type": "code", - "execution_count": 241, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -157,96 +154,68 @@ " \n", " \n", " pre-tax amount\n", + " employee id_1\n", + " employee id_3\n", + " employee id_4\n", + " employee id_5\n", + " employee id_6\n", + " employee id_7\n", " holiday_0\n", " holiday_1\n", - " text_data_Airplane ticket to NY Travel\n", - " text_data_Client dinner Meals and Entertainment\n", - " text_data_Coffee with Steve Meals and Entertainment\n", - " text_data_Dinner Meals and Entertainment\n", - " text_data_Dinner with client Meals and Entertainment\n", - " text_data_Dinner with potential client Meals and Entertainment\n", - " text_data_Dropbox Subscription Computer - Software\n", - " text_data_Flight to Miami Travel\n", - " text_data_HP Laptop Computer Computer - Hardware\n", - " text_data_Macbook Air Computer Computer - Hardware\n", - " text_data_Microsoft Office Computer - Software\n", - " text_data_Paper Office Supplies\n", - " text_data_Starbucks coffee Meals and Entertainment\n", - " text_data_Taxi ride Travel\n", - " text_data_Team lunch Meals and Entertainment\n", - " text_data_iCloud Subscription Computer - Software\n", - " text_data_iPhone Computer - Hardware\n", + " role_CEO\n", + " role_Engineer\n", + " role_IT and Admin\n", + " role_Sales\n", " \n", " \n", " \n", " \n", " 0\n", " 0.018045\n", - " 1\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", " 0\n", " 0\n", " 0\n", " 0\n", " 0\n", + " 1\n", + " 1\n", " 0\n", " 0\n", " 0\n", " 0\n", " 1\n", - " 0\n", - " 0\n", - " 0\n", " \n", " \n", " 1\n", " 0.098246\n", - " 1\n", - " 0\n", - " 0\n", - " 1\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", " 0\n", " 0\n", " 0\n", " 0\n", " 0\n", + " 1\n", + " 1\n", " 0\n", " 0\n", " 0\n", " 0\n", + " 1\n", " \n", " \n", " 2\n", " 1.000000\n", - " 1\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", " 0\n", " 0\n", " 0\n", " 0\n", " 0\n", " 1\n", + " 1\n", " 0\n", " 0\n", " 0\n", " 0\n", - " 0\n", - " 0\n", - " 0\n", + " 1\n", " \n", " \n", " 3\n", @@ -257,24 +226,16 @@ " 0\n", " 0\n", " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", + " 1\n", " 0\n", " 1\n", " 0\n", " 0\n", + " 0\n", " \n", " \n", " 4\n", " 0.005514\n", - " 0\n", " 1\n", " 0\n", " 0\n", @@ -282,144 +243,40 @@ " 0\n", " 0\n", " 0\n", + " 1\n", + " 1\n", " 0\n", " 0\n", " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 0\n", - " 1\n", - " 0\n", " \n", " \n", "\n", "" ], "text/plain": [ - " pre-tax amount holiday_0 holiday_1 \\\n", - "0 0.018045 1 0 \n", - "1 0.098246 1 0 \n", - "2 1.000000 1 0 \n", - "3 0.115789 1 0 \n", - "4 0.005514 0 1 \n", - "\n", - " text_data_Airplane ticket to NY Travel \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Client dinner Meals and Entertainment \\\n", - "0 0 \n", - "1 1 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Coffee with Steve Meals and Entertainment \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Dinner Meals and Entertainment \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Dinner with client Meals and Entertainment \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Dinner with potential client Meals and Entertainment \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Dropbox Subscription Computer - Software \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Flight to Miami Travel \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_HP Laptop Computer Computer - Hardware \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Macbook Air Computer Computer - Hardware \\\n", - "0 0 \n", - "1 0 \n", - "2 1 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Microsoft Office Computer - Software \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Paper Office Supplies \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Starbucks coffee Meals and Entertainment \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 \n", - "\n", - " text_data_Taxi ride Travel text_data_Team lunch Meals and Entertainment \\\n", - "0 1 0 \n", - "1 0 0 \n", - "2 0 0 \n", - "3 0 1 \n", - "4 0 0 \n", + " pre-tax amount employee id_1 employee id_3 employee id_4 employee id_5 \\\n", + "0 0.018045 0 0 0 0 \n", + "1 0.098246 0 0 0 0 \n", + "2 1.000000 0 0 0 0 \n", + "3 0.115789 1 0 0 0 \n", + "4 0.005514 1 0 0 0 \n", "\n", - " text_data_iCloud Subscription Computer - Software \\\n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 1 \n", + " employee id_6 employee id_7 holiday_0 holiday_1 role_CEO \\\n", + "0 0 1 1 0 0 \n", + "1 0 1 1 0 0 \n", + "2 0 1 1 0 0 \n", + "3 0 0 1 0 1 \n", + "4 0 0 0 1 1 \n", "\n", - " text_data_iPhone Computer - Hardware \n", - "0 0 \n", - "1 0 \n", - "2 0 \n", - "3 0 \n", - "4 0 " + " role_Engineer role_IT and Admin role_Sales \n", + "0 0 0 1 \n", + "1 0 0 1 \n", + "2 0 0 1 \n", + "3 0 0 0 \n", + "4 0 0 0 " ] }, - "execution_count": 241, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } diff --git a/spark_random_forest.ipynb b/spark_random_forest.ipynb index 5cddedf..92bf863 100644 --- a/spark_random_forest.ipynb +++ b/spark_random_forest.ipynb @@ -25,12 +25,13 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "import datetime\n", "import calendar\n", + "import numpy as np\n", " \n", "from pyspark.mllib.tree import RandomForest, RandomForestModel\n", "from pyspark.mllib.util import MLUtils\n", @@ -54,7 +55,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 25, "metadata": {}, "outputs": [], "source": [ @@ -70,7 +71,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 26, "metadata": {}, "outputs": [], "source": [ @@ -85,7 +86,7 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": 27, "metadata": {}, "outputs": [], "source": [ @@ -109,7 +110,7 @@ }, { "cell_type": "code", - "execution_count": 37, + "execution_count": 28, "metadata": {}, "outputs": [ { @@ -118,7 +119,7 @@ "" ] }, - "execution_count": 37, + "execution_count": 28, "metadata": {}, "output_type": "execute_result" } @@ -146,7 +147,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": 29, "metadata": {}, "outputs": [], "source": [ @@ -163,7 +164,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": 30, "metadata": {}, "outputs": [], "source": [ @@ -219,7 +220,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": 31, "metadata": {}, "outputs": [], "source": [ @@ -238,7 +239,7 @@ }, { "cell_type": "code", - "execution_count": 41, + "execution_count": 32, "metadata": {}, "outputs": [], "source": [ @@ -255,7 +256,7 @@ }, { "cell_type": "code", - "execution_count": 42, + "execution_count": 33, "metadata": {}, "outputs": [], "source": [ @@ -266,7 +267,7 @@ }, { "cell_type": "code", - "execution_count": 43, + "execution_count": 34, "metadata": {}, "outputs": [], "source": [ @@ -275,7 +276,7 @@ }, { "cell_type": "code", - "execution_count": 44, + "execution_count": 35, "metadata": {}, "outputs": [], "source": [ @@ -284,7 +285,7 @@ }, { "cell_type": "code", - "execution_count": 45, + "execution_count": 36, "metadata": {}, "outputs": [], "source": [ @@ -294,7 +295,7 @@ }, { "cell_type": "code", - "execution_count": 46, + "execution_count": 64, "metadata": {}, "outputs": [ { @@ -302,15 +303,19 @@ "output_type": "stream", "text": [ "Training accuracy = 0.875\n", - "Validation accuracy = 0.6666666666666666\n" + "Validation accuracy = 0.833333333333\n" ] } ], "source": [ - "train_accuracy = train_labelsAndPredictions.filter(\n", - " lambda lp: lp[0] == lp[1]).count() / float(df_train.count())\n", - "val_accuracy = val_labelsAndPredictions.filter(\n", - " lambda lp: lp[0] == lp[1]).count() / float(df_val.count())\n", + "actual = np.array(data_train_rdd.map(lambda lp: lp.label).collect())\n", + "predictions = np.array(train_predictions.collect())\n", + "train_accuracy = sum(actual == predictions) / float(len(actual))\n", + "\n", + "actual = np.array(data_val_rdd.map(lambda lp: lp.label).collect())\n", + "predictions = np.array(val_predictions.collect())\n", + "val_accuracy = sum(actual == predictions) / float(len(actual))\n", + "\n", "print('Training accuracy = ' + str(train_accuracy))\n", "print('Validation accuracy = ' + str(val_accuracy))\n", "# print('Learned classification forest model:')\n", @@ -328,7 +333,7 @@ }, { "cell_type": "code", - "execution_count": 48, + "execution_count": 38, "metadata": {}, "outputs": [ { @@ -358,13 +363,6 @@ "source": [ "print(classification_report(val_predictions.collect(), df_val.select('category').rdd.map(lambda x:x[0]).collect(), target_names=val_categoroes))" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { From 1e6b04b94b8e87d58ad0007f644c3dcb5c2dbef3 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 14:29:03 -0400 Subject: [PATCH 08/13] Update README.md --- README.md | 29 ++++++++++++++++++++++++++++- 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index fbf2772..b39ef1e 100644 --- a/README.md +++ b/README.md @@ -141,7 +141,7 @@ This notebook tries to address the second problem in the problem set - identifyi - Resulting clusters kind of makes sense as has been described in the notebook - Tried the cluster model on the validation data as well and that sort of makes sense as well -## spark_random_forest.ipynb +### spark_random_forest.ipynb This notebook tries to address the bonus of problem of running one solutio via spark. For this I tried implementing the classification problem. This notebook is not very well documented, I can explain in person if I get the chance. Key summary points: @@ -149,3 +149,30 @@ This notebook tries to address the bonus of problem of running one solutio via s - Only used text TF-IDF features - Training Accuracy **87.5%**, validation accuracy **83.33%**, precision **100%**, **83%** and f1-score **91%** - Tons of improvement possible + +## Luigi Solution + +[Luigi](Luigi is a Python package that helps you build complex pipelines of batch jobs). I used it to manifest how a real production type pipeline can be developed. I have only included the prediction (classification) pipeline here, not the clustering. However, the clustering can be easily incorporated with some additional effort. The solution is in the `luigi` subfolder. The code is basically taken from the Classification notebook but aranged in a more modular format. + +### Classes + +- **PreProcessor** pre-processes and vectorizes the data and generates vectorized files + - train_processed.csv: vectorized training data + - validation_processed.csv: vectorized validation data + - train_processed_tfidf.csv: vectorized training data with only tf-idf features + - validation_processed_tfidf.csv: vectorized validation data with only tf-idf features + - labels.pkl: a labels files including training and validation labels +- **Predictor** does the actual classification and generates result files + - report.txt: report contaning accuracy and other measurements + - comparison.csv: file showing actual and predicted categories side by side + +Sample output files are provided with the solution. You can delete those files and rerun `run.sh` to run the entore workflow. + +The solution is nowhere close to production-ready and can be further modularized and enhanced in a million way. This was just a manifestation of how pipelines can be developed with modularized tasks. + + +## Conclusion + +I have tried different approaches of addressing the solutions. The datasets were very small and nothing I experimented here is conclusive. I have learnt from my experience and real work that models that perform well on small data are not necessarily best for tackling real big data. Also, there is a myriad of ways of improving things. + +I am not particularly proud of anything, however, I am glad that I went through the exercise and learnt a lot in the process. I hope I get a chance to present myself in person and discuss in more detail. From 2d2b9e9c99f40410314778d31f6d9dec646e3cf7 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 14:35:25 -0400 Subject: [PATCH 09/13] Update README.md --- README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index b39ef1e..9dd55cc 100644 --- a/README.md +++ b/README.md @@ -86,13 +86,15 @@ Before proceeding, please install/setup the pre-requisites below. I developed th - Install python 3 `brew install python3` - Install virtual environment `pip3 install virtualenv` - Create a virtual env `virtualenv venv` and activate it `source venv/bin/activate` -- Do the following as apre-requisite for XGBoost +- Do the following as a pre-requisite for XGBoost ``` brew install gcc brew install gcc@5 ``` - Now install the packages in [requirements.txt](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/requirements.txt) +- Download the pretrained vector `glove.6B.zip` from [here](https://nlp.stanford.edu/projects/glove/) and extract off the `glove.6B.100d` file into the working directory. + At this point, you can run `jupter notebook` and run the notebooks interactively. You can also run the luigi solution by doing ``` cd luigi @@ -115,7 +117,7 @@ This notebook tries to address the first question in the problem set - classifyi - Tried several **shallow learning** classifiers - Logistic Regression, Naive Bayes, SVM, Gradient Boosting (XGBoost) and Random Forest and attempted to take an ensemble of the ones that performed best - Feature engineering involved vectorizing text (**expense description** field) as TF-IDF ventors and using other fields like day_of_week, expense amount, employee ID and tax name. I didn't use n-grams here, but that's possibly an improvement option - Tried the classifiers in a two-fold approach - with all features and with only text features -- Tried with Grid Search cross validation and tuning several parameters like regularization factor, number of estimators etc +- Tried with Grid Search cross validation (used `LeaveOneOut` since pretty small data) and tuning several parameters like regularization factor, number of estimators etc - Most classifiers worked better with only text features, indicating that the **expense description** text field is the key field for classification here, and that goes right with common intuition - XGBoost didn't perform as well as I had anticipated - My personal pick of the classifiers in this case would be Random forest based on the performance (although I tried a further ensemle of Random Forest with SVM amd Logistic Regression) @@ -137,6 +139,7 @@ This notebook tries to address the second problem in the problem set - identifyi - Decided to use **K-means** - Extracted holiday (not in a very proper way), along with employee ID and role, and most importantly, again, the text features. This time concatenated **expense description** and **category** +- Tried with tf-idf vectorizer and a pre-trained embedding - 15 data points in business cluster and 9 in personal cluster - Resulting clusters kind of makes sense as has been described in the notebook - Tried the cluster model on the validation data as well and that sort of makes sense as well @@ -173,6 +176,6 @@ The solution is nowhere close to production-ready and can be further modularized ## Conclusion -I have tried different approaches of addressing the solutions. The datasets were very small and nothing I experimented here is conclusive. I have learnt from my experience and real work that models that perform well on small data are not necessarily best for tackling real big data. Also, there is a myriad of ways of improving things. +I have tried different approaches of addressing the solutions. Overall performance were fairly good, however, the datasets were very small and nothing I experimented here is conclusive. I have learnt from my experience and real work that models that perform well on small data are not necessarily best for tackling real big data. Also, there is a myriad of ways of improving things. I am not particularly proud of anything, however, I am glad that I went through the exercise and learnt a lot in the process. I hope I get a chance to present myself in person and discuss in more detail. From e057ea01aa5f8c09ac3f7b7fb086373968cff364 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 14:35:53 -0400 Subject: [PATCH 10/13] more changes --- luigi/employee.csv | 8 ++++++++ luigi/pre_processor.py | 8 ++++++++ luigi/report.txt | 11 +++++++++++ luigi/train_processed.csv | 25 +++++++++++++++++++++++++ luigi/train_processed_tfidf.csv | 24 ++++++++++++++++++++++++ luigi/validation_processed.csv | 25 +++++++++++++++++++++++++ luigi/validation_processed_tfidf.csv | 12 ++++++++++++ 7 files changed, 113 insertions(+) create mode 100755 luigi/employee.csv create mode 100644 luigi/report.txt create mode 100644 luigi/train_processed.csv create mode 100644 luigi/train_processed_tfidf.csv create mode 100644 luigi/validation_processed.csv create mode 100644 luigi/validation_processed_tfidf.csv diff --git a/luigi/employee.csv b/luigi/employee.csv new file mode 100755 index 0000000..7bfe516 --- /dev/null +++ b/luigi/employee.csv @@ -0,0 +1,8 @@ +employee id,employee name,employee address,role +1,Steve Jobs,"235 Carlaw Ave, Toronto, ON, Canada",CEO +2,Jonathan Ive,"1 Yonge St, Toronto, ON, Canada",Engineer +3,Tim Cook,"1 Infinite Loop, Toronto, ON, Canada",IT and Admin +4,Larry Page,"1600 Amphitheatre Parkway, Hamilton, ON, Canada",Sales +5,Jonathan Ive,"1 Yonge St, Toronto, ON, Canada",Engineer +6,Eric Schmidt,"250 Carlaw Ave, Toronto, ON, Canada",Sales +7,Don Draper,"783 Park Ave, New York, NY 10021",Sales \ No newline at end of file diff --git a/luigi/pre_processor.py b/luigi/pre_processor.py index 5b2f71e..2283d3a 100644 --- a/luigi/pre_processor.py +++ b/luigi/pre_processor.py @@ -20,6 +20,14 @@ class PreProcess(luigi.Task): input_path = luigi.Parameter() def output(self): + """ + :return: a dictionary of 5 files: + vectorized training data + vectorized validation data + vectorized training data with only tf-idf features + vectorized validation data with only tf-idf features + a labels files including training and validation labels + """ return { 'train_x': LocalTarget(os.path.join(self.output_path, 'train_processed.csv')), 'train_x_tfidf': LocalTarget(os.path.join(self.output_path, 'train_processed_tfidf.csv')), diff --git a/luigi/report.txt b/luigi/report.txt new file mode 100644 index 0000000..6c807b3 --- /dev/null +++ b/luigi/report.txt @@ -0,0 +1,11 @@ +Training Score: 100.0 % +Validation Score: 91.66666666666666 % + + precision recall f1-score support + + Computer - Hardware 1.00 1.00 1.00 1 +Meals and Entertainment 0.88 1.00 0.93 7 + Office Supplies 1.00 0.50 0.67 2 + Travel 1.00 1.00 1.00 2 + + avg / total 0.93 0.92 0.91 12 diff --git a/luigi/train_processed.csv b/luigi/train_processed.csv new file mode 100644 index 0000000..99e4a8c --- /dev/null +++ b/luigi/train_processed.csv @@ -0,0 +1,25 @@ +,pre-tax amount,employee id_1,employee id_3,employee id_4,employee id_5,employee id_6,employee id_7,tax name_CA Sales tax,tax name_NY Sales tax,day_of_week_0,day_of_week_1,day_of_week_2,day_of_week_3,day_of_week_4,day_of_week_5,day_of_week_6,month_10,month_11,month_12,month_9,role_CEO,role_Engineer,role_IT and Admin,role_Sales +0,0.018045112781954885,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1 +1,0.09824561403508772,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1 +2,0.9999999999999999,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1 +3,0.11578947368421053,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0 +4,0.005513784461152882,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0 +5,0.02807017543859649,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0 +6,0.09824561403508772,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0 +7,0.4987468671679197,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0 +8,0.44862155388471175,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0 +9,0.023057644110275687,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0 +10,0.09824561403508772,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0 +11,0.023057644110275687,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1 +12,0.11328320802005012,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1 +13,0.09824561403508772,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1 +14,0.09824561403508772,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1 +15,0.10325814536340852,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1 +16,0.08822055137844612,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1 +17,0.49924812030075183,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1 +18,0.013032581453634085,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1 +19,0.14837092731829574,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1 +20,0.09824561403508772,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1 +21,0.09824561403508772,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1 +22,0.7498746867167919,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1 +23,0.0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0 diff --git a/luigi/train_processed_tfidf.csv b/luigi/train_processed_tfidf.csv new file mode 100644 index 0000000..ab87e66 --- /dev/null +++ b/luigi/train_processed_tfidf.csv @@ -0,0 +1,24 @@ +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.304793476171201227e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.994129051629246696e-01,0.000000000000000000e+00,0.000000000000000000e+00,5.994129051629246696e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.404148953098627084e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,6.721501192466493579e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.404148953098627084e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,6.721501192466493579e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,6.627346931058820667e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.488542759134465543e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,7.979673611645498044e-01,0.000000000000000000e+00,0.000000000000000000e+00,6.027006641078843652e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +5.994129051629246696e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.304793476171201227e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.994129051629246696e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.795729893990751558e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,6.263113875696254551e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,5.661601849739470449e-01,0.000000000000000000e+00,0.000000000000000000e+00,4.276178902571380336e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.047024796907559452e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,5.773502691896257311e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.773502691896257311e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.773502691896257311e-01 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,6.627346931058820667e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.488542759134465543e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,5.773502691896257311e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.773502691896257311e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.773502691896257311e-01 +0.000000000000000000e+00,0.000000000000000000e+00,7.979673611645498044e-01,0.000000000000000000e+00,0.000000000000000000e+00,6.027006641078843652e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 diff --git a/luigi/validation_processed.csv b/luigi/validation_processed.csv new file mode 100644 index 0000000..99e4a8c --- /dev/null +++ b/luigi/validation_processed.csv @@ -0,0 +1,25 @@ +,pre-tax amount,employee id_1,employee id_3,employee id_4,employee id_5,employee id_6,employee id_7,tax name_CA Sales tax,tax name_NY Sales tax,day_of_week_0,day_of_week_1,day_of_week_2,day_of_week_3,day_of_week_4,day_of_week_5,day_of_week_6,month_10,month_11,month_12,month_9,role_CEO,role_Engineer,role_IT and Admin,role_Sales +0,0.018045112781954885,0,0,0,0,0,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1 +1,0.09824561403508772,0,0,0,0,0,1,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1 +2,0.9999999999999999,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1 +3,0.11578947368421053,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0 +4,0.005513784461152882,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0 +5,0.02807017543859649,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0 +6,0.09824561403508772,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0 +7,0.4987468671679197,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0 +8,0.44862155388471175,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0 +9,0.023057644110275687,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0 +10,0.09824561403508772,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0 +11,0.023057644110275687,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1 +12,0.11328320802005012,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,1 +13,0.09824561403508772,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1 +14,0.09824561403508772,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1 +15,0.10325814536340852,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1 +16,0.08822055137844612,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1 +17,0.49924812030075183,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1 +18,0.013032581453634085,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1 +19,0.14837092731829574,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1 +20,0.09824561403508772,0,0,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1 +21,0.09824561403508772,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1 +22,0.7498746867167919,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1 +23,0.0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0 diff --git a/luigi/validation_processed_tfidf.csv b/luigi/validation_processed_tfidf.csv new file mode 100644 index 0000000..5cc3b0c --- /dev/null +++ b/luigi/validation_processed_tfidf.csv @@ -0,0 +1,12 @@ +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.071067811865475727e-01,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +5.994129051629246696e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.304793476171201227e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.994129051629246696e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,5.524290705318358752e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,6.242149021472203074e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,5.524290705318358752e-01 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,6.627346931058820667e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,7.488542759134465543e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,7.979673611645498044e-01,0.000000000000000000e+00,0.000000000000000000e+00,6.027006641078843652e-01,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 +0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,1.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00,0.000000000000000000e+00 From 49be8bf20c4f5119eb0afce93b80e935a41a42e2 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 14:37:38 -0400 Subject: [PATCH 11/13] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9dd55cc..03eae7d 100644 --- a/README.md +++ b/README.md @@ -93,7 +93,7 @@ brew install gcc@5 ``` - Now install the packages in [requirements.txt](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/requirements.txt) -- Download the pretrained vector `glove.6B.zip` from [here](https://nlp.stanford.edu/projects/glove/) and extract off the `glove.6B.100d` file into the working directory. +- Download the pretrained vector `glove.6B.zip` from [here](https://nlp.stanford.edu/projects/glove/) and extract off the `glove.6B.100d` file into the working directory. **Note that this file is not included in the repo since it's too big. You do need to download it** At this point, you can run `jupter notebook` and run the notebooks interactively. You can also run the luigi solution by doing ``` From 02a49a1a124a2201bd128c916f05136fa8135e32 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 14:53:14 -0400 Subject: [PATCH 12/13] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 03eae7d..2231e93 100644 --- a/README.md +++ b/README.md @@ -92,6 +92,8 @@ brew install gcc brew install gcc@5 ``` - Now install the packages in [requirements.txt](https://github.com/asif31iqbal/ml-challenge-expenses/blob/master/requirements.txt) +For Mac OSX Sierra or higher, you might need to do this for XGBoost: +`env CC=gcc-5 CXX=g++-5 pip install xgboost` - Download the pretrained vector `glove.6B.zip` from [here](https://nlp.stanford.edu/projects/glove/) and extract off the `glove.6B.100d` file into the working directory. **Note that this file is not included in the repo since it's too big. You do need to download it** From a4169841e10d7743ee60e39a5d1baea06f3218d1 Mon Sep 17 00:00:00 2001 From: Asif Iqbal Date: Wed, 30 May 2018 16:04:51 -0400 Subject: [PATCH 13/13] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 2231e93..079448f 100644 --- a/README.md +++ b/README.md @@ -115,8 +115,8 @@ And a folder called [luigi](https://github.com/asif31iqbal/ml-challenge-expenses ### classification.ipynb This notebook tries to address the first question in the problem set - classifying the data points into the right category. **Please follow the cell by cell detailed guideline to walk thruogh it**. Key summary points are: -- Intuitively the **expense description** field seems to be the best determiner -- Tried several **shallow learning** classifiers - Logistic Regression, Naive Bayes, SVM, Gradient Boosting (XGBoost) and Random Forest and attempted to take an ensemble of the ones that performed best +- Intuitively the **expense description** field seems to be the best determiner, hinting that this problem is possibly a good cancidate for **NLP** +- Tried several **shallow learning** classifiers - Logistic Regression, Naive Bayes, SVM (with a linear kernel), Gradient Boosting (XGBoost) and Random Forest and attempted to take an ensemble of the ones that performed best - Feature engineering involved vectorizing text (**expense description** field) as TF-IDF ventors and using other fields like day_of_week, expense amount, employee ID and tax name. I didn't use n-grams here, but that's possibly an improvement option - Tried the classifiers in a two-fold approach - with all features and with only text features - Tried with Grid Search cross validation (used `LeaveOneOut` since pretty small data) and tuning several parameters like regularization factor, number of estimators etc @@ -180,4 +180,4 @@ The solution is nowhere close to production-ready and can be further modularized I have tried different approaches of addressing the solutions. Overall performance were fairly good, however, the datasets were very small and nothing I experimented here is conclusive. I have learnt from my experience and real work that models that perform well on small data are not necessarily best for tackling real big data. Also, there is a myriad of ways of improving things. -I am not particularly proud of anything, however, I am glad that I went through the exercise and learnt a lot in the process. I hope I get a chance to present myself in person and discuss in more detail. +I am not particularly proud of anything, however, I am glad that I went through the exercise and learnt a lot in the process. It was particularly interesting in finding that the **text features** alone seemed to solve the problems to a great extent. However, with better feature engineering and combining those with the text features, better results are surely possible. I hope I get a chance to present myself in person and discuss in more detail.