Skip to content

Custom data science and deep learning Python libraries

License

Notifications You must be signed in to change notification settings

bobbywlindsey/zenml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ZenML

This repo contains Python libraries to aid in data science and machine learning tasks. Currently porting over TensorFlow deep learning modules to PyTorch.

Install

pip install git+https://github.com/bobbywlindsey/zenml

Test

To test, run pytest from the root directory.

Usage

To import the deeplearning library along with other useful environmental objects, use the following:

from deeplearning.imports_and_configs import *
import deeplearning as dl

Preprocessing

You can preprocess your pandas dataframe in a pipeline fashion:

from zenml.preprocessing import strip_whitespace, add_suffix, replace_nan_with_string 

df = pd.read_csv('some_file.csv')

df.name = strip_whitespace(df.name)
df.name = add_suffix('_sufix', df.name)
df.name = replace_nan_with_string('empty', df.name)

Hypothesis Testing

Only two hypothesis tests are available at the moment: t-test (Student's, Welch's, and paired) and fisher's test.

As an example t-test, say you're looking at the Boston Housing dataset.

import pandas as pd
from sklearn.datasets import load_boston

boston = load_boston()
boston_df = pd.DataFrame(boston.data)
boston_df.columns = boston.feature_names
boston_df['PRICE'] = boston.target
boston_df.head()

Your hypothesis is that house prices in areas with lower student-teacher ratios are greater than house prices in an area with a higher student-teacher ratio.

from zenml.hypothesis_tests import t_test

low_st_ratio = boston_df[boston_df.PTRATIO < 18].PRICE
low_st_ratio.reset_index(drop=True, inplace=True)
high_st_ratio = boston_df[boston_df.PTRATIO > 18].PRICE

alt_hypothesis = '>'
no_intervention_data = high_st_ratio
intervention_data = low_st_ratio

test_results = t_test(alt_hypothesis, intervention_data, no_intervention_data)
print(test_results)
{'Sample Size 1': 185, 'Sample Size 2': 316, 'Alt. Hypothesis': '>', 'Test Name': 'Welchs t-test', 'p-value': 5.19826947410151e-23, 'Test Statistic': 10.642892435040595, 'Effect Size': 1.0606056235342178, 'Power': 0.8641386288870567}

Or say you want to know if there's a significant difference in the clicks between site A and site B.

from zenml.hypothesis_tests import fisher

site_a = pd.Series([0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0])
site_b = pd.Series([0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1])
no_intervention_data = site_a
intervention_data = site_b
alt_hypothesis = '!='

test_results = fisher(alt_hypothesis, intervention_data, no_intervention_data)
print(test_results)
{'Sample Size 1': 15, 'Sample Size 2': 17, 'Alt. Hypothesis': '!=', 'Test Name': 'Fishers Exact Test', 'p-value': 0.03195238826540317, 'Effect Size': 0.7951807228915662, 'Power': None}

Feature Engineering

If you'd like to do a binary comparison between two variables and use the result as a new variable:

from zenml.features import variable_match

variable_match = variable_match(df.variable_1, df.variable_2)
df.variable_match = variable_match

Or you could calculate the cosine similarity between two text variables:

from zenml.features import cosine_similarity

cosine_sim_variable = cosine_similarity(df.variable_1, df.variable_2)
df.variable_cosine_sim = cosine_sim_variable

To create a ngram feature for a variable:

from zenml.features import ngram_tf, ngram_idf_sum

ngram_tf_df = ngram_tf(2, .0025, .5, [df.variable])
bigram_idf_sum_variable = ngram_idf_sum(df.variable, ngram_tf_df, 2)

If you have a text field, you can use text embeddings like a continuous bag of words model:

from zenml.features import word_embedding, cosine_similarity_text_embedding

# fine tune continuous bag of words model
ngram = 3
min_word_count = 10
workers = 20
epochs = list(range(2,3))
model_type = 'cbow'
hidden_layer_size = 300
initial_learning_rate = .9

cbow_model = word_embedding([df.variable_1, df.variable_2], ngram, min_word_count, epochs,
                             initial_learning_rate, workers, model_type=model_type)

# now calculate cosine similarity between the learned vector representations of the words
variable_cos_sim_cbow = cosine_similarity_text_embedding([df.variable_1, df.variable_2], cbow_model)
df.variable_cos_sim_cbow = variable_cos_sim_cbow

Or a skip gram model:

from zenml.features import word_embedding, cosine_similarity_text_embedding

# fine tune skip gram model
ngram = 3
min_word_count = 10
workers = 20
epochs = list(range(2,3))
model_type = 'skipgram'
hidden_layer_size = 300
initial_learning_rate = .9

sg_model = word_embedding([df.variable_1, df.variable_2], ngram, min_word_count, epochs,
                           initial_learning_rate, workers, model_type=model_type)

# now calculate cosine similarity between the learned vector representations of the words
variable_cos_sim_skipgram = cosine_similarity_text_embedding(df.variable_1, df.variable_2, sg_model)
df.variable_cos_sim_skipgram = variable_cos_sim_skipgram

Deep Neural Net

Split your training data

train, test, train_labels, test_labels, classes = train_test_data(df, 'label')

Train a deep neural network with two hidden layers with 25 neurons for the first layer and 12 for the second

train_neural_net = dl.deep_neural_net(train, train_labels, test, test_labels, 
                                     [25, 12], num_epochs=1500)

Then test it

dl.test_deep_neural_net(train_neural_net, test, test_labels)

Transfer Learning on ImageNet Winning Conv Nets

Below are the ImageNet 1-crop error rates (224x224) for winning convolutional neural nets. I've implemented all but the SqueezeNet models in a transfer learning context.

Network Top-1 error Top-5 error
AlexNet 43.45 20.91
VGG-11 30.98 11.37
VGG-13 30.07 10.75
VGG-16 28.41 9.62
VGG-19 27.62 9.12
VGG-11 with batch normalization 29.62 10.19
VGG-13 with batch normalization 28.45 9.63
VGG-16 with batch normalization 26.63 8.50
VGG-19 with batch normalization 25.76 8.15
ResNet-18 30.24 10.92
ResNet-34 26.70 8.58
ResNet-50 23.85 7.13
ResNet-101 22.63 6.44
ResNet-152 21.69 5.94
SqueezeNet 1.0 41.90 19.58
SqueezeNet 1.1 41.81 19.38
Densenet-121 25.35 7.83
Densenet-169 24.00 7.00
Densenet-201 22.80 6.43
Densenet-161 22.35 6.20
Inception v3 22.55 6.44

First structure the data as such:

data/
  - train/
      - class_1 folder/
          - img1.png
          - img2.png
      - class_2 folder/
      .....
      - class_n folder/
  - dev/
      - class_1 folder/
      - class_2 folder/
      ......
      - class_n folder/

Show transformed images

data_directory = 'image_directory'
data_transforms = dl.get_image_transform_rules(224)
dl.show_transformed_images(data_transforms, 5)

Specify parameters of the model you'd like to use. Choose any model from the following list:

Specify the model you'd like to use

models_to_choose_from = ['resnet18', 'resnet34', 'resnet50', 'resnet101', 'resnet152', 'densenet121', 'densenet161', 'densenet169', 'densenet201', 'vgg11', 'vgg13', 'vgg16', 'vgg19']
model_name = 'vgg19'

Specify the parameters of the model

freeze_params = {'freeze_all_layers': True, 'freeze_first_n_layrs': 0}
# decay learning rate every step_size epochs at a factor gamma
optimizer_params = {'step_size': 7, 'gamma': 0.1}
learning_rate_params = {'learning_rate': 0.001, 'momentum': 0.9}
num_epochs = 25

Train the model

dl.transfer_learn(model_name, data_directory, freeze_params, optimizer_params,
               learning_rate_params, num_epochs)

More Traditional Models

Random Forest

# split your data
train, test, train_labels, test_labels, classes = train_test_data(df, 'label',
                                                                  train_size=0.8,
                                                                  random_state=52)

# some hyperparamters to try
num_estimators = list(range(5, 10))
min_samples_leaves = list(range(1, 5))
max_depths = list(range(9, 10))

# create the parameter grid for a random forest to grid search for best parameters
model = Model()
param_grid = model.create_random_forest_param_grid(num_estimators, max_depths,
                                             min_samples_leaves, num_workers=42)

# autotune the model
rf = model.random_forest(train, test, train_labels, test_labels, param_grid)

About

Custom data science and deep learning Python libraries

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages