data_science_task

This repository solves a cell nuclei segmentation and cell nuclei counting task.

Two approaches were developed.

Convolutional Neural Network
Random Forest Pixel Classifier

Issues

fine-tuned model file is ~138MB large which is not permitted in github, Download link will be provided
GPU support not tested

Getting Started

git clone https://github.com/melaniedohmen/data_science_task

Unpack data/data.zip into data/

conda env create -f environment.yml
conda activate data_science_task

Download fine-tuned CNN model here:

https://www.dropbox.com/s/im995cp0gkfyi48/best_test_train_0_1_NoColorAug.zip?dl=0

and unpack into models/best_test_train_0_1_NoColorAug.pytorch

Then run

python segment.py

To reproduce the CNN training with selected parameters do: (parameters documented in train_NN.py)

 # call train routine for CNN model (see parameters)
 train(CNN_train_params)
 
 # get model name:
 CNN_test_params["model_name"] = CNN_train_params["experiment_name"]
 
 # run test routine:
 eval_dict_NN, eval_dict_pp_NN = test(CNN_test_params)
 
 # plot results with and without post-processing:
 plot_evaluation_results(eval_dict_NN, "Evaluation of newly trained CNN model")
 plot_evaluation_results(eval_dict_pp_NN, "Evaluation of newly trained CNN model after pp")

To run a prediction on a fine-tuned CNN model: (parameters documented in testNN.py)

 # call test routine for CNN model
 # evaluation metrics are returned as dictionaries
 eval_dict_ftNN, eval_dict_pp_ftNN = test(CNN_test_params)
 
 # plot evaluation metrics for unprocessed predictions:
 plot_evaluation_results(eval_dict_ftNN, "Evaluation of ft-CNN model")
 
 # plot evaluation metrics after post-processing predictions:
 plot_evaluation_results(eval_dict_pp_ftNN, "Evaluation of ft-CNN model after pp")

To train a random forest classifier do: (parameters documented in rf_pixel_classifier.py)

 # call train and test routine for random forest classifier 
 eval_dict_rfc, eval_dict_pp_rfc = train_and_test(rfc_params)
 
 # plot results with and without post-processing:
 plot_evaluation_results(eval_dict_rfc, "Evaluation of RFC Model")
 plot_evaluation_results(eval_dict_pp_rfc, "Evaluation of RFC Model after pp")

Data

Two folders of data are available:

data/tissue_images" with 28 HE stained images (512x512px, TIFF-Format)
data/binary mask 28 binary images (512x512px, PNG-Format)

There are 10 human tissue types with (samples) available:

AdrenalGland (3)
Larynx (3)
LymphNodes (2)
Mediastinum (3)
Pancreas (3)
Pleura (3)
Skin (3)
Testis (3)
Thymus (3)
ThyroidGland (2)

To ensure best generalizability, the samples of AdrenalGland (a complete tissue type) was selected as test set and excluded from training.

By default, Larynx is selected as validation set.

To better compare models based on the small dataset, a cross-validation approach was implemented (for the random forest classifier).

Evaluation

The following evaluation metrics were implemented

Accuracy
Average Precision for IoU threshold 0.5
Mean Average Precision mAP for IoU thresholds in range(0.5;1.0;0.05)
Absolute Percentage Error (on cell nuclei counts)
Intersection over Union

Pre-processing

HE Stain Normalization is performed according to

https://github.com/schaugf/HEnorm_python/blob/master/normalizeStaining.py

[1] A method for normalizing histology slides for quantitative analysis, M Macenko, M Niethammer, JS Marron, D Borland, JT Woosley, G Xiaojun, C Schmitt, NE Thomas, IEEE ISBI, 2009. dx.doi.org/10.1109/ISBI.2009.5193250

Data aumgmentation was performed for CNN training including:

horizontal/vertical flip
rotation by 0,90,180 or 270 degrees
random crop
color augmentation was implemented, but did not yield better results

Post-processing

Types of postprocessing that are applied by default in this order: (see postprocessing.py)

filling holes
filtering out all segments with area < 11px
separate touching nuclei (watershed on distance transform)

CNN model

Using a pretrained fully convolutional Res-Net-50 model from:

https://pytorch.org/hub/pytorch_vision_fcn_resnet101/

Modified the output layers (classifier and aux_classifier) to have only 1 output channel for foreground/background segmentation

Random Forest Classifier

See https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
.gitignore		.gitignore
README.md		README.md
cross_validate_rfc.py		cross_validate_rfc.py
data_utils.py		data_utils.py
environment.yml		environment.yml
evaluate.py		evaluate.py
norm_stain.py		norm_stain.py
plot_figures.py		plot_figures.py
postprocessing.py		postprocessing.py
rf_pixel_classifier.py		rf_pixel_classifier.py
segment.py		segment.py
test_NN.py		test_NN.py
train_NN.py		train_NN.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data_science_task

Issues

Getting Started

Data

Evaluation

Pre-processing

Post-processing

CNN model

Random Forest Classifier

About

Releases

Packages

Languages

melanie-dohmen/data_science_task

Folders and files

Latest commit

History

Repository files navigation

data_science_task

Issues

Getting Started

Data

Evaluation

Pre-processing

Post-processing

CNN model

Random Forest Classifier

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages