Skip to content

Object Detection using Fast R CNN

Philipp Kranen edited this page Oct 25, 2016 · 2 revisions

Table of Contents

Summary

image image

The above are examples images and object annotations for the grocery data set (left) and the Pascal VOC data set (right) used in this tutorial.

Fast R-CNN is an object detection algorithm proposed by Ross Girshick in 2015. The paper is accepted to ICCV 2015, and archived at https://arxiv.org/abs/1504.08083. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs a region of interest pooling scheme that allows to reuse the computations from the convolutional layers.

The following are the main resources for CNTK Fast R-CNN:

Recipe Brain Script config file.
pre-trained models Download pre-trained Fast R-CNN models for the grocery data or the Pascal VOC data set.
Data Example data (food items in a fridge) and Pascal VOC data.
How to run Follow the description below.

Additional material: a detailed tutorial for object detection using CNTK Fast R-CNN (including optional SVM training and publishing the trained model as a Rest API) can be found here.

Setup

Currently, CNTK only supports Python 3.4. We recommend to install anaconda python (http://continuum.io/downloads) and create a python 3.4 environment using

conda create --name cntk python=3.4.3 numpy scipy
activate cntk

To run the code in this example, you need to install a few additional packages. Under Python 3.4 (64bit version assumed), go to the FastRCNN folder and run:

pip install -r requirements.txt

Known issue: to install scikit-learn you might have to run conda install scikit-learn. You will further need Scikit-Image and OpenCV to run these examples (and possibly numpy and scipy if your Python 3.4 package does not come with them). You need to download the corresponding wheel packages and install them manually. For Windows users, visit http://www.lfd.uci.edu/~gohlke/pythonlibs/, and download:

  • scikit_image-0.12.3-cp34-cp34m-win_amd64.whl
  • opencv_python-3.1.0-cp34-cp34m-win_amd64.whl

Once you download the respective wheel binaries, install them with:

pip install your_download_folder/scikit_image-0.12.3-cp34-cp34m-win_amd64.whl
pip install your_download_folder/opencv_python-3.1.0-cp34-cp34m-win_amd64.whl

This tutorial code assumes you are using 64bit version of Python 3.4, as the Fast R-CNN DLL files under utils_win64 are prebuilt for this version. If your task requires the use of a different Python version, please recompile these DLL files yourself in the correct environment.

The tutarial further assumes that the folder where cntk.exe resides is in your PATH environment variable.

Example data and baseline model

We use a pre-trained AlexNet model as the basis for Fast-R-CNN training. Both the example dataset and the pre-trained AlexNet model can be downloaded by running the following Python command from the FastRCNN folder:

python install_fastrcnn.py

Run the toy example

In the toy example we train a CNTK Fast R-CNN model to detect grocery items in a refrigirator. All required scripts are in <cntkroot>/Examples/Image/Detection/FastRCNN.

Quick guide

To run the toy example, make sure that in PARAMETERS.py the datasetName is set to "grocery".

  • Run A1_GenerateInputROIs.py to generate the input ROIs for training and testing.
  • Run A2_RunCntk.py to train a Fast R-CNN model and compute test results.
    • Alternatively you can download the pre-trained CNTK Fast R-CNN model for the grocery example here. (Details)
  • Run A3_ParseAndEvaluateOutput.py to compute the mAP (mean average precision) of the trained model.

The ouput from script A3 should contain the following:

Evaluating detections
AP for         avocado = 1.0000
AP for          orange = 0.2500
AP for          butter = 1.0000
AP for       champagne = 1.0000
AP for          eggBox = 0.7500
AP for          gerkin = 1.0000
AP for         joghurt = 1.0000
AP for         ketchup = 1.0000
AP for     orangeJuice = 1.0000
AP for           onion = 1.0000
AP for          pepper = 0.7600
AP for          tomato = 0.6400
AP for           water = 0.5000
AP for            milk = 1.0000
AP for         tabasco = 1.0000
AP for         mustard = 1.0000
Mean AP = 0.8688
DONE.

To visualize the bounding boxes and predicted labels you can run B3_VisualizeOutputROIs.py (click on the images to enlarge):

image image image image image

Step details

A1: The script first A1_GenerateInputROIs.py generates ROI candidates for each image using selective search. It then stores them in a CNTK Text Format as input for cntk.exe. Additionally the required CNTK input files for the images and the ground truth labels are generated. The script generates the following folders and files under the FastRCNN folder:

  • proc - root folder for generated content.
    • grocery_2000 - contains all generated folders and files for the grocery example using 2000 ROIs. If you run again with a different number of ROIs the folder name will change correspondingly.
      • rois - contains the raw ROI coordinates for each image stored in text files.
      • cntkFiles - contains the formatted CNTK input files for images (train.txt and test.txt), ROI coordinates (xx.rois.txt) and ROI labels (xx.roilabels.txt) for train and test. (Format details are provided below.)

All parameters are contained in PARAMETERS.py, for example cntk_nrRois = 2000 to set the number of ROIs used for training and testing. We describe parameters in the section Parameters below.

A2: The script A2_RunCntk.py runs cntk using cntk.exe and a brain script config file (configuration details). A script that uses the new CNTK Python API for Fast R-CNN training will be added soon. The trained model is stored in the folder cntkFiles/Output of the corresponding proc sub-folder. The trained model is tested seperately on both the training set and the test set. During testing for each image and each corresponding ROI a label is predicted and stored in the files test.z and train.z in the cntkFiles folder.

A3: The evaluation step parses the CNTK output and computes the mAP comparing the predicted results with the ground truth annoations. Non maximum suppression is used to merge overlapping ROIs. You can set the threshold for non maximum suppresion in PARAMETERS.py (details).

Using a pre-trained model

Download links for pre-trained models are provided at the top of this page. Store the model in the cntkFiles/Output folder under the corresponding proc sub-folder, for example proc/grocery_2000/cntkFiles/Output for the toy example. Note: if you are using a pre-trained model you still need to run step A2 to compute the predicted labels, i.e. CNTK will skip the training and only run the testing.

Additional scripts

There are three optional scripts you can run to visualize and analyze the data:

  • B1_VisualizeInputROIs.py visualizes the candidate input ROIs.
  • B2_EvaluateInputROIs.py computes the recall of the ground truth ROIs with respect to the candidate ROIs.
  • B3_VisualizeOutputROIs.py visualize the bounding boxes and predicted labels.

Run Pascal VOC

The Pascal VOC (PASCAL Visual Object Classes) data is a well known set of standardised images for object class recognition. Training or testing CNTK Fast R-CNN on the Pascal VOC data requires a GPU with at least 4GB of RAM. Alternatively you can run using the CPU, which will however take some time. In this case we strongly recommend to download the pre-trained model (see Using a pre-trained model).

Getting the Pascal VOC data

You need the 2007 (trainval and test) and 2012 (trainval) data as well as the precomputed ROIs used in the original paper. You need to follow the folder structure described below. The scripts assume that the Pascal data resides in <cntkroot>/Examples/Image/DataSets/Pascal. If you are using a different folder please set pascalDataDir in PARAMETERS.py correspondingly.

The VOCdevkit2007 folder should look like this (similar for 2012):

VOCdevkit2007/VOC2007
VOCdevkit2007/VOC2007/Annotations
VOCdevkit2007/VOC2007/ImageSets
VOCdevkit2007/VOC2007/JPEGImages

Running CNTK on Pascal VOC

To run on the Pascal VOC data make sure that in PARAMETERS.py the datasetName is set to "pascal".

  • Run A1_GenerateInputROIs.py to generate the CTNK formatted input files for training and testing from the downloaded ROI data.
  • Run A2_RunCntk.py to train a Fast R-CNN model and compute test results.
    • If you downloaded the pre-trained model you still need to run step A2 to compute the predicted labels. To decrease the required time you can skip computing the predictions for the training data by setting command = Train:WriteTest (i.e. reomving WriteTrain) in the fastrcnn.cntk file.
  • Run A3_ParseAndEvaluateOutput.py to compute the mAP (mean average precision) of the trained model.
    • Please note that this is work in progress and the results are preliminary as we are training new baseline models.
    • Please make sure to have the latest version from CNTK master for the files fastRCNN/pascal_voc.py and fastRCNN/voc_eval.py to avoid encoding errors.

Train on your own data

To train a CNTK Fast R-CNN model on your own data set we provide two scripts to annotate rectangular regions on images and assign labels to these regions. The scripts will store the annotations in the correct format as required by the first step of running Fast R-CNN (A1_GenerateInputROIs.py). First, store your images in the following folder structure

  • <your_image_folder>/negative - images used for training that don't contain any objects
  • <your_image_folder>/positive - images used for training that do contain objects
  • <your_image_folder>/testImages - images used for testing that do contain objects

For the negative images you do not need to create any annotations. For the other two folders use the proveded scripts:

  • Run C1_DrawBboxesOnImages.py to draw bounding boxes on the images.
    • In the script set imgDir = <your_image_folder> before running.
    • Add annotations using the mouse cursor. Once all objects in an image are annotated, pressing key 'n' writes the .bboxes.txt file and then proceeds to the next image, 'u' undoes (i.e. removes) the last rectangle, and 'q' quits the annotation tool.
  • Run C2_AssignLabelsToBboxes.py to assign labels to the bounding boxes.
    • In the script set imgDir = <your_image_folder> before running...
    • ... and adapt the classes in the script to reflect your object categories, for example classes = ("dog", "cat", "octopus").
    • The script loads these manually annotated rectangles for each image, displays them one-by-one, and asks the user to provide the object class by clicking on the respective button to the left of the window. Ground truth annotations marked as either "undecided" or "exclude" are fully excluded from further processing.

Before running CNTK Fast R-CNN using scripts A1-A3 you need to add your data set to PARAMETERS.py:

  • Set the data set name to a new name datasetName = "myOwnImages"
  • For the # project-specific parameters add a new for myOwnImages. You can start by copying the section from grocery
    • Adapt the classes to reflect your object categories. Following the above example this would look like classes = ('__background__', 'dog', 'cat', 'octopus').
    • Set imgDir = <your_image_folder>.
    • Optionally you can adjust more parametes, e.g. for ROI generation and pruning (see Parameters section).

Ready to train on your own data! (Use the same steps as for the toy example.)

Technical details

Parameters

The main parameters in PARAMETERS.py are

  • datasetName - which data set to use
  • cntk_numRois - how many ROIs to use for training and testing
  • nmsThreshold - Non Maximum suppression threshold (in range [0,1]). The lower the more ROIs will be combined. It used for both evalutation and visualization.

All parameters for ROI generation, such as minimum and maximum width and heigth etc., are described in PARAMETERS.py next to the parameters themselves. They are all set to a default value which is reasonable. You can overwrite them in the # project-specific parameter section corresponding to the data set you are using.

CNTK configuration

The CNTK brain script configuration file that is used to train and test Fast R-CNN is fastrcnn.cntk. The part that is constructing the network is the BrainScriptNetworkBuilder section in the Train command:

BrainScriptNetworkBuilder = {
    network     = BS.Network.Load ("../../../../../pre-trainedModels/AlexNet.model")
    convLayers  = BS.Network.CloneFunction(network.features, network.conv5_y, parameters = "constant")
    fcLayers    = BS.Network.CloneFunction(network.pool3, network.h2_d)

    model (features, rois) = {
        featNorm = features - 114
        convOut  = convLayers (featNorm)
        roiOut   = ROIPooling (convOut, rois, (6:6))
        fcOut    = fcLayers (roiOut)
        W        = ParameterTensor{($NumLabels$:4096), init="glorotUniform"}
        b        = ParameterTensor{$NumLabels$, init = 'zero'}
        z        = W * fcOut + b
    }.z

    imageShape = $ImageH$:$ImageW$:$ImageC$         # 1000:1000:3
    labelShape = $NumLabels$:$NumTrainROIs$         # 21:64
    ROIShape   = 4:$NumTrainROIs$                   # 4:64

    features = Input {imageShape}
    roiLabels = Input {labelShape}
    rois = Input {ROIShape}

    z = model (features, rois)
    
    ce = CrossEntropyWithSoftmax(roiLabels, z, axis = 1)
    errs = ClassificationError(roiLabels, z, axis = 1)
    
    featureNodes    = (features:rois)
    labelNodes      = (roiLabels)
    criterionNodes  = (ce)
    evaluationNodes = (errs)
    outputNodes     = (z)
}

In the first line the pre-trained AlexNet is loaded as the base model. Next two parts of the network are cloned: convLayers contains the convolutional layers with constant weights, i.e. they are not trained further. fcLayers contains the fully connected layers with the pre-trained weights, which will be trained further. The node names network.features, network.conv5_y etc. can be derived from looking at the log output of the cntk.exe call (contained in the log output of the A2_RunCntk.py script).

The model definition(model (features, rois) = ...) first normalizes the features by subtracting 114 for each channel and pixel. Then the normalized features are pushed through the convLayers followed by the ROIPooling and finally the fcLayers. The output shape (width:height) of the ROI pooling layer is set to (6:6) since this is the shape nd size that the pre-trained fcLayers from the AlexNet model expect. The output of the fcLayers is fed into a dense layer that predicts one value per label (NumLabels) for each ROI.

The following six lines define the input: an image of size 1000 x 1000 x 3 ($ImageH$:$ImageW$:$ImageC$), ground truth labels for each ROI ($NumLabels$:$NumTrainROIs$) and four coordinates per ROI (4:$NumTrainROIs$) coresponding to (x, y, w, h), all relative with respect to the full width and height of the image.

z = model (features, rois) feeds the input images and rois into the defined network model and assigns the output to z. Both the criterion (CrossEntropyWithSoftmax) and the error (ClassificationError) are specified with axis = 1 to account for the prediction error per ROI.

The reader section of the CNTK configuration is listed below. It uses three deserializers:

  • ImageDeserializer to read the image data. It picks up the image file names from train.txt, scales the image to the desired width and height while preserving the aspect ratio (padding empty areas with 114) and transposes the tensor to have the correct input shape.
  • One CNTKTextFormatDeserializer to read the ROI coordinates from train.rois.txt.
  • A second CNTKTextFormatDeserializer to read the ROI labels from train.roislabels.txt.

The input file formats are described in the next section.

reader = {
    randomize = false
    verbosity = 2
    deserializers = ({
        type = "ImageDeserializer" ; module = "ImageReader"
        file = train.txt
        input = {
            features = { transforms = (
                { type = "Scale" ; width = $ImageW$ ; height = $ImageW$ ; channels = $ImageC$ ; scaleMode = "pad" ; padValue = 114 }:
                { type = "Transpose" }
            )}
            ignored = {labelDim = 1000}
        }
    }:{
        type = "CNTKTextFormatDeserializer" ; module = "CNTKTextFormatReader"
        file = train.rois.txt
        input = { rois = { dim = $TrainROIDim$ ; format = "dense" } }
    }:{
        type = "CNTKTextFormatDeserializer" ; module = "CNTKTextFormatReader"
        file = train.roilabels.txt
        input = { roiLabels = { dim = $TrainROILabelDim$ ; format = "dense" } }
    })
}

CNTK input file format

There are three input files for CNTK Fast R-CNN corresponding to the three deserializers described above:

  1. train.txt contains in each line first a sequence number, then an image filenames and finally a 0 (which is currently still needed for legacy reasons of the ImageReader).
0 image_01.jpg 0
1 image_02.jpg 0
...
  1. train.rois.txt (CNTK text format) contains in each line first a sequence number, then the |rois identifier followed by a sequence of numbers. These are groups of four numbers corresponding to (x, y, w, h) of an ROI, all relative with respect to the full width and height of the image. There is a total of 4 * number-of-rois numbers per line.
0 |rois 0.2185 0.0 0.165 0.29 ...
  1. train.roilabels.txt (CNTK text format) contains in each line first a sequence number, then the |roiLabels identifier followed by a sequence of numbers. These are groups of number-of-labels numbers (either zero or one) per ROI encoding the ground truth class in a one-hot representation. There is a total of number-of-labels * number-of-rois numbers per line.
0 |roiLabels 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0  ...

Algorithm details

Fast R-CNN

R-CNNs for Object Detection were first presented in 2014 by Ross Girshick et al., and were shown to outperform previous state-of-the-art approaches on one of the major object recognition challenges in the field: Pascal VOC. Since then, two follow-up papers were published which contain significant speed improvements: Fast R-CNN and Faster R-CNN.

The basic idea of R-CNN is to take a deep Neural Network which was originally trained for image classification using millions of annotated images and modify it for the purpose of object detection. The basic idea from the first R-CNN paper is illustrated in the Figure below (taken from the paper): (1) Given an input image, (2) in a first step, a large number region proposals are generated. (3) These region proposals, or Regions-of-Interests (ROIs), are then each independently sent through the network which outputs a vector of e.g. 4096 floating point values for each ROI. Finally, (4) a classifier is learned which takes the 4096 float ROI representation as input and outputs a label and confidence to each ROI.

image

While this approach works well in terms of accuracy, it is very costly to compute since the Neural Network has to be evaluated for each ROI. Fast R-CNN addresses this drawback by only evaluating most of the network (to be specific: the convolution layers) a single time per image. According to the authors, this leads to a 213 times speed-up during testing and a 9x speed-up during training without loss of accuracy. This is achieved by using an ROI pooling layer which projects the ROI onto the convolutional feature map and performs max pooling to generate the desired output size that the following layer is expecting. In the AlexNet example used in this tutorial the ROI pooling layer is put between the last convolutional layer and the first fully connected layer (see brain script code).

The original Caffe implementation used in the R-CNN papers can be found at github: RCNN, Fast R-CNN, and Faster R-CNN. This tutorial uses some of the code from these repositories, notably (but not exclusively) for svm training and model evaluation.

SVM vs NN training

Patrick Buehler provides instructions on how to train an SVM on the CNTK Fast R-CNN output (using the 4096 features from the last fully connected layer) as well as a discussion on pros and cons here.

Selective Search

Selective Search is a method for finding a large set of possible object locations in an image, independent of the class of the actual object. It works by clustering image pixels into segments, and then performing hierarchical clustering to combine segments from the same object into object proposals.

image image image

To complement the detected ROIs from Selective Search, we add ROIs that uniform cover the image at different scales and aspect ratios. The image on the left shows an example output of Selective Search, where each possible object location is visualized by a green rectangle. ROIs that are too small, too big, etc. are discarded (middle) and finally ROIs that uniformly cover the image are added (right). These rectangles are then used as Regions-of-Interests (ROIs) in the R-CNN pipeline.

The goal of ROI generation is to find a small set of ROIs which however tightly cover as many objects in the image as possible. This computation has to be sufficiently quick, while at the same time finding object locations at different scales and aspect ratios. Selective Search was shown to perform well for this task, with good accuracy to speed trade-offs.

NMS (Non Maximum Suppression)

Object detection methods often output multiple detections which fully or partly cover the same object in an image. These ROIs need to be merged to be able to count objects and obtain their exact locations in the image. This is traditionally done using a technique called Non Maximum Suppression (NMS). The version of NMS we use (and which was also used in the R-CNN publications) does not merge ROIs but instead tries to identify which ROIs best cover the real locations of an object and discards all other ROIs. This is implemented by iteratively selecting the ROI with highest confidence and removing all other ROIs which significantly overlap this ROI and are classified to be of the same class. The threshold for the overlap can be set in PARAMETERS.py (details).

Detection results before (left) and after (right) Non Maximum Suppression:

image image

mAP (mean Average Precision)

Once trained, the quality of the model can be measured using different criteria, such as precision, recall, accuracy, area-under-curve, etc. A common metric which is used for the Pascal VOC object recognition challenge is to measure the Average Precision (AP) for each class. The following description of Average Precision is taken from Everingham et. al. The mean Average Precision (mAP) is computed by taking the average over the APs of all classes.

For a given task and class, the precision/recall curve is computed from a method’s ranked output. Recall is defined as the proportion of all positive examples ranked above a given rank. Precision is the proportion of all examples above that rank which are from the positive class. The AP summarises the shape of the precision/recall curve, and is defined as the mean precision at a set of eleven equally spaced recall levels [0,0.1, . . . ,1]:

image

The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r:

image

where p(˜r) is the measured precision at recall ˜r. The intention in interpolating the precision/recall curve in this way is to reduce the impact of the “wiggles” in the precision/recall curve, caused by small variations in the ranking of examples. It should be noted that to obtain a high score, a method must have precision at all levels of recall – this penalises methods which retrieve only a subset of examples with high precision (e.g. side views of cars).

Clone this wiki locally