Skip to content

CNTKTextFormat Reader

Clemens Marschner edited this page Apr 19, 2016 · 15 revisions

CNTKTextFormatReader (later simply Text Reader) is designed to consume input text data formatted according the specification below. It supports the following main features:

  • Multiple inputs per file
  • Both sparse and dense inputs
  • Variable length sequences

Input Schema

Each line in the input file contains one sample for one or more inputs. Since (explicitly or implicitly) every line is also attached to a sequence, it defines one or more <sequence, input, sample> relations. Each input line must be formatted as follows:

[Sequence_Id](Sample)+

where

Sample=|Input_Name (Value )*

  • Each line starts with a sequence id and contains one or more samples (in other words, each line is an unordered collection of samples).
  • Sequence id is a number. It can be omitted, in which case the line number will be used as the sequence id.
  • Each sample is effectively a key-value pair consisting of an input name and the corresponding value vector (mapping to higher dimensions is done as part of the network itself).
  • Each sample begins with a pipe symbol followed by the input name (no spaces), followed by a space and then a list of values.
  • Each value is either a number or an index-prefixed number for sparse inputs.

Simple Example

This example is based on a minimal set of parameters and format options.

To use the Text Reader set the readerType to CNTKTextFormatReader in the reader section of your CNTK configuration:

...
reader = [
    readerType = "CNTKTextFormatReader" 
    file = "c:\mydata\SampleInput.txt" # See the second example for Linux path example
    
    # IMPORTANT!
    # All inputs are grouped within "input" sub-section.
    input = [
        Apples = [
            dim = 10 
            format = "dense"
        ]
            
        Oranges = [
            dim = 1000000
            format = "sparse"
        ]
        
         Bananas = [
            dim = 1
            format = "dense"
        ]
    ]
]
# the rest of the cntk config ...

(This fragment as well as other NDL examples in this document presents only reader section, omitting the rest of the CNTK configuration; see the end of this page for pointers to a set of complete example Networks and the corresponding datasets)

The Text Reader requires the following set of parameters:

  • file - path to the file with the dataset.
  • input - sub-section defining inputs identified by input names (Apples, Orangesand Bananas in the example above). Within each input the following required parameters must be specified:
    • format - specifies the input type. Must either be dense or sparse
    • dim - specifies the dimension of the input value vector (for dense input this directly corresponds to the number of values in each sample, for sparse this represents the upper bound on the range of possible index values).

The input data corresponding to the reader configuration above should look something like this:

|Oranges 100:3 123:4 |Bananas 8 |Apples 0 1 2 3 4 5 6 7 8 9
|Apples 0 1.1 22 0.3 14 54 0.06 0.7 1.8 9.9 |Bananas 123917 |Oranges 1134:1.911 13331:0.014
|Bananas -0.001 |Apples 3.9 1.11 121.2 99.13 0.04 2.95 1.6 7.19 10.8 -9.9 |Oranges 999:0.001 918918:-9.19

Note the following about the input format:

  • |Input_Name identifies the beginning of each input sample. This element is mandatory and is followed by the correspondent value vector.
  • Dense vector is just a list of numbers; sparse vector is a list of index:value.
  • Only spaces are allowed as value delimiters (i.e., within input vectors). Both spaces and tabs are allowed as input delimiters (i.e., between inputs).
  • Each separate line constitutes a "sequence" of length 1 ("Real" variable-length sequences are explained in the extended example below).
  • Each input can have only a single sample per line.
  • The order of input samples within a line is NOT important (conceptually, each line is an unordered collection of key-value pairs)
  • Each well-formed line must end with either a "Line Feed" \n or "Carriage Return, Line Feed" \r\n symbols (including the last line of the file).

Extended Example

This example features all possible configuration parameters and shows various input format options. Please refer to the tables below for the full description of the configuration parameters used in this example.

...
precision="double"

reader = [
    readerType = "CNTKTextFormatReader" 
    file = "/home/mydata/SampleInput.txt" # See the first example for Windows style path example
    randomize="auto"
    skipSequenceIds = "false"
    maxErrors = 100
    traceLevel = 2
    
    chunkSizeInBytes = 1024
    numChunksToCache = 10
    
    input = [
        Some_very_long_input_name = [
            alias = "a"
            dim = 3 
            format = "dense"
        ]
        Some_other_also_very_long_input_name = [
            alias = "b"
            dim = 2 
            format = "dense"
        ]
    ]
]
# the rest of the cntk config ...

The corresponding input file can then look approximately as follows:

100 |a 1 2 3 |b 100 200
100 |a 4 5 6 |b 101 201
100 |b 102983 14532 |a 7 8 9 
100 |a 7 8 9
200 |b 300 400 |a 10 20 30
333 |b 500 100 
333 |b 600 -900
400 |a 1 2 3 |b 100 200
|a 4 5 6 |b 101 201
|a 4 5 6 |b 101 201
500 |a 1 2 3 |b 100 200

All options discussed in the example above, still apply here. On top of that, we introduced two additional features:

Input name aliases

Input names can be arbitrary long and thus repeating them throughout the input file may not be space-efficient. To mitigate this, the dataset can use "aliases" instead of full input names. Aliases then need to be specified within each input sub-subsection. In our example, the dataset uses aliases a and b, which are mapped to "Some_very_long_input_name" and "Some_other_also_very_long_input_name" respectively in the reader config section.

Sequence IDs

As already mentioned, each line in the input file terminated by \n or \r\n symbols represents a sequence containing a single sample for each input. If a line is prefixed with a number, the number is used as the corresponding sequence id. All subsequent lines that share the same sequence id, become a part of the same sequence. Therefore, repeating the same numerical prefix for N lines allows to build up a multi-sample sequence, which each input containing between 1 and N samples. Omitting the sequence prefix on the second and following lines would have the same effect. Thus, the example dataset above five sequences with ids 100, 200, 333, 400 and 500.

Setting skipSequenceIds parameter in the reader section set to true, forces the reader to ignore all explicit sequence ids in the dataset and treat separate lines as individual sequence. Also, omitting the sequence id on the first line in the dataset has the same effect -- all subsequent sequence ids are ignored, lines treated as individual sequences, as in this example:

|a 1 2 3 |b 100 200
100 |a 4 5 6 |b 101 201
200 |b 102983 14532 |a 7 8 9 

A few final things to consider when using sequences:

  • Sequence ids must be unique.
  • Id prefixes can only be repeated for consecutive lines.
  • Sequence length in lines (that is, the number of lines sharing the same id prefix) must not exceed the maximum input length in samples (the number of samples in an input) in this sequence.

For example, the following datasets are invalid:

100 |a 1 2 3 |b 100 200
200 |a 4 5 6 |b 101 201
100 |b 102983 14532 |a 7 8 9 

123 |a 1 2 3 |b 100 200
456 |a 4 5 6 
456 |b 101 201

A Few Real-World Examples

  • Classification: Every line contains a sample, consisting of a label and features. No sequence ID needed.
|class 23:1 |features 2 3 4 5 6
|class 13:1 |features 1 2 0 2 3
...
  • DSSM: Every line contains a source-target document pair, expressed through a bag of words, encoded as sparse vectors.
|src 12:1 23:1 345:2 45001:1 |tgt 233:1 766:2 234:1
|src 123:1 56:1 10324:1 18001:3 |tgt 233:1 2344:2 8889:1 2234:1 253434:1
  • Part-of-speech tagging: Sequences mapping every element to a corresponding label. The sequences are aligned vertically (one token + tag per line)
0 |token 234:1 |tag 12:1
0 |token 123:1 |tag 10:1
0 |token 123:1 |tag 13:1
1 |token 234:1 |tag 12:1
1 |token 123:1 |tag 10:1
...
  • Sequence classification: Sequences mapped to a single label. Sequences are aligned vertically; The "class" label can occur in any line that has the same sequenceId (so it can also appear on its own).
0 |token 234:1 
0 |token 123:1
0 |token 123:1
0 |class 3:1
1 |token 11:1 
1 |token 344:1
1 |class 2:1
  • Sequence to sequence: Map a source sequence to a target sequence. The two sequences are aligned vertically and, in the easiest case, just printed after another. They are joined by having the same overall "sequence ID" (which then becomes a "work unit ID" in this case).
0 |src 234:1 
0 |src 123:1
0 |src 123:1
0 |tgt 11:1 
0 |tgt 344:1
0 |tgt 456:1
0 |tgt 2222:1
1 |src 123:1 
...

Configuration Parameters

Parameter Mandatory Accepted values Default value Description
precision No double, float float Sets the precision of the input value

reader section

Parameter Mandatory Accepted values Default value Description
readerType Yes Known CNTK Reader Type, CNTKTextFormatReader for Text Reader N/A CNTK Reader Type to use
file Yes Path to data set file (Windows or Linux style) N/A Path to the file with the dataset
randomize No auto, none auto Specifies whether the input should be randomized
skipSequenceIds No true, false false If true, the reader will ignore sequence IDs in the input file (see section on input format below)
maxErrors No Positive integer 0 Number of input errors after which an exception will be generated. By default an exception is thrown after the first encountered malformed value
traceLevel No 0, 1, 2 0 Verbose output level. 0 - show only errors; 1 - show errors and warnings; 2 - show all output1
chunkSizeInBytes No Positive integer 33554432 Smallest reading unit in bytes (default is 32MB)2
numChunksToCache No Positive integer 32 Number of chuks to keep in memory. The size of a single chunk is set by chunkSizeInBytes2

1 In order to force the reader to show a warning for each input error it ignores (in the sense of now raising an exception), non-default maxErrors value should be used in combination with the traceLevel set to 1 or above.

2 Using default values of chunkSizeInBytes and numChunksToCache results in 1GB of input data stored in memory (32MB * 32).

input sub-section

input combines a number of individual inputs, with an appropriately named configuration sub-section. All parameters below are specific to an Input name sub-section associated with a particular input.

Parameter Mandatory Accepted values Default value Description
alias No String N/A An alternative shorthand name used in the input file
format Yes dense, sparse N/A Specifies input type
dim Yes if format is dense Positive integer N/A Dimension of the input value (i.e., number of input values in a sample for dense input, upper bound on the index range for sparse input)

You will find complete network definitions and the corresponding data set examples in the CNTK Repository. There you will also find an End-to-End Test using Text Reader.

Clone this wiki locally