Skip to content

CNTKTextFormat Reader

Nikos Karampatziakis edited this page Aug 17, 2016 · 15 revisions

CNTKTextFormatReader (later simply Text Reader) is designed to consume input text data formatted according to the specification below. It supports the following main features:

  • Multiple input streams (inputs) per file
  • Both sparse and dense inputs
  • Variable length sequences

Input Schema

Each line in the input file contains one sample for one or more inputs. Since (explicitly or implicitly) every line is also attached to a sequence, it defines one or more <sequence, input, sample> relations. Each input line must be formatted as follows:

[Sequence_Id](Sample)+

where

Sample=|Input_Name (Value )*

  • Each line starts with a sequence id and contains one or more samples (in other words, each line is an unordered collection of samples).
  • Sequence id is a number. It can be omitted, in which case the line number will be used as the sequence id.
  • Each sample is effectively a key-value pair consisting of an input name and the corresponding value vector (mapping to higher dimensions is done as part of the network itself).
  • Each sample begins with a pipe symbol followed by the input name (no spaces), followed by a whitespace delimiter and then a list of values.
  • Each value is either a number or an index-prefixed number for sparse inputs.
  • Both tabs and spaces can be used interchangeably as delimiters.

Simple Example

This example is based on a minimal set of parameters and format options.

To use the Text Reader set the readerType to CNTKTextFormatReader in the reader section of your CNTK configuration:

...
reader = [
    readerType = "CNTKTextFormatReader" 
    file = "c:\mydata\SampleInput.txt" # See the second example for Linux path example
    
    # IMPORTANT!
    # All inputs are grouped within "input" sub-section.
    input = [
        Apples = [
            dim = 10 
            format = "dense"
        ]
            
        Oranges = [
            dim = 1000000
            format = "sparse"
        ]
        
         Bananas = [
            dim = 1
            format = "dense"
        ]
    ]
]
# the rest of the cntk config ...

(This fragment as well as other NDL examples in this document present only reader section, omitting the rest of the CNTK configuration; see the end of this page for pointers to a set of complete example Networks and the corresponding datasets)

The Text Reader requires the following set of parameters:

  • file - path to the file with the dataset.
  • input - sub-section defining inputs identified by input names (Apples, Orangesand Bananas in the example above). Within each input the following required parameters must be specified:
    • format - specifies the input type. Must either be dense or sparse
    • dim - specifies the dimension of the input value vector (for dense input this directly corresponds to the number of values in each sample, for sparse this represents the upper bound on the range of possible index values).

The input data corresponding to the reader configuration above should look something like this:

|Oranges 100:3 123:4 |Bananas 8 |Apples 0 1 2 3 4 5 6 7 8 9
|Apples 0 1.1 22 0.3 14 54 0.06 0.7 1.8 9.9 |Bananas 123917 |Oranges 1134:1.911 13331:0.014
|Bananas -0.001 |Apples 3.9 1.11 121.2 99.13 0.04 2.95 1.6 7.19 10.8 -9.9 |Oranges 999:0.001 918918:-9.19

Note the following about the input format:

  • |Input_Name identifies the beginning of each input sample. This element is mandatory and is followed by the correspondent value vector.
  • Dense vector is just a list of floating point values; sparse vector is a list of index:value tuples.
  • Both tabs and spaces are allowed as value delimiters (within input vectors) as well as input delimiters (between inputs).
  • Each separate line constitutes a "sequence" of length 1 ("Real" variable-length sequences are explained in the extended example below).
  • Each input identifier can only appear once on a single line (which translates into one sample per input per line requirement).
  • The order of input samples within a line is NOT important (conceptually, each line is an unordered collection of key-value pairs)
  • Each well-formed line must end with either a "Line Feed" \n or "Carriage Return, Line Feed" \r\n symbols (including the last line of the file).

Extended Example

This example features all possible configuration parameters and shows various input format options. Please refer to the tables below for the full description of the configuration parameters used in this example.

...
precision="double"

reader = [
    readerType = "CNTKTextFormatReader" 
    file = "/home/mydata/SampleInput.txt" # See the first example for Windows style path example
    randomize = true
    randomizationWindow = 30
    skipSequenceIds = false
    maxErrors = 100
    traceLevel = 2
    
    chunkSizeInBytes = 1024

    keepDataInMemory = true
    frameMode = false
    
    input = [
        Some_very_long_input_name = [
            alias = "a"
            dim = 3 
            format = "dense"
        ]
        Some_other_also_very_long_input_name = [
            alias = "b"
            dim = 2 
            format = "dense"
        ]
    ]
]
# the rest of the cntk config ...

The corresponding input file can then look approximately as follows:

100 |a 1 2 3 |b 100 200
100 |a 4 5 6 |b 101 201
100 |b 102983 14532 |a 7 8 9 
100 |a 7 8 9
200 |b 300 400 |a 10 20 30
333 |b 500 100 
333 |b 600 -900
400 |a 1 2 3 |b 100 200
|a 4 5 6 |b 101 201
|a 4 5 6 |b 101 201
500 |a 1 2 3 |b 100 200

All options discussed in the example above, still apply here. On top of that, we introduced two additional features:

Input name aliases

Input names can be arbitrary long and thus repeating them throughout the input file may not be space-efficient. To mitigate this, the dataset can use "aliases" instead of full input names. Aliases then need to be specified within each input sub-subsection. In our example, the dataset uses aliases a and b, which are mapped to "Some_very_long_input_name" and "Some_other_also_very_long_input_name" respectively in the reader config section.

Sequence IDs

As already mentioned, each line in the input file terminated by \n or \r\n symbols represents a sequence containing a single sample for each input. If a line is prefixed with a number, the number is used as the corresponding sequence id. All subsequent lines that share the same sequence id, become a part of the same sequence. Therefore, repeating the same numerical prefix for N lines allows to build up a multi-sample sequence, with each input containing between 1 and N samples. Omitting the sequence prefix on the second and following lines has the same effect. Thus, the example dataset above defines five sequences with ids 100, 200, 333, 400 and 500.

Setting skipSequenceIds parameter in the reader section to true, forces the reader to ignore all explicit sequence ids in the dataset and treat separate lines as individual sequences. Also, omitting the sequence id on the first line in the dataset has the same effect -- all subsequent sequence ids are ignored, lines treated as individual sequences, as in this example:

|a 1 2 3 |b 100 200
100 |a 4 5 6 |b 101 201
200 |b 102983 14532 |a 7 8 9 

A few final things to consider when using sequences:

  • Sequence ids must be unique.
  • Id prefixes can only be repeated for consecutive lines.
  • Sequence length in lines (that is, the number of lines sharing the same id prefix) must not exceed the maximum input length in samples (the number of samples in an input) in this sequence.

For example, the following datasets are invalid:

100 |a 1 2 3 |b 100 200
200 |a 4 5 6 |b 101 201
100 |b 102983 14532 |a 7 8 9 

123 |a 1 2 3 |b 100 200
456 |a 4 5 6 
456 |b 101 201

A Few Real-World Examples

  • Classification: Every line contains a sample, consisting of a label and features. No sequence ID needed, since every line is its own "sequence" of length 1.
|class 23:1 |features 2 3 4 5 6
|class 13:1 |features 1 2 0 2 3
...
  • DSSM: Every line contains a source-target document pair, expressed through a bag of words, encoded as sparse vectors.
|src 12:1 23:1 345:2 45001:1    |tgt 233:1 766:2 234:1
|src 123:1 56:1 10324:1 18001:3 |tgt 233:1 2344:2 8889:1 2234:1 253434:1
  • Part-of-speech tagging: Sequences mapping every element to a corresponding label. The sequences are aligned vertically (one word + tag per line).
0 |word 234:1 |tag 12:1
0 |word 123:1 |tag 10:1
0 |word 123:1 |tag 13:1
1 |word 234:1 |tag 12:1
1 |word 123:1 |tag 10:1
...
  • Sequence classification: Sequences mapped to a single label. Sequences are aligned vertically; The "class" label can occur in any line that has the same sequenceId.

NOTE: At the moment the number of lines must not exceed the length of the longest sequence. This means that the label cannot appear on a line on its own. This is an implementation detail that will be lifted in the future.

0 |word 234:1 |class 3:1
0 |word 123:1
0 |word 890:1
1 |word 11:1 |class 2:1
1 |word 344:1
  • Sequence to sequence: Map a source sequence to a target sequence. The two sequences are aligned vertically and, in the easiest case, just printed after another. They are joined by having the same overall "sequence ID" (which then becomes a "work unit ID" in this case).

NOTE: At the moment the number of lines must not exceed the length of the longest sequence. This means that sequences must be aligned horizontally. This is an implementation detail that will be lifted in the future.

0 |sourceWord 234:1  |targetWord 344:1
0 |sourceWord 123:1  |targetWord 456:1
0 |sourceWord 123:1  |targetWord 2222:1
0 |sourceWord 11:1 
1 |sourceWord 123:1 
...
  • Learning to Rank: A "sequence" represents a query, every sample a document with a hand-labeled rating. In this case the "sequence" is just a multiset that (in the context of a learning-to-rank loss function) doesn't have an ordering.
0 |rating 4 |features 23 35 0 0 0 21 2345 0 0 0 0 0 
0 |rating 2 |features 0 123 0 22 44 44 290 22 22 22 33 0 
0 |rating 1 |features 0 0 0 0 0 0 1 0 0 0 0 0
1 |rating 1 |features 34 56 0 0 0 45 1312 0 0 0 0 0 
1 |rating 0 |features 45 45 0 0 0 12 335 0 0 0 0 0 
2 |rating 0 |features 0 0 0 0 0 0 22 0 0 0 0 0 
...

Configuration Parameters

Parameter Mandatory Accepted values Default value Description
precision No double, float float Specifies the floating point precision of the input values

reader section

Parameter Mandatory Accepted values Default value Description
readerType Yes one of the supported CNTK readers Specifies the reader flavor to load (e.g., CNTKTextFormatReader)
file Yes File path Path to the file containing the input dataset (Windows or Linux style)
randomize No true, false true Specifies whether the input should be randomized
randomizationWindow No Positive integer Specifies the randomization range (in number of samples)1. This controls how much of the dataset resides in memory.
skipSequenceIds No true, false false If true, the reader will ignore sequence IDs in the input file (see section on input format below)
maxErrors No Positive integer 0 Number of input errors after which an exception should be raised. By default, the first malformed value will trigger an exception
traceLevel No 0, 1, 2 1 Output verbosity level. 0 - show only errors; 1 - show errors and warnings; 2 - show all output2
chunkSizeInBytes No Positive integer 33554432 Number of consecutive bytes to read from disk in a single read operation (default is 32MB)
keepDataInMemory No true, false false If true, the whole dataset will be cached in memory
frameMode No true, false false true signals the reader to use a packing method optimized for frames (single sample sequences)

1 If no randomizationWindow is specified, the randomization range is set to be equal to the size of the dataset (i.e., the input is randomized across the whole dataset). randomizationWindow is ignored when randomize is set to false.

2 In order to force the reader to show a warning for each input error it ignores (in the sense of not raising an exception), non-default maxErrors value should be used in combination with the traceLevel set to 1 or above.

input sub-section

input combines a number of individual inputs, each with an appropriately labeled configuration sub-section. All parameters described below are specific to an Input name sub-section associated with a particular input.

Parameter Mandatory Accepted values Description
alias No String An alternative shorthand name used in the input file
format Yes dense, sparse Specifies input type
dim Yes Positive integer Dimension of the input value (i.e., number of input values in a sample for dense input, upper bound on the index range for sparse input)

You will find complete network definitions and the corresponding data set examples in the CNTK Repository. There you will also find an End-to-End Test that uses the CNTKTextFormat reader.

Clone this wiki locally