-
Notifications
You must be signed in to change notification settings - Fork 0
CNTKTextFormat Reader
CNTKTextFormatReader (later simply Text Reader) is designed to consume input text data formatted according to the specification below. It supports the following main features:
- Multiple input streams (inputs) per file
- Both sparse and dense inputs
- Variable length sequences
Each line in the input file contains one sample for one or more inputs. Since (explicitly or implicitly) every line is also attached to a sequence, it defines one or more <sequence, input, sample> relations. Each input line must be formatted as follows:
[Sequence_Id](Sample)+
where
Sample=|Input_Name (Value )*
- Each line starts with a sequence id and contains one or more samples (in other words, each line is an unordered collection of samples).
- Sequence id is a number. It can be omitted, in which case the line number will be used as the sequence id.
- Each sample is effectively a key-value pair consisting of an input name and the corresponding value vector (mapping to higher dimensions is done as part of the network itself).
- Each sample begins with a pipe symbol followed by the input name (no spaces), followed by a whitespace delimiter and then a list of values.
- Each value is either a number or an index-prefixed number for sparse inputs.
- Both tabs and spaces can be used interchangeably as delimiters.
This example is based on a minimal set of parameters and format options.
To use the Text Reader set the readerType
to CNTKTextFormatReader
in the reader section of your CNTK configuration:
...
reader = [
readerType = "CNTKTextFormatReader"
file = "c:\mydata\SampleInput.txt" # See the second example for Linux path example
# IMPORTANT!
# All inputs are grouped within "input" sub-section.
input = [
Apples = [
dim = 10
format = "dense"
]
Oranges = [
dim = 1000000
format = "sparse"
]
Bananas = [
dim = 1
format = "dense"
]
]
]
# the rest of the cntk config ...
(This fragment as well as other NDL examples in this document present only reader
section, omitting the rest of the CNTK configuration; see the end of this page for pointers to a set of complete example Networks and the corresponding datasets)
The Text Reader requires the following set of parameters:
-
file
- path to the file with the dataset. -
input
- sub-section defining inputs identified by input names (Apples
,Oranges
andBananas
in the example above). Within each input the following required parameters must be specified:-
format
- specifies the input type. Must either bedense
orsparse
-
dim
- specifies the dimension of the input value vector (for dense input this directly corresponds to the number of values in each sample, for sparse this represents the upper bound on the range of possible index values).
-
The input data corresponding to the reader configuration above should look something like this:
|Oranges 100:3 123:4 |Bananas 8 |Apples 0 1 2 3 4 5 6 7 8 9
|Apples 0 1.1 22 0.3 14 54 0.06 0.7 1.8 9.9 |Bananas 123917 |Oranges 1134:1.911 13331:0.014
|Bananas -0.001 |Apples 3.9 1.11 121.2 99.13 0.04 2.95 1.6 7.19 10.8 -9.9 |Oranges 999:0.001 918918:-9.19
Note the following about the input format:
-
|Input_Name
identifies the beginning of each input sample. This element is mandatory and is followed by the correspondent value vector. - Dense vector is just a list of floating point values; sparse vector is a list of
index:value
tuples. - Both tabs and spaces are allowed as value delimiters (within input vectors) as well as input delimiters (between inputs).
- Each separate line constitutes a "sequence" of length 1 ("Real" variable-length sequences are explained in the extended example below).
- Each input identifier can only appear once on a single line (which translates into one sample per input per line requirement).
- The order of input samples within a line is NOT important (conceptually, each line is an unordered collection of key-value pairs)
- Each well-formed line must end with either a "Line Feed"
\n
or "Carriage Return, Line Feed"\r\n
symbols (including the last line of the file).
This example features all possible configuration parameters and shows various input format options. Please refer to the tables below for the full description of the configuration parameters used in this example.
...
precision="double"
reader = [
readerType = "CNTKTextFormatReader"
file = "/home/mydata/SampleInput.txt" # See the first example for Windows style path example
randomize = true
randomizationWindow = 30
skipSequenceIds = false
maxErrors = 100
traceLevel = 2
chunkSizeInBytes = 1024
keepDataInMemory = true
frameMode = false
input = [
Some_very_long_input_name = [
alias = "a"
dim = 3
format = "dense"
]
Some_other_also_very_long_input_name = [
alias = "b"
dim = 2
format = "dense"
]
]
]
# the rest of the cntk config ...
The corresponding input file can then look approximately as follows:
100 |a 1 2 3 |b 100 200
100 |a 4 5 6 |b 101 201
100 |b 102983 14532 |a 7 8 9
100 |a 7 8 9
200 |b 300 400 |a 10 20 30
333 |b 500 100
333 |b 600 -900
400 |a 1 2 3 |b 100 200
|a 4 5 6 |b 101 201
|a 4 5 6 |b 101 201
500 |a 1 2 3 |b 100 200
All options discussed in the example above, still apply here. On top of that, we introduced two additional features:
Input names can be arbitrary long and thus repeating them throughout the input file may not be space-efficient. To mitigate this, the dataset can use "aliases" instead of full input names. Aliases then need to be specified within each input sub-subsection. In our example, the dataset uses aliases a
and b
, which are mapped to "Some_very_long_input_name" and "Some_other_also_very_long_input_name" respectively in the reader config section.
As already mentioned, each line in the input file terminated by \n
or \r\n
symbols represents a sequence containing a single sample for each input. If a line is prefixed with a number, the number is used as the corresponding sequence id. All subsequent lines that share the same sequence id, become a part of the same sequence. Therefore, repeating the same numerical prefix for N lines allows to build up a multi-sample sequence, with each input containing between 1 and N samples. Omitting the sequence prefix on the second and following lines has the same effect. Thus, the example dataset above defines five sequences with ids 100
, 200
, 333
, 400
and 500
.
Setting skipSequenceIds
parameter in the reader section to true
, forces the reader to ignore all explicit sequence ids in the dataset and treat separate lines as individual sequences. Also, omitting the sequence id on the first line in the dataset has the same effect -- all subsequent sequence ids are ignored, lines treated as individual sequences, as in this example:
|a 1 2 3 |b 100 200
100 |a 4 5 6 |b 101 201
200 |b 102983 14532 |a 7 8 9
A few final things to consider when using sequences:
- Sequence ids must be unique.
- Id prefixes can only be repeated for consecutive lines.
- Sequence length in lines (that is, the number of lines sharing the same id prefix) must not exceed the maximum input length in samples (the number of samples in an input) in this sequence.
For example, the following datasets are invalid:
100 |a 1 2 3 |b 100 200
200 |a 4 5 6 |b 101 201
100 |b 102983 14532 |a 7 8 9
123 |a 1 2 3 |b 100 200
456 |a 4 5 6
456 |b 101 201
- Classification: Every line contains a sample, consisting of a label and features. No sequence ID needed, since every line is its own "sequence" of length 1.
|class 23:1 |features 2 3 4 5 6
|class 13:1 |features 1 2 0 2 3
...
- DSSM: Every line contains a source-target document pair, expressed through a bag of words, encoded as sparse vectors.
|src 12:1 23:1 345:2 45001:1 |tgt 233:1 766:2 234:1
|src 123:1 56:1 10324:1 18001:3 |tgt 233:1 2344:2 8889:1 2234:1 253434:1
- Part-of-speech tagging: Sequences mapping every element to a corresponding label. The sequences are aligned vertically (one word + tag per line).
0 |word 234:1 |tag 12:1
0 |word 123:1 |tag 10:1
0 |word 123:1 |tag 13:1
1 |word 234:1 |tag 12:1
1 |word 123:1 |tag 10:1
...
- Sequence classification: Sequences mapped to a single label. Sequences are aligned vertically; The "class" label can occur in any line that has the same sequenceId.
NOTE: At the moment the number of lines must not exceed the length of the longest sequence. This means that the label cannot appear on a line on its own. This is an implementation detail that will be lifted in the future.
0 |word 234:1 |class 3:1
0 |word 123:1
0 |word 890:1
1 |word 11:1 |class 2:1
1 |word 344:1
- Sequence to sequence: Map a source sequence to a target sequence. The two sequences are aligned vertically and, in the easiest case, just printed after another. They are joined by having the same overall "sequence ID" (which then becomes a "work unit ID" in this case).
NOTE: At the moment the number of lines must not exceed the length of the longest sequence. This means that sequences must be aligned horizontally. This is an implementation detail that will be lifted in the future.
0 |sourceWord 234:1 |targetWord 344:1
0 |sourceWord 123:1 |targetWord 456:1
0 |sourceWord 123:1 |targetWord 2222:1
0 |sourceWord 11:1
1 |sourceWord 123:1
...
- Learning to Rank: A "sequence" represents a query, every sample a document with a hand-labeled rating. In this case the "sequence" is just a multiset that (in the context of a learning-to-rank loss function) doesn't have an ordering.
0 |rating 4 |features 23 35 0 0 0 21 2345 0 0 0 0 0
0 |rating 2 |features 0 123 0 22 44 44 290 22 22 22 33 0
0 |rating 1 |features 0 0 0 0 0 0 1 0 0 0 0 0
1 |rating 1 |features 34 56 0 0 0 45 1312 0 0 0 0 0
1 |rating 0 |features 45 45 0 0 0 12 335 0 0 0 0 0
2 |rating 0 |features 0 0 0 0 0 0 22 0 0 0 0 0
...
Parameter | Mandatory | Accepted values | Default value | Description |
---|---|---|---|---|
precision |
No |
double , float
|
float |
Specifies the floating point precision of the input values |
Parameter | Mandatory | Accepted values | Default value | Description |
---|---|---|---|---|
readerType |
Yes | one of the supported CNTK readers | Specifies the reader flavor to load (e.g., CNTKTextFormatReader ) |
|
file |
Yes | File path | Path to the file containing the input dataset (Windows or Linux style) | |
randomize |
No |
true , false
|
true |
Specifies whether the input should be randomized |
randomizationWindow |
No | Positive integer | Specifies the randomization range (in number of samples)1. This controls how much of the dataset resides in memory. | |
skipSequenceIds |
No |
true , false
|
false |
If true , the reader will ignore sequence IDs in the input file (see section on input format below) |
maxErrors |
No | Positive integer | 0 |
Number of input errors after which an exception should be raised. By default, the first malformed value will trigger an exception |
traceLevel |
No |
0 , 1 , 2
|
1 |
Output verbosity level. 0 - show only errors; 1 - show errors and warnings; 2 - show all output2
|
chunkSizeInBytes |
No | Positive integer | 33554432 |
Number of consecutive bytes to read from disk in a single read operation (default is 32MB) |
keepDataInMemory |
No |
true , false
|
false |
If true , the whole dataset will be cached in memory |
frameMode |
No |
true , false
|
false |
true signals the reader to use a packing method optimized for frames (single sample sequences) |
1 If no randomizationWindow
is specified, the randomization range is set to be equal to the size of the dataset (i.e., the input is randomized across the whole dataset). randomizationWindow
is ignored when randomize
is set to false
.
2 In order to force the reader to show a warning for each input error it ignores (in the sense of not raising an exception), non-default maxErrors
value should be used in combination with the traceLevel
set to 1
or above.
input
combines a number of individual inputs, each with an appropriately labeled configuration sub-section. All parameters described below are specific to an Input name sub-section associated with a particular input.
Parameter | Mandatory | Accepted values | Description |
---|---|---|---|
alias |
No | String | An alternative shorthand name used in the input file |
format |
Yes |
dense , sparse
|
Specifies input type |
dim |
Yes | Positive integer | Dimension of the input value (i.e., number of input values in a sample for dense input, upper bound on the index range for sparse input) |
You will find complete network definitions and the corresponding data set examples in the CNTK Repository. There you will also find an End-to-End Test that uses the CNTKTextFormat reader.
Getting Started
Additional Documentation
How to use CNTK
Using CNTK Models in Your Code
- Overview
- Nuget Package for Evaluation
- C++ Evaluation Interface
- C# Evaluation Interface
- Evaluating Hidden Layers
- C# Image Transforms for Evaluation
- C# Multi-model Evaluation
- Evaluate in Azure
Advanced topics
Licenses
Source Code & Development