jcjohnson · dgcrouse · Aug 12, 2016 · Aug 12, 2016 · Aug 18, 2016 · Aug 19, 2016
diff --git a/LanguageModel.lua b/LanguageModel.lua
@@ -162,12 +162,28 @@ function LM:sample(kwargs)
   local verbose = utils.get_kwarg(kwargs, 'verbose', 0)
   local sample = utils.get_kwarg(kwargs, 'sample', 1)
   local temperature = utils.get_kwarg(kwargs, 'temperature', 1)
+  local start_tokens = utils.get_kwarg(kwargs,'start_tokens','')
 
   local sampled = torch.LongTensor(1, T)
   self:resetStates()
 
   local scores, first_t
-  if #start_text > 0 then
+  if #start_tokens > 0 then
+	  local json_tokens = utils.read_json(start_tokens)
+
+	  local num_tokens = table.getn(json_tokens.tokens)
+
+  	  local tokenTensor = torch.LongTensor(num_tokens)
+	  for i = 1,num_tokens do
+	    tokenTensor[i] = json_tokens.tokens[i] 
+	  end
+
+	  local x = tokenTensor:view(1,-1)
+      local T0 = x:size(2)
+      sampled[{{}, {1, T0}}]:copy(x)
+      scores = self:forward(x)[{{}, {T0, T0}}]
+      first_t = T0 + 1
+  elseif #start_text > 0 then
     if verbose > 0 then
       print('Seeding with: "' .. start_text .. '"')
     end

diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 # torch-rnn
-torch-rnn provides high-performance, reusable RNN and LSTM modules for torch7, and uses these modules for character-level
+torch-rnn provides high-performance, reusable RNN and LSTM modules for torch7, and uses these modules for character-level and word-level 
 language modeling similar to [char-rnn](https://github.com/karpathy/char-rnn).
 
 You can find documentation for the RNN and LSTM modules [here](doc/modules.md); they have no dependencies other than `torch`
@@ -92,7 +92,7 @@ Jeff Thompson has written a very detailed installation guide for OSX that you [c
 To train a model and use it to generate new text, you'll need to follow three simple steps:
 
 ## Step 1: Preprocess the data
-You can use any text file for training models. Before training, you'll need to preprocess the data using the script
+You can use any text file or folder of .txt files for training models. Before training, you'll need to preprocess the data using the script
 `scripts/preprocess.py`; this will generate an HDF5 file and JSON file containing a preprocessed version of the data.
 
 If you have training data stored in `my_data.txt`, you can run the script like this:
@@ -104,10 +104,29 @@ python scripts/preprocess.py \
   --output_json my_data.json
 ```
 
+If you instead have multiple .txt files in the folder `my_data`, you instead run the script like this
+
+```bash
+python scripts/preprocess.py \
+  --input_folder my_data
+```
+
 This will produce files `my_data.h5` and `my_data.json` that will be passed to the training script.
 
 There are a few more flags you can use to configure preprocessing; [read about them here](doc/flags.md#preprocessing)
 
+### Preprocess Word Tokens
+To preprocess the input data with words as tokens, add the flag `--use_words`.
+
+A large text corpus will contain many rare words, usually typos or unusual names. Adding a token for each of these is not practical and can result in a very large token space. Using the options `--min_occurrences` or `--min_documents` allow specifying how many times or in how many documents a word must occur before being added as a token. Words that fail to meet these criteria are replaced by wildcards, which are randomly distributed to avoid overtraining.
+
+More information on additional flags is available [here](doc/flags.md#preprocessing)
+
+### Preprocess Data With Existing Token Schema
+If you have an existing token schema (.json file generated by preprocess.py), you can use the script `scripts/tokenize.py` to tokenize a file based on that schema. It accepts input as a text file or folder of text files (similar to the preprocessing script), as well as an argument `--input_json` which specifies the input token schema file. This is useful for transfer learning onto a new dataset.
+
+To learn more about the tokenizing script [see here](doc/flags.md#tokenizing).
+
 ## Step 2: Train the model
 After preprocessing the data, you'll need to train the model using the `train.lua` script. This will be the slowest step.
 You can run the training script like this:
@@ -144,6 +163,12 @@ and print the results to the console.
 By default the sampling script will run in GPU mode using CUDA; to run in CPU-only mode add the flag `-gpu -1` and
 to run in OpenCL mode add the flag `-gpu_backend opencl`.
 
+To pre-seed the model with text, there are 2 options. If you used character-based preprocessing, use flag `-start_text` and include a quoted string.
+
+If you used word-based preprocessing, use the Python script `scripts\tokenize.py` to generate a JSON file of tokens and provide it using the flag `-start_tokens`. Since Python was used to parse the input data into tokens, it is best to use Python to so it for seed text as well, as Lua does not have full regex support, hence the extra step.
+
+To learn more about the tokenizing script [see here](doc/flags.md#tokenizing).
+
 There are more flags you can use to configure sampling; [read about them here](doc/flags.md#sampling).
 
 # Benchmarks

diff --git a/doc/flags.md b/doc/flags.md
@@ -3,12 +3,22 @@ Here we'll describe in detail the full set of command line flags available for p
 # Preprocessing
 The preprocessing script `scripts/preprocess.py` accepts the following command-line flags:
 - `--input_txt`: Path to the text file to be used for training. Default is the `tiny-shakespeare.txt` dataset.
+- `--input_folder`: Path to a folder containing .txt files to use for training. Overrides the `--input_txt` option
 - `--output_h5`: Path to the HDF5 file where preprocessed data should be written.
 - `--output_json`: Path to the JSON file where preprocessed data should be written.
 - `--val_frac`: What fraction of the data to use as a validation set; default is `0.1`.
 - `--test_frac`: What fraction of the data to use as a test set; default is `0.1`.
 - `--quiet`: If you pass this flag then no output will be printed to the console.
+- `--use_words`: Passing this flag preprocesses the flag as word tokens rather than characters. Using it activates additional options below (ignored otherwise).
 
+##Preprocessing Word Tokens
+- `--case_sensitive`: Makes word tokens case-sensitive. Default is to convert everything to lowercase for words, character tokens are ALWAYS case-sensitive.
+- `--min_occurrences`: Minimum number of times a word needs to be seen to be given a token. Default is 20.
+- `--min_documents`: Minimum number of documents a word needs to be seen in to be given a token. Default is 1.
+- `--use_ascii`: Convert the input files to ASCII by removing all non-ASCII characters. Default is unicode.
+- `--wildcard_rate`: Number of wildcards generated as a fraction of ignored words. Ex. `0.01` will generate 1 percent of the number of ignored words as wildcards. Default is `0.01`.
+- `--wildcard_max`: If set, the maximum number of wildcards that will be generated. Default is unlimited.
+- `--wildcard_min`: Minimum number of wildcards that will be generated. Cannot be less than 1. Default is 10.
 
 # Training
 The training script `train.lua` accepts the following command-line flags:
@@ -51,9 +61,22 @@ The training script `train.lua` accepts the following command-line flags:
 The sampling script `sample.lua` accepts the following command-line flags:
 - `-checkpoint`: Path to a `.t7` checkpoint file from `train.lua`
 - `-length`: The length of the generated text, in characters.
-- `-start_text`: You can optionally start off the generation process with a string; if this is provided the start text will be processed by the trained network before we start sampling. Without this flag, the first character is chosen randomly.
+- `-start_text`: You can optionally start off the generation process with a string; if this is provided the start text will be processed by the trained network before we start sampling. Without this flag or the `-start_tokens` flag, the first character is chosen randomly.
+- `-start_tokens`: As an alternative to start_text for word-based tokenizing, accepts a JSON file generated by `scripts/tokenize.py` which contains tokens for start text. Without this flag or the `-start_text` flag, the first character is chosen randomly.
 - `-sample`: Set this to 1 to sample from the next-character distribution at each timestep; set to 0 to instead just pick the argmax at every timestep. Sampling tends to produce more interesting results.
 - `-temperature`: Softmax temperature to use when sampling; default is 1. Higher temperatures give noiser samples. Not used when using argmax sampling (`sample` set to 0).
 - `-gpu`: The ID of the GPU to use (zero-indexed). Default is 0. Set this to -1 to run in CPU-only mode.
 - `-gpu_backend`: The GPU backend to use; either `cuda` or `opencl`. Default is `cuda`.
 - `-verbose`: By default just the sampled text is printed to the console. Set this to 1 to also print some diagnostic information.
+
+#Tokenizing
+The tokenizing script `scripts/tokenizeWords.py` accepts the following command-line flags:
+- `--input_str`: The string to tokenize as a quoted block, ex. `--input "lorem ipsum"`
+- `--input_txt`: Path to the text file to be used for training. Default is the `tiny-shakespeare.txt` dataset.
+- `--input_folder`: Path to a folder containing .txt files to use for training. Overrides the `--input_txt` option
+- `--input_json: The JSON output from `scripts/preprocessWords.py` to use to tokenize the string
+- `--output_json`: Optional - The output JSON file to save the tokenization to.
+- `--output_h5`: Optional - The path to the HDF5 file where preprocessed data should be written.
+- `--val_frac`: What fraction of the data to use as a validation set; default is `0.1`.
+- `--test_frac`: What fraction of the data to use as a test set; default is `0.1`.
+- `--quiet`: If you pass this flag then no output will be printed to the console except in case of error.
diff --git a/sample.lua b/sample.lua
@@ -8,6 +8,7 @@ local cmd = torch.CmdLine()
 cmd:option('-checkpoint', 'cv/checkpoint_4000.t7')
 cmd:option('-length', 2000)
 cmd:option('-start_text', '')
+cmd:option('-start_tokens','')
 cmd:option('-sample', 1)
 cmd:option('-temperature', 1)
 cmd:option('-gpu', 0)