Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add preprocess script to use words as tokens with typo and rare word reduction #132

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion LanguageModel.lua
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,28 @@ function LM:sample(kwargs)
local verbose = utils.get_kwarg(kwargs, 'verbose', 0)
local sample = utils.get_kwarg(kwargs, 'sample', 1)
local temperature = utils.get_kwarg(kwargs, 'temperature', 1)
local start_tokens = utils.get_kwarg(kwargs,'start_tokens','')

local sampled = torch.LongTensor(1, T)
self:resetStates()

local scores, first_t
if #start_text > 0 then
if #start_tokens > 0 then
local json_tokens = utils.read_json(start_tokens)

local num_tokens = table.getn(json_tokens.tokens)

local tokenTensor = torch.LongTensor(num_tokens)
for i = 1,num_tokens do
tokenTensor[i] = json_tokens.tokens[i]
end

local x = tokenTensor:view(1,-1)
local T0 = x:size(2)
sampled[{{}, {1, T0}}]:copy(x)
scores = self:forward(x)[{{}, {T0, T0}}]
first_t = T0 + 1
elseif #start_text > 0 then
if verbose > 0 then
print('Seeding with: "' .. start_text .. '"')
end
Expand Down
29 changes: 27 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# torch-rnn
torch-rnn provides high-performance, reusable RNN and LSTM modules for torch7, and uses these modules for character-level
torch-rnn provides high-performance, reusable RNN and LSTM modules for torch7, and uses these modules for character-level and word-level
language modeling similar to [char-rnn](https://github.com/karpathy/char-rnn).

You can find documentation for the RNN and LSTM modules [here](doc/modules.md); they have no dependencies other than `torch`
Expand Down Expand Up @@ -92,7 +92,7 @@ Jeff Thompson has written a very detailed installation guide for OSX that you [c
To train a model and use it to generate new text, you'll need to follow three simple steps:

## Step 1: Preprocess the data
You can use any text file for training models. Before training, you'll need to preprocess the data using the script
You can use any text file or folder of .txt files for training models. Before training, you'll need to preprocess the data using the script
`scripts/preprocess.py`; this will generate an HDF5 file and JSON file containing a preprocessed version of the data.

If you have training data stored in `my_data.txt`, you can run the script like this:
Expand All @@ -104,10 +104,29 @@ python scripts/preprocess.py \
--output_json my_data.json
```

If you instead have multiple .txt files in the folder `my_data`, you instead run the script like this

```bash
python scripts/preprocess.py \
--input_folder my_data
```

This will produce files `my_data.h5` and `my_data.json` that will be passed to the training script.

There are a few more flags you can use to configure preprocessing; [read about them here](doc/flags.md#preprocessing)

### Preprocess Word Tokens
To preprocess the input data with words as tokens, add the flag `--use_words`.

A large text corpus will contain many rare words, usually typos or unusual names. Adding a token for each of these is not practical and can result in a very large token space. Using the options `--min_occurrences` or `--min_documents` allow specifying how many times or in how many documents a word must occur before being added as a token. Words that fail to meet these criteria are replaced by wildcards, which are randomly distributed to avoid overtraining.

More information on additional flags is available [here](doc/flags.md#preprocessing)

### Preprocess Data With Existing Token Schema
If you have an existing token schema (.json file generated by preprocess.py), you can use the script `scripts/tokenize.py` to tokenize a file based on that schema. It accepts input as a text file or folder of text files (similar to the preprocessing script), as well as an argument `--input_json` which specifies the input token schema file. This is useful for transfer learning onto a new dataset.

To learn more about the tokenizing script [see here](doc/flags.md#tokenizing).

## Step 2: Train the model
After preprocessing the data, you'll need to train the model using the `train.lua` script. This will be the slowest step.
You can run the training script like this:
Expand Down Expand Up @@ -144,6 +163,12 @@ and print the results to the console.
By default the sampling script will run in GPU mode using CUDA; to run in CPU-only mode add the flag `-gpu -1` and
to run in OpenCL mode add the flag `-gpu_backend opencl`.

To pre-seed the model with text, there are 2 options. If you used character-based preprocessing, use flag `-start_text` and include a quoted string.

If you used word-based preprocessing, use the Python script `scripts\tokenize.py` to generate a JSON file of tokens and provide it using the flag `-start_tokens`. Since Python was used to parse the input data into tokens, it is best to use Python to so it for seed text as well, as Lua does not have full regex support, hence the extra step.

To learn more about the tokenizing script [see here](doc/flags.md#tokenizing).

There are more flags you can use to configure sampling; [read about them here](doc/flags.md#sampling).

# Benchmarks
Expand Down
25 changes: 24 additions & 1 deletion doc/flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,22 @@ Here we'll describe in detail the full set of command line flags available for p
# Preprocessing
The preprocessing script `scripts/preprocess.py` accepts the following command-line flags:
- `--input_txt`: Path to the text file to be used for training. Default is the `tiny-shakespeare.txt` dataset.
- `--input_folder`: Path to a folder containing .txt files to use for training. Overrides the `--input_txt` option
- `--output_h5`: Path to the HDF5 file where preprocessed data should be written.
- `--output_json`: Path to the JSON file where preprocessed data should be written.
- `--val_frac`: What fraction of the data to use as a validation set; default is `0.1`.
- `--test_frac`: What fraction of the data to use as a test set; default is `0.1`.
- `--quiet`: If you pass this flag then no output will be printed to the console.
- `--use_words`: Passing this flag preprocesses the flag as word tokens rather than characters. Using it activates additional options below (ignored otherwise).

##Preprocessing Word Tokens
- `--case_sensitive`: Makes word tokens case-sensitive. Default is to convert everything to lowercase for words, character tokens are ALWAYS case-sensitive.
- `--min_occurrences`: Minimum number of times a word needs to be seen to be given a token. Default is 20.
- `--min_documents`: Minimum number of documents a word needs to be seen in to be given a token. Default is 1.
- `--use_ascii`: Convert the input files to ASCII by removing all non-ASCII characters. Default is unicode.
- `--wildcard_rate`: Number of wildcards generated as a fraction of ignored words. Ex. `0.01` will generate 1 percent of the number of ignored words as wildcards. Default is `0.01`.
- `--wildcard_max`: If set, the maximum number of wildcards that will be generated. Default is unlimited.
- `--wildcard_min`: Minimum number of wildcards that will be generated. Cannot be less than 1. Default is 10.

# Training
The training script `train.lua` accepts the following command-line flags:
Expand Down Expand Up @@ -51,9 +61,22 @@ The training script `train.lua` accepts the following command-line flags:
The sampling script `sample.lua` accepts the following command-line flags:
- `-checkpoint`: Path to a `.t7` checkpoint file from `train.lua`
- `-length`: The length of the generated text, in characters.
- `-start_text`: You can optionally start off the generation process with a string; if this is provided the start text will be processed by the trained network before we start sampling. Without this flag, the first character is chosen randomly.
- `-start_text`: You can optionally start off the generation process with a string; if this is provided the start text will be processed by the trained network before we start sampling. Without this flag or the `-start_tokens` flag, the first character is chosen randomly.
- `-start_tokens`: As an alternative to start_text for word-based tokenizing, accepts a JSON file generated by `scripts/tokenize.py` which contains tokens for start text. Without this flag or the `-start_text` flag, the first character is chosen randomly.
- `-sample`: Set this to 1 to sample from the next-character distribution at each timestep; set to 0 to instead just pick the argmax at every timestep. Sampling tends to produce more interesting results.
- `-temperature`: Softmax temperature to use when sampling; default is 1. Higher temperatures give noiser samples. Not used when using argmax sampling (`sample` set to 0).
- `-gpu`: The ID of the GPU to use (zero-indexed). Default is 0. Set this to -1 to run in CPU-only mode.
- `-gpu_backend`: The GPU backend to use; either `cuda` or `opencl`. Default is `cuda`.
- `-verbose`: By default just the sampled text is printed to the console. Set this to 1 to also print some diagnostic information.

#Tokenizing
The tokenizing script `scripts/tokenizeWords.py` accepts the following command-line flags:
- `--input_str`: The string to tokenize as a quoted block, ex. `--input "lorem ipsum"`
- `--input_txt`: Path to the text file to be used for training. Default is the `tiny-shakespeare.txt` dataset.
- `--input_folder`: Path to a folder containing .txt files to use for training. Overrides the `--input_txt` option
- `--input_json: The JSON output from `scripts/preprocessWords.py` to use to tokenize the string
- `--output_json`: Optional - The output JSON file to save the tokenization to.
- `--output_h5`: Optional - The path to the HDF5 file where preprocessed data should be written.
- `--val_frac`: What fraction of the data to use as a validation set; default is `0.1`.
- `--test_frac`: What fraction of the data to use as a test set; default is `0.1`.
- `--quiet`: If you pass this flag then no output will be printed to the console except in case of error.
1 change: 1 addition & 0 deletions sample.lua
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ local cmd = torch.CmdLine()
cmd:option('-checkpoint', 'cv/checkpoint_4000.t7')
cmd:option('-length', 2000)
cmd:option('-start_text', '')
cmd:option('-start_tokens','')
cmd:option('-sample', 1)
cmd:option('-temperature', 1)
cmd:option('-gpu', 0)
Expand Down
Loading