In Codice Ratio

Synthetic dataset generation for sequence prediction models

Requirements

numpy
cv2
networkx
matplotlib

Usage

Set the dataset, corpus and destination paths in the generate_textlines.py main, then run it to generate synthetic line images and their transcription.

Dataset folder expects structure dataset_folder/{character classes}/character_images.png.

Corpus folder expects structure corpus_folder/{text files}.txt

Files in the destination folder will be of the type destination_folder/{i.png, i.txt} for each line generated.

json file abbr_matchings.json maps text to sequences of symbols in the dataset.

Model training

Requirements

tensor2tensor

Usage

Preprocess generated data (synthetic dataset in the form of i.png, i.txt couples must be in $TMP_DIR/ocr) and put it into $DATA_DIR, using custom problem definition (in t2t_usr):

$ t2t-datagen \
    --t2t_usr_dir=t2t_usr \
    --problem=ocr_latin \
    --tmp_dir=$TMP_DIR \
    --data_dir=$DATA_DIR

Train the transformer_sketch model on the generated dataset, using custom problem definition (in t2t_usr):

$ t2t-trainer \
    --t2t_usr_dir=t2t_usr \
    --problem=ocr_latin \
    --tmp_dir=$TMP_DIR/ocr \
    --data_dir=$DATA_DIR \
    --model=transformer_sketch \
    --hparams_set=transformer_small_sketch \
    --output_dir=$OUTPUT_DIR

Author

This project was developed by Elena Nieddu during Pi School's AI programme in Fall 2017.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

In Codice Ratio

Synthetic dataset generation for sequence prediction models

Requirements

Usage

Model training

Requirements

Usage

Author

Files

README.md

Latest commit

History

README.md

File metadata and controls

In Codice Ratio

Synthetic dataset generation for sequence prediction models

Requirements

Usage

Model training

Requirements

Usage

Author