s2s-decipherment

This repository contains the Project Gutenberg data that we used in our paper. It also contains code to reproduce the results in the paper.

Data Processing Pipeline

1. dataLoader.py

Description

dataLoader.py is responsible for loading the data that is intended to be processed. It ensures that the required data is available for subsequent steps in the pipeline.

Usage

python dataLoader.py

2. truncateData.py

Description

truncateData.py truncates the loaded data to a desired length (Default = 64 characters), a useful step when dealing with large datasets and you want to work with a subset of the data. Onlu use this on preprocessed text (Character-level and underscore separated).

Usage

python truncateData.py file_path word_length

3. preprocessData.py

Description

preprocessData.py truncates the loaded data to a desired length (Default = 64 characters), processes all characters to lowercases, stripes punctuations, changes whitespaces into underscores, and modifies text into character level, which are the necessary preprocesses to run a character level transformer model. Only use this on unpreprocessed text.

Usage

python preprocessData.py file_path

or with customized word length

python preprocessData.py file_path word_length

4. dataTokenizer.py

Description

Tokenizer.py tokenizes a given file, breaking it down into its constituent tokens. It is especially useful for natural language processing tasks where tokenized input is required.

Usage

python Tokenizer.py file_path

Model Training

Set Up

Run setUp.sh to create configuration file for training, run modifyConfig.py to change parameters as needed.

Training

Run the following command in terminal to start training:

rtg-pipe 01-tfm-deen --gpu-only

Decoding

After model is trained, run the following command to perform decoding on extra test sets:

rtg-decode 01-tfm-deen -if [input_file_path] -of [output_file_path]

Data Evaluation

Description

evaluation.py evaluates the performance of model based on ter score

Usage

python evaluation.py decipher_path plain_text_path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

s2s-decipherment

Data Processing Pipeline

1. dataLoader.py

Description

Usage

2. truncateData.py

Description

Usage

3. preprocessData.py

Description

Usage

4. dataTokenizer.py

Description

Usage

Model Training

Set Up

Training

Decoding

Data Evaluation

Description

Usage

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
01-tfm-deen		01-tfm-deen
borg		borg
README.md		README.md
dataLoader.py		dataLoader.py
dataTokenizer.py		dataTokenizer.py
evaluation.py		evaluation.py
gutenberg-data.zip		gutenberg-data.zip
modifyConfig.py		modifyConfig.py
preprocessData.py		preprocessData.py
setUp.sh		setUp.sh
truncateData.py		truncateData.py

isi-nlp/s2s-decipherment-datafirst

Folders and files

Latest commit

History

Repository files navigation

s2s-decipherment

Data Processing Pipeline

1. dataLoader.py

Description

Usage

2. truncateData.py

Description

Usage

3. preprocessData.py

Description

Usage

4. dataTokenizer.py

Description

Usage

Model Training

Set Up

Training

Decoding

Data Evaluation

Description

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages