Redesign

This page gives an overview of the redesign of the typology project by describing the workflow and defining APIs.

Workflow

Parse data sets
Build absolute sequences
Count absolute sequences
Build sequences that are needed for Modified Kneser-Ney smoothing
Calculate Modified Kneser-Ney smoothed values
Evaluate the resulting language models

APIs

General:

In order to allow the communication between threads, we will use PipedOutputStreams and PipedInputStreams. Thereby, classes will take InputStream and/or OutputStream as arguments for I/O.
Parameters are placed in the constructors to allow run() without parameters
Classes implement Runnable
Classes should be run as Thread

Parse data sets

The parser API is work in progress.

Input:

InputStream that reads input (XML) file
Supported characters
Optional: split punctuation marks
Optional: add start and end tags

Output:

OutputStream with:
- One sentence per line
- Exact one white space between each word

Constructor:

public XYZParser(InputStream inputStream, OutputStream outputStream, ??? allowedCharacters, boolean splitPunctuationMarks);
public XYZParser(InputStream inputStream, OutputStream outputStream, ??? allowedCharacters, boolean splitPunctuationMarks, String startTag, String endTag);

Build absolute sequences

Idea: Run this Class as a Thread for every type of sequence

But: No way to input the parsed text into all open threads at the same time (too many open writers)

So: Read from file

Also: IndexBuilder needs to be synchronized so the index is built only once

Input:

File to be split (No InputStream since a index may has to be build)
Word delimiter (\s, \t ...)
Pattern (integer vs. binary String vs. char/boolean-array)
Name of folder that will be created inside the input file directory (e.g. "absolute")

Output:

Index which is stored in the same directory as the input file
Sequences split to files according to the index (folder structure e.g.: ../absolute/pattern/files)

Count absolute sequences

Input:

InputStream:
- One sequence per line
- File is smaller than 2GB

Output:

OutputStream:
- Sequence \t count
- Sorted alphabetically (for Kneser-Ney aggregation: Can't store all files that have to be aggregated in the RAM at the same time)

Build sequences that are needed for Modified Kneser-Ney smoothing

Input:

Directory with counted sequences
Pattern which colum to keep (e.g. {1,2,3} to keep column 1,2, and 3)

Output:

OutputStream with shortened sequences as an input for previous class

Calculate Modified Kneser-Ney smoothed values

Note: Doing this in parallel would not be useful since the HDD is the bottleneck

Input:

Different InputStreams to be aggregated
...

Output:

OutputStream with smoothed values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign

Workflow

APIs

Parse data sets

Build absolute sequences

Count absolute sequences

Build sequences that are needed for Modified Kneser-Ney smoothing

Calculate Modified Kneser-Ney smoothed values

Evaluate the resulting language models

Clone this wiki locally