Skip to content
Martin Körner edited this page Oct 8, 2013 · 4 revisions

This page gives an overview of the redesign of the typology project by describing the workflow and defining APIs.

Workflow

  1. Parse data sets
  2. Build absolute sequences
  3. Count absolute sequences
  4. Build sequences that are needed for Modified Kneser-Ney smoothing
  5. Calculate Modified Kneser-Ney smoothed values
  6. Evaluate the resulting language models

APIs

General:

  • In order to allow the communication between threads, we will use PipedOutputStreams and PipedInputStreams. Thereby, classes will take InputStream and/or OutputStream as arguments for I/O.
  • Parameters are placed in the constructors to allow run() without parameters
  • Classes implement Runnable
  • Classes should be run as Thread

Parse data sets

The parser API is work in progress.

Input:

  • InputStream that reads input (XML) file
  • Supported characters
  • Optional: split punctuation marks
  • Optional: add start and end tags

Output:

  • OutputStream with:
    • One sentence per line
    • Exact one white space between each word

Constructor:

public XYZParser(InputStream inputStream, OutputStream outputStream, ??? allowedCharacters, boolean splitPunctuationMarks);
public XYZParser(InputStream inputStream, OutputStream outputStream, ??? allowedCharacters, boolean splitPunctuationMarks, String startTag, String endTag);

Build absolute sequences

Idea: Run this Class as a Thread for every type of sequence

But: No way to input the parsed text into all open threads at the same time (too many open writers)

So: Read from file

Also: IndexBuilder needs to be synchronized so the index is built only once

Input:

  • File to be split (No InputStream since a index may has to be build)
  • Word delimiter (\s, \t ...)
  • Pattern (integer vs. binary String vs. char/boolean-array)
  • Name of folder that will be created inside the input file directory (e.g. "absolute")

Output:

  • Index which is stored in the same directory as the input file
  • Sequences split to files according to the index (folder structure e.g.: ../absolute/pattern/files)

Count absolute sequences

Input:

  • InputStream:
    • One sequence per line
    • File is smaller than 2GB

Output:

  • OutputStream:
    • Sequence \t count
    • Sorted alphabetically (for Kneser-Ney aggregation: Can't store all files that have to be aggregated in the RAM at the same time)

Build sequences that are needed for Modified Kneser-Ney smoothing

Input:

  • Directory with counted sequences
  • Pattern which colum to keep (e.g. {1,2,3} to keep column 1,2, and 3)

Output:

  • OutputStream with shortened sequences as an input for previous class

Calculate Modified Kneser-Ney smoothed values

Note: Doing this in parallel would not be useful since the HDD is the bottleneck

Input:

  • Different InputStreams to be aggregated
  • ...

Output:

  • OutputStream with smoothed values

Evaluate the resulting language models