-
Notifications
You must be signed in to change notification settings - Fork 17
Redesign
Martin Körner edited this page Oct 8, 2013
·
4 revisions
This page gives an overview of the redesign of the typology project by describing the workflow and defining APIs.
- Parse data sets
- Build absolute sequences
- Count absolute sequences
- Build sequences that are needed for Modified Kneser-Ney smoothing
- Calculate Modified Kneser-Ney smoothed values
- Evaluate the resulting language models
General:
- In order to allow the communication between threads, we will use PipedOutputStreams and PipedInputStreams. Thereby, classes will take
InputStream
and/orOutputStream
as arguments for I/O. - Parameters are placed in the constructors to allow
run()
without parameters - Classes implement
Runnable
- Classes should be run as
Thread
The parser API is work in progress.
Input:
-
InputStream
that reads input (XML) file - Supported characters
- Optional: split punctuation marks
- Optional: add start and end tags
Output:
-
OutputStream
with:- One sentence per line
- Exact one white space between each word
Constructor:
public XYZParser(InputStream inputStream, OutputStream outputStream, ??? allowedCharacters, boolean splitPunctuationMarks);
public XYZParser(InputStream inputStream, OutputStream outputStream, ??? allowedCharacters, boolean splitPunctuationMarks, String startTag, String endTag);
Idea: Run this Class as a Thread for every type of sequence
But: No way to input the parsed text into all open threads at the same time (too many open writers)
So: Read from file
Also: IndexBuilder needs to be synchronized so the index is built only once
Input:
- File to be split (No InputStream since a index may has to be build)
- Word delimiter (
\s
,\t
...) - Pattern (integer vs. binary String vs. char/boolean-array)
- Name of folder that will be created inside the input file directory (e.g.
"absolute"
)
Output:
- Index which is stored in the same directory as the input file
- Sequences split to files according to the index (folder structure e.g.: ../absolute/pattern/files)
Input:
-
InputStream
:- One sequence per line
- File is smaller than 2GB
Output:
-
OutputStream
:- Sequence
\t
count - Sorted alphabetically (for Kneser-Ney aggregation: Can't store all files that have to be aggregated in the RAM at the same time)
- Sequence
Input:
- Directory with counted sequences
- Pattern which colum to keep (e.g. {1,2,3} to keep column 1,2, and 3)
Output:
-
OutputStream
with shortened sequences as an input for previous class
Note: Doing this in parallel would not be useful since the HDD is the bottleneck
Input:
- Different
InputStreams
to be aggregated - ...
Output:
-
OutputStream
with smoothed values