CITlabErrorRate

A tool for computing error rates for different algorithms:

Requirements

Java >= version 8
Maven
All further dependencies are gathered via Maven

Build

git clone https://github.com/CITlabRostock/CITlabErrorRate
cd CITlabErrorRate
mvn package [-DskipTests=true]

Running:

End-2-End Character Error Rate (CER)

This tool makes it possible to measure the CER of an End2End system. In general it calculate the number of manipulations (insertion, deletion, substitution) that have to be done to come from the hypothesis/recognition to the ground truth/reference. The CER is equal to #manipulation/#GT, whereals #GT is the number of ground truth characters.

The tool has several options to be configured. A first overview over all parameters can be gathered by

java -jar target/CITlabErrorRate.jar \
<list_pageXml_groundtruth> \
<list_pageXml_hypothesis> \
[-d] [-D] [-g] [-h] [-l] [-N] [-n] [-r] [-s] [-t <arg>] [-u]

Parameters that manipulate/normalize both, the ground truth and the hypothesis:

-l The CER is only calculated on letters, numbers and spaces. All other characters like punktuations and symbols are ignored. Examples: this, 1 word! leads to this 1 word ; 31.Nov.2019 leads to 31Nov2019; 12.000 $ budget->12000 budget
-N the text will be normalized according the unicode standard NFKC (see http://unicode.org/reports/tr15/ for details). Example: ſ leads to s
-n the text will be normalized according the unicode standard NFC (see http://unicode.org/reports/tr15/ for details). Example: a^ leads to â, whereas ^ is the accent circumfelx\u+005e
-u make text to upper (so it is case insensitive). Example: Straße leads to STRASSE

Parameter that determin how the error is calculated:

-r the reading order is ignored. So ["first line", "second line"] vs. ["second line", "first line"] would be correct.
-s the right segmentation plays a role. That means a space \+u0020 can be interpretet as space or as split of lines. So ["split and", "merge lines"] vs. ["split", "and merge", "line"] would be correct.
-g the geometric postion of the line plays a role. The couverage between two lines have above a threshold (see parameter -t).
-t <FLOAT> the minimal couverage [0.0,1.0) between two line so that they were assumed to be adjacent.

Parameter for analizing errors, but not implemented yet:

-d the algorithm will return all manipulations which had to be done to come from the hypothesis to the ground truth (insertions, deletions, substitutions)
-D the algorithm will return all operations which had to be done to come from the hypothesis to the ground truth (corrects, insertions, deletions, substitutions)

HTR:

It can calculate Character Error Rate (CER), Word Error Rate (WER), Bag of Tokens (BOG) and some more metrics. Type

java -cp target/CITlabErrorRate.jar de.uros.citlab.errorrate.HtrError --help

for more information concerning evaluating an HTR result if the files are PAGE-XML-files. For raw UTF-8 encoded textfiles use

java -cp target/CITlabErrorRate.jar de.uros.citlab.errorrate.HtrErrorTxt --help

or

java -cp target/CITlabErrorRate.jar de.uros.citlab.errorrate.HtrErrorTxtLeip

KWS:

To calculate measures for KWS

java -cp target/CITlabErrorRate.jar de.uros.citlab.errorrate.KwsError

can be used. Use --help to see the configuration opportunities

Text2Image

To calculate measures for image alignment

java -cp target/CITlabErrorRate.jar de.uros.citlab.errorrate.Text2ImageError

can be used. Use --help to see the configuration opportunities

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CITlabErrorRate

Requirements

Build

Running:

End-2-End Character Error Rate (CER)

HTR:

KWS:

Text2Image

About

Releases

Packages

Languages

License

kahlep/CITlabErrorRate

Folders and files

Latest commit

History

Repository files navigation

CITlabErrorRate

Requirements

Build

Running:

End-2-End Character Error Rate (CER)

HTR:

KWS:

Text2Image

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages