Skip to content

Commit

Permalink
Merge pull request #48 from bertsky/segment-fixes
Browse files Browse the repository at this point in the history
some fixes for recent segmentation update
  • Loading branch information
bertsky authored Jun 17, 2020
2 parents f2b42d4 + 62a96f9 commit fa40e7e
Show file tree
Hide file tree
Showing 5 changed files with 203 additions and 134 deletions.
192 changes: 121 additions & 71 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,74 @@
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/context:python)
[![Total alerts](https://img.shields.io/lgtm/alerts/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/alerts/)

Content:
* [ocrd_cis](#ocrd_cis)
* [Introduction](#introduction)
* [Installation](#installation)
* [Profiler](#profiler)
* [Usage](#usage)
* [ocrd-cis-postcorrect](#ocrd-cis-postcorrect)
* [ocrd-cis-align](#ocrd-cis-align)
* [ocrd-cis-data](#ocrd-cis-data)
* [Trainining](#trainining)
* [ocrd-cis-ocropy-train](#ocrd-cis-ocropy-train)
* [ocrd-cis-ocropy-clip](#ocrd-cis-ocropy-clip)
* [ocrd-cis-ocropy-resegment](#ocrd-cis-ocropy-resegment)
* [ocrd-cis-ocropy-segment](#ocrd-cis-ocropy-segment)
* [ocrd-cis-ocropy-deskew](#ocrd-cis-ocropy-deskew)
* [ocrd-cis-ocropy-denoise](#ocrd-cis-ocropy-denoise)
* [ocrd-cis-ocropy-binarize](#ocrd-cis-ocropy-binarize)
* [ocrd-cis-ocropy-dewarp](#ocrd-cis-ocropy-dewarp)
* [ocrd-cis-ocropy-recognize](#ocrd-cis-ocropy-recognize)
* [Tesserocr](#tesserocr)
* [Workflow configuration](#workflow-configuration)
* [Testing](#testing)
* [Miscellaneous](#miscellaneous)
* [OCR-D workspace](#ocr-d-workspace)
* [OCR-D links](#ocr-d-links)

# ocrd_cis

[CIS](http://www.cis.lmu.de) [OCR-D](http://ocr-d.de) command line
tools for the automatic post-correction of OCR-results.

## Introduction
`ocrd_cis` contains different tools for the automatic post correction
of OCR-results. It contains tools for the training, evaluation and
execution of the post correction. Most of the tools are following the
[OCR-D cli conventions](https://ocr-d.github.io/cli).
`ocrd_cis` contains different tools for the automatic post-correction
of OCR results. It contains tools for the training, evaluation and
execution of the post-correction. Most of the tools are following the
[OCR-D CLI conventions](https://ocr-d.de/en/spec/cli).

There is a helper tool to align multiple OCR results as well as a
version of ocropy that works with python3.
Additionally, there is a helper tool to align multiple OCR results,
as well as an improved version of [Ocropy](https://github.com/tmbarchive/ocropy)
that works with Python 3 and is also wrapped for [OCR-D](https://ocr-d.de/en/spec/).

## Installation
There are multiple ways to install the `ocrd_cis` tools:
* `make install` uses `pip` to install `ocrd_cis` (see below).
* `make install-devel` uses `pip -e` to install `ocrd_cis` (see
below).
* `pip install --upgrade pip ocrd_cis_dir`
* `pip install -e --upgrade pip ocrd_cis_dir`

It is possible to install `ocrd_cis` in a custom directory using
`virtualenv`:
There are 2 ways to install the `ocrd_cis` tools:
* normal packaging:
```sh
make install # or equally: pip install -U pip .
```
(Installs `ocrd_cis` including its Python dependencies
from the current directory to the Python package directory.)
* editable mode:
```sh
make install-devel # or equally: pip install -e -U pip .
```
(Installs `ocrd_cis` including its Python dependencies
from the current directory.)

It is possible (and recommended) to install `ocrd_cis` in a custom user directory
(instead of system-wide) by using `virtualenv` (or `venv`):
```sh
python3 -m venv venv-dir
# create venv:
python3 -m venv venv-dir # where "venv-dir" could be any path name
# enter venv in current shell:
source venv-dir/bin/activate
make install # or any other command to install ocrd_cis (see above)
# use ocrd_cis
# install ocrd_cis:
make install # or any other way (see above)
# use ocrd_cis:
ocrd-cis-ocropy-binarize ...
# finally, leave venv:
deactivate
```

Expand All @@ -49,19 +89,21 @@ and the language configurations lie in `/etc/profiler/languages` in
the container image.

## Usage
Most tools follow the [OCR-D cli
conventions](https://ocr-d.github.io/cli). They accept the
`--input-file-grp`, `--output-file-grp`, `--parameter`, `--mets`,
`--log-level` command line arguments (short and long). Some of the
tools (most notably the alignment tool) expect a comma seperated list
of multiple input file groups.
Most tools follow the [OCR-D specifications](https://ocr-d.de/en/spec),
(which makes them [OCR-D _processors_](https://ocr-d.de/en/spec/cli),)
i.e. they accept the command-line options `--input-file-grp`, `--output-file-grp`,
`--page-id`, `--parameter`, `--mets`, `--log-level` (each with an argument).
Invoke with `--help` to get self-documentation.

The [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a schema
description of the parameter config file for the different tools that
accept the `--parameter` argument.
Some of the processors (most notably the alignment tool) expect a comma-seperated list
of multiple input file groups, or multiple output file groups.

The [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a formal
description of all the processors along with the parameter config file
accepted by their `--parameter` argument.

### ocrd-cis-postcorrect
This command runs the post correction using a pre-trained model. If
This processor runs the post correction using a pre-trained model. If
additional support OCRs should be used, models for these OCR steps are
required and must be executed and aligned beforehand (see [the test
script](tests/run_postcorrection_test.bash) for an example).
Expand Down Expand Up @@ -99,7 +141,7 @@ ocrd-cis-postcorrect -I ALGN -O PC ... # post correction

### ocrd-cis-align
Aligns tokens of multiple input file groups to one output file group.
This tool is used to align the master OCR with any additional support
This processor is used to align the master OCR with any additional support
OCRs. It accepts a comma-separated list of input file groups, which
it aligns in order.

Expand Down Expand Up @@ -150,95 +192,95 @@ java -jar $(ocrd-cis-data -jar) \
```

### ocrd-cis-ocropy-clip
The `ocropy-clip` tool can be used to remove intrusions of neighbouring segments in regions / lines of a page.
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (as AlternativeImage).
The `clip` processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (via `AlternativeImage`).
(Use this to suppress separators and neighbouring text.)
```sh
ocrd-cis-ocropy-clip \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-CLIP \
--input-file-grp OCR-D-SEG-REGION \
--output-file-grp OCR-D-SEG-REGION-CLIP \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### ocrd-cis-ocropy-resegment
The `ocropy-resegment` tool can be used to remove overlap between neighbouring lines of a page.
The `resegment` processor can be used to remove overlap between neighbouring lines of a page.
It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
(Use this to polygonalise text lines poorly segmented, e.g. via bounding boxes.)
(Use this to polygonalise text lines that are poorly segmented, e.g. via bounding boxes.)
```sh
ocrd-cis-ocropy-resegment \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-RES \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### ocrd-cis-ocropy-segment
The `ocropy-segment` tool can be used to segment (pages or) regions of a page into lines.
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) TextLine elements with the resulting polygon outlines to the annotation of the output PAGE.
(Does not detect tables or images.)
The `segment` processor can be used to segment (pages or) regions of a page into (regions and) lines.
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) `TextLine` elements with the resulting polygon outlines to the annotation of the output PAGE.
(Does _not_ detect tables.)
```sh
ocrd-cis-ocropy-segment \
--input-file-grp OCR-D-SEG-BLOCK \
--output-file-grp OCR-D-SEG-LINE \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### ocrd-cis-ocropy-deskew
The `ocropy-deskew` tool can be used to deskew pages / regions of a page.
The `deskew` processor can be used to deskew pages / regions of a page.
It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
(Does not include orientation detection.)
(Does _not_ include orientation detection.)
```sh
ocrd-cis-ocropy-deskew \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-DES \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### ocrd-cis-ocropy-denoise
The `ocropy-denoise` tool can be used to despeckle pages / regions / lines of a page.
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
The `denoise` processor can be used to despeckle pages / regions / lines of a page.
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`).
```sh
ocrd-cis-ocropy-denoise \
--input-file-grp OCR-D-SEG-LINE-DES \
--output-file-grp OCR-D-SEG-LINE-DEN \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### ocrd-cis-ocropy-binarize
The `ocropy-binarize` tool can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
The `binarize` processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
```sh
ocrd-cis-ocropy-binarize \
--input-file-grp OCR-D-SEG-LINE-DES \
--output-file-grp OCR-D-SEG-LINE-BIN \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### ocrd-cis-ocropy-dewarp
The `ocropy-dewarp` tool can be used to dewarp text lines of a page.
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
The `dewarp` processor can be used to vertically dewarp text lines of a page.
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as `AlternativeImage`).
```sh
ocrd-cis-ocropy-dewarp \
--input-file-grp OCR-D-SEG-LINE-BIN \
--output-file-grp OCR-D-SEG-LINE-DEW \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### ocrd-cis-ocropy-recognize
The `ocropy-recognize` tool can be used to recognize the lines / words / glyphs of a page.
The `recognize` processor can be used to recognize the lines / words / glyphs of a page.
It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
```sh
ocrd-cis-ocropy-recognize \
--input-file-grp OCR-D-SEG-LINE-DEW \
--output-file-grp OCR-D-OCR-OCRO \
--mets mets.xml
--parameter file:///path/to/config.json
--parameter path/to/config.json
```

### Tesserocr
Expand All @@ -263,21 +305,29 @@ own models and place them into: /usr/share/tesseract-ocr/4.00/tessdata

A decent pipeline might look like this:

0. page-level binarization
1. image normalization/optimization
1. page-level binarization
1. page-level cropping
2. (page-level binarization)
3. page-level deskewing
4. (page-level dewarping)
5. region segmentation
6. region-level clipping
7. (region-level deskewing)
8. line segmentation
9. (line-level clipping or resegmentation)
10. line-level dewarping
11. line-level recognition
12. (line-level alignment and post-correction)

If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.
1. (page-level binarization)
1. (page-level despeckling)
1. page-level deskewing
1. (page-level dewarping)
1. region segmentation, possibly subdivided into
1. text/non-text separation
1. text region segmentation (and classification)
1. reading order detection
1. non-text region classification
1. region-level clipping
1. (region-level deskewing)
1. line segmentation
1. (line-level clipping or resegmentation)
1. line-level dewarping
1. line-level recognition
1. (line-level alignment and post-correction)

If GT is used, then cropping/segmentation steps can be omitted.

If a segmentation is used which does not produce overlapping segments, then clipping/resegmentation can be omitted.

## Testing
To run a few basic tests type `make test` (`ocrd_cis` has to be
Expand All @@ -289,11 +339,11 @@ installed in order to run any tests).
* Create a new (empty) workspace: `ocrd workspace init workspace-dir`
* cd into `workspace-dir`
* Add new file to workspace: `ocrd workspace add file -G group -i id
-m mimetype`
-m mimetype -g pageId`

## OCR-D links

- [OCR-D](https://ocr-d.github.io)
- [Github](https://github.com/OCR-D)
- [Project-page](http://www.ocr-d.de/)
- [Ground-truth](http://www.ocr-d.de/sites/all/GTDaten/IndexGT.html)
- [Ground-truth](https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/search)
30 changes: 14 additions & 16 deletions ocrd_cis/ocropy/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import numpy as np
from scipy.ndimage import measurements, filters, interpolation, morphology
from scipy import stats, signal
from skimage.morphology import convex_hull_image
#from skimage.morphology import convex_hull_image
from PIL import Image

from . import ocrolib
Expand Down Expand Up @@ -344,7 +344,7 @@ def check_region(binary, zoom=1.0):
if np.amax(binary)==np.amin(binary): return "image is blank"
if np.mean(binary)<np.median(binary): return "image may be inverted"
h,w = binary.shape
if h<60/zoom: return "image not tall enough for a region image %s"%(binary.shape,)
if h<45/zoom: return "image not tall enough for a region image %s"%(binary.shape,)
if h>5000/zoom: return "image too tall for a region image %s"%(binary.shape,)
if w<100/zoom: return "image too narrow for a region image %s"%(binary.shape,)
if w>5000/zoom: return "image too wide for a region image %s"%(binary.shape,)
Expand Down Expand Up @@ -1189,7 +1189,7 @@ def lines2regions(binary, llabels,
region label (in the order of the call chain, which is controlled
by ``rl`` and ``bt``), covering all the line labels inside it.
Afterwards, for each region label, combine line labels by using
Afterwards, for each region label, simplify regions by using
their convex hull polygon.
Return a Numpy array of text region labels.
Expand Down Expand Up @@ -1563,17 +1563,15 @@ def finalize():
# apply re-assignments:
rlabels = relabel[llabels]
DSAVE('rlabels', rlabels)
LOG.debug('closing %d regions component-wise', np.amax(relabel))
# close regions (label by label)
for region in np.unique(relabel):
if not region:
continue # ignore bg
# lines = np.setdiff1d(np.nonzero(relabel==region)[0], [0])
# if len(lines) < 2:
# LOG.debug('region %d has only 1 line', region)
# continue
# faster than morphological closing:
region_hull = convex_hull_image(rlabels==region)
rlabels[region_hull] = region
DSAVE('rlabels_closed', rlabels)
# FIXME: hulls can overlap, we just need simplification
# (but cv2.approxPolyDP is faulty and morphology costly)
# LOG.debug('closing %d regions component-wise', np.amax(relabel))
# # close regions (label by label)
# for region in np.unique(relabel):
# if not region:
# continue # ignore bg
# # faster than morphological closing:
# region_hull = convex_hull_image(rlabels==region)
# rlabels[region_hull] = region
# DSAVE('rlabels_closed', rlabels)
return rlabels
6 changes: 3 additions & 3 deletions ocrd_cis/ocropy/ocrolib/morph.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,12 +170,12 @@ def find_contours(image):
contours, _ = cv2.findContours(image.astype(uint8),
cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
# convert to y,x tuples
return list(zip((contour[:,0,::-1], cv2.contourArea(contour))
for contour in contours))
return [(contour[:,0,::-1], cv2.contourArea(contour))
for contour in contours]

@checks(SEGMENTATION)
def find_label_contours(labels):
contours = [[]]*amax(labels)+1
contours = [[]]*(amax(labels)+1)
for label in unique(labels):
if not label:
continue
Expand Down
Loading

0 comments on commit fa40e7e

Please sign in to comment.