Merge pull request #48 from bertsky/segment-fixes

some fixes for recent segmentation update
cisocrgroup · Jun 17, 2020 · fa40e7e · fa40e7e
2 parents f2b42d4 + 62a96f9
commit fa40e7e
Show file tree

Hide file tree

Showing 5 changed files with 203 additions and 134 deletions.
diff --git a/README.md b/README.md
@@ -1,34 +1,74 @@
 [![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/context:python)
 [![Total alerts](https://img.shields.io/lgtm/alerts/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/alerts/)
+
+Content:
+   * [ocrd_cis](#ocrd_cis)
+      * [Introduction](#introduction)
+      * [Installation](#installation)
+      * [Profiler](#profiler)
+      * [Usage](#usage)
+         * [ocrd-cis-postcorrect](#ocrd-cis-postcorrect)
+         * [ocrd-cis-align](#ocrd-cis-align)
+         * [ocrd-cis-data](#ocrd-cis-data)
+         * [Trainining](#trainining)
+         * [ocrd-cis-ocropy-train](#ocrd-cis-ocropy-train)
+         * [ocrd-cis-ocropy-clip](#ocrd-cis-ocropy-clip)
+         * [ocrd-cis-ocropy-resegment](#ocrd-cis-ocropy-resegment)
+         * [ocrd-cis-ocropy-segment](#ocrd-cis-ocropy-segment)
+         * [ocrd-cis-ocropy-deskew](#ocrd-cis-ocropy-deskew)
+         * [ocrd-cis-ocropy-denoise](#ocrd-cis-ocropy-denoise)
+         * [ocrd-cis-ocropy-binarize](#ocrd-cis-ocropy-binarize)
+         * [ocrd-cis-ocropy-dewarp](#ocrd-cis-ocropy-dewarp)
+         * [ocrd-cis-ocropy-recognize](#ocrd-cis-ocropy-recognize)
+         * [Tesserocr](#tesserocr)
+      * [Workflow configuration](#workflow-configuration)
+      * [Testing](#testing)
+   * [Miscellaneous](#miscellaneous)
+      * [OCR-D workspace](#ocr-d-workspace)
+      * [OCR-D links](#ocr-d-links)
+
 # ocrd_cis
 
 [CIS](http://www.cis.lmu.de) [OCR-D](http://ocr-d.de) command line
 tools for the automatic post-correction of OCR-results.
 
 ## Introduction
-`ocrd_cis` contains different tools for the automatic post correction
-of OCR-results.  It contains tools for the training, evaluation and
-execution of the post correction.  Most of the tools are following the
-[OCR-D cli conventions](https://ocr-d.github.io/cli).
+`ocrd_cis` contains different tools for the automatic post-correction
+of OCR results.  It contains tools for the training, evaluation and
+execution of the post-correction.  Most of the tools are following the
+[OCR-D CLI conventions](https://ocr-d.de/en/spec/cli).
 
-There is a helper tool to align multiple OCR results as well as a
-version of ocropy that works with python3.
+Additionally, there is a helper tool to align multiple OCR results,
+as well as an improved version of [Ocropy](https://github.com/tmbarchive/ocropy)
+that works with Python 3 and is also wrapped for [OCR-D](https://ocr-d.de/en/spec/).
 
 ## Installation
-There are multiple ways to install the `ocrd_cis` tools:
- * `make install` uses `pip` to install `ocrd_cis` (see below).
- * `make install-devel` uses `pip -e` to install `ocrd_cis` (see
-   below).
- * `pip install --upgrade pip ocrd_cis_dir`
- * `pip install -e --upgrade pip ocrd_cis_dir`
-
-It is possible to install `ocrd_cis` in a custom directory using
-`virtualenv`:
+There are 2 ways to install the `ocrd_cis` tools:
+ * normal packaging:
+  ```sh
+  make install # or equally: pip install -U pip .
+  ```
+  (Installs `ocrd_cis` including its Python dependencies
+   from the current directory to the Python package directory.)
+ * editable mode:
+  ```sh
+  make install-devel # or equally: pip install -e -U pip .
+  ```
+  (Installs `ocrd_cis` including its Python dependencies
+   from the current directory.)
+
+It is possible (and recommended) to install `ocrd_cis` in a custom user directory
+(instead of system-wide) by using `virtualenv` (or `venv`):
 ```sh
- python3 -m venv venv-dir
+ # create venv:
+ python3 -m venv venv-dir # where "venv-dir" could be any path name
+ # enter venv in current shell:
  source venv-dir/bin/activate
- make install # or any other command to install ocrd_cis (see above)
- # use ocrd_cis
+ # install ocrd_cis:
+ make install # or any other way (see above)
+ # use ocrd_cis:
+ ocrd-cis-ocropy-binarize ...
+ # finally, leave venv:
  deactivate
 ```
 
@@ -49,19 +89,21 @@ and the language configurations lie in `/etc/profiler/languages` in
 the container image.
 
 ## Usage
-Most tools follow the [OCR-D cli
-conventions](https://ocr-d.github.io/cli).  They accept the
-`--input-file-grp`, `--output-file-grp`, `--parameter`, `--mets`,
-`--log-level` command line arguments (short and long).  Some of the
-tools (most notably the alignment tool) expect a comma seperated list
-of multiple input file groups.
+Most tools follow the [OCR-D specifications](https://ocr-d.de/en/spec),
+(which makes them [OCR-D _processors_](https://ocr-d.de/en/spec/cli),)
+i.e. they accept the command-line options `--input-file-grp`, `--output-file-grp`,
+`--page-id`, `--parameter`, `--mets`, `--log-level` (each with an argument).
+Invoke with `--help` to get self-documentation. 
 
-The [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a schema
-description of the parameter config file for the different tools that
-accept the `--parameter` argument.
+Some of the processors (most notably the alignment tool) expect a comma-seperated list
+of multiple input file groups, or multiple output file groups.
+
+The [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a formal
+description of all the processors along with the parameter config file
+accepted by their `--parameter` argument.
 
 ### ocrd-cis-postcorrect
-This command runs the post correction using a pre-trained model.  If
+This processor runs the post correction using a pre-trained model.  If
 additional support OCRs should be used, models for these OCR steps are
 required and must be executed and aligned beforehand (see [the test
 script](tests/run_postcorrection_test.bash) for an example).
@@ -99,7 +141,7 @@ ocrd-cis-postcorrect -I ALGN -O PC ... # post correction
 
 ### ocrd-cis-align
 Aligns tokens of multiple input file groups to one output file group.
-This tool is used to align the master OCR with any additional support
+This processor is used to align the master OCR with any additional support
 OCRs.  It accepts a comma-separated list of input file groups, which
 it aligns in order.
 
@@ -150,95 +192,95 @@ java -jar $(ocrd-cis-data -jar) \
 ```
 
 ### ocrd-cis-ocropy-clip
-The `ocropy-clip` tool can be used to remove intrusions of neighbouring segments in regions / lines of a page.
-It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (as AlternativeImage).
+The `clip` processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
+It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (via `AlternativeImage`).
 (Use this to suppress separators and neighbouring text.)
 ```sh
 ocrd-cis-ocropy-clip \
-  --input-file-grp OCR-D-SEG-LINE \
-  --output-file-grp OCR-D-SEG-LINE-CLIP \
+  --input-file-grp OCR-D-SEG-REGION \
+  --output-file-grp OCR-D-SEG-REGION-CLIP \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### ocrd-cis-ocropy-resegment
-The `ocropy-resegment` tool can be used to remove overlap between neighbouring lines of a page.
+The `resegment` processor can be used to remove overlap between neighbouring lines of a page.
 It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
-(Use this to polygonalise text lines poorly segmented, e.g. via bounding boxes.)
+(Use this to polygonalise text lines that are poorly segmented, e.g. via bounding boxes.)
 ```sh
 ocrd-cis-ocropy-resegment \
   --input-file-grp OCR-D-SEG-LINE \
   --output-file-grp OCR-D-SEG-LINE-RES \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### ocrd-cis-ocropy-segment
-The `ocropy-segment` tool can be used to segment (pages or) regions of a page into lines.
-It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) TextLine elements with the resulting polygon outlines to the annotation of the output PAGE.
-(Does not detect tables or images.)
+The `segment` processor can be used to segment (pages or) regions of a page into (regions and) lines.
+It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) `TextLine` elements with the resulting polygon outlines to the annotation of the output PAGE.
+(Does _not_ detect tables.)
 ```sh
 ocrd-cis-ocropy-segment \
   --input-file-grp OCR-D-SEG-BLOCK \
   --output-file-grp OCR-D-SEG-LINE \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### ocrd-cis-ocropy-deskew
-The `ocropy-deskew` tool can be used to deskew pages / regions of a page.
+The `deskew` processor can be used to deskew pages / regions of a page.
 It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
-(Does not include orientation detection.)
+(Does _not_ include orientation detection.)
 ```sh
 ocrd-cis-ocropy-deskew \
   --input-file-grp OCR-D-SEG-LINE \
   --output-file-grp OCR-D-SEG-LINE-DES \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### ocrd-cis-ocropy-denoise
-The `ocropy-denoise` tool can be used to despeckle pages / regions / lines of a page.
-It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
+The `denoise` processor can be used to despeckle pages / regions / lines of a page.
+It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`).
 ```sh
 ocrd-cis-ocropy-denoise \
   --input-file-grp OCR-D-SEG-LINE-DES \
   --output-file-grp OCR-D-SEG-LINE-DEN \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### ocrd-cis-ocropy-binarize
-The `ocropy-binarize` tool can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
-It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
+The `binarize` processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
+It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
 ```sh
 ocrd-cis-ocropy-binarize \
   --input-file-grp OCR-D-SEG-LINE-DES \
   --output-file-grp OCR-D-SEG-LINE-BIN \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### ocrd-cis-ocropy-dewarp
-The `ocropy-dewarp` tool can be used to dewarp text lines of a page.
-It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
+The `dewarp` processor can be used to vertically dewarp text lines of a page.
+It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as `AlternativeImage`).
 ```sh
 ocrd-cis-ocropy-dewarp \
   --input-file-grp OCR-D-SEG-LINE-BIN \
   --output-file-grp OCR-D-SEG-LINE-DEW \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### ocrd-cis-ocropy-recognize
-The `ocropy-recognize` tool can be used to recognize the lines / words / glyphs of a page.
+The `recognize` processor can be used to recognize the lines / words / glyphs of a page.
 It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
 ```sh
 ocrd-cis-ocropy-recognize \
   --input-file-grp OCR-D-SEG-LINE-DEW \
   --output-file-grp OCR-D-OCR-OCRO \
   --mets mets.xml
-  --parameter file:///path/to/config.json
+  --parameter path/to/config.json
 ```
 
 ### Tesserocr
@@ -263,21 +305,29 @@ own models and place them into: /usr/share/tesseract-ocr/4.00/tessdata
 
 A decent pipeline might look like this:
 
-0. page-level binarization
+1. image normalization/optimization
+1. page-level binarization
 1. page-level cropping
-2. (page-level binarization)
-3. page-level deskewing
-4. (page-level dewarping)
-5. region segmentation
-6. region-level clipping
-7. (region-level deskewing)
-8. line segmentation
-9. (line-level clipping or resegmentation)
-10. line-level dewarping
-11. line-level recognition
-12. (line-level alignment and post-correction)
-
-If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.
+1. (page-level binarization)
+1. (page-level despeckling)
+1. page-level deskewing
+1. (page-level dewarping)
+1. region segmentation, possibly subdivided into
+   1. text/non-text separation
+   1. text region segmentation (and classification)
+   1. reading order detection
+   1. non-text region classification
+1. region-level clipping
+1. (region-level deskewing)
+1. line segmentation
+1. (line-level clipping or resegmentation)
+1. line-level dewarping
+1. line-level recognition
+1. (line-level alignment and post-correction)
+
+If GT is used, then cropping/segmentation steps can be omitted.
+
+If a segmentation is used which does not produce overlapping segments, then clipping/resegmentation can be omitted.
 
 ## Testing
 To run a few basic tests type `make test` (`ocrd_cis` has to be
@@ -289,11 +339,11 @@ installed in order to run any tests).
 * Create a new (empty) workspace: `ocrd workspace init workspace-dir`
 * cd into `workspace-dir`
 * Add new file to workspace: `ocrd workspace add file -G group -i id
-  -m mimetype`
+  -m mimetype -g pageId`
 
 ## OCR-D links
 
 - [OCR-D](https://ocr-d.github.io)
 - [Github](https://github.com/OCR-D)
 - [Project-page](http://www.ocr-d.de/)
-- [Ground-truth](http://www.ocr-d.de/sites/all/GTDaten/IndexGT.html)
+- [Ground-truth](https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/search)
diff --git a/ocrd_cis/ocropy/common.py b/ocrd_cis/ocropy/common.py
@@ -6,7 +6,7 @@
 import numpy as np
 from scipy.ndimage import measurements, filters, interpolation, morphology
 from scipy import stats, signal
-from skimage.morphology import convex_hull_image
+#from skimage.morphology import convex_hull_image
 from PIL import Image
 
 from . import ocrolib
@@ -344,7 +344,7 @@ def check_region(binary, zoom=1.0):
     if np.amax(binary)==np.amin(binary): return "image is blank"
     if np.mean(binary)<np.median(binary): return "image may be inverted"
     h,w = binary.shape
-    if h<60/zoom: return "image not tall enough for a region image %s"%(binary.shape,)
+    if h<45/zoom: return "image not tall enough for a region image %s"%(binary.shape,)
     if h>5000/zoom: return "image too tall for a region image %s"%(binary.shape,)
     if w<100/zoom: return "image too narrow for a region image %s"%(binary.shape,)
     if w>5000/zoom: return "image too wide for a region image %s"%(binary.shape,)
@@ -1189,7 +1189,7 @@ def lines2regions(binary, llabels,
     region label (in the order of the call chain, which is controlled
     by ``rl`` and ``bt``), covering all the line labels inside it.
     
-    Afterwards, for each region label, combine line labels by using
+    Afterwards, for each region label, simplify regions by using
     their convex hull polygon.
     
     Return a Numpy array of text region labels.
@@ -1563,17 +1563,15 @@ def finalize():
     # apply re-assignments:
     rlabels = relabel[llabels]
     DSAVE('rlabels', rlabels)
-    LOG.debug('closing %d regions component-wise', np.amax(relabel))
-    # close regions (label by label)
-    for region in np.unique(relabel):
-        if not region:
-            continue # ignore bg
-        # lines = np.setdiff1d(np.nonzero(relabel==region)[0], [0])
-        # if len(lines) < 2:
-        #     LOG.debug('region %d has only 1 line', region)
-        #     continue
-        # faster than morphological closing:
-        region_hull = convex_hull_image(rlabels==region)
-        rlabels[region_hull] = region
-    DSAVE('rlabels_closed', rlabels)
+    # FIXME: hulls can overlap, we just need simplification
+    #        (but cv2.approxPolyDP is faulty and morphology costly)
+    # LOG.debug('closing %d regions component-wise', np.amax(relabel))
+    # # close regions (label by label)
+    # for region in np.unique(relabel):
+    #     if not region:
+    #         continue # ignore bg
+    #     # faster than morphological closing:
+    #     region_hull = convex_hull_image(rlabels==region)
+    #     rlabels[region_hull] = region
+    # DSAVE('rlabels_closed', rlabels)
     return rlabels
diff --git a/ocrd_cis/ocropy/ocrolib/morph.py b/ocrd_cis/ocropy/ocrolib/morph.py
@@ -170,12 +170,12 @@ def find_contours(image):
     contours, _ = cv2.findContours(image.astype(uint8),
                                    cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
     # convert to y,x tuples
-    return list(zip((contour[:,0,::-1], cv2.contourArea(contour))
-                    for contour in contours))
+    return [(contour[:,0,::-1], cv2.contourArea(contour))
+            for contour in contours]
 
 @checks(SEGMENTATION)
 def find_label_contours(labels):
-    contours = [[]]*amax(labels)+1
+    contours = [[]]*(amax(labels)+1)
     for label in unique(labels):
         if not label:
             continue