Skip to content

OCR for Cuneiform RGB Images

jayanth edited this page Aug 27, 2019 · 11 revisions

OCR for Cuneiform RGB Images - GSoC 2019 with CDLI

About the project

Previously work had been done on extracting and classifying cuneiform characters which involved the use of scanned 3d cuneiform dataset Nils M. Kriege et al. [1], D. Fisseler et al.[2], Hubert Mara et al. [3]. Extraction of these characters from the 3d models required the use of specialised software. Also in some cases after individual wedges were extracted, manual post-processing was needed to specify the affiliation of the wedges to the cuneiform signs/characters which required a lot of time. The main issue is these techniques only work on 3d datasets (testing purposes), and generating them would require specialised equipment to scan the tablets. Since CDLI has a large collection of 2d cuneiform images (100,000+) it would be beneficial if a system could conduct OCR on them.

I formulated my project as an object (line/sign) detection and classification task where I detect individual cuneiform characters and give them a class. I do not use any language model for cuneiform script ie; relation between characters, zipf's law, histogram for the characters to verify/improve the detection.

  • Example if the model has learned these properties it would identify a word sequence "the the" would not appear in English text and would lower the score/confidence for the detection.
  • Another example is spelling mistakes: Lugal (lu-gal) corresponds to "man"-"big" (king), but if the detection system incorrectly detects it as lu-kisal (kisal and gal looks similar), the language model would identify lukisal does not exist in the cuneiform dictionary and hence correct it as lugal or lower the confidence/score of the detection while training.

Firstly ad-hoc/traditional image processing techniques were used for line and sign detection. It works relatively well for line detection on clean and properly scribed cuneiform tablets. But it is not good for Sign detection (ridge detection and connect components). It is very complex to make image processing techniques understand which wedges belong to which cuneiform signs.

I then looked at object detection models that would help with character detection. Most of the high performance state of the art object detection models are supervised which required labelled datasets. CDLI did not have annotated 2d images (class names with coordinates) and manually annotating them would take a lot of time. To take care of this I created synthetic cuneiform images with annotations that looked as realistically as possible which was then used for training the model. I wrote a script that uses the blender api to automate this process.

The goal of the project involves developing a system:

  1. That takes as input: Cuneiform transliterated text and image to generate segments equivalent to the number of lines of transliteration. Appropriate segment indexing should enable us to further map the text and segments.

  2. Detects and recognises Cuneiform characters.

Objectives completed

  • Creating a cuneiform template that contains signs with variations and line indentations, and also it's corresponding labelled annotation file.
  • Generating synthetic cuneiform images using the templates on blender.
  • Detecting lines containing cuneiform characters on 2d images.
    • Detection using ad-hoc/traditional image processing techniques.
    • Detection using deep learning based techniques (Mask-RCNN)
    • Mapping cuneiform image line to corresponding transliterated text line (currently only works when supplied with individual atf files and hence not pushed. Need to make it work with pyoracc on cdliatf_unblocked.atf)
  • Detecting individual cuneiform characters on 2d images using deep learning techniques.

Team

Technologies Used

  • Blender
  • Python
  • TensorFlow
  • OpenCV

Important Links

Possible Improvements

  • The main deciding factor for the model to recognise cuneiform characters is it's shape. Currently 78 cuneiform signs were manually cropped from cuneiform svg line drawings from CDLI's downloads. These signs were then augmented using the following operations: elastic distortion, scaling in x or y(stretching), and rotation. The problem is these operations are not applied for individual wedges, but for the whole sign. But in many real cases for a particular cuneiform sign, wedge shapes could be drastically different even for the same wedge class across different cuneiform tablets. A method that could take care of this problem (augment in svg format?) could help the model learn better ie; create more variations for individual characters.

  • Apply a cuneiform language model to improve the detection as described in the about section.

Future Plans

  • Dockerize the current character and line detection system and deploy it in a web interface.
  • Write a Research Paper about the OCR for RGB Cuneiform images.