Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open-ocr-preprocessor: fix convert-pdf #15

Merged
merged 1 commit into from
Sep 4, 2019
Merged

Conversation

sfcodes
Copy link
Contributor

@sfcodes sfcodes commented Aug 16, 2019

I started by fixing convert-pdf (tleyden/open-ocr#117), but in the process ended up upgrading this entire Dockerfile.

Here is what changed:

  • Upgraded to Ubuntu:18.04, latest LTS
  • Upgraded to Golang 1.10
  • Upgraded to libboost 1.62
  • Upgraded to libopencv 3.2
  • Upgraded to tesseract 4
  • Switched from forked tleyden/DetectText to original aperrault/DetectText; original seems more recently updated
  • Added Ghostscript for convert-pdf
  • Deleted stroke-width-transform Dockerfile as it seem unnecessary now given the new two-stage build process (although that's debatable)
  • Converted to a two-stage build; which shrunk this image from 1.72GB to only 310MB.
  • Minimized unnecessary layers by combining instructions.
  • Switched to building DetectText with cmake; the manual g++ method didn't seem to work for me.

All in all this should fix the build issues with the badly outdated tleyden5iwx/open-ocr-preprocessor Docker image, and enable convert-pdf support.

You can pull the resulting image from here:

docker pull sfcodes/open-ocr-preprocessor:pr15

This is a major update to the open-ocr-preprocessor Dockerfile.  It updates to the latest Ubuntu LTS, latest Go, and latest available libraries; which in turn enables building PDF pre-processor support.

Previously PDF support did not work because the golang version was too outdated and build failed. This is documented in tleyden/open-ocr#117

To shrink the image size I used a two-stage build; where the first stage install the many dependencies necessary for the build, but the end result image only include the few dependencies required in runtime.  This shrink the image from 1.72GB to 310MB.

Finally, I eliminated the underlaying stroke-width-transform image as it didn't really make sense here anymore; this new image supports both stroke-width-transform and convert-pdf.
@tleyden
Copy link
Owner

tleyden commented Sep 4, 2019

@sfcodes so sorry, I haven't been able to get to this, but I haven't forgotten about it!

Copy link
Owner

@tleyden tleyden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks! I will kick off a dockerhub rebuild.

@tleyden tleyden merged commit 74409db into tleyden:master Sep 4, 2019
@tleyden
Copy link
Owner

tleyden commented Sep 4, 2019

Dockerhub is giving me an error:

Building in Docker Cloud's infrastructure...
Cloning into '.'...
Warning: Permanently added the RSA host key for IP address '140.82.114.3' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
please ensure the correct public key is added to the list of trusted keys for this repository (128)

Will circle back to this.

@sfcodes
Copy link
Contributor Author

sfcodes commented Sep 16, 2019

That is strange.... it works on my DockerHub just fine. Here is the log in fact: build_9b8800a4.txt

Any chance you could run it again? Maybe just a fluke.

Meanwhile I'll think about it a little more.

@tleyden
Copy link
Owner

tleyden commented Nov 4, 2019

@sfcodes I may have figured it out, and I've kicked off another build

@tleyden
Copy link
Owner

tleyden commented Nov 5, 2019

Ok it was an issue with my docker<->github account link. It's fixed, and it build ok now.

Build logs: https://gist.github.com/tleyden/64bd251b39c1e30a8b68106feb5beffc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants