Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/contain nltk assets in docker image #3853

Merged
merged 39 commits into from
Jan 8, 2025

Conversation

christinestraub
Copy link
Collaborator

@christinestraub christinestraub commented Dec 26, 2024

This pull request adds NLTK data to the Docker image by pre-packaging the data to ensure a more reliable and efficient deployment process, as the required NLTK resources are readily available within the container.

Current updated solution:

  • Dockerfile Update: Integrated NLTK data directly into the Docker image, ensuring that the API can operate independently of external - data sources. The data is stored at /home/notebook-user/nltk_data.
  • Environment Variable Setup: Configured the NLTK_PATH environment variable, enabling Python scripts to automatically locate and use the embedded NLTK data. This eliminates the need for manual configuration in deployment environments.
  • Code Cleanup: Removed outdated code in tokenize.py and related scripts that previously downloaded NLTK data from S3. This streamlines the codebase and removes unnecessary dependencies.
  • Script Updates: Updated tokenize.py and test_tokenize.py to utilize the NLTK_PATH variable, ensuring consistent access to the embedded data across all environments.
  • Dependency Elimination: Fully eliminated reliance on the S3 bucket for NLTK data, mitigating risks from network failures or access changes.
  • Improved System Reliability: By embedding assets within the Docker image, the API now has a self-contained setup that ensures consistent behavior regardless of deployment location.
  • Updated the Dockerfile to copy the local NLTK data to the appropriate directory within the container.
  • Adjusted the application setup to verify the presence of NLTK assets during the container build process.

@christinestraub christinestraub marked this pull request as ready for review January 3, 2025 16:57
Copy link
Contributor

@ajjimeno ajjimeno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@christinestraub christinestraub added this pull request to the merge queue Jan 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 8, 2025
@christinestraub christinestraub added this pull request to the merge queue Jan 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Jan 8, 2025
@christinestraub christinestraub added this pull request to the merge queue Jan 8, 2025
Merged via the queue into main with commit 8378c26 Jan 8, 2025
41 checks passed
@christinestraub christinestraub deleted the feat/contain-nltk-assets-in-docker-image branch January 8, 2025 22:34
@NolanTrem
Copy link

Would love to see a bit of stability with how Unstructured handles NLTK install. In the past ~2 months this has broken twice without any warning for us.

@davidlmorton
Copy link

I agree @NolanTrem, it was surprising for us that a bugfix version bump would break our tests. Downloading the NLTK stuff at container build time is fine, removing automatic downloading if found missing is not backwards compatible for a lot of folks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants