Skip to content

Latest commit

 

History

History
198 lines (138 loc) · 8.18 KB

development.rst

File metadata and controls

198 lines (138 loc) · 8.18 KB

Developer guide

Setting up development environment (recommendation)

This section shows a way to configure a development environment that allows you to run tests and build documentation.

virtualenv env
source env/bin/activate
pip install -U pip setuptools
pip install -e .[opencv,tf,test,torch]

Additionally, you can use the Dockerized Linux workspace via the Makefile provided at docker/Makefile. The following will build the Docker image, start a running container with petastorm source mounted into it from the host, and open a BASH shell into it (you must have GNU Make and Docker installed beforehand):

make build run shell

Within the Dockerized workspace, you can find the Python virtual environments at /petastorm_venv2.7 and /petastorm_venv3.6, and the local petastorm/ mounted at /petastorm. Remember to set python for pyspark correctly after load virtual env, for example:

export PYSPARK_PYTHON=`which python3`

Also, if you are seeing "ImportError: libGL.so.1"from "import cv2", update with running "apt-get update; apt-get install ffmpeg libsm6 libxext6 -y" (reference: https://stackoverflow.com/questions/55313610/importerror-libgl-so-1-cannot-open-share).

Unit tests

To run unit tests:

pytest -v petastorm

NOTE: you need to have Java 1.8 to be installed for the test to pass (it's a dependency of Spark)

pytest has mulitple useful plugins. Consider installing the following plugins:

pip install pytest-xdist pytest-repeat pytest-pycharm

which enable you to run tests in parallel (-n switch) and repeat tests multiple times (--count switch)

Caching test datasets

Some unit tests rely on mock data. Generating these datasets is not very fast as it spins up local Spark isntance. Use -Y switch to cache these datasets. Be careful, as the dataset generation exercises Petastorm code, hence in some cases you would need to invalidate the cache for the test to take all code changes into account. Use --cache-clear switch to do so.

Documentation

The petastorm project uses sphinx autodoc capabilities, along with free documentation hosting by ReadTheDocs.org (RTD), to serve up auto-generated API docs on http://petastorm.rtfd.io .

The RTD site is configured via webhooks to trigger sphinx doc builds from changes in the petastorm github repo. Documents are configured to build the same locally or on RTD.

All the source files needed to generate the autodocs reside under docs/autodoc/.

To make documents locally:

pip install -e .[docs]
cd docs/autodoc

# To nuke all generated HTMLs
make clean

# Each run incrementally updates HTML based on file changes
make html

Once the HTML build process completes successfully, naviate your browser to file:///tmp/autodocs/_build/html/index.html.

Some changes may require build and deployment to see, including:

  • Changes to readthedocs.yml
  • Changes to docs/autodoc/conf.py
  • A change that makes RTD build different from a local build

To see the above documentation changes:

  1. One needs to create a petastorm branch and push it
  2. Then configure RTD to activate a version for that branch
  3. A project maintainer will need to effect such version activation
  4. The status of a built version, as well as the resulting docs, can then be viewed

Release versions

By default, RTD defines the latest version, which can be pointed at master or another branch. Additionally, each release may have an associated RTD build version, which must be explicitly activated in the Versions settings page.

As with any source file, once a release is tagged, it is essentially immutable, so be sure that all the desired documentation changes are in place before tagging a release.

Note that conf.py defines a release and version property. For ease of maintenance, we've set that to be the same version string as defined in petastorm/__init__.py.

Known doc-build caveats and issues

  • Due to RTD's build resource limitations, we are unable to pip install any of the petastorm extra-required library packages.
  • Since Sphinx must be able to load a python module to read its docstrings, the doc page for any module that imports cv2, tensorflow, or torch will, unfortunately, fail to build.
  • The alabaster Sphinx theme defaults to using travis-ci.org for the Travis CI build badge, whereas the petastorm project is served on .com, so we don't currently have a working Travis CI build status.

Future: auto-generate with sphinx-apidoc

Sphinx has the ability to auto-generate the entire API, either via the autosummary extension, or the sphinx-apidoc tool.

The following sphinx-apidoc invocation will autogenerate an api/ subdirectory of rST files for each of the petastorm modules. Those files can then be glob'd into a TOC tree.

cd docs/autodocs
sphinx-apidoc -fTo api ../.. ../../setup.py

The apidoc_experiment branch and RTD output demonstrates the outcome of vanilla usage. Actually leveraging this approach to produce uncluttered auto-generated API doc will require:

  1. Code package reorganization
  2. Experimentation with sphinx settings, if available, to shorten link names
  3. Configuration change to auto-run sphinx-apidoc in RTD build, as opposed to committing the api/*.rst files

Release procedure

  1. Make sure you are on the latest mater in your local workspace (git checkout master && git pull).
  2. Update __version__ in petastorm/__init__.py and commit.
  3. Update docs/release-notes.rst.
    1. Delete (unreleased) from the release we are about to release.
    2. Add any additional information if needed.
    3. Add kudos message to any new contributors who contributed to the release.
    4. Create a future release entry and tag it with (unreleased)) string.
  4. Commit the changes.
  5. Tag as vX.X.Xrc0 (git tag vX.X.Xrc0) and push both master and the tag (git push origin master vX.X.Xrc0). This will trigger build and pypi release.
  6. Provide an opportunity for users to test the new release (slack channel/tweater). Create new release candidates as needed.
  7. Tag as vX.X.X (git tag vX.X.X) and push both master and the tag (git push origin master vX.X.X). This will trigger build and pypi release
  8. Once the build finishes, a new python wheel will be pushed to public pypi server.
  9. Navigate to https://readthedocs.org/ --> "My Projects" --> "Builds" --> Trigger build of the 'latest' documentation (not clear when RTD picks up new tags from github, so you may see only outdated release versions there).

Setting up pyspark for working with S3 locally

Checked these instructions for pyspark 3.0.1 1. Download the following files into some local directory:

  1. https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
  2. https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.4/hadoop-aws-2.7.4.jar
  3. https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar (was not able to confirm s3 protocol due to authentication issues)
  1. Add/set CLASSPATH environment variable to point to the directory containing these jars.