- datalad/ is the main Python module where major development is happening,
with major submodules being:
cmdline/
- helpers for accessinginterface/
functionality from command linecustomremotes/
- custom special remotes for annex provided by dataladdownloaders/
- support for accessing data from various sources (e.g. http, S3, XNAT) via a unified interface.configs/
- specifications for known data providers and associated credentials
interface/
- high level interface functions which get exposed via command line (cmdline/
) or Python (datalad.api
).tests/
- some unit- and regression- tests (more could be found undertests/
of corresponding submodules. See Tests)- utils.py provides convenience helpers used by unit-tests such as
@with_tree
,@serve_path_via_http
and other decorators
- utils.py provides convenience helpers used by unit-tests such as
ui/
- user-level interactions, such as messages about errors, warnings, progress reports, AND when supported by available frontend -- interactive dialogssupport/
- various support modules, e.g. for git/git-annex interfaces, constraints for theinterface/
, etc
- benchmarks/ - asv benchmarks suite (see Benchmarking)
- docs/ - yet to be heavily populated documentation
bash-completions
- bash and zsh completion setup for datalad (justsource
it)
- fixtures/ currently not under git, contains generated by vcr fixtures
- sandbox/ - various scripts and prototypes which are not part of the main/distributed with releases codebase
- tools/ contains helper utilities used during development, testing, and benchmarking of DataLad. Implemented in any most appropriate language (Python, bash, etc.)
Whenever a new top-level file or folder is added to the repository, it should
be listed in MANIFEST.in
so that it will be either included in or excluded
from source distributions as appropriate. See
here for information
about writing a MANIFEST.in
.
The preferred way to contribute to the DataLad code base is to fork the main repository on GitHub. Here we outline the workflow used by the developers:
-
Have a clone of our main project repository as
origin
remote in your git:git clone git://github.com/datalad/datalad
-
Fork the project repository: click on the 'Fork' button near the top of the page. This creates a copy of the code base under your account on the GitHub server.
-
Add your forked clone as a remote to the local clone you already have on your local disk:
git remote add gh-YourLogin [email protected]:YourLogin/datalad.git git fetch gh-YourLogin
To ease addition of other github repositories as remotes, here is a little bash function/script to add to your
~/.bashrc
:ghremote () { url="$1" proj=${url##*/} url_=${url%/*} login=${url_##*/} git remote add gh-$login $url git fetch gh-$login }
thus you could simply run:
ghremote [email protected]:YourLogin/datalad.git
to add the above
gh-YourLogin
remote. Additional handy aliases such asghpr
(to fetch existing pr from someone's remote) andghsendpr
could be found at yarikoptic's bash config file -
Create a branch (generally off the
origin/master
) to hold your changes:git checkout -b nf-my-feature
and start making changes. Ideally, use a prefix signaling the purpose of the branch
nf-
for new featuresbf-
for bug fixesrf-
for refactoringdoc-
for documentation contributions (including in the code docstrings).bm-
for changes to benchmarks We recommend to not work in themaster
branch!
-
Work on this copy on your computer using Git to do the version control. When you're done editing, do:
git add modified_files git commit
to record your changes in Git. Ideally, prefix your commit messages with the
NF
,BF
,RF
,DOC
,BM
similar to the branch name prefixes, but you could also useTST
for commits concerned solely with tests, andBK
to signal that the commit causes a breakage (e.g. of tests) at that point. Multiple entries could be listed joined with a+
(e.g.rf+doc-
). Seegit log
for examples. If a commit closes an existing DataLad issue, then add to the end of the message(Closes #ISSUE_NUMER)
-
Push to GitHub with:
git push -u gh-YourLogin nf-my-feature
Finally, go to the web page of your fork of the DataLad repo, and click 'Pull request' (PR) to send your changes to the maintainers for review. This will send an email to the committers. You can commit new changes to this branch and keep pushing to your remote -- github automagically adds them to your previously opened PR.
(If any of the above seems like magic to you, then look up the Git documentation on the web.) Our Design Docs provide a growing collection of insights on the command API principles and the design of particular subsystems in DataLad to inform standard development practice.
We support Python 3 only (>= 3.7).
See README.md:Dependencies for basic information about installation of datalad itself. On Debian-based systems we recommend to enable NeuroDebian since we use it to provide backports of recent fixed external modules we depend upon:
apt-get install -y -q git git-annex-standalone
apt-get install -y -q patool python3-scrapy python3-{argcomplete,git,humanize,keyring,lxml,msgpack,progressbar,requests,setuptools}
and additionally, for development we suggest to use tox and new versions of dependencies from pypy:
apt-get install -y -q python3-{dev,httpretty,pytest,pip,vcr,virtualenv} python3-tox
# Some libraries which might be needed for installing via pip
apt-get install -y -q lib{ffi,ssl,curl4-openssl,xml2,xslt1}-dev
some of which you could also install from PyPi using pip (prior installation of those libraries listed above might be necessary)
pip install -r requirements-devel.txt
and you will need to install recent git-annex using appropriate for your OS means (for Debian/Ubuntu, once again, just use NeuroDebian).
The original repository provided a .zenodo.json file, and we generate a .contributors file from that via:
pip install tributors
tributors --version
0.0.18
It helps to have a GitHub token to increase API limits:
export GITHUB_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Instructions for these environment variables can be found here. Then update zenodo:
tributors update zenodo
INFO: zenodo:Updating .zenodo.json
INFO: zenodo:Updating .tributors cache from .zenodo.json
WARNING:tributors:zenodo does not support updating from names.
In the case that there is more than one orcid found for a user, you will be given a list to check. Others will be updated in the file. You can then curate the file as you see fit. We next want to add the .allcontributors file:
$ tributors init allcontrib
INFO:allcontrib:Generating .all-contributorsrc for datalad/datalad
$ tributors update allcontrib
INFO:allcontrib:Updating .all-contributorsrc
INFO:allcontrib:Updating .tributors cache from .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor glalteva in .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor adswa in .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor chrhaeusler in .all-contributorsrc
...
INFO:allcontrib:⭐️ Found new contributor bpoldrack in .all-contributorsrc
INFO:allcontrib:⭐️ Found new contributor yetanothertestuser in .all-contributorsrc
WARNING:tributors:allcontrib does not support updating from orcids.
WARNING:tributors:allcontrib does not support updating from email.
We can then populate the shared .tributors file:
$ tributors update-lookup allcontrib
And then we can rely on the GitHub action to update contributors. The action is set to run on merges to master, meaning when the contributions are finalized. This means that we add new contributors, and we look for new orcids as we did above.
For merge commits to have more informative description, add to your
.git/config
or ~/.gitconfig
following section:
[merge]
log = true
and if conflicts occur, provide short summary on how they were resolved in "Conflicts" listing within the merge commit (see example).
It is recommended to check that your contribution complies with the following rules before submitting a pull request:
-
All public methods should have informative docstrings with sample usage presented as doctests when appropriate.
-
All other tests pass when everything is rebuilt from scratch.
-
New code should be accompanied by tests.
The documentation contains a Design Document specifically on running and writing tests that we encourage you to read beforehand. Further hands-on advice is detailed below.
datalad/tests
contains tests for the core portion of the project, and
more tests are provided under corresponding submodules in tests/
subdirectories to simplify re-running the tests concerning that portion
of the codebase. To execute many tests, the codebase first needs to be
"installed" in order to generate scripts for the entry points. For
that, the recommended course of action is to use virtualenv
, e.g.
virtualenv --system-site-packages venv-tests
source venv-tests/bin/activate
pip install -r requirements.txt
python setup.py develop
and then use that virtual environment to run the tests, via
pytest datalad
then to later deactivate the virtualenv just simply enter
deactivate
Alternatively, or complimentary to that, you can use tox
-- there is a tox.ini
file which sets up a few virtual environments for testing locally, which you can
later reuse like any other regular virtualenv for troubleshooting.
Additionally, tools/testing/test_README_in_docker script can
be used to establish a clean docker environment (based on any NtesteuroDebian-supported
release of Debian or Ubuntu) with all dependencies listed in README.md pre-installed.
We are using several continuous integration services to run our tests battery for every PR and on the default branch.
Please note that new a contributor's first PR needs workflow approval from a team member to start the CI runs, but we promise to promptly review and start the CI runs on your PR.
As the full CI suite takes a while to complete, we recommend to run at least tests directly related to your contributions locally beforehand.
Logs from all CI runs are collected periodically by con/tinuous and archived at smaug:/mnt/btrfs/datasets/datalad/ci/logs/
.
For developing on Windows you can use free Windows VMs.
If you would like to propose patch against git-annex
itself, submit them against datalad/git-annex repository which builds and tests git-annex
.
You can also check for common programming errors with the following tools:
-
Code with good unittest coverage (at least 80%), check with:
pip install pytest coverage pytest --cov=datalad path/to/tests_for_package
-
We rely on https://codecov.io to provide convenient view of code coverage. Installation of the codecov extension for Firefox/Iceweasel or Chromium is strongly advised, since it provides coverage annotation of pull requests.
We are not (yet) fully PEP8 compliant, so please use these tools as guidelines for your contributions, but not to PEP8 entire code base.
Sidenote: watch Raymond Hettinger - Beyond PEP 8
-
No pyflakes warnings, check with:
pip install pyflakes pyflakes path/to/module.py
-
No PEP8 warnings, check with:
pip install pep8 pep8 path/to/module.py
-
AutoPEP8 can help you fix some of the easy redundant errors:
pip install autopep8 autopep8 path/to/pep8.py
Also, some team developers use
PyCharm community edition which
provides built-in PEP8 checker and handy tools such as smart
splits/joins making it easier to maintain code following the PEP8
recommendations. NeuroDebian provides pycharm-community-sloppy
package to ease pycharm installation even further.
We use asv to benchmark some core DataLad functionality.
The benchmarks suite is located under benchmarks/, and
periodically we publish results of running benchmarks on a dedicated host
to http://datalad.github.io/datalad/ . Those results are collected
and available under the .asv/
submodule of this repository, so to get started
git submodule update --init .asv
pip install .[devel]
or justpip install asv
asv machine
- to configure asv for your host if you want to run benchmarks locally
And then you could use asv in multiple ways.
asv run -E existing
- benchmark using the existing python environment and just print out results (not stored anywhere). You can add-q
to run each benchmark just once (thus less reliable estimates)asv run -b api.supers.time_createadd_to_dataset -E existing
would run that specific benchmark using the existing python environment
Note: --python=same
(-E existing
) seems to have restricted
applicability, e.g. can't be used for a range of commits, so it can't
be used with continuous
.
Use asv compare to compare results from different runs, which should be
available under .asv/results/<machine>
. (Note that the example
below passes ref names instead of commit IDs, which requires asv v0.3
or later.)
> asv compare -m hopa maint master
All benchmarks:
before after ratio
[b619eca4] [7635f467]
- 1.87s 1.54s 0.82 api.supers.time_createadd
- 1.85s 1.56s 0.84 api.supers.time_createadd_to_dataset
- 5.57s 4.40s 0.79 api.supers.time_installr
145±6ms 145±6ms 1.00 api.supers.time_ls
- 4.59s 2.17s 0.47 api.supers.time_remove
427±1ms 434±8ms 1.02 api.testds.time_create_test_dataset1
- 4.10s 3.37s 0.82 api.testds.time_create_test_dataset2x2
1.81±0.07ms 1.73±0.04ms 0.96 core.runner.time_echo
2.30±0.2ms 2.04±0.03ms ~0.89 core.runner.time_echo_gitrunner
+ 420±10ms 535±3ms 1.27 core.startup.time_help_np
111±6ms 107±3ms 0.96 core.startup.time_import
+ 334±6ms 466±4ms 1.39 core.startup.time_import_api
asv continuous could be used to first run benchmarks for the to-be-tested commits and then provide stats:
asv continuous maint master
- would run and comparemaint
andmaster
branchesasv continuous HEAD
- would compareHEAD
againstHEAD^
asv continuous master HEAD
- would compareHEAD
against state of master- TODO: continuous -E existing
Notes:
- only significant changes will be reported
- raw results from benchmarks are not stored (use
--record-samples
if desired)
asv run
would run all configured branches (see asv.conf.json)
Example (replace with the benchmark of interest)
asv profile -v -o profile.gprof usecases.study_forrest.time_make_studyforrest_mockup
gprof2dot -f pstats profile.gprof | dot -Tpng -o profile.png \
&& xdg-open profile.png
-E
to restrict to specific environment, e.g.-E virtualenv:2.7
-b
could be used to specify specific benchmark(s)-q
to run benchmark just once for a quick assessment (results are not stored since too unreliable)
A great way to start contributing to DataLad is to pick an item from the list of Easy issues in the issue tracker. Resolving these issues allows you to start contributing to the project without much prior knowledge. Your assistance in this area will be greatly appreciated by the more experienced developers as it helps free up their time to concentrate on other issues.
We distinguish particular aspects of DataLad's functionality, each corresponding
to parts of the code base in this repository, and loosely maintain teams assigned
to these aspects.
While any contributor can tackle issues on any aspect, you may want to refer to
members of such teams (via GitHub tagging or review requests) or the team itself
(via GitHub issue label team-<area>
) when creating a PR, feature request, or bug report.
Members of a team are encouraged to respond to PRs or issues within the given area,
and pro-actively improve robustness, user experience, documentation, and
performance of the code.
New and existing contributors are invited to join teams:
-
core: core API/commands (@datalad/team-core)
-
git: Git interface (e.g. GitRepo, protocols, helpers, compatibility) (@datalad/team-git)
-
gitannex: git-annex interface (e.g. AnnexRepo, protocols, helpers, compatibility) (@datalad/team-gitannex)
-
remotes: (special) remote implementations (@datalad/team-remotes)
-
runner: sub-process execution and IO (@datalad/team-runner)
-
services: interaction with 3rd-party services (create-sibling*, downloaders, credentials, etc.) (@datalad/team-services)
We welcome and recognize all contributions from documentation to testing to code development.
You can see a list of current contributors in our zenodo file. If you are new to the project, don't forget to add your name and affiliation there! We also have an .all-contributorsrc that is updated automatically on merges. Once it's merged, if you helped in a non standard way (e.g., a contribution other than code) you can open a pull request to add any All Contributors Emoji that match your contribution types.
You're awesome. 👋😃
-
While performing IO/net heavy operations use dstat for quick logging of various health stats in a separate terminal window:
dstat -c --top-cpu -d --top-bio --top-latency --net
-
To monitor speed of any data pipelining pv is really handy, just plug it in the middle of your pipe.
-
For remote debugging epdb could be used (avail in pip) by using
import epdb; epdb.serve()
in Python code and then connecting to it withpython -c "import epdb; epdb.connect()".
-
We are using codecov which has extensions for the popular browsers (Firefox, Chrome) which annotates pull requests on github regarding changed coverage.
Refer datalad/config.py for information on how to add these environment variables to the config file and their naming convention
-
DATALAD_DATASETS_TOPURL: Used to point to an alternative location for
///
dataset. If running tests preferred to be set to https://datasets-tests.datalad.org -
DATALAD_LOG_LEVEL: Used for control the verbosity of logs printed to stdout while running datalad commands/debugging
-
DATALAD_LOG_NAME: Whether to include logger name (e.g.
datalad.support.sshconnector
) in the log -
DATALAD_LOG_OUTPUTS: Used to control either both stdout and stderr of external commands execution are logged in detail (at DEBUG level)
-
DATALAD_LOG_PID To instruct datalad to log PID of the process
-
DATALAD_LOG_TARGET Where to log:
stderr
(default),stdout
, or another filename -
DATALAD_LOG_TIMESTAMP: Used to add timestamp to datalad logs
-
DATALAD_LOG_TRACEBACK: Runs TraceBack function with collide set to True, if this flag is set to 'collide'. This replaces any common prefix between current traceback log and previous invocation with "..."
-
DATALAD_LOG_VMEM: Reports memory utilization (resident/virtual) at every log line, needs
psutil
module -
DATALAD_EXC_STR_TBLIMIT: This flag is used by datalad to cap the number of traceback steps included in exception logging and result reporting to DATALAD_EXC_STR_TBLIMIT of pre-processed entries from traceback.
-
DATALAD_SEED: To seed Python's
random
RNG, which will also be used for generation of dataset UUIDs to make those random values reproducible. You might want also to set all the relevant git config variables like we do in one of the travis runs -
DATALAD_TESTS_TEMP_KEEP: Function rmtemp will not remove temporary file/directory created for testing if this flag is set
-
DATALAD_TESTS_TEMP_DIR: Create a temporary directory at location specified by this flag. It is used by tests to create a temporary git directory while testing git annex archives etc
-
DATALAD_TESTS_NONETWORK: Skips network tests completely if this flag is set Examples include test for S3, git_repositories, OpenfMRI, etc
-
DATALAD_TESTS_SSH: Skips SSH tests if this flag is not set. If you enable this, you need to set up a "datalad-test" and "datalad-test2" target in your SSH configuration. The second target is used by only a couple of tests, so depending on the tests you're interested in, you can get by with only "datalad-test" configured.
A Docker image that is used for DataLad's tests is available at https://github.com/datalad-tester/docker-ssh-target. Note that the DataLad tests assume that target files exist in
DATALAD_TESTS_TEMP_DIR
, which restricts the "datalad-test" target to being either the localhost or a container that mountsDATALAD_TESTS_TEMP_DIR
. -
DATALAD_TESTS_NOTEARDOWN: Does not execute teardown_package which cleans up temp files and directories created by tests if this flag is set
-
DATALAD_TESTS_USECASSETTE: Specifies the location of the file to record network transactions by the VCR module. Currently used by when testing custom special remotes
-
DATALAD_TESTS_OBSCURE_PREFIX: A string to prefix the most obscure (but supported by the filesystem test filename
-
DATALAD_TESTS_PROTOCOLREMOTE: Binary flag to specify whether to test protocol interactions of custom remote with annex
-
DATALAD_TESTS_RUNCMDLINE: Binary flag to specify if shell testing using shunit2 to be carried out
-
DATALAD_TESTS_TEMP_FS: Specify the temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation
-
DATALAD_TESTS_TEMP_FSSIZE: Specify the size of temporary file system to use as loop device for testing DATALAD_TESTS_TEMP_DIR creation
-
DATALAD_TESTS_NONLO: Specifies network interfaces to bring down/up for testing. Currently used by travis.
-
DATALAD_TESTS_KNOWNFAILURES_PROBE: Binary flag to test whether "known failures" still actually are failures. That is - change behavior of tests, that decorated with any of the
known_failure
, to not skip, but executed and fail if they would pass (indicating that the decorator may be removed/reconsidered). -
DATALAD_TESTS_GITCONFIG: Additional content to add to
~/.gitconfig
in the testsHOME
environment.\n
is replaced withos.linesep
. -
DATALAD_TESTS_CREDENTIALS: Set to
system
to allow for credentials possibly present in the user/system wide environment to be used. -
DATALAD_CMD_PROTOCOL: Specifies the protocol number used by the Runner to note shell command or python function call times and allows for dry runs. 'externals-time' for ExecutionTimeExternalsProtocol, 'time' for ExecutionTimeProtocol and 'null' for NullProtocol. Any new DATALAD_CMD_PROTOCOL has to implement datalad.support.protocol.ProtocolInterface
-
DATALAD_CMD_PROTOCOL_PREFIX: Sets a prefix to add before the command call times are noted by DATALAD_CMD_PROTOCOL.
-
DATALAD_USE_DEFAULT_GIT: Instructs to use
git
as available in current environment, and not the one which possibly comes with git-annex (default behavior). -
DATALAD_ASSERT_NO_OPEN_FILES: Instructs test helpers to check for open files at the end of a test. If set, remaining open files are logged at ERROR level. Alternative modes are: "assert" (raise AssertionError if any open file is found), "pdb"/"epdb" (drop into debugger when open files are found, info on files is provided in a "files" dictionary, mapping filenames to psutil process objects).
-
DATALAD_ALLOW_FAIL: Instructs
@never_fail
decorator to allow to fail, e.g. to ease debugging.
master
: changes toward the nextMAJOR.MINOR.0
release. Release candidates (tagged with anrcX
suffix) are cut from this branchmaint
: bug fixes for the latest releasedMAJOR.MINOR.PATCH
maint-MAJOR.MINOR
: generally not used, unless some bug fix release with a critical bug fix is needed.
- upon release of
MAJOR.MINOR.0
,maint
branch needs to be fast-forwarded to that release - bug fixes to functionality released within the
maint
branch should be submitted againstmaint
branch - cherry-picking fixes from
master
intomaint
is allowed where needed master
branch accepts PRs with new functionalitymaster
branch mergesmaint
as frequently as needed
Makefile provides a number of useful make
targets:
linkissues-changelog
: converts(#ISSUE)
placeholders into proper markdown within CHANGELOG.mdupdate-changelog
: uses abovelinkissues-changelog
and updates .rst changelogrelease-pypi
: ensures nodist/
exists yet, creates a wheel and a source distribution and uploads to pypi.
New releases of DataLad are created via a GitHub Actions workflow using datalad/release-action, which was inspired by auto
.
Whenever a pull request is merged into maint
that has the "release
" label, that workflow updates the
changelog based on the pull requests since the last release, commits the
results, tags the new commit with the next version number, and creates a GitHub
release for the tag.
This in turn triggers a job for building an sdist & wheel for the project and uploading them to PyPI.
DataLad uses scriv to maintain CHANGELOG.md.
Adding label CHANGELOG-missing
to a PR triggers workflow to add a new scriv
changelog fragment under changelog.d/
using PR title as the content.
That produced changelog snippet could subsequently tuned to improve perspective CHANGELOG entry.
The section that workflow adds to the changelog depends on the semver-
label added to the PR:
semver-minor
— for changes corresponding to an increase in the minor version componentsemver-patch
— for changes corresponding to an increase in the patch/micro version component; this is the default label for unlabelled PRssemver-internal
— for changes only affecting the internal APIsemver-documentation
— for changes only affecting the documentationsemver-tests
— for changes to testssemver-dependencies
— for updates to dependency versionssemver-performance
— for performance improvements
Even though git-annex is a separate project, DataLad's and git-annex's development is often intertwined.
It is not uncommon to discover potential git-annex bugs or git-annex feature request while working on DataLad. In those cases, it is common for developers and contributors to file an issue in git-annex's public bug tracker at git-annex.branchable.com. Here are a few hints on how to go about it:
- You can report a new bug or browse through existing bug reports at git-annex.branchable.com/bugs)
- In order to associate a bug report with the DataLad you can add the following mark up into the description:
[[!tag projects/datalad]]
- You can add author metadata with the following mark up:
[[!meta author=yoh]]
. Some authors will be automatically associated with the DataLad project by git-annex's bug tracker.
To provide downstream testing of development git-annex
against DataLad, we maintain the datalad/git-annex repository.
It provides daily builds of git-annex with CI setup to run git-annex built-in tests and tests of DataLad across all supported operating systems.
It also has a facility to test git-annex on your client systems following the instructions.
All the build logs and artifacts (installer packages etc) for daily builds and releases are collected using con/tinuous and archived on smaug:/mnt/btrfs/datasets/datalad/ci/git-annex/
.
You can test your fixes for git-annex by submitting patches for it following instructions.