Skip to content

Latest commit

 

History

History
547 lines (390 loc) · 27.2 KB

DEVELOP.adoc

File metadata and controls

547 lines (390 loc) · 27.2 KB

rdiff-backup: Development Guide

1. GETTING THE SOURCE

Simply clone the source with:

git clone https://github.com/rdiff-backup/rdiff-backup.git
Note
If you plan to provide your own code, you should first fork our repo and clone your own forked repo (probably using ssh not https). How is described at https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/working-with-forks

2. GENERAL GUIDELINES

  • Before committing to a lot of writing or coding, please file an issue on Github and discuss your plans and gather feedback. Eventually it will be much easier to merge your change request if the idea and design has been agreed upon, and there will be less work for you as a contributor if you implement your idea along the correct lines to begin with.

  • Please check out existing issues and existing merge requests and browse the git history to see if somebody already tried to address the thing you have are interested in. It might provide useful insight why the current state is as it is.

  • Changes can be submitted using the typical Github workflow: clone this repository, make your changes, test and verify, and submit a Pull Request (PR).

  • For all code changes, please remember also to include inline comments and update tests where needed.

  • Follow of course our coding and documentation guidelines.

2.1. License

Rdiff-backup is licensed with GNU General Public License v2.0 or later. By contributing to this repository you agree that your work is licensed using the chosen project license.

2.2. Commit messages

If something is of interest for the changelog, prefix the statement in the commit body with a three uppercase letters and a colon:

  • FIX: for a bug fix

  • NEW: for a new feature

  • CHG: for a change requesting consideration when upgrading

  • DOC: for documentation and website aspects

If the commit addresses a specific issue, end the statement with a , closes #NNN message, NNN being of course the number of the issue.

Multiple entries might be required for one single change, e.g. NEW if you introduce a new feature, and CHG because this new features introduces some breaking changes on which the user needs to take actions at upgrade time. You don’t need to notice that you’ve added documentation with DOC for a new feature, this is obvious and redundant; DOC should be reserved for solely documentation changes.

Tip
make sure that your commit was properly formatted by calling ./tools/get_changelog_since.sh v2.2.2 (or whichever is the last released version), and checking that the whole message is present. Beware that the interactive git commit command tends to split long lines, which is contraproductive in our use case.

In summary, any commit message could look as follows:

git commit -m "Some commit title summarizing the change

DOC: some understandable message on one single line explaining clearly the documentation change, closes #123

Further development notes, important (and recommended) but not fit for end-user changelog"
Tip
check git log for earlier examples. Note that the DEV prefix has been deprecated.

2.3. Branching model and pull requests

The master branch is always kept in a clean state. Anybody can at any time clone this repository and branch off from master and expect test suite to pass and the code and other contents to be of good quality and a reasonable foundation for them to continue development on.

Each PR focuses on some topic and resist changing anything else. Keeping the scope clear also makes it easier to review the pull request. A good pull request has only one or a few commits, with each commit having a good commit subject and if needed also a body that explains the change.

Each pull request has only one author, but anybody can give feedback. The original author should be given time to address the feedback — reviewers should not do the fixes for the author, but instead let the author keep the authorship. Things can always be iterated and extended in future commits once the PR has been merged, or even in parallel if the changes are in different files or at least on different lines and do not cause merge conflicts if worked on.

It is the responsibility of the PR author to keep it without conflict with master (e.g. if not quickly merged) and overall to support the review process.

Ideally each pull request gets some feedback within 24 hours from it having been filed, and is merged within days or a couple of weeks. Each author should facilitate quick reviews and merges by making clean and neat commits and pull requests that are quick to review and do not spiral out in long discussions.

2.3.1. Merging changes to master

Currently the rdiff-backup Github repository is configured so that merging a pull request is possible only if it:

  • passes the CI testing

  • has at least one approving review

While anybody can make forks, pull requests and comment them, only a developer with write access to the main repository can merge and land commits in the master branch. To get write access, the person mush exhibit commitment to high standards and have a track record of meaningful contributions over several months.

It is the responsibility of the merging developer to make sure that the PR is squashed and that the squash commit message helps the release process with the right description and 3-capital-letters prefix (it is still the obligation of the PR author to provide enough information in their commit messages).

2.4. Versioning

In versioning we utilize git tags as understood by setuptools_scm. Version strings follow the PEP-440 standard.

The rules are currently as follows (check the files in .github/workflows for details):

  • all commits tagged with an underscore at the end or with a tag looking like a version number (i.e. as in next two bullets) are released to GitHub.

  • all commits tagged with alpha, beta, rc or final format are released to PyPI, i.e. the ones looking like: vX.Y.ZaN (alpha), vX.Y.ZbN (beta), vX.Y.ZrcN (release candidate) or vX.Y.Z (final).

  • all commits where the "version tag" is a development one, i.e. like previously with an additional .devM at the end, are released to Test PyPI. They are meant mostly to test the deployment itself (use alpha versions to release development code).

Note
the GitHub releases are created as draft, meaning that a maintainer must review them and publish them before they become visible.

3. BUILD AND INSTALL

3.1. Pre-requisites

The same pre-requisites as for the installation of rdiff-backup also apply for building:

  • Python 3.9 or higher

  • librsync 2.0.0 or higher (to be verified)

Further python dependencies are documented in requirements.txt.

Additionally following pre-requisites are needed:

  • python3-dev (or -devel)

  • librsync-dev (or -devel)

  • a C compiler (gcc)

  • libacl-devel (for sys/acl.h)

  • rdiff (for testing)

  • asciidoctor (for documentation generation)

  • rpdb and netcat/ncat/nc (for remote debugging of server processes)

All of those should come packaged with your system or available from https://pypi.org/ but if you need them otherwise, here are some sources:

3.1.1. Changing dependencies versions

Python interpreter
  • Windows:

    • .github/workflows/test_windows.yml - check for WIN_PYTHON_VERSION

    • .github/workflows/deploy.yml - check for WIN_PYTHON_VERSION

    • tools/windows/group_vars/windows_hosts/generic.yml - check for python_version and python_version_full

  • Linux:

    • tox.ini, tox_root.ini, tox_dist.ini and tox_slow.ini - check for envlist

    • .github/workflows/test_linux.yml - check for python-version

    • .github/workflows/deploy.yml - check for /opt/python/cp3…​ (and possibly many-linux)

    • pyproject.toml - check for requires-python

    • README.adoc - check for Python references

Python libraries and binary dependencies

All Python dependencies have been concentrated into requirements.txt, generated from requs/*.txt with one file for each purpose. Only those files should be used, and maintained, throughout the build/release process.

Binaries are listed in bindep.txt (based on the bindep utility).

In all cases, a validation of the documentation is also necessary, but the above files should be considered the ultimate source of truth, and correctly maintained.

3.2. Build and install using Makefile

The project has a Makefile that defines steps like all, build, test and others. You can view the contents to see what it exactly does. Using the Makefile is the easiest way to quickly build and test the source code.

By default the Makefile runs all of it’s command in a clean Docker container, thus making sure all the build dependencies are correctly defined and also protecting the host system from having to install them.

The CI pipeline also uses the Makefile, so if all commands in the Makefile succeed locally, the CI is most likely to pass as well.

3.3. Build and install

To install, simply run:

pip install .
Tip
if pip isn’t present on your system, or too old, you can install or upgrade it with python -m ensurepip --upgrade

The build process can be also be run separately using pyproject-build.

The setup.py script expects to find librsync headers and libraries in the default location, usually /usr/include and /usr/lib. If you want the setup script to check different locations, use the LIBRSYNC_DIR environment variable. For instance to instruct the setup program to look in /usr/local/include and /usr/local/lib for the librsync files run:

LIBRSYNC_DIR=/usr/local pyproject-build

Finally, the LFLAGS and LIBS environment variables are also recognized.

To build from source on Windows, check the Windows tools to build a single executable file which contains Python, librsync, and all required modules.

3.4. Install for test and development

There are the occasions where you don’t want to make your system "dirty" with an early or even development version of rdiff-backup. This is what virtual environments (or short virtualenv, or even venv) are meant for. Here a very short summary on how to create a virtualenv in the directory …​/rdb (name and exact location aren’t important, but once created, a virtualenv can’t be moved):

python -m venv .../rdb
source .../rdb/bin/activate  #(1)
which pip                    #(2)
pip install -r requirements.txt
# install rdiff-backup       #(3)
which rdiff-backup           #(2)
# use rdiff-backup and do whatever you want actually
deactivate                   #(4)
rm -fr .../rdb               #(5)
  1. assuming a bash shell, but there are other activate-scripts for other shells, even Windows' cmd. In all cases, you should have a prompt starting with (rdb).

  2. the path to the command should be …​/rdb/bin/<command>, else call hash -r (under bash) and try again.

  3. the different options to install rdiff-backup are listed below.

  4. you’re now leaving the virtualenv, the prompt should go back to normal.

  5. you can of course keep and maintain the virtualenv instead, but why?

Tip
the script ./tools/create_venv.sh is available to execute all these steps at once.

The different ways of installing rdiff-backup in such a virtualenv depend on the version type:

pip install rdiff-backup                 #(1)
pip install rdiff-backup==2.1.3b3        #(2)
pip install -i https://test.pypi.org/simple/ rdiff-backup==2.1.3.dev1  #(3)
pip install .                            #(4)
pip install rdiff-backup[meta]==2.1.3b3  #(5)
pip install .[meta]                      #(5)
  1. this will install the last stable version released to PyPI e.g. 2.0.5.

  2. this will install a specific version, e.g. alpha, beta or release candidate.

  3. this will install a development version inofficially released (seldom).

  4. this assumes that you have cloned the Git repo and are in its root, and will install this development state.

  5. this is the same as the above commands but installs also the optional dependencies of rdiff-backup.

4. TESTING

Clone, unpack and prepare the testfiles by calling the script tools/setup-testfiles.sh from the cloned source Git repo. You will most probably be asked for your password so that sudo can extract and prepare the testfiles (else the tests will fail).

That’s it, you can now run the tests:

  • run tox to use the default tox.ini

  • or tox -c tox_slow.ini for long tests

  • or sudo tox -c tox_root.ini for the few tests needing root rights

For more details on testing, see the test sections in the Makefile and the GitHub Actions.

A naming convention has been introduced to be able to easily use pytest if we want so, e.g. with pytest --cov --cov-config=tox.ini testing/*_test.py. So, all "normal" test files are to be named *_test.py. Specific files are similarly called *_SPECIFICtest.py e.g. generic_roottest.py for (generic) root tests. All these files are placed under the testing directory.

Note
the interest of pytest would be to parallelize tests, but this is currently difficult due to the usage of global variables.
Tip
check the files commontest.py and fileset.py for reusable functions.

5. DEBUGGING

5.1. Trace back a coredump

At the time of writing these notes, there was an issue where calling the program generates a Segmentation fault (core dumped). This chapter is based on this experience debugging under Fedora 29 (then partially tested again under Fedora 39).

References:

$ CFLAGS='-Wall -O0 -g' pip install .
$ python3-debug -m rdiffbackup.run -v 10 backup /some/dir1 /some/dir2
[...]
Segmentation fault (core dumped)
Note
The CFLAGS avoids optimizations making debugging too complicated

At this stage coredumpctl list shows that coredump is the last one, so that one can call coredumpctl gdb, which itself tells (in multiple steps) that we missing some more debug information, hence the above debuginfo-install statements.

So now back into coredumpctl gdb, with some commands:

help
help stack
backtrace  #(1)
bt full  #(2)
py-bt  #(3)
frame <FrameNumber>  #(4)
p __SomeVar__  #(5)

<1>get a backtrace of all function calls leading to the coredump (also bt) <2>backtrace with local vars <3>py-bt is the Python version of backtrace <4>jump between frames as listed by bt using their #FrameNumber <5>print some variable/expression in the context of the selected frame

Jumping between frames and printing the different variables, we can recognize that:

  1. the core dump is due to a seek on a null file pointer

  2. that the file pointer comes from the job pointer handed over to the function rs_job_iter

  3. the job pointer itself comes from the self variable handed over to _librsync_patchmaker_cycle

  4. reading through the librsync documentation, it appears that the job type is opaque, i.e. I can’t directly influence and it has been created via the rs_patch_begin function within the function _librsync_new_patchmaker in rdiff_backup/_librsyncmodule.c.

At this stage, it seems that the core file has given most of its secrets and we need to debug the live program:

$ PYTHONTRACEMALLOC=1 gdb -args python3-debug -m rdiffbackup.run backup \
	/some/source/dir /some/target/dir
(gdb) break rdiff_backup/_librsyncmodule.c:_librsync_new_patchmaker
(gdb) run
Tip
if you are not sure about the correct break statement, run once the program without it. Then the module will have been loaded, and autocompletion on the break command (with twice <TAB>) can help you find the right file and place.

The debugger runs until the breakpoint is reached, after which a succession of next and print <SomeVar> allows me to analyze the code step by step, and to come to the conclusion that cfile = fdopen(python_fd, ... is somehow wrong as it creates a null file pointer whereas python_fd looks like a valid file descriptor (an integer equal to 5).

5.2. ResourceWarning unclosed file

If you get something looking like a ResourceWarning: Enable tracemalloc to get the object allocation traceback

PYTHONTRACEMALLOC=1 rdiff-backup -v 10 backup /tmp/äłtèr /var/tmp/rdiff

This tells you indeed where the file was opened: Object allocated at (most recent call last) but it still requires deeper analysis to understand the reason.

5.3. Debug client / server mode

In order to make sure the debug messages are properly sorted, you need to have the verbosity level 9 set-up, mix stdout and stderr, and then use the date/time output to properly sort the lines coming both from server and client, while making sure that lines belonging together stay together. The result command line might look as follows:

rdiff-backup -v9 localhost::/sourcedir /backupdir 2>&1 | awk \
	'/^2019-09-16/ { if (line) print line; line = $0 } ! /^2019-09-16/ { line = line " ## " $0 }' \
	| sort | sed 's/ ## /\n/g'

Since version 2.1+, you can use the server’s --debug option to debug remotely the server process. Make sure first that you’ve installed rpdb (remote pdb) and netcat (also called nc or ncat).

If you make sure that you run the latest code version, and set all the environment variables correctly, you can then connect remotely to the spawned server process:

source .../rdb/bin/activate  # (1)
pip install .
python -m pdb -m rdiffbackup.run --remote-schema \
	"ssh -C {h}
	RDIFF_BACKUP_DEBUG=0.0.0.0:4445  # (2)
	.../rdb/bin/rdiff-backup server --debug" \  # (3)
backup source_dir localhost::/target_dir
pdb is running on 0.0.0.0:4445  # (4)
  1. see above how to create a virtualenv fit for rdiff-backup

  2. this variable is optional and only required if you want another address/port

  3. note the --debug option necessary to set a breakpoint early in the process

  4. here the address:port where the debug process is listening, the default is 127.0.0.1:4444

Once you’ve done this, in another terminal, you can call netcat localhost 4445 (resp. ncat or nc, and 4444 by default) and you’ll arrive on the pdb command line. You’re one or two n(ext) steps away from the pre-check method, so you can start to debug the server process relatively early (not the argument parsing step though).

Tip
rpdb is just a wrapper around pdb so it acts very similarly.

5.4. Debug iterators

When debugging, the fact that rdiff-backup uses a lot of iterators makes it rather complex to understand what’s happening. It would sometimes make it easier to have a list to study at once of iterating painfully through each but if you simply use p list(some_iter_var), you basically run through the iterator and it’s lost for the program, which can only fail.

The solution is to use itertools.tee, create a copy of the iterator and print the copy, e.g.:

(Pdb) import itertools
(Pdb) inc_pair_iter,mycopy = itertools.tee(inc_pair_iter)
(Pdb) p list(map(lambda x: [str(x[0]),list(map(str,x[1]))], mycopy))
[... whatever output ...]

Assuming the iteration has no side effects, the initial variable inc_pair_iter is still valid for the rest of the program, whereas the mycopy is "dried out" (but you can repeat the tee operation as often as you want).

5.5. Hints where to place breakpoints

Depending on the kind of issue, there are some good places to put a breakpoint:

  • if there is a file access issue, src/rdiff_backup/rpath.py in the make_file_dict(filename) function.

  • if you need to follow the listing of files and directories, src/rdiff_backup/selection.py in the diryield(rpath) function.

5.6. Get coverage details

If you need to check the details of the coverage report after the run of tox -e pyXY, you can simply call something like the following:

COVERAGE_FILE=.tox/pyXY/log/coverage.sqlite .tox/pyXY/bin/coverage report -m

The report output will show you which code lines aren’t covered by the tests.

Tip
if a clause needs to be excluded from the report, you can use the comment # pragma: no cover. But don’t do it because you can but only because you must!

5.7. Profile rdiff-backup

5.7.1. Profiling without code changes

After having created and activated the usual virtualenv, you may call something like the following to profile the current code (adapt to your Python version):

python -m cProfile -s tottime -m rdiffbackup.run [... rdiff-backup parameters ...]

The -s tottime option sorts by total time spent in the function. More information can be found in the profile documentation.

Tip
if you’re into graphical tools and overviews, have a look e.g. at https://pythonhosted.org//ProfileEye/ ?

You may also do memory profiling using the memory-profiler, though more detailed information requires changes to the code by adding the @profile decorator to functions:

pip install memory-profiler matplotlib
mprof run rdiff-backup [... rdiff-backup parameters ...]
mprof plot --output mprofile.png
mprof clean && rm mprofile.png
Tip
there is also a line-profiler, but I didn’t try it because it requires changes to the code (again the @profile decorator).

5.7.2. More profiling with code changes

Once you have found by profiling an object that uses a lot of memory, one can use print(sys.getsizeof(x)) to print it’s memory footprint then iterating for a code solution to bring it down.

Memory can be freed manually with:

import gc
collected_objects = gc.collect()

This can also be run in Python:

import cProfile, pstats, StringIO
pr = cProfile.Profile()
pr.enable()
# ... do something ... pr.disable()
s = StringIO.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats(‘cumulative’)
ps.print_stats()
print s.getvalue()

6. RELEASING

There is no prior release schedule — they are made when deemed fit.

We use GitHub Actions to release automatically, as setup in the GitHub Workflows.

The following rules apply:

  • each modification to master happens through a Pull Request (PR) which triggers a pipeline job, which must be succesful for the merge to have a chance to happen. Such PR jobs will not trigger a release.

  • GitHub releases are generated as draft only on Git tags looking like a release. The release manager reviews then the draft release, names and describes it before they makes it visible. An automated Pypi release is foreseen but not yet implemented.

  • If you need to trigger a job for test purposes (e.g. because you changed something to the pipeline), create a branch or a tag with an underscore at the end of their name. Just make sure that you remove such tags, and potential draft releases, after usage.

  • If you want, again for test purposes, to trigger a PyPI deployment towards test.pypi.org, tag the commit before you push it with a development release tag, like vA.B.CbD.devN, then explicitly push the tag and the branch at the same time e.g. with git push origin vA.B.CbD.devN myname-mybranch.

Given the above rules, a release cycle looks roughly as follows:

  1. Call ./tools/get_changelog_since.sh PREVIOUSTAG to get a list of changes (see above) since the last release and a sorted and unique list of authors, on which basis you can extend the CHANGELOG for the new release. IMPORTANT: make sure that the PR is squashed or you won’t be able to trigger the release pipeline via a tag on master.

  2. Make sure you have the latest master commits with git checkout master && git pull --prune.

  3. Tag the last commit with git tag vX.Y.ZbN (beta) or `git tag vX.y.Z" (stable).

  4. Push the tag to GitHub with git push --tags.

  5. You can go to Actions to verify that the pipeline has started.

  6. If everything goes well, you should see the new draft release with all assets (aka packages) attached to it after all jobs have finished.

  7. Give the release a title and description and save it to make it visible to everybody.

  8. You’ll get a notification e-mail telling you that rdiff-backup-admin has released a new version.

  9. Use this e-mail to inform the rdiff-backup users.

Important
if not everything goes well, remove the tag both locally with git tag -d TAG and remotely with git push -d origin TAG. Then fix the issue with a new PR and start from the beginning.
Tip
if the PyPI deploy pipeline is broken, you may download the impacted wheel(s) from GitHub and upload them to PyPI from the command line using twine: twine upload [--repository-url https://test.pypi.org/legacy/] dist/rdiff\_backup-*.whl

The following sub-chapters list some learnings and specifities in case you need to modify the pipeline.

6.1. Delete draft releases

Because there is one draft release created for each pipeline job, it can be quite a lot when one tests the release pipeline. The GitHub WebUI requires quite a lot of clicks to delete them. A way to simplify (a bit) the deletion is to install the command line tool hub and call the following command:

hub release --include-drafts -f '%U %S %cr%n' | \
	awk '$2 == "draft" && $4 == "days" && $3 > 2 {print $1}' | xargs firefox

the 2 compared to $3 is the number of days, so that you get one tab opened in firefox for each draft release, so that you only need 2 clicks and one Ctrl+W (close the tab) to delete those releases.

Note
deletion directly using hub isn’t possible as it only supports tags and not release IDs. Drafts do NOT have tags…​