Home // Development Guide
Note
|
Suggest improvements to this documentation! |
Simply clone the source with:
git clone https://github.com/rdiff-backup/rdiff-backup.git
Note
|
If you plan to provide your own code, you should first fork our repo and clone your own forked repo (probably using ssh not https). How is described at https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/working-with-forks |
-
Before committing to a lot of writing or coding, please file an issue on Github and discuss your plans and gather feedback. Eventually it will be much easier to merge your change request if the idea and design has been agreed upon, and there will be less work for you as a contributor if you implement your idea along the correct lines to begin with.
-
Please check out existing issues and existing merge requests and browse the git history to see if somebody already tried to address the thing you have are interested in. It might provide useful insight why the current state is as it is.
-
Changes can be submitted using the typical Github workflow: clone this repository, make your changes, test and verify, and submit a Pull Request (PR).
-
For all code changes, please remember also to include inline comments and update tests where needed.
-
Follow of course our coding and documentation guidelines.
Rdiff-backup is licensed with GNU General Public License v2.0 or later. By contributing to this repository you agree that your work is licensed using the chosen project license.
If something is of interest for the changelog, prefix the statement in the commit body with a three uppercase letters and a colon:
-
FIX: for a bug fix
-
NEW: for a new feature
-
CHG: for a change requesting consideration when upgrading
-
DOC: for documentation and website aspects
If the commit addresses a specific issue, end the statement with a , closes #NNN
message, NNN
being of course the number of the issue.
Multiple entries might be required for one single change, e.g. NEW
if you introduce a new feature, and CHG
because this new features introduces some breaking changes on which the user needs to take actions at upgrade time.
You don’t need to notice that you’ve added documentation with DOC
for a new feature, this is obvious and redundant; DOC
should be reserved for solely documentation changes.
Tip
|
make sure that your commit was properly formatted by calling ./tools/get_changelog_since.sh v2.2.2 (or whichever is the last released version), and checking that the whole message is present.
Beware that the interactive git commit command tends to split long lines, which is contraproductive in our use case.
|
In summary, any commit message could look as follows:
git commit -m "Some commit title summarizing the change DOC: some understandable message on one single line explaining clearly the documentation change, closes #123 Further development notes, important (and recommended) but not fit for end-user changelog"
Tip
|
check git log for earlier examples.
Note that the DEV prefix has been deprecated.
|
The master branch is always kept in a clean state. Anybody can at any time clone this repository and branch off from master and expect test suite to pass and the code and other contents to be of good quality and a reasonable foundation for them to continue development on.
Each PR focuses on some topic and resist changing anything else. Keeping the scope clear also makes it easier to review the pull request. A good pull request has only one or a few commits, with each commit having a good commit subject and if needed also a body that explains the change.
Each pull request has only one author, but anybody can give feedback. The original author should be given time to address the feedback — reviewers should not do the fixes for the author, but instead let the author keep the authorship. Things can always be iterated and extended in future commits once the PR has been merged, or even in parallel if the changes are in different files or at least on different lines and do not cause merge conflicts if worked on.
It is the responsibility of the PR author to keep it without conflict with master (e.g. if not quickly merged) and overall to support the review process.
Ideally each pull request gets some feedback within 24 hours from it having been filed, and is merged within days or a couple of weeks. Each author should facilitate quick reviews and merges by making clean and neat commits and pull requests that are quick to review and do not spiral out in long discussions.
Currently the rdiff-backup Github repository is configured so that merging a pull request is possible only if it:
-
passes the CI testing
-
has at least one approving review
While anybody can make forks, pull requests and comment them, only a developer with write access to the main repository can merge and land commits in the master branch. To get write access, the person mush exhibit commitment to high standards and have a track record of meaningful contributions over several months.
It is the responsibility of the merging developer to make sure that the PR is squashed and that the squash commit message helps the release process with the right description and 3-capital-letters prefix (it is still the obligation of the PR author to provide enough information in their commit messages).
In versioning we utilize git tags as understood by setuptools_scm. Version strings follow the PEP-440 standard.
The rules are currently as follows (check the files in .github/workflows
for details):
-
all commits tagged with an underscore at the end or with a tag looking like a version number (i.e. as in next two bullets) are released to GitHub.
-
all commits tagged with alpha, beta, rc or final format are released to PyPI, i.e. the ones looking like: vX.Y.ZaN (alpha), vX.Y.ZbN (beta), vX.Y.ZrcN (release candidate) or vX.Y.Z (final).
-
all commits where the "version tag" is a development one, i.e. like previously with an additional
.devM
at the end, are released to Test PyPI. They are meant mostly to test the deployment itself (use alpha versions to release development code).
Note
|
the GitHub releases are created as draft, meaning that a maintainer must review them and publish them before they become visible. |
The same pre-requisites as for the installation of rdiff-backup also apply for building:
-
Python 3.9 or higher
-
librsync 2.0.0 or higher (to be verified)
Further python dependencies are documented in requirements.txt.
Additionally following pre-requisites are needed:
-
python3-dev (or -devel)
-
librsync-dev (or -devel)
-
a C compiler (gcc)
-
libacl-devel (for sys/acl.h)
-
rdiff (for testing)
-
asciidoctor (for documentation generation)
-
rpdb and netcat/ncat/nc (for remote debugging of server processes)
All of those should come packaged with your system or available from https://pypi.org/ but if you need them otherwise, here are some sources:
-
Python - https://www.python.org/
-
Librsync - http://librsync.sourceforge.net/
-
Pywin32 - https://github.com/mhammond/pywin32
-
Pylibacl - http://pylibacl.sourceforge.net/
-
Pyxattr - http://pyxattr.sourceforge.net/
-
PyYAML - https://github.com/yaml/pyyaml
-
Windows:
-
.github/workflows/test_windows.yml - check for
WIN_PYTHON_VERSION
-
.github/workflows/deploy.yml - check for
WIN_PYTHON_VERSION
-
tools/windows/group_vars/windows_hosts/generic.yml - check for
python_version
andpython_version_full
-
-
Linux:
-
tox.ini, tox_root.ini, tox_dist.ini and tox_slow.ini - check for
envlist
-
.github/workflows/test_linux.yml - check for
python-version
-
.github/workflows/deploy.yml - check for
/opt/python/cp3…
(and possiblymany-linux
) -
pyproject.toml - check for
requires-python
-
README.adoc - check for Python references
-
All Python dependencies have been concentrated into requirements.txt
, generated from requs/*.txt
with one file for each purpose.
Only those files should be used, and maintained, throughout the build/release process.
Binaries are listed in bindep.txt
(based on the bindep utility).
In all cases, a validation of the documentation is also necessary, but the above files should be considered the ultimate source of truth, and correctly maintained.
The project has a Makefile that defines steps like all
, build
, test
and others.
You can view the contents to see what it exactly does.
Using the Makefile
is the easiest way to quickly build and test the source code.
By default the Makefile
runs all of it’s command in a clean Docker container, thus making sure all the build dependencies are correctly defined and also protecting the host system from having to install them.
The CI pipeline also uses the Makefile
, so if all commands in the Makefile
succeed locally, the CI is most likely to pass as well.
To install, simply run:
pip install .
Tip
|
if pip isn’t present on your system, or too old, you can install or upgrade it with python -m ensurepip --upgrade
|
The build process can be also be run separately using pyproject-build
.
The setup.py
script expects to find librsync headers and libraries in the default location, usually /usr/include
and /usr/lib
.
If you want the setup script to check different locations, use the LIBRSYNC_DIR
environment variable.
For instance to instruct the setup program to look in /usr/local/include
and /usr/local/lib
for the librsync files run:
LIBRSYNC_DIR=/usr/local pyproject-build
Finally, the LFLAGS
and LIBS
environment variables are also recognized.
To build from source on Windows, check the Windows tools to build a single executable file which contains Python, librsync, and all required modules.
There are the occasions where you don’t want to make your system "dirty" with an early or even development version of rdiff-backup.
This is what virtual environments (or short virtualenv, or even venv) are meant for.
Here a very short summary on how to create a virtualenv in the directory …/rdb
(name and exact location aren’t important, but once created, a virtualenv can’t be moved):
python -m venv .../rdb source .../rdb/bin/activate #(1) which pip #(2) pip install -r requirements.txt # install rdiff-backup #(3) which rdiff-backup #(2) # use rdiff-backup and do whatever you want actually deactivate #(4) rm -fr .../rdb #(5)
-
assuming a bash shell, but there are other activate-scripts for other shells, even Windows' cmd. In all cases, you should have a prompt starting with
(rdb)
. -
the path to the command should be
…/rdb/bin/<command>
, else callhash -r
(under bash) and try again. -
the different options to install rdiff-backup are listed below.
-
you’re now leaving the virtualenv, the prompt should go back to normal.
-
you can of course keep and maintain the virtualenv instead, but why?
Tip
|
the script ./tools/create_venv.sh is available to execute all these steps at once.
|
The different ways of installing rdiff-backup in such a virtualenv depend on the version type:
pip install rdiff-backup #(1) pip install rdiff-backup==2.1.3b3 #(2) pip install -i https://test.pypi.org/simple/ rdiff-backup==2.1.3.dev1 #(3) pip install . #(4) pip install rdiff-backup[meta]==2.1.3b3 #(5) pip install .[meta] #(5)
-
this will install the last stable version released to PyPI e.g. 2.0.5.
-
this will install a specific version, e.g. alpha, beta or release candidate.
-
this will install a development version inofficially released (seldom).
-
this assumes that you have cloned the Git repo and are in its root, and will install this development state.
-
this is the same as the above commands but installs also the optional dependencies of rdiff-backup.
Clone, unpack and prepare the testfiles by calling the script tools/setup-testfiles.sh
from the cloned source Git repo.
You will most probably be asked for your password so that sudo can extract and prepare the testfiles (else the tests will fail).
That’s it, you can now run the tests:
-
run
tox
to use the defaulttox.ini
-
or
tox -c tox_slow.ini
for long tests -
or
sudo tox -c tox_root.ini
for the few tests needing root rights
For more details on testing, see the test
sections in the Makefile and the GitHub Actions.
A naming convention has been introduced to be able to easily use pytest if we want so, e.g. with pytest --cov --cov-config=tox.ini testing/*_test.py
.
So, all "normal" test files are to be named *_test.py
.
Specific files are similarly called *_SPECIFICtest.py
e.g. generic_roottest.py
for (generic) root tests.
All these files are placed under the testing
directory.
Note
|
the interest of pytest would be to parallelize tests, but this is currently difficult due to the usage of global variables.
|
Tip
|
check the files commontest.py and fileset.py for reusable functions.
|
At the time of writing these notes, there was an issue where calling the program generates a Segmentation fault (core dumped)
.
This chapter is based on this experience debugging under Fedora 29 (then partially tested again under Fedora 39).
References:
-
https://ask.fedoraproject.org/en/question/98776/where-is-core-dump-located/
-
Adventures in Python core dumping: https://gist.github.com/toolness/d56c1aab317377d5d17a
-
Debugging dynamically loaded extensions: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch16s08.html
-
Debugging Memory Problems: https://www.oreilly.com/library/view/python-cookbook/0596001673/ch16s09.html
-
First install:
sudo dnf install python3-debug gdb sudo dnf debuginfo-install --exclude "*.i686" \ #(1) python3-debug bzip2-libs glibc librsync libxcrypt openssl-libs \ popt sssd-client xz-libs zlib
<1>The exclude pattern was necessary to avoid installing 32 bits library
-
Create a virtualenv with
python3-debug
, and activate it -
Then run:
-
$ CFLAGS='-Wall -O0 -g' pip install .
$ python3-debug -m rdiffbackup.run -v 10 backup /some/dir1 /some/dir2
[...]
Segmentation fault (core dumped)
Note
|
The CFLAGS avoids optimizations making debugging too complicated |
At this stage coredumpctl list
shows that coredump is the last one, so that one can call coredumpctl gdb
, which itself tells (in multiple steps) that we missing some more debug information, hence the above debuginfo-install
statements.
So now back into coredumpctl gdb
, with some commands:
help
help stack
backtrace #(1)
bt full #(2)
py-bt #(3)
frame <FrameNumber> #(4)
p __SomeVar__ #(5)
<1>get a backtrace of all function calls leading to the coredump (also bt
)
<2>backtrace with local vars
<3>py-bt is the Python version of backtrace
<4>jump between frames as listed by bt using their #FrameNumber
<5>print some variable/expression in the context of the selected frame
Jumping between frames and printing the different variables, we can recognize that:
-
the core dump is due to a seek on a null file pointer
-
that the file pointer comes from the job pointer handed over to the function rs_job_iter
-
the job pointer itself comes from the self variable handed over to
_librsync_patchmaker_cycle
-
reading through the librsync documentation, it appears that the job type is opaque, i.e. I can’t directly influence and it has been created via the
rs_patch_begin
function within the function_librsync_new_patchmaker
inrdiff_backup/_librsyncmodule.c
.
At this stage, it seems that the core file has given most of its secrets and we need to debug the live program:
$ PYTHONTRACEMALLOC=1 gdb -args python3-debug -m rdiffbackup.run backup \
/some/source/dir /some/target/dir
(gdb) break rdiff_backup/_librsyncmodule.c:_librsync_new_patchmaker
(gdb) run
Tip
|
if you are not sure about the correct break statement, run once the program without it.
Then the module will have been loaded, and autocompletion on the break command (with twice <TAB>) can help you find the right file and place.
|
The debugger runs until the breakpoint is reached, after which a succession of next
and print <SomeVar>
allows me to analyze the code step by step, and to come to the conclusion that cfile = fdopen(python_fd, ...
is somehow wrong as it creates a null file pointer whereas python_fd
looks like a valid file descriptor (an integer equal to 5).
If you get something looking like a ResourceWarning: Enable tracemalloc to get the object allocation traceback
PYTHONTRACEMALLOC=1 rdiff-backup -v 10 backup /tmp/äłtèr /var/tmp/rdiff
This tells you indeed where the file was opened: Object allocated at (most recent call last)
but it still requires deeper analysis to understand the reason.
Note
|
See https://docs.python.org/3/library/tracemalloc.html for more information. |
In order to make sure the debug messages are properly sorted, you need to have the verbosity level 9 set-up, mix stdout and stderr, and then use the date/time output to properly sort the lines coming both from server and client, while making sure that lines belonging together stay together. The result command line might look as follows:
rdiff-backup -v9 localhost::/sourcedir /backupdir 2>&1 | awk \
'/^2019-09-16/ { if (line) print line; line = $0 } ! /^2019-09-16/ { line = line " ## " $0 }' \
| sort | sed 's/ ## /\n/g'
Since version 2.1+, you can use the server’s --debug
option to debug remotely the server process.
Make sure first that you’ve installed rpdb (remote pdb) and netcat (also called nc or ncat).
If you make sure that you run the latest code version, and set all the environment variables correctly, you can then connect remotely to the spawned server process:
source .../rdb/bin/activate # (1) pip install . python -m pdb -m rdiffbackup.run --remote-schema \ "ssh -C {h} RDIFF_BACKUP_DEBUG=0.0.0.0:4445 # (2) .../rdb/bin/rdiff-backup server --debug" \ # (3) backup source_dir localhost::/target_dir pdb is running on 0.0.0.0:4445 # (4)
-
see above how to create a virtualenv fit for rdiff-backup
-
this variable is optional and only required if you want another address/port
-
note the
--debug
option necessary to set a breakpoint early in the process -
here the address:port where the debug process is listening, the default is 127.0.0.1:4444
Once you’ve done this, in another terminal, you can call netcat localhost 4445
(resp. ncat
or nc
, and 4444 by default) and you’ll arrive on the pdb command line.
You’re one or two n(ext)
steps away from the pre-check method, so you can start to debug the server process relatively early (not the argument parsing step though).
Tip
|
rpdb is just a wrapper around pdb so it acts very similarly. |
When debugging, the fact that rdiff-backup uses a lot of iterators makes it rather complex to understand what’s happening.
It would sometimes make it easier to have a list to study at once of iterating painfully through each but if you simply use p list(some_iter_var)
, you basically run through the iterator and it’s lost for the program, which can only fail.
The solution is to use itertools.tee
, create a copy of the iterator and print the copy, e.g.:
(Pdb) import itertools (Pdb) inc_pair_iter,mycopy = itertools.tee(inc_pair_iter) (Pdb) p list(map(lambda x: [str(x[0]),list(map(str,x[1]))], mycopy)) [... whatever output ...]
Assuming the iteration has no side effects, the initial variable inc_pair_iter
is still valid for the rest of the program, whereas the mycopy
is "dried out" (but you can repeat the tee
operation as often as you want).
Depending on the kind of issue, there are some good places to put a breakpoint:
-
if there is a file access issue,
src/rdiff_backup/rpath.py
in themake_file_dict(filename)
function. -
if you need to follow the listing of files and directories,
src/rdiff_backup/selection.py
in thediryield(rpath)
function.
If you need to check the details of the coverage report after the run of tox -e pyXY
, you can simply call something like the following:
COVERAGE_FILE=.tox/pyXY/log/coverage.sqlite .tox/pyXY/bin/coverage report -m
The report output will show you which code lines aren’t covered by the tests.
Tip
|
if a clause needs to be excluded from the report, you can use the comment # pragma: no cover .
But don’t do it because you can but only because you must!
|
After having created and activated the usual virtualenv, you may call something like the following to profile the current code (adapt to your Python version):
python -m cProfile -s tottime -m rdiffbackup.run [... rdiff-backup parameters ...]
The -s tottime
option sorts by total time spent in the function.
More information can be found in the profile documentation.
Tip
|
if you’re into graphical tools and overviews, have a look e.g. at https://pythonhosted.org//ProfileEye/ ? |
You may also do memory profiling using the memory-profiler, though more detailed information requires changes to the code by adding the @profile
decorator to functions:
pip install memory-profiler matplotlib mprof run rdiff-backup [... rdiff-backup parameters ...] mprof plot --output mprofile.png mprof clean && rm mprofile.png
Tip
|
there is also a line-profiler, but I didn’t try it because it requires changes to the code (again the @profile decorator).
|
Once you have found by profiling an object that uses a lot of memory, one can use print(sys.getsizeof(x))
to print it’s memory footprint then iterating for a code solution to bring it down.
Memory can be freed manually with:
import gc collected_objects = gc.collect()
This can also be run in Python:
import cProfile, pstats, StringIO pr = cProfile.Profile() pr.enable() # ... do something ... pr.disable() s = StringIO.StringIO() ps = pstats.Stats(pr, stream=s).sort_stats(‘cumulative’) ps.print_stats() print s.getvalue()
There is no prior release schedule — they are made when deemed fit.
We use GitHub Actions to release automatically, as setup in the GitHub Workflows.
The following rules apply:
-
each modification to master happens through a Pull Request (PR) which triggers a pipeline job, which must be succesful for the merge to have a chance to happen. Such PR jobs will not trigger a release.
-
GitHub releases are generated as draft only on Git tags looking like a release. The release manager reviews then the draft release, names and describes it before they makes it visible. An automated Pypi release is foreseen but not yet implemented.
-
If you need to trigger a job for test purposes (e.g. because you changed something to the pipeline), create a branch or a tag with an underscore at the end of their name. Just make sure that you remove such tags, and potential draft releases, after usage.
-
If you want, again for test purposes, to trigger a PyPI deployment towards test.pypi.org, tag the commit before you push it with a development release tag, like
vA.B.CbD.devN
, then explicitly push the tag and the branch at the same time e.g. withgit push origin vA.B.CbD.devN myname-mybranch
.
Given the above rules, a release cycle looks roughly as follows:
-
Call
./tools/get_changelog_since.sh PREVIOUSTAG
to get a list of changes (see above) since the last release and a sorted and unique list of authors, on which basis you can extend the CHANGELOG for the new release. IMPORTANT: make sure that the PR is squashed or you won’t be able to trigger the release pipeline via a tag on master. -
Make sure you have the latest master commits with
git checkout master && git pull --prune
. -
Tag the last commit with
git tag vX.Y.ZbN
(beta) or `git tag vX.y.Z" (stable). -
Push the tag to GitHub with
git push --tags
. -
You can go to Actions to verify that the pipeline has started.
-
If everything goes well, you should see the new draft release with all assets (aka packages) attached to it after all jobs have finished.
-
Give the release a title and description and save it to make it visible to everybody.
-
You’ll get a notification e-mail telling you that rdiff-backup-admin has released a new version.
-
Use this e-mail to inform the rdiff-backup users.
Important
|
if not everything goes well, remove the tag both locally with git tag -d TAG and remotely with git push -d origin TAG .
Then fix the issue with a new PR and start from the beginning.
|
Tip
|
if the PyPI deploy pipeline is broken, you may download the impacted wheel(s) from GitHub and upload them to PyPI from the command line using twine: twine upload [--repository-url https://test.pypi.org/legacy/] dist/rdiff\_backup-*.whl
|
The following sub-chapters list some learnings and specifities in case you need to modify the pipeline.
Because there is one draft release created for each pipeline job, it can be quite a lot when one tests the release pipeline.
The GitHub WebUI requires quite a lot of clicks to delete them.
A way to simplify (a bit) the deletion is to install the command line tool hub
and call the following command:
hub release --include-drafts -f '%U %S %cr%n' | \ awk '$2 == "draft" && $4 == "days" && $3 > 2 {print $1}' | xargs firefox
the 2
compared to $3
is the number of days, so that you get one tab opened in firefox for each draft release, so that you only need 2 clicks and one Ctrl+W (close the tab) to delete those releases.
Note
|
deletion directly using hub isn’t possible as it only supports tags and not release IDs. Drafts do NOT have tags… |