So you have access to a cluster and you want to run ItemSubjector in batch mode?
The guide below is adapted to the specifics of the Wikimedia cluster, but it should be possible to run on any Kubernetes cluster with a python >=3.8 pod.
- setup a toolforge account, see https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Quickstart
- download PuTTY (e.g. with chocolatey.org)
- generate a SSH key using PuTTY and give it a password as per WMF instructions here https://www.mediawiki.org/wiki/Gerrit/Tutorial#Generate_a_new_SSH_key
- log in via PuTTY to url
dev-buster.toolforge.org
on port 22, see https://www.mediawiki.org/wiki/Toolserver:Logging_in#Logging_in_with_PuTTY. Ask for help in Telegram or IRC if you don't succeed
- Now log into the toolserver webinterface and create a tool. E.g. "itemsubjector-YOUR_USERNAME"
After the tool is registered log out of SSH and back in.
- become the tool
become TOOLNAME
The author recommends GNU screen to make it possible to have multiple "windows" and to be able to easily attach/detach
If you don't know how to use screen your life can become pretty miserable. Read up and watch e.g. https://www.youtube.com/results?search_query=gnu+screen
The author recommends:
- increasing the scrollback buffer by invoking with e.g.
screen -D -RR -h 5000
- using
ctrl + a ESC
to scroll back and inspect matches
run git clone https://github.com/dpriskorn/ItemSubjector.git itemsubjector && cd itemsubjector
Run this in the itemsubjector-folder:
chmod +x *.sh
ln -s setup_environment.sh ~setup.sh
The bastion only has python 3.7 installed which is not enough to run the new version of WikibaseIntegrator :/ This means that the requirements file from poetry cannot be used on the bastion until the python version is updated.
Run this command instead
$ pip install wikibaseintegrator==0.12.1 console-menu pydantic rich pandas
Follow the README, but leave out "poetry run" e.g. run python itemsubjector.py -a Q108801503
instead
run ./create_kubernettes_job_and_watch_the_log.sh 1
This will start he k8s job and show you the tail of the output.
By using watch
the output will be updated every 2 seconds
which makes it easy for you to get an idea of the progress.