- Installation instructions (v 0.15+)
The easiest way to install KonText is to create an LXC/LXD container with Ubuntu 22.04 LTS OS, clone KonText repository and run scripts/install/install.py script that performs all the installation and configuration steps that are necessary to run KonText as a standalone server application.
[1] Set up and start an LXC container:
sudo apt-get install lxc
sudo lxc-create -t download -n kontext-container -- -d ubuntu -r jammy -a amd64
sudo lxc-start -n kontext-container
[2] Note down the container's IP address:
sudo lxc-info -n kontext-container -iH
[3] Open the container's shell:
sudo lxc-attach -n kontext-container
(for more details about lxc containers, see https://linuxcontainers.org/lxc)
[4] In the container, git-clone the KonText git repo to a directory of your choice (e.g. /opt/kontext), set the required permissions and run the install script.
sudo apt-get update
sudo apt-get install -y ca-certificates git
git clone https://github.com/czcorpus/kontext.git /opt/kontext/
python3 /opt/kontext/scripts/install/install.py
By default, the script installs Manatee from the deb
packages provided by
the NLP Center at the Faculty of Informatics, Masaryk University.
If you wish to build Manatee from sources and use the ucnk-specific manatee patch, you can
use the install.py --patch /path/to/patch
option.
(for more details about Manatee installation, see https://nlp.fi.muni.cz/trac/noske/wiki/Downloads)
[5] Once the installation is complete, start KonText by entering the following command in the installation directory you specified above (/opt/kontext):
python3 public/app.py --address 127.0.0.1 --port 8080
**[6] Open [container_IP_address]
in your browser on the host. You should see
KonText's home page and be able to enter a query to search in the sample Susanne corpus.
To install a newer version:
[1] fetch data from the origin
repository and merge it:
git fetch origin
git merge origin/master
(replace master
with a different branch if you use one)
[2] install possible new JS dependencies:
npm install
[3] build the project
make production
[4] optionally, if you want to deploy to a different directory without any unnecessary files:
cp -r {conf,lib,locale,package.json,public,scripts,worker.py} destination_directory
You can also take a look at helper script which allows you (once you write a simple JSON configuration file) to install newest version from GitHub while keeping a backup copy of your previous installation(s).
The key factor in deciding on how to configure performance-related parameters is how fast is your server able to handle a typical query on a typical corpus and how many users will be querying the server.
In general the following parameters should be in harmony:
- HTTP proxy
- read timeout (proxy_read_timeout in Nginx)
- Celery
- timeout
- hard timeout
CELERYD_OPTS="--time-limit=1800 --concurrency=64"
- soft timeout (
/kontext/calc_backend/task_time_limit
inconf/config.xml
)
- hard timeout
- number of workers (again
CELERYD_OPTS=...
)
- timeout
For small installations with infrequent visits, 2-4 Gunicorn workers can be enough. For thousands (to few tens of thousands) users a day, 16-24 Gunicorn workers should do the job. If you expect to have usage peaks (e.g. workshops) it is better to add more workers even if some of them may sleep most of the time.
In case of Celery, the best number to start with is the same number as the number of Gunicorn workers and monitor how the application performs. With large corpora, our recommendation would be to use about 1.5 to 2 times the number of Gunicorn workers.
uWSGI server can be used instead of Gunicorn. For more information please see uWSGI.md.
Celery Beat allows cron-like task management within Celery. It is used by KonText especially to remove old cache files (concordance, frequency, collocations).
File /etc/systemd/system/celerybeat.service:
[Unit]
Description=Celery Beat Service
After=network.target celery.service
[Service]
Type=forking
User=celery
Group=www-data
PIDFile=/var/run/celerybeat/beat.pid
EnvironmentFile=/etc/conf.d/celerybeat
WorkingDirectory=/opt/kontext/production/scripts/celery
ExecStart=/bin/sh -c '${CELERY_BIN} --app=${CELERY_APP} beat ${CELERYBEAT_OPTS} --workdir=${CELERYBEAT_CHDIR} --detach --pidfile=${CELERYBEAT_PID_FILE} --logfile=${CELERYBEAT_LOGFILE} --loglevel=${CELERYBEAT_LOGLEVEL}'
ExecStop=/bin/kill -s TERM $MAINPID
PermissionsStartOnly=true
ExecStartPre=-/bin/mkdir /var/run/celerybeat
ExecStartPre=/bin/chown -R celery:root /var/run/celerybeat
[Install]
WantedBy=multi-user.target
Then define a KonText-specific configuration /opt/kontext-production/conf/beatconfig.py:
from celery.schedules import crontab
from datetime import timedelta
BROKER_URL = 'redis://127.0.0.1:6379/2'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ENABLE_UTC = True
CELERY_TIMEZONE = 'Europe/Prague'
CELERYBEAT_SCHEDULE = {
'conc-cache-cleanup': {
'task': 'conc_cache.conc_cache_cleanup',
'schedule': crontab(hour='0,12', minute=1),
'kwargs': dict(ttl=120, subdir=None, dry_run=False)
}
}
The installation script has installed two users:
- public user which is used for any unauthorized access to KonText (id = 0)
- user kontext with a random password generated by the installation script (id = 1)
To add more users, another script packed with the default_auth plug-in can be used.
First, a JSON containing new users' credentials must be prepared. See lib/plugins/default_auth/scripts/users.sample.json
for the required format and data.
Note 1: Paswords in the file should be in plaintext. Please note that the file is needed just to install the users and from then, it won't be needed by KonText. The installed users' password are stored in a secure form within the database. The best way how to treat the user credentials file is to delete it or encrypt it and archive somewhere once you install the users.
Note 2: the script we are going to use will rewrite any existing user if id
collides. To prevent this,
start your new user IDs from 2 (to leave two initial users unchanged).
For each user, you should also set a list of corpora they will be able to access. It is possible to configure access to corpora which are not installed yet (just make sure proper corpora identifiers/names are used - typically this means names matching respective Manatee registry file names for your corpora).
Once the list of new users is ready, run:
python3 lib/plugins/default_auth/scripts/install.py /path/to/your/users.json
To install a new corpus, please follow the steps described below.
Registry configures a corpus for the Manatee search engine (and also adds some configuration needed by KonText/NoSkE).
The file should be located in a directory defined in conf/config.xml
, element kontext/corpora/manatee_registry
.
Copy your indexed corpus to a directory specified in conf/config.xml
(element kontext/corpora/manatee_registry
).
The data must be located in the path specified within a respective registry file under the PATH
key.
The default_corparch
plug-in which is used by default as a module to list and search for corpora uses conf/corplist.xml
file to store all the configured corpora. The susanne
corpus should be already there. To add a new corpus please
follow the same pattern as in case of susanne
.
To provide access to the new corpus for an existing user, you can use script lib/plugins/default_auth/scripts/usercorp.py
.
E.g. to add a new corpus 'foo' to a user with ID = 3:
python3 lib/plugins/default_auth/scripts/usercorp.py 3 add foo
Note: The script also allows removing access rights.
To prompt KonText to acknowledge your new corpus without requiring a restart, invoke the
/soft-reset
action along with the appropriate key. This key is stored in a file located
at /tmp/kontext_srt/...
, with the exact filename partially disclosed in the application log.
However, if only a single instance of KonText is operational, there will be just one file in
the kontext_srt
directory. The filename is a constant, as it is derived from the hash of the
absolute path of the corresponding KonText configuration XML file.
After a configuration change, it is always a good idea to validate your setup:
python3 scripts/validate_setup.py conf/config.xml
The script tests not just XML configuration validity but also actual existence of essential files/dirs and services.
In case your Gunicorn or Celery services fail to start, check systemd log:
journalctl -xe
There are several logs which may be quite helpful when debugging different issues:
- KonText application log (configured in
conf/config.xml
) - Celery log (configured in
/etc/conf.d/celery
)