Development Outlook

Overall appearance
Query construction
User corpora and subcorpora
Frequency data

This page briefly describes major features of KonText planned to be implemented in the distant future.

Important notice: the list is tentative and very likely to change.

Please do get in touch with us in case you are thinking about similar functionality.

Overall appearance

Beginner mode

motivation: people new to corpus linguistics may feel confused by the full functionality of KonText, especially considering its planned enhancements
current status: expert mode only
properties: KonText would have two basic modes of operation: beginner and expert mode; its current appearance will serve as the expert mode, while the beginner mode will have to be created as its simplified version which should be designed with mobile devices in mind

Query construction

Intuitive graphical query construction (GQC)

motivation: enabling users with limited knowledge of CQL to use more sophisticated queries by means of an intuitive graphical widget (full CQL will not be supported, though)
current status: six query types with a significant gap between CQL and all other ones
properties: easy switching from GQC to CQL (and if possible also vice versa); a nice feature would be updating the CQL form according to what has been selected in GQC; replacement of the other query types.

More general tag builder

motivation: to make the tag builder usable on wider range of tagsets, e.g. for the InterCorp or other foreign-language corpora
current status: tag builder requires a positional tagset where every combination of character & position is guaranteed to have the same meaning
properties: to be discussed, as there is a trade-off between general usability of the tag-builder and work needed both in terms of programming and complex configuration during the deployment; conceptual foundation for InterCorp - mapping between language-specific tagsets and a mediating taxonomy (OLiA, Universal Dependencies) as an optimal strategy

An example of one GQC tool that includes an abstraction of morphology is implemented in Korp:

a nice thing is that it keeps the abstracted view, unlike our current tag builder, that leaves the user with the textual tag only (string with maybe some regex), once the tag is constructed.
A not so nice thing is that it doesn't filter possible choices dynamically like the current Kontext tag builder, or Dan's Interset tag builder

Support for syntactically annotated corpora

motivation: obvious
current status: no such functionality; large syntactically annotated corpora do exist, but in experimental version only
properties: on the interface level dependence trees only; KonText would create a complex CQL query based on a subtree the user has constructed and display the result as a dependency tree
notice: functionality already implemented in KorAP and PML-TQ
- however: PML-TQ is complex and can only handle small datasets (a few milion tokens). It cannot easily output concordances. KorAP is so far impossible to test. I (Pavel S.) asked authors, they say it is running, but I cannot get access.
- We have started with different, simplistic approach, but it just may be good enough: http://ufal.mff.cuni.cz/lindat-kontext. This may be sufficient for simple queries. Ideally we would add visualisation of the syntactic tree. Again, the simplest solution would be just (SVG) pictures of the trees prepared ahead, e.g. by TrEd, and stored on the server.

User corpora and subcorpora

General mechanism for creation and management of (sub)corpora

motivation: user requests as well as easy administration of available (sub)corpora
current status: no such mechanism, only users can create their own subcorpora that cannot be shared; storing the within condition already implemented, but not used so far
properties: this feature would make use of the within condition that created the particular subcorpus; the within condition would be editable and also shareable among users

Creation of user corpora

description: to enable users to create their own (possibly lemmatized and tagged) corpora
current status: no such functionality
dependencies: corpus sharing required for maximum usability

Subcorpus blending module ✔️

description: selection of documents not only according to the given constraints, but also user-selected ratios (e.g. newspaper subcorpus that would contain 30 % title_A, 30 % title_B, 30 % title_C and 10 % other newspapers)
current status: no such functionality
properties: given the set of constraints and ratios, the module would select a suitable subset of documents (this is a computationally demanding task, but sufficient solution can presumably be found in real time)
dependencies: corpus sharing required for maximum usability

Frequency data

Statistical module

motivation: helping users with limited statistical background to make valid judgements
current status: being implemented in the CNC
properties: comparison of two frequencies in the same corpus/between two corpora; lexical richness; statistical confidence based on random samples

Multidimensional frequency distribution with visualisation 🚧

motivation: enabling two-(or more-)dimensional frequency distributions, e.g. for a combination of txtype and publication year OR education and genre
current status: one-dimensional frequency distributions only
properties: structural attributes only; attractive visualisations; related with statistical module (contingency tables, correlations); possibly n dimensions; inspiration e.g. here
notice: Manatee API seems to provide basic support, but this is definitely worth checking

Advanced collocational module(s)

motivation: providing an alternative to the Word Sketches
current status: only regular collocation lists available
properties: based on cooccurrence profiles (Belica) and/or p-collocations (Cvrček); another option would be to make use of syntactic relations (if available)
We can start small, like in Korp. Their Word Picture is not quite the full Sketch, but still useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly