-
Notifications
You must be signed in to change notification settings - Fork 15
Google Summer of Code 2019 Ideas List
On this page you will find project ideas for applications to the Google Summer of Code 2019. We encourage creativity; applicants are welcome to propose their own ideas in CDLI realm.
The Cuneiform Digital Library Initiative (CDLI) is driven by the mission to enable collection, preservation and accessibility of information— image files, textual annotation, and metadata—concerning all ancient Near Eastern artifacts inscribed with cuneiform. With over 334,000 artifacts in our catalogue, we house information about approximately two-thirds of all sources from cuneiform collections around the world. Our data are publicly available at https://cdli.ucla.edu, and our audience comprises primarily scholars, students, museum staff, and informal learners.
Through its long history, CDLI is now integral to the Assyriological discipline fabric itself. It is used as a tertiary source, a data hub and a research data repository. Based on Google Analytics reports, CDLI's website is visited on average by 3,000 monthly users in 10,000 sessions and 100,000 pageviews. 78% of these users are recurring visitors. The majority of users access CDLI collections and associated tools seeking information about a specific text or group of texts; insofar as these are available to us, CDLI has authoritative records about where the physical document is currently located, when and where is was originally created and deposited in ancient times, what is inscribed on the artifact and where it has been published. Search results display artifact images and associated linguistic annotations when available.
CDLI is a collaboration of developers, language scientists, machine learning engineers and cuneiform specialists who are creating a software infrastructure to process and analyze curated data. To this effect, we are actively developing two projects: Framework Update and Machine Translation and Automated Analysis of Cuneiform Languages. As part of these endeavors, we are building a natural language processing platform to empower specialists of ancient languages to undertake translation of Sumerian language texts, thus enabling data-driven study of the languages, culture, history, economy and politics of ancient Near Eastern civilizations. In this platform we are focusing on data normalization using Linked Open Data to foster best practices in data exchange, standardization and integration with other projects in digital humanities and computational philology.
To follow our research, see an overview of the tools we use to track our work.
The CDLI offers catalogue and text data that are downloadable from our Github data repository, image files that can be simply harvested online or obtained on demand (including higher resolution images, for research purposes only), and textual annotations that are currently being prepared by the MTAAC research team.
- Computer vision challenge for the cuneiform script
- Multiple layer annotations querying
- RDF Textual annotations visualization using dot and GraphViz
- Temporal and geographic viz
- Neural Machine Translation for Sumerian and English
- Your own project
The current display system used at CDLI requires that a user reads a text to absorb visual and text information simultaneously, and to interpret the mapping between them, since image and transliteration are shown side by side (example: https://cdli.ucla.edu/P423472). Experts in cuneiform studies are usually able to discern this mapping only for their areas of expertise; non-experts and informal learners, on the other hand, have no direct means of affiliating image and annotation content. This poses a core challenge for the CDLI project to make a fundamental contribution to the question of cuneiform paleography, and more broadly to define new approaches to deal with the dilemma of automatically hyperlinking existing text annotation with corresponding delineation in images. With the advent of image processing methodologies, this text-image hyperlink concern can now be addressed with reasonable performance. This would involve building models using machine learning algorithms specifically trained over a large file set to understand the underlying structure in the tablet images so as to optimally perform image segmentation.
Image processing would not only alleviate the need for manual segmentation, but also will enhance the system to have robust tagging mechanisms for further additions to the library. Previous research in this domain has been focused on accurately detecting and localizing boundaries in natural scenes using local image measurements that involved analyzing brightness, color, and texture associated with natural boundaries. However, in regards to ancient cuneiform artifacts, this problem involves learning from three-dimensional, in the majority of cases damaged tablets, that increases the noise in the training algorithm.
The goal of this proof of concept research project involves developing machine learning models that ingest cuneiform text and image to generate segments equivalent to the number of lines of transliteration. Appropriate segment indexing should enable us to further map the text and segments. This would require the student to formulate, test and evaluate strategies for line-by-line, and section-by-section encoding of cuneiform artifact image coordinates.
- Interest in computer vision and machine learning (some prior background preferred).
- Proficiency in python.
- Research experience is a plus.
- Passion to thrive in ambiguity.
- Openness to ideas and experimentation instincts.
- Saurabh Trikande
There is currently no tool available to seamlessly integrate into a website the capacity to query through multiple layers of linguistic annotations (morphology, syntax and semantics). The best stand-alone web tool we have found is ANNIS, a complete and robust corpus analysis tool. ANNIS is an excellent example of desired functionalities; however, it has some limitations when tasked to provide an accessible interface with a seamless experience. In addition, we want to query our data in RDF format for flexibility and further integration into a linked open data toolbox for computational linguistics.
Getting started:
- Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. In: Digital Scholarship in the Humanities 2016
- Sparql query language Virtuoso Sparql endpoint
Tasks:
- Set up a sparql endpoint.
- Define sparql query chunks to provide search for all layers of annotations.
- Assemble and test the search system.
- Prepare basic textual results display for humans.
- Prepare basic RDF output (XML, Turtle, JSON) for machines.
A feature to search through combined linguistic annotations and basic display integration to the current results display.
- Familiarity with natural language annotation.
- Interest in Linked Open Data.
- Familiarity with Sparql.
- Émilie Pagé-Perron
- Niko Schenk
As part of the MTAAC project, we are producing rich textual annotations, both manually and automatically. Soon, 68,000 texts will be enriched with morphological, syntactic and semantic annotations. Those annotations are stored in CDLI-CoNLL format and exploitable in RDF format to be manipulated as a graph. GraphViz enables the visualisation of graph data by generating svg images. As an intermediary, the dot language is ideal to simplify the notation of the visual representation of a graph. This technology must be integrated into the CDLI platform to enable users to visualize linguistic information about the texts at hand for research and teaching purposes.
Getting started :
The resulting deliverable is a novel visualization system for linguistic annotations (specifically, syntax and semantics attached to a text) in clear and helpful svg representation.
- Familiarity with linked open data,
- Knowledge of PHP,
- Émilie Pagé-Perron
- Christian Chiarcos
CDLI has rich geographical and temporal data at its disposal. Currently, this information is not fully utilized. Although we are working to improve our data schema, there are significant challenges in exploiting the new relationships available.
Temporal and geographical data should be presented to users in an interactive manner, giving them a new way to browse and discover information. The temporal and geographical data can be coupled with other information such as text genre, language, and word frequency comparison, and displayed through a novel visualization technique.
Getting started:
Task:
- Identify potential user cases.
- Choose the most accessible visualization plugins for each selected display.
- Integrate the chosen technology with data outputs.
- Fine tune the displays, interlinking data further and increasing interactivity.
- One or more displays to discover and browse data in new and interactive ways.
- Familiarity with JS.
- Familiarity with JSON structured data.
- Familiarity with HTML and CSS.
- Familiarity with accessibility principles.
- Émilie Pagé-Perron
- A converter to C-ATF format from multiple other formats such as ORACC-ATF, BDTNS format, etc.
- Implementation in Bootstrap of our new UX design.
- Create a scripted pipeline for processing of raw scans of artifacts and scans of line art to the final version images for archival storage and web display.
As part of the MTAAC project, we host a small collection of aligned Sumerian/English phrase pairs. Your task is to train a neural network-based encode-decoder architecture for English-Sumerian and Sumerian-English Machine Translation in order to support experts in cuneiform studies with automated translations.
Getting started:
- Open NMT: (http://opennmt.net/)
- Link to sentence-aligned data: TODO
- Implement a neural network-based encoder-decoder framework for Sumerian/English bidirectional machine translation.
- Train and evaluate different models and architectures on standard train/development/test splits.
- Experiment with all possible hyperparameter settings to obtain the best performance.
- Experiment with different embedding representations.
- Visualize learning behaviour of the models.
- Perform a quantitative and qualitative evaluation of the translations.
- Visualize the attention activity of the model.
- Develop techniques to augment the sparse training data by semi supervised data acquisition in order to boost the overall performance of the trained models.
- Familiarity with Python
- Familiarity with deep learning libraries, keras, tensorflow.
- Familiarity with (deep) neural networks, RNNs, LSTMs, CNNs, hyperparameters, stochastic optimization methods.
- Familiarity with training and evaluating statistical language models.
- Knowledge of visualization techniques for model evaluation.
- Émilie Pagé-Perron
- Niko Schenk
We are interested in expanding our technological capabilities in processing, analyzing and distributing (including visualization and accessibility) our catalogue and textual derived data. If you have an idea that could be reused either to reproduce your research or enhance further developments in the disciplines of Assyriology, Computational Linguistics or Computer Science, reach out to us and we can work together on preparing a project suitable for you, CDLI and GSoC.