Skip to content

Google Summer of Code 2018 Ideas List

englundrk edited this page Feb 19, 2018 · 56 revisions

On this page you will find project ideas for applications to the Google Summer of Code 2018. We encourage creativity; applicants are welcome to propose their own ideas in CDLI realm.

About the Cuneiform Digital Library Initiative

The Cuneiform Digital Library Initiative (CDLI) is driven by the mission to enable collection, preservation and accessibility of information— image files, textual annotation, and metadata—concerning all ancient Near Eastern artifacts inscribed with cuneiform. With over 334,000 artifacts in our catalogue, we house information about approximately two-thirds of all sources from cuneiform collections around the world. Our data are publicly available at https://cdli.ucla.edu, and our audience comprises primarily scholars, students, museum staff, and informal learners.

Through its long history, CDLI is now integral to the Assyriological discipline fabric itself. It is used as a tertiary source, a data hub and a research data repository. Based on Google Analytics reports, CDLI's website is visited on average by 3,000 monthly users in 10,000 sessions and 100,000 pageviews. 78% of these users are recurring visitors. The majority of users access CDLI collections and associated tools seeking information about a specific text or group of texts; insofar as these are available to us, CDLI has authoritative records about where the physical document is currently located, when and where is was originally created and deposited in ancient times, what is inscribed on the artifact and where it has been published. Search results display artifact images and associated linguistic annotations when available.

CDLI is a collaboration of developers, language scientists, machine learning engineers and cuneiform specialists who are creating a software infrastructure to process and analyze curated data. To this effect, we are actively developing two projects: Framework Update and Machine Translation and Automated Analysis of Cuneiform Languages. As part of these endeavors, we are building a natural language processing platform to empower specialists of ancient languages to undertake translation of Sumerian language texts, thus enabling data-driven study of the languages, culture, history, economy and politics of ancient Near Eastern civilizations. In this platform we are focusing on data normalization using Linked Open Data to foster best practices in data exchange, standardization and integration with other projects in digital humanities and computational philology.

To follow our research, see an overview of the tools we use to track our work.

The CDLI offers catalogue and text data that are downloadable from our Github data repository, image files that can be simply harvested online or obtained on demand (including higher resolution images, for research purposes only), and textual annotations that are currently being prepared by the MTAAC research team.

Potential project ideas

API for data retrieval on CDLI (easy)

Since its inception, CDLI has endeavored to be a community-driven initiative with global contributors. Over the course of time, our database has grown and been extensively used in other projects in various fields of research, including computer science and computational linguistics. Our data are generally accessible to humans, but only partially to machines. One of our short-term goals is to implement data sharing in our field. This would enable self-service access to CDLI data for academic research projects, particularly linguistic research and digital humanities projects. In this, our data must be available to clients in multiple machine-actionable formats (including XML-RDF,turtle, JSON and csv) through a well documented API.

This project will be a launch pad to provide all of our data as Linked Open Data using the RDF Format. Although Linked Open Data has been applied in humanities fields and language sciences, its use in the field of Assyriology is almost non-existent. This would benefit and encourage initiatives like Modref project (providing linked data from the CDLI along with two other digital libraries) and the British Museum Research Space service (including cuneiform text artifacts). These services offer a Sparql endpoint to query their catalogue metadata that are formalized using the CIDOC-CRM ontology (an ontology is designed to handle the classification and description of entities). We have already set up a successful proof of concept (paper "in print", May 2018), that will serve as a blueprint to integrate the technologies to CDLI's infrastructure.

The scope of the project is to prepare catalogue (for example, entities like material, provenience, text genre, museum collection, etc.) and dictionary data output suitable for linking with linked open data.

Outcomes:

Minimal viable product:

  • Understand the catalogue and vocabulary data to be made available through API.
  • Restructure the databases where needed.
  • Design and implement API and retrieval services.
  • Test throughput and latency requirements.
  • Document the API in github.

If time permits:

Skills required/preferred:

  • Firm understanding of data structures.
  • Familiarity with MVC design and PHP.
  • Desire and eagerness to learn service oriented architecture.
  • Interest in linked open data.

Possible mentors:

  • Émilie Pagé-Perron
  • Saurabh Trikande

Integrating CDLI corpora to CLTK/NLTK (easy)

This project seeks to develop new tools and resources for research, with emphasis on re-usability and use of recognized standards. Currently, there is no off-the-shelf tool to undertake basic natural language processing tasks on cuneiform transliterations such as calculating average text length, line length, tokens (words, signs) frequencies, etc. Those are the most basic operations one should have access to in able to start evaluating a corpus for further processing and automated analysis.

One of the challenges at CDLI is to deal with the fact that the cuneiform corpus is always evolving: we know less about the Sumerian and Akkadian languages than we do about Hindi or Ancient Greek, so each new research endeavor brings improvements and adds value to the existing corpus. Currently, the Classical language toolkit(CLTK) deals with fixed versions of corpora. As part of this project, CLTK needs to be extended to address the particularity of our corpus.

This work can be done for a specific language (Sumerian or Akkadian) OR all languages OR for a specific temporal corpus. The design should be modular to enable expansion to the whole corpus in the realm of cuneiform texts.

Getting started:

Tasks:

  • Choose which NLP methods should be implemented (with justification).
  • Design and develop a system for corpus versioning and its integration into the CLTK.
  • Implement and test chosen objectives.

Outcomes:

The deliverable of this project is to make available a chosen corpus in CTLK and also provide NLP functionalities. The modular design would enable future expansion to the rest of the corpus. This will give cuneiform research community access to essential tools for linguistic analysis.

Skills required/preferred:

  • Familiarity with Python.
  • Familiarity with the NLTK.
  • Interest in Natural Language Processing.
  • Interest in Sumerian and/or Akkadian.

Possible mentors:

  • Émilie Pagé-Perron
  • Saurabh Trikande

Computer vision challenge for the cuneiform script (hard) very popular

The current display system used at CDLI requires that a user reads a text to absorb visual and text information simultaneously, and to interpret the mapping between them, since image and transliteration are shown side by side (example: https://cdli.ucla.edu/P423472). Experts in cuneiform studies are usually able to discern this mapping only for their areas of expertise; non-experts and informal learners, on the other hand, have no direct means of affiliating image and annotation content. This poses a core challenge for the CDLI project to make a fundamental contribution to the question of cuneiform paleography, and more broadly to define new approaches to deal with the dilemma of automatically hyperlinking existing text annotation with corresponding delineation in images. With the advent of image processing methodologies, this text-image hyperlink concern can now be addressed with reasonable performance. This would involve building models using machine learning algorithms specifically trained over a large file set to understand the underlying structure in the tablet images so as to optimally perform image segmentation.

Image processing would not only alleviate the need for manual segmentation, but also will enhance the system to have robust tagging mechanisms for further additions to the library. Previous research in this domain has been focused on accurately detecting and localizing boundaries in natural scenes using local image measurements that involved analyzing brightness, color, and texture associated with natural boundaries. However, in regards to ancient cuneiform artifacts, this problem involves learning from three-dimensional, in the majority of cases damaged tablets, that increases the noise in the training algorithm.

Outcomes:

The goal of this proof of concept research project involves developing machine learning models that ingest cuneiform text and image to generate segments equivalent to the number of lines of transliteration. Appropriate segment indexing should enable us to further map the text and segments. This would require the student to formulate, test and evaluate strategies for line-by-line, and section-by-section encoding of cuneiform artifact image coordinates.

Skills required/preferred:

  • Interest in computer vision and machine learning (some prior background preferred).
  • Proficiency in python.
  • Research experience is a plus.
  • Passion to thrive in ambiguity.
  • Openness to ideas and experimentation instincts.

Possible mentors:

  • Saurabh Trikande

Multiple layer annotations querying (hard)

There is currently no tool available to seamlessly integrate into a website the capacity to query through multiple layers of linguistic annotations (morphology, syntax and semantics). The best stand-alone web tool we have found is ANNIS, a complete and robust corpus analysis tool. ANNIS is an excellent example of desired functionalities; however, it has some limitations when tasked to provide an accessible interface with a seamless experience. In addition, we want to query our data in RDF format for flexibility and further integration into a linked open data toolbox for computational linguistics.

Getting started:

Tasks:

  • Set up a sparql endpoint.
  • Define sparql query chunks to provide search for all layers of annotations.
  • Assemble and test the search system.
  • Prepare basic textual results display for humans.
  • Prepare basic RDF output (XML, Turtle, JSON) for machines.

Outcomes:

A feature to search through combined linguistic annotations and basic display integration to the current results display.

Skills required/preferred:

  • Familiarity with natural language annotation.
  • Interest in Linked Open Data.
  • Familiarity with Sparql.

Possible mentors:

  • Émilie Pagé-Perron
  • Christian Chiarcos

Textual annotations viz (hard)

As part of the MTAAC project, we are producing rich textual annotations, both manually and automatically. Soon, 68,000 texts will be enriched with morphological, syntactic and semantic annotations. Those annotations are stored in CDLI-CoNLL format and exploitable in RDF format to be manipulated as a graph. GraphViz enables the visualisation of graph data by generating svg images. As an intermediary, the dot language is ideal to simplify the notation of the visual representation of a graph. This technology must be integrated into the CDLI platform to enable users to visualize linguistic information about the texts at hand for research and teaching purposes.

Getting started :

Outcomes:

The resulting deliverable is a novel visualization system for linguistic annotations (specifically, syntax and semantics attached to a text) in clear and helpful svg representation.

Skills required/preferred:

  • Familiarity with linked open data,
  • Knowledge of PHP,

Possible mentors:

  • Émilie Pagé-Perron
  • Christian Chiarcos

Granular temporal data management (easy)

CDLI is currently improving the complexity of its data model, structuring the data to enable the full leveraging of relationships. One salient classification aspect of cuneiform sources is dating information. Historical periods can be subdivided based on the identification of rulers and corresponding dates provided in the texts. Depending on the period, texts can bear year name(s), month name(s), and day(s). Currently, the date is encoded in a text field of the CDLI catalogue as follows: RN.Y.M.D (Royal name, year [numbers 1ff.], month [numbers 1-13], day [numbers 1-30]). Royal name is spelled in full with conventional English designations, with “--” for lost information, “00” when information was not given by the scribe. Month intercalations were designated by scribes with "min," "the second," or "diri," "extra." A question mark following a space after the full date notation records doubts about any one, or all of the preceding RN.Y.M.D slots. We are considering expanding date information to include dynasty/era.

The candidate's design should accommodate the requirements for an annotation pipeline currently being developed. Annotations providing information about the date should be compatible with processing of preexisting dating information extracted manually and available in catalogue data.

Getting started:

Tasks:

  • Convert existing data to a new data model.
  • Extend search engine to handle dating information.
  • Prepare views to navigate useful dating information.

Outcomes:

  • Search and display capabilities integrating granular temporal data.

Skills required/preferred:

  • Familiarity with relational data or structured data, and PHP.
  • Familiarity with HTML and CSS.
  • Familiarity with the Sumerian and Akkadian languages.

Possible mentors:

  • Émilie Pagé-Perron

Temporal and geographic viz (easy)

CDLI has rich geographical and temporal data at its disposal. Currently, this information is not fully utilized. Although we are working to improve our data schema, there are significant challenges in exploiting the new relationships available.

Temporal and geographical data should be presented to users in an interactive manner, giving them a new way to browse and discover information. The temporal and geographical data can be coupled with other information such as text genre, language, and word frequency comparison, and displayed through a novel visualization technique.

Getting started:

Task:

  • Identify potential user cases.
  • Choose the most accessible visualization plugins for each selected display.
  • Integrate the chosen technology with data outputs.
  • Fine tune the displays, interlinking data further and increasing interactivity.

Outcomes:

  • One or more displays to discover and browse data in new and interactive ways.

Skills required/preferred:

  • Familiarity with JS.
  • Familiarity with JSON structured data.
  • Familiarity with HTML and CSS.
  • Familiarity with accessibility principles.

Possible mentors:

  • Émilie Pagé-Perron

Other ideas

  • A converter to C-ATF format from multiple other formats such as ORACC-ATF, BDTNS format, etc.
  • Implementation in Bootstrap of our new UX design.
  • Create a scripted pipeline for processing of raw scans of artifacts and scans of line art to the final version images for archival storage and web display.

Your own project (bring 'em on!)

We are interested in expanding our technological capabilities in processing, analyzing and distributing (including visualization and accessibility) our catalogue and textual derived data. If you have an idea that could be reused either to reproduce your research or enhance further developments in the disciplines of Assyriology, Computational Linguistics or Computer Science, reach out to us and we can work together on preparing a project suitable for you, CDLI and GSoC.

[email protected]