-
Notifications
You must be signed in to change notification settings - Fork 15
Google Summer of Code 2018 Ideas List
On this page you will find project ideas for applications to the Google Summer of Code 2018. We encourage creativity, applicants are welcome to propose their own ideas in CDLI realm.
The Cuneiform Digital Library Initiative (CDLI) is driven by the mission to enable collection, preservation and accessibility of information concerning all ancient Near Eastern artifacts inscribed with cuneiform through images, textual information, and metadata. With over 334,000 artifacts in our catalogue, we house information about approximately two-thirds of all sources from cuneiform collections around the world as part of this mission. Our data is publicly available at https://cdli.ucla.edu and our audience primarily comprises scholars, students, museum staff, and informal learners.
Through its long history, CDLI is now integral to the Assyriological discipline fabric itself. It is used as a tertiary source, a data hub and a research data repository. Based on Google Analytics reports, CDLI website is visited on average by 3,000 monthly users through 10,000 sessions and 100,000 pageviews. 78% of these users are recurring visitors. The majority of users accesses CDLI collections and associated tools seeking information about a specific text or group of texts; CDLI has authoritative record about where the physical document is currently located, when and where is was originally created and deposited in ancient times, what is inscribed on the artifact and where it has been published. Search results display artifact images and associated linguistic annotations when available.
At CDLI, we are a group of developers, language scientists, machine learning engineers and cuneiform specialists who develop software infrastructure to process and analyze curated data. To this effect, we are actively developing two projects: Framework Update and Machine Translation and Automated Analysis of Cuneiform Languages. As part of these endeavors we are building a natural language processing platform to empower specialists of ancient languages for undertaking translation of Sumerian language texts thus enabling data driven study of languages, culture, history, economy and politics of the ancient Near Eastern civilizations. In this platform we are focusing on data normalization using Linked Open Data to foster best practices in data exchange, standardization and integration with other projects in digital humanities and computational philology.
To follow our research, see an overview of the tools we use to track our work.
The CDLI data comprises catalogue and text data that can be downloaded from our Github data repository, images which can be harvested or obtained on demand (including higher resolution images, for research purposes only) and textual annotations which are currently being prepared by the MTAAC research team.
- API for data retrieval
- Integrating CDLI corpora to CLTK/NLTK
- Computer vision challenge for the cuneiform script
- Multiple layer annotations querying
- RDF Textual annotations visualization using dot and GraphViz
- Granular temporal data management
- Temporal and geographic viz
- Your own project
Since inception, CDLI has endeavored to be a community driven initiative with contributors all around the world. Over the course of time our database has grown and has been extensively used in other projects in various fields of research, including computer science and computational linguistics. Our data is generally accessible to humans but only partially to machines. One of our short term goals is to break ground of data sharing in our field. This would enable accessibility of CDLI data for academic research projects, linguistic research and digital humanities projects as a self-service. Our data must be available to clients in multiple machine-actionable formats, (including XML-RDF,turtle, JSON and csv) through a well documented API.
This project will be a stepping stone towards providing all of our data as Linked Open Data using the RDF Format. Although Linked Open Data has been applied in fields of humanities, and language sciences however its use in field of Assyriology is almost absent. This would benefit and encourage initiatives like Modref project (which provides linked data from the CDLI along with two other digital libraries) and the British Museum Research Space service (which includes cuneiform objects). These services offer a Sparql endpoint to query their catalogue metadata which is formalized using the CIDOC-CRM ontology (An ontology is designed to handle the classification and description of entities). We already have set up a successful proof of concept (paper "in print", May 2018) which will serve as a blue print to integrate the technologies to the CDLI infrastructure.
The scope of the project is to prepare catalogue ( eg. entities like material, provenience, textual genre, museum collection, etc.) and dictionary data output suitable for linking with linked open data.
Minimal viable product:
- Understand the catalog and vocabulary data to be made available thru API.
- Restructure the databases if needed.
- Design and implement the API and retrieval services.
- Test throughput and latency requirements.
- Document the API in github.
If time permits:
- Design and integration with electronic Pensylvenia Sumerian Dictionary 1 or 2 and/or British Museum Research Space service and/or Modref project
- Solid understanding of data structures.
- Familiarity with MVC design and PHP.
- Desire and eagerness to learn service oriented architecture.
- Interest in Linked open data.
- Émilie Pagé-Perron
- Saurabh Trikande
This project aims at developing new tools and resources for research with emphasis on re-usability and use of recognized standards. Currently there is no off-the-shelf tool to perform and undertake basic natural language processing tasks on cuneiform transliteration such as calculating average text length, line length, tokens (words, signs) frequencies, etc. Those are the most basic operations one should have access to perform to start evaluating a corpus for further processing and automated analysis.
One of the challenges at CDLI is to deal with the fact that the cuneiform corpus is always evolving: we know less about the Sumerian and Akkadian languages than Hindi or Ancient Greek so each new research endeavor brings improvements and is a value addition to the existing corpus. Currently, the Classical language toolkit(CLTK) deals with fixed versions of corpora. As part of this project, CLTK needs to be extended to address the particularity of our corpus.
This work can be done for a specific language (Sumerian or Akkadian) OR all languages OR for a specific temporal corpus. The design should be modular to enable expansion to the whole corpus in the realm of cuneiform texts.
Getting started:
Tasks:
- Choose which NLP methods should be implemented (with justification).
- Design and develop a system for corpus versioning and its integration into the CLTK.
- Implement and test chosen objectives.
The deliverable of this projects is to make available a chosen corpus in CTLK and also provide NLP functionalities. The modular design would enable future expansion to rest of the corpus. This will empower the entire cuneiform research community with access to essential tools for linguistic analysis.
- Familiarity with Python
- Familiarity with the NLTK
- Interest in Natural Language Processing
- Interest in Sumerian and or Akkadian
- Émilie Pagé-Perron
- Saurabh Trikande
Unlike other obsolete-language digital libraries combining text and artifact image, the current system used by CDLI requires user to absorb visual and text information simultaneously to interpret the mapping between them. Experts in cuneiform studies are usually able to discern this mapping only for their areas of expertise; non-experts and informal learners, on the other hand, have no direct means of affiliating image and annotation content. This poses a core challenge for CDLI project to make fundamental contribution to the question of cuneiform paleography, and more broadly to define new approaches to deal with the dilemma of automatically hyperlinking existing text annotation with corresponding delineation in image. With the advent of image processing methodologies, this text-image hyperlink concern can now be addressed with reasonable performance. This would involve building models using machine learning algorithms specifically trained over a large training set to understand the underlying structure in the tablet images so as to optimally perform image segmentation.
Image processing would not only let-go manual segmentation labor but also will enhance the system to have robust tagging mechanisms for further additions to the library. Previous research in this domain has been focused around accurately detecting and localizing boundaries in natural scenes using local image measurements that involved analyzing brightness, color, and texture associated with natural boundaries. However, in regards to ancient cuneiform artifacts, this problem involves learning from three-dimensional, in the majority of cases damaged tablets, which increases the noise in the training algorithm.
The goal of this POC research project involves developing machine learning models which ingest cuneiform text and image to generate segments equivalent to the number of lines of transliteration. Appropriate segment indexing should enable us to further map the text and segments. This would require the student to formulate, test and evaluate strategies for line-by-line, and section-by-section encoding of cuneiform artifact image co-ordinates.
- Interest in Computer Vision and Machine Learning (Some prior background is preferred).
- Proficiency in python.
- Research experience is a plus.
- Passion to thrive in ambiguity.
- Openness to ideas and experimentation instincts.
- Saurabh Trikande
- Jayanth Jaiswal
There is currently no accessible tool available to seamlessly integrate into a website for querying through multiple layers of linguistic annotations (morphology, syntax and semantics). The best standalone tool we have found is ANNIS, a complete and robust corpus analysis tool. ANNIS is an excellent example of desired functionalities however it has some limitations when wanting to provide an accessible interface with seamless experience. In addition, we want to query our data in RDF format for flexibility and further integration into linked open data toolbox for computational linguisitcs.
Getting started:
- Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016
- Sparql query language Virtuoso Sparql endpoint
Tasks:
- Setup a sparql endpoint
- Define sparql query chunks to provide search for all layers of annotations
- Assemble and test the search system
- Prepare basic textual results display for humans
- Prepare basic RDF output (XML, Turtle, JSON) for machines
Feature to search through combined linguistic annotations and basic display integration to the current results display.
- Familiarity with natural language annotation
- Interest in Linked Open Data
- Familiarity with Sparql
- Émilie Pagé-Perron
- Christian Chiarcos
As part of the MTAAC project, we are producing rich textual annotations, both manually and automatically. Soon, 68 000 thousands texts will be enriched with morphological, syntactic and semantic annotations. Those annotations are stored in CDLI-CoNLL format and exploitable in RDF format to be manipulated as a graph. GraphViz enables the visualisation of graph data by generating svg images. As an intermediary, dot language is ideal to represent the visual aspects of a graph. This technology must be integrated into CDLI platform to enable users to visualize linguistic information about the texts at hand for research and teaching purposes.
Getting started :
The resulting deliverable is a novel visualization of linguistic annotations (specifically syntax and semantics attached to a text) in clear and helpful svg representation.
- Familiarity with linked open data
- Knowledge of PHP
- Émilie Pagé-Perron
- Christian Chiarcos
CDLI is currently improving complexity of it's data model, structuring the data to enable leveraging relationships. One salient classification aspect of cuneiform sources is dating information. Historical periods can be subdivided using rulers that reigned and corresponding dates provided on the texts. Depending on the period, texts can bear year name, month names and day. Currently the date is encoded in a text field of the CDLI catalogue as follows: RN.Y.M.D (Royal name, year, month, day). Royal name is spelled in full with conventional English designations, with “--” for lost information, “00” when information was not given by the scribe. Month intercalations were designated by scribes with "min," “the second,” or "diri," “extra.” A question mark following a space after the full date notation records doubts about any one, or all of the preceding RN.Y.M.D slots. We are considering expanding date information to include dynasty/era.
Candidate's design should accommodate requirements for an annotation pipeline currently being developed. Annotations providing information about the date should be compatible with processing of preexisting dating information extracted manually and available in catalogue data.
Getting started:
Tasks:
- Convert existing data to a new data model
- Extend search engine to handle dating information
- Prepare views to navigate useful dating information
- Search and display capabilities integrating granular temporal data
- Familiarity with Relational data or structured data, and PHP
- Familiarity with HTML and CSS
- Familiarity with the Sumerian and Akkadian languages
- Émilie Pagé-Perron
CDLI has rich geographical and temporal data at its disposal. At this time this information is not fully utilized. Although we are working on our data schema and model, there are significant challenges in exploiting new relationships.
This data should be presented to users in an interactive manner, giving them a new way to browse and discover information. The temporal and geographical data can be coupled with other information such as text genre, language, word frequency comparison and displayed thru a novel visualization technique.
Getting started:
Task:
- Identify potential user cases
- Choose the most accessible visualization plugins for each chosen display
- Integrate the chosen technology with data outputs
- Fine tune the displays, interlinking data further and increasing interactivity
- One or more display usable to discover and browse data in new and interactive ways
- Familiarity with JS
- Familiarity with JSON structured data
- Familiarity with HTML and CSS
- Familiarity with accessibility principles
- Émilie Pagé-Perron
We are interested in expanding our technological fathom in processing, analyzing and distributing (including visualization and accessibility) of our catalog and textual derived data. If you have an idea which could be reused either to reproduce your research or enhance further developments in the disciplines of Assyriology, Computational Linguistics or Computer Science, reach out to us and we can work on preparing together a project suitable for you, CDLI and GSoC.