-
Notifications
You must be signed in to change notification settings - Fork 15
Google Summer of Code 2018 Ideas List
On this page you will find project ideas for applications to the Google Summer of Code 2018. We encourage creativity, applicants are welcome to propose their own ideas in CDLI realm.
Cuneiform Digital Library Initiative (CDLI) is driven with mission to enable collection, preservation and accessibility of all ancient Near Eastern artifact’s images, text & metadata inscribed with cuneiform. With over 334,000 artifacts in exhaustive catalogue, we exclusively house approximately two-thirds of all sources in cuneiform collections around the world as part of this mission. Our data is publicly available here and our audience primarily comprises of scholars, students, informal learners and growing number of cuneiform enthusiast.
CDLI through its long history, is now integral part of the Assyriological discipline fabric itself. Based on Google Analytics reports, CDLI website is visited on average by 3,000 monthly users through 10,000 sessions and 100,000 pageviews. 78% of these users are recurring visitors. Majority of users access CDLI collection and associated tools seeking information about a specific text or group of texts; CDLI has authoritative record about where the physical document is currently located, when and where is was originally created and deposited in ancient times, what is inscribed on the artifact and where it has been published. Search results display artifact images and associated annotation when available.
At CDLI we are a group of developers, language scientists, machine learning engineers and cuneiform specialists who develop software infrastructure to process and analyze curated data. To this effect, we are actively developing two projects: Framework Update and Machine Translation and Automated Analysis of Cuneiform Languages Project. As part of these projects we are building a natural language processing platform to empower specialists of ancient languages for undertaking translation of Sumerian language texts thus enabling data driven study of languages, culture, history, economy and politics of the ancient Near Eastern civilization. In this platform we are focusing on data normalization using Linked Open Data to foster best practices in data exchange, standardization and integration with other projects in digital humanities and computational philology.
To follow our research, see an overview of the tools we use to track our work.
The CDLI data comprises catalogue and text data that can be downloaded from our Github data repository, images which can be harvested or obtained on demand (including higher resolution images, for research purposes only) and textual annotations which are currently being prepared by the MTAAC research team.
- API for data retrieval
- Integrating CDLI corpora to CLTK/NLTK
- Computer vision challenge for the cuneiform script
- Multiple layer annotations querying
- RDF Textual annotations visualization using dot and GraphViz
- Granular temporal data management
- Temporal and geographic viz
- Your own project
Since inception, CDLI has endeavored to be a community driven initiative with contributors all around the world. Over the course of time our database has grown and has been extensively used in other projects in the discipline. Currently such efforts require manual interventions to share relevant data. As part of this project we would be breaking grounds with data sharing in field of Assyriology. This would enable accessibility of CDLI data for academic research projects, linguistic research and digital humanities projects as a self-service. As part of this project we plan to offer the data in multiple formats (including XML, RDF, JSON) for the clients thru well documented APIs.
This project will be a stepping stone towards Linked Open Data as all our data would be linkable using RDF Format. Although Linked Open Data has been applied in fields of humanities, and language sciences however its use in field of Assyriology is almost absent. The same is true for Linguistic Linked Open Data in this regards. This would benefit and encourage initiatives like Modref project (which provides linked data from the CDLI along with two other digital libraries) and the British Museum Research Space service (which includes cuneiform objects). These services offer a Sparql endpoint to query their catalogue metadata which intern formalized using the CIDOC-CRM ontology (An ontology is designed to handle the classification and description of material culture artifacts). Linking CDLI with these services will permit the user to query artifacts of a diverse nature across multiple collections.
Outcomes: Minimal viable product:
- Understand the catalog and vocab data to be made available thru API.
- Restructure the databases if needed.
- Design and implement the API and retrieval services.
- Test throughput and latency requirements.
- Document the APIs in github.
If time permits:
- Design and integration with ePSD2 and/or British Museum Research Space service and/or Modref project
Skills required/preferred:
- Solid understanding of data structures.
- Familiarity with MVC design and php.
- Desire and eagerness to learn service oriented architecture.
- Interest in Linked open data
Possible mentors:
- Émilie Pagé-Perron
- Saurabh Trikande
Developing new tools and resources for research is amazing but even more when it is reusable and uses recognized standards. There currently exist no off-the-shelf tool to perform basic natural language processing tasks on cuneiform transliteration such as calculating average text length, line length, tokens (words, signs) frequencies, etc. Those are the most basic operations one should have access to perform to start evaluating a corpus for further processing ans analysis.
One of the challenges here is to deal with the fact that the cuneiform corpus is always evolving: We know less about the Sumerian and Akkadian languages than Hindi or Ancient Geek so each new research can bring improvements to the corpus. the Classical language toolkit (CLTK) deals with fixed versions of corpora, so a process must be set up to address this particularity of our corpus.
This work can be done for a specific language (Sumerian or Akkadian) or all languages, or for a specific corpus, as long as developpement is thought with the expansion to the whole corpus of cuneiform texts in mind.
Getting started:
Tasks:
- Choose which methods should be implemented (with justification)
- Develop a system for corpus versioning and integration to the CLTK
- Implement, test and fine-tune the chosen methods.
Outcomes:
The deliverables of the projects are a chosen corpus accessible to the CTLK, chosen basic NLP function available to apply to the corpus and a whole field of research finally getting access to primary, essential tools to process their primary sources.
Skills required/preferred:
- Familiarity with Python
- Familiarity with the NLTK and the CLTK
- Interest in Natural Language Processing
- Interest in Sumerian and or Akkadian
Possible mentors:
- Émilie Pagé-Perron
-Add description here-
Outcomes:
Skills required/preferred:
Possible mentors:
- Saurabh Trikande
There is currently no accessible tool available to seamlessly integrate into a website for querying through multiple layers of linguistic annotations (morphology, syntax and semantics). The best standalone tool we have found is ANNIS, a complete and robust corpus analysis tool. ANNIS is an excellent example of the functionalities we are looking for but it also has some limitations when wanting to provide an accessible interface but also a seamless experience. In addition, we want to query our data in the RDF format for added flexibility and to expand the linked open data toolbox for computational linguisitcs.
Getting started:
- Krause, Thomas & Zeldes, Amir (2016): ANNIS3: A new architecture for generic corpus query and visualization. in: Digital Scholarship in the Humanities 2016 (31)
- Sparql query language Virtuoso Sparql endpoint
Tasks:
Setup a sparql endpoint
Define sparql query chunks to provide search for all layers of annotations
Assemble and test the search system
Prepare basic textual results display for humans
Prepare basic RDF output (XML, Turtle, JSON) for machines
Outcomes: Feature to search through combined linguistic annotations and basic display integration to the current results display
Skills required/preferred: Familiarity with natural language annotation Interest in Linked Open Data Familiarity with Sparql
Possible mentors: Émilie Pagé-Perron Christian Chiarcos
As part of the MTAAC project, we are producing rich textual annotations, both manually and automatically. Soon, 68 000 thousands texts will be enriched with morphological, syntactic and semantic annotations. Those annotations are stored in CDLI-CoNLL format and exploitable in the RDF format, to they can be manipulated as a graph. GraphViz enables the visualisation of graph data by generating svg images. As an intermediary, the dot language is ideal to represent the visual aspects of a graph. This technology must be integrated into the CDLI platform so users can visualize linguistic information about the texts at hand, for research and teaching purposes. Getting started : DOT graph language GraphViz library
Outcomes:
The resulting deliverable is the visualization of linguistic annotations, specifically syntax and semantics, attached to a text, in clear and helpful svg representation.
Skills required/preferred: Familiarity with linked open data Knowledge of PHP
Possible mentors: Émilie Pagé-Perron Christian Chiarcos
CDLI is currently improving the complexity of it's data model, structuring the data further to better leverage relationships. One very important classification aspect of cuneiform sources is dating information. Historical periods can be subdivided using the rulers that reigned and the dates provided on the texts. Depending on the period, texts can bear year name, month names and day. At this time, the date is encoded in a text field of the CDLI catalogue as the following: RN.Y.M.D (Royal name, year, month, day). The royal name is spelled in full with conventional English designations, with “--” for lost information, “00” when information was not given by the scribe. Month intercalations were designated by scribes with "min," “the second,” or "diri," “extra.” A question mark following a space after the full date notation records doubts about any one, or all of the preceding RN.Y.M.D slots. We are considering expanding data information to include dynasty/era.
The candidate should keep in mind that since we are fevelopping an annotation pipeline, annotations providing information about the date should be compatible with the processing of dating information already extracted manually and available in the catalogue data. Getting started:
Tasks:
- Convert existing data to the new data model
- Extend the search engine to handle dating information
- Prepare views to navigate usefully the dating information
Outcomes:
- Search and display capabilities integrating granular temporal data
Skills required/preferred:
- Familiarity with Relational data or structured data, and PHP
- Familiarity with HTML and CSS
- Familiarity with the Sumerian and Akkadian languages
Possible mentors:
- Émilie Pagé-Perron
CDLI has rich geographical and temporal data at disposal. At this time this information is not fully exploited. Although we are working on our data schema and model, a challenge lie in the full exploitation of the new relationships and depths we are converting our data to.
This data should be presented to users in a fun and interactive manner, giving them a new way to browse and discover information and visualizing texts using new angles of approach. The temporal and geographical data can and should be coupled with other information such as text genre, language, word frequency comparison, imagination is the limit, as long as the combination can be useful for research.
Getting started:
Task:
- Identify potential user cases
- Choose the most accessible visualization plugins for each chosen display
- Integrate the chosen technology with out data outputs
- Fine tune the displays, interlinking data further and increasing interactivity
Outcomes:
- One or more display usable to discover and browse data in new and interactive ways
Skills required/preferred:
- Familiarity with JS
- Familiarity with JSON structured data
- Familiarity with HTML and CSS
- Familiarity with accessibility principles
Possible mentors:
- Émilie Pagé-Perron
We are interested in expanding our capacity in processing, analyzing and distributing (including visualization and accessibility) of our catalog and textual derived data. If you have an idea of which the deliverable could be reused either to reproduce your research or employed for further developpements in the disciplines of Assyriology, Computational Linguistics or Computer Science, reach out to us and we can work on preparing a project suitable for GSoC.