Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maninpasta minutes: BuildingKG#1 #6

Open
enridaga opened this issue Apr 6, 2021 · 28 comments
Open

Maninpasta minutes: BuildingKG#1 #6

enridaga opened this issue Apr 6, 2021 · 28 comments

Comments

@enridaga
Copy link
Member

enridaga commented Apr 6, 2021

To collect notes on the discussion in this group

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

It would be useful to start collect a list of resources that the project members would like to include in the KG

@85jesse
Copy link
Member

85jesse commented Apr 6, 2021

The Dutch Network for Digital Heritage has developed a registry for datasets: https://github.com/netwerk-digitaal-erfgoed/register. This could provide an infrastructural component for Polifonia datasets as well. By registering them in this fashion the dataset-metadata can be queried, and various endpoints can be found.

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

A catalogue of musical resources on the Web was developed a few years ago: https://musow.kmi.open.ac.uk/

@85jesse
Copy link
Member

85jesse commented Apr 6, 2021

For the project it will be relevant to create an overview of relevant datasets and their status (the degree to which they are published as semantic or linked data), but also their legal status.
Could be anything from data that needs to be scraped from html pages, to full RDF datasets.
Then within Polifonia pipelines can be created (or use existing pipelines) to transform the data so that it can be used within the project.

@paolobonora
Copy link

Example of sources for the Carolina#1 CS about Perti, Giacomo Antonio:

  1. Catalogue of Museo della Musica of Bologna: http://www.bibliotecamusica.it/cmbm/scripts/gaspari/libri.asp?ms=%27E%27&ms=%27M%27&ID=3589
  2. Corago LOD: http://coragolod2.ing.unibo.it:8080/corago/resource/RESPONS/APCN00004400
  3. Data from REPIM: 13 works from relational DB (we own)
  4. Grove Music Online: https://doi.org/10.1093/gmo/9781561592630.article.21394

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Interested in the REPIM, "Repertorio Poesia in Musica", including secular music from 15th-17th centuries

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Obviously, there is a page on Wikipedia: https://en.wikipedia.org/wiki/Giacomo_Antonio_Perti

@paolobonora
Copy link

And we should also take into account the VIAF entry: http://viaf.org/viaf/19946155

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Take away message 1: we need a registry!

@paolobonora
Copy link

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Take away message 2: there is a huge diversity of formats / availability status / quality

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

On the diversity of format, the OU is working on a new tool to SPARQL non-RDF resources: https://github.com/SPARQL-Anything/sparql.anything

@paolobonora
Copy link

RISM is "available" also in RDF: https://opac.rism.info/id/rismid/454006820?format=rdf

@85jesse
Copy link
Member

85jesse commented Apr 6, 2021

The registry needs to be filled, which is open to all. But we would also like to bring datasets a step further towards full Linked Open Data publication (by setting up pipelines, settling on a dataformat, linking with various sources, etc), but there we need to focus our efforts. So we need to see which datasets need to be prioritised: are there any specific usecases within Polifonia that have a need for specific datasets?

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

One challenge is that we cannot ask all data providers to commit to one ontological representation, which creates an interesting challenge in terms of developing an exploratory system for the KG

@85jesse
Copy link
Member

85jesse commented Apr 6, 2021

We also have to formulate requirements for datasets requirements in terms of performance (what kind of queries do they need to be able to handle? How long can it take before results are returned?). But we also need to decide on the degree to which the data can be cached/aggregated to boost performance.

@enridaga enridaga added this to the Maninpasta (6/04/2021) milestone Apr 6, 2021
@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Maybe we need to pursue a mixed strategy, by leaving data to the provider but caching as soon as access is requested. A search facility will necessarily need to index all the data, though.

@85jesse
Copy link
Member

85jesse commented Apr 6, 2021

Datasets to be registered also include thesauri/vocabularies that are being used within the music-domain. Would also be interesting to know which vocabularies have been linked/aligned with other (public) datasources (e.g. Wikidata, Discogs, etc.). These vocabularies can act as linking layers in the Knowledge Graph.

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Maybe we can use GitHub as a repository for the registry, and include a JSON-(LD) file for each of the resources. The musoW web application can just expose data from there

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

We should also consider using GitHub to host the actual datasets / linked data

@paolobonora
Copy link

We should define a basic process for a request of a new source within the KG.
Something like:

  1. proposed from user
  2. under analisys/triage
  3. accepted and being aligned
  4. added (and what has ben included)
  5. rejected

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

There is a set of key-questions about integrating resources into the knowledge graph: what do we integrate?

  • Metadata about the resources (easy, mandatory)
  • Schema elements / vocabularies
  • Entity linking

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Are we expecting to copy the original resource and transform it with our vocabulary? Instead, are we asking data providers to commit to our representation? Something in between the two extremes?

@85jesse
Copy link
Member

85jesse commented Apr 6, 2021

Another possible piece of the puzzle for vocabularies is the 'Network of Terms', an application that allows you to search multiple vocabularies via a single API: https://github.com/netwerk-digitaal-erfgoed/network-of-terms-api
This gives a good overview of vocabularies that are already out there that can be used. It's up to the collection holders to decide which vocabularie(s) they want to use.

@enridaga
Copy link
Member Author

enridaga commented Apr 6, 2021

Work in progress query to generate schema.org descriptions from the current musoW catalogue:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
  ?item schema:identifier ?identifier ;
    	schema:license ?license ;
        schema:featureList ?feature
} 
FROM <http://data.open.ac.uk/context/musow>
WHERE {
VALUES (?item) {(<http://data.open.ac.uk/musow/79e8954f4b0bc004e3ed8e5ea91bf7b4>)} .
  ?item dct:identifier ?identifier ;
        <http://data.open.ac.uk/musow/ontology/access/type> ?charged ;
        <http://data.open.ac.uk/musow/ontology/situation/task> ?task ;
        <http://dbpedia.org/ontology/category> ?category ;
        <http://purl.org/dc/terms/accessRights> ?accessRights ;
        <http://purl.org/dc/terms/license> ?license ;
        <http://schema.org/featureList> ?feature
  .
}

@enridaga
Copy link
Member Author

Shall we rename this issue into RegistryActivity?

@enridaga
Copy link
Member Author

Shall we move this issue to the Registry repository?

@albertmeronyo
Copy link
Member

Meeting 14-05-2021 on Sethus: @enridaga suggested adding MEI document support in SPARQL Anything to enable KG access/ingestion/creation

AP: presenting the idea (and maybe a prototype if time allows?) at the MEI WG meeting on 28-05-2021 would be great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants