-
Notifications
You must be signed in to change notification settings - Fork 10
Details about RDF Mapping
In this section we'll show abstracts from the DBPedia example, which map some example RDF downloaded from DBPedia and used to populate Neo4j.
rdf2neo allows you to define multiple configuration sets (named ConfigItem
, see here). Each has a list of SPARQL mapping queries and possibly other configuration elements about a logical subset of your RDF data. For instance, in the DBPedia example, we have a ConfigItem
for mapping data about places and another to map data about people. While in simple projects you might have just one configuration item, we allow for many because this helps with keeping data subsets separated.
RDF data can be mapped to Cypher nodes by means of the following query types.
This is a SPARQL query that lists all the URIs about RDF resources that represent a node. An example:
# The node list query must always project a ?iri variable
# (further returned variables are safely ignored, performance is usually better if you don't mention them at all
SELECT DISTINCT ?iri
WHERE
{
# This picks up nodes of interests based on their rdf:type, which should be pretty common.
# Any instance of Person and Employee is considered
{ ?iri a schema:Person }
UNION { ?iri a schema:Employee }
# Another option is to consider anyone in the domain or range of a property, i.e., you know
# that anyone involved in a foaf:knows relation must be a person.
UNION { ?someone foaf:knows|^foaf:knows ?iri }
}
Typically this query will be listing instances of target classes, although you might also catch resources of interest by targeting subjects or objects of given relations.
This query is invoked for each of the URIs found by the node URIs and is parameterised over a single node URIs (ie, the ?iri
variable in this query is replaced by the IRIs found from the listing query, and the query is invoked once per such IRIs). The query should return all the labels that you want to assign to that node on the Cypher side. For instance:
# The node list query must always project a ?label variable and must use the ?iri variable in the WHERE clause.
# ?iri will be bound to one of IRIs found in the node IRI query. The label query will be invoked once per node IRI,
# its purpose is to list all the Cypher labels that have to be assigned to the node.
#
# A label can be either a IRI or a literal, or a string. If it's a URI, it will be translated into a Cypher
# identifier by means of the configured IRI-to-ID converter. At the moment we're using the default DefaultIri2IdConverter
# (see the Java sources), which takes the last part of an IRI.
#
SELECT DISTINCT ?label
WHERE
{
# As said above, ?iri is a constant during the actual execution of this query.
# When DefaultIri2IdConverter is used, schema:Person will become the label 'Person'.
{ ?iri a ?label }
# We always want this label
UNION { BIND ( schema:Person AS ?label ) }
}
This works with the same mechanism (one query per node URI, the ?iri
variable bound to a specific IRI) and lists all the pairs of property name + value that you want to assign to the node:
# You need to return these two variables. ?iri is bound to a constant, as above.
#
# - ?name is typically a IRI and is converted into a shorter ID by means of a configured IRI->ID converter
# (no conversion if it's a literal).
# - ?value is a literal and, for the moment, is converted to simple value types (e.g., string, number), using
# its lexical value. We'll offer more customisation soon (e.g., mapping XSD types to Cypher/Java types).
#
SELECT DISTINCT ?name ?value
{
?iri ?name ?value.
FILTER ( isNumeric (?value) || LANG ( ?value ) = 'en' ). # Let's consider only these values
# We're interested in these properties only
# Again, these are passed to DefaultIri2IdConverter by default, and so things like
# rdfs:label, dbo:areaTotal become 'label', 'areaTotal'
VALUES ( ?name ) {
( rdfs:label )
( rdfs:comment )
( foaf:givenName )
( foaf:familyName )
}
}
So, if this RDF exists in the input:
...
@prefix ex: <http://www.example.com/resources/>
ex:john a schema:Person, schema:Employee;
foaf:givenName "John";
foaf:familyName "Smith".
The queries above will yield the following Cypher node:
{
iri:"http://www.example.com/resources/john",
givenName: 'John',
familyName: 'Smith'
}: [ `Person`, `Employee`, `Resource` ]
As you can see, some values are created implicitly:
-
every node has always an
iri
property. We need this to correctly process the RDF-defined relations (see below) and we think it can be useful to track the provenance URI for a node. This property is always indexed and has distinct values. -
every node has a always a default label. The predefined value fo this is
Resource
, but it can be changed by configuring a String beandefaultNodeLabel
as ID. Again, we need this in order to find nodes by their IRI (the Cypher construct:MATCH ( n: { id: $const }:Resource )
is very fast, not so when you try to match the label withWHERE $myLabel IN LABELS (n)
).
Notes
- If values are literals, you should expect reasonable conversions (e.g., RDF numbers => Cypher numbers). TODO: we plan to add a configuration option to define custom literal converters.
Cypher relations between nodes are mapped from RDF in a similar way.
Similarly to nodes, rdf2pg needs first a list of relations to be created. These must refer to their linking nodes by means of the node URIs (mapped earlier via the iri
property). This is an example for the DBPedia people resources:
# You must always return a relation IRI, a relation type (IRI or string), the IRIs of the relation source and target.
SELECT DISTINCT ?iri ?type ?fromIri ?toIri
{
# Plain relations, non-reified
?fromIri ?type ?toIri.
# We're interested in these predicates only
VALUES ( ?type ) {
( dct:subject )
( dbo:team )
( dbo:birthPlace )
}
FILTER ( isIRI ( ?toIri ) ). # Just in case of problems
# Fictitious IRI for plain relations. We always need a relation iri on the Cypher end,
# so typically will do this for straight triples.
#
BIND (
IRI ( CONCAT (
STR ( ex: ),
MD5 ( CONCAT ( STR ( ?type ), STR ( ?fromIri ), STR ( ?toIri ) ) )
))
AS ?iri
)
}
As you can see, we need certain properties always reported after the SELECT
keyword. Among these, we always need the relation URI, which has to be computed for straight (non reified) triples too.
Similarly to nodes, relation URIs (i.e., ?iri
) are needed by rdf2pg to check for the relation properties with the relation property query. Moreover, it is a good way to keep track of multiple statements about the same subject/predicate/property.
As said above, this is similar to the nodes case. If there are relations with attached properties on the RDF side, these will be defined through some RDF graph structure, which puts together multiple triples per relation.
For example, if such relations are reified via the rdf:
vocabulary:
SELECT DISTINCT ?iri ?type ?fromIri ?toIri
WHERE {
?iri a rdf:Statement;
rdf:subject ?fromIri;
rdf:predicate ?type;
rdf:object ?toIri.
}
Once rdf2pg receives reified relation tuples, each is used with query like this to select their properties:
# You must always return these and bind ?iri below
SELECT DISTINCT ?name ?value
WHERE {
?iri ?name ?value.
FILTER ( isNumeric (?value) || LANG ( ?value ) = 'en' ). # again, safeguarding code
# Again, we're interested in this datatype properties only
VALUES ( ?name ) {
( rdfs:label )
( rdfs:comment )
( dbo:areaTotal )
( dbo:populationTotal )
}
}
As above, ?name
is the property name that will be used for Cypher. If it is a URI, it will be converted by a URI-identifier converter. ?value
is converted to Cypher following the same rules described above.
As explained elsewhere, all mapped nodes are created on the PG side before any PG relation. Thanks to that, relation queries can reliably refer to the same URIs used for nodes, and they will always match. This apply to nodes/relations spread across different ConfigItems
.
The identifier SPARQL queries must return distinct results. rdf2pg makes no effort to check or enforce for such uniqueness. As it is common, when you are sure that a given set of patterns grab unique tuples, you can avoid DISTINCT
, which might make the query faster. Details about URI uniqueness are discussed in issue #3.
Another approach to deal with this constraint in a simple (though not very efficient) way is to first allow for duplicated nodes or relations, and then remove duplicates on the PG target (eg, by using Cypher DELETE). URI-based duplicates generated by rdfp2g can be removed without problems, since nodes/relations created from the same URIs are usually identical, due to the fact that their properties/labels/types are created with the same SPARQL query, which is instantiated with the same URI.