Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triple, Node and URI expansion for TriplesGraph #19

Merged
merged 8 commits into from
Dec 21, 2014

Conversation

cordawyn
Copy link
Collaborator

This addresses #12 . However, don't be so willing accept this pull request yet @robstewart57 :-)
It introduces some inconsistency that I would like to highlight below and discuss.

First and foremost, select and query functions of TriplesGraph did not work as announced by the original author - they called expandTriples function, which was supposed to expand triples' namespaces and filter out duplicates. However it was not implemented and it just returned triples intact. I implemented the body of that function and also added uniqTriplesOf (better name pending) function as a companion of triplesOf, which returns unique and expanded triples unlike the latter. Now select, query, isIsomorphic and other query functions should work properly.

But we now deal with the following inconsistency: triplesOf returns non-expanded triples, select and query return expanded triples. Perhaps this could lead to confusion when you suddenly get triples that you "never added to the graph". At least, not in the format that they were added. All other functions that use triplesOf behind the curtain are going to produce a bit different output than that of select and query.

And now I'm looking at MGraph. It appears that MGraph needs that expansion routine implemented too, but due to the graph nature, it must have that expansion performed when triples are added to the graph. Since MGraph is a map tree, it effectively "avoids" duplicates, but node expansion is a must, because expanded and non-expanded nodes will compose different branches.

All that brings up a question whether TriplesGraph really needs that expansion and duplicate filtering of triples. We have 2 different types of RDF graphs, each having a distinctive set of pros and cons, so perhaps having duplicates and naïve "handling" of namespaces is a "feature" of TriplesGraph? On the other hand, not handling namespaced URIs is just too cruel. I could bear having duplicates in TriplesGraph at the cost of... graph building speed? While MGraph takes longer to build, but is (arguably) better for querying and has no duplicates. Perhaps there is a third or a fourth option too, and we could explore other kind of graphs, or even limit RDF4H to just one?

So my point is: introduction of namespace expansion and handling of duplicates is actually much bigger than issue #12 . It is perhaps quite a dramatic change for the project and it must permeate it wholly and systematically. And we must definitely bring namespace expansion to MGraph.

In the light of the abovementioned things, this pull request is not "complete". You can accept it as work-in-progress, but it's certainly not ready for a release.

@robstewart57
Copy link
Owner

While MGraph takes longer to build, but is (arguably) better for querying and has no duplicates. Perhaps there is a third or a fourth option too, and we could explore other kind of graphs, or even limit RDF4H to just one?

I've thought about this for a while. I've no intuition as to whether the performance of MGraph or TriplesGraph is any good, for each use case of the rdf4h API. We should probably adopt criterion, I've created a ticket for this #20 .

I've also thought about the possibility of switching RDF to a single data structure, if we could come up with one implementation that vastly outperforms existing instances in all API use cases. If criterion tells us there's a trade-off in the performance of each instance, then that's a good argument for keeping RDF as a type class -- it gives the user the choice for their use case.

I'm quite sure that there are more appropriate graph representations we could use for a 3rd RDF implementation. See "Structuring Depth-First Search Algorithms in Haskell" by David King and John Launchbury, here. It has a corresponding Haskell implementation here. There's an implementation of Martin Erwig's Functional Graph Library here, too. Maybe there is a Haskell parallel graph library that could be adopted to represent RDF graph structures that is more efficient than map trees. We also don't use any of Haskell's parallelism libraries for graph querying currently, we certainly could. Again, we need criterion to tell us where performance is bad.

@robstewart57
Copy link
Owner

Further to the above comment, the following document could serve as a good influence for efficient RDF graph representations in Haskell.

"Storing and Indexing Massive RDF Data Sets". Yongming Luo, Fran¸cois Picalausa, George H.L. Fletcher, Jan Hidders, and Stijn Vansummeren. 2011.
http://www.win.tue.nl/~yluo/seeqr/files/11survey.pdf

It provides a thorough literature survey on indexing and query techniques in real world RDF stores. It includes vertical and horizontal indexing, implementations of entity perspectives, graph based indexing methods, and structural indexes.

I wonder whether rdf4h could incorporate some of these demonstrably successful RDF indexing techniques. Combined with GHCs fusion on composition over Text value URI representations (which was adopted from rdf4h 1.0.0), and GHCs concurrency support for parallel RDF search, we might have a chance to get close to, or even beat, the performance of the currently widely used RDF stores and library APIs.

I'm convinced that with so many RDF indexing strategies, using the RDF type class flexibility to provide multiple instances is probably the right choice. It'd be good to have domain specific meaningful names, the TriplesGraph and MGraph are not very descriptive. E.g. ClusteredBTree would be more descriptive.

I recommend reading the document above if you get a chance.

@robstewart57
Copy link
Owner

First and foremost, select and query functions of TriplesGraph did not work as announced by the original author - they called expandTriples function, which was supposed to expand triples' namespaces and filter out duplicates. However it was not implemented and it just returned triples intact. I implemented the body of that function and also added uniqTriplesOf (better name pending) function as a companion of triplesOf, which returns unique and expanded triples unlike the latter. Now select, query, isIsomorphic and other query functions should work properly.

What do you think about just deprecating triplesOf to remove any confusion over the inconsistency with select and query? Either that, or replace the current implementation of triplesOf with your implementation of uniqueTriplesOf ? If you think that's a good and want to add this as a commit in cordawyn/master (i.e. to append this pull request) then this'd be a good time to merge, bump to 1.2.8 and update hackage. Thoughts?

@cordawyn
Copy link
Collaborator Author

I support the idea of keeping RDF type class as well as having different graph implementations. There are at least 2 distinct ones that make sense: list-based (TriplesGraph) and graph-based (MGraph). We certainly should explore other graph-based options. But for now, I think we should concentrate on sorting out the basic implementation requirements (in #20) and leave this pull request hanging for a while.

@robstewart57
Copy link
Owner

I've thought about this some more. See section 4 of the standard.
http://www.isi.edu/in-notes/rfc2396.txt

How about we adopt more of network-uri, i.e.

data Node =
  UNode !URI
  | BNode !T.Text
  | BNodeGen !Int
  | LNode !LValue
    deriving Generic

Note in the RFC2396 standard the URI syntax is

URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]

Moreover, network-uri supports both relative and absolute URI values in its URI type, e.g.

uriIsAbsolute :: URI -> Bool
uriIsRelative :: URI -> Bool

With respect to select query and mkRdf, rather than guessing what the wants we could just expose:

mkRdf :: Triples -> Maybe BaseUrl -> PrefixMappings -> rdf
mkRdfExpand :: Triples -> Maybe BaseUrl -> PrefixMappings -> rdf
select :: rdf -> NodeSelector -> NodeSelector -> NodeSelector -> Triples
selectExpand :: rdf -> NodeSelector -> NodeSelector -> NodeSelector -> Triples
triplesOf :: rdf -> Triples
triplesOfExpand :: rdf -> Triples
query :: rdf -> Maybe Node -> Maybe Node -> Maybe Node -> Triples
queryExpand :: rdf -> Maybe Node -> Maybe Node -> Maybe Node -> Triples

This gives control to the user. The main drawback is potentially erroneous programs where the user has not expanded nodes when they thought they had. That is, there is nothing in the type [Triple] which tells you whether the URI nodes in each triple is expanded or not. This may be an adequate limitation?

Alternatively, what about using type indexing? This is used in the repa array library to provide type level information about the representation of arrays, i.e. whether they are delayed or manifested. See section 3 of:

Guiding Parallel Array Fusion with Indexed Types. Ben Lippmeier et al., 2012.
http://www.cse.unsw.edu.au/~keller/Papers/repa3.pdf

So for example, our types would become:

data Abs
data Rel
data family Node e
data family Triple e
data instance Node Abs = ...
data instance Node Rel = ...
data instance Triple Abs = ...
data instance Triple Rel = ...

Where the e type argument denotes whether the URIs in the value are absolute or relative. At that stage, we could specialise on type level information about the URI values inside. E.g.

expand :: UNode Rel -> UNode Abs
select :: rdf -> NodeSelector -> NodeSelector -> NodeSelector -> Triples Abs
triplesOf :: rdf -> Triples Abs
triplesOf' :: rdf -> Triples Rel

(I don't expect the above to be correctly typed, I just hope it conveys the idea).

@cordawyn
Copy link
Collaborator Author

It looks like that "Indexed Types" paper needs a thorough read on my part, so I'm not going to discuss that suggestion of yours, yet. But I think we could do with a better integration with network-uri as you suggested. In fact, this was my approach in my ruby library for RDF (and it appears I've been gently pushing RDF4H towards that idea as well 😁). If we're dealing with URIs, we must deal with URI objects, not Strings, that's the general point.

It should be noted, however, that for performance reasons, we may want to enable end users to create nodes and/or triples by using Strings directly, skipping that intermediate URI creation (if possible). I came across a few use cases where users just had a list of absolute URIs (as strings) and all they needed was to make an RDF graph from them. Mapping them to URIs, then to nodes and then to triples appeared to be a slow process. Although this may not apply to RDF4H, as we do not have means of creating triples from a list of 3 strings (as ruby bindings for Redland do) - we'll have to do all that remapping anyway, I guess.

Anyway, slow as it may turn out to be, UNode !URI is still "the right way" (IMHO) and we can implement it first, then think about optimizations.

As far as those *expand functions are concerned, I'm more inclined to drop the idea of handling URI expansion within RDF graph implementations altogether. Instead, we should provide functions for prefix expansion (and URI absolutizing) and let users "massage" their input. We could also add a note (in red glowing letters) saying that "we accept your URIs as they are, without expansion and absolutizing". If user's project ensures absolute URIs, they won't suffer from performance hits as we reprocess their input that doesn't need to be reprocessed. Others will need to expand their URIs selectively, as needed, without an overall processing of the whole graph. That said, perhaps we should roll back my pull request (if it's already merged in), if you agree.

@robstewart57
Copy link
Owner

OK, I agree that we defer URI expansion to new functions:

expandNode :: Node -> Node
expandNodes :: Triple -> Triple
expandTriples :: [Triple] -> [Triple]

That would mean we'd need to make pervasive changes to the property based and unit test cases. Might the separation of URI expansion to the three functions above help us satisfy more of those w3c tests. Passing 100% of our QuickChck, HUnit and w3c tests is a good starting point for introducing type indexes for relative and absolute URIs, so let's aim for that first.

Given the decision to defer expansion to expandNode, expandNodes and expandTriples, how many commits in this pull request survive?

@robstewart57
Copy link
Owner

Yikes, I think that should be:

expandNode    :: (RDF rdf) => rdf -> Node -> Node
expandNodes   :: (RDF rdf) => rdf -> Triple -> Triple
expandTriples :: (RDF rdf) => rdf -> [Triple] -> [Triple]

@robstewart57
Copy link
Owner

Quit that, I've looked through the commits in this pull request. I think we can accept all commits with the exception of 9dd4729 i.e. "added namespace expansion to other query functions".

I agree with your functions

expandTriple :: PrefixMappings -> Triple -> Triple
expandTriples :: (RDF rdf) => rdf -> Triples
expandNode :: PrefixMappings -> Node -> Node
expandURI :: PrefixMappings -> T.Text -> T.Text
absolutizeNode :: Maybe BaseUrl -> Node -> Node
absolutizeTriple :: Maybe BaseUrl -> Triple -> Triple

robstewart57 added a commit that referenced this pull request Dec 21, 2014
Triple, Node and URI expansion for TriplesGraph
@robstewart57 robstewart57 merged commit 05ccc56 into robstewart57:master Dec 21, 2014
@robstewart57
Copy link
Owner

This pull request is merged, I've commented on 9dd4729 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants