-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triple, Node and URI expansion for TriplesGraph #19
Conversation
unable to look-up the prefix
Conflicts: src/Text/RDF/RDF4H/TurtleParser.hs
I've thought about this for a while. I've no intuition as to whether the performance of I've also thought about the possibility of switching I'm quite sure that there are more appropriate graph representations we could use for a 3rd |
Further to the above comment, the following document could serve as a good influence for efficient RDF graph representations in Haskell. "Storing and Indexing Massive RDF Data Sets". Yongming Luo, Fran¸cois Picalausa, George H.L. Fletcher, Jan Hidders, and Stijn Vansummeren. 2011. It provides a thorough literature survey on indexing and query techniques in real world RDF stores. It includes vertical and horizontal indexing, implementations of entity perspectives, graph based indexing methods, and structural indexes. I wonder whether rdf4h could incorporate some of these demonstrably successful RDF indexing techniques. Combined with GHCs fusion on composition over I'm convinced that with so many RDF indexing strategies, using the I recommend reading the document above if you get a chance. |
What do you think about just deprecating |
I support the idea of keeping |
I've thought about this some more. See section 4 of the standard. How about we adopt more of
Note in the RFC2396 standard the URI syntax is
Moreover,
With respect to
This gives control to the user. The main drawback is potentially erroneous programs where the user has not expanded nodes when they thought they had. That is, there is nothing in the type Alternatively, what about using type indexing? This is used in the repa array library to provide type level information about the representation of arrays, i.e. whether they are delayed or manifested. See section 3 of: Guiding Parallel Array Fusion with Indexed Types. Ben Lippmeier et al., 2012. So for example, our types would become:
Where the
(I don't expect the above to be correctly typed, I just hope it conveys the idea). |
It looks like that "Indexed Types" paper needs a thorough read on my part, so I'm not going to discuss that suggestion of yours, yet. But I think we could do with a better integration with It should be noted, however, that for performance reasons, we may want to enable end users to create nodes and/or triples by using Strings directly, skipping that intermediate URI creation (if possible). I came across a few use cases where users just had a list of absolute URIs (as strings) and all they needed was to make an RDF graph from them. Mapping them to URIs, then to nodes and then to triples appeared to be a slow process. Although this may not apply to RDF4H, as we do not have means of creating triples from a list of 3 strings (as ruby bindings for Redland do) - we'll have to do all that remapping anyway, I guess. Anyway, slow as it may turn out to be, As far as those |
OK, I agree that we defer URI expansion to new functions:
That would mean we'd need to make pervasive changes to the property based and unit test cases. Might the separation of URI expansion to the three functions above help us satisfy more of those w3c tests. Passing 100% of our QuickChck, HUnit and w3c tests is a good starting point for introducing type indexes for relative and absolute URIs, so let's aim for that first. Given the decision to defer expansion to |
Yikes, I think that should be:
|
Quit that, I've looked through the commits in this pull request. I think we can accept all commits with the exception of 9dd4729 i.e. "added namespace expansion to other query functions". I agree with your functions
|
Triple, Node and URI expansion for TriplesGraph
This pull request is merged, I've commented on 9dd4729 . |
This addresses #12 . However, don't be so willing accept this pull request yet @robstewart57 :-)
It introduces some inconsistency that I would like to highlight below and discuss.
First and foremost,
select
andquery
functions ofTriplesGraph
did not work as announced by the original author - they calledexpandTriples
function, which was supposed to expand triples' namespaces and filter out duplicates. However it was not implemented and it just returned triples intact. I implemented the body of that function and also addeduniqTriplesOf
(better name pending) function as a companion oftriplesOf
, which returns unique and expanded triples unlike the latter. Nowselect
,query
,isIsomorphic
and other query functions should work properly.But we now deal with the following inconsistency:
triplesOf
returns non-expanded triples,select
andquery
return expanded triples. Perhaps this could lead to confusion when you suddenly get triples that you "never added to the graph". At least, not in the format that they were added. All other functions that usetriplesOf
behind the curtain are going to produce a bit different output than that ofselect
andquery
.And now I'm looking at
MGraph
. It appears thatMGraph
needs that expansion routine implemented too, but due to the graph nature, it must have that expansion performed when triples are added to the graph. SinceMGraph
is a map tree, it effectively "avoids" duplicates, but node expansion is a must, because expanded and non-expanded nodes will compose different branches.All that brings up a question whether
TriplesGraph
really needs that expansion and duplicate filtering of triples. We have 2 different types of RDF graphs, each having a distinctive set of pros and cons, so perhaps having duplicates and naïve "handling" of namespaces is a "feature" ofTriplesGraph
? On the other hand, not handling namespaced URIs is just too cruel. I could bear having duplicates inTriplesGraph
at the cost of... graph building speed? WhileMGraph
takes longer to build, but is (arguably) better for querying and has no duplicates. Perhaps there is a third or a fourth option too, and we could explore other kind of graphs, or even limit RDF4H to just one?So my point is: introduction of namespace expansion and handling of duplicates is actually much bigger than issue #12 . It is perhaps quite a dramatic change for the project and it must permeate it wholly and systematically. And we must definitely bring namespace expansion to
MGraph
.In the light of the abovementioned things, this pull request is not "complete". You can accept it as work-in-progress, but it's certainly not ready for a release.