Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect pagination in SPARQL CONSTRUCT dataset #413

Open
erikap opened this issue Sep 21, 2016 · 5 comments
Open

Incorrect pagination in SPARQL CONSTRUCT dataset #413

erikap opened this issue Sep 21, 2016 · 5 comments

Comments

@erikap
Copy link
Contributor

erikap commented Sep 21, 2016

If the limit/offset are not specified in the SPARQL query of the dataset definition, the SPARQLController calculates the pagination. This seems to be done wrongly in case of a CONSTRUCT query.

If the query has a structure like:

CONSTRUCT { <construct_definition> } WHERE { <where_clause> }

the number of results is calculated as follows:

SELECT COUNT  as ?number WHERE { <where_clause> }

This doesn't yield a correct result in case of a CONSTRUCT query.

Thanks to @bertvannuffelen for the catch.

@coreation
Copy link
Member

Indeed, isn't the problem here on how to implement counting for a construct(ed) result? On what exactly to count?

@bertvannuffelen
Copy link

Well, the only count you can do is by executing the construct. And then based on that result implement paging.
The supplying SPARQL endpoint should/could provide pagination in this case, but not all do this (for instance Virtuoso).

So far, the only solution is to rely on the finiteness of the respons of the construct (that users do not request stupid things). Pagination can only be implemented by collecting the complete respons in a temporary structure.

@coreation
Copy link
Member

@bertvannuffelen how would you handle paging in a temporary structure, am I correct in assuming that we cannot rely on a SPARQL endpoint returning the same order of triples for a construct query in consecutive calls? If that's the case are you suggesting caching the result for a query, perform in memory paging, returning the result and when a different page of the query is requested, get the cached object, page it in memory and return it to the client instead of performing the SPARQL query?

@bertvannuffelen
Copy link

SPARQL construct queries return always the complete answer. However it is up to the SPARQL endpoint implementation to handle the need of possible pagination. And here sits the problem. Most do not support pagination for construct queries.

so CONSTRUCT { ...} where {...} will return all information at once.

This can be the whole database e.g. use this query:
CONSTRUCT { ?s ?p ?o} where {?s ?p ?o}

Now for small volumes, there is no problem. For larger volumes, clients might stumble on it. For very large volumes, the supplying SPARQL endpoint will apply a strategy to reduce the chance to die. Virtuoso does that by implementing a cut-off in the respons (the magic 10000 number - part of the virtuoso configuration). If you get 10K triples/respons rows you do not know if there were just 10K triples/respons rows or more.

I am indeed suggesting that for construct queries (for selects the current approach works fine) "caching the result for a query, perform in memory paging, returning the result and when a different page of the query is requested, get the cached object, page it in memory and return it to the client instead of performing the SPARQL query " is the approach.

I see no other alternative for the moment (unless selecting a SPARQL endpoint that implements pagination on all requests).

Constructs are actually used in the TDT setting for 2 cases:
a) subject pages (provide all info about a subject)
b) complete datasets
For a) the caching-object will be mostly empty if the limit is not set to low.
For b) case is dependent on the volume of the dataset and that is very different in that case. For the b) case with large volumes one could image to create a temporay file will the output on disk (compressed) and returning that.

@coreation
Copy link
Member

L4.2 also has swappable caching mechanism, supported out of the box are file, memcached and a few others so I don't think storing the object in a file would be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants