Efficient node selection based on node IDs #1394

JoelLeupp · 2023-03-21T10:40:08Z

JoelLeupp
Mar 21, 2023

Is there a way to efficiently select nodes by their ids and use them in the query as a replacement for the whole set of nodes?

I have a node (inproceeding) with over 2.5 million instances and it has a relation to it's proceeding and the conference it took place.
If I want the get the title and ids of the mentioned nodes and I have the id of the inproceeding I would do it as follows:

MATCH
(i:inproceeding)-[co:crossref]->(p:proceeding)->[b:belongsToConf]->[conf:conference)
WHERE i.id = "some id"
RETURN *

This works well but I want to get that information not for just one id but for multiple up to 80'000 record ids.
What is the fastest way for this to work?

I could loop over the 80'000 ids and do each query individual but that takes very long.

I also tried to use UNWIND:

UNWIND [list of record ids] AS rec_ids
WITH rec_ids
MATCH
(i:inproceeding)-[co:crossref]->(p:proceeding)->[b:belongsToConf]->[conf:conference)
WHERE i.id = rec_ids
RETURN *

but that does not work and I get a buffer manager exception because it runs out of memory.
Is there a way to efficiently select nodes by their ids and use them in the query as a replacement for the whole set of nodes?
Or is there a better way to write my query that is fast and does not result in an out of memory error?

From a logical perspective I think the query is very simple as I know exactly which nodes from inproceeding are of my interest and I can just follow the relations and I think it should be possible to get the results fast but I didnt find a way that seems to work.
If the inprocceedings would just contain the 80'000 nodes of interest it would probably work well but I dont see how that can be done efficently.

Any help would be highly appriciated

Answered by andyfengHKU

Mar 21, 2023

Hi,

The most common way to select nodes from a large set of IDs is to use list_contains function
E.g.

MATCH (i:inproceeding)-[co:crossref]->(p:proceeding)->[b:belongsToConf]->[conf:conference)
WHERE list_contains([list of record ids], i.id)

In our release v0.0.2, we have a limitation that a List size cannot exceed 4KB. So you might need to chunk your 80,000 ids into 200 ids per batch and run multiple queries. Apologize for this constraint. We will fix it very soon.

Best,
Xiyang

P.S. The UNWIND is also an alternative. It runs out-of-memory because we pick a bad plan. I'll fix it too. But the most performant way should be list_contains because running filter is preferred (from performance …

View full answer

andyfengHKU · 2023-03-21T15:01:16Z

andyfengHKU
Mar 21, 2023
Maintainer

Hi,

The most common way to select nodes from a large set of IDs is to use list_contains function
E.g.

MATCH (i:inproceeding)-[co:crossref]->(p:proceeding)->[b:belongsToConf]->[conf:conference)
WHERE list_contains([list of record ids], i.id)

In our release v0.0.2, we have a limitation that a List size cannot exceed 4KB. So you might need to chunk your 80,000 ids into 200 ids per batch and run multiple queries. Apologize for this constraint. We will fix it very soon.

Best,
Xiyang

P.S. The UNWIND is also an alternative. It runs out-of-memory because we pick a bad plan. I'll fix it too. But the most performant way should be list_contains because running filter is preferred (from performance perspective) than running a join

5 replies

JoelLeupp Mar 21, 2023
Author

Thanks, I didn't know that this function exists as I was still using the old version. I upgraded to the new version and it works now with small batches as you mentioned. But it is very slow I took batches of 200 and each query takes a bit more than 1s to calculate so for my 80'000 ids it would be too slow to use in an API. Is there an other way to boost the performance?

andyfengHKU Mar 21, 2023
Maintainer

I'll need the data or statistics to further analyze why the query is slow. Usually it's because either the query touches a lot of data, or we pick a bad plan.

So if you don't mind, could you upload your data into the following google drive. Otherwise, could u tell me the plan that is being executed (through EXPLAIN [query]) as well as the number of rows in each node and rel table?

There is high chance that this is due to a bad plan picking. If so, it will very likely be fixed in our next release which is scheduled next week. Meanwhile, unfortunately I don't have a good alternative to solve this. I could suggest one more ad-hoc thing. If the 80,000 ids are stable (i.e. it's always the same 80,000 when running the query), you could add a BOOLEAN column in the node.csv to annotate these records and change query to

MATCH ...
WHERE i.boolean_column = True

JoelLeupp Mar 22, 2023
Author

Thanks for the answer I uploaded the data to your google drive.
The number of rows of the nodes and relations are:
Inproceedings: 2'542'194
crossref: 2'542'194
proceedings: 41'507
belongsToConf: 41'507
Conference: 4'719

Unfortunately its not always the same IDs it could also be just a few hundert but the maximum of ids that could be requested over the API are around 80'000. I also thought about creating a new inproceeding to inproceeding relation based on the ids and delete the relation after the query again but that doesn't seem like an ideal solution.

andyfengHKU Mar 23, 2023
Maintainer

Thanks. I'll take a look and most likely integrate the fix in the next release. One thing that comes to my mind is that if you are selecting your ids through a trivial function (e.g. a range id > 0 AND id < 80,000), then you might be able to write that function into the query.

JoelLeupp Mar 23, 2023
Author

The problem is that the ids area strings and not integers and it can be a random selection from all the nodes of up to 80'000 nodes. Maybe a hash-map could solve the problem where it can be fast checked if the id is a key in the map but as far as I know there is no implementation of hash-maps in the kuzu-db right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient node selection based on node IDs #1394

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Efficient node selection based on node IDs #1394

JoelLeupp Mar 21, 2023

Replies: 1 comment · 5 replies

andyfengHKU Mar 21, 2023 Maintainer

JoelLeupp Mar 21, 2023 Author

andyfengHKU Mar 21, 2023 Maintainer

JoelLeupp Mar 22, 2023 Author

andyfengHKU Mar 23, 2023 Maintainer

JoelLeupp Mar 23, 2023 Author

JoelLeupp
Mar 21, 2023

Replies: 1 comment 5 replies

andyfengHKU
Mar 21, 2023
Maintainer

JoelLeupp Mar 21, 2023
Author

andyfengHKU Mar 21, 2023
Maintainer

JoelLeupp Mar 22, 2023
Author

andyfengHKU Mar 23, 2023
Maintainer

JoelLeupp Mar 23, 2023
Author