Jaccard Similarity doesn't work with concurrency #894

JorenVdV · 2019-04-16T13:27:21Z

Problem
When running the Jaccard similarity algorithm over a list of node and categories entries all the similarities are 0 when run without concurrency limit set to 1.

Environment
Docker image running Neo4j 3.5.3 and graph algorithms 3.5.3.3,
memory is limited to 16G, cpu's are unbound (192 cpu's in the machine, shared with other processes)

Setup

MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (praveena)-[:LIKES]->(indian)
MERGE (praveena)-[:LIKES]->(portuguese)

MERGE (zhen)-[:LIKES]->(french)
MERGE (zhen)-[:LIKES]->(indian)

MERGE (michael)-[:LIKES]->(french)
MERGE (michael)-[:LIKES]->(italian)
MERGE (michael)-[:LIKES]->(indian)

MERGE (arya)-[:LIKES]->(lebanese)
MERGE (arya)-[:LIKES]->(italian)
MERGE (arya)-[:LIKES]->(portuguese)

MERGE (karin)-[:LIKES]->(lebanese)
MERGE (karin)-[:LIKES]->(italian)

Queries

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data, {concurrency:1, similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╕
│"nodes"│"similarityPairs"│"min"              │"max"             │"mean"             │"p25"             │"p50"             │"p75"              │"p90"             │"p95"             │
╞═══════╪═════════════════╪═══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╡
│5      │7                │0.19999980926513672│0.6666669845581055│0.37380967821393696│0.2500009536743164│0.2500009536743164│0.33333301544189453│0.6666669845581055│0.6666669845581055│
└───────┴─────────────────┴───────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┘

removing the concurrency limit

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data,  {similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═════╤═════╤══════╤═════╤═════╤═════╤═════╤═════╕
│"nodes"│"similarityPairs"│"min"│"max"│"mean"│"p25"│"p50"│"p75"│"p90"│"p95"│
╞═══════╪═════════════════╪═════╪═════╪══════╪═════╪═════╪═════╪═════╪═════╡
│5      │0                │0.0  │0.0  │0.0   │0.0  │0.0  │0.0  │0.0  │0.0  │
└───────┴─────────────────┴─────┴─────┴──────┴─────┴─────┴─────┴─────┴─────┘

Setting the concurrency to any number except for 1 results in the latter case.
The same behaviour is observed when running with our 300k nodes Jaccard computation.

The text was updated successfully, but these errors were encountered:

mneedham · 2019-04-17T14:41:48Z

Hey,

I'll take a look at it. I've seen this happen sporadically, but not been able to figure out exactly why it happens as it doesn't happen every time annoyingly.

e.g. I just tested this on a Docker image and it gives the same results with concurrency 1 and concurrency > 1.

Cheers, Mark

d-kilc · 2020-03-09T22:16:58Z

Any resolution on this? Still not able to use > 1 core with algo.similarity.jaccard. I'm running 3.5.8 EE.

tomasonjo · 2020-03-10T08:18:30Z

Please check the https://github.com/neo4j/graph-data-science as it has improved graph algorithms, and it is also the successor for the graph algorithms library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaccard Similarity doesn't work with concurrency #894

Jaccard Similarity doesn't work with concurrency #894

JorenVdV commented Apr 16, 2019 •

edited

Loading

mneedham commented Apr 17, 2019

d-kilc commented Mar 9, 2020 •

edited

Loading

tomasonjo commented Mar 10, 2020

Jaccard Similarity doesn't work with concurrency #894

Jaccard Similarity doesn't work with concurrency #894

Comments

JorenVdV commented Apr 16, 2019 • edited Loading

mneedham commented Apr 17, 2019

d-kilc commented Mar 9, 2020 • edited Loading

tomasonjo commented Mar 10, 2020

JorenVdV commented Apr 16, 2019 •

edited

Loading

d-kilc commented Mar 9, 2020 •

edited

Loading