Skip to content
This repository has been archived by the owner on Apr 22, 2020. It is now read-only.

Jaccard Similarity doesn't work with concurrency #894

Open
JorenVdV opened this issue Apr 16, 2019 · 3 comments
Open

Jaccard Similarity doesn't work with concurrency #894

JorenVdV opened this issue Apr 16, 2019 · 3 comments

Comments

@JorenVdV
Copy link

JorenVdV commented Apr 16, 2019

Problem
When running the Jaccard similarity algorithm over a list of node and categories entries all the similarities are 0 when run without concurrency limit set to 1.

Environment
Docker image running Neo4j 3.5.3 and graph algorithms 3.5.3.3,
memory is limited to 16G, cpu's are unbound (192 cpu's in the machine, shared with other processes)

Setup

MERGE (french:Cuisine {name:'French'})
MERGE (italian:Cuisine {name:'Italian'})
MERGE (indian:Cuisine {name:'Indian'})
MERGE (lebanese:Cuisine {name:'Lebanese'})
MERGE (portuguese:Cuisine {name:'Portuguese'})

MERGE (zhen:Person {name: "Zhen"})
MERGE (praveena:Person {name: "Praveena"})
MERGE (michael:Person {name: "Michael"})
MERGE (arya:Person {name: "Arya"})
MERGE (karin:Person {name: "Karin"})

MERGE (praveena)-[:LIKES]->(indian)
MERGE (praveena)-[:LIKES]->(portuguese)

MERGE (zhen)-[:LIKES]->(french)
MERGE (zhen)-[:LIKES]->(indian)

MERGE (michael)-[:LIKES]->(french)
MERGE (michael)-[:LIKES]->(italian)
MERGE (michael)-[:LIKES]->(indian)

MERGE (arya)-[:LIKES]->(lebanese)
MERGE (arya)-[:LIKES]->(italian)
MERGE (arya)-[:LIKES]->(portuguese)

MERGE (karin)-[:LIKES]->(lebanese)
MERGE (karin)-[:LIKES]->(italian)

Queries

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data, {concurrency:1, similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╤═══════════════════╤══════════════════╤══════════════════╕
│"nodes"│"similarityPairs"│"min"              │"max"             │"mean"             │"p25"             │"p50"             │"p75"              │"p90"             │"p95"             │
╞═══════╪═════════════════╪═══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╪═══════════════════╪══════════════════╪══════════════════╡
│5      │7                │0.19999980926513672│0.6666669845581055│0.37380967821393696│0.2500009536743164│0.2500009536743164│0.33333301544189453│0.6666669845581055│0.6666669845581055│
└───────┴─────────────────┴───────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┴───────────────────┴──────────────────┴──────────────────┘

removing the concurrency limit

MATCH (b:Person)-[v:LIKES]->(c:Cuisine)
WITH {item:id(b), categories: collect(id(c))} as vacatureData limit 50000
WITH collect(vacatureData) as data

CALL algo.similarity.jaccard(data,  {similarityCutoff:0.1})
YIELD nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95
RETURN nodes, similarityPairs, min, max, mean, p25, p50, p75, p90, p95

results in

╒═══════╤═════════════════╤═════╤═════╤══════╤═════╤═════╤═════╤═════╤═════╕
│"nodes"│"similarityPairs"│"min"│"max"│"mean"│"p25"│"p50"│"p75"│"p90"│"p95"│
╞═══════╪═════════════════╪═════╪═════╪══════╪═════╪═════╪═════╪═════╪═════╡
│5      │0                │0.0  │0.0  │0.0   │0.0  │0.0  │0.0  │0.0  │0.0  │
└───────┴─────────────────┴─────┴─────┴──────┴─────┴─────┴─────┴─────┴─────┘

Setting the concurrency to any number except for 1 results in the latter case.
The same behaviour is observed when running with our 300k nodes Jaccard computation.

@mneedham
Copy link
Contributor

Hey,

I'll take a look at it. I've seen this happen sporadically, but not been able to figure out exactly why it happens as it doesn't happen every time annoyingly.

e.g. I just tested this on a Docker image and it gives the same results with concurrency 1 and concurrency > 1.

Cheers, Mark

@d-kilc
Copy link

d-kilc commented Mar 9, 2020

Any resolution on this? Still not able to use > 1 core with algo.similarity.jaccard. I'm running 3.5.8 EE.

@tomasonjo
Copy link
Collaborator

Please check the https://github.com/neo4j/graph-data-science as it has improved graph algorithms, and it is also the successor for the graph algorithms library

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants