Skip to content

Commit

Permalink
Improved files parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
KrainskiL committed Jun 16, 2021
1 parent f23d13a commit 420f12d
Show file tree
Hide file tree
Showing 12 changed files with 1,103 additions and 86 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "CGE"
uuid = "f7ff1d1e-e254-4b26-babe-fc3421add060"
authors = ["KrainskiL <[email protected]>"]
version = "1.2.1"
version = "1.2.2"

[deps]
DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"
Expand Down
77 changes: 56 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ Julia package to compare graph embeddings.

## Details of the framework

To be presented at WAW2020: https://math.ryerson.ca/waw2020 with publication in Springer LNCS.
Presented at [WAW2020](https://math.ryerson.ca/waw2020/) with publication in [Springer LNCS](https://www.springer.com/gp/book/9783030484774).

Detailed information can be found in the [paper](https://math.ryerson.ca/~pralat/papers/2020_WAW-Scalable_Embeddings.pdf).

Framework version without landmarks (written in C) is available under: https://github.com/ftheberge/Comparing_Graph_Embeddings

Expand All @@ -37,7 +39,7 @@ using CGE; cd(pwd, joinpath(dirname(pathof(CGE)), "..", "example"))
```
Make sure to copy the CLI file from this location (as it is read only).

Alternatively you can just download CGE_CLI.jl from GitHub repository. It is located in example/ folder
Alternatively you can just download CGE_CLI.jl from GitHub repository. It is located in `example/` folder.

Finally you might also download the whole repository and extract the CGE_CLI.jl file from it.
```shell
Expand All @@ -56,46 +58,46 @@ When comparing embeddings, lower divergence is better.
Format:

```
julia CGE_CLI.jl -g edgelist_file -e embedding_file [-c clusters_file] [-a -v] [-l landmarks -f forced -m method]
julia CGE_CLI.jl -g edgelist_file -e embedding_file [-c clusters_file] [-v] [-l landmarks -f forced -m method]
## required flags:
-g: the edgelist (1 per line, whitespace separated, optionally with weights)
-e: the embedding (two formats accepted, see details below)
## optional flags:
-c: the communities (in vertices order, 1 per line), if not given calculated using Louvain algorithm
-a: 'asis' flag, use if embedding is provided unordered with vertices in first column
-v: verbose, printing additional information
-l: number of landmarks to create
-f: number of forced landmarks to be created
-m: chosen ladnmark creation method: `rss`, `rss2`, `size`, `diamater`
```

For instance, while in `example` folder run
For instance, while in `example` folder run:

```julia
julia CGE_CLI.jl -g 100k.edgelist -c 100k.ecg -e 100k.embedding -l 200 -f 0 -m diameter
```
Result consists of 4 element:
Result consists of 4 elements:
1. Best alpha
2. Best divergence score
2. **Best divergence score**
3. Best divergence external score
4. Best divergence internal score
```
[0.25, 0.01483964683262605, 0.026810577776668364, 0.002868715888583737]
```
# File Formats

For a graph with n nodes, the nodes can be represented with numbers 1 to n or 0 to n-1.
For a graph with `n` nodes, the nodes can be represented with numbers 1 to n or 0 to n-1.

Two input files are required to run the algorithm:
1. the undirected graph, represented by a sequence of edges, 1 per line and with optional weights in third column
2. the node embedding in on of the supported formats (see below)

Three input files are required to run the algorithm:
1. the undirected graph, represented by a sequence of edges, 1 per line
2. a file with the node's cluster number, 1 per line, in numerical order of the nodes
3. the node embedding in the node2vec format (see below)
Additional file with the node's cluster number (1 per line) may be provided. If it's missing communities are calculated automatically with Louvain algorithm.

## Example of graph (edgelist) file

Nodes can be 0-based or 1-based
One edge per line with whitespace between nodes
Nodes can be 0-based or 1-based.
One edge per line with whitespace between nodes.

```
1 32
Expand All @@ -111,7 +113,7 @@ One edge per line with whitespace between nodes
...
```

Additional weights may be provided in third column
Additional weights may be provided in third column (both integers and floats are supported).

```
1 32 1.23
Expand All @@ -129,9 +131,10 @@ Additional weights may be provided in third column

## Example of clustering file

Clusters can be 0-based or 1-based
Clusters: one value per line in the numerical order of the nodes
If not provided, clusters will be automatically calculated with Louvain algorithm
Clusters can be 0-based or 1-based.
If not provided, clusters will be automatically calculated with Louvain algorithm.

First variant with clusters IDs only - must be ordered by nodes IDs

```
1
Expand All @@ -146,14 +149,30 @@ If not provided, clusters will be automatically calculated with Louvain algorith
1
...
```

Second variant with clusters IDs and nodes IDs

```
1 1
2 1
3 1
4 1
5 0
6 0
8 0
9 1
7 3
11 1
...
```
## Example of embedding file

Nodes are 0-based or 1-based in any order
Two formats are supported
Nodes are 0-based or 1-based in any order.
Three formats are supported

**First format - unordered embedding**

First column indicates node number, the rest of the line is d-dimensional embedding
First column indicates node number, the rest of the line is d-dimensional embedding.

```
21 0.960689 -2.28209 3.65194 0.272646 -3.01281 1.0245 -0.329389 -2.95956
Expand All @@ -169,6 +188,7 @@ First column indicates node number, the rest of the line is d-dimensional embedd
**Second format - ordered embedding**

Only d-dimensional embedding in order of nodes is stored in file.

```
0.854487 -2.30527 4.10575 0.370613 -3.04878 1.46481 -0.120326 -4.02328
0.960689 -2.28209 3.65194 0.272646 -3.01281 1.0245 -0.329389 -2.95956
Expand All @@ -179,3 +199,18 @@ Only d-dimensional embedding in order of nodes is stored in file.
0.750248 -2.26306 4.04495 0.143616 -3.02735 1.49937 -0.400896 -4.04177
...
```

**Third format - node2vec format**

First line contains number of nodes and dimension of the embedding. It's stripped during parsing and the rest of the file is handled as either first or second format.
```
500 8
21 0.960689 -2.28209 3.65194 0.272646 -3.01281 1.0245 -0.329389 -2.95956
33 0.702187 -2.14331 4.25541 0.372346 -3.16427 1.41296 -0.390471 -4.49782
3 0.854487 -2.30527 4.10575 0.370613 -3.04878 1.46481 -0.120326 -4.02328
29 0.673825 -2.19518 4.00447 0.650003 -2.74663 0.757385 -0.505723 -3.2947
32 0.750248 -2.26306 4.04495 0.143616 -3.02735 1.49937 -0.400896 -4.04177
25 0.831608 -2.191 4.04712 0.786012 -2.85804 1.11308 -0.391722 -3.4645
28 1.14632 -2.20708 4.11004 0.338067 -2.86409 1.01202 -0.485711 -3.50161
...
```
2 changes: 1 addition & 1 deletion example/CGE_CLI.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
using CGE

edges, weights, vweights, comm, clusters, embed, asis, verbose, land, forced, method = parseargs()
edges, weights, vweights, comm, clusters, embed, verbose, land, forced, method = parseargs()
distances = zeros(length(vweights))
if land != -1
distances, embed, comm, edges, weights = landmarks(edges, weights, vweights, clusters, comm,
Expand Down
120 changes: 66 additions & 54 deletions src/auxilary.jl
Original file line number Diff line number Diff line change
Expand Up @@ -64,61 +64,54 @@ function parseargs()
"size" => split_cluster_size,
"diameter" => split_cluster_diameter)
try
# Optional arguments
##Flags
asis = !isnothing(findfirst(==("-a"),ARGS)) ? true : false
# Check if calculations should be verbose
verbose = !isnothing(findfirst(==("-v"),ARGS)) ? true : false
verbose = !isnothing(findfirst(==("-v"),ARGS))

# Check for required arguments: -g graph_edgelist -e embedding
@assert length(ARGS) >= 4 "Graph edgelist and embedding files are required"

# Load obligatory files
################
## Graph edges #
################
#############
## Edgelist #
#############

idx = findfirst(==("-g"),ARGS)
@assert !isnothing(idx) "Edges list file is required"
@assert !isnothing(idx) "Edgelist file is required"
fn_edges = ARGS[idx+1]

# read edges
# Read edges
edges = readdlm(fn_edges, Float64)
rows, no_cols = size(edges)
verbose && println("$no_cols columns in graph edgelist file")
verbose && println("$no_cols columns and $rows rows in edgelist file.")

# Validate file structure
@assert no_cols==2 || no_cols==3 "Expected 2 or 3 columns"
@assert no_cols == 2 || no_cols == 3 "Expected 2 or 3 columns in edgelist file"
v_min = minimum(edges[:,1:2])
@assert v_min==0 || v_min==1 "Vertices should be either 0-based or 1-based"
@assert v_min == 0 || v_min == 1 "Vertices should be either 0-based or 1-based"

# make vertices 1-based
# Make vertices 1-based
if v_min == 0.0
edges[:,1:2] .+= 1.0
end
no_vertices = Int(maximum(edges[:,1:2]))
verbose && println("Vertices from $v_min to $no_vertices")
verbose && println("Graph contains $no_vertices vertices")

# if unweighted, add unit weights
# compute vertex weights
# If graph is unweighted, add unit weights
# Compute vertices weights
if no_cols == 2
edges = convert.(Int,edges)
eweights = ones(rows)
vweight = zeros(no_vertices)
for i in 1:rows
vweight[edges[i,1]]+=1.0
vweight[edges[i,2]]+=1.0
vweight[edges[i,1]] += 1.0
vweight[edges[i,2]] += 1.0
end
else
eweights = edges[:,3]
edges = convert.(Int,edges[:,1:2])
vweight = zeros(no_vertices)
for i in 1:rows
vweight[edges[i,1]]+=eweights[i]
vweight[edges[i,2]]+=eweights[i]
vweight[edges[i,1]] += eweights[i]
vweight[edges[i,2]] += eweights[i]
end
end
verbose && println("Done preparing edgelist and vertex weight")
verbose && println("Done preparing edgelist and vertices weights")

################
## Communities #
Expand All @@ -136,16 +129,21 @@ function parseargs()
end
comm_rows, no_cols = size(comm)
# Validate file structure
@assert no_cols==1 "Expected 1 column"
v_min = minimum(comm[:,1])
@assert v_min==0 || v_min==1 "Communities should be either 0-based or 1-based"
c_min=minimum(comm)
@assert comm_rows == no_vertices "No. communities ($comm_rows) differ from no. nodes ($no_vertices)"
@assert no_cols == 1 || no_cols == 2 "Expected 1 or 2 columns in communities file, but encountered $no_cols."
# 2 columns file - sort by first column and extract only second column
if no_cols == 2
comm = comm[sortperm(comm[:,1]),2]
comm = reshape(comm,size(comm)[1],1)
end
c_min = minimum(comm)
@assert c_min == 0 || c_min == 1 "Communities should be either 0-based or 1-based, but are $c_min based."

# make communities 1-based
# Make communities 1-based
if c_min == 0
comm[:,1] .+=1
comm .+=1
end
verbose && println("Done preparing communities")
verbose && println("Done preparing communities.")

##############
## Embedding #
Expand All @@ -156,52 +154,66 @@ function parseargs()
fn_embed = ARGS[idx+1]

# Read embedding
embedding = readdlm(fn_embed,Float64)

embedding = []
try
embedding = readdlm(fn_embed,Float64)
catch
verbose && println("Embedding in node2vec format. Loading without first line.")
embedding = readdlm(fn_embed,Float64,skipstart = 1)
end
# Validate file
@assert comm_rows == size(embedding, 1) "No. rows in embedding and communities files differ"

# if embedding contains index in first column, sort by it and remove column
if asis == 0
embedding = embedding[sortperm(embedding[:,1]),2:end]
@assert no_vertices == size(embedding, 1) "No. rows in embedding and no. vertices in a graph differ."

# If embedding contains indices in first column, sort by it and remove the column
try
order = convert.(Int,embedding[:,1])
verbose && println("Sorting embedding by first column")
embedding = embedding[sortperm(order),2:end]
catch

end
verbose && println("done preparing embedding")
verbose && println("Done preparing embedding.")

#############
##Landmarks #
#############

# Transform communities
clusters = Dict{Any, Vector{Int}}()
for (i, c) in enumerate(comm[:,1])
if haskey(clusters, c)
push!(clusters[c], i)
else
clusters[c] = [i]
end
end
clusters = collect(values(clusters))

idx = findfirst(==("-l"),ARGS)
landmarks = !isnothing(idx) ? parse(Int, ARGS[idx+1]) : -1
# Provide clusters membership only if generating landmarks
if landmarks != -1
for (i, c) in enumerate(comm[:,1])
if haskey(clusters, c)
push!(clusters[c], i)
else
clusters[c] = [i]
end
end
clusters = collect(values(clusters))
end

idx = findfirst(==("-f"),ARGS)
forced = !isnothing(idx) ? parse(Int, ARGS[idx+1]) : -1

idx = findfirst(==("-m"),ARGS)
method_str = !isnothing(idx) ? lowercase(strip(ARGS[idx+1])) : "rss"

method = methods[method_str]
return edges, eweights, vweight, comm, clusters, embedding, asis, verbose, landmarks, forced, method
return edges, eweights, vweight, comm, clusters, embedding, verbose, landmarks, forced, method
catch e
showerror(stderr, e)
println("\n\nUsage:")
println("\tjulia CGE.jl -g graph_edgelist -e embedding [-c communities] [-a -v] [-l landmarks -f forced -m method]")
println("\nParameters:")
println("graph_edgelist: rows should contain two vertices ids (edge) and optional weights")
println("communities: rows should contain cluster identifiers of consecutive vertices")
println("graph_edgelist: rows should contain two vertices ids (edge) and optional weights in third column")
println("communities: rows should contain cluster identifiers of consecutive vertices with optional node ids in first column")
println("if no file is given communities are calculated with Louvain algorithm")
println("embedding: rows should contain whitespace separated locations of vertices in embedding")
println("-a: flag for sorting embedding")
println("embedding: rows should contain whitespace separated embedding values with optional node ids in first column")
println("-v: flag for debugging messages")
println("landmarks: required number of landmarks")
println("landmarks: number of landmarks")
println("forced: required maximum number of forced splits of a cluster")
println("method: one of:")
println("\t* rss: minimize maximum residual sum of squares when doing a cluster split")
Expand Down
Loading

0 comments on commit 420f12d

Please sign in to comment.