Improved files parsing

KrainskiL · Jun 16, 2021 · 420f12d · 420f12d
1 parent f23d13a
commit 420f12d
Show file tree

Hide file tree

Showing 12 changed files with 1,103 additions and 86 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "CGE"
 uuid = "f7ff1d1e-e254-4b26-babe-fc3421add060"
 authors = ["KrainskiL <[email protected]>"]
-version = "1.2.1"
+version = "1.2.2"
 
 [deps]
 DelimitedFiles = "8bb1440f-4735-579b-a4ab-409b98df4dab"

diff --git a/README.md b/README.md
@@ -18,7 +18,9 @@ Julia package to compare graph embeddings.
 
 ## Details of the framework
 
-To be presented at WAW2020: https://math.ryerson.ca/waw2020 with publication in Springer LNCS.
+Presented at [WAW2020](https://math.ryerson.ca/waw2020/) with publication in [Springer LNCS](https://www.springer.com/gp/book/9783030484774).
+
+Detailed information can be found in the [paper](https://math.ryerson.ca/~pralat/papers/2020_WAW-Scalable_Embeddings.pdf).
 
 Framework version without landmarks (written in C) is available under: https://github.com/ftheberge/Comparing_Graph_Embeddings
 
@@ -37,7 +39,7 @@ using CGE; cd(pwd, joinpath(dirname(pathof(CGE)), "..", "example"))
 ```
 Make sure to copy the CLI file from this location (as it is read only).
 
-Alternatively you can just download CGE_CLI.jl from GitHub repository. It is located in example/ folder
+Alternatively you can just download CGE_CLI.jl from GitHub repository. It is located in `example/` folder.
 
 Finally you might also download the whole repository and extract the CGE_CLI.jl file from it.
 ```shell
@@ -56,46 +58,46 @@ When comparing embeddings, lower divergence is better.
 Format:
 
 ```
-julia CGE_CLI.jl -g edgelist_file -e embedding_file [-c clusters_file] [-a -v] [-l landmarks -f forced -m method]
+julia CGE_CLI.jl -g edgelist_file -e embedding_file [-c clusters_file] [-v] [-l landmarks -f forced -m method]
 
 ## required flags:
 -g: the edgelist (1 per line, whitespace separated, optionally with weights)
 -e: the embedding (two formats accepted, see details below)
 ## optional flags:
 -c: the communities (in vertices order, 1 per line), if not given calculated using Louvain algorithm
--a: 'asis' flag, use if embedding is provided unordered with vertices in first column
 -v: verbose, printing additional information
 -l: number of landmarks to create
 -f: number of forced landmarks to be created
 -m: chosen ladnmark creation method: `rss`, `rss2`, `size`, `diamater`
 ```
 
-For instance, while in `example` folder run
+For instance, while in `example` folder run:
 
 ```julia
 julia CGE_CLI.jl -g 100k.edgelist -c 100k.ecg -e 100k.embedding -l 200 -f 0 -m diameter
 ```
-Result consists of 4 element:
+Result consists of 4 elements:
 1. Best alpha
-2. Best divergence score
+2. **Best divergence score**
 3. Best divergence external score
 4. Best divergence internal score
 ```
 [0.25, 0.01483964683262605, 0.026810577776668364, 0.002868715888583737]
 ```
 # File Formats
 
-For a graph with n nodes, the nodes can be represented with numbers 1 to n or 0 to n-1.
+For a graph with `n` nodes, the nodes can be represented with numbers 1 to n or 0 to n-1.
+
+Two input files are required to run the algorithm:
+1. the undirected graph, represented by a sequence of edges, 1 per line and with optional weights in third column
+2. the node embedding in on of the supported formats (see below)
 
-Three input files are required to run the algorithm:
-1. the undirected graph, represented by a sequence of edges, 1 per line
-2. a file with the node's cluster number, 1 per line, in numerical order of the nodes
-3. the node embedding in the node2vec format (see below)
+Additional file with the node's cluster number (1 per line) may be provided. If it's missing communities are calculated automatically with Louvain algorithm.
 
 ## Example of graph (edgelist) file
 
-Nodes can be 0-based or 1-based
-One edge per line with whitespace between nodes
+Nodes can be 0-based or 1-based.
+One edge per line with whitespace between nodes.
 
 ```
 1 32
@@ -111,7 +113,7 @@ One edge per line with whitespace between nodes
 ...
 ```
 
-Additional weights may be provided in third column
+Additional weights may be provided in third column (both integers and floats are supported).
 
 ```
 1 32 1.23
@@ -129,9 +131,10 @@ Additional weights may be provided in third column
 
 ## Example of clustering file
 
-Clusters can be 0-based or 1-based
-Clusters: one value per line in the numerical order of the nodes
-If not provided, clusters will be automatically calculated with Louvain algorithm
+Clusters can be 0-based or 1-based.
+If not provided, clusters will be automatically calculated with Louvain algorithm.
+
+First variant with clusters IDs only - must be ordered by nodes IDs
 
 ```
 1
@@ -146,14 +149,30 @@ If not provided, clusters will be automatically calculated with Louvain algorith
 1
 ...
 ```
+
+Second variant with clusters IDs and nodes IDs
+
+```
+1 1
+2 1
+3 1
+4 1
+5 0
+6 0
+8 0 
+9 1
+7 3
+11 1
+...
+```
 ## Example of embedding file
 
-Nodes are 0-based or 1-based in any order
-Two formats are supported
+Nodes are 0-based or 1-based in any order.
+Three formats are supported
 
 **First format - unordered embedding**
 
-First column indicates node number, the rest of the line is d-dimensional embedding
+First column indicates node number, the rest of the line is d-dimensional embedding.
 
 ```
 21 0.960689 -2.28209 3.65194 0.272646 -3.01281 1.0245 -0.329389 -2.95956
@@ -169,6 +188,7 @@ First column indicates node number, the rest of the line is d-dimensional embedd
 **Second format - ordered embedding**
 
 Only d-dimensional embedding in order of nodes is stored in file.
+
 ```
 0.854487 -2.30527 4.10575 0.370613 -3.04878 1.46481 -0.120326 -4.02328
 0.960689 -2.28209 3.65194 0.272646 -3.01281 1.0245 -0.329389 -2.95956
@@ -179,3 +199,18 @@ Only d-dimensional embedding in order of nodes is stored in file.
 0.750248 -2.26306 4.04495 0.143616 -3.02735 1.49937 -0.400896 -4.04177
 ...
 ```
+
+**Third format - node2vec format**
+
+First line contains number of nodes and dimension of the embedding. It's stripped during parsing and the rest of the file is handled as either first or second format.
+```
+500 8
+21 0.960689 -2.28209 3.65194 0.272646 -3.01281 1.0245 -0.329389 -2.95956
+33 0.702187 -2.14331 4.25541 0.372346 -3.16427 1.41296 -0.390471 -4.49782
+3 0.854487 -2.30527 4.10575 0.370613 -3.04878 1.46481 -0.120326 -4.02328
+29 0.673825 -2.19518 4.00447 0.650003 -2.74663 0.757385 -0.505723 -3.2947
+32 0.750248 -2.26306 4.04495 0.143616 -3.02735 1.49937 -0.400896 -4.04177
+25 0.831608 -2.191 4.04712 0.786012 -2.85804 1.11308 -0.391722 -3.4645
+28 1.14632 -2.20708 4.11004 0.338067 -2.86409 1.01202 -0.485711 -3.50161
+...
+```
diff --git a/example/CGE_CLI.jl b/example/CGE_CLI.jl
@@ -1,6 +1,6 @@
 using CGE
 
-edges, weights, vweights, comm, clusters, embed, asis, verbose, land, forced, method = parseargs()
+edges, weights, vweights, comm, clusters, embed, verbose, land, forced, method = parseargs()
 distances = zeros(length(vweights))
 if land != -1
     distances, embed, comm, edges, weights  = landmarks(edges, weights, vweights, clusters, comm,

diff --git a/src/auxilary.jl b/src/auxilary.jl
@@ -64,61 +64,54 @@ function parseargs()
                    "size" => split_cluster_size,
                    "diameter" => split_cluster_diameter)
     try
-        # Optional arguments
-        ##Flags
-        asis = !isnothing(findfirst(==("-a"),ARGS)) ? true : false
         # Check if calculations should be verbose
-        verbose = !isnothing(findfirst(==("-v"),ARGS)) ? true : false
+        verbose = !isnothing(findfirst(==("-v"),ARGS))
 
-        # Check for required arguments: -g graph_edgelist -e embedding
-        @assert length(ARGS) >= 4 "Graph edgelist and embedding files are required"
-
-        # Load obligatory files
-        ################
-        ## Graph edges #
-        ################
+        #############
+        ## Edgelist #
+        #############
 
         idx = findfirst(==("-g"),ARGS)
-        @assert !isnothing(idx) "Edges list file is required"
+        @assert !isnothing(idx) "Edgelist file is required"
         fn_edges = ARGS[idx+1]
 
-        # read edges
+        # Read edges
         edges = readdlm(fn_edges, Float64)
         rows, no_cols = size(edges)
-        verbose && println("$no_cols columns in graph edgelist file")
+        verbose && println("$no_cols columns and $rows rows in edgelist file.")
 
         # Validate file structure
-        @assert no_cols==2 || no_cols==3 "Expected 2 or 3 columns"
+        @assert no_cols == 2 || no_cols == 3 "Expected 2 or 3 columns in edgelist file"
         v_min = minimum(edges[:,1:2])
-        @assert v_min==0 || v_min==1 "Vertices should be either 0-based or 1-based"
+        @assert v_min == 0 || v_min == 1 "Vertices should be either 0-based or 1-based"
 
-        # make vertices 1-based
+        # Make vertices 1-based
         if v_min == 0.0
             edges[:,1:2] .+= 1.0
         end
         no_vertices = Int(maximum(edges[:,1:2]))
-        verbose && println("Vertices from $v_min to $no_vertices")
+        verbose && println("Graph contains $no_vertices vertices")
 
-        # if unweighted, add unit weights
-        # compute vertex weights
+        # If graph is unweighted, add unit weights
+        # Compute vertices weights
         if no_cols == 2
             edges = convert.(Int,edges)
             eweights = ones(rows)
             vweight = zeros(no_vertices)
             for i in 1:rows
-                vweight[edges[i,1]]+=1.0
-                vweight[edges[i,2]]+=1.0
+                vweight[edges[i,1]] += 1.0
+                vweight[edges[i,2]] += 1.0
             end
         else
             eweights = edges[:,3]
             edges = convert.(Int,edges[:,1:2])
             vweight = zeros(no_vertices)
             for i in 1:rows
-                vweight[edges[i,1]]+=eweights[i]
-                vweight[edges[i,2]]+=eweights[i]
+                vweight[edges[i,1]] += eweights[i]
+                vweight[edges[i,2]] += eweights[i]
             end
         end
-        verbose && println("Done preparing edgelist and vertex weight")
+        verbose && println("Done preparing edgelist and vertices weights")
 
         ################
         ## Communities #
@@ -136,16 +129,21 @@ function parseargs()
         end
         comm_rows, no_cols = size(comm)
         # Validate file structure
-        @assert no_cols==1 "Expected 1 column"
-        v_min = minimum(comm[:,1])
-        @assert v_min==0 || v_min==1 "Communities should be either 0-based or 1-based"
-        c_min=minimum(comm)
+        @assert comm_rows == no_vertices "No. communities ($comm_rows) differ from no. nodes ($no_vertices)"
+        @assert no_cols == 1 || no_cols == 2 "Expected 1 or 2 columns in communities file, but encountered $no_cols."
+        # 2 columns file - sort by first column and extract only second column
+        if no_cols == 2
+            comm = comm[sortperm(comm[:,1]),2]
+            comm = reshape(comm,size(comm)[1],1)
+        end
+        c_min = minimum(comm)
+        @assert c_min == 0 || c_min == 1 "Communities should be either 0-based or 1-based, but are $c_min based."
 
-        # make communities 1-based
+        # Make communities 1-based
         if c_min == 0
-            comm[:,1] .+=1
+            comm .+=1
         end
-        verbose && println("Done preparing communities")
+        verbose && println("Done preparing communities.")
 
         ##############
         ## Embedding #
@@ -156,52 +154,66 @@ function parseargs()
         fn_embed = ARGS[idx+1]
 
         # Read embedding
-        embedding = readdlm(fn_embed,Float64)
-
+        embedding = []
+        try
+            embedding = readdlm(fn_embed,Float64)
+        catch 
+            verbose && println("Embedding in node2vec format. Loading without first line.")
+            embedding = readdlm(fn_embed,Float64,skipstart = 1)
+        end
         # Validate file
-        @assert comm_rows == size(embedding, 1) "No. rows in embedding and communities files differ"
-
-        # if embedding contains index in first column, sort by it and remove column
-        if asis == 0
-            embedding = embedding[sortperm(embedding[:,1]),2:end]
+        @assert no_vertices == size(embedding, 1) "No. rows in embedding and no. vertices in a graph differ."
+
+        # If embedding contains indices in first column, sort by it and remove the column
+        try
+            order = convert.(Int,embedding[:,1])
+            verbose && println("Sorting embedding by first column")
+            embedding = embedding[sortperm(order),2:end]
+        catch
+
         end
-        verbose && println("done preparing embedding")
+        verbose && println("Done preparing embedding.")
 
         #############
         ##Landmarks #
         #############
 
         # Transform communities
         clusters = Dict{Any, Vector{Int}}()
-        for (i, c) in enumerate(comm[:,1])
-            if haskey(clusters, c)
-                push!(clusters[c], i)
-            else
-                clusters[c] = [i]
-            end
-        end
-        clusters = collect(values(clusters))
 
         idx = findfirst(==("-l"),ARGS)
         landmarks = !isnothing(idx) ? parse(Int, ARGS[idx+1]) : -1
+        # Provide clusters membership only if generating landmarks
+        if landmarks != -1
+            for (i, c) in enumerate(comm[:,1])
+                if haskey(clusters, c)
+                    push!(clusters[c], i)
+                else
+                    clusters[c] = [i]
+                end
+            end
+            clusters = collect(values(clusters))
+        end
+
         idx = findfirst(==("-f"),ARGS)
         forced = !isnothing(idx) ? parse(Int, ARGS[idx+1]) : -1
+
         idx = findfirst(==("-m"),ARGS)
         method_str = !isnothing(idx) ? lowercase(strip(ARGS[idx+1])) : "rss"
+
         method = methods[method_str]
-        return edges, eweights, vweight, comm, clusters, embedding, asis, verbose, landmarks, forced, method
+        return edges, eweights, vweight, comm, clusters, embedding, verbose, landmarks, forced, method
     catch e
         showerror(stderr, e)
         println("\n\nUsage:")
         println("\tjulia CGE.jl -g graph_edgelist -e embedding [-c communities] [-a -v] [-l landmarks -f forced -m method]")
         println("\nParameters:")
-        println("graph_edgelist: rows should contain two vertices ids (edge) and optional weights")
-        println("communities: rows should contain cluster identifiers of consecutive vertices")
+        println("graph_edgelist: rows should contain two vertices ids (edge) and optional weights in third column")
+        println("communities: rows should contain cluster identifiers of consecutive vertices with optional node ids in first column")
         println("if no file is given communities are calculated with Louvain algorithm")
-        println("embedding: rows should contain whitespace separated locations of vertices in embedding")
-        println("-a: flag for sorting embedding")
+        println("embedding: rows should contain whitespace separated embedding values with optional node ids in first column")
         println("-v: flag for debugging messages")
-        println("landmarks: required number of landmarks")
+        println("landmarks: number of landmarks")
         println("forced: required maximum number of forced splits of a cluster")
         println("method: one of:")
         println("\t* rss:      minimize maximum residual sum of squares when doing a cluster split")