Add option for pretrained embeddings #108

mmcenta · 2019-12-11T02:37:14Z

As per #107, this PR implements an extra parameter, --pretrained, that can be used to specify pre-trained embeddings that will be used as the initializers for the node embeddings. The specified file must be in the traditional C node2vec format (same as the program outputs).

Help Wanted: how to check if the dimensions match? Either we assert that the loaded vectors have the same dimension as the value passed into the --representation_size parameter or we infer the representation size from the pre-trained embeddings (and make the parameters mutually exclusive).

Also, I read through the README to check if I needed to update something, but I decided that this new parameter didn't fit anywhere, so I should either not alter it or add a new subsection (up to you).

GTmac · 2019-12-13T23:17:58Z

deepwalk/__main__.py

@@ -72,7 +72,14 @@ def process(args):
    walks = graph.build_deepwalk_corpus(G, num_paths=args.number_walks,
                                        path_length=args.walk_length, alpha=0, rand=random.Random(args.seed))
    print("Training...")
-    model = Word2Vec(walks, size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers)
+    model = Word2Vec(size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers)
+    model.build_vocab(walks)


Why is this necessary?

I think that if you initialize the Word2Vec class from the gensim with the training data, it builds the vocabulary and trains the model as it initializes, which is not what we want in this case (because if there are pre-trained embeddings we should load them before training). If the --pretrained parameter is not set it basically does the same thing as the default initializer.

GTmac · 2019-12-13T23:21:01Z

deepwalk/__main__.py

@@ -150,6 +157,9 @@ def main():
  parser.add_argument('--workers', default=1, type=int,
                      help='Number of parallel processes.')

+  parser.add_argument('--pretrained', nargs='?',
+                      help='Pre-trained embeddings file')


What should be the format of the embedding file? The help string could be more detailed imo

Yes, I agree. The format is the same as the output of the model, which gensim's docs refer to as C's Word2Vec format. I don't know exactly how to refer to that format in the help message.

GTmac · 2019-12-13T23:27:10Z

Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.

GTmac · 2019-12-13T23:29:00Z

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

GTmac · 2019-12-13T23:29:56Z

deepwalk/__main__.py

                     window=args.window_size, min_count=0, trim_rule=None, workers=args.workers)
+    model.build_vocab(vocab_walks_corpus)


Why is this necessary?

GTmac

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

mmcenta · 2019-12-17T20:18:10Z

Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.

So I have been using the version on this branch for my project in link prediction and I've already run the program in cases 1 and 3. As soon as we figure out how to deal with invalid pre-trained files I can run all the tests.

mmcenta · 2019-12-17T20:20:37Z

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

I was thinking about disabling the --representation-size flag if we are using the --pretrained one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?

mmcenta · 2019-12-17T20:25:53Z

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.

mmcenta · 2019-12-17T20:28:03Z

Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.

GTmac · 2019-12-18T16:44:10Z

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

I was thinking about disabling the --representation-size flag if we are using the --pretrained one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?

Good idea on making them mutually exclusive!

GTmac · 2019-12-18T16:46:18Z

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.

Or maybe you could be more aggressive here: assert that the vocab in the pre-trained embeddings is the same as graph nodes, otherwise abort the program

GTmac · 2019-12-18T16:46:38Z

Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.

No rush, good luck with your finals :-)

mmcenta · 2020-01-10T14:03:16Z

I'm back from finals and vacations!

I just implemented two of the improvements we talked about, and I wanted your opinion on this next one: the way the code is implemented right now, the vocabulary is a union of the pre-trained one and the one built from the walks. We can either 1) only intersect the pre-trained embeddings with the vocabulary from the walks or 2) assert that they are equal and terminate otherwise. From what I learned, 1) is really easy and quick to implement and 2) takes linear time on the size of the vocabularies. What do you think is the best option?

mmcenta added 3 commits December 9, 2019 15:46

Add pretrained embeddings option when all walks are in memory

06655ab

Make feature compatible with walks on disk

a6cea21

Fix misnamer in the code

6cc7ade

GTmac reviewed Dec 13, 2019

View reviewed changes

mmcenta added 2 commits January 10, 2020 14:27

Improve --pretrained help string to state the expected format

be8b321

Make --representation-size and --pretrained mutually exclusive arguments

1575493

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for pretrained embeddings #108

Add option for pretrained embeddings #108

mmcenta commented Dec 11, 2019

GTmac Dec 13, 2019

mmcenta Dec 17, 2019

GTmac Dec 13, 2019 •

edited

Loading

mmcenta Dec 17, 2019

GTmac commented Dec 13, 2019

GTmac commented Dec 13, 2019

GTmac Dec 13, 2019

GTmac left a comment

mmcenta commented Dec 17, 2019

mmcenta commented Dec 17, 2019

mmcenta commented Dec 17, 2019

mmcenta commented Dec 17, 2019

GTmac commented Dec 18, 2019

GTmac commented Dec 18, 2019

GTmac commented Dec 18, 2019

mmcenta commented Jan 10, 2020

		window=args.window_size, min_count=0, trim_rule=None, workers=args.workers)
		model.build_vocab(vocab_walks_corpus)

Add option for pretrained embeddings #108

Are you sure you want to change the base?

Add option for pretrained embeddings #108

Conversation

mmcenta commented Dec 11, 2019

GTmac Dec 13, 2019

Choose a reason for hiding this comment

mmcenta Dec 17, 2019

Choose a reason for hiding this comment

GTmac Dec 13, 2019 • edited Loading

Choose a reason for hiding this comment

mmcenta Dec 17, 2019

Choose a reason for hiding this comment

GTmac commented Dec 13, 2019

GTmac commented Dec 13, 2019

GTmac Dec 13, 2019

Choose a reason for hiding this comment

GTmac left a comment

Choose a reason for hiding this comment

mmcenta commented Dec 17, 2019

mmcenta commented Dec 17, 2019

mmcenta commented Dec 17, 2019

mmcenta commented Dec 17, 2019

GTmac commented Dec 18, 2019

GTmac commented Dec 18, 2019

GTmac commented Dec 18, 2019

mmcenta commented Jan 10, 2020

GTmac Dec 13, 2019 •

edited

Loading