Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option for pretrained embeddings #108

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mmcenta
Copy link

@mmcenta mmcenta commented Dec 11, 2019

As per #107, this PR implements an extra parameter, --pretrained, that can be used to specify pre-trained embeddings that will be used as the initializers for the node embeddings. The specified file must be in the traditional C node2vec format (same as the program outputs).

Help Wanted: how to check if the dimensions match? Either we assert that the loaded vectors have the same dimension as the value passed into the --representation_size parameter or we infer the representation size from the pre-trained embeddings (and make the parameters mutually exclusive).

Also, I read through the README to check if I needed to update something, but I decided that this new parameter didn't fit anywhere, so I should either not alter it or add a new subsection (up to you).

@@ -72,7 +72,14 @@ def process(args):
walks = graph.build_deepwalk_corpus(G, num_paths=args.number_walks,
path_length=args.walk_length, alpha=0, rand=random.Random(args.seed))
print("Training...")
model = Word2Vec(walks, size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers)
model = Word2Vec(size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers)
model.build_vocab(walks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if you initialize the Word2Vec class from the gensim with the training data, it builds the vocabulary and trains the model as it initializes, which is not what we want in this case (because if there are pre-trained embeddings we should load them before training). If the --pretrained parameter is not set it basically does the same thing as the default initializer.

@@ -150,6 +157,9 @@ def main():
parser.add_argument('--workers', default=1, type=int,
help='Number of parallel processes.')

parser.add_argument('--pretrained', nargs='?',
help='Pre-trained embeddings file')
Copy link
Collaborator

@GTmac GTmac Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the format of the embedding file? The help string could be more detailed imo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. The format is the same as the output of the model, which gensim's docs refer to as C's Word2Vec format. I don't know exactly how to refer to that format in the help message.

@GTmac
Copy link
Collaborator

GTmac commented Dec 13, 2019

Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.

@GTmac
Copy link
Collaborator

GTmac commented Dec 13, 2019

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

window=args.window_size, min_count=0, trim_rule=None, workers=args.workers)
model.build_vocab(vocab_walks_corpus)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Collaborator

@GTmac GTmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

@mmcenta
Copy link
Author

mmcenta commented Dec 17, 2019

Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file.

So I have been using the version on this branch for my project in link prediction and I've already run the program in cases 1 and 3. As soon as we figure out how to deal with invalid pre-trained files I can run all the tests.

@mmcenta
Copy link
Author

mmcenta commented Dec 17, 2019

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

I was thinking about disabling the --representation-size flag if we are using the --pretrained one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?

@mmcenta
Copy link
Author

mmcenta commented Dec 17, 2019

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.

@mmcenta
Copy link
Author

mmcenta commented Dec 17, 2019

Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.

@GTmac
Copy link
Collaborator

GTmac commented Dec 18, 2019

For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as --representation_size, otherwise just abort the program.

I was thinking about disabling the --representation-size flag if we are using the --pretrained one (i.e. setting them to be mutually exclusive). That's another way to solve the problem and both are pretty easy to implement, I imagine. What do you think?

Good idea on making them mutually exclusive!

@GTmac
Copy link
Collaborator

GTmac commented Dec 18, 2019

Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?

I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens.

Or maybe you could be more aggressive here: assert that the vocab in the pre-trained embeddings is the same as graph nodes, otherwise abort the program

@GTmac
Copy link
Collaborator

GTmac commented Dec 18, 2019

Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon.

No rush, good luck with your finals :-)

@mmcenta
Copy link
Author

mmcenta commented Jan 10, 2020

I'm back from finals and vacations!

I just implemented two of the improvements we talked about, and I wanted your opinion on this next one: the way the code is implemented right now, the vocabulary is a union of the pre-trained one and the one built from the walks. We can either 1) only intersect the pre-trained embeddings with the vocabulary from the walks or 2) assert that they are equal and terminate otherwise. From what I learned, 1) is really easy and quick to implement and 2) takes linear time on the size of the vocabularies. What do you think is the best option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants