-
Notifications
You must be signed in to change notification settings - Fork 823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option for pretrained embeddings #108
base: master
Are you sure you want to change the base?
Add option for pretrained embeddings #108
Conversation
@@ -72,7 +72,14 @@ def process(args): | |||
walks = graph.build_deepwalk_corpus(G, num_paths=args.number_walks, | |||
path_length=args.walk_length, alpha=0, rand=random.Random(args.seed)) | |||
print("Training...") | |||
model = Word2Vec(walks, size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers) | |||
model = Word2Vec(size=args.representation_size, window=args.window_size, min_count=0, sg=1, hs=1, workers=args.workers) | |||
model.build_vocab(walks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you initialize the Word2Vec class from the gensim with the training data, it builds the vocabulary and trains the model as it initializes, which is not what we want in this case (because if there are pre-trained embeddings we should load them before training). If the --pretrained
parameter is not set it basically does the same thing as the default initializer.
deepwalk/__main__.py
Outdated
@@ -150,6 +157,9 @@ def main(): | |||
parser.add_argument('--workers', default=1, type=int, | |||
help='Number of parallel processes.') | |||
|
|||
parser.add_argument('--pretrained', nargs='?', | |||
help='Pre-trained embeddings file') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the format of the embedding file? The help string could be more detailed imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. The format is the same as the output of the model, which gensim's docs refer to as C's Word2Vec format. I don't know exactly how to refer to that format in the help message.
Thanks for working on this! It would be really helpful if you can add the usage of the new argument in README. Also, can we also test it by: 1) run DeepWalk without the --pretrained flag, to make sure we do not break anything; 2) run with an invalid embedding file, to make sure it is handled properly; 3) run with a valid embedding file. |
For dimension match: yeah, I think we should assert that the loaded vectors have the same dimension as |
window=args.window_size, min_count=0, trim_rule=None, workers=args.workers) | ||
model.build_vocab(vocab_walks_corpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how do we handle the case when vocabulary in the pre-trained embedding does not match the list of graph nodes?
So I have been using the version on this branch for my project in link prediction and I've already run the program in cases 1 and 3. As soon as we figure out how to deal with invalid pre-trained files I can run all the tests. |
I was thinking about disabling the |
I'm not sure. I thought that it would just initialize the embeddings randomly for nodes that are not in the pre-trained embeddings, but reading through the code that doesn't seem very clear. I will dig into the gensim docs or maybe just delete a line from my embedding file and run it to see what happens. |
Thanks for taking the time to help me out! I am kind of taking some time to study for my finals right now, but I will be back soon to figure implement the changes you proposed soon. |
Good idea on making them mutually exclusive! |
Or maybe you could be more aggressive here: assert that the vocab in the pre-trained embeddings is the same as graph nodes, otherwise abort the program |
No rush, good luck with your finals :-) |
I'm back from finals and vacations! I just implemented two of the improvements we talked about, and I wanted your opinion on this next one: the way the code is implemented right now, the vocabulary is a union of the pre-trained one and the one built from the walks. We can either 1) only intersect the pre-trained embeddings with the vocabulary from the walks or 2) assert that they are equal and terminate otherwise. From what I learned, 1) is really easy and quick to implement and 2) takes linear time on the size of the vocabularies. What do you think is the best option? |
As per #107, this PR implements an extra parameter,
--pretrained
, that can be used to specify pre-trained embeddings that will be used as the initializers for the node embeddings. The specified file must be in the traditional C node2vec format (same as the program outputs).Help Wanted: how to check if the dimensions match? Either we assert that the loaded vectors have the same dimension as the value passed into the
--representation_size
parameter or we infer the representation size from the pre-trained embeddings (and make the parameters mutually exclusive).Also, I read through the README to check if I needed to update something, but I decided that this new parameter didn't fit anywhere, so I should either not alter it or add a new subsection (up to you).