Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning the initial weights of the heads leads to NaN #2

Open
tristandeleu opened this issue Sep 17, 2015 · 6 comments
Open

Learning the initial weights of the heads leads to NaN #2

tristandeleu opened this issue Sep 17, 2015 · 6 comments

Comments

@tristandeleu
Copy link
Collaborator

When setting the learn_init=True parameter on the heads, the error and parameters of the heads become NaN after a few iterations (not necessarily on the first one, but it can happen after 100+ iterations).

How to reproduce it:

heads = [
    WriteHead([controller, memory], shifts=(-1, 1), name='write', learn_init=True),
    ReadHead([controller, memory], shifts=(-1, 1), name='read', learn_init=True)
]

This is a non-blocking issue since learning these weights may not actually make sense (we can just leave equi-probability as is as the first step).

@tristandeleu
Copy link
Collaborator Author

This may be due to the norm constraint of the weights (and the initial weights) being violated during training. The weights are required to sum to one, but the (vanilla) training procedure does not guarantee it. The sum-to-one constraint is critical since other setups may lead to entries in w_tilde being negative -- which explains the NaNs in w \propto w_tilde ** gamma.

@tristandeleu
Copy link
Collaborator Author

Learning the initial weights might be something we'll eventually need. Initializing them to a uniform probability over all the addresses almost necessarily forces the first step to write in a distributed way (over multiple addresses, instead of hard addressing).

Instead of learning the raw weight_init, which may have some issues as explained in #2 (comment), we could learn some kind of initialization that needs to go through a normalization step to get w_0. The process would be to learn weight_init (keep this shared variable as a parameter) and then get the first weight as

w_0 = normalize(rectify(weight_init))

With an additional rectify() nonlinearity to favor sparse initializations.

@EderSantana
Copy link

hi @tristandeleu, when learning the initial weights, are you making sure they are behind of a softmax? In other word, are you learning the initial logits instead? If so, there is no problem if they get negative values. I had this problem in my NTM implementation as well.

@tristandeleu
Copy link
Collaborator Author

When I originally opened this issue I didn't, which was a mistake on my end. I haven't tried to learn the logit. But indeed you're right, I think this is the right solution (I only sketched the idea in this issue).
All in all I ended up leaving the learn_init=False for the weights in my experiments, and initialize them as one-hot vectors. But I haven't found a good way to allow both to fix the initialization (eg. with OneHot) and learning logit weights.

@EderSantana
Copy link

I'm new to your codebase, could you point me out where you get the initial weights, I could try to check that out.

@tristandeleu
Copy link
Collaborator Author

The initial weights are defined here: https://github.com/snipsco/ntm-lasagne/blob/master/ntm/heads.py#L102
But for now, there's no correct way to learn these weights unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants