Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiries regards to target model update #2

Open
originholic opened this issue Dec 23, 2015 · 4 comments
Open

Inquiries regards to target model update #2

originholic opened this issue Dec 23, 2015 · 4 comments

Comments

@originholic
Copy link

Hello,
Thanks for sharing this great code of chainer based DQN.

I recently started to use chainer. The code works great for me and I would like to implement a critic-actor architecture based on your DQN code. I don't know whether it is strange to ask question like this here, since I couldn't find appropriate forum to ask about chainer. But it will be really appreciated if help can be offered.

I can see in the code that the target model is updated as follow which directly copy from its model:

self.model_target = copy.deepcopy(self.model)

If I want the target to be updated slower base on this paper:

θ' ← τθ + (1 −τ )θ'

Because the θ and θ' is the weights of the model and target_model, I was thinking of doing:

self.target_model.W.data = tau*self.model.W.data + (1-tau)*target_model.W.data

Is this the right way of doing it? or any better suggestions of doing it using chainer?
Many thanks

@ugo-nama-kun
Copy link
Owner

Hello,

Thank you for your interest.

Your question is quite interesting and important for me. That's kind of "Theano-like" update rule is very important for the future development of deep reinforcement learning with Chainer.

Unfortunately, I've never tried that's kind of update rule in chainer code, then I'm currently not sure your suggestion works well.

However, I think that "Theano-like" update rule must be necessary for me in the near future.
I'll look for some appropriate way and notice here.
And I really appreciate if you notice me when your method actually works well.

thanks

@originholic
Copy link
Author

Happy new year, and many thanks for responding.

This few days, I have tried to run tests on the update method that I mentioned previously, with the critic-actor architecture using a continuous carpole balancing domain.

However the result is strange, since I could see it try to learn balancing at the beginning as the step reward gradually raise, when the results reaches the goal of balancing steps, the reward start to decrease from the goal step. I am not able to identify whether it is the problem of the update or the architecture or the parameter.

But, yes I will keep trying on this issue to see whether this type of update works with chainer.

@ugo-nama-kun
Copy link
Owner

Hello originholic,

Sorry for late response.

Today I just uploaded the test codes for testing the "moving copy" in chainer:
https://github.com/ugo-nama-kun/moving_copy_in_chainer.git
To run my codes, you need the latest chainer package.

I constructed a very simple classification task and tested both of cpu-based and gpu-based implementation, and it looks working correctly.
Actually, my codes are just same with your suggested code ;-)

Finally, I think the cart-pole balancing task is a bit too complex for checking a code in continuous RL algorithms. If you are not used to implement RL algorithms, I recommend you to use the mountain-car task instead of cart-pole. Because the mountain-car task have only 2-dimensional state space, the visualization of value function, policy and agent's behavior are very straightforward.

@originholic
Copy link
Author

Hello @ugo-nama-kun ,
Sorry for the delayed reply.
And many thanks for testing out the update method for chainer, so I can be more confident with the update when I run the actor-critic experiments.

I think you are right on the count that cart-pole balancing task is quite complex to test out, It is likely that my work would have been less painful if I had started with simpler task, but since this balancing scenario is very close to the task that I want to work on, so I stick to the cart-pole task.

Eventually, I got a "moderately good" CartPole results after a long manual hyperparameter search, although there are still lots works on tuning the deep neural network used, as the network always overfit, I have to do early stopping to prevent that somehow. Really appreciated for your helps, and I will share the actor-critic code I was working on sometimes in my repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants