Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generated gibberish within the block #48

Open
yangyangyyy123 opened this issue Oct 10, 2024 · 1 comment
Open

generated gibberish within the block #48

yangyangyyy123 opened this issue Oct 10, 2024 · 1 comment

Comments

@yangyangyyy123
Copy link

yangyangyyy123 commented Oct 10, 2024

the generate() function right now only takes the last position of token generated, and shifts the entire window of input one position forward to generate the next output token, still only taking the last position. https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py#L189

I was curious and looked at the entire output contents. For the input, I fed the output of a previous run with the current generate() function, so the input token sequence would be completely "based on the behavior of the model itself", so to speak. then I generated the entire list of T tokens from output. to my surprise, the output is very much gibberish , and quite different from the input (though I could still see a few matches).

I can't figure out why the current method of only taking from the last output position produces seemingly fluent sequences, while the output from middle of the block doesn't make sense. in the current scheme, input grows from torch.zeros((1,1)), up to block size, so during this period, it should be no different from what an output position in the middle of block_size sees, as the output position has masked out all input after it, effective it becomes the end of output window too

@HangjianQian
Copy link

The question of 'why we only taking the last position of logits', I think all positions of output logits have meaning. The meaning of logits at location i: consider the input [0:i], what the following output should be. So when we train the model, the model can learn from the shorter length.
An Training Example in Word Level: input: "I like shopping online".

input output output logits position
I like 1
I like shopping 2
I like shopping online 3

During training, all these losses are collected by cross entropy. During inference, as we only consider the next token, we should only take the last element of logits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants