Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_encode_plus doesn't work correctly #1704

Open
tempdeltavalue opened this issue Dec 18, 2024 · 2 comments
Open

batch_encode_plus doesn't work correctly #1704

tempdeltavalue opened this issue Dec 18, 2024 · 2 comments

Comments

@tempdeltavalue
Copy link

tempdeltavalue commented Dec 18, 2024

code here:
https://github.com/tempdeltavalue/temp_l/blob/main/finetune_seq2seq.ipynb

Screenshot 2024-12-18 150037
Screenshot 2024-12-18 150203

https://discuss.huggingface.co/t/repetitive-words-in-model-output/132085/2

@tempdeltavalue tempdeltavalue changed the title batch_encode doesn't work correctly batch_encode_plus doesn't work correctly Dec 18, 2024
@tempdeltavalue
Copy link
Author

same with tokenizer() batch encoding
Screenshot 2024-12-18 151941
Screenshot 2024-12-18 152002

@jonvet
Copy link

jonvet commented Jan 11, 2025

In your notebook you initialise the tokenizer as follows

tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-medium", padding_side='left')
tokenizer.pad_token = tokenizer.eos_token

so you're padding with tokens from the left.
the reason why you're getting more pad tokens for the same input sequence when you encode X[0:99] than when you encode X[0:3] is that some sequence in X[3:99] is longer than the longest sequence in X[0:3], and it will make all encoded sequences the same length (due to padding). where is this going wrong in your opinion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants