Strange output of Flan-T5-Large when using float16 converted model #1074

Matthieu-Tinycoaching · 2023-02-09T16:38:55Z

Hi,

I've installed the "python-wheels" artifact from #1066 .

Then, I use the following code to answer to a question based on a context:

model_path = "./flan-t5-large/float16"

context = "Le Père Noël (né Thomas Patrick O'Connor ; mai 1947)[1] est un homme politique américain, un moine et un militant de la protection de l'enfance[2]. Il est membre du conseil municipal de North Pole, en Alaska, depuis 2015[3]. Claus a été réélu au conseil en 2019[4], avec 100 voix[5]. Il est actuellement le maire pro tem[6]. Il a été candidat à la Chambre des représentants des États-Unis lors de l'élection spéciale de 2022. Claus est un moine chrétien de l'ordre anglican celtique Anam Cara[8]. Il a changé son nom pour correspondre au personnage légendaire en 2005, afin d'aider à faire connaître son activisme pour la santé et le bien-être des enfants[9]. Claus est un partisan du socialisme démocratique, y compris le soutien à l'assurance-maladie pour tous, l'aide aux coronavirus, un impôt sur la fortune et l'annulation des prêts étudiants[10]."

question = "Pourquoi Claus a-t-il changé de nom ?"

input_text = f"Given the context please answer the question. Use only French words. Context: {context}; Question: {question}; Answer:"

translator = ctranslate2.Translator(model_path, device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))

results = translator.translate_batch(source=[input_tokens], beam_size=1)

output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text)

which generates the following text : <pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Any suggestion regarding this result?

The text was updated successfully, but these errors were encountered:

guillaumekln · 2023-02-09T17:11:25Z

Thanks again for reporting the issue.

I found that the model produces NaN values after the 5th layer.

Apparently this is a known issue when running T5 models in FP16: huggingface/transformers#10956. As far as I understand, the issue is that the model is trained with bfloat16 and there are numerical issues when the model is then run in float16 (these are 2 different types).

We can try the change that is proposed in this PR but it does not seem efficient. Another solution would be to support bfloat16 execution.

Maybe you can go one step further and use the int8 quantization for this model?

Matthieu-Tinycoaching · 2023-02-09T17:24:34Z

Ok right 👍

baptistejamin · 2023-05-03T09:10:10Z

Hey there!

I discovered some similar quirks evaluating ctranslate2 with flan-t5.

I compared the outputs of flan-t5-xl and flan-t5-xxl on GPU using float32, int8_float16, float16, and int8

The results for "Who is Barrack Obama?":

With Flan-T5-XL float32: president of united states
With Flan-T5-XL int8_float16: president of united states
With Flan-T5-XL float16: <pad><pad><pad><pad><pad><pad>
With Flan-T5-XL int8: politician
With Flan-T5-XXL float32: president of united states
With Flan-T5-XXL int8_float16: sanders defeated barkan as president
With Flan-T5-XXL float16: <pad><pad><pad><pad><pad><pad>
With Flan-T5-XXL int8: president of makulin country south africa name or africa name authority of president of south logan city or south log stage south africa name south africa name south africa citizenship south africa name south africa nationality or south africa author south africa name south africa name south leader south name south south africa

Conclusion:

The model is losing a lot in terms of precision when running in int8_float16 in int8. XXL model seems to be more affected as it generates nonsensical text in int8.

Here is the code being used:

import ctranslate2
import transformers
import time

#ct2-transformers-converter --model google/flan-t5-xl --output_dir flan-xl-c2
#ct2-transformers-converter --model google/flan-t5-xxl --output_dir flan-xxl-c2

translator = ctranslate2.Translator("flan-xl-c2", device="cuda", compute_type="int8")
tokenizer = transformers.AutoTokenizer.from_pretrained("google/flan-t5-xxl", truncation=True, truncation_side="left")

prompt = """
Who is Barrack Obama?

Answer: """

while True:
	start_time = time.time()

	input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))

	results = translator.translate_batch([input_tokens], beam_size=1, sampling_topk=40, sampling_temperature=0.4, max_input_length=0, max_decoding_length=200, return_scores=False)

	output_tokens = results[0].hypotheses[0]
	output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

	print(output_text)
	print("--- %s seconds ---" % (time.time() - start_time))

baptistejamin · 2023-05-05T09:50:38Z

Update: After investigating, it seems a part of the answer is there: https://github.com/huggingface/transformers/pull/22095/commits

guillaumekln · 2023-05-12T14:40:00Z

Seems also related to huggingface/transformers#20287

redthing1 · 2023-05-22T17:43:04Z

Yeah, I get it returning its own input many times. No real answers.

redthing1 · 2023-05-22T17:47:46Z

So from reading those issues, it doesn't quantize properly to either fp16 or int8, but supposedly works fine as bf16 or fp32.

Also, it looks like this is the actual underlying issue: triton-inference-server/fastertransformer_backend#95 (comment)

redthing1 · 2023-05-22T17:49:04Z

Here are more details: huggingface/transformers#20287 (comment)

redthing1 · 2023-05-22T17:50:42Z

@guillaumekln This appears to be the necessary patch: larsmennen/transformers@f90b269

redthing1 · 2023-05-22T17:54:04Z

https://github.com/huggingface/transformers/blame/main/src/transformers/models/t5/modeling_t5.py#L315

However it looks like the patch has already been upstream for a while. Any ideas?

guillaumekln · 2023-05-24T08:12:25Z

#1239 keeps the FFN output layer in FP32 as suggested in the different issues mentioned above.

Can you help testing this development?

Go to this build page
Download the artifact "python-wheels"
Extract the archive
Install the wheel matching your system and Python version with pip install --force-reinstall

The model must converted again with this version.

redthing1 · 2023-05-24T08:37:03Z

@guillaumekln Thank you for the quick fix! This appears to solve the problem for me.

baptistejamin · 2023-05-24T08:48:27Z

We are running the tests as we speak on a FlanT5 XXL.

I seems there is the same problem in fp8. I will try with fp16. Maybe related to this ? https://github.com/OpenNMT/CTranslate2/pull/1239/files#diff-c2632813f7cc9ff44bda20479812ebefd15be7f7fe1bd099e553111bcb6750acR171

guillaumekln · 2023-05-24T09:16:30Z

I seems there is the same problem in fp8.

You meant int8, right? More specifically did you try int8 or int8_float16?

baptistejamin · 2023-05-24T09:45:23Z

Sorry about the typo. Int8 indeed.

So here at the tests with Flan T5 XXL:

pip install ctranslate2-3.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

With ct2-transformers-converter --model google/flan-t5-xxl --output_dir flan-xxl-c2

Who is Barrack Obama?:

int8: us president/barackhopatsumibi/podium/barhamambua00barackoblamesurprize/summonant/218974312
int8_float6: Error: ValueError: expected storage to be of type float32, but is of type float16
float16: Error: ValueError: expected storage to be of type float32, but is of type float16

With ct2-transformers-converter --model google/flan-t5-xxl --output_dir flan-xxl-c2 --quantization float16

int8: president united states pubic servant elect
int8_float6: Error: ValueError: expected storage to be of type float32, but is of type float16
float16: Error: ValueError: expected storage to be of type float32, but is of type float16

I may be missing something.

guillaumekln · 2023-05-24T14:30:03Z

Thanks for the test.

The error should be fixed in this build: https://github.com/OpenNMT/CTranslate2/actions/runs/5068556228

You can check whether it improves the output for "float16", but "int8_float16" will probably behave like "int8". I'm not sure what else is needed to improve the output with 8-bit quantization.

bradfox2 · 2023-05-28T18:37:03Z

Results are looking good. I also experienced garbled output previously with CTranslate and T5-Flan series.

Running a domain specific t5-xl based on flan-xl and trying @baptistejamin prompt with latest build: https://github.com/OpenNMT/CTranslate2/actions/runs/5068556228 :

Who is barrack obama? -> Barack Obama is the 44th President of the United States. He was elected in 2008 and has been in office since January 20, 2009.

With:
translate_args = {
"repetition_penalty": 1.3,
"sampling_topk": 1,
"beam_size": 1,
"sampling_temperature": 0.3,
"max_decoding_length": 512,
}

ctranslate2.Translator(
device="cpu",
inter_threads=8,
intra_threads=8,
compute_type="int8",
)

Equally response on:
device="cuda",
inter_threads=8,
intra_threads=8,
compute_type="int8_float16",

redthing1 · 2023-05-28T19:24:39Z

Fixed for me as well, thank you

baptistejamin · 2023-05-29T10:03:20Z

I can confirm it works fine with Flan XL int8 with the latest build

However, Flan XXL int8 it still returns non-sensical text, and it for the same prompt, the output is worse than a XL model.

Finally, float16 mode still returns <pad><pad><pad><pad>... for both Flan XL and Flan XXL official models.

It behaves exactly as before the initial patch was made.

Matthieu-Tinycoaching · 2023-06-06T07:21:19Z

Hi,
Is this corrected build published on the last release ?

guillaumekln · 2023-06-06T08:30:20Z

No the change is not merged yet. It does not seem complete since the generation for Flan XXL int8 is still broken.

bradfox2 · 2023-06-15T01:55:41Z

I've been working with XXL in int8 with 3.15.1 and it appears to be emitting valid responses. Both cuda and CPU. @baptistejamin FYI

baptistejamin · 2023-06-15T07:28:28Z

What’s your prompt and parameters ?Sent from my iPhoneOn 15 Jun 2023, at 03:55, Bradley Fox ***@***.***> wrote: I've been working with XXL in int8 with 3.15.1 and it appears to be emitting valid responses. Both cuda and CPU. @baptistejamin FYI —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

bradfox2 · 2023-07-08T21:41:24Z

@baptistejamin Any prompt works just fine - However, I had a recent converted model fail in the same way. It that case it was how tokenizer was converted to the config.json that that CT2 creates. The bos and and decoder start sequence tokens were </s>, which is wrong for T5.

I haven't spent much time troubleshooting but it comes from when the Seq2Seq model inits in model_spec.py: 386

class SequenceToSequenceModelConfig(ModelConfig):
    """Configuration for sequence-to-sequence models."""

    def __init__(
        self,
        unk_token: str = "<unk>",
        bos_token: str = "<s>",
        eos_token: str = "</s>",
        decoder_start_token: Optional[str] = "<s>",
        add_source_bos: bool = False,
        add_source_eos: bool = False,
    ):

Maybe this is also your issue?

guillaumekln · 2023-07-21T10:28:09Z

FYI, CTranslate2 3.17 supports the "bfloat16" compute type which should be used instead of "float16" for these models.

guillaumekln added the enhancement New feature or request label Feb 16, 2023

guillaumekln mentioned this issue Mar 9, 2023

Support bfloat16 execution #1121

Closed

guillaumekln mentioned this issue May 22, 2023

Use of Flan-t5 for answer generation #1064

Closed

vince62s mentioned this issue Nov 23, 2023

Support MADLAD-400 - multilingual machine translation model based on the T5 architecture #1560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange output of Flan-T5-Large when using float16 converted model #1074

Strange output of Flan-T5-Large when using float16 converted model #1074

Matthieu-Tinycoaching commented Feb 9, 2023 •

edited

Loading

guillaumekln commented Feb 9, 2023 •

edited

Loading

Matthieu-Tinycoaching commented Feb 9, 2023

baptistejamin commented May 3, 2023

baptistejamin commented May 5, 2023

guillaumekln commented May 12, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

guillaumekln commented May 24, 2023

redthing1 commented May 24, 2023

baptistejamin commented May 24, 2023

guillaumekln commented May 24, 2023

baptistejamin commented May 24, 2023

guillaumekln commented May 24, 2023

bradfox2 commented May 28, 2023 •

edited

Loading

redthing1 commented May 28, 2023

baptistejamin commented May 29, 2023 •

edited

Loading

Matthieu-Tinycoaching commented Jun 6, 2023

guillaumekln commented Jun 6, 2023

bradfox2 commented Jun 15, 2023

baptistejamin commented Jun 15, 2023 via email

bradfox2 commented Jul 8, 2023 •

edited

Loading

guillaumekln commented Jul 21, 2023

Strange output of Flan-T5-Large when using float16 converted model #1074

Strange output of Flan-T5-Large when using float16 converted model #1074

Comments

Matthieu-Tinycoaching commented Feb 9, 2023 • edited Loading

guillaumekln commented Feb 9, 2023 • edited Loading

Matthieu-Tinycoaching commented Feb 9, 2023

baptistejamin commented May 3, 2023

baptistejamin commented May 5, 2023

guillaumekln commented May 12, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

redthing1 commented May 22, 2023

guillaumekln commented May 24, 2023

redthing1 commented May 24, 2023

baptistejamin commented May 24, 2023

guillaumekln commented May 24, 2023

baptistejamin commented May 24, 2023

guillaumekln commented May 24, 2023

bradfox2 commented May 28, 2023 • edited Loading

redthing1 commented May 28, 2023

baptistejamin commented May 29, 2023 • edited Loading

Matthieu-Tinycoaching commented Jun 6, 2023

guillaumekln commented Jun 6, 2023

bradfox2 commented Jun 15, 2023

baptistejamin commented Jun 15, 2023 via email

bradfox2 commented Jul 8, 2023 • edited Loading

guillaumekln commented Jul 21, 2023

Matthieu-Tinycoaching commented Feb 9, 2023 •

edited

Loading

guillaumekln commented Feb 9, 2023 •

edited

Loading

bradfox2 commented May 28, 2023 •

edited

Loading

baptistejamin commented May 29, 2023 •

edited

Loading

bradfox2 commented Jul 8, 2023 •

edited

Loading