Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation fails with exceeding context length #276

Open
TomasTomecek opened this issue Sep 12, 2024 · 4 comments
Open

Generation fails with exceeding context length #276

TomasTomecek opened this issue Sep 12, 2024 · 4 comments

Comments

@TomasTomecek
Copy link

The data generation fails with exceeded model's context length. I'm assuming there is something wrong with my input data but it's hard to tell because the error message doesn't give me any pointers.

$ ilab -v data generate --gpus 1 --enable-serving-output
...
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 174, in generate
    ds = future.result()
         ^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 203, in _generate_single
    raise PipelineBlockError(
instructlab.sdg.pipeline.PipelineBlockError: PipelineBlockError(<class 'instructlab.sdg.llmblock.LLMBlock'>/gen_questions): Error code: 400 - {'object': 'error', 'message': "This model's maximum context length is 4096 tokens. However, you requested 4472 tokens (376 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.", 'type': 'BadRequestError', 'param': None, 'code': 400}

I am using this knowledge taxonomy: https://github.com/fedora-copr/logdetective-taxonomy/pull/1/files#diff-72023c16fbdf204c9d0cf1d3710f1cd4d3626e02786a23162ce649e20a8adf51

Expected output

Print out the entry from my input which would help me update it.

ilab 0.18.4, RHELAI 1.1, demo.redhat.com environment

@bbrowning
Copy link
Contributor

Thanks for the report @TomasTomecek! If this is triggered specifically in a RHEL AI environment, then I'd encourage you to pursue specific Red Hat channels for getting the support you need. However, there are situations where you could hit and reproduce this directly with InstructLab, so let's walk through out how to diagnose that.

First, what model is in use here? What pipeline? Are you sure it's pointing to the correct taxonomy? You're looking for the model, pipeline, and taxonomy_path keys under the generate section of your ilab config, which can be viewed via ilab config show. It looks like you're running against a model with a 4096 context length, so I'm going to assume this is the simple pipeline along with the merlinite-7b-lab-Q4_K_M.gguf model?

Secondly, since you passed the -v and --enable-serving-output flags, were you able to tell from the output which specific inference request triggered this error? The name of the pipeline step that errored out is gen_questions, but that's odd because we only trigger that pipeline step when generating freeform skills data. The qna.yaml you link to is for knowledge data, so I think this is perhaps a matter of not pointing at the right taxonomy or not having the expected folder setup in that taxonomy?

@TomasTomecek
Copy link
Author

@bbrowning thank you for guiding me, Ben!

Parameters:
                   gpus: 1              [type: int, src: commandline]
  enable_serving_output: True           [type: bool, src: commandline]
             model_path: '/var/home/instruct/.cache/instructlab/models/granite-7b-redhat-lab'   [type: str, src: default_map]
               num_cpus: 10             [type: int, src: default_map]
       chunk_word_count: 1000           [type: int, src: default_map]
       num_instructions: -1             [type: int, src: default]
       sdg_scale_factor: 30             [type: int, src: default_map]
          taxonomy_path: '/var/home/instruct/.local/share/instructlab/taxonomy'         [type: str, src: default_map]
          taxonomy_base: 'empty'        [type: str, src: default_map]
             output_dir: '/var/home/instruct/.local/share/instructlab/datasets'         [type: str, src: default_map]
        rouge_threshold: 0.9            [type: float, src: default]
                  quiet: False          [type: bool, src: default]
           endpoint_url: None           [type: None, src: default]
                api_key: 'no_api_key'   [type: str, src: default]
             yaml_rules: None           [type: None, src: default]
        server_ctx_size: 4096           [type: int, src: default]
           tls_insecure: False          [type: bool, src: default]
        tls_client_cert: ''             [type: str, src: default]
         tls_client_key: ''             [type: str, src: default]
      tls_client_passwd: ''             [type: str, src: default]
           model_family: None           [type: None, src: default]
               pipeline: '/usr/share/instructlab/sdg/pipelines/agentic'         [type: str, src: default_map]
             batch_size: None           [type: None, src: default]
INFO 2024-09-13 09:09:52,501 numexpr.utils:161: NumExpr defaulting to 4 threads.
INFO 2024-09-13 09:09:54,266 datasets:59: PyTorch version 2.3.1 available.
DEBUG 2024-09-13 09:09:57,328 instructlab.model.backends.backends:254: Auto-detecting backend for model /var/home/instruct/.cache/instructlab/models/granite-7b-redhat-lab
DEBUG 2024-09-13 09:09:57,364 instructlab.model.backends.backends:212: Model is huggingface safetensors and system is Linux, using vllm backend.
DEBUG 2024-09-13 09:09:57,364 instructlab.model.backends.backends:262: Auto-detected backend: vllm

DEBUG 2024-09-13 09:10:00,167 instructlab.model.backends.vllm:205: vLLM serving command is: ['/opt/app-root/bin/python3.11', '-m', 'vllm.entrypoints.openai.api_server', '--host', '127.0.0.1', '--port', '57795', '--model', '/var/home
/instruct/.cache/instructlab/models/granite-7b-redhat-lab', '--distributed-executor-backend', 'mp', '--enable-lora', '--max-lora-rank', '64', '--dtype', 'bfloat16', '--lora-dtype', 'bfloat16', '--fully-sharded-loras', '--lora-module
s', 'skill-classifier-v3-clm=/var/home/instruct/.cache/instructlab/models/skills-adapter-v3', 'text-classifier-knowledge-v3-clm=/var/home/instruct/.cache/instructlab/models/knowledge-adapter-v3', '--tensor-parallel-size', '1']

In the config, I only switched from mixtral to granite but will try merlinite you suggest as well.

generate:
    chunk_word_count: 1000
    model: /var/home/instruct/.cache/instructlab/models/granite-7b-redhat-lab
    num_cpus: 10
    output_dir: /var/home/instruct/.local/share/instructlab/datasets
    pipeline: /usr/share/instructlab/sdg/pipelines/agentic
    prompt_file: /var/home/instruct/.local/share/instructlab/internal/prompt.txt
    sdg_scale_factor: 30
    seed_file: /var/home/instruct/.local/share/instructlab/internal/seed_tasks.json
    taxonomy_base: empty
    taxonomy_path: /var/home/instruct/.local/share/instructlab/taxonomy
    teacher:
        backend: vllm
        chat_template: tokenizer
        host_port: 127.0.0.1:8000
        llama_cpp:
            gpu_layers: -1
            llm_family: ''
            max_ctx_size: 4096
        model_path: /var/home/instruct/.cache/instructlab/models/granite-7b-redhat-lab
        vllm:
            gpus: 1
            llm_family: granite
            max_startup_attempts: 50
            vllm_args:
            - --enable-lora
            - --max-lora-rank
            - '64'
            - --dtype
            - bfloat16
            - --lora-dtype
            - bfloat16
            - --fully-sharded-loras
            - --lora-modules
            - skill-classifier-v3-clm=/var/home/instruct/.cache/instructlab/models/skills-adapter-v3
            - text-classifier-knowledge-v3-clm=/var/home/instruct/.cache/instructlab/models/knowledge-adapter-v3

Since this is the workshop for RHELAI, there were some preloaded taxonomies - I disabled them all and only left mine. The logs confirm this:

<mine qna.yaml>
`account-serrver.conf-sample` contains a typo. Therefore\nit cannot be present at the expected location. Causing build to fail.'}]
DEBUG 2024-09-13 09:11:21,801 instructlab.sdg:400: Dataset: Dataset({
    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],
    num_rows: 209
})

Seems the model actually be a problem because there is no tokenizer for it:

WARNING 09-13 09:11:22 tokenizer.py:145] No tokenizer found in /var/home/instruct/.cache/instructlab/models/knowledge-adapter-v3, using base model tokenizer instead. (Exception: Incorrect path_or_model_id: '/var/home/instruct/.cache
/instructlab/models/knowledge-adapter-v3'. Please provide either the path to a local folder or the repo_id of a model on the Hub.)
INFO:     127.0.0.1:36980 - "POST /v1/completions HTTP/1.1" 400 Bad Request

I can see a ton of 400s in the logs.

The end of the log is pretty confusing:

DEBUG 2024-09-13 09:11:22,173 instructlab.sdg.llmblock:184: Generating outputs for 1 samples
INFO:     127.0.0.1:36980 - "POST /v1/completions HTTP/1.1" 400 Bad Request
INFO:     127.0.0.1:36972 - "POST /v1/completions HTTP/1.1" 400 Bad Request
INFO:     127.0.0.1:36980 - "POST /v1/completions HTTP/1.1" 400 Bad Request
DEBUG 2024-09-13 09:11:22,181 instructlab.model.backends.backends:317: Sending SIGINT to vLLM server PID 34
DEBUG 2024-09-13 09:11:22,181 instructlab.model.backends.backends:321: Waiting for vLLM server to shut down gracefully
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [34]
INFO 09-13 09:11:22 async_llm_engine.py:51] Engine is gracefully shutting down.
DEBUG 2024-09-13 09:11:25,200 instructlab.model.backends.backends:336: Nothing left to clean up with the vLLM process group
INFO 2024-09-13 09:11:25,200 instructlab.model.backends.backends:351: Waiting for GPU VRAM reclamation...
DEBUG 2024-09-13 09:11:28,773 instructlab.model.backends.backends:402: GPU free vram stable (stable count 1, free 23368695808, last free 23368695808)
DEBUG 2024-09-13 09:11:29,773 instructlab.model.backends.backends:402: GPU free vram stable (stable count 2, free 23368695808, last free 23368695808)
DEBUG 2024-09-13 09:11:30,774 instructlab.model.backends.backends:402: GPU free vram stable (stable count 3, free 23368695808, last free 23368695808)
DEBUG 2024-09-13 09:11:31,774 instructlab.model.backends.backends:402: GPU free vram stable (stable count 4, free 23368695808, last free 23368695808)
DEBUG 2024-09-13 09:11:32,774 instructlab.model.backends.backends:402: GPU free vram stable (stable count 5, free 23368695808, last free 23368695808)
DEBUG 2024-09-13 09:11:33,775 instructlab.model.backends.backends:402: GPU free vram stable (stable count 6, free 23368695808, last free 23368695808)
DEBUG 2024-09-13 09:11:33,775 instructlab.model.backends.backends:409: Successful sample recorded, (stable count 6, free 23368695808, last free 23368695808)
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 201, in _generate_single
    dataset = block.generate(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/llmblock.py", line 208, in generate
    outputs = self._generate(samples)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/llmblock.py", line 162, in _generate
    response = self.ctx.client.completions.create(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/openai/_utils/_utils.py", line 274, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/openai/resources/completions.py", line 528, in create
    return self._post(
           ^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/openai/_base_client.py", line 1260, in post
    return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/openai/_base_client.py", line 937, in request
    return self._request(
           ^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/openai/_base_client.py", line 1041, in _request
    raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'allowed_token_ids contains out-of-vocab token id', 'type': 'BadRequestError', 'param': None, 'code': 400}

followed by the exception from my original post:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/app-root/bin/ilab", line 8, in <module>
    sys.exit(ilab())
             ^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 306, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/data/generate.py", line 305, in generate
    generate_data(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/generate_data.py", line 401, in generate_data
    new_generated_data = pipe.generate(ds, leaf_node_path)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 174, in generate
    ds = future.result()
         ^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/sdg/pipeline.py", line 203, in _generate_single
    raise PipelineBlockError(
instructlab.sdg.pipeline.PipelineBlockError: PipelineBlockError(<class 'instructlab.sdg.llmblock.LLMBlock'>/router): Error code: 400 - {'object': 'error', 'message': 'allowed_token_ids contains out-of-vocab token id', 'type': 'BadRe
questError', 'param': None, 'code': 400}

@TomasTomecek
Copy link
Author

With merlinite, I'm in a similar spot:

[rank0]: Traceback (most recent call last):
[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank0]:   File "<frozen runpy>", line 88, in _run_code
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 282, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 224, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 444, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 373, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 525, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 263, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 375, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 62, in initialize_cache
[rank0]:     self._run_workers("initialize_cache",
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 135, in _run_workers
[rank0]:     driver_worker_output = driver_worker_method(*args, **kwargs)
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 212, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/opt/app-root/lib64/python3.11/site-packages/vllm/worker/worker.py", line 372, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (27536). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

I tried setting the memory utilization as the error suggests but it didn't help. Neither lowering the context length.

I'd appreciate any help.

@bbrowning
Copy link
Contributor

Ok, so at this point I think we should redirect this to specific Red Hat support channels, as what it looks like is you're hitting issues specifically with how you're using RHEL AI that won't be present in InstructLab directly. I'll reach out to you via those channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants