To use a guardrails configuration in streaming mode, the following must be met:
- The main LLM must support streaming.
- There are no output rails.
To activate streaming on a guardrails configuration, add the following to your config.yml
:
streaming: True
You can enable streaming when launching the NeMo Guardrails chat CLI by using the --streaming
option:
nemoguardrails chat --config=examples/configs/streaming --streaming
You can use the streaming directly from the python API in two ways:
- Simple: receive just the chunks (tokens).
- Full: receive both the chunks as they are generated and the full response at the end.
For the simple usage, you need to call the stream_async
method on the LLMRails
instance:
from nemoguardrails import LLMRails
app = LLMRails(config)
history = [{"role": "user", "content": "What is the capital of France?"}]
async for chunk in app.stream_async(messages=history):
print(f"CHUNK: {chunk}")
# Or do something else with the token
For the full usage, you need to provide a StreamingHandler
instance to the generate_async
method on the LLMRails
instance:
from nemoguardrails import LLMRails
from nemoguardrails.streaming import StreamingHandler
app = LLMRails(config)
history = [{"role": "user", "content": "What is the capital of France?"}]
streaming_handler = StreamingHandler()
async def process_tokens():
async for chunk in streaming_handler:
print(f"CHUNK: {chunk}")
# Or do something else with the token
asyncio.create_task(process_tokens())
result = await app.generate_async(
messages=history, streaming_handler=streaming_handler
)
print(result)
For the complete working example, check out this demo script.
To make a call to the NeMo Guardrails Server in streaming mode, you have to set the stream
parameter to True
inside the JSON body. For example, to get the completion for a chat session using the /v1/chat/completions
endpoint:
POST /v1/chat/completions
{
"config_id": "some_config_id",
"messages": [{
"role":"user",
"content":"Hello! What can you do for me?"
}],
"stream": true
}
We also support streaming for LLMs deployed using HuggingFacePipeline
.
One example is provided in the HF Pipeline Dolly configuration.
To use streaming for HF Pipeline LLMs, you first need to set the streaming flag in your config.yml
.
streaming: True
Then you need to create an nemoguardrails.llm.providers.huggingface.AsyncTextIteratorStreamer
streamer object,
add it to the kwargs
of the pipeline and to the model_kwargs
of the HuggingFacePipelineCompatible
object.
from nemoguardrails.llm.providers.huggingface import AsyncTextIteratorStreamer
# instantiate tokenizer object required by LLM
streamer = AsyncTextIteratorStreamer(tokenizer, skip_prompt=True)
params = {"temperature": 0.01, "max_new_tokens": 100, "streamer": streamer}
pipe = pipeline(
# all other parameters
**params,
)
llm = HuggingFacePipelineCompatible(pipeline=pipe, model_kwargs=params)