Skip to content

Streaming

Streaming lets you receive LLM output as it is generated, chunk by chunk, instead of waiting for the entire response to complete. This is essential for chat interfaces where perceived latency matters — the user sees tokens appear in real time.

Prompty wraps the raw SDK stream in a PromptyStream (or AsyncPromptyStream) that accumulates chunks for tracing, then hands them to the Processor which extracts usable content deltas and yields them to your application.

flowchart LR
    A["SDK Stream"] --> B["PromptyStream\naccumulates chunks\ntraces on exhaust"]
    B --> C["Processor\nextracts delta.content\nyields text chunks"]
    C --> D["Application"]

    style A fill:#e5e7eb,stroke:#6b7280,color:#374151
    style B fill:#1d4ed8,stroke:#3b82f6,color:#fff
    style C fill:#10b981,stroke:#059669,color:#fff
    style D fill:#f59e0b,stroke:#d97706,color:#fff

Set stream: true in the model’s additionalProperties inside the options block. You can do this in the .prompty file or override it at runtime.

---
name: streaming-chat
model:
id: gpt-4o-mini
provider: openai
apiType: chat
connection:
kind: key
endpoint: ${env:OPENAI_API_BASE:https://api.openai.com/v1}
apiKey: ${env:OPENAI_API_KEY}
options:
temperature: 0.7
additionalProperties:
stream: true
---
system:
You are a helpful assistant.
user:
{{question}}
from prompty import load
agent = load("chat.prompty")
# Enable streaming by mutating options before execution
agent.model.options.additionalProperties["stream"] = True

from prompty import load, run, process
agent = load("chat.prompty")
agent.model.options.additionalProperties["stream"] = True
# run with raw=True returns the PromptyStream
stream = run(agent, inputs={"question": "Tell me a joke"}, raw=True)
# process yields text chunks
for chunk in process(agent, stream):
print(chunk, end="", flush=True)
print() # newline after stream completes

The streaming processor does more than just forward chunks. It handles several edge cases from the OpenAI streaming protocol:

ScenarioBehavior
Content deltasEach delta.content string is yielded directly to the caller.
Tool-call deltasArgument fragments are accumulated across chunks. A complete ToolCall object is yielded when the stream ends.
RefusalIf delta.refusal is present the processor raises a ValueError with the refusal text.
Empty / heartbeat chunksChunks with no content, tool-call, or refusal data are silently skipped.

A common concern with streaming is losing observability — if chunks are consumed lazily, when does the trace fire?

Prompty solves this with the PromptyStream wrapper:

  1. The executor wraps the raw SDK iterator in a PromptyStream.
  2. As your application (or the processor) iterates, each chunk is forwarded and appended to an internal accumulator.
  3. When the iterator is exhausted, PromptyStream flushes the complete accumulated response to the active tracer.
iterate chunk 1 → yield + accumulate
iterate chunk 2 → yield + accumulate
iterate chunk 3 → yield + accumulate
...
StopIteration → flush accumulated data to tracer ✓

The same applies to AsyncPromptyStream for async iteration.


When using execute_agent(), the runtime runs a tool-calling loop: call the LLM, execute any requested tools, append results, repeat.

Streaming still works inside the agent loop. The executor does not disable streaming — instead it consumes the stream through the processor internally to detect tool calls:

  1. The LLM streams a response with tool_calls deltas.
  2. The processor accumulates fragments and yields a ToolCall.
  3. The agent loop executes the tool function and appends the result.
  4. The loop re-calls the LLM (still streaming).
  5. When the model finally returns a content-only response, those chunks are forwarded to your application as normal text deltas.