Calling the Large Model Agent API — Stream Streaming

When using the agent developed with BISHENG毕昇, I found that there is an option called stream (dify should be similar) when calling via API.

import requests
import json

url = "https://xxx:xxx/api/v2/workflow/invoke"

payload = json.dumps({
   "workflow_id": "xxx",
   "stream": False, # If empty or not passed, the workflow events will be returned as a stream. In this example, to intuitively show the return result, it is set to non-streaming. In real scenarios, streaming is recommended for better user experience.
})

headers = {
   'Content-Type': 'application/json'
}

response = requests.request("POST", 
                            url, 
                            headers=headers, 
                            data=payload, 
                            verify=False)

print(response.text) # Output the workflow response

If stream is set to false, the output response is complete, for example:

{"status_code":200,"status_message":"SUCCESS","data":{"session_id":"xxx","events":[{"event":"guide_word","message_id":null,"status":"end","node_id":"start_31155","node_execution_id":"xxx","output_schema":{"message":"xxx","reasoning_content":null,"output_key":null,"files":null,"source_url":null,"extra":null},"input_schema":null},{"event":"guide_question","message_id":null,"status":"end","node_id":"start_31155","node_execution_id":"xxx","output_schema":{"message":[""],"reasoning_content":null,"output_key":null,"files":null,"source_url":null,"extra":null},"input_schema":null},{"event":"input","message_id":"131572","status":"end","node_id":"input_75aa8","node_execution_id":null,"output_schema":null,"input_schema":{"input_type":"dialog_input","value":[{"key":"user_input","type":"text","value":"","label":null,"multiple":false,"required":true,"options":null,"file_type":null},{"key":"dialog_files_content","type":"dialog_file","value":[],"label":"Upload file content","multiple":false,"required":false,"options":null,"file_type":null},{"key":"dialog_file_accept","type":"dialog_file_accept","value":"all","label":"Upload file types","multiple":false,"required":false,"options":null,"file_type":null}]}}]}}

If stream is set to true, the returned response is chunked, and does not contain the session id inside:

data: {"session_id":"xxx","data":{"event":"guide_word","message_id":null,"status":"end","node_id":"start_31155","node_execution_id":"xxx","output_schema":{"message":"Please upload your medical check-up report","reasoning_content":null,"output_key":null,"files":null,"source_url":null,"extra":null},"input_schema":null}}

data: {"session_id":"xxx","data":{"event":"guide_question","message_id":null,"status":"end","node_id":"start_31155","node_execution_id":"xxx","output_schema":{"message":[""],"reasoning_content":null,"output_key":null,"files":null,"source_url":null,"extra":null},"input_schema":null}}

data: {"session_id":"xxx","data":{"event":"input","message_id":"131650","status":"end","node_id":"input_75aa8","node_execution_id":null,"output_schema":null,"input_schema":{"input_type":"dialog_input","value":[{"key":"user_input","type":"text","value":"","label":null,"multiple":false,"required":true,"options":null,"file_type":null},{"key":"dialog_files_content","type":"dialog_file","value":[],"label":"Upload file content","multiple":false,"required":false,"options":null,"file_type":null},{"key":"dialog_file_accept","type":"dialog_file_accept","value":"all","label":"Upload file types","multiple":false,"required":false,"options":null,"file_type":null}]}}}

Later, chunked files need to be processed separately.

This article is transcoded by SimpRead, original at platform.moonshot.cn

After the Kimi large model receives a user’s question, it first performs inference and then generates the answer token by token. In our examples in the first two chapters, we chose to wait until the Kimi large model has generated all tokens before printing the reply content, which usually takes several seconds. If your question is complex enough and the Kimi model’s reply is long enough, the total waiting time for the model to generate the result might extend to 10 or even 20 seconds, which greatly reduces the user experience. To improve this and provide timely feedback to users, we offer streaming output capability, i.e., Streaming. We will explain the principle of Streaming and illustrate it with practical code:

  • How to use streaming output;
  • Common issues when using streaming output;
  • How to handle streaming output without using the Python SDK;

How to Use Streaming Output

Streaming output, in short, means that whenever the Kimi large model generates a certain number of tokens (usually 1 token), it immediately sends these tokens to the client instead of waiting for all tokens to be generated before transmitting them. When you chat with the Kimi Smart Assistant (opens in a new tab), its replies appear character by character “jumping” out, which is one manifestation of streaming output. Streaming output allows the user to see the first token outputted by the Kimi model immediately, reducing waiting time.

You can use streaming output and receive streaming responses like this (stream=True):

pythonnode.js

from openai import OpenAI
 
client = OpenAI(
    api_key = "MOONSHOT_API_KEY", # Replace MOONSHOT_API_KEY here with the API Key you applied for from the Kimi Open Platform
    base_url = "https://api.moonshot.cn/v1",
)
 
stream = client.chat.completions.create(
    model = "kimi-k2-turbo-preview",
    messages = [
        {"role": "system", "content": "You are Kimi, an AI assistant provided by Moonshot AI, more proficient in Chinese and English conversations. You provide safe, helpful, and accurate answers to users. Meanwhile, you refuse to answer any questions involving terrorism, racial discrimination, adult content, violence, etc. Moonshot AI is a proper noun and should not be translated into other languages."},
        {"role": "user", "content": "Hello, my name is Li Lei, what is 1+1?"}
    ],
    temperature = 0.6,
    stream=True, # <--- Note here, we enable streaming output mode by setting stream=True
)
 
# When streaming output mode (stream=True) is enabled, the SDK's returned content changes — we no longer access choice directly from the return value
# Instead, we iterate over each individual chunk in the return value
 
for chunk in stream:
    # Here, each chunk's structure is similar to the previous completion, but the message field is replaced by delta
    delta = chunk.choices[0].delta # <--- message field replaced by delta
 
    if delta.content:
        # When printing the content, since this is streaming output, to keep sentence coherence, we do not add line breaks manually
        # Therefore, we set end="" to cancel the print's default newline.
        print(delta.content, end="")

Common Issues When Using Streaming Output

After successfully running the above code and understanding streaming output basics, now let’s explain some streaming details and common issues to help you implement your business logic better.

Interface Details

When streaming output mode (stream=True) is enabled, the Kimi large model no longer returns a JSON format response (Content-Type: application/json) but uses Content-Type: text/event-stream (SSE for short). This response format supports the server continuously transmitting data to the client. In the context of Kimi large model, it means the server continuously transmits tokens to the client.

When you look at the SSE (opens in a new tab) HTTP response body, it looks like this:

data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
 
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
 
...
 
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{"content":"."},"finish_reason":null}]}
 
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{},"finish_reason":"stop","usage":{"prompt_tokens":19,"completion_tokens":13,"total_tokens":32}}]}
 
data: [DONE]

In the SSE (opens in a new tab) response body, we agree that each data chunk is prefixed with data:, followed by a valid JSON object, and ends with two newlines \n\n to mark the end of the current data chunk. Finally, once all data chunks have been transmitted, data: [DONE] is sent to mark the completion of data transmission, at which point the network connection can be closed.

Tokens Calculation

When using streaming output mode, there are two ways to calculate tokens. The most direct and accurate way is to wait until all data chunks have been transmitted and then access the last data chunk’s usage field to see the total prompt_tokens/completion_tokens/total_tokens generated during the entire streaming output process.

...
 
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{},"finish_reason":"stop","usage":{"prompt_tokens":19,"completion_tokens":13,"total_tokens":32}}]}
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                               Accessing the usage field in the last data chunk to see the tokens count generated by the current request
data: [DONE]

However, in practical use, streaming output might be interrupted by uncontrollable factors (e.g., network disconnection, client errors), so often the last data chunk is not transmitted fully, making it impossible to know the total tokens consumed by the request. To avoid the failure of token calculation in such scenarios, we recommend saving the content of each received data chunk and after the request ends (whether successful or not), use the token calculation API to compute the total consumed tokens. Example code as below:

pythonnode.js

import os
import httpx
from openai import OpenAI
 
client = OpenAI(
    api_key = "MOONSHOT_API_KEY", # Replace MOONSHOT_API_KEY here with the API Key you applied for from the Kimi Open Platform
    base_url = "https://api.moonshot.cn/v1",
)
 
stream = client.chat.completions.create(
    model = "kimi-k2-turbo-preview",
    messages = [
        {"role": "system", "content": "You are Kimi, an AI assistant provided by Moonshot AI, more proficient in Chinese and English conversations. You provide safe, helpful, and accurate answers to users. Meanwhile, you refuse to answer any questions involving terrorism, racial discrimination, adult content, violence, etc. Moonshot AI is a proper noun and should not be translated into other languages."},
        {"role": "user", "content": "Hello, my name is Li Lei, what is 1+1?"}
    ],
    temperature = 0.6,
    stream=True, # <--- Note here, we enable streaming output mode by setting stream=True
)
 
 
def estimate_token_count(input: str) -> int:
    """
    Implement your Tokens calculation logic here, or directly use our Tokens calculation API to calculate Tokens

    https://api.moonshot.cn/v1/tokenizers/estimate-token-count
    """
    header = {
        "Authorization": f"Bearer {os.environ['MOONSHOT_API_KEY']}",
    }
    data = {
        "model": "kimi-k2-turbo-preview",
        "messages": [
            {"role": "user", "content": input},
        ]
    }
    r = httpx.post("https://api.moonshot.cn/v1/tokenizers/estimate-token-count", headers=header, json=data)
    r.raise_for_status()
    return r.json()["data"]["total_tokens"]
 
 
completion = []
for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        completion.append(delta.content)
 
 
print("completion_tokens:", estimate_token_count("".join(completion)))

How to Terminate Output

If you want to terminate the streaming output, you can directly close the HTTP network connection or discard the subsequent data chunks. For example:

for chunk in stream:
    if condition:
        break

How to Handle Streaming Output Without Using SDK

If you do not want to use the Python SDK to handle streaming output and want to interact directly with the HTTP interface for using the Kimi large model (e.g., languages where SDK is not available, or you have your unique business logic that the SDK cannot satisfy), we provide some examples to help you understand how to properly handle HTTP SSE (opens in a new tab) response body (here we still use Python code as example, the detailed explanation is given in comments).

pythonnode.js

import httpx # We use the httpx library to perform HTTP requests
 
 
data = {
    "model": "kimi-k2-turbo-preview",
    "messages": [
        # Specific messages
    ],
    "temperature": 0.6,
    "stream": True,
}
 
 
# Send chat request to Kimi large model via httpx and get response r
r = httpx.post("https://api.moonshot.cn/v1/chat/completions", json=data)
if r.status_code != 200:
    raise Exception(r.text)
 
data: str
 
# Here, we use the iter_lines method to read the response body line by line
for line in r.iter_lines():
    # Strip leading and trailing spaces from each line for better data chunk handling
    line = line.strip()
 
    # Next, we handle three different cases:
    #   1. If the current line is empty, it means the previous data chunk transmission is complete (as mentioned, data chunks are ended by double newlines), we can deserialize the data chunk and print the corresponding content;
    #   2. If the current line is non-empty and starts with data:, it marks the start of a data chunk transmission. We remove the data: prefix and first check if it is the end signal [DONE]. If not, save the data content to the data variable;
    #   3. If the current line is non-empty but does not start with data:, it means this line still belongs to the previous data chunk currently being transmitted. We append the line to the data variable;
 
    if len(line) == 0:
        chunk = json.loads(data)
 
        # Here you can replace this logic with your own business logic. Printing is just to show the process.
        choice = chunk["choices"][0]
        usage = choice.get("usage")
        if usage:
            print("total_tokens:", usage["total_tokens"])
        delta = choice["delta"]
        role = delta.get("role")
        if role:
            print("role:", role)
        content = delta.get("content")
        if content:
            print(content, end="")
 
        data = "" # Reset data
    elif line.startswith("data: "):
        data = line.lstrip("data: ")
 
        # When data chunk content is [DONE], it means all data chunks have been sent, you can disconnect the network
        if data == "[DONE]":
            break
    else:
        data = data + "\n" + line # We still add a newline when appending because the data chunk might intentionally split content into multiple lines
```The above describes the streaming output processing flow using Python as an example. If you are using other languages, you can also properly handle streaming output content. The basic steps are as follows:

1. Initiate an HTTP request and set the `stream` parameter to `true` in the request body;
2. Receive the server's response, and note that if the response `Headers` contain `Content-Type` as `text/event-stream`, it indicates that the current response content is streaming output;
3. Read the response content line by line and parse data chunks (the data chunks are in JSON format). Note that the start and end of a data chunk are determined by the `data:` prefix and newline character `\n`;
4. Determine whether the current data chunk content is `[DONE]` to know if the transmission is complete;

_Note: Always use `data: [DONE]` to determine whether the data transmission is complete, rather than relying on `finish_reason` or other methods. If the `data: [DONE]` message block has not been received, even if `finish_reason=stop` information is obtained, it should not be regarded as the data chunk transmission has completed. In other words, before receiving a data chunk with `data: [DONE]`, the message should be considered **incomplete**._

During streaming output, only the `content` field is streamed, meaning each data chunk includes partial Tokens of the `content`. For fields that do not need streaming output, such as `role` and `usage`, we usually present them once in either the first or the last data chunk and do not include `role` and `usage` in every data chunk (specifically, the `role` field only appears in the first data chunk and will not be present in subsequent chunks; the `usage` field only appears in the last data chunk and will not be present in the previous chunks).

### How to handle when n > 1

In some cases, we want to output multiple results for selection. The correct approach is to set the `n` parameter in the request to a value greater than 1. Streaming output also supports using `n > 1`. In this situation, extra code is needed to check the `index` value of the current data chunk to determine which reply the transmitted data chunk belongs to. The following example code illustrates this:

pythonnode.js

import httpx # We use the httpx library to perform our HTTP request

data = {
“model”: “kimi-k2-turbo-preview”,
“messages”: [
# Specific messages
],
“temperature”: 0.6,
“stream”: True,
“n”: 2, # ← Note here, we request the Kimi large model to output 2 replies
}

Use httpx to send a chat request to the Kimi large model and get the response r

r = httpx.post(“https://api.moonshot.cn/v1/chat/completions”, json=data)
if r.status_code != 200:
raise Exception(r.text)

data: str

Here, we pre-build a list to store different reply messages. Since we set n=2, we initialize the list with 2 elements

messages = [{}, {}]

Here, we use the iter_lines method to read the response body line by line

for line in r.iter_lines():
# Strip trailing and leading spaces from each line to better process data chunks
line = line.strip()

# Next, we handle three different cases:
#   1. If the current line is empty, it means the previous data chunk has been fully received (i.e., the data chunk transmission ended as mentioned earlier by two line breaks). We can deserialize the data chunk and print its corresponding content;
#   2. If the current line is not empty and starts with data:, it means this is the start of a data chunk transmission. We remove the data: prefix and first check if it is the termination symbol [DONE]. If not, save the data content to the data variable;
#   3. If the current line is not empty but does not start with data:, it means the current line still belongs to the previously transmitting data chunk, so we append the current line's content to the tail of the data variable;

if len(line) == 0:
	chunk = json.loads(data)

	# Loop through all choices in each data chunk and get the message object corresponding to the index
	for choice in chunk["choices"]:
		index = choice["index"]
		message = messages[index]
		usage = choice.get("usage")
		if usage:
			message["usage"] = usage
		delta = choice["delta"]
		role = delta.get("role")
		if role:
			message["role"] = role
		content = delta.get("content")
		if content:
			message["content"] = message["content"] + content

	data = "" # Reset data
elif line.startswith("data: "):
	data = line.lstrip("data: ")

	# When the data chunk content is [DONE], it indicates all data chunks have been sent, and the network connection can be closed
	if data == "[DONE]":
		break
else:
	data = data + "\n" + line # We add a newline character while appending content since this might be a deliberate multiline data chunk display

After assembling all messages, we print their content respectively

for index, message in enumerate(messages):
print(“index:”, index)
print(“message:”, json.dumps(message, ensure_ascii=False))


When `n > 1`, the key point in handling streaming output is that you first need to determine which reply message a data chunk belongs to based on the `index` value of the data chunk, and then proceed with subsequent logic processing.

Last updated on October 28, 2025 [Auto Reconnect](/docs/guide/auto-reconnect "Auto Reconnect") [Using Tool Calls](/docs/guide/use-kimi-api-to-complete-tool-calls "Using Tool Calls")