This article is transcoded by SimpRead, original at platform.moonshot.cn
After the Kimi large model receives a user’s question, it first performs inference and then generates the answer token by token. In our examples in the first two chapters, we chose to wait until the Kimi large model has generated all tokens before printing the reply content, which usually takes several seconds. If your question is complex enough and the Kimi model’s reply is long enough, the total waiting time for the model to generate the result might extend to 10 or even 20 seconds, which greatly reduces the user experience. To improve this and provide timely feedback to users, we offer streaming output capability, i.e., Streaming. We will explain the principle of Streaming and illustrate it with practical code:
- How to use streaming output;
- Common issues when using streaming output;
- How to handle streaming output without using the Python SDK;
How to Use Streaming Output
Streaming output, in short, means that whenever the Kimi large model generates a certain number of tokens (usually 1 token), it immediately sends these tokens to the client instead of waiting for all tokens to be generated before transmitting them. When you chat with the Kimi Smart Assistant (opens in a new tab), its replies appear character by character “jumping” out, which is one manifestation of streaming output. Streaming output allows the user to see the first token outputted by the Kimi model immediately, reducing waiting time.
You can use streaming output and receive streaming responses like this (stream=True):
pythonnode.js
from openai import OpenAI
client = OpenAI(
api_key = "MOONSHOT_API_KEY", # Replace MOONSHOT_API_KEY here with the API Key you applied for from the Kimi Open Platform
base_url = "https://api.moonshot.cn/v1",
)
stream = client.chat.completions.create(
model = "kimi-k2-turbo-preview",
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant provided by Moonshot AI, more proficient in Chinese and English conversations. You provide safe, helpful, and accurate answers to users. Meanwhile, you refuse to answer any questions involving terrorism, racial discrimination, adult content, violence, etc. Moonshot AI is a proper noun and should not be translated into other languages."},
{"role": "user", "content": "Hello, my name is Li Lei, what is 1+1?"}
],
temperature = 0.6,
stream=True, # <--- Note here, we enable streaming output mode by setting stream=True
)
# When streaming output mode (stream=True) is enabled, the SDK's returned content changes — we no longer access choice directly from the return value
# Instead, we iterate over each individual chunk in the return value
for chunk in stream:
# Here, each chunk's structure is similar to the previous completion, but the message field is replaced by delta
delta = chunk.choices[0].delta # <--- message field replaced by delta
if delta.content:
# When printing the content, since this is streaming output, to keep sentence coherence, we do not add line breaks manually
# Therefore, we set end="" to cancel the print's default newline.
print(delta.content, end="")
Common Issues When Using Streaming Output
After successfully running the above code and understanding streaming output basics, now let’s explain some streaming details and common issues to help you implement your business logic better.
Interface Details
When streaming output mode (stream=True) is enabled, the Kimi large model no longer returns a JSON format response (Content-Type: application/json) but uses Content-Type: text/event-stream (SSE for short). This response format supports the server continuously transmitting data to the client. In the context of Kimi large model, it means the server continuously transmits tokens to the client.
When you look at the SSE (opens in a new tab) HTTP response body, it looks like this:
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
...
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{"content":"."},"finish_reason":null}]}
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{},"finish_reason":"stop","usage":{"prompt_tokens":19,"completion_tokens":13,"total_tokens":32}}]}
data: [DONE]
In the SSE (opens in a new tab) response body, we agree that each data chunk is prefixed with data:, followed by a valid JSON object, and ends with two newlines \n\n to mark the end of the current data chunk. Finally, once all data chunks have been transmitted, data: [DONE] is sent to mark the completion of data transmission, at which point the network connection can be closed.
Tokens Calculation
When using streaming output mode, there are two ways to calculate tokens. The most direct and accurate way is to wait until all data chunks have been transmitted and then access the last data chunk’s usage field to see the total prompt_tokens/completion_tokens/total_tokens generated during the entire streaming output process.
...
data: {"id":"cmpl-1305b94c570f447fbde3180560736287","object":"chat.completion.chunk","created":1698999575,"model":"kimi-k2-turbo-preview","choices":[{"index":0,"delta":{},"finish_reason":"stop","usage":{"prompt_tokens":19,"completion_tokens":13,"total_tokens":32}}]}
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Accessing the usage field in the last data chunk to see the tokens count generated by the current request
data: [DONE]
However, in practical use, streaming output might be interrupted by uncontrollable factors (e.g., network disconnection, client errors), so often the last data chunk is not transmitted fully, making it impossible to know the total tokens consumed by the request. To avoid the failure of token calculation in such scenarios, we recommend saving the content of each received data chunk and after the request ends (whether successful or not), use the token calculation API to compute the total consumed tokens. Example code as below:
pythonnode.js
import os
import httpx
from openai import OpenAI
client = OpenAI(
api_key = "MOONSHOT_API_KEY", # Replace MOONSHOT_API_KEY here with the API Key you applied for from the Kimi Open Platform
base_url = "https://api.moonshot.cn/v1",
)
stream = client.chat.completions.create(
model = "kimi-k2-turbo-preview",
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant provided by Moonshot AI, more proficient in Chinese and English conversations. You provide safe, helpful, and accurate answers to users. Meanwhile, you refuse to answer any questions involving terrorism, racial discrimination, adult content, violence, etc. Moonshot AI is a proper noun and should not be translated into other languages."},
{"role": "user", "content": "Hello, my name is Li Lei, what is 1+1?"}
],
temperature = 0.6,
stream=True, # <--- Note here, we enable streaming output mode by setting stream=True
)
def estimate_token_count(input: str) -> int:
"""
Implement your Tokens calculation logic here, or directly use our Tokens calculation API to calculate Tokens
https://api.moonshot.cn/v1/tokenizers/estimate-token-count
"""
header = {
"Authorization": f"Bearer {os.environ['MOONSHOT_API_KEY']}",
}
data = {
"model": "kimi-k2-turbo-preview",
"messages": [
{"role": "user", "content": input},
]
}
r = httpx.post("https://api.moonshot.cn/v1/tokenizers/estimate-token-count", headers=header, json=data)
r.raise_for_status()
return r.json()["data"]["total_tokens"]
completion = []
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
completion.append(delta.content)
print("completion_tokens:", estimate_token_count("".join(completion)))
How to Terminate Output
If you want to terminate the streaming output, you can directly close the HTTP network connection or discard the subsequent data chunks. For example:
for chunk in stream:
if condition:
break
How to Handle Streaming Output Without Using SDK
If you do not want to use the Python SDK to handle streaming output and want to interact directly with the HTTP interface for using the Kimi large model (e.g., languages where SDK is not available, or you have your unique business logic that the SDK cannot satisfy), we provide some examples to help you understand how to properly handle HTTP SSE (opens in a new tab) response body (here we still use Python code as example, the detailed explanation is given in comments).
pythonnode.js
import httpx # We use the httpx library to perform HTTP requests
data = {
"model": "kimi-k2-turbo-preview",
"messages": [
# Specific messages
],
"temperature": 0.6,
"stream": True,
}
# Send chat request to Kimi large model via httpx and get response r
r = httpx.post("https://api.moonshot.cn/v1/chat/completions", json=data)
if r.status_code != 200:
raise Exception(r.text)
data: str
# Here, we use the iter_lines method to read the response body line by line
for line in r.iter_lines():
# Strip leading and trailing spaces from each line for better data chunk handling
line = line.strip()
# Next, we handle three different cases:
# 1. If the current line is empty, it means the previous data chunk transmission is complete (as mentioned, data chunks are ended by double newlines), we can deserialize the data chunk and print the corresponding content;
# 2. If the current line is non-empty and starts with data:, it marks the start of a data chunk transmission. We remove the data: prefix and first check if it is the end signal [DONE]. If not, save the data content to the data variable;
# 3. If the current line is non-empty but does not start with data:, it means this line still belongs to the previous data chunk currently being transmitted. We append the line to the data variable;
if len(line) == 0:
chunk = json.loads(data)
# Here you can replace this logic with your own business logic. Printing is just to show the process.
choice = chunk["choices"][0]
usage = choice.get("usage")
if usage:
print("total_tokens:", usage["total_tokens"])
delta = choice["delta"]
role = delta.get("role")
if role:
print("role:", role)
content = delta.get("content")
if content:
print(content, end="")
data = "" # Reset data
elif line.startswith("data: "):
data = line.lstrip("data: ")
# When data chunk content is [DONE], it means all data chunks have been sent, you can disconnect the network
if data == "[DONE]":
break
else:
data = data + "\n" + line # We still add a newline when appending because the data chunk might intentionally split content into multiple lines
```The above describes the streaming output processing flow using Python as an example. If you are using other languages, you can also properly handle streaming output content. The basic steps are as follows:
1. Initiate an HTTP request and set the `stream` parameter to `true` in the request body;
2. Receive the server's response, and note that if the response `Headers` contain `Content-Type` as `text/event-stream`, it indicates that the current response content is streaming output;
3. Read the response content line by line and parse data chunks (the data chunks are in JSON format). Note that the start and end of a data chunk are determined by the `data:` prefix and newline character `\n`;
4. Determine whether the current data chunk content is `[DONE]` to know if the transmission is complete;
_Note: Always use `data: [DONE]` to determine whether the data transmission is complete, rather than relying on `finish_reason` or other methods. If the `data: [DONE]` message block has not been received, even if `finish_reason=stop` information is obtained, it should not be regarded as the data chunk transmission has completed. In other words, before receiving a data chunk with `data: [DONE]`, the message should be considered **incomplete**._
During streaming output, only the `content` field is streamed, meaning each data chunk includes partial Tokens of the `content`. For fields that do not need streaming output, such as `role` and `usage`, we usually present them once in either the first or the last data chunk and do not include `role` and `usage` in every data chunk (specifically, the `role` field only appears in the first data chunk and will not be present in subsequent chunks; the `usage` field only appears in the last data chunk and will not be present in the previous chunks).
### How to handle when n > 1
In some cases, we want to output multiple results for selection. The correct approach is to set the `n` parameter in the request to a value greater than 1. Streaming output also supports using `n > 1`. In this situation, extra code is needed to check the `index` value of the current data chunk to determine which reply the transmitted data chunk belongs to. The following example code illustrates this:
pythonnode.js
import httpx # We use the httpx library to perform our HTTP request
data = {
“model”: “kimi-k2-turbo-preview”,
“messages”: [
# Specific messages
],
“temperature”: 0.6,
“stream”: True,
“n”: 2, # ← Note here, we request the Kimi large model to output 2 replies
}
Use httpx to send a chat request to the Kimi large model and get the response r
r = httpx.post(“https://api.moonshot.cn/v1/chat/completions”, json=data)
if r.status_code != 200:
raise Exception(r.text)
data: str
Here, we pre-build a list to store different reply messages. Since we set n=2, we initialize the list with 2 elements
messages = [{}, {}]
Here, we use the iter_lines method to read the response body line by line
for line in r.iter_lines():
# Strip trailing and leading spaces from each line to better process data chunks
line = line.strip()
# Next, we handle three different cases:
# 1. If the current line is empty, it means the previous data chunk has been fully received (i.e., the data chunk transmission ended as mentioned earlier by two line breaks). We can deserialize the data chunk and print its corresponding content;
# 2. If the current line is not empty and starts with data:, it means this is the start of a data chunk transmission. We remove the data: prefix and first check if it is the termination symbol [DONE]. If not, save the data content to the data variable;
# 3. If the current line is not empty but does not start with data:, it means the current line still belongs to the previously transmitting data chunk, so we append the current line's content to the tail of the data variable;
if len(line) == 0:
chunk = json.loads(data)
# Loop through all choices in each data chunk and get the message object corresponding to the index
for choice in chunk["choices"]:
index = choice["index"]
message = messages[index]
usage = choice.get("usage")
if usage:
message["usage"] = usage
delta = choice["delta"]
role = delta.get("role")
if role:
message["role"] = role
content = delta.get("content")
if content:
message["content"] = message["content"] + content
data = "" # Reset data
elif line.startswith("data: "):
data = line.lstrip("data: ")
# When the data chunk content is [DONE], it indicates all data chunks have been sent, and the network connection can be closed
if data == "[DONE]":
break
else:
data = data + "\n" + line # We add a newline character while appending content since this might be a deliberate multiline data chunk display
After assembling all messages, we print their content respectively
for index, message in enumerate(messages):
print(“index:”, index)
print(“message:”, json.dumps(message, ensure_ascii=False))
When `n > 1`, the key point in handling streaming output is that you first need to determine which reply message a data chunk belongs to based on the `index` value of the data chunk, and then proceed with subsequent logic processing.
Last updated on October 28, 2025 [Auto Reconnect](/docs/guide/auto-reconnect "Auto Reconnect") [Using Tool Calls](/docs/guide/use-kimi-api-to-complete-tool-calls "Using Tool Calls")