Technical Analysis of Reading and Parsing Streaming Responses from Large Language Models

This article was converted by SimpRead; the original source is www.yigegongjiang.com

Primarily explains streaming HTTP responses via the ReadableStream capability of the Web Fetch API.

Primarily explains streaming HTTP responses via the ReadableStream capability of the Web Fetch API.
Streaming hinges on parsing the response body at the HTTP protocol layer—similar to TCP packet sticking and fragmentation handling. All major programming languages support intercepting and parsing raw binary body data streams.

Fetch API Quick Overview

  • await fetch(url) → returns a Response object: At this point, the browser has typically already received and parsed the HTTP status line and headers from the underlying connection (TCP / QUIC), but the response body has not yet been consumed or parsed (it is exposed separately as a streaming interface for incremental reading by application code).
  • await res.json() / await res.text()reads and parses the entire body at once, suitable for non-streaming scenarios.
  • res.bodyReadableStream<Uint8Array>, a byte stream interface, supporting incremental consumption.
  • res.body.getReader().read() → manual Pull mode; each call returns { value: Uint8Array, done: boolean }.

You can think of it like this: fetch() first retrieves and wraps the status and headers into a Response; how and when the body bytes are read and parsed is left entirely up to the caller:

  • Incrementally read via ReadableStream (ideal for AI token streaming, NDJSON, SSE, etc.).
  • Or invoke convenience methods like json() / text(), which internally read the full body before parsing it all at once.

Core distinction: Convenience methods like res.json() wait until the entire response body finishes downloading before parsing it; reader.read() supports processing while receiving, forming the foundation of streaming reads.

const full = await (await fetch(url)).json();


const res = await fetch(url);
const reader = res.body!.getReader();
const decoder = new TextDecoder("utf-8");
for (;;) {
  const { value, done } = await reader.read();
  if (done) break;
  const chunkText = decoder.decode(value, { stream: true });
  
}

SSE and Fetch

Many “chat streaming outputs” appear similar to Server-Sent Events (SSE), but their underlying implementations generally fall into two categories:

  • SSE (protocol + API): text/event-stream MIME type + EventSource (the browser handles protocol framing and provides built-in semantics such as auto-reconnect). See the SSE Guide.
  • Streaming via Fetch response body (transport capability + custom framing): fetch() + Response.body (which yields a ReadableStream<Uint8Array>; you must define your own message boundaries and semantics—e.g., NDJSON, length-prefixed frames, etc.).
  • Side note: Many implementations are migrating from SSE/EventSource to fetch streams because SSE has several limitations (e.g., only supports GET requests, inflexible headers/authentication).

Commonality: Both rely on long-lived connections + incremental writes + timely flushes, resulting in an apparent HTTP response body byte stream on the client side.

One-sentence distinction: SSE standardizes both message framing and event semantics; Fetch streaming exposes only raw bytes to the application layer—the rest is defined by your application protocol.

  • Framing: SSE uses fixed data: lines followed by blank lines; Fetch streams allow fully customizable framing (NDJSON, length-prefix, delimiters, etc.).
  • Semantics: SSE includes built-in browser features like automatic reconnection and Last-Event-ID; with Fetch streams, reconnection, resumption, and error semantics must be designed and implemented at the application layer.
  • Use cases: SSE fits better for generic “standard event streams”; whereas chat often requires POST requests, flexible authentication headers, and custom protocols—making Fetch streams more convenient.

End-to-end Transmission (from server write to client receiving chunk)

To truly achieve “push-as-you-go” display, the crux usually lies not in HTTP syntax itself, but rather in where along the transmission path bytes get buffered.

  1. Server-side application write: The application calls write()/send() to write bytes into the socket. If it only writes to userspace buffers without calling flush (or if buffering occurs in frameworks/middleware), the client won’t receive incremental data.
  2. TCP socket buffers (send/receive): send() typically copies data only into the TCP send buffer; actual transmission depends on the TCP stack’s segmentation logic and flow/congestion control policies. As a result: application-level write() granularity rarely matches what the peer read()s (further affected by Nagle’s algorithm, delayed ACKs, cwnd, rwnd, etc.).
  3. HTTP transport mechanism:
    • HTTP/1.1: Usually relies on Chunked Transfer Encoding or continuous writing when Content-Length is unknown.
    • HTTP/2 / HTTP/3: Transmits continuously using DATA frames or QUIC streams, subject to multiplexing and flow-control mechanisms.
  4. Browser network stack → ReadableStream: The browser pushes “arrived and ready” bytes into the internal queue of the ReadableStream, and JavaScript consumes them via reader.read() or pipeThrough() in Pull mode.
  • Chunk boundary ≠ message boundary: A single read() yields a Uint8Array representing only currently available bytes—it may be arbitrarily truncated (e.g., mid-UTF-8 character, mid-JSON structure, or mid-custom frame header).
  • End-to-end buffering “fakes non-streaming”: Application-layer flushing, reverse-proxy buffering, compression buffering, CDN policies, or browser internal queues—if any stage batches data, tokens will appear to arrive “in batches”.

Basic Model of Fetch Streaming Reads

The Response object returned by fetch() contains a body property of type ReadableStream<Uint8Array>. There are two main ways to consume it:

  1. Manual reading: Get a reader and loop over read().
  2. Pipeline processing: Use pipeThrough() to construct decoding, framing, and parsing pipelines (recommended).

Overall, this operates in Pull mode: Each read() returns “whatever bytes have arrived so far”. If processing lags behind network delivery speed, the internal queue builds up, triggering backpressure, which ultimately slows down further transmission.

Example: Parsing a Message Stream

Byte Decoding: Handling Incremental UTF-8

Network transmission delivers Uint8Arrays, but chat ultimately needs textual tokens. Note that UTF-8 is a variable-length encoding—a single character may span two chunks. Directly calling decoder.decode(chunk) (non-streaming mode by default) risks garbled or missing characters at chunk boundaries. Instead, use incremental decoding:

  • Use TextDecoder with the { stream: true } option.
  • Or use TextDecoderStream in a pipeline: response.body.pipeThrough(new TextDecoderStream()), yielding a ReadableStream<string> directly.

Framing Strategy: Defining Message Boundaries

To achieve “render-as-you-receive”, you need a framing rule enabling incremental parsing: How do we extract individual messages?

  1. NDJSON / JSON Lines (Recommended)

    • Format: One message per line, e.g., {"type":"delta","text":"..."}\n.
    • Pros: Simple parsing, debugging-friendly, compatible with JSON.parse.
    • Note: Ensure payloads contain no unescaped newlines (standard JSON escapes newlines as \n, usually safe).
  2. Delimiter-based Protocol

    • Format: Custom delimiter (e.g., \n\n or a unique boundary string) separates messages.
    • Risk: If payloads contain the delimiter, escaping or complex boundary schemes are needed.
  3. Length-Prefixed Framing (Binary Framing)

    • Format: [Length][Payload]....
    • Pros: Safe for arbitrary binary or text content; unaffected by payload characters.
    • Cons: Higher implementation complexity; requires maintaining a byte-level state machine.

Prefer NDJSON or length-prefixed framing; never assume chunk boundaries align with business message boundaries.

Fetch + NDJSON

Complete workflow: Byte reading → text decoding → line-based framing → JSON parsing → UI update

type ChatChunk = { type: "delta"; text: string } | { type: "done" } | { type: "error"; message: string };

export async function streamChat(
  input: { prompt: string },
  onChunk: (c: ChatChunk) => void,
  signal?: AbortSignal,
) {
  const res = await fetch("/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(input),
    signal,
  });

  if (!res.ok) throw new Error(`HTTP ${res.status}`);
  if (!res.body) throw new Error("ReadableStream not supported");

  const reader = res.body.getReader();

  const decoder = new TextDecoder("utf-8");

  let lineBuffer = "";

  try {
    while (true) {
      const { value, done } = await reader.read();

      if (done) {
        if (lineBuffer.trim()) {
          const msg = JSON.parse(lineBuffer.trim()) as ChatChunk;
          onChunk(msg);
        }
        break;
      }

      const text = decoder.decode(value, { stream: true });
      lineBuffer += text;

      const lines = lineBuffer.split("\n");
      lineBuffer = lines.pop() ?? "";

      for (const line of lines) {
        const trimmed = line.trim();
        if (!trimmed) continue; 

        const msg = JSON.parse(trimmed) as ChatChunk;
        onChunk(msg);

        if (msg.type === "done") return;
      }
    }
  } catch (e) {
    throw e;
  } finally {
    reader.releaseLock();
  }
}

Server-side sending:

Wrap incremental tokens into parseable, separable message units (lines), ensuring the client always parses complete JSON objects.

  • Incremental token: {"type":"delta","text":"..."}\n + Flush
  • Completion signal: {"type":"done"}\n + End Response