Differences between GBK, GB2312, and GB18030

Preface

Recently, users reported errors when using rimetool

2025-10-15 11:09:09,607 - ERROR - [new_app.py:273] - Failed to call rimetool_main: 'gb2312' codec can't decode byte 0x90 in position 698: illegal multibyte sequence

It’s strange because we already use chardet for encoding detection. If it’s detected as gbk, we open it with gbk; if utf8, we open it with utf8.

But why does it fail to open when identified as gb2312?

After discussing with AI, we learned that automatic detection often mistakenly identifies GBK/GB18030 as GB2312. In such cases, opening with GB2312 naturally causes errors.

Solution

Open GBK, GB2312, and GB18030 all with GB18030.

Differences between GBK, GB2312, and GB18030

Brief answer:

  • Standards and Publication Years
    • GB2312: Released in 1980, includes about 6,763 simplified Chinese characters + 682 symbols.
    • GBK: Released in 1995, greatly expands on GB2312, including about 21,003 characters (including many traditional and rare characters), fully backward compatible with GB2312. Windows Chinese codepage CP936 is essentially GBK.
  • Character Coverage
    • GB2312 has a smaller coverage; many commonly used characters in GBK/GB18030 (such as some traditional characters, daily-use rare characters, radicals, etc.) are not included.
    • GBK covers a wider range, encoding more Chinese characters and symbols; GB18030 further expands on GBK and almost completely covers Unicode.
  • Byte Values and Validity
    • GB2312’s double-byte range is narrow: lead byte A1–F7, subsequent byte A1–FE.
    • GBK’s range is broader: lead byte 81–FE, subsequent byte 40–FE (excluding 7F).
    • Thus, bytes like 0x90 are invalid in GB2312 but may form valid combinations in GBK/GB18030. This is a common cause of “gb2312 decode error.”
  • Relationship to Unicode
    • GBK basically corresponds to the Chinese subset of Unicode 1.1.
    • GB18030 (versions 2000/2005/2022) maps sequences of 1/2/4 bytes, almost fully covering Unicode, and is recommended for modern use.
  • Practical Advice
    • Automatic detection often misclassifies GBK/GB18030 as GB2312. To avoid decoding errors, it is most reliable to open “Simplified Chinese” encoded files uniformly with gb18030 (gb18030 is backward compatible with GBK/GB2312).
    • UTF-8 files with BOM are better opened with utf-8-sig.

In summary: GB2312 is an early and smaller subset; GBK greatly expands on and backwardly compatible with GB2312; for modern scenarios, gb18030 is recommended to decode Chinese text to avoid “GB2312 failing to decode” issues.

Further Reading

Similar issues encountered previously: