Preface
Recently, users reported errors when using rimetool
2025-10-15 11:09:09,607 - ERROR - [new_app.py:273] - Failed to call rimetool_main: 'gb2312' codec can't decode byte 0x90 in position 698: illegal multibyte sequence
It’s strange because we already use chardet for encoding detection. If it’s detected as gbk, we open it with gbk; if utf8, we open it with utf8.
But why does it fail to open when identified as gb2312?
After discussing with AI, we learned that automatic detection often mistakenly identifies GBK/GB18030 as GB2312. In such cases, opening with GB2312 naturally causes errors.
Solution
Open GBK, GB2312, and GB18030 all with GB18030.
Differences between GBK, GB2312, and GB18030
Brief answer:
- Standards and Publication Years
- GB2312: Released in 1980, includes about 6,763 simplified Chinese characters + 682 symbols.
- GBK: Released in 1995, greatly expands on GB2312, including about 21,003 characters (including many traditional and rare characters), fully backward compatible with GB2312. Windows Chinese codepage CP936 is essentially GBK.
- Character Coverage
- GB2312 has a smaller coverage; many commonly used characters in GBK/GB18030 (such as some traditional characters, daily-use rare characters, radicals, etc.) are not included.
- GBK covers a wider range, encoding more Chinese characters and symbols; GB18030 further expands on GBK and almost completely covers Unicode.
- Byte Values and Validity
- GB2312’s double-byte range is narrow: lead byte A1–F7, subsequent byte A1–FE.
- GBK’s range is broader: lead byte 81–FE, subsequent byte 40–FE (excluding 7F).
- Thus, bytes like 0x90 are invalid in GB2312 but may form valid combinations in GBK/GB18030. This is a common cause of “gb2312 decode error.”
- Relationship to Unicode
- GBK basically corresponds to the Chinese subset of Unicode 1.1.
- GB18030 (versions 2000/2005/2022) maps sequences of 1/2/4 bytes, almost fully covering Unicode, and is recommended for modern use.
- Practical Advice
- Automatic detection often misclassifies GBK/GB18030 as GB2312. To avoid decoding errors, it is most reliable to open “Simplified Chinese” encoded files uniformly with gb18030 (gb18030 is backward compatible with GBK/GB2312).
- UTF-8 files with BOM are better opened with utf-8-sig.
In summary: GB2312 is an early and smaller subset; GBK greatly expands on and backwardly compatible with GB2312; for modern scenarios, gb18030 is recommended to decode Chinese text to avoid “GB2312 failing to decode” issues.
Further Reading
Similar issues encountered previously: