Difference between utf-8 and utf-8-sig

doggie · September 9, 2025, 7:00am

Recommended Terms

When using pandas to process data and save as CSV, UTF-8 can also cause garbled characters. Copilot recommends using utf-8-sig, problem solved.

A small encoding bug every day

Main Text

This article is transcoded by 简悦 SimpRead, original address www.cnblogs.com

Preface: There was a problem of garbled characters when writing to a CSV file.

Solution: Change utf-8 to utf-8-sig

The difference is as follows:

“utf-8” uses bytes as the encoding unit. Its byte order is the same across all systems, so there is no byte order issue, and thus it does not need BOM. Therefore, when reading a file with BOM using the “utf-8” encoding method, it treats the BOM as file content, which causes errors like the one above.
In “utf-8-sig”, sig stands for signature, meaning “utf-8 with signature”. Therefore, “utf-8-sig” reads a “utf-8” file with BOM by handling the BOM separately and isolating it from the text content, which is the expected behavior.

with open('data.csv', 'w', encoding='utf_8_sig') as fp:

To let Excel open a CSV file saved in utf-8 format properly, you need to add a BOM (Byte Order Mark) at the very beginning of the file. If the receiver receives a byte stream starting with EF BB BF, it knows this is UTF-8 encoded.

So before writing the file content data, write the BOM first. As in the code below:

FileOutputStream fos = new FileOutputStream(new File(this.csvFileAbsolutePath));

    byte [] bs = { (byte)0xEF, (byte)0xBB, (byte)0xBF};  // UTF-8 encoding

    fos.write(bs); 

    fos.write(...);

    fos.close();

Adding BOM this way to a CSV file allows Excel to open it directly without causing garbled characters.

The problem I encountered was like this: after downloading the CSV file and opening with Excel, Chinese characters were garbled; but opening with Atom, Notepad++, and Notepad showed normal display. Upon investigation, it was found that Excel cannot recognize unicode files without BOM; in other words, Excel defaults to opening CSV files using ANSI. So a BOM header is needed.

Meaning of BOM

BOM stands for Byte Order Mark. BOM is prepared for UTF-16 and UTF-32 to mark the byte order. Taking UTF-16 as an example, it uses two bytes as the encoding unit. Before interpreting a UTF-16 text, you first need to know the byte order of each encoding unit. For example, the Unicode encoding for “奎” is 594E, and for “乙” is 4E59. If we receive a UTF-16 byte stream “594E”, is it “奎” or “乙”?

The Unicode specification recommends marking byte order using BOM: In UCS encoding there is a character called “ZERO WIDTH NO-BREAK SPACE” whose encoding is FEFF. And FEFF is a non-character in UCS (i.e., invisible), so it should not appear in actual transmission. UCS recommends transmitting the character “ZERO WIDTH NO-BREAK SPACE” before transmitting the byte stream. If the receiver gets FEFF, it indicates that the byte stream is Big-Endian; if it gets FFFE, it indicates Little-Endian. Therefore, “ZERO WIDTH NO-BREAK SPACE” is also called BOM.

UTF-8 uses bytes as encoding units and has no byte order issue.

To extend a bit:

UTF-8 encoding processes one byte at a time and is not affected by CPU endianness; when needing the next byte, just address + 1.

UTF-16 and UTF-32 process two bytes and four bytes per unit, respectively, meaning one read accesses 2 or 4 bytes. Thus, when storing and transmitting on networks, you must consider the byte order within those units.

UTF-8 BOM

UTF-8 BOM is also called UTF-8 signature. UTF-8 does not require BOM to indicate byte order but can use BOM to indicate encoding. When text programs read a byte stream starting with EF BB BF, they know it is UTF-8 encoded. Windows uses BOM to mark the encoding of text files.

Supplement:

The UCS encoding of “ZERO WIDTH NO-BREAK SPACE” is FEFF (assumed Big-Endian), and the corresponding UTF-8 encoding is EF BB BF.

That is, a byte stream starting with EF BB BF indicates this is a UTF-8 encoded byte stream. But if the file itself is UTF-8 encoded, these three bytes EF BB BF are pointless. So BOM’s existence has no actual effect on UTF-8 itself.

Disadvantages of UTF-8 files containing BOM

1. Effect on PHP

PHP was not designed considering BOM, meaning it does not ignore the three EF BB BF characters at the start of UTF-8 encoded files and parses them as text, causing parsing errors.

2. Errors executing SQL scripts on Linux

Recently in development, SQL files created on Windows caused errors when run on Linux.

At the beginning of the file, whether using Chinese or English comments, or even without comments, there was always an error like SP2-0734: unknown command beginning “?declare …” - rest of line ignored.

here is the start of the file section

1 --create tablespace
2 declare
3 v_tbs_name varchar2(200):='hytpdtsmsshistorydb';
4 begin

Error output:

1 SP2-0734: unknown command beginning "?--create ..." - rest of line ignored.
4 PL/SQL procedure successfully completed.

No similar solutions found online, and encoding was confirmed as utf-8. This issue troubled me for a long time. Finally, checking the difference between BOM and no BOM, changing to no BOM surprisingly fixed the problem.

After modification, whether using Chinese, English, or removing comments, it worked normally.

Heartfelt suggestion: best to avoid BOM in UTF-8

The difference between “UTF-8” and “UTF-8 with BOM” is whether BOM exists, i.e., whether the file starts with U+FEFF.

To check BOM in Linux: use the less command; other commands may not show it:

You will find an extra <U+FEFF> before some words.

To remove BOM in UTF-8:

On Linux:

(1)

Open file with vim
Execute: set nobomb
Save: wq

(2)

dos2unix filename

This converts Windows-format files to Unix/Linux format, not only changing Windows line endings \r\n to Unix/Linux \n, but also converting UTF-8 Unicode (with BOM) to UTF-8 Unicode.

PS:

A tricky situation: a UTF-8 Unicode (with BOM) file containing two <U+FEFF> requires running either method (1) or (2) twice to completely remove all <U+FEFF>!!!

On Windows, open the file in NotePad++, then select “Encoding” → “Encode in UTF-8 without BOM”, then save the file again.

Reference: https://www.cnblogs.com/Allen-rg/p/10536081.html

Topic		Replies	Views
记Windows和macOS默认文件编码不同导致报错 💻编程 rimetool	0	10	March 4, 2025
GBK、GB2312、GB18030区别 💻编程编码 , gbk , gb2312 , gb18030	0	24	October 15, 2025
如何判断txt文件的编码方式？chardet库帮你搞定！ 💻编程 python	1	256	February 19, 2025
vscode 安装插件报错 Signature verification failed with 'UnknownError' error. 💻编程 vscode	0	119	January 27, 2026
rime雾凇配置使用拆字（拼字）u模式 🛠工具与编程 rime	0	128	August 21, 2024

Difference between utf-8 and utf-8-sig

Recommended Terms

Main Text

Related Links

Related topics