Character encoding bugs are the chupacabra of software: everyone has heard of them, few have seen one in the wild, but when they appear, they leave a trail of incomprehensible text. I've spent more hours than I care to admit staring at strings like 'é' or '’' and wondering whether to reach for a hex editor or a stiff drink.
The worst part? These bugs look random but they aren't. Every garbled character is a clue. In this post I'll show you how to systematically decode those clues, with real examples from production incidents.
The Anatomy of Mojibake
Mojibake is the Japanese term for garbled text, and it follows predictable patterns. When UTF-8 bytes are decoded as Latin-1 (ISO 8859-1), multi-byte sequences get split into individual bytes that happen to be printable Latin characters. For example, the UTF-8 encoding of 'é' is bytes 0xC3 0xA9. Decoded as Latin-1, those become 'é'.
Similarly, Windows-1252 decodes byte 0x92 as a right single quotation mark (’), but in UTF-8 that same byte is part of a multi-byte sequence. If you see ’, you're looking at UTF-8 bytes being interpreted as Windows-1252.
Keep a mojibake cheat sheet: 'é' = UTF-8 as Latin-1, '’' = UTF-8 as Windows-1252, 'æ–‡å—' = UTF-8 as Shift-JIS.
# Example: Convert a file from UTF-8 to Windows-1252 and back
# Create a file with one UTF-8 encoded character
$ echo 'é' > test.txt
$ xxd test.txt
00000000: c3a9 0a ...
# Decode as Latin-1 to see mojibake
$ iconv -f utf-8 -t latin1 test.txt | xxd
00000000: c3a9 0a # wait, that's still hex
# Actually, iconv without -o prints to stdout; use file to verify
$ file test.txt
test.txt: UTF-8 Unicode text
# Force Latin-1 interpretation by converting to Latin-1 then back to UTF-8
$ iconv -f utf-8 -t latin1 test.txt > broken.txt
$ cat broken.txt
éThe BOM That Broke the Build
The BOM That Broke the Build
- 09:15CI pipeline fails with a cryptic JSON parse error on a config file.
- 09:20Developer checks the file locally; it parses fine. Compare with production — same file.
- 09:30Hex dump reveals 0xEFBBBF at the beginning of the file — UTF-8 BOM.
- 09:35Local editor (VS Code) stripped the BOM on save; CI server's editor (nano) added it.
- 09:40Strip BOM with sed and re-commit. Build passes.
Lesson
UTF-8 BOM is optional and often invisible. Many parsers (including Python's json.load, shell script interpreters) treat BOM as a literal character, causing syntax errors. Always configure your editor to save without BOM for cross-platform files.
Detecting Encoding Blind Spots
You can't fix what you can't see. When you open a file and see garbled text, the first step is to identify the actual encoding. Two tools dominate: `file -i` and Python's `chardet`.
The `file` command uses magic bytes and heuristics. It's fast but can be wrong for short files. `chardet` is slower but more accurate. Together they cover most cases.
# Use file to guess encoding
$ file -i mystery.txt
mystery.txt: text/plain; charset=utf-8
# Use chardet for a second opinion
$ pip install chardet
$ chardetect mystery.txt
mystery.txt: utf-8 with confidence 0.99
# If both say utf-8 but text is garbled, check for double encoding
# Try decoding as utf-8 then re-encoding as latin-1
$ python3 -c "
import sys
with open(sys.argv[1], 'rb') as f:
raw = f.read()
# Try to decode as utf-8, then encode as latin-1 to see original bytes
try:
decoded = raw.decode('utf-8')
re_encoded = decoded.encode('latin-1', errors='replace')
print(re_encoded.decode('utf-8'))
except:
print('Not double-encoded')
" mystery.txtDatabase Encoding Hell
Databases are a common source of encoding bugs. MySQL's `utf8` charset is actually a subset of UTF-8 (max 3 bytes per character), so emoji like 😀 (4 bytes) cause truncation. PostgreSQL handles full UTF-8 but connection encoding can misconfigure.
I once spent three days debugging why Japanese characters stored in MySQL showed as '???' on the frontend. The table was utf8mb4, the column was utf8mb4, but the PHP PDO connection used `SET NAMES utf8` — which told the server to send data as the 3-byte utf8 variant. The fix: `SET NAMES utf8mb4`.
In MySQL, always use utf8mb4, not utf8. The latter is a MySQL-specific 3-byte subset that cannot store emoji or some CJK characters.
# Check actual encoding of a database connection
import mysql.connector
conn = mysql.connector.connect(
host='localhost',
user='root',
password='password',
database='test',
charset='utf8mb4' # Explicitly set
)
cursor = conn.cursor()
cursor.execute("SHOW VARIABLES LIKE 'character_set_%'")
for row in cursor:
print(row)
# Output shows connection, client, and server encodings
# If 'character_set_connection' is not utf8mb4, your data may be truncatedA Systematic Debugging Workflow
- 1Identify the symptom: what specific garbled characters do you see? Write them down.
- 2Dump raw bytes: use `xxd` or `od -c` to see the actual bytes before any interpretation.
- 3Match pattern to known mojibake: consult a reference like the one above.
- 4If pattern doesn't match, use chardet on the raw bytes to guess the original encoding.
- 5Convert the raw bytes to the correct encoding using iconv or Python's decode/encode.
- 6Fix the root cause: change the source to output correct UTF-8, or adjust the parser to handle the actual encoding.
of encoding bugs I've seen are caused by mixing UTF-8 and Latin-1
That stat is from personal experience, but it aligns with internet lore. The fix is almost always to enforce UTF-8 everywhere: in editors, databases, HTTP headers, and file formats. If you can't, be explicit about the encoding at every boundary.
Tools of the Trade
- arrow_right`xxd` – hex dump any file or pipe.
- arrow_right`file -i` – quick encoding guess with magic bytes.
- arrow_right`chardet` – Python library for statistical encoding detection.
- arrow_right`iconv` – convert between encodings from the command line.
- arrow_right`sed` – strip BOM or replace byte sequences.
- arrow_right`vim` with `:set fileencoding=utf-8` – fix encoding on save.
- arrow_right`notepad++` (Windows) – Encoding menu shows current encoding and can convert.
The most dangerous encoding bug is the one you can't see: data that looks correct but is actually stored in the wrong encoding, silently corrupting every downstream consumer.
Character encoding problems are a rite of passage. The next time you see '’' in your logs, remember: it's not magic, it's just bytes. And bytes can be fixed.
Frequently asked questions
What is mojibake and how do I recognize it?
Mojibake is garbled text caused by decoding bytes with the wrong encoding. For example, UTF-8 bytes interpreted as Latin-1 produce strings like 'é' instead of 'é'. The specific garbled pattern often reveals the original encoding.
How do I find the encoding of a text file without metadata?
Use `file -i` on Linux or `chardetect` (Python chardet) to guess. For binary safety, dump hex with `xxd file.txt | head` and look for BOM bytes (EF BB BF for UTF-8) or patterns like null bytes (UTF-16).
What is a BOM and why is it problematic?
A BOM (Byte Order Mark) is a Unicode character U+FEFF used to signal endianness. In UTF-8 it's optional and often causes issues: some parsers (e.g., JSON, shell scripts) fail when the first bytes are EF BB BF. Strip it with `sed -i '1s/^\xEF\xBB\xBF//' file`.
How do I prevent encoding issues in a web application?
Set Content-Type header to charset=utf-8, use UTF-8 in HTML meta tags, and validate user input. At the database level, ensure tables use utf8mb4 (MySQL) or UTF-8 (PostgreSQL). Never trust the client's encoding declaration.