Software Salariman: Fix Incorrectly Encoded Unicode Files with Python

2008-05-24

Fix Incorrectly Encoded Unicode Files with Python

The Problem

We had a lot of text files committed into our CVS repository as Unicode format. When these files were checked out later, we found that they weren't really text files nor Unicode files because CVS had only prepended two bytes to the start of these files, FF FE, but left only one byte for encoding each character. Some text editors such as Vim could open these files but other applications such as Notepad and Excel showed only gibberish.

Unicode Encoded Text in Files

Unicode is an encoding standard … for processing, storage and interchange of text data in any language. For the purpose of fixing this problem, we just have to know how to identify and write valid Unicode files.

We use two tools to experiment and visualize the effect of different encoding methods:

Microsoft Notepad editor, because it can save text files using different encoding methods.
GnuWin32 od utility to output the data in a file as byte values.

Open Notepad and enter this text: Hello World. Select the File / Save As menu item. In the Save As dialog, there are four encoding methods in the Encoding drop down list: ANSI, Unicode, Unicode big endian and UTF-8. Save the same text using each of the encoding methods into four files, say TestANSI.txt, TestUnicode.txt, TestUnicodeBigEndian.txt and TestUTF8.txt, respectively.

Examine the contents of each file using od:

>od -A x -t x1 HelloANSI.txt
000000 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000b

>od -A x -t x1 HelloUnicode.txt
000000 ff fe 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00
000010 6f 00 72 00 6c 00 64 00
000018

>od -A x -t x1 HelloUnicodeBigEndian.txt
000000 fe ff 00 48 00 65 00 6c 00 6c 00 6f 00 20 00 57
000010 00 6f 00 72 00 6c 00 64
000018

>od -A x -t x1 HelloUTF8.txt
000000 ef bb bf 48 65 6c 6c 6f 20 57 6f 72 6c 64
00000e

The ANSI encoded file contains 11 bytes representing the characters you typed. The Unicode encoded files contain 24 bytes, starting with a two-byte BOM and using two bytes to represent each character. If the first two bytes are FF FE, then the two bytes are stored in low-byte, high-byte order. Conversely, if the first two bytes are FE FF, then the two bytes are stored in high-byte, low-byte order. Finally, when a file starts with byte EF BB BF, only one byte is used to encode each ANSI character and two or more bytes are used to encode non-ANSI characters (not demonstrated).

Fixing Incorrectly Encoded Files in Python

Now we know the format of a Unicode encoded file: it starts with FF FE and stores each character in low-byte, high-byte order. Our text files in CVS just have ANSI characters, so we just have to insert a 0 byte between each character, starting from the third byte. Julian W. wrote a short Python script that to do this. I don't have his code right now, so here's my version for correcting the Unicode encoding for a file:

import codecs
raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = codecs.open(r'HelloFixedUnicode.txt', 'w', 'UTF-16')
  for i in raw[2:]:
    output.write(chr(i))
  output.close()

References

Unicode Consortium's FAQ on UTF-8, UTF-16, UTF-32 & BOM.
Wikipedia's Byte-order mark.

Postscript

I started with a more complicated piece of Python code using lists and generators:

from itertools import repeat
from operator import concat

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in reduce(concat, zip(raw[2:], repeat(0, len(raw)-2))):
    output.write(chr(i))
  output.close()

But then I realised I just had to write a 0 byte after each ANSI character, so here's a simpler version:

raw = map(ord, file(r'HelloBadUnicode.txt').read())
if raw[0] == 255 and raw[1] == 254 and raw[3] != 0:
  output = file(r'HelloFixedUnicode.txt','w')
  output.write(chr(255))
  output.write(chr(254))
  for i in raw[2:]:
    output.write(chr(i))
    output.write(chr(0))
  output.close()

2008-05-25. I remembered that Python had no problems with writing Unicode files, resulting in the even simpler code in the body of this article.

Software Salariman