The Problem
We had a lot of text files committed into our CVS repository as Unicode format. When these files were checked out later, we found that they weren't really text files nor Unicode files because CVS had only prepended two bytes to the start of these files, FF FE
, but left only one byte for encoding each character. Some text editors such as Vim could open these files but other applications such as Notepad and Excel showed only gibberish.
Unicode Encoded Text in Files
Unicode is an encoding standard … for processing, storage and interchange of text data in any language
. For the purpose of fixing this problem, we just have to know how to identify and write valid Unicode files.
We use two tools to experiment and visualize the effect of different encoding methods:
- Microsoft Notepad editor, because it can save text files using different encoding methods.
- GnuWin32 od utility to output the data in a file as byte values.
Open Notepad and enter this text: Hello World. Select the File / Save As menu item. In the Save As dialog, there are four encoding methods in the Encoding drop down list: ANSI, Unicode, Unicode big endian and UTF-8. Save the same text using each of the encoding methods into four files, say TestANSI.txt, TestUnicode.txt, TestUnicodeBigEndian.txt and TestUTF8.txt, respectively.
Examine the contents of each file using od:
>od -A x -t x1 HelloANSI.txt 000000 48 65 6c 6c 6f 20 57 6f 72 6c 64 00000b >od -A x -t x1 HelloUnicode.txt 000000 ff fe 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 00 000010 6f 00 72 00 6c 00 64 00 000018 >od -A x -t x1 HelloUnicodeBigEndian.txt 000000 fe ff 00 48 00 65 00 6c 00 6c 00 6f 00 20 00 57 000010 00 6f 00 72 00 6c 00 64 000018 >od -A x -t x1 HelloUTF8.txt 000000 ef bb bf 48 65 6c 6c 6f 20 57 6f 72 6c 64 00000e
The ANSI encoded file contains 11 bytes representing the characters you typed. The Unicode encoded files contain 24 bytes, starting with a two-byte BOM and using two bytes to represent each character. If the first two bytes are FF FE
, then the two bytes are stored in low-byte, high-byte order. Conversely, if the first two bytes are FE FF
, then the two bytes are stored in high-byte, low-byte order. Finally, when a file starts with byte EF BB BF
, only one byte is used to encode each ANSI character and two or more bytes are used to encode non-ANSI characters (not demonstrated).
Fixing Incorrectly Encoded Files in Python
Now we know the format of a Unicode encoded file: it starts with FF FE
and stores each character in low-byte, high-byte order. Our text files in CVS just have ANSI characters, so we just have to insert a 0 byte between each character, starting from the third byte. Julian W. wrote a short Python script that to do this. I don't have his code right now, so here's my version for correcting the Unicode encoding for a file:
import codecs raw = map(ord, file(r'HelloBadUnicode.txt').read()) if raw[0] == 255 and raw[1] == 254 and raw[3] != 0: output = codecs.open(r'HelloFixedUnicode.txt', 'w', 'UTF-16') for i in raw[2:]: output.write(chr(i)) output.close()
References
- Unicode Consortium's FAQ on UTF-8, UTF-16, UTF-32 & BOM.
- Wikipedia's Byte-order mark.
Postscript
I started with a more complicated piece of Python code using lists and generators:
from itertools import repeat from operator import concat raw = map(ord, file(r'HelloBadUnicode.txt').read()) if raw[0] == 255 and raw[1] == 254 and raw[3] != 0: output = file(r'HelloFixedUnicode.txt','w') output.write(chr(255)) output.write(chr(254)) for i in reduce(concat, zip(raw[2:], repeat(0, len(raw)-2))): output.write(chr(i)) output.close()
But then I realised I just had to write a 0 byte after each ANSI character, so here's a simpler version:
raw = map(ord, file(r'HelloBadUnicode.txt').read()) if raw[0] == 255 and raw[1] == 254 and raw[3] != 0: output = file(r'HelloFixedUnicode.txt','w') output.write(chr(255)) output.write(chr(254)) for i in raw[2:]: output.write(chr(i)) output.write(chr(0)) output.close()
2008-05-25. I remembered that Python had no problems with writing Unicode files, resulting in the even simpler code in the body of this article.
No comments:
Post a Comment