Archive for August 31, 2006

Soft hyphen broken my XML – a warning about encoding

Suppose you type “Pesto – a receipe” in Microsoft Word. it will automatically replace your dash with a soft hyphen. You can store that string with the soft hyphen in a database field in MS SQL. Suppose you write a program to extract that string from the database and return it as an XML and attempt to display it in the browser, the browser will complain that the XML contains an invalid character. If you use an XML parser, you’ll get an exception.

Dumping the XML file using hexdump (a open source program) reveal the problem. One expects the soft hyphen to be 0xAD, instead the soft hyphen shows up as 0x96. While 0x96 is a displayable character in ANSI (which is the default encoding in windows and MS SQL), it is not a displayable character in unicode. Since most XML are in UTF-8, the 0x96 blows up the viewers and parsers.

Be real careful about copy and pasting from Microsoft Word into the MS SQL database. Be sure what you are copying and pasting is a dash and not a soft hyphen.

August 31, 2006 at 3:36 am 3 comments


Calendar

August 2006
M T W T F S S
« Jul   Sep »
 123456
78910111213
14151617181920
21222324252627
28293031  

Posts by Month

Posts by Category