Soft hyphen broken my XML – a warning about encoding

August 31, 2006

Suppose you type “Pesto – a receipe” in Microsoft Word. it will automatically replace your dash with a soft hyphen. You can store that string with the soft hyphen in a database field in MS SQL. Suppose you write a program to extract that string from the database and return it as an XML and attempt to display it in the browser, the browser will complain that the XML contains an invalid character. If you use an XML parser, you’ll get an exception.

Dumping the XML file using hexdump (a open source program) reveal the problem. One expects the soft hyphen to be 0xAD, instead the soft hyphen shows up as 0×96. While 0×96 is a displayable character in ANSI (which is the default encoding in windows and MS SQL), it is not a displayable character in unicode. Since most XML are in UTF-8, the 0×96 blows up the viewers and parsers.

Be real careful about copy and pasting from Microsoft Word into the MS SQL database. Be sure what you are copying and pasting is a dash and not a soft hyphen.

Entry Filed under: Character Encoding(Unicode, MS SQL Server. .

2 Comments Add your own

  • 1. m  |  June 27, 2007 at 4:11 am

    What Microsoft creates typographically for “Pesto — a recipe” is not a soft hyphen but an en-dash or an em-dash (I don’t know what MS produces here). A soft hyphen is an optional hyphen that doesn’t show up unless the (long) word is broken over into the next line.

    Reply
  • 2. murki  |  May 19, 2009 at 2:11 pm

    cool, this helped me out of misery! ;)

    Reply

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

August 2006
M T W T F S S
« Jul   Sep »
 123456
78910111213
14151617181920
21222324252627
28293031  

Most Recent Posts