Soft hyphen broken my XML – a warning about encoding
August 31, 2006
Suppose you type “Pesto – a receipe” in Microsoft Word. it will automatically replace your dash with a soft hyphen. You can store that string with the soft hyphen in a database field in MS SQL. Suppose you write a program to extract that string from the database and return it as an XML and attempt to display it in the browser, the browser will complain that the XML contains an invalid character. If you use an XML parser, you’ll get an exception.
Dumping the XML file using hexdump (a open source program) reveal the problem. One expects the soft hyphen to be 0xAD, instead the soft hyphen shows up as 0×96. While 0×96 is a displayable character in ANSI (which is the default encoding in windows and MS SQL), it is not a displayable character in unicode. Since most XML are in UTF-8, the 0×96 blows up the viewers and parsers.
Be real careful about copy and pasting from Microsoft Word into the MS SQL database. Be sure what you are copying and pasting is a dash and not a soft hyphen.
Entry Filed under: Character Encoding(Unicode, MS SQL Server. .
2 Comments Add your own
Leave a Comment
Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
Trackback this post | Subscribe to the comments via RSS Feed
1.
m | June 27, 2007 at 4:11 am
What Microsoft creates typographically for “Pesto — a recipe” is not a soft hyphen but an en-dash or an em-dash (I don’t know what MS produces here). A soft hyphen is an optional hyphen that doesn’t show up unless the (long) word is broken over into the next line.
2.
murki | May 19, 2009 at 2:11 pm
cool, this helped me out of misery!