Soft hyphen broken my XML – a warning about encoding

August 31, 2006 at 3:36 am 3 comments

Suppose you type “Pesto – a receipe” in Microsoft Word. it will automatically replace your dash with a soft hyphen. You can store that string with the soft hyphen in a database field in MS SQL. Suppose you write a program to extract that string from the database and return it as an XML and attempt to display it in the browser, the browser will complain that the XML contains an invalid character. If you use an XML parser, you’ll get an exception.

Dumping the XML file using hexdump (a open source program) reveal the problem. One expects the soft hyphen to be 0xAD, instead the soft hyphen shows up as 0x96. While 0x96 is a displayable character in ANSI (which is the default encoding in windows and MS SQL), it is not a displayable character in unicode. Since most XML are in UTF-8, the 0x96 blows up the viewers and parsers.

Be real careful about copy and pasting from Microsoft Word into the MS SQL database. Be sure what you are copying and pasting is a dash and not a soft hyphen.

Advertisements

Entry filed under: Character Encoding(Unicode, MS SQL Server.

osql converted characters during execution Replacing the tail light bulb on a 2000 Ford Focus ZX3

3 Comments Add your own

  • 1. m  |  June 27, 2007 at 4:11 am

    What Microsoft creates typographically for “Pesto — a recipe” is not a soft hyphen but an en-dash or an em-dash (I don’t know what MS produces here). A soft hyphen is an optional hyphen that doesn’t show up unless the (long) word is broken over into the next line.

    Reply
  • 2. murki  |  May 19, 2009 at 2:11 pm

    cool, this helped me out of misery! 😉

    Reply
  • 3. Editor  |  January 14, 2010 at 11:09 am

    Great Article and glad to have come across it. I never knew about this issue until I hit it when creating a feed for my site. All the news articles displayed perfectly across all browsers with HTML but everything went wrong when viewing the XML feeds.
    Thanks for this mystery-solving article.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

August 2006
M T W T F S S
« Jul   Sep »
 123456
78910111213
14151617181920
21222324252627
28293031  

Most Recent Posts


%d bloggers like this: