Soft hyphen broken my XML – a warning about encoding

Suppose you type “Pesto – a receipe” in Microsoft Word. it will automatically replace your dash with a soft hyphen. You can store that string with the soft hyphen in a database field in MS SQL. Suppose you write a program to extract that string from the database and return it as an XML and attempt to display it in the browser, the browser will complain that the XML contains an invalid character. If you use an XML parser, you’ll get an exception.

Dumping the XML file using hexdump (a open source program) reveal the problem. One expects the soft hyphen to be 0xAD, instead the soft hyphen shows up as 0x96. While 0x96 is a displayable character in ANSI (which is the default encoding in windows and MS SQL), it is not a displayable character in unicode. Since most XML are in UTF-8, the 0x96 blows up the viewers and parsers.

Be real careful about copy and pasting from Microsoft Word into the MS SQL database. Be sure what you are copying and pasting is a dash and not a soft hyphen.

August 31, 2006

osql converted characters during execution

Suppose you have a sql script that inserts values into a MS SQL database. You had previous tested the script on the query analyzer, but when you ran the script in osql you notice that some of your characters have been converted. For example that soft hypen has mysteriously turned into a “û”.

Windows uses several different type of codepages. There is ANSI, which is used by windows and appears to be the default on SQL 2000 server. DOS uses OEM. There is also the Unicode format.

The problem is that when you run osql from the command line or from a cmd file, it automatically assumes that the file is in OEM format. If your file is in ANSI, it will convert the characters as if you are going from OEM to ANSI. This can cause some characters like a hypen to convert.

Solution is to save the file in Unicode. 

August 30, 2006


