Tuesday, May 20, 2008

Characters are not always what they seem

We recently ported our application from Linux to Solaris and suddenly pound signs on the web front-end became a ? marks. This was because Solaris has a different default character set that cannot display a pound sign. Since we were not allowed to change the character set of the operating system, we had to make changes in our code.

In tracking down the issue a developer logged out the pound sign at various points in the code. They all appeared as ? marks. So it seemed that the moment the pound sign was created it was converted. It turned out that this was a wild goose chase. The byte value fro the pound sign was correct. The operating system was converting it when it wrote it to the screen or log file. It seems obvious now that this should happen, but it was not that obvious at the time.

It is very important to always specify the character set when converting byte values to characters. If you are not sure of the character set to use UTF-8 should be adequate for most applications.

In Java web applications you also need to ensure that the HTTPServletRequest's character set is set otherwise some characters sent to the server in the GET or POST requests will not convert correctly. This needs to be done before reading any parameter values from the request.

The other common place to set the character set is when reading and writing files or reading and writing data from a network connection.

No comments: