Your Mouth Says Windows-1255, But Your Eyes Say ISO-8859-1

I recently wrote an engine that gets XML files stored at our clients’ servers using HTTP requests. One of our clients decided to serve the XML file with one encoding and encode the file itself with another. This posed a problem to XDocument.

The client decided to encode their XML using the Windows-1255 encoding (Hebrew), noting the encoding correctly in the XML’s declaration, but served the file stating the ISO-8859-1 (Latin) encoding. This meant that I couldn’t just use XDocument’s normal Load method to load directly from the stream because XDocument looks at the HTTP headers and takes the document’s encoding from them.

Here’s a snippet of the code I used to get over that:

using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
// Use response's charset.
var encoding = Encoding.GetEncoding("ISO-8859-1");
if (!string.IsNullOrEmpty(response.CharacterSet))
encoding = Encoding.GetEncoding(response.CharacterSet);
byte[] bytes = ReadStream(response.GetResponseStream());
// Get the XML with the response's charset.
string xml = new string(encoding.GetChars(bytes));
int endOfDeclaration = xml.IndexOf("?>");
if (endOfDeclaration != -1)
{
// Try to find out the encoding from the declaration.
string decl = xml.Substring(0, endOfDeclaration + 2) + "<duperoot />";
XDocument declDoc = XDocument.Parse(decl);
var docEncoding = Encoding.GetEncoding(declDoc.Declaration.Encoding);
if (docEncoding == encoding)
return xml;
else
return new string(docEncoding.GetChars(bytes));
}
else
{
// Not XML or something... Send up.
    }
}

What I did here was to create a new document with the original XML’s declaration (the Latin characters which make up the XML’s declaration always have the same byte position), add a dupe root and parse that to get the name of the encoding used by the document. I then use that encoding to decode the document correctly.

Note that I’m using ISO-8859-1 as the default response’s encoding, since that is what HTTP’s specification demands.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s