[Expat-discuss] RE: Parser query
Dirk Dierckx
brc@fourlittlemice.com
Thu, 16 Aug 2001 08:40:10 +0200
As Fred L. Drake already pointed out, you get it UTF-8 encoded by default
and using Latin-1 requires Expat to be recompiled to work with UTF-16. If
you're like me and just want to retrieve your ß in a char with value
(dec 223), etc. without having to recompile expat nor using 16bit wide chars
(wchar_t) you can use the following function to convert from UTF-8 to ANSI
(0-255).
int
ishUtilUTF8toANSI(char *pchString, int iStringLen)
{
const size_t cszStringLen = iStringLen >= 0
? (size_t)iStringLen
: (pchString
? strlen(pchString)
: (size_t)0U);
int bConverted = 1;
size_t szInputIdx, szOutputIdx = (size_t)0U;
for(szInputIdx = (size_t)0U;
bConverted
&& szInputIdx < cszStringLen;
++szInputIdx)
{
/* If input_bin(0xxxxxxx) ~ ASCII_bin(0xxxxxxx)
If input_bin(110000yy 10xxxxxx) ~ ANSI_bin(yyxxxxxx)
All other UTF-8 encodings don't map to ANSI so we don't
convert them and fail if we encounter them.
See: http://www.unicode.org for more information about
UTF-8 encoding.
*/
if(0x00 == (pchString[szInputIdx] & 0x80)) /* Plain ascii char */
pchString[szOutputIdx++] = pchString[szInputIdx];
else if(szInputIdx + (size_t)1U < cszStringLen
&& 0xC0 == (pchString[szInputIdx] & 0xFC)
&& 0x80 == (pchString[szInputIdx + (size_t)1U] & 0xC0))
{
/* UTF-8 encoded char that maps to ANSI. */
pchString[szOutputIdx++] = ((pchString[szInputIdx] & 0x03) << 6)
+ (pchString[szInputIdx + (size_t)1U] & 0x3F);
++szInputIdx; /* We must skip the second input char. */
}
else /* UTF-8 encoding that doesn't map to ANSI or illegal input. */
bConverted = 0;
}
return bConverted ? (int)szOutputIdx : -1;
}
Note: As you can see from the code, this function converts the string
inplace (modifying the data pointed to by pchString directly). This is
possible because the resulting string length will be <= iStringLen.
---
Regards,
Dirk