[Expat-discuss] RE: Parser query

Dirk Dierckx brc@fourlittlemice.com
Thu, 16 Aug 2001 08:40:10 +0200


As Fred L. Drake already pointed out, you get it UTF-8 encoded by default
and using Latin-1 requires Expat to be recompiled to work with UTF-16.  If
you're like me and just want to retrieve your ß in a char with value
(dec 223), etc. without having to recompile expat nor using 16bit wide chars
(wchar_t) you can use the following function to convert from UTF-8 to ANSI
(0-255).

int
ishUtilUTF8toANSI(char *pchString, int iStringLen)
{
  const size_t cszStringLen = iStringLen >= 0
    ? (size_t)iStringLen
    : (pchString
       ? strlen(pchString)
       : (size_t)0U);
  int bConverted = 1;
  size_t szInputIdx, szOutputIdx = (size_t)0U;

  for(szInputIdx = (size_t)0U;
      bConverted
	&& szInputIdx < cszStringLen;
      ++szInputIdx)
    {
      /* If input_bin(0xxxxxxx) ~ ASCII_bin(0xxxxxxx)
	 If input_bin(110000yy 10xxxxxx) ~ ANSI_bin(yyxxxxxx)
	 All other UTF-8 encodings don't map to ANSI so we don't
	 convert them and fail if we encounter them.
	 See: http://www.unicode.org for more information about
	 UTF-8 encoding.
      */
      if(0x00 == (pchString[szInputIdx] & 0x80)) /* Plain ascii char */
		pchString[szOutputIdx++] = pchString[szInputIdx];
      else if(szInputIdx + (size_t)1U < cszStringLen
	      && 0xC0 == (pchString[szInputIdx] & 0xFC)
	      && 0x80 == (pchString[szInputIdx + (size_t)1U] & 0xC0))
	{
	  /* UTF-8 encoded char that maps to ANSI. */
	  pchString[szOutputIdx++] = ((pchString[szInputIdx] & 0x03) << 6)
	    + (pchString[szInputIdx + (size_t)1U] & 0x3F);
	  ++szInputIdx; /* We must skip the second input char. */
	}
      else /* UTF-8 encoding that doesn't map to ANSI or illegal input. */
		bConverted = 0;
    }
  return bConverted ? (int)szOutputIdx : -1;
}

Note: As you can see from the code, this function converts the string
inplace (modifying the data pointed to by pchString directly).  This is
possible because the resulting string length will be <= iStringLen.

---
Regards,
Dirk