[Expat-discuss] Hello and question about unknown encoding handler

Alfonso alforan at tin.it
Tue Nov 20 09:49:26 CET 2007


Hi expat gurus.

I am Alfonso from Italy and I ported expat to MorphOS/Amiga (see
alfie.altervista.org), while writing a RSS client.

My problem is this:

let's say I am downloading from http://bash.org.ru/rss.

- it sends me a xml file encoded in windows-1251

- expat calls my unknownEncodingHandler

- in the above handler I know how to handle with that encoding (I am
using lib codesets for that porpouse); so I set the map this way:
---
/* codeset=findcodeset() */
for (i = 0; i<256; i++)
{
  int l = codeset->table[i].utf8[0];

  if (l==1) info->map[i] = i;
  else info->map[i] = -l;
}

info->convert = convert;
info->data    = codeset;
---

- now for any char to be translated to UTF-8, convert is called:
---
int convert(void *data, const char *s)
{
    struct codeset *codeset = (struct codeset *)data;

    return codeset->table[*s].ucs4 ;
}
---
I thought that I should return the UTF-8 unsigned int of the char.
But it doesn't work, for the simple reason, any windows-1251 chars is
just a single char and doesn't start any sequences. That results in
expat considering any "strange" char as starting a sequence of 2 or 3
chars, while it is always a single one.

Is there a solution for the above problem? What should I do in the
unknownEncoding handler or in convert()? I am sure there is something I
don't get :P

Thank for your help.

Ciao. Alfonso





More information about the Expat-discuss mailing list