[XML-SIG] Encoding detection in the html parser from libxml2

Wed Feb 8 12:55:31 CET 2006

On Wed, Feb 08, 2006 at 11:46:01AM +0100, Cesar Ortiz wrote:
> Hi,
> 
> I am parsing html documents using the html parser from libxml2, and if
> the encoding is included in the document it works perfectly but if it
> is not, I think it does not work well (probably because I am doing
> something wrong).

  Well first thing wrong is that this is not libxml2 help mailing list, see
    http://xmlsoft.org/bugs.html

> As it is said in
> http://xmlsoft.org/encoding.html<http://www.google.com/url?sa=D&q=http://xmlsoft.org/encoding.html>the
> parser should
> detect the encoding.

  autodetection is done on XML based on the XMLDecl and the default
values as specified by the XML specification. On HTML all bets are off
if you don't have a meta tag or if you didn't indicate the encoding to the
parser.

> So I tested it putting an utf-8 word in a file and
> it does not detect it (it generates a wrong string). Example:
> reducciÃ³n --> reducciÃÂ³n.

  encoding is an entity property (i.e. per file) not per word. So either
I don't understand your test or this just can't work.

  http://xmlsoft.org/html/libxml-HTMLparser.html#htmlCreatePushParserCtxt
  use the encoding field when creating your parser.
For further informations/help, subscribe and use the libxml2 mailing-list,

  thanks,

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
veillard at redhat.com  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/