help with (x)html / xml encoding...

lt at
Fri Mar 21 00:03:13 CET 2003


i'm looking for a way to extract encoding from a file retrieved by urllib,
i'm planning of creating a "restricted" parser which will only examine <?
and <meta tags, to check for :

<meta http-equiv="content-type" content="text/html; charset=xxxencodingxxx">
<?xml version="1.0" encoding="'xxxencodingxxx'"?>

do you think that is enough ? how should you do it ?

my solution is below, please feel free to comment this code, i *really *need
to improve my python !!! (inspired by
s will be the string to check out and more precisely the string returned
while parsing by a SGMLParser object that will only handle_pi and start_meta

import re

_encoding = re.compile(
_charset = re.compile(

def get_encoding(s):
encoding = None
 search =
 if not search:
  search =
 if not search:
  return encoding
 encoding =
 while encoding[:1] == '\'' == encoding[-1:] or \
 encoding[:1] == '"' == encoding[-1:]:
  encoding = encoding[1:-1]
 return encoding


