a simple unicode question

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Thu Oct 22 22:57:55 CEST 2009


En Thu, 22 Oct 2009 17:08:21 -0300, <rurpy at yahoo.com> escribió:

> On 10/22/2009 03:23 AM, Gabriel Genellina wrote:
>> En Wed, 21 Oct 2009 15:14:32 -0300, <rurpy at yahoo.com> escribió:
>>
>>> On Oct 21, 4:59 am, Bruno Desthuilliers <bruno.
>>> 42.desthuilli... at websiteburo.invalid> wrote:
>>>> beSTEfar a écrit :
>>>> (snip)
>>>>  > When parsing strings, use Regular Expressions.
>>>>
>>>> And now you have _two_ problems <g>
>>>>
>>>> For some simple parsing problems, Python's string methods are powerful
>>>> enough to make REs overkill. And for any complex enough parsing (any
>>>> recursive construct for example - think XML, HTML, any programming
>>>> language etc), REs are just NOT enough by themselves - you need a full
>>>> blown parser.
>>>
>>> But keep in mind that many XML, HTML, etc parsing problems
>>> are restricted to a subset where you know the nesting depth
>>> is limited (often to 0 or 1), and for that large set of
>>> problems, RE's *are* enough.
>>
>> I don't think so. Nesting isn't the only problem. RE's cannot handle
>> comments, by example. And you must support unquoted attributes, single  
>> and
>> double quotes, any attribute ordering, empty tags, arbitrary  
>> whitespace...
>> If you don't, you are not reading XML (or HTML), only a specific file
>> format that resembles XML but actually isn't.
>
> OK, then let me rephrase my point as: in the real world it is often
> not necessary to parse XML in it's full generality; parsing, as you
> put it, "a specific file format that resembles XML" is all that is
> really needed.

Given that using a real XML parser like ElementTree is as easy as (or even  
easier than) building a regular expression, and more robust, and more  
likely to survive small changes in the input format, why use the worse  
solution?
RE's are good in solving some problems, but parsing XML isn't one of those.

-- 
Gabriel Genellina




More information about the Python-list mailing list