[Tutor] Trying to parse a HUGE(1gb) xml file in python

Tue Dec 21 11:00:57 CET 2010

On Tue, Dec 21, 2010 at 4:46 AM, Alan Gauld <alan.gauld at btinternet.com> wrote:
> "David Hutto" <smokefloat at gmail.com> wrote
>
>>> And from what I recall XML is intended for data transfer in respect to
>>> HTML(from a recent brushup, nothing more),
>>
>> Apologies that is browser based transfer,
>
> I'm not sure what that last bit means.
> XML is a self-describing data format. It is usually used for files
> but can be used in data streams or in-memory strings.

I know it's self tagged, meaning you create the tags within, and that
it's used elsewhere as a form of data transfer, my previous usage with
the particular file format was browser based in usage, but I know it's
used in many other places, which is why I didn't see the meaning of
the discussion saying it was horrible to use, I just asked for any
alternative suggestions for files, since everyone 'seemed' to have a
bad view of the usage, since it seems to be the standard for user
defined tags for data transfer.

>
> It's natural competitors are TLV (Tag,Lenth,Value) and
> CSV(Comma Seperated Value) files but neither is as rich
> in structure.

That was kind of my point, I've seen all but TLV in use, but XML is
the web standard it seems.

Alternative options include ASN.1, Edifact and
> IDL but these are not self-describing(*) (although they are all
> more compact and faster to parse, but only IDL is free

Haven't heard of these, but formula of file, it seems to me,
is encoding + extension + text, how much can these really differ.
 On average it seems that the self defined tags of xml, would have a
bigger impact on the average usage(someone has larger tag sizes, and
more tags) than a defined file with averaged tags.

>
>>> sure has been displayed as a data transfer mechanism,
>
> You don't have to use it for data transfer - eg MS's use
> as a document storage format in Office - but frankly if
> you use XML to store large volumes of data you are mad,
> a database is a much more sensible option being far more
> space efficient and faster to work with.

If truly optimizing, I would time both, and maybe move to a different
language, or pattern if it truly mattered.

>
> (*)ASN.1, IDL etc all rely on a shared definition, and
> often shared code library, at both sender and receiver.
> The library is a compiled version of the data definition
> which enables complex data structures to be read from
> the file in a single chunk very efficiently.

This I might have to work on, but I rely on experience to quasi-trust
experience.