[XML-SIG] Is anyone implementing EXI in Python?

Fri Jul 17 17:01:12 CEST 2009

I think the issue here is the nature of the data exchange.  EXI
essentially provides a compression algorithm that saves information
between instances of a message or file and can be seeded with what is
known in advance about certain characteristics of the instances.  The gzip
algorithm learns the characteristics of each instance separately from that
instance and does not retain information between instances.

If you are occasionally sending a large file, gzip makes sense.  There is
little gain from retaining information.  However, if you have frequent
small messages or separate small files based on a schema, the namespace
definitions are repeated for each instance and can take up an appreciable
fraction of what is sent over-the-wire for each instance.  There isn't
much for gzip to learn, and it has to start all over for the next
instance.  Similarly, the tags recur across instances but gzip will only
learn them as it encounters them in a particular instance.  Again, gzip
forgets between instances.

I think in the absence of prior information and when used only
occasionally (without information retention between instances), EXI
provides something close to gzip compression.  What EXI provides is a
variant of compression technology that has information retention between
instances and the ability to use prior information across instances.  In
applications with frequent repetitive data exchanges, the information
retention and ability to use prior information can provide significant
benefits.

Stan Klein

On Fri, July 17, 2009 4:06 am, Stefan Behnel wrote:
> Hi,
>
> Stanley A. Klein wrote:
>> On Wed, 2009-07-15 at 22:26 +0200, Stefan Behnel wrote:
>>> A well chosen compression method is a lot better suited to such
>>> applications and is already supported by most available XML parsers (or
>>> rather outside of the parsers themselves, which is a huge advantage).
>>
>> It depends on the nature of the XML application.  One feature of EXI is
>> to
>> support representation of numeric data as bits rather than characters.
>> That is very useful in appropriate applications.
>
> One drawback is that this requires a schema to make sure the number of
> bits
> is sufficient. Otherwise, you'd need to add the information how many bits
> you use for their representation, which would add to the data volume.
>
>
>> There is a measurements
>> document that shows the compression that was achieved on a wide variety
>> of
>> test cases.  Straight use of a common compression algorithm does not
>> necessarily achieve the best results.
>
> Repetitive data like an XML byte stream compresses extremely well, though,
> and the 'best' compression isn't always required anyway. I worked on a
> Python SOAP application where we sent some 3MB of XML as a web service
> response. That took a couple of seconds to transmit. Injecting the
> standard
> gzip algorithm into the WSGI stack got it down to some 48KB. Nothing more
> to do here.
>
> If you need 'the best' compression, there's no way around benchmarking a
> couple of different algorithms that are suitable for your application, and
> choosing the one that works best for your data. That may or may not
> include
> EXI.
>
>
>> Besides, EXI incorporates elements
>> of common compression algorithm(s) as both a fallback for its
>> schema-less
>> mode and an additional capability in its schema-informed mode.
>
> Makes sense, as compression also applies to text content, for example.
>
>
>> EXI is intended for use outboard of the parser, and that would apply
>> equally well to a Python version.  For example, EXI gets rid of the need
>> to constantly resend over-the-wire all the namespace definitions with
>> each
>> message.  The relevant strings would just go into the string table and
>> get
>> restored from there when the message is converted back.
>
> That's how any run-length based compression algorithm works anyway. Plus,
> namespace definitions usually only happen once in a document, so they are
> pretty much negligible in a larger XML document.
>
>
>> However, for something like SOAP in certain applications, it may be
>> eventually desirable to integrate the EXI implementation within the
>> communications system.  The message sender could reasonably create a
>> schema-informed EXI version without actually starting from and
>> converting
>> an XML object.  The recipient would have to convert the EXI back to XML,
>> parse it, and use the data.
>
> Ok, that's where I see it, too. At the level where you'd normally apply a
> compression algorithm anyway.
>
>
>> Numeric data is most efficiently sent as bits
>
> Depends on how you select the bits. When I write into my schema that I use
> a 32 bit integer value in my XML, and all I really send happens to be
> within [0-9] in, say, 95% of the cases with a few exceptions that really
> require 32 bits, a general run-length compression algorithm will easily
> beat anything that sends the value as a 4-byte sequence. That's the
> advantage of general compression: it sees the real data, not only its
> schema.
>
> I do not question EXI in general, I'm fine with it having its niche
> (wherever that turns out to be). I'm just saying that common compression
> algorithms are a lot more broadly available and achieve similar results.
> So
> EXI is just another way of compressing XML, with the disadvantage of not
> being as widely implemented. Compare it to the ubiquity of the gzip
> compression algorithm, for example. It's just the usual trade-off that you
> make between efficiency and cross-platform compatibility.
>
> Stefan
>

--