Checking for binary data in a string

Dave Angel davea at ieee.org
Fri Jun 19 23:35:16 CEST 2009


Mitko Haralanov wrote:
> I have a question about finding out whether a string contains binary
> data?
>
> In my application, I am reading from a file which could contain
> binary data. After I have read the data, I transfer it using xmlrpclib.
>
> However, xmlrpclib has trouble unpacking XML which contains binary data
> and my application throws an exception. The solution to this problem is
> to use base64 encoding of the data but I don't know how to check
> whether the encoding will be needed?
>
> If I read in a string containing some binary data from the file, the
> type of that string is <type 'str'> which is not different from any
> other string, so I can't use that as a check.
>
> The only other check that I can think of is to check every character in
> the read-in string against string.printable but that will take a long
> time.
>
> Can anyone suggest a better way to handle the check? Thank you in
> advance.
>
>   
All the data is binary.  But perhaps you mean ASCII (7 bits), or you 
mean between 20-7f.  or something.

The way I'd tackle it is to build a translation table for your 
definition of "binary."  Then simply do something like:
       if data != data.translate(table):
                 ..... Convert to bin64 or whatever...

The translation table would be defined such that table[ch] == ch   for 
all ch that are "nonbinary"  and   table[ch] != ch  for all ch that are 
"binary."    And naturally you only build the table once, and reuse it 
on each buffer.

This should be quicker than any for loop you could write, though there 
may be other builltin functions that are even quicker.  It's a start, 
though.

Note that you will probably also be escaping the  xml special 
characters, such as &, <, and >.  So you might get clever about letting 
a single translate pass tell you whether the data can be stored 
unmodified, then do a second translate to decide which way to modify 
it.  Whether this is worthwhile depends in part on how often the buffer 
fits into which category.





More information about the Python-list mailing list