piping input to an external script

Tim Arnold tim.arnold at sas.com
Tue May 12 18:46:47 CEST 2009

"Dave Angel" <davea at ieee.org> wrote in message 
news:mailman.25.1242113076.8015.python-list at python.org...
> Tim Arnold wrote:
>> Hi, I have some html files that I want to validate by using an external 
>> script 'validate'. The html files need a doctype header attached before 
>> validation. The files are in utf8 encoding. My code:
>> ---------------
>> import os,sys
>> import codecs,subprocess
>> HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 
>> Transitional//EN">'
>> filename  = 'mytest.html'
>> fd = codecs.open(filename,'rb',encoding='utf8')
>> s = HEADER + fd.read()
>> fd.close()
>> p = subprocess.Popen(['validate'],
>>                     stdin=subprocess.PIPE,
>>                     stdout=subprocess.PIPE,
>>                     stderr=subprocess.STDOUT)
>> validate = p.communicate(unicode(s,encoding='utf8'))
>> print validate
>> ---------------
>> I get lots of lines like this:
>> Error at line 1, character 66:\tillegal character number 0
>> etc etc.
>> But I can give the command in a terminal 'cat mytest.html | validate' and 
>> get reasonable output. My subprocess code must be wrong, but I could use 
>> some help to see what the problem is.
>> python2.5.1, freebsd6
>> thanks,
>> --Tim
> The usual rule in debugging:  split the problem into two parts, and test 
> each one separately, starting with the one you think most likely to be the 
> culprit
> In this case the obvious place to split is with the data you're passing to 
> the  communicate call.  I expect it's already wrong, long before you hand 
> it to the subprocess.  So write it to a file instead, and inspect it with 
> a binary file viewer.  And of course test it manually with your validate 
> program.  Is validate really expecting a Unicode stream in stdin ?

Good advice from everyone. The example was simpler than my actual situation, 
but it did show the problem. Dave's final question was the right one: I 
needed to pass the html content as a string, not unicode object:

HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n'

filename  = 'mytest.html'
fd = codecs.open(filename,'rb',encoding='utf8')
s = HEADER + fd.read().encode('utf8') # <- made the difference

p = subprocess.Popen(['validate',],
validate = p.communicate(s)
print validate

More information about the Python-list mailing list