piping input to an external script
norseman at hughes.net
Tue May 12 12:37:17 EDT 2009
Steve Howell wrote:
> On May 11, 11:31 pm, norseman <norse... at hughes.net> wrote:
>> Steve Howell wrote:
>>> On May 11, 10:16 pm, norseman <norse... at hughes.net> wrote:
>>>> Tim Arnold wrote:
>>>>> Hi, I have some html files that I want to validate by using an external
>>>>> script 'validate'. The html files need a doctype header attached before
>>>>> validation. The files are in utf8 encoding. My code:
>>>>> import os,sys
>>>>> import codecs,subprocess
>>>>> HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
>>>>> filename = 'mytest.html'
>>>>> fd = codecs.open(filename,'rb',encoding='utf8')
>>>>> s = HEADER + fd.read()
>>>>> p = subprocess.Popen(['validate'],
>>>>> validate = p.communicate(unicode(s,encoding='utf8'))
>>>>> print validate
>>>>> I get lots of lines like this:
>>>>> Error at line 1, character 66:\tillegal character number 0
>>>>> etc etc.
>>>>> But I can give the command in a terminal 'cat mytest.html | validate' and
>>>>> get reasonable output. My subprocess code must be wrong, but I could use
>>>>> some help to see what the problem is.
>>>>> python2.5.1, freebsd6
>>>> If you search through the recent Python-List for UTF-8 things you might
>>>> get the same understanding I have come to.
>>>> the problem is the use of python's 'print' subcommand or what ever it
>>>> is. It 'cooks' things and someone decided that it would only handle 1/2
>>>> of a byte (in the x'00 to x'7f' range) and ignore or send error messages
>>>> against anything else. I guess the person doing the deciding read the
>>>> part that says ASCII printables are in the 7 bit range and chose to
>>>> ignore the part about the rest of the byte being undefined. That is
>>>> undefined, not disallowed. Means the high bit half can be used as
>>>> wanted since it isn't already taken. Nor did whoever it was take a look
>>>> around the computer world and realize the conflict that was going to be
>>>> generated by using only 1/2 of a byte in a 1byte+ world.
>>>> If you can modify your code to use read and write you can bypass print
>>>> and be OK. Or just have python do the 'cat mytest.html | validate' for
>>>> you. (Apply a var for html and let python accomplish the the equivalent
>>>> of Unix's:
>>>> for f in *.html; do cat $f | validate; done
>>>> for f in *.html; do validate $f; done #file name available this way
>>>> If you still have problems, take a look at os.POPEN2 (and its popen3)
>>>> Also take look at os.spawn.. et al
>>> Wow. Unicode and subprocessing and printing can have dark corners,
>>> but common sense does apply in MOST situations.
>>> If you send the header, add the newline.
>>> But you do not need the header if you can cat the input file sans
>>> header and get sensible input.
>> Yep! The problem is with 'print'
> Huh? Print is printing exactly what you expect it to print.
Tim: Using what you posted;
Is the third char of the first line read from file a TAB?
Just curious. len(HEADER) is 63, error at 66 char number 0, doesn't
seem quite consistent math wise.
63 + cr + lf gives 65. But, as another noted, you don't have those.
"...66:\tillegal..." is '\t' a tab on screen or byte 1 or 3 of file?
If you have mc available, in it - highlight file and press Shift-F3 then
F4. 09 is TAB
</title> is closing, should not exist as opener
<html> can be opener, did the h somehow become a '\'
(still - that would put x'09' at byte 2 of file)
Most validate programs I have used will let me know the header is
missing if in fact it is and give me a choice of how to process (XML,
XHTML, HTML 1.1, ...) or quit.
is HEADER ('<!DOC...>') itself already in utf-8?
Or are you mixing things?
Last but not least - if you have source of validate process, check that
over carefully. The numbers don't work for me.
Just thinking on paper. No need to respond.
More information about the Python-list