piping input to an external script

Tue May 12 01:50:02 EDT 2009

On May 11, 10:16 pm, norseman <norse... at hughes.net> wrote:
> Tim Arnold wrote:
> > Hi, I have some html files that I want to validate by using an external
> > script 'validate'. The html files need a doctype header attached before
> > validation. The files are in utf8 encoding. My code:
> > ---------------
> > import os,sys
> > import codecs,subprocess
> > HEADER = '<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">'
>
> > filename  = 'mytest.html'
> > fd = codecs.open(filename,'rb',encoding='utf8')
> > s = HEADER + fd.read()
> > fd.close()
>
> > p = subprocess.Popen(['validate'],
> >                     stdin=subprocess.PIPE,
> >                     stdout=subprocess.PIPE,
> >                     stderr=subprocess.STDOUT)
> > validate = p.communicate(unicode(s,encoding='utf8'))
> > print validate
> > ---------------
>
> > I get lots of lines like this:
> > Error at line 1, character 66:\tillegal character number 0
> > etc etc.
>
> > But I can give the command in a terminal 'cat mytest.html | validate' and
> > get reasonable output. My subprocess code must be wrong, but I could use
> > some help to see what the problem is.
>
> > python2.5.1, freebsd6
> > thanks,
> > --Tim
>
> ============================
> If you search through the recent Python-List for UTF-8 things you might
> get the same understanding I have come to.
>
> the problem is the use of python's 'print' subcommand or what ever it
> is. It 'cooks' things and someone decided that it would only handle 1/2
> of a byte (in the x'00 to x'7f' range) and ignore or send error messages
> against anything else. I guess the person doing the deciding read the
> part that says ASCII printables are in the 7 bit range and chose to
> ignore the part about the rest of the byte being undefined. That is
> undefined, not disallowed.  Means the high bit half can be used as
> wanted since it isn't already taken. Nor did whoever it was take a look
> around the computer world and realize the conflict that was going to be
> generated by using only 1/2 of a byte in a 1byte+ world.
>
> If you can modify your code to use read and write you can bypass print
> and be OK.  Or just have python do the 'cat mytest.html | validate' for
> you. (Apply a var for html and let python accomplish the the equivalent
> of Unix's:
>     for f in *.html; do cat $f | validate; done
>                          or
>      for f in *.html; do validate $f; done  #file name available this way
>
> If you still have problems, take a look at os.POPEN2 (and its popen3)
> Also take look at os.spawn.. et al
>

Wow.  Unicode and subprocessing and printing can have dark corners,
but common sense does apply in MOST situations.

If you send the header, add the newline.

But you do not need the header if you can cat the input file sans
header and get sensible input.

Finally, if you are concerned about adding the header, then it belongs
in the original input file; otherwise, you are creating a false
positive.