[python-win32] File I/O problem

Schollnick, Benjamin Benjamin.Schollnick at xerox.com
Wed Jan 12 16:56:23 CET 2005


When in doubt, turn the problem around 90 degrees.
 
file(	 filename[, mode[, bufsize]])	
Return a new file object (described in section 2.3.8 <mk:@MSITStore:C:\develope\Python23\Doc\Python23.chm::/lib/bltin-file-objects.html#bltin-file-objects> , ``File Objects <mk:@MSITStore:C:\develope\Python23\Doc\Python23.chm::/lib/bltin-file-objects.html> ''). The first two arguments are the same as for stdio's fopen(): filename is the file name to be opened, mode indicates how the file is to be opened: 'r' for reading, 'w' for writing (truncating an existing file), and 'a' opens it for appending (which on some Unix systems means that all writes append to the end of the file, regardless of the current seek position). 
 
The problem is that your file contains BINARY data....
 
So, let's remove the binary data:
 

import sys
import string
 
def strip_binary ( filename, newname ):
    test = open ( filename, 'rb')
    stripped = open (newname, 'wb')
 
    data = None
    while data <> "":
        data = test.read (1)
        
        if data <> "":
            if data in string.printable:
                stripped.write (data)
 
    stripped.close ()
    test.close ()
 
strip_binary ( sys.argv[1], sys.argv[2])

 
This will remove all characters that are not contained in the string modules PRINTABLE variable.
 
Then you should be able to open the NEW file as a ASCII file, without any issues.
 
You could instead of creating a temporary file, write the data to a list, and then use a SPLIT("\n") on the temporary list, and process that.  That would be the rough equivalent of READLINES....
 
        - Ben
 

	-----Original Message-----
	From: python-win32-bounces at python.org [mailto:python-win32-bounces at python.org] On Behalf Of AddisonN at iti-ab.com
	Sent: Wednesday, January 12, 2005 8:44 AM
	To: python-win32 at python.org
	Subject: [python-win32] File I/O problem
	
	

	I trying to process a file that was originally created on an AS/400 as a spooled report. The file has been converted to ASCII before sending to me by e-mail. The original report is in Arabic script and so any Arabic script has been mapped to

	I can't read the whole file in unless I chop out all the (formerly) Arabic characters as read(), readline() or readlines() seems to think its done too early. The problem appears to be that the conversion has produced a byte with hex value 1a and Python is treating this as an end-of-file marker. This I've worked this out by using a Hex Editor and looking at the character after where the read operation stops.  The offending character the square (unprintable) character in the file snippet below.

	Start file snippet >>

	MK    2005/01/10 ÇáÈäß ÇáÚÑÈí(Ô .ã.Ú)        ÇáãíÒÇäíÉ ÇáãæÍÏÉ - ÊÞÑíÜÑ ÇáãíÒÇäíÉ ÇáÔåÜÜÑíÉ                              ßãÇ åí Ýí

	              01 : ÝÑæÚ ÏæáÉ ÇãÇÑÇÊ                =========================================                              ÇáÕÜÝÍÉ  

	<< End file snippet

	Is there a way I can pre-process this file with Python and chop out the characters ( the 1a) I don't want?

	 

	If I do this:

	import string

	report = open('d:\\Software\\PythonScripts\\ear11050110.txt').readlines()                

	report is:

	>>> report

	['MK    2005/01/10 \xc7\xe1\xc8\xe4\xdf \xc7\xe1\xda\xd1\xc8\xed(\xd4 .\xe3.\xda)        \xc7\xe1\xe3\xed\xd2\xc7\xe4\xed\xc9 \xc7\xe1\xe3\xe6\xcd\xcf\xc9 - \xca\xde\xd1\xed\xdc\xd1 \xc7\xe1\xe3\xed\xd2\xc7\xe4\xed\xc9 \xc7\xe1\xd4\xe5\xdc\xdc\xd1\xed\xc9                              \xdf\xe3\xc7 \xe5\xed \xdd\xed\n', '              01 : \xdd\xd1\xe6\xda \xcf\xe6\xe1\xc9 \xc7']

	 

	Which is everything up to the hex 1a.

	 

	Thanks for any prompting whatsoever.

	 

	Nick.

	 

	
	
	**********************************************************************
	This email and any files transmitted with it are confidential and
	intended solely for the use of the individual or entity to whom they
	are addressed. If you have received this email in error please notify
	the system manager.
	This footnote also confirms that this email message has been swept by
	MIMEsweeper for the presence of computer viruses.
	Information Technology International (ITI) +44 (0)20 7315 8500
	**********************************************************************
	

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-win32/attachments/20050112/0050b170/attachment.htm


More information about the Python-win32 mailing list