Turnign greek-iso filenames => utf-8 iso
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Jun 12 05:12:13 EDT 2013
On Wed, 12 Jun 2013 08:02:24 +0000, Νικόλαος Κούρας wrote:
> # Collect directory and its filenames as bytes
> path = b'/home/nikos/public_html/data/apps/'
> files = os.listdir( path )
[snip code]
I realised that the version I gave you earlier, or rather the modified
version you came up with, was subject to a race condition. If somebody
uploaded a file while the script was running, and that file name was not
UTF-8 clean, the script would fail.
This version may be more robust and should be resistant to race
conditions when files are uploaded. (However, do not *delete* files while
this script is running.) As before, I have not tested this. I recommend
that you test it thoroughly before deploying it live.
def guess_encoding(bytestring):
for encoding in ('utf-8', 'iso-8859-7', 'latin-1'):
try:
bytestring.decode(encoding)
except UnicodeDecodeError:
# Decoding failed. Try the next one.
pass
else:
# Decoding succeeded. This is our guess.
return encoding
# If we get here, none of the encodings worked. We cannot guess.
return None
path = b'/home/nikos/public_html/data/apps/'
files = os.listdir( path )
clean_files = []
for filename in files:
# Compute 'path/to/filename'
filepath_bytes = path + filename
encoding = guess_encoding(filepath_bytes)
if encoding == 'utf-8':
# File name is valid UTF-8, so we can skip to the next file.
clean_files.append(filepath_bytes)
continue
if encoding is None:
# No idea what the encoding is. Hit it with a hammer until it
# stops moving.
filename = filepath_bytes.decode('utf-8', 'xmlcharrefreplace')
else:
filename = filepath_bytes.decode(encoding)
# Rename the file to something which ought to be UTF-8 clean.
newname_bytes = filename.encode('utf-8')
os.rename(filepath_bytes, newname_bytes)
clean_files.append(newname_bytes)
# Once we get here, the file ought to be UTF-8 clean,
# and the Unicode name ought to exist:
assert os.path.exists(newname_bytes.decode('utf-8'))
# Dump the old list of file names, it is no longer valid.
del files
# DO NOT CALL listdir again. Somebody might have uploaded a
# new file, with a broken file name. That will be fixed next
# time this script runs, but for now, we ignore the dirty file
# name and just use the list of clean file names we built above.
clean_files = set(clean_files)
for name_as_bytes in sorted(clean_files):
filename = name_as_bytes.decode('utf-8')
# Check the presence of a file against the database
# and insert if it doesn't exist
cur.execute('SELECT url FROM files WHERE url = %s', filename)
data = cur.fetchone()
--
Steven
More information about the Python-list
mailing list