Reading \n unescaped from a file

Friedrich Rentsch anthra.norell at bluewin.ch
Thu Sep 3 10:40:47 CEST 2015



On 09/02/2015 04:03 AM, Rob Hills wrote:
> Hi,
>
> I am developing code (Python 3.4) that transforms text data from one
> format to another.
>
> As part of the process, I had a set of hard-coded str.replace(...)
> functions that I used to clean up the incoming text into the desired
> output format, something like this:
>
>      dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
>      dataIn = dataIn.replace('<','<') # Tidy up < character
>      dataIn = dataIn.replace('>','>') # Tidy up < character
>      dataIn = dataIn.replace('o','o') # No idea why but lots of these: convert to 'o' character
>      dataIn = dataIn.replace('f','f') # .. and these: convert to 'f' character
>      dataIn = dataIn.replace('e','e') # ..  'e'
>      dataIn = dataIn.replace('O','O') # ..  'O'
>
> These statements transform my data correctly, but the list of statements
> grows as I test the data so I thought it made sense to store the
> replacement mappings in a file, read them into a dict and loop through
> that to do the cleaning up, like this:
>
>          with open(fileName, 'r+t', encoding='utf-8') as mapFile:
>              for line in mapFile:
>                  line = line.strip()
>                  try:
>                      if (line) and not line.startswith('#'):
>                          line = line.split('#')[:1][0].strip() # trim any trailing comments
>                          name, value = line.split('=')
>                          name = name.strip()
>                          self.filterMap[name]=value.strip()
>                  except:
>                      self.logger.error('exception occurred parsing line [{0}] in file [{1}]'.format(line, fileName))
>                      raise
>
> Elsewhere, I use the following code to do the actual cleaning up:
>
>      def filter(self, dataIn):
>          if dataIn:
>              for token, replacement in self.filterMap.items():
>                  dataIn = dataIn.replace(token, replacement)
>          return dataIn
>
>
> My mapping file contents look like this:
>
> \r = \\n
> “ = "
> < = <
> > = >
> ' = '
> F = F
> o = o
> f = f
> e = e
> O = O
>
> This all works "as advertised" */except/* for the '\r' => '\\n'
> replacement. Debugging the code, I see that my '\r' character is
> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from
> the file.
>
> I've been googling hard and reading the Python docs, trying to get my
> head around character encoding, but I just can't figure out how to get
> these bits of code to do what I want.
>
> It seems to me that I need to either:
>
>    * change the way I represent '\r' and '\\n' in my mapping file; or
>    * transform them somehow when I read them in
>
> However, I haven't figured out how to do either of these.
>
> TIA,
>
>

I have had this problem too and can propose a solution ready to run out 
of my toolbox:


class editor:

     def compile (self, replacements):
         targets, substitutes = zip (*replacements)
         re_targets = [re.escape (item) for item in targets]
         re_targets.sort (reverse = True)
         self.targets_set = set (targets)
         self.table = dict (replacements)
         regex_string = '|'.join (re_targets)
         self.regex = re.compile (regex_string, re.DOTALL)

     def edit (self, text, eat = False):
         hits = self.regex.findall (text)
         nohits = self.regex.split (text)
         valid_hits = set (hits) & self.targets_set  # Ignore targets 
with illegal re modifiers.
         if valid_hits:
             substitutes = [self.table [item] for item in hits if item 
in valid_hits] + []  # Make lengths equal for zip to work right
             if eat:
                 output = ''.join (substitutes)
             else:
                 zipped = zip (nohits, substitutes)
                 output = ''.join (list (reduce (lambda a, b: a + b, 
[zipped][0]))) + nohits [-1]
         else:
             if eat:
                 output = ''
             else:
                 output = input
         return output

 >>> substitutions = (
     ('\r', '\n'),
     ('<', '<'),
     ('>', '>'),
     ('o', 'o'),
     ('f', 'f'),
     ('e', 'e'),
     ('O', 'O'),
     )

Order doesn't matter. Add new ones at the end.

 >>> e = editor ()
 >>> e.compile (substitutions)

A simple way of testing is running the substitutions through the editor

 >>> print e.edit (repr (substitutions))
(('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e', 
'e'), ('O', 'O'))

The escapes need to be tested separately

 >>> print e.edit ('abc\rdef')
abc
def

Note: This editor's compiler compiles the substitution list to a regular 
expression which the editor uses to find all matches in the text passed 
to edit. There has got to be a limit to the size of a text which a 
regular expression can handle. I don't know what this limit is. To be on 
the safe side, edit a large text line by line or at least in sensible 
chunks.

Frederic



More information about the Python-list mailing list