Reading \n unescaped from a file

Thu Sep 3 05:24:30 EDT 2015

Friedrich Rentsch wrote:

> 
> 
> On 09/02/2015 04:03 AM, Rob Hills wrote:
>> Hi,
>>
>> I am developing code (Python 3.4) that transforms text data from one
>> format to another.
>>
>> As part of the process, I had a set of hard-coded str.replace(...)
>> functions that I used to clean up the incoming text into the desired
>> output format, something like this:
>>
>>      dataIn = dataIn.replace('\r', '\\n') # Tidy up linefeeds
>>      dataIn = dataIn.replace('<','<') # Tidy up < character
>>      dataIn = dataIn.replace('>','>') # Tidy up < character
>>      dataIn = dataIn.replace('o','o') # No idea why but lots of
>>      these: convert to 'o' character dataIn =
>>      dataIn.replace('f','f') # .. and these: convert to 'f'
>>      character
>>      dataIn = dataIn.replace('e','e') # ..  'e'
>>      dataIn = dataIn.replace('O','O') # ..  'O'
>>
>> These statements transform my data correctly, but the list of statements
>> grows as I test the data so I thought it made sense to store the
>> replacement mappings in a file, read them into a dict and loop through
>> that to do the cleaning up, like this:
>>
>>          with open(fileName, 'r+t', encoding='utf-8') as mapFile:
>>              for line in mapFile:
>>                  line = line.strip()
>>                  try:
>>                      if (line) and not line.startswith('#'):
>>                          line = line.split('#')[:1][0].strip() # trim any
>>                          trailing comments name, value = line.split('=')
>>                          name = name.strip()
>>                          self.filterMap[name]=value.strip()
>>                  except:
>>                      self.logger.error('exception occurred parsing line
>>                      [{0}] in file [{1}]'.format(line, fileName)) raise
>>
>> Elsewhere, I use the following code to do the actual cleaning up:
>>
>>      def filter(self, dataIn):
>>          if dataIn:
>>              for token, replacement in self.filterMap.items():
>>                  dataIn = dataIn.replace(token, replacement)
>>          return dataIn
>>
>>
>> My mapping file contents look like this:
>>
>> \r = \\n
>> â€œ = "
>> < = <
>> > = >
>> ' = '
>> F = F
>> o = o
>> f = f
>> e = e
>> O = O
>>
>> This all works "as advertised" */except/* for the '\r' => '\\n'
>> replacement. Debugging the code, I see that my '\r' character is
>> "escaped" to '\\r' and the '\\n' to '\\\\n' when they are read in from
>> the file.
>>
>> I've been googling hard and reading the Python docs, trying to get my
>> head around character encoding, but I just can't figure out how to get
>> these bits of code to do what I want.
>>
>> It seems to me that I need to either:
>>
>>    * change the way I represent '\r' and '\\n' in my mapping file; or
>>    * transform them somehow when I read them in
>>
>> However, I haven't figured out how to do either of these.
>>
>> TIA,
>>
>>
> 
> I have had this problem too and can propose a solution ready to run out
> of my toolbox:
> 
> 
> class editor:
> 
>      def compile (self, replacements):
>          targets, substitutes = zip (*replacements)
>          re_targets = [re.escape (item) for item in targets]
>          re_targets.sort (reverse = True)
>          self.targets_set = set (targets)
>          self.table = dict (replacements)
>          regex_string = '|'.join (re_targets)
>          self.regex = re.compile (regex_string, re.DOTALL)
> 
>      def edit (self, text, eat = False):
>          hits = self.regex.findall (text)
>          nohits = self.regex.split (text)
>          valid_hits = set (hits) & self.targets_set  # Ignore targets
> with illegal re modifiers.

Can you give an example of an ignored target? I don't see the light...

>          if valid_hits:
>              substitutes = [self.table [item] for item in hits if item
> in valid_hits] + []  # Make lengths equal for zip to work right

That looks wrong...

>              if eat:
>                  output = ''.join (substitutes)
>              else:
>                  zipped = zip (nohits, substitutes)
>                  output = ''.join (list (reduce (lambda a, b: a + b,
> [zipped][0]))) + nohits [-1]
>          else:
>              if eat:
>                  output = ''
>              else:
>                  output = input

...and so does this.

>          return output
> 
>  >>> substitutions = (
>      ('\r', '\n'),
>      ('<', '<'),
>      ('>', '>'),
>      ('o', 'o'),
>      ('f', 'f'),
>      ('e', 'e'),
>      ('O', 'O'),
>      )
> 
> Order doesn't matter. Add new ones at the end.
> 
>  >>> e = editor ()
>  >>> e.compile (substitutions)
> 
> A simple way of testing is running the substitutions through the editor
> 
>  >>> print e.edit (repr (substitutions))
> (('\r', '\n'), ('<', '<'), ('>', '>'), ('o', 'o'), ('f', 'f'), ('e',
> 'e'), ('O', 'O'))
> 
> The escapes need to be tested separately
> 
>  >>> print e.edit ('abc\rdef')
> abc
> def
> 
> Note: This editor's compiler compiles the substitution list to a regular
> expression which the editor uses to find all matches in the text passed
> to edit. There has got to be a limit to the size of a text which a
> regular expression can handle. I don't know what this limit is. To be on
> the safe side, edit a large text line by line or at least in sensible
> chunks.
> 
> Frederic
>