Parsing a serial stream too slowly

M.Pekala mcdpekala at gmail.com
Mon Jan 23 20:07:06 EST 2012


On Jan 23, 6:49 pm, Cameron Simpson <c... at zip.com.au> wrote:
> On 23Jan2012 13:48, M.Pekala <mcdpek... at gmail.com> wrote:
> | Hello, I am having some trouble with a serial stream on a project I am
> | working on. I have an external board that is attached to a set of
> | sensors. The board polls the sensors, filters them, formats the
> | values, and sends the formatted values over a serial bus. The serial
> | stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
> | value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
> |
> | When one sensor is running my python script grabs the data just fine,
> | removes the formatting, and throws it into a text control box. However
> | when 3 or more sensors are running, I get output like the following:
> |
> | Sensor 1: 373
> | Sensor 2: 112$$M-160$G373
> | Sensor 3: 763$$A892$
> |
> | I am fairly certain this means that my code is running too slow to
> | catch all the '$' markers. Below is the snippet of code I believe is
> | the cause of this problem...
>
> Your code _is_ slow, but as you can see above you're not missing data,
> you're gathering too much data.
>
> Some point by point remarks below. The actual _bug_ is your use of ".*"
> in your regexps. Some change suggestions below the code.
>
> | def OnSerialRead(self, event):
> |       text = event.data
> |       self.sensorabuffer = self.sensorabuffer + text
> |       self.sensorbbuffer = self.sensorbbuffer + text
> |       self.sensorcbuffer = self.sensorcbuffer + text
>
> Slow and memory wasteful. Supposing a sensor never reports? You will
> accumulate an ever growing buffer string. And extending a string gets
> expensive as it grows.
>
> |       if sensoraenable:
> |               sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
>
> Slow and buggy.
>
> The slow: You're compiling the regular expression _every_ time you come
> here (unless the re module caches things, which I seem to recall it may.
> But that efficiency is only luck.
>
> The bug: supposing you get multiple sensor reports, like this:
>
>   $A1$$B2$$C3$
>
> Your regexp matches the whole thing! Because ".*" is greedy.
> You want "[^$]*" - characters that are not a "$".
>
> |                       if sensorresult:
> |                               s = sensorresult.group(0)
> |                               s = s[2:-1]
> |                               if self.sensor_enable_chkbox.GetValue():
> |                                       self.SensorAValue = s
> |                               self.sensorabuffer = ''
>
> What if there are multiple values in the buffer? After fixing your
> regexp you will now be throwing them away. Better to go:
>
>   self.sensorabuffer = self.sensorabuffer[sensorresult.end():]
>
> [...]
> | I think that regex is too slow for this operation, but I'm uncertain
> | of another method in python that could be faster. A little help would
> | be appreciated.
>
> Regex _is_ slow. It is good for flexible lexing, but generally Not
> Fast. It can be faster than in-Python lexing because the inner
> interpreation of the regex is C code, but is often overkill when speed
> matters. (Which you may find it does not for your app - fix the bugs
> first and see how it behaves).
>
> I would be making the following changes if it were me:
>
>   - keep only one buffer, and parse it into sensor "tokens"
>     pass each token to the right sensor as needed
>
>   - don't use regexps
>     this is a speed thing; if you code is more readable with regexps and
>     still faster enough you may not do this
>
> To these ends, untested attempt 1 (one buffer, lex into tokens, still
> using regexps):
>
>     re_token = re.compile( r'\$([A-Z])([^$]*)\$' )
>
>     def OnSerialRead(self, event):
>         # accessing a local var is quicker and more readable
>         buffer = self.buffer
>
>         text = event.data
>         buffer += text
>
>         m = re_token.search(buffer)
>         while m:
>             sensor, value = m.group(1), m.group(2)
>             buffer = buffer[m.end():]
>             if sensor == 'A':
>                 # ...
>             elif sensor == 'B':
>                 # ...
>             else:
>                 warning("unsupported sensor: %s", sensor)
>
>         # stash the updated buffer for later
>         self.buffer = buffer
>
> I'm assuming here that you can get noise in the serial stream. If you
> are certain to get only clean "$Ax$" sequences and nothing else you can
> make the code much simpler. And faster again.
>
> Pretending clean data and no regexps:
>
>     def OnSerialRead(self, event):
>         # accessing a local var is quicker and more readable
>         buffer = self.buffer
>
>         text = event.data
>         buffer += text
>
>         while buffer:
>             if not buffer.startswith('$'):
>                 raise ValueError("bad data in buffer! code rewrite needed!")
>             mark2 = buffer.find('$', 1)
>             if mark2 < 0:
>                 # end of token not present
>                 # look again later
>                 break
>             token = buffer[1:mark2]
>             buffer = buffer[mark2+1:]
>
>             if not token:
>                 raise ValueError("no data in packet!")
>             sensorm value = token[1], token[1:]
>
>             ... hand off to sensors as above ...
>
> Cheers,
> --
> Cameron Simpson <c... at zip.com.au> DoD#743http://www.cskk.ezoshosting.com/cs/
>
> If your new theorem can be stated with great simplicity, then there
> will exist a pathological exception.    - Adrian Mathesis

Okay I altered my code a bit based on all of your suggestions:

	self.sensor_re = re.compile(r'\$([A-E])([^$]*)\$')

	def OnSerialRead(self, event):
		text = event.data
		buffer = self.sensorbuffer + text

		if self.SensorLogging:
			result = self.sensor_re.search(buffer)
			if result:
				sensor,value = result.groups()
				buffer = buffer[result.end():]
				if sensor == 'A' and self.sensora_chkbox.GetValue():
					self.SensorAValue = value
				elif sensor == 'B' and self.sensorb_chkbox.GetValue():
					self.SensorBValue = value
				elif sensor == 'C' and self.sensorc_chkbox.GetValue():
					self.SensorCValue = value
				elif sensor == D' and self.sensord_chkbox.GetValue():
					self.SensorDValue = value
				elif sensor == 'E' and self.sensore_chkbox.GetValue():
					self.SensorEValue = value
				self.LogSensor()
			self.sensorbuffer = buffer

...and it works far better, in so much as the problems I was having
before seem to be gone. I'm probably going to change this to something
without re, though for the time being this will work for me. The
reason there was a whole lot of junk, such as the three buffers,
around this code was because I am somewhat of a novice python
programmer and I was just doing random stuff to see if it would stick.
The reason I was using regex in the first place was that I thought it
would be faster than the python code due its c origins, but apparently
that may not be the case. Regardless thank you all for your help.

Cheers,
Miles Pekala



More information about the Python-list mailing list