NumPy Newcomer Questions: Reading Data Stream & Building Array
After decades of Fortran and C, I'm quite new to Python and all the great tools it provides. So, I'd like mentoring in how to most elegantly and effectively parse data read over a serial port from a scanner, and how to build a 2D array from it. After this, I'll probably need help in other NumPy array (matrix) manipulations. But, first things first: Data are entered on an optical mark readable (OMR) form, which is scanned and the data sent over a serial line. Each form read transmits 69 bytes; there are 2 bytes for each "column" (where there is a timing mark on the edge of the form) and their position within that column determines their value. The form has two blank lines (with timing marks) and three lines with labels; these transmit the equivalent of '0' and are not recorded. The final byte is a carriage return, '\r,' which is translated into a newline ('\n') by the method. For each form read I want to record the form number (programmatically sequentially assigned, starting with '1') and the data from the form as a row. The total forms read then fill a two-dimensional array. The two issues with which I would like help are: How to slice the list of data from the scanner stream into two-byte chunks (leaving the end-of-record byte to be dealt with separately), How to build the array so that each form read creates a new row in the array. While the 31 columns are fixed, the number of rows is indeterminate until all submitted forms have been read. I've attached the 95-line 'OnScan.py' to this message; it's a single function within the application. TIA, Rich -- Richard B. Shepard, Ph.D. | The Environmental Permitting Applied Ecosystem Services, Inc.(TM) | Accelerator <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863
Hi, I must admit I did not understand the data format correctly, however maybe I can help.
How to slice the list of data from the scanner stream into two-byte chunks (leaving the end-of-record byte to be dealt with separately),
Since you want it as array in the end, you could convert it straight away using fromstring: #code s = '\x00\x01\x00\x02\x00\x03' # some binary data a = fromstring(s, dtype='>u2') # >: Big Endian, u: uint, 2: bytes/val #endcode You will have to strip the EOR byte. Choose appropriate byteorder.
How to build the array so that each form read creates a new row in the array. While the 31 columns are fixed, the number of rows is indeterminate until all submitted forms have been read.
Since the row count is undetermined, the best way is to build a list (linked list, builtin type) containing all the columns and convert it to an array when finished with reading: # code rows = [] # create empty list while still_rows_left: rows.append(row_that_was_just_read) # now rows is [array(1,2,3,..), array(4,5,6,...), ...] mydata = array(rows) # make array out of list #endcode HTH, Johannes
On Fri, 8 Sep 2006, Johannes Loehnert wrote:
I must admit I did not understand the data format correctly, however maybe I can help.
Johannes, The data format is a stream of ASCII bytes. Each two-byte combination represents either a string (for two columns) or an integer (for the other 28 data columns). The conversion from position-specific bit values to the ASCII characters is accomplished using the data mapping dictionaries. I accept that I wasn't as clear as I could have been.
Since you want it as array in the end, you could convert it straight away using fromstring:
#code s = '\x00\x01\x00\x02\x00\x03' # some binary data a = fromstring(s, dtype='>u2') # >: Big Endian, u: uint, 2: bytes/val #endcode
After sending the message it occurred to me that the string.Split() function is probably what I want. I'm not receiving Hex values from the scanner. I'll play with this over the weekend. You are correct, however, that I do want to take that incoming string and store it into an array in the fewest possible steps. I'll have to look at 'fromstring()' to see just what that does. I assumed that each row in the array was a list, is that incorrect?
You will have to strip the EOR byte. Choose appropriate byteorder.
If the incoming data is stored as a list, I can slice off the last byte. On AMD processors (and Intel, too) I believe that the byte order has always been littleendian.
Since the row count is undetermined, the best way is to build a list (linked list, builtin type) containing all the columns and convert it to an array when finished with reading:
# code rows = [] # create empty list while still_rows_left: rows.append(row_that_was_just_read) # now rows is [array(1,2,3,..), array(4,5,6,...), ...] mydata = array(rows) # make array out of list #endcode
Ah! I think that's just what I need. I've started reading Travis' book, but haven't hit this part yet. Many thanks, Rich -- Richard B. Shepard, Ph.D. | The Environmental Permitting Applied Ecosystem Services, Inc.(TM) | Accelerator <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863
Hi,
The data format is a stream of ASCII bytes. Each two-byte combination represents either a string (for two columns) or an integer (for the other 28 data columns). The conversion from position-specific bit values to the ASCII characters is accomplished using the data mapping dictionaries. I accept that I wasn't as clear as I could have been.
Understood now. But you cannot store strings and floats in an array simultaneously.
After sending the message it occurred to me that the string.Split() function is probably what I want. I'm not receiving Hex values from the scanner. I'll play with this over the weekend.
Not sure about string.split(), I don't think this is useful (would split into 1-byte chars iirc). You can you use a list comprehension for the data part, like this: line = read_data() mapped_vals = [DATA_MAP_7[i:i+2] for i in range(2*7, 2*35)] although I admit it is no very pretty solution. Maybe somebody has a better idea.
You are correct, however, that I do want to take that incoming string and store it into an array in the fewest possible steps. I'll have to look at 'fromstring()' to see just what that does. I assumed that each row in the array was a list, is that incorrect?
well, since the data has to be mapped in a certain manner first, I'm not sure if fromstring() really helps here. I did not realize this at the first glance. BTW, while having a closer look to your code I noticed that at line 72 ("line = ser.readline") there is a pair of brackets missing. Here is one free of charge: () ;-) Johannes
On Sat, 9 Sep 2006, Johannes Loehnert wrote:
Understood now. But you cannot store strings and floats in an array simultaneously.
Johannes, I understand. I was thinking of lists, I guess. This is not a problem because I can store a numeric code (1, 2, 3) for each of the two strings and translate them later in the program. They're there for selecting rows. Originally I was going to put the incoming data into a SQLite3 database table, then extract the values using SQL 'select from ... where ...' statements, but I thought that direct application of python datatypes would be more efficient.
Not sure about string.split(), I don't think this is useful (would split into 1-byte chars iirc). You can you use a list comprehension for the data part, like this:
One-byte chars are exactly what I want. That's what is coming from the scanner, in bit-mapped format. What the scanner folks call a column we see as a row on the form. Each 'row' (as we look at the form in portrait orientation) has 12 positions, each with a bit value. There are two bytes encoded there. I'm using the DATA_MAP_ dictionaries to translate from the transmitted bytes to values meaningful to the application.
line = read_data() mapped_vals = [DATA_MAP_7[i:i+2] for i in range(2*7, 2*35)]
DATA_MAP is a dictionary; can I find keys by slicing? If so, it would be [i:i+1] from bytes 12-67.
although I admit it is no very pretty solution. Maybe somebody has a better idea.
You are helping me, but what I'm trying to do is beyond what all my introductory references cover, and I'm not finding clearly presented solutions using Google.
BTW, while having a closer look to your code I noticed that at line 72 = ("line ser.readline") there is a pair of brackets missing. Here is one free of charge: () ;-)
Thank you. I caught that earlier today and fixed it. Rich -- Richard B. Shepard, Ph.D. | The Environmental Permitting Applied Ecosystem Services, Inc.(TM) | Accelerator <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863
Hi again,
line = read_data() mapped_vals = [DATA_MAP_7[i:i+2] for i in range(2*7, 2*35)]
DATA_MAP is a dictionary; can I find keys by slicing? If so, it would be [i:i+1] from bytes 12-67.
I am sorry, it was late in the evening. :-) What I meant was: mapped_vals = [DATA_MAP_7[line[i:i+2]] for i in range(2*7, 2*35)] i.e. take a 2-byte slice of the string "line" (slicing strings is no problem) and then look up the resulting 2-byte chunk in a dictionary. What I had in mind was something like this: for line in scanner.readline(): header = [ DATA_MAP_2[line[2:4], DATA_MAP_5[line[5:7]] ] mapped_vals = [DATA_MAP_7[line[i:i+2]] for i in range(2*7, 2*35)] col_array = array(header + mapped_vals) ("+" concatenates the lists). Thinking about it, you do not need to convert it to an array right here. You can build a list of lists and convert it into an array as the very last step. Johannes
Rich Shepard wrote:
On Sat, 9 Sep 2006, Johannes Loehnert wrote:
Understood now. But you cannot store strings and floats in an array simultaneously.
Johannes,
I understand. I was thinking of lists, I guess. This is not a problem because I can store a numeric code (1, 2, 3) for each of the two strings and translate them later in the program. They're there for selecting rows.
Actually, with the sophisticated dtypes (datatypes) now in numpy, you can now store strings and floats in the same array. However, you solution sounds sensible and simpler than creating such an array, so you may prefer to carry on.
On Mon, 11 Sep 2006, Andrew Straw wrote:
Actually, with the sophisticated dtypes (datatypes) now in numpy, you can now store strings and floats in the same array. However, you solution sounds sensible and simpler than creating such an array, so you may prefer to carry on.
Thank you, Andrew. Thinking more about the process, I decided that writing the parsed and processed stream into a sqlite3 table is the better approach. This gives us permanent storage of the digital images of each scanned paper form so they can be compared if a data audit is required. Also, it is easier -- for me, at least -- to write the SQL select statements to calculate column averages grouped by the two string values. It is these average values that are to be inserted into a 2D array. Rich -- Richard B. Shepard, Ph.D. | The Environmental Permitting Applied Ecosystem Services, Inc.(TM) | Accelerator <http://www.appl-ecosys.com> Voice: 503-667-4517 Fax: 503-667-8863
participants (3)
-
Andrew Straw -
Johannes Loehnert -
Rich Shepard