decoding keyboard input when using curses
Arnaud Delobelle
arnodel at googlemail.com
Sun May 31 04:05:20 EDT 2009
Chris Jones <cjns1989 at gmail.com> writes:
Hi Chris, thanks for your detailed reply.
> On Sat, May 30, 2009 at 04:55:19PM EDT, Arnaud Delobelle wrote:
>
>> Hi all,
>
> Disclaimer: I am not familiar with the curses python implementation and
> I'm neither an ncurses nor a "unicode" expert by a long shot.
>
> :-)
>
>> I am looking for advice on how to use unicode with curses. First I will
>> explain my understanding of how curses deals with keyboard input and how
>> it differs with what I would like.
>>
>> The curses module has a window.getch() function to capture keyboard
>> input. This function returns an integer which is more or less:
>>
>> * a byte if the key which was pressed is a printable character (e.g. a,
>> F, &);
>>
>> * an integer > 255 if it is a special key, e.g. if you press KEY_UP it
>> returns 259.
>
> The getch(3NCURSES) function returns an integer. Provide it's large
> enough to accomodate the highest possible value, the actual size in
> bytes of the integer should be irrelevant.
Sorry I was somehow mixing up what happens in general and what happens
with utf-8 (probably because I have only done test with utf-8), where
the number of bytes used to encode a character varies.
>> As far as I know, curses is totally unicode unaware,
>
> My impression is that rather than "unicode unaware", it is "unicode
> transparent" - or (nitpicking) "UTF8 transparent" - since I'm not sure
> other flavors of unicode are supported.
>> so if the key pressed is printable but not ASCII,
>
> .. nitpicking again, but ASCII is a 7-bit encoding: 0-127.
>
>> the getch() function will return one or more bytes depending on the
>> encoding in the terminal.
>
> I don't know about the python implementation, but my guess is that it
> should closely follow the underlying ncurses API - so the above is
> basically correct, although it's not a question of the number of bytes
> but rather the returned range of integers - if your locale is en.US then
> that should be 0-255.. if it is en_US.utf8 the range is considerably
> larger.
In my tests, my locale is en_GB.utf8 and the python getch() function
does return a number of bytes - see below.
>> E.g. given utf-8 encoding, if I press the key 'é' on my keyboard (which
>> encoded as '\xc3\xa9' in utf-8), I will need two calls to getch() to get
>> this: the first one will return 0xC3 and the second one 0xA9.
>
> No. A single call to getch() will grab your " é" and return 0xc3a9,
> decimal 50089.
It is the case though that on my machine, if I press 'é' then call
getch() it will return 0xC3. A further call to getch() will return
0xA9. This I was I was talking about getch() returning bytes: to me it
behaves as if it returns the encoded characters byte by byte.
>> Instead of getting a stream of bytes and special keycodes (with value >
>> 255) from getch(), what I want is a stream of *unicode characters* and
>> special keycodes.
>
> This is what getch(3NCURSES) does: it returns the integer value of one
> "unicode character".
It is not what happens in my tests. I have made a simple testing
script, see below.
> Likewise, I would assume that looping over the python equivalent of
> getch() will not return a stream of bytes but rather a "stream" of
> integers that map one to one to the "unicode characters" that were
> entered at the terminal.
> Note: I am only familiar with languages such as English, Spanish,
> French, etc. where only one terminal cell is used for each glyph. My
> understanding is that things get somewhat more complicated with
> languages that require so-called "wide characters" - two terminal cells
> per character, but that's a different issue.
>
>> So, still assuming utf-8 encoding in the terminal, if I type:
>>
>> Té[KEY_UP]ça
>>
>> iterating call to the getch() function will give me this sequence of
>> integers:
>>
>> 84, 195, 169, 259, 195, 167, 97
>> T- é------- KEY_UP ç------- a-
>>
>> But what I want to get this stream instead:
>>
>> u'T', u'é', 259, u'ç', u'a'
>
> No, for the above, getch() will return:
>
> 84, 50089, 259, 50087, 97
>
> .. which is "functionally" equivalent to:
>
> u'T', u'é', 259, u'ç', u'a'
>
> [..]
>
> So shouldn't this issue boil down to just a matter of casting the
> integers to the "u" data type?
>
> This short snippet may help clarify the above:
>
> -----------------------------------------------------------------------
> #include <locale.h>
> #include <ncurses.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
>
> int unichar;
>
> int main(int argc, char *argv[])
> {
> setlocale(LC_ALL, "en_US.UTF.8"); /* make sure UTF8 */
> initscr(); /* start curses mode */
> raw();
> keypad(stdscr, TRUE); /* pass special keys */
> unichar = getch(); /* read terminal */
>
> mvprintw(24, 0, "Key pressed is = %4x ", unichar);
>
> refresh();
> getch(); /* wait */
> endwin(); /* leave curses mode */
> return 0;
> }
> -----------------------------------------------------------------------
>
> Hopefully you have access to a C compiler:
>
> $ gcc -lncurses uni00.c -o uni00
Thanks for this. When I test it on my machine (BTW it is MacOS 10.5.7),
if I type an ASCII character (e.g. 'A'), I get its ASCII code (0x41),
but if I type a non-ascii character (e.g. '§') I get back to the prompt
immediately. It must be because two values are queued for getch. I
should try it on a Linux machine, but I don't have one handy at the
moment.
I have made a little test script in Python which is similar but will
only stop when 'Esc' is pressed.
--------------------------------------------------
import curses
def getcodes(win):
codes = []
while True:
c = win.getch()
if c == 27:
return codes
codes.append(c)
print curses.wrapper(getcodes)
--------------------------------------------------
If I try this in a Terminal and type 'souçi[ESC]', I get this:
[115, 111, 117, 195, 167, 105]
s--, o--, u--, ç-------, i--
As you see, two calls to getch() were necessary after typing 'ç'.
BTW on the same terminal:
marigold:junk arno$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
I will have to do tests with other encodings.
--
Arnaud
More information about the Python-list
mailing list