decoding keyboard input when using curses

Sun May 31 04:05:20 EDT 2009

Chris Jones <cjns1989 at gmail.com> writes:

Hi Chris, thanks for your detailed reply.

> On Sat, May 30, 2009 at 04:55:19PM EDT, Arnaud Delobelle wrote:
>
>> Hi all,
>
> Disclaimer: I am not familiar with the curses python implementation and
> I'm neither an ncurses nor a "unicode" expert by a long shot.
>
> :-)
>
>> I am looking for advice on how to use unicode with curses.  First I will
>> explain my understanding of how curses deals with keyboard input and how
>> it differs with what I would like.
>> 
>> The curses module has a window.getch() function to capture keyboard
>> input.  This function returns an integer which is more or less:
>> 
>> * a byte if the key which was pressed is a printable character (e.g. a,
>>   F, &);
>> 
>> * an integer > 255 if it is a special key, e.g. if you press KEY_UP it
>>   returns 259.
>
> The getch(3NCURSES) function returns an integer. Provide it's large
> enough to accomodate the highest possible value, the actual size in
> bytes of the integer should be irrelevant.

Sorry I was somehow mixing up what happens in general and what happens
with utf-8 (probably because I have only done test with utf-8), where
the number of bytes used to encode a character varies.

>> As far as I know, curses is totally unicode unaware, 
>
> My impression is that rather than "unicode unaware", it is "unicode
> transparent" - or (nitpicking) "UTF8 transparent" - since I'm not sure
> other flavors of unicode are supported.

>> so if the key pressed is printable but not ASCII, 
>
> .. nitpicking again, but ASCII is a 7-bit encoding: 0-127.
>
>> the getch() function will return one or more bytes depending on the
>> encoding in the terminal.
>
> I don't know about the python implementation, but my guess is that it
> should closely follow the underlying ncurses API - so the above is
> basically correct, although it's not a question of the number of bytes
> but rather the returned range of integers - if your locale is en.US then
> that should be 0-255.. if it is en_US.utf8 the range is considerably
> larger.

In my tests, my locale is en_GB.utf8 and the python getch() function
does return a number of bytes - see below.

>> E.g. given utf-8 encoding, if I press the key 'é' on my keyboard (which
>> encoded as '\xc3\xa9' in utf-8), I will need two calls to getch() to get
>> this: the first one will return 0xC3 and the second one 0xA9.
>
> No. A single call to getch() will grab your " é" and return 0xc3a9,
> decimal 50089.

It is the case though that on my machine, if I press 'é' then call
getch() it will return 0xC3.  A further call to getch() will return
0xA9.  This I was I was talking about getch() returning bytes: to me it
behaves as if it returns the encoded characters byte by byte.

>> Instead of getting a stream of bytes and special keycodes (with value >
>> 255) from getch(), what I want is a stream of *unicode characters* and
>> special keycodes.
>
> This is what getch(3NCURSES) does: it returns the integer value of one
> "unicode character".

It is not what happens in my tests.  I have made a simple testing
script, see below.

> Likewise, I would assume that looping over the python equivalent of
> getch() will not return a stream of bytes but rather a "stream" of
> integers that map one to one to the "unicode characters" that were
> entered at the terminal.

> Note: I am only familiar with languages such as English, Spanish,
> French, etc. where only one terminal cell is used for each glyph. My
> understanding is that things get somewhat more complicated with
> languages that require so-called "wide characters" - two terminal cells
> per character, but that's a different issue.
>
>> So, still assuming utf-8 encoding in the terminal, if I type:
>> 
>>     Té[KEY_UP]ça
>> 
>> iterating call to the getch() function will give me this sequence of
>> integers:
>> 
>>     84, 195, 169, 259,   195, 167, 97
>>     T-  é-------  KEY_UP ç-------  a-
>> 
>> But what I want to get this stream instead:
>> 
>>     u'T', u'é', 259, u'ç', u'a'
>
> No, for the above, getch() will return:
>
>      84, 50089, 259, 50087, 97
>
> .. which is "functionally" equivalent to:
>
>      u'T', u'é', 259, u'ç', u'a'
>
> [..]
>
> So shouldn't this issue boil down to just a matter of casting the
> integers to the "u" data type?
>
> This short snippet may help clarify the above:
>
> -----------------------------------------------------------------------
> #include <locale.h>
> #include <ncurses.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
>
> int unichar;
>
> int main(int argc, char *argv[])
> {
>   setlocale(LC_ALL, "en_US.UTF.8");        /* make sure UTF8       */
>   initscr();                               /* start curses mode    */
>   raw();
>   keypad(stdscr, TRUE);                    /* pass special keys    */
>   unichar = getch();                       /* read terminal        */
>
>   mvprintw(24, 0, "Key pressed is = %4x ", unichar);
>
>   refresh();
>   getch();                                 /* wait                 */
>   endwin();                                /* leave curses mode    */
>   return 0;
> }
> -----------------------------------------------------------------------
>
> Hopefully you have access to a C compiler:
>
> $ gcc -lncurses uni00.c -o uni00

Thanks for this.  When I test it on my machine (BTW it is MacOS 10.5.7),
if I type an ASCII character (e.g. 'A'), I get its ASCII code (0x41),
but if I type a non-ascii character (e.g. '§') I get back to the prompt
immediately.  It must be because two values are queued for getch.  I
should try it on a Linux machine, but I don't have one handy at the
moment.

I have made a little test script in Python which is similar but will
only stop when 'Esc' is pressed.

--------------------------------------------------
import curses

def getcodes(win):
    codes = []
    while True:
        c = win.getch()
        if c == 27:
            return codes
        codes.append(c)

print curses.wrapper(getcodes)
--------------------------------------------------

If I try this in a Terminal and type 'souçi[ESC]', I get this:

[115, 111, 117, 195, 167, 105]
 s--, o--, u--, ç-------, i--

As you see, two calls to getch() were necessary after typing 'ç'.
BTW on the same terminal:

marigold:junk arno$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=

I will have to do tests with other encodings.

-- 
Arnaud