C API PyObject_CallFunctionObjArgs returns incorrect result

Mon Mar 7 14:03:24 EST 2022

On 2022-03-07 17:05, Jen Kris wrote:
> Thank you MRAB for your reply.
>
> Regarding your first question, pSentence is a list.  In the nltk 
> library, nltk.word_tokenize takes a string, so we convert sentence to 
> string before we call nltk.word_tokenize:
>
> >>> sentence = " ".join(sentence)
> >>> pt = nltk.word_tokenize(sentence)
> >>> print(sentence)
> [ Emma by Jane Austen 1816 ]
>
> But with the C API it looks like this:
>
> PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string
>
> ; See what str_sentence looks like:
> PyObject* repr_str = PyObject_Repr(str_sentence);
> PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");
> const char *bytes_str = PyBytes_AS_STRING(str_str);
> printf("REPR_String: %s\n", bytes_str);
>
> REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>
> So the two string representations are not the same – or at least the   
> PyUnicode_AsEncodedString is not the same, as each item is surrounded 
> by single quotes.
>
> Assuming that the conversion to bytes object for the REPR is an 
> accurate representation of str_sentence, it looks like I need to strip 
> the quotes from str_sentence before “PyObject* pWTok = 
> PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).”
>
> So my questions now are (1) is there a C API function that will 
> convert a list to a string exactly the same way as ‘’.join, and if not 
> then (2) how can I strip characters from a string object in the C API?
>
Your Python code is joining the list with a space as the separator.

The equivalent using the C API is:

     PyObject* separator;
     PyObject* joined;

     separator = PyUnicode_FromString(" ");
     joined = PyUnicode_Join(separator, pSentence);
     Py_DECREF(sep);

>
> Mar 6, 2022, 17:42 by python at mrabarnett.plus.com:
>
>     On 2022-03-07 00:32, Jen Kris via Python-list wrote:
>
>         I am using the C API in Python 3.8 with the nltk library, and
>         I have a problem with the return from a library call
>         implemented with PyObject_CallFunctionObjArgs.
>
>         This is the relevant Python code:
>
>         import nltk
>         from nltk.corpus import gutenberg
>         fileids = gutenberg.fileids()
>         sentences = gutenberg.sents(fileids[0])
>         sentence = sentences[0]
>         sentence = " ".join(sentence)
>         pt = nltk.word_tokenize(sentence)
>
>         I run this at the Python command prompt to show how it works:
>
>                     sentence = " ".join(sentence)
>                     pt = nltk.word_tokenize(sentence)
>                     print(pt)
>
>         ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>
>                     type(pt)
>
>         <class 'list'>
>
>         This is the relevant part of the C API code:
>
>         PyObject* str_sentence = PyObject_Str(pSentence);
>         // nltk.word_tokenize(sentence)
>         PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr,
>         "word_tokenize");
>         PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok,
>         str_sentence, 0);
>
>         (where pModule_mstr is the nltk library).
>
>         That should produce a list with a length of 7 that looks like
>         it does on the command line version shown above:
>
>         ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>
>         But instead the C API produces a list with a length of 24, and
>         the REPR looks like this:
>
>         '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\',
>         "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'",
>         \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'
>
>         I also tried this with PyObject_CallMethodObjArgs and
>         PyObject_Call without success.
>
>         Thanks for any help on this.
>
>     What is pSentence? Is it what you think it is?
>     To me it looks like it's either the list:
>
>     ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>
>     or that list as a string:
>
>     "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>
>     and that what you're tokenising.
>     -- 
>     https://mail.python.org/mailman/listinfo/python-list
>
>