C API PyObject_CallFunctionObjArgs returns incorrect result
Jen Kris
jenkris at tutanota.com
Mon Mar 7 16:08:26 EST 2022
Thanks to MRAB and Chris Angelico for your help. Here is how I implemented the string conversion, and it works correctly now for a library call that needs a list converted to a string (error handling not shown):
PyObject* str_sentence = PyObject_Str(pSentence);
PyObject* separator = PyUnicode_FromString(" ");
PyObject* str_join = PyUnicode_Join(separator, pSentence);
Py_DECREF(separator);
PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize");
PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_join, 0);
That produces what I need (this is the REPR of pWTok):
"['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
Thanks again to both of you.
Jen
Mar 7, 2022, 11:03 by python at mrabarnett.plus.com:
> On 2022-03-07 17:05, Jen Kris wrote:
>
>> Thank you MRAB for your reply.
>>
>> Regarding your first question, pSentence is a list. In the nltk library, nltk.word_tokenize takes a string, so we convert sentence to string before we call nltk.word_tokenize:
>>
>> >>> sentence = " ".join(sentence)
>> >>> pt = nltk.word_tokenize(sentence)
>> >>> print(sentence)
>> [ Emma by Jane Austen 1816 ]
>>
>> But with the C API it looks like this:
>>
>> PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
>> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string
>>
>> ; See what str_sentence looks like:
>> PyObject* repr_str = PyObject_Repr(str_sentence);
>> PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");
>> const char *bytes_str = PyBytes_AS_STRING(str_str);
>> printf("REPR_String: %s\n", bytes_str);
>>
>> REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>>
>> So the two string representations are not the same – or at least the PyUnicode_AsEncodedString is not the same, as each item is surrounded by single quotes.
>>
>> Assuming that the conversion to bytes object for the REPR is an accurate representation of str_sentence, it looks like I need to strip the quotes from str_sentence before “PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).”
>>
>> So my questions now are (1) is there a C API function that will convert a list to a string exactly the same way as ‘’.join, and if not then (2) how can I strip characters from a string object in the C API?
>>
> Your Python code is joining the list with a space as the separator.
>
> The equivalent using the C API is:
>
> PyObject* separator;
> PyObject* joined;
>
> separator = PyUnicode_FromString(" ");
> joined = PyUnicode_Join(separator, pSentence);
> Py_DECREF(sep);
>
>>
>> Mar 6, 2022, 17:42 by python at mrabarnett.plus.com:
>>
>> On 2022-03-07 00:32, Jen Kris via Python-list wrote:
>>
>> I am using the C API in Python 3.8 with the nltk library, and
>> I have a problem with the return from a library call
>> implemented with PyObject_CallFunctionObjArgs.
>>
>> This is the relevant Python code:
>>
>> import nltk
>> from nltk.corpus import gutenberg
>> fileids = gutenberg.fileids()
>> sentences = gutenberg.sents(fileids[0])
>> sentence = sentences[0]
>> sentence = " ".join(sentence)
>> pt = nltk.word_tokenize(sentence)
>>
>> I run this at the Python command prompt to show how it works:
>>
>> sentence = " ".join(sentence)
>> pt = nltk.word_tokenize(sentence)
>> print(pt)
>>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>> type(pt)
>>
>> <class 'list'>
>>
>> This is the relevant part of the C API code:
>>
>> PyObject* str_sentence = PyObject_Str(pSentence);
>> // nltk.word_tokenize(sentence)
>> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr,
>> "word_tokenize");
>> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok,
>> str_sentence, 0);
>>
>> (where pModule_mstr is the nltk library).
>>
>> That should produce a list with a length of 7 that looks like
>> it does on the command line version shown above:
>>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>> But instead the C API produces a list with a length of 24, and
>> the REPR looks like this:
>>
>> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\',
>> "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'",
>> \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'
>>
>> I also tried this with PyObject_CallMethodObjArgs and
>> PyObject_Call without success.
>>
>> Thanks for any help on this.
>>
>> What is pSentence? Is it what you think it is?
>> To me it looks like it's either the list:
>>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>> or that list as a string:
>>
>> "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>>
>> and that what you're tokenising.
>> -- https://mail.python.org/mailman/listinfo/python-list
>>
> --
> https://mail.python.org/mailman/listinfo/python-list
>
More information about the Python-list
mailing list