Blame external/pybind11/docs/advanced/cast/strings.rst

Packit 534379
Strings, bytes and Unicode conversions
Packit 534379
######################################
Packit 534379
Packit 534379
.. note::
Packit 534379
Packit 534379
    This section discusses string handling in terms of Python 3 strings. For
Packit 534379
    Python 2.7, replace all occurrences of ``str`` with ``unicode`` and
Packit 534379
    ``bytes`` with ``str``.  Python 2.7 users may find it best to use ``from
Packit 534379
    __future__ import unicode_literals`` to avoid unintentionally using ``str``
Packit 534379
    instead of ``unicode``.
Packit 534379
Packit 534379
Passing Python strings to C++
Packit 534379
=============================
Packit 534379
Packit 534379
When a Python ``str`` is passed from Python to a C++ function that accepts
Packit 534379
``std::string`` or ``char *`` as arguments, pybind11 will encode the Python
Packit 534379
string to UTF-8. All Python ``str`` can be encoded in UTF-8, so this operation
Packit 534379
does not fail.
Packit 534379
Packit 534379
The C++ language is encoding agnostic. It is the responsibility of the
Packit 534379
programmer to track encodings. It's often easiest to simply `use UTF-8
Packit 534379
everywhere <http://utf8everywhere.org/>`_.
Packit 534379
Packit 534379
.. code-block:: c++
Packit 534379
Packit 534379
    m.def("utf8_test",
Packit 534379
        [](const std::string &s) {
Packit 534379
            cout << "utf-8 is icing on the cake.\n";
Packit 534379
            cout << s;
Packit 534379
        }
Packit 534379
    );
Packit 534379
    m.def("utf8_charptr",
Packit 534379
        [](const char *s) {
Packit 534379
            cout << "My favorite food is\n";
Packit 534379
            cout << s;
Packit 534379
        }
Packit 534379
    );
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> utf8_test('🎂')
Packit 534379
    utf-8 is icing on the cake.
Packit 534379
    🎂
Packit 534379
Packit 534379
    >>> utf8_charptr('🍕')
Packit 534379
    My favorite food is
Packit 534379
    🍕
Packit 534379
Packit 534379
.. note::
Packit 534379
Packit 534379
    Some terminal emulators do not support UTF-8 or emoji fonts and may not
Packit 534379
    display the example above correctly.
Packit 534379
Packit 534379
The results are the same whether the C++ function accepts arguments by value or
Packit 534379
reference, and whether or not ``const`` is used.
Packit 534379
Packit 534379
Passing bytes to C++
Packit 534379
--------------------
Packit 534379
Packit 534379
A Python ``bytes`` object will be passed to C++ functions that accept
Packit 534379
``std::string`` or ``char*`` *without* conversion.  On Python 3, in order to
Packit 534379
make a function *only* accept ``bytes`` (and not ``str``), declare it as taking
Packit 534379
a ``py::bytes`` argument.
Packit 534379
Packit 534379
Packit 534379
Returning C++ strings to Python
Packit 534379
===============================
Packit 534379
Packit 534379
When a C++ function returns a ``std::string`` or ``char*`` to a Python caller,
Packit 534379
**pybind11 will assume that the string is valid UTF-8** and will decode it to a
Packit 534379
native Python ``str``, using the same API as Python uses to perform
Packit 534379
``bytes.decode('utf-8')``. If this implicit conversion fails, pybind11 will
Packit 534379
raise a ``UnicodeDecodeError``.
Packit 534379
Packit 534379
.. code-block:: c++
Packit 534379
Packit 534379
    m.def("std_string_return",
Packit 534379
        []() {
Packit 534379
            return std::string("This string needs to be UTF-8 encoded");
Packit 534379
        }
Packit 534379
    );
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> isinstance(example.std_string_return(), str)
Packit 534379
    True
Packit 534379
Packit 534379
Packit 534379
Because UTF-8 is inclusive of pure ASCII, there is never any issue with
Packit 534379
returning a pure ASCII string to Python. If there is any possibility that the
Packit 534379
string is not pure ASCII, it is necessary to ensure the encoding is valid
Packit 534379
UTF-8.
Packit 534379
Packit 534379
.. warning::
Packit 534379
Packit 534379
    Implicit conversion assumes that a returned ``char *`` is null-terminated.
Packit 534379
    If there is no null terminator a buffer overrun will occur.
Packit 534379
Packit 534379
Explicit conversions
Packit 534379
--------------------
Packit 534379
Packit 534379
If some C++ code constructs a ``std::string`` that is not a UTF-8 string, one
Packit 534379
can perform a explicit conversion and return a ``py::str`` object. Explicit
Packit 534379
conversion has the same overhead as implicit conversion.
Packit 534379
Packit 534379
.. code-block:: c++
Packit 534379
Packit 534379
    // This uses the Python C API to convert Latin-1 to Unicode
Packit 534379
    m.def("str_output",
Packit 534379
        []() {
Packit 534379
            std::string s = "Send your r\xe9sum\xe9 to Alice in HR"; // Latin-1
Packit 534379
            py::str py_s = PyUnicode_DecodeLatin1(s.data(), s.length());
Packit 534379
            return py_s;
Packit 534379
        }
Packit 534379
    );
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> str_output()
Packit 534379
    'Send your résumé to Alice in HR'
Packit 534379
Packit 534379
The `Python C API
Packit 534379
<https://docs.python.org/3/c-api/unicode.html#built-in-codecs>`_ provides
Packit 534379
several built-in codecs.
Packit 534379
Packit 534379
Packit 534379
One could also use a third party encoding library such as libiconv to transcode
Packit 534379
to UTF-8.
Packit 534379
Packit 534379
Return C++ strings without conversion
Packit 534379
-------------------------------------
Packit 534379
Packit 534379
If the data in a C++ ``std::string`` does not represent text and should be
Packit 534379
returned to Python as ``bytes``, then one can return the data as a
Packit 534379
``py::bytes`` object.
Packit 534379
Packit 534379
.. code-block:: c++
Packit 534379
Packit 534379
    m.def("return_bytes",
Packit 534379
        []() {
Packit 534379
            std::string s("\xba\xd0\xba\xd0");  // Not valid UTF-8
Packit 534379
            return py::bytes(s);  // Return the data without transcoding
Packit 534379
        }
Packit 534379
    );
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> example.return_bytes()
Packit 534379
    b'\xba\xd0\xba\xd0'
Packit 534379
Packit 534379
Packit 534379
Note the asymmetry: pybind11 will convert ``bytes`` to ``std::string`` without
Packit 534379
encoding, but cannot convert ``std::string`` back to ``bytes`` implicitly.
Packit 534379
Packit 534379
.. code-block:: c++
Packit 534379
Packit 534379
    m.def("asymmetry",
Packit 534379
        [](std::string s) {  // Accepts str or bytes from Python
Packit 534379
            return s;  // Looks harmless, but implicitly converts to str
Packit 534379
        }
Packit 534379
    );
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> isinstance(example.asymmetry(b"have some bytes"), str)
Packit 534379
    True
Packit 534379
Packit 534379
    >>> example.asymmetry(b"\xba\xd0\xba\xd0")  # invalid utf-8 as bytes
Packit 534379
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte
Packit 534379
Packit 534379
Packit 534379
Wide character strings
Packit 534379
======================
Packit 534379
Packit 534379
When a Python ``str`` is passed to a C++ function expecting ``std::wstring``,
Packit 534379
``wchar_t*``, ``std::u16string`` or ``std::u32string``, the ``str`` will be
Packit 534379
encoded to UTF-16 or UTF-32 depending on how the C++ compiler implements each
Packit 534379
type, in the platform's native endianness. When strings of these types are
Packit 534379
returned, they are assumed to contain valid UTF-16 or UTF-32, and will be
Packit 534379
decoded to Python ``str``.
Packit 534379
Packit 534379
.. code-block:: c++
Packit 534379
Packit 534379
    #define UNICODE
Packit 534379
    #include <windows.h>
Packit 534379
Packit 534379
    m.def("set_window_text",
Packit 534379
        [](HWND hwnd, std::wstring s) {
Packit 534379
            // Call SetWindowText with null-terminated UTF-16 string
Packit 534379
            ::SetWindowText(hwnd, s.c_str());
Packit 534379
        }
Packit 534379
    );
Packit 534379
    m.def("get_window_text",
Packit 534379
        [](HWND hwnd) {
Packit 534379
            const int buffer_size = ::GetWindowTextLength(hwnd) + 1;
Packit 534379
            auto buffer = std::make_unique< wchar_t[] >(buffer_size);
Packit 534379
Packit 534379
            ::GetWindowText(hwnd, buffer.data(), buffer_size);
Packit 534379
Packit 534379
            std::wstring text(buffer.get());
Packit 534379
Packit 534379
            // wstring will be converted to Python str
Packit 534379
            return text;
Packit 534379
        }
Packit 534379
    );
Packit 534379
Packit 534379
.. warning::
Packit 534379
Packit 534379
    Wide character strings may not work as described on Python 2.7 or Python
Packit 534379
    3.3 compiled with ``--enable-unicode=ucs2``.
Packit 534379
Packit 534379
Strings in multibyte encodings such as Shift-JIS must transcoded to a
Packit 534379
UTF-8/16/32 before being returned to Python.
Packit 534379
Packit 534379
Packit 534379
Character literals
Packit 534379
==================
Packit 534379
Packit 534379
C++ functions that accept character literals as input will receive the first
Packit 534379
character of a Python ``str`` as their input. If the string is longer than one
Packit 534379
Unicode character, trailing characters will be ignored.
Packit 534379
Packit 534379
When a character literal is returned from C++ (such as a ``char`` or a
Packit 534379
``wchar_t``), it will be converted to a ``str`` that represents the single
Packit 534379
character.
Packit 534379
Packit 534379
.. code-block:: c++
Packit 534379
Packit 534379
    m.def("pass_char", [](char c) { return c; });
Packit 534379
    m.def("pass_wchar", [](wchar_t w) { return w; });
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> example.pass_char('A')
Packit 534379
    'A'
Packit 534379
Packit 534379
While C++ will cast integers to character types (``char c = 0x65;``), pybind11
Packit 534379
does not convert Python integers to characters implicitly. The Python function
Packit 534379
``chr()`` can be used to convert integers to characters.
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> example.pass_char(0x65)
Packit 534379
    TypeError
Packit 534379
Packit 534379
    >>> example.pass_char(chr(0x65))
Packit 534379
    'A'
Packit 534379
Packit 534379
If the desire is to work with an 8-bit integer, use ``int8_t`` or ``uint8_t``
Packit 534379
as the argument type.
Packit 534379
Packit 534379
Grapheme clusters
Packit 534379
-----------------
Packit 534379
Packit 534379
A single grapheme may be represented by two or more Unicode characters. For
Packit 534379
example 'é' is usually represented as U+00E9 but can also be expressed as the
Packit 534379
combining character sequence U+0065 U+0301 (that is, the letter 'e' followed by
Packit 534379
a combining acute accent). The combining character will be lost if the
Packit 534379
two-character sequence is passed as an argument, even though it renders as a
Packit 534379
single grapheme.
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> example.pass_wchar('é')
Packit 534379
    'é'
Packit 534379
Packit 534379
    >>> combining_e_acute = 'e' + '\u0301'
Packit 534379
Packit 534379
    >>> combining_e_acute
Packit 534379
    'é'
Packit 534379
Packit 534379
    >>> combining_e_acute == 'é'
Packit 534379
    False
Packit 534379
Packit 534379
    >>> example.pass_wchar(combining_e_acute)
Packit 534379
    'e'
Packit 534379
Packit 534379
Normalizing combining characters before passing the character literal to C++
Packit 534379
may resolve *some* of these issues:
Packit 534379
Packit 534379
.. code-block:: python
Packit 534379
Packit 534379
    >>> example.pass_wchar(unicodedata.normalize('NFC', combining_e_acute))
Packit 534379
    'é'
Packit 534379
Packit 534379
In some languages (Thai for example), there are `graphemes that cannot be
Packit 534379
expressed as a single Unicode code point
Packit 534379
<http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>`_, so there is
Packit 534379
no way to capture them in a C++ character type.
Packit 534379
Packit 534379
Packit 534379
C++17 string views
Packit 534379
==================
Packit 534379
Packit 534379
C++17 string views are automatically supported when compiling in C++17 mode.
Packit 534379
They follow the same rules for encoding and decoding as the corresponding STL
Packit 534379
string type (for example, a ``std::u16string_view`` argument will be passed
Packit 534379
UTF-16-encoded data, and a returned ``std::string_view`` will be decoded as
Packit 534379
UTF-8).
Packit 534379
Packit 534379
References
Packit 534379
==========
Packit 534379
Packit 534379
* `The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Packit 534379
* `C++ - Using STL Strings at Win32 API Boundaries <https://msdn.microsoft.com/en-ca/magazine/mt238407.aspx>`_