Module — Unicode Text¶

Unicode support is deterministic and does not depend on libc locale facilities.

Version pin¶

Unicode 15.1.0 (ZR_UNICODE_VERSION_MAJOR=15, MINOR=1, PATCH=0).

This pin ensures identical grapheme segmentation and width calculation across all platforms and toolchains.

Components¶

UTF-8 decoding (`src/unicode/zr_utf8.*`)¶

RFC-conformant UTF-8 decoder
Pinned invalid-sequence policy: malformed bytes produce U+FFFD
No locale dependencies

Grapheme iteration (`src/unicode/zr_grapheme.*`)¶

UAX #29 grapheme cluster boundaries (minimal subset)
Iterator interface for walking grapheme boundaries in UTF-8 text
Deterministic: same input bytes produce same grapheme breaks

Width policy (`src/unicode/zr_width.*`)¶

Terminal column width for codepoints and graphemes
Output is always 0, 1, or 2 columns
Emoji width is policy-dependent:
ZR_WIDTH_EMOJI_NARROW — emoji renders as 1 column
ZR_WIDTH_EMOJI_WIDE — emoji renders as 2 columns (default)

Wrapping (`src/unicode/zr_wrap.*`)¶

Line wrapping at grapheme boundaries
Width-aware: respects wide characters

Determinism guarantees¶

Property	Guarantee
Grapheme breaks	Identical for same input bytes
Width calculation	Identical for same codepoint + policy
Invalid sequences	Always produce `U+FFFD`
No locale	No `setlocale()`, no `LC_*` environment

Usage pattern¶

// Iterate graphemes in UTF-8 text
zr_grapheme_iter_t iter;
zr_grapheme_iter_init(&iter, utf8_bytes, utf8_len);

zr_grapheme_t g;
while (zr_grapheme_next(&iter, &g)) {
    // g.offset, g.size identify grapheme bytes within the buffer
    const uint8_t* ptr = iter.bytes + g.offset;
    uint8_t w = zr_width_grapheme_utf8(ptr, g.size, policy);
    // w is 0, 1, or 2
}

See:

src/unicode/zr_utf8.h
src/unicode/zr_grapheme.h
src/unicode/zr_width.h
src/unicode/zr_wrap.h

Module — Unicode Text¶

Version pin¶

Components¶

UTF-8 decoding (src/unicode/zr_utf8.*)¶

Grapheme iteration (src/unicode/zr_grapheme.*)¶

Width policy (src/unicode/zr_width.*)¶

Wrapping (src/unicode/zr_wrap.*)¶