C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?
edit: yes i know how to decode utf8 manually, my query is about the stl specifically
@luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?
@mikebabcock Those are code units
@krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!
@mikebabcock Quick guide to Unicode terminology:
- code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
- codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
- graphemes: the smallest functional units of a script, formed from one or more codepoints
- grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.
@krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.
@dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.
@krans @mikebabcock Are unassigned values "mapped to a character"? Or things like FFFF?
Sorry, not picking on you, just pointing out that the definitions here are subtle & sometimes painful. Not gratuitously, but intrinsically.
@dalias I thought surrogates were USVs but not codepoints? @mikebabcock
@krans @mikebabcock Nope, a UTF is defined as a bijection between the Unicode Scalar Values and some subset of the possible sequences of code units. Thus UTFs can't/don't represent numbers in the surrogate range but do represent & round-trip noncharacter things like 0xFFFF.
@mikebabcock @krans What does that mean precisely though? (IOW what do you mean by "encoding system of UCS"?)
UTF-8 was originally conceived without a lot of rigor as an encoding of 31-bit numbers with non-unique encodings, but that was quickly realized to be a mistake and fixed. The other UTFs, and the unified definition of a UTF (which also includes GB18030!), were developed more rigorously, and involve the concept of USVs.
@dalias @krans so you have a system that uses arbitrarily large numbers. You can store those numbers as very large words or dwords or you can encode them into smaller serially-decoded parcels.
UTF does that.
Each UTF byte is either the start of a larger value or a follow-up value (based on the high bit being set). This means the first 127 characters in ASCII and UTF-8 match by the way.
Small numbers? fewer bytes to encode. Large numbers? more bytes. UTF=variable. UCS=fixed.
@mikebabcock The idea that Roman languages “benefit greatly in byte count from using UTF-8” has been debunked:
Re: #Unicode & space in programming & l10n
https://www.unicode.org/mail-arch/unicode-ml/y2006-m09/0063.html
@mikebabcock Yes. However, the overall point is valid: just because a Roman character fits into 1 byte and a Japanese character requires 3, doesn't allow you to conclude that Unicode advantages Roman languages. @dalias
@mikebabcock Your claim was that English gets an advantage, so wouldn't the appropriate comparison be with the same article written in English?
@mikebabcock What I've found is that, for translations of (substantially) the same text, Korean requires fewer bytes than English in both UTF-8 and UTF-16.
@krans <delay apology> -- no, my one claim was specifically that UTF-8 requires *no additional storage* over using simple ASCII for storing languages that are covered by ASCII and this was a design consideration in its creation.
This is simply fact.
My other claim was that UTF-8 requires more bytes for certain languages than 16 bit encodings do, because those characters fit into single 16-bit words, where they'd require 50% more storage in UTF-8.
Also proven, you can test yourself with code.
@mikebabcock @krans Oh please this has been debunked so many times. Ultimately because compression makes it irrelevant in most contexts where size matters, but also, ideographic languages have a much higher *base* information density. 3 UTF-8 bytes of kanji typically contain as much information as 3-8 bytes of Latin script.