hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.3K
active users

C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?

edit: yes i know how to decode utf8 manually, my query is about the stl specifically

@luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?

@krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!

@mikebabcock Quick guide to Unicode terminology:

- code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
- codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
- graphemes: the smallest functional units of a script, formed from one or more codepoints
- grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.

@krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.

@dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.

@mikebabcock

@krans @mikebabcock Are unassigned values "mapped to a character"? Or things like FFFF? 🤪

Sorry, not picking on you, just pointing out that the definitions here are subtle & sometimes painful. Not gratuitously, but intrinsically.

@dalias I thought surrogates were USVs but not codepoints? @mikebabcock

Cassandrich

@krans @mikebabcock Nope, a UTF is defined as a bijection between the Unicode Scalar Values and some subset of the possible sequences of code units. Thus UTFs can't/don't represent numbers in the surrogate range but do represent & round-trip noncharacter things like 0xFFFF.

@dalias @krans I prefer to think of UTF as an encoding system of UCS, as that's how it was designed even though sometimes it has other side-effects.

@mikebabcock @krans What does that mean precisely though? (IOW what do you mean by "encoding system of UCS"?)

UTF-8 was originally conceived without a lot of rigor as an encoding of 31-bit numbers with non-unique encodings, but that was quickly realized to be a mistake and fixed. The other UTFs, and the unified definition of a UTF (which also includes GB18030!), were developed more rigorously, and involve the concept of USVs.

@dalias @krans so you have a system that uses arbitrarily large numbers. You can store those numbers as very large words or dwords or you can encode them into smaller serially-decoded parcels.
UTF does that.
Each UTF byte is either the start of a larger value or a follow-up value (based on the high bit being set). This means the first 127 characters in ASCII and UTF-8 match by the way.
Small numbers? fewer bytes to encode. Large numbers? more bytes. UTF=variable. UCS=fixed.

@dalias @krans the result of which has been that languages that primarily use ASCII benefit greatly in byte-count from using UTF-8 as an encoding system, where languages like Japanese (iirc) end up using 3 bytes in UTF-8 but only two in 16-bit encodings.

@krans @dalias to quote your own link "Now, this is just one sample -- the language style for the text is more formal than is typical, and thus these figures may be different than for more customary text. (So don't draw too many conclusions from this!)"

@mikebabcock Yes. However, the overall point is valid: just because a Roman character fits into 1 byte and a Japanese character requires 3, doesn't allow you to conclude that Unicode advantages Roman languages. @dalias

@krans @dalias ... go to bbc and grab random article in Chinese. UTF-8 encoded: 11K. UTF-16 encoded: 7.5K.

ctext = open("test.cn","r", encoding='utf-8').read()

open("outfile16.txt","w",encoding='utf-16').write(ctext)

I'm not sure why you're trying to debunk something that's easy to prove?

@mikebabcock Your claim was that English gets an advantage, so wouldn't the appropriate comparison be with the same article written in English?

@mikebabcock What I've found is that, for translations of (substantially) the same text, Korean requires fewer bytes than English in both UTF-8 and UTF-16.

@krans <delay apology> -- no, my one claim was specifically that UTF-8 requires *no additional storage* over using simple ASCII for storing languages that are covered by ASCII and this was a design consideration in its creation.
This is simply fact.
My other claim was that UTF-8 requires more bytes for certain languages than 16 bit encodings do, because those characters fit into single 16-bit words, where they'd require 50% more storage in UTF-8.
Also proven, you can test yourself with code.

@mikebabcock @krans Oh please this has been debunked so many times. Ultimately because compression makes it irrelevant in most contexts where size matters, but also, ideographic languages have a much higher *base* information density. 3 UTF-8 bytes of kanji typically contain as much information as 3-8 bytes of Latin script.