Back

@luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?

**Peter Brett** @krans@mastodon.me.uk · Mar 12

Mar 12

@mikebabcock Those are code units

**Michael T Babcock** @mikebabcock@floss.social · Mar 12

Mar 12

@krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!

**Peter Brett** @krans@mastodon.me.uk · Mar 12 *

Mar 12 *

@mikebabcock Quick guide to Unicode terminology:

- code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
- codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
- graphemes: the smallest functional units of a script, formed from one or more codepoints
- grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.

**Cassandrich** @dalias · Mar 13

Mar 13

@krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.

**Peter Brett** @krans@mastodon.me.uk · Mar 13

Mar 13

@dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.

@mikebabcock

**Cassandrich** @dalias · Mar 13 *

Mar 13 *

@krans @mikebabcock Are unassigned values "mapped to a character"? Or things like FFFF?

Sorry, not picking on you, just pointing out that the definitions here are subtle & sometimes painful. Not gratuitously, but intrinsically.

**Peter Brett** @krans@mastodon.me.uk · Mar 13

Mar 13

@dalias I thought surrogates were USVs but not codepoints? @mikebabcock

Cassandrich @dalias@hachyderm.io

@krans @mikebabcock Nope, a UTF is defined as a bijection between the Unicode Scalar Values and some subset of the possible sequences of code units. Thus UTFs can't/don't represent numbers in the surrogate range but do represent & round-trip noncharacter things like 0xFFFF.

Mar 13, 2025, 12:20 PM··Web

0boosts·0favorites

**Michael T Babcock** @mikebabcock@floss.social · Mar 14

@dalias @krans I prefer to think of UTF as an encoding system of UCS, as that's how it was designed even though sometimes it has other side-effects.

**Cassandrich** @dalias · Mar 14

@mikebabcock @krans What does that mean precisely though? (IOW what do you mean by "encoding system of UCS"?)

UTF-8 was originally conceived without a lot of rigor as an encoding of 31-bit numbers with non-unique encodings, but that was quickly realized to be a mistake and fixed. The other UTFs, and the unified definition of a UTF (which also includes GB18030!), were developed more rigorously, and involve the concept of USVs.

**Michael T Babcock** @mikebabcock@floss.social · Mar 14

@dalias @krans so you have a system that uses arbitrarily large numbers. You can store those numbers as very large words or dwords or you can encode them into smaller serially-decoded parcels.
UTF does that.
Each UTF byte is either the start of a larger value or a follow-up value (based on the high bit being set). This means the first 127 characters in ASCII and UTF-8 match by the way.
Small numbers? fewer bytes to encode. Large numbers? more bytes. UTF=variable. UCS=fixed.

**Michael T Babcock** @mikebabcock@floss.social · Mar 14

@dalias @krans the result of which has been that languages that primarily use ASCII benefit greatly in byte-count from using UTF-8 as an encoding system, where languages like Japanese (iirc) end up using 3 bytes in UTF-8 but only two in 16-bit encodings.

**Peter Brett** @krans@mastodon.me.uk · Mar 14

www.unicode.orgUnicode Mail List Archive: Re: Unicode & space in programming & l10n

@mikebabcock The idea that Roman languages “benefit greatly in byte count from using UTF-8” has been debunked:

Re: #Unicode & space in programming & l10n
https://www.unicode.org/mail-arch/unicode-ml/y2006-m09/0063.html

@dalias

**Michael T Babcock** @mikebabcock@floss.social · Mar 14

@krans @dalias to quote your own link "Now, this is just one sample -- the language style for the text is more formal than is typical, and thus these figures may be different than for more customary text. (So don't draw too many conclusions from this!)"

**Peter Brett** @krans@mastodon.me.uk · Mar 14

@mikebabcock Yes. However, the overall point is valid: just because a Roman character fits into 1 byte and a Japanese character requires 3, doesn't allow you to conclude that Unicode advantages Roman languages. @dalias

**Michael T Babcock** @mikebabcock@floss.social · Mar 14

@krans @dalias ... go to bbc and grab random article in Chinese. UTF-8 encoded: 11K. UTF-16 encoded: 7.5K.

ctext = open("test.cn","r", encoding='utf-8').read()

open("outfile16.txt","w",encoding='utf-16').write(ctext)

I'm not sure why you're trying to debunk something that's easy to prove?

**Peter Brett** @krans@mastodon.me.uk · Mar 14

@mikebabcock Your claim was that English gets an advantage, so wouldn't the appropriate comparison be with the same article written in English?

**Peter Brett** @krans@mastodon.me.uk · Mar 14 *

Mar 14 *

@mikebabcock What I've found is that, for translations of (substantially) the same text, Korean requires fewer bytes than English in both UTF-8 and UTF-16.

**Michael T Babcock** @mikebabcock@floss.social · Mar 24

Mar 24

@krans <delay apology> -- no, my one claim was specifically that UTF-8 requires *no additional storage* over using simple ASCII for storing languages that are covered by ASCII and this was a design consideration in its creation.
This is simply fact.
My other claim was that UTF-8 requires more bytes for certain languages than 16 bit encodings do, because those characters fit into single 16-bit words, where they'd require 50% more storage in UTF-8.
Also proven, you can test yourself with code.

**Cassandrich** @dalias · Mar 14