After a long period of quiet, I have released an update to the `unicode-age` #Python package
https://pypi.org/project/unicode-age/
The package now supports #Unicode 16.0
When I wrote `unicode-age` I just sorta felt like writing it in Cython as a fun exercise, but upon reflection (and naturally, immediately after updating it), I'm wondering if it can be converted into a pure Python module.
The main waste of parsing ages into `list[int | None]` is that it ignores the span-oriented nature of DerivedAge.txt
A quick sketch suggests that the in-memory representation of the span information as `list[tuple[int, int, int, int]]` is ~300 KiB worth. That's ~10x the Cython approach (mostly because CPython's integers are >=24 bytes worth), but still pretty small.
We'll see.
I'll file an issue about it for the next update and forget about it until Fall (or whenever Unicode 16.1 would be if there will be one)
I may actually write up the pure Python implementation (which will be my first serious use of `struct.Struct`!), open a PR
I did the thing, NOW I can forget about it until the next Unicode release:
I'm impressed that I was able to keep the overhead still pretty low here.
The whole thing is an exercise in silliness, really. I don't think people who arent-me care very much about looking up the age of codepoints, and if they do, they probably wouldn't care much about the size of the backing data.
But the 10 MB approach was just so shockingly wasteful that I swung the other way and went for something close-to-optimal.
I believe the primary "waste" of the program now is in the linear scan and the integer objects representing the start/stop of each span that get churned up by the search.
O(2000) objects, 24 bytes and change apiece, so ballpark 47 KiB of memory overhead waste in the maximally-pessimistic view, which is excessively pessimistic (only two of them are 'live' at a time, and the runtime is smart about integers). I can live with that.
Total memory burden at rest is ~128 KiB.
not me thinking "what if I did a binary search on this" when a linear scan is perfectly acceptable
working on silly little personal projects has its perks, even if the red here is mostly generated code
@SnoopJ `struct.Struct`! is the name of my new jrpg
@SnoopJ could you write it in the pure python cython subset?
@graingert maybe, but then I'd still have Cython in my life which isn't really worth it for this project
@SnoopJ what about mypyc?
@graingert might be slightly less of a bother but I'm trying to move *away* from an extension module here, not tweak how I get it
@SnoopJ do you know about interpolation search?
@demofox I don't, but based on the name I'm guessing you select your next offset based on monotonicity and a linear fit?
@demofox the problem here is that the offset is to an entire span of codepoints, and the spans vary in sizes (as small as exactly-one codepoint, but may span thousands) so it's not intuitive to me how I'd do much better than binary search if I went all tryhard on it
@demofox this made me curious what the distribution of span sizes is like, though. It's not too shocking
@SnoopJ i found good results interleaving interpolation search with binary search: https://blog.demofox.org/2019/03/22/linear-fit-search/
@demofox the thing I worry about in this problem is that my needle is not really the interpolant, the offset to which span I want is. And +1 to the offset might mean "a few codepoints over" or "10,000 codepoints over", so it's really unevenly distributed.
Maybe there's some clever thing to be done even in this case, but I can sleep at night with a 2.5 msec linear search :)