hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.2K
active users

okay I've figured out there's a shared format they're using here. it chunks the file into chunks, which have a 16-bit ID (unique per file, but not globally), an offset, and 16-bit length

so like, midisnd.dat will have 12 entries, and the first 11 are 200-500 bytes each, and then the last is 3k.
presumably it's each song and then some config info?

cities.dat is very interesting. There's 30 cities in total, but 491 entries in it!

So they must be doing something odd there, that doesn't divide equally. Maybe one city-chunk gives IDs of the others?

idea for a test: it's easy to spot which chunk in a city is the image, because it's the biggest. Here's a way to determine if it's looking up by IDs or offsets/indices: swap the IDs of two images

darn. turns out you can't just renumber the chunks, because they have to be in increasing order.

so maybe I just need to leave the chunk indexes as is, and instead of moving the entries around, I move where they're pointing?

Bingo! I'm in Athens, but I'm seeing the image for Baghdad, and apparently with the Baghdad palette?

So one of these other chunks must be the palette for a city. Or it selects from a selection of palettes? Maybe they've just got a couple defined.

okay I figured out the cities.dat IDs:

They're all 1XXYY (in decimal):
XX is the city number (0-29), YY is the sub-chunk-id.

So like:
YY=0: City name
YY=2: City image.

They go between 00 and 22, and not all numbers need to be present.

hmm, reading a buffer and then summing all the values of the bytes in it.

suspicious behavior.

okay I think it has a very simple 1-byte CRC check on the chunks, which are optionally not run.
I can't make the math work but I'm reasonably sure that's what it is

okay they're using a blit that's UI-aware, so it starts the coordinate system at (1,13). Fun!

looking into the blitting code I managed to steal the world map out of RAM

ugh. TODO for my eventual Good DOS Debugger:
Instant Video display.
I don't know exactly how DOSBox-X is doing it, but while single-stepping the debugger, the display never updates. I can dump the ram at A000:0000 and see what updated, but not on the screen in DOSBox

found a suspicious array, which goes:
[
(-1,0),
(-1,1),
(0,1),
(1,1),
(1,0),
(1,-1),
(0, -1),
(-1,-1),
(0,0)
]

POP QUIZ: why does the font renderer need this array? how are they being "lazy" with this array?

The Answer to the DRM questions for Where in the world is Carmen Sandiego? Enhanced (DOS, 1990) are, in no particular order:

23
Kent
dragon
calcium
1796
Warren
revenue
1792
Willard
1937
Crater
Tanzania
Hartford
Duluth
London
Gem
Silent
squeaker

if ((0x80 >> ((byte)local_4 & 7) &
(int)(char)*(byte *)((int)((int *)param_1 + 1) + (local_4 >> 3))) != 0) {

COULD YOU USE SOME MORE CASTS MAYBE?

oh it's because ghidra's near/far pointer support is shit.

I had param2 defined as a byte*32 and it was casting it to a byte* before using it

if I define it as byte* and let the calling convention implicitly define it as 32bit, it doesn't do the cast

well I found the decompression method.

as always, I hate it. decompression routines are probably my least favorite thing to reverse engineer

I think this compression is specifically designed for ASCII text, which is annoying because they've also got compressed images... which probably use a DIFFERENT COMPRESSION!

it looks like this chunk has length 256, which means 253 usable bytes, and it expands to 374 bytes.

Not the greatest compression. a little better than just doing 6-bit ASCII.

it's some kind of shifting bit mask but it starts at encoding values in 4 bits, then it can increase (or decrease, I guess) based on the input stream.

then it has an output filter, where if the number specified wasn't 8 bits, it's actually an index into a predefined text table

the predefined table starts with NUL, space, then:
aetonisrdlhugfcwypbmk,vSA.T'PMxBCIRGDWHqE-zNFKL0j:51YJ8\U?73Q;2!469
\r\nOVXZ()*+"#$%&<=>/@[]^_`

given that the most comment symbols are near the beginning, this is presumably a sort of lazy huffman coding

but I've got the predefined table, an input file, an output file, and now I need to write some python code to replicate this, hopefully without crying

"vs ses oa is isgit's tc eital and largest t u anhtA ttggh os nnotosnhrdsmarosogdn ss drte tishoth's isdhsceohtsnthminder of isgit's t nuorhdhtpast\x00 geru is slightltsn oaller than ndhd na and is o nnsgtgstbtst oa dotlalssaaolootbiaoht Sal gh, sonuhvia and sl ghh\x00isgit, ontvdn ss nhsiaalgarsnadlfnaatawlarst oadrlhrs i is a rugged land dooousr'casrbhe nrdsgs fountainsnht iah"

I mean, it's not 100% wrong, but it's not right either

that's supposed to read:
"\x03Lima is Peru's capital and largest city. A well-known landmark is the Archbishop's Palace, a reminder of Peru's colonial past\x00Peru is slightly smaller than Alaska and is bordered by Ecuador, Colombia, Brazil, Bolivia and Chile\x00Peru, once the center of the mighty Incan Empire, is a rugged land dominated by the Andes Mountains. Forests and jungles cover half its land area\x00"

I somehow confused the dosbox-x debugger into not accepting letters anymore

it was a trivial off-by-one error.
I was doing saved_byte=input[3]

but while I needed the 3rd byte, that's at input[2]

yess!

C:\DOSBox-X\drive_c\carmen\py>python datfile.py cities.dat --dump=12803 --decompress
"\x03Sydney, with a population of more than 3.3 million people, is Australia's largest city. A well-known sight is Sydney's distinctively designed Opera House\x00An island continent, Australia is nearly as large as the United States but has only one-fifteenth the population\x00The capital of Australia is Canberra, located in the southeast corner of the country between Sydney and Melbourne\x00"

It starts with \x03 to indicate there's three strings: then it describes the city three times. at runtime it uses select_string function with a random input to select one of the three strings

okay now that I can decode the chunks (well, most of them) I can identify a lot more of them:

00 Name and (some other info)
01 ???
02 Image
03 City descriptions
04 Items to steal
10 ???
11&up: Hints leading here

So like, the 12 chunk for Tokyo says:

b'\x05asked about the exchange rate for yen\x00was practicing Japanese characters\x00said\x81planned to take photographs of Mount Fuji\x00asked about tours of the Imperial Palace\x00was interested in visiting Shinto shrines\x00'

So it picks from one of those 5 options

and then 13 will be:
b'\x02asked questions about Shinto rituals\x00said\x81was researching an archipelago\x00'

so when it sets up a city that has hints to lead to Tokyo, it picks 3 of these sets of questions, then picks a question in each set.

tool that'd really be handy right now:
a "live" version of binxelview, so I can step through the DOSBox-x debugger and see how memory is changing in real time, as an image.

that might not be TOO hard to hack in, hmm.

I'm stepping through a high-level loading routine I don't understand yet, trying to figure out when it decompresses an image by watching the RAM it uses for file loading and decompression and spotting when the image appears

Cassandrich

@foone Visualizations of memory contents is a vastly unexplored area. I suspect you could do automated statistical analysis of patterns with some way to display the output and seriously accelerate reverse engineering.

@dalias @foone This is something I'd love to see more exploration of. Especially with changes over time. Scanlime had that "temporal hex dump" tool but that's only scratching the surface of what could be done UI wise.

I would love a tool integrating with ngscopeclient that lets you look at e.g. QSPI bus activity in a graphical manner to somehow visualize access patterns and understand how a boot image is constructed by dynamic analysis.