January 2024 Update

2024-01-15

Life, it seems, likes to throw a lot of curveballs right at those times when I'm working hardest towards my goals. This is going to be an uncharacteristically personal post. Truth told I'm forcing myself to write because a) I haven't been keeping up this Gemlog and b) I need something to do right now to avoid spiraling.

I haven't written much code recently either, at least not as much as I normally output. I've recently finished knocking my rust implementation of my Haggis archiver into shape so that it's not just a library but has a full command line program using the library. There is a C implementation as well, but I've only just started working on anything past the library.

The Haggis archive spec

The Rust library

The Rust cli program using the library

The C library implementation (seahag)

Haggis is my attempt at creating a more modern and simpler archive format for Unix that is simpler than Tar and doesn't require workarounds such as additional header fields for long filenames. I wrote it mostly to use as the archive format for a package manager in HitchHiker, my little toy Linux distro. As such it has integrated checksumming and other features that make it especially suited for that use. Uncompressed archives are slightly smaller than uncompressed Tar archives and the command line program is only slightly slower than GNU tar when using it's checksumming feature. The speed actually beats BSDtar by a wide margin when creating or extracting archives. I also gave it a long listing format, much like using the ls command on local files, which displays important metadata such as filetype, permissions and size - in color.

The only compression format that the Rust haggis implementation supports is zstd. In testing (unscientifically of course) gzip, bzip2, xz and zstd it was glaringly obvious that zstd gave the best compression level vs compute time of the four, and Haggis is intended to be modern. Of course one could create an archive and then manually gzip it, and vice versa.

The vision for the package manager

The reason I wanted to wrrite my own package manager, rather than re-using an existing one, is that all of the existing package managers share a few common flaws in design.

The package manager downloads all package archives before doing any other work. A lot could be done with the existing data while waiting for the downloads to finish.

A -lot- of unchanged files get downloaded to perform an update. Why re-download the GPL license every time any GPL licensed software is updated? Why re-download the manual page on a patch level release?

Integrity is checked per archive, not per file. If the archive is off by a single byte the entire thing is discarded.

There is almost always an on-disk cache for downloads. This can grow quite large.

I have the following in mind instead.

The archiver (haggis) is a library, and can read directly from the stream of bytes coming in over the network.

Each node representing a file, directory, or link is passed off to a background thread as soon as it has finished downloading.

Since the checksum data is part of the archive header, the background thread calculates the checksum of the received data and compares it to what it should have received before writing it to disk

Only those files which have changed are downloaded during updates

An update comes in a a single, continuous zstd compressed stream of haggis nodes rather than individual archives

A lot of this is hypothetical for now, but should be fully compatible with the archive library as written. In order to pull this off, the client is going to have to maintain a list of all of the files on the machine, with stored checksums for all regular files, organized by package name and version info. That info will have to be compared to a list of files available on the server. My plan is to store both in plain text.

When updating the database of available packages/files from the server, we're going to use an existing protocol that is just about perfect for the task. What protocol is perfectly suited for updating a bunch of plain text files when only certain lines change between releases? It turns out that Git is perfect for this. So, in order to check for updates, the client will perform a `git pull` on the repository containing the database. It will check and see which packages have updates, then accumulate a list of only the changed files. This list gets sent off the the server as a request.

The server needs to be running dedicated software for this to work. That software doesn't exist yet, but it has been started. How it will work is that rather than having packages stored as compressed archives they will be stored uncompressed in individual directories. The server will read the list of requested files and write each one directly into the network connection as a zstd compressed stream of haggis nodes, again doing so in a background thread immediately after the read is finished.

This scheme does have tradeoffs of course. The server will have to have more storage than it would if packages were being stored as compressed archives. This is going to be mitigated somewhat because HitchHiker is going to be a semi rolling distro, and I have no intention of keeping old versions around. The other drawback will be that the client and server are going to be doing a bit more in a short period of time than a traditional package manager, at least in relation to gathering updates, and moreso the server. Time will tell how much this impacts server performance. And of course this is not going to be a situation where you can just mirror the archive using any webserver. The update server will be a purpose built program.

The benefits should be pretty obvious, however. By the time the last archive node comes in over the network, every other node should be either in the process of being extracted to disk or already finished. That means that this thing is going to be fast. Likely orders of magnitude faster than any other existing package manager. The other really immediate benefit is that update downloads have the potential to be significantly smaller, which will have a huge impact if your network connection is not the greatest.

Compare this with Snap and Flatpak for a sec...

Just, for a moment, consider this scheme as a foil to Snap and Flatpak. I'm going completely the opposite direction than what they're doing. Both formats are applying a sledgehammer approach in that they will download what is basically an entire OS image to overlay with your base OS filesystem. Sometimes you will get an entire runtime for a single program. Even if you ignore the runtimes for a minute, a Snap is literally a filesystem image file. It can't be split into individual files, you get the entire thing or nothing at all. Flatpak is a similar boat. Both technologies are massively more demanding of both storage and network traffic compared with a traditional package manager. My scheme, on the other hand, will transfer significantly less data over the network than a traditional package manager, let alone Snap or Flatpak. It trades using a bit more compute for less network usage and a potential for huge speed gains. Personally I think this is a better way to go. I feel like modern Linux is moving forward boldly in exactly the wrong direction on this front. Just because we can send 10x more data then we need to in order to perform a transaction does not mean that we should. Resources are finite, and should be conserved.

Ok then, stepping down from my soap box.

What's the holdup, then?

Well, life has been trying to kill me. Seriously.

My fiancee's car had to have a new engine put in last month. We were paying for a rental for almost six weeks. I'm back in school. I'm slowly restoring a 34 year old Dodge pickup while daily driving it, because I got tired of newer vehicles constantly letting me down (this is an entire subject - I could write a book. Several.)

And then my dad went into the hospital. Congestive heart failure. Fluid buildup in his lungs and his legs. Unable to walk more than twenty steps at a time. Then they had to take his arthritis medication away after putting him on blood thinners, so he had another setback in mobility. After a week in the hospital he went to a rehabilitation center to work on mobility, while my brother and I do repairs and cleaning at his house. Yesterday they had to transfer him back to the hospital. The nurse found him unresponsive, catatonic, and even after eventually rousing him he had lost three hours of memory, including and entire visit from my brother. Low oxygen levels due to congestive heart failure, according to the doctor. He's back at the rehab center, but we suspect there's more going on. Of course the hospital is a two hour drive for me, and the rehab center is only a little bit better. I want him closer, but he wants to be closer to his home.

I was already getting burned out due to working too hard on side projects before any of these events happened. Right now, two weeks in a mental hospital is starting to sound like maybe a nice vacation opportunity.

My dad was like a superhero to me growing up. Literally the strongest man I've ever known. He was also wickedly smart and had such a wide ranging skillset that I always thought he could do literally anything. It hurts seeing him wither away more than I can describe.

None of this is going to stop me from working on projects or working towards my goals. The pace has definitely slowed, however. I couldn't sustain the pace I was keeping up before, even without starting school. Even without all of the other problems. But after taking nearly a month at one point where I didn't write any code, or update this gemlog, I have to say I'm back now in spite of everything else going on. I'm going to be taking a more measured approach and watching myself for signs of burnout, but I can't just not do the things I enjoy, and some of the projects I've started are because these things need to exist. Nobody else is doing them, so it's on me.

Home

All posts

All content for this site is licensed as CC BY-SA.

Finger

Contact