Making Things Sortable

2024-01-20

Several times recently, I've had reason to make my own data types sortable in Rust projects. In my archiver, Haggis, I wanted to be able to not only list the files contained in an archive but also to provide a long listing format, such as what you get when you type `ls -l` in a terminal on Unix. Now, printing a listing is just a matter of formatting the various metadata fields for each archive node, but we probably also want to be able to alphabetize them by filename, and we might also want to be able to show directories first. You could create some complex sorting logic of your own, but Rust gives a nice easy interface that we can use in the form of the Traits in `std::cmp`.

First, a simplified data structure

For the sake of illustration I'm going to dumb down the example a little bit and make a simpler data structure than what I'm actually using.

enum Kind {
    Normal,
    Directory,
    Symlink,
}

struct Node {
    kind: Kind,
    name: String,
    mode: u32,
}

What is desired is a way to sort a collection of `Node` structures alphabetically, placing directories at the top. If we manually implement the `std::cmp` trait `Ord` for our `Node` type, then we can compare Node's using the `<` and `>` operators. As a side effect, we can also sort a collection of Node's. But before we manually implement `Ord` we're going to want `Eq`, which the compiler can automatically derive for us using a macro.

#[derive(PartialEq, Eq)]
enum Kind {
    Normal,
    Directory,
    Symlink,
}

#[derive(PartialEq, Eq)]
struct Node {
    kind: Kind,
    name: String,
    mode: u32,
}

Note that in order to derive `Eq` for `Node`, we also have to derive it for `Kind`, since the latter is a field in the former. Next, we write out implementation block.

use std::cmp::Ordering;

impl Ord for Node {
    fn cmp(&self, other: &Self) -> Ordering {
        match (self.kind, other.kind) {
            (Kind::Directory, Kind::Directory) => self.name.cmp(other.name),
            (Kind::Directory, _) => Ordering::Less,
            (_, Kind::Directory) => Ordering::Greater,
            _ => self.name.cmp(other.name),
        }
    }
}

impl PartialOrd for Node {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}

That's not too bad. So how do we use this? Suppose we have an iterator called a NodeStream that returns Node's until we reach the archive end. If we didn't want to sort it, then we could just iterate and print. But since we want to sort the entries we'll first collect them into a `Vec`, or a dynamically sized array for those who aren't up on Rust terminology.

> NOTE: We not only have to implement Ord but PartialOrd as well. But PartialOrd is crazy simple when we already have Ord, as you can see.

// assume that `stream` is our hypothetical `NodeStream` type
stream.collect::<Vec>().sort().iter().for_each(|x| {
    println!("{x}");
})

Now, in order to use the `println!()` macro like this we also have to have implemented the `Display` trait for `Node`. And in the implementation of the Display trait, we can provide a nice pretty printed representation that breaks everything down into columns so the final result comes out as a nice looking table.

What else can we do with `Ord`?

Suppose you are writing software that needs to compare version numbers. Now, it would be awesome if everyone used SemVer, and used it correctly. But you'll often find that a project does interesting things like releasing a version as "1.0" followed up by version "1.0.1". Where was the patch number in the first version number? We can't let the compiler derive `Ord` in this case because there is no logical way to compare dissimilar data types, and in this case the version numbers would logically parse to two different types (one has two fields, and one has three). It get's more complex if we consider pre-release versions. But with some creative programming, we can actually make comparing the two data types simple. Here's a starting point for our data types.

enum PreRelease {
    Alpha(u8),
    Beta(u8),
    Rc(u8),
    None,
}

struct SemVer {
    major: u8,
    minor: u8,
    patch: u8,
    pre: PreRelease,
}

struct Rapid {
    major: u8,
    minor: u8,
    pre: PreRelease,
}

Let's start with how we're going to handle comparing a PreRelease. If we have two alpha releases we can just compare the associated `u8`. But we need a simple way to say that a beta version is always greater than an alpha, an Rc greater than either, and PreRelease::None trumps all. We could write a long and very verbose match statement, but that would kinda suck and be error prone. Instead, lets encode it into an integer value by shifting some bits so we can just compare two numbers. For that, we're going to turn to another Trait - `std::convert::From`.

impl From<PreRelease> for u16 {
    fn from(value: PreRelease) -> Self {
        match value {
            Prerelease::Alpha(n) => n as u16 | 0o400,
            Prerelease::Beta(n) => n as u16 | 0o1000,
            PreRelease::Rc(n) => n as u16 | 0o2000,
            PreRelease::None => 0o4000,
        }
    }
}

Basically we turn our enum's tag (Rust enum's are really tagged unions under the covers) and turn it into a set of bitflags, in such a way that if we do the conversion to u16 we can get the correct result just by using the comparison operators `<`, `>` and `==`. So now we can go ahead and manually implement Ord for PreRelease.

use std::cmp::Ordering;

impl PartialEq for PreRelease {
    fn eq(&self, other: &Self) -> bool {
        u16::from(*self) == u16::from(*other)
    }
}

impl Eq for PreRelease {}

impl Ord for PreRelease {
    fn cmp(&self, other: &Self) -> Ordering {
        u16::from(*self).cmp(u16::from(*other))
    }
}

impl PartialOrd for PreRelease {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}

Note that in this case we also manually implemented PartialEq and Eq. Our impl block for Eq is empty, because Eq gives us a default implementation so long as we have PartialEq. I know, traits can be weird.

That makes our comparison pretty trivial, and we don't have to rely on a long and verbose match statement. To finish up we create a larger integer by taking our patch number and shifting the bits until they're just left of the bitflag fields, shift the minor bits until they're left of the patch bits, and the major bits until they're left of minor. In the case of our `Rapid` type, which omits the patch field entirely, we shift the major and minor numbers by the same amount we would if we did have a patch number. This means that we can not only easily compare different kinds of prerelease or release versions but we can compare `Rapid` with `SemVer`. So version "1.0" and version "1.0.0" would come up as equal, which they logically are. I'm not going to show this because it's just a logical extension of what we already saw.

Once again this is somewhat dumbed down compared with my Version library crate. In the name of robustness the library uses an `Option<NonZerou16>` in favor of just a `u8` for prerelease numbers. A `NonZerou16` can never be zero, just as the name would imply. Complicating things a bit further was want "alpha" to come up as equal to "alpha1", so the production quality implementations of `PartialEq` and Ord are a bit more complex. The library also provides parsing version numbers from strings and writing them out as strings, and it also takes architecture into account. None of that is really needed to show the concepts I wanted to highlight in this post, however.

Conclusion

I hope you got something out of these little examples. Rust's trait system was one of those things that I didn't really get on with right away when I started out learning the language. I've found, however, that over time traits have shown themselves to be a hugely valuable language construct that goes a long way towards making your code more concise and readable, often making things which would be incredibly verbose into short little snippets. The bit twiddling is something that you might realistically use in just about any language, but it's also something that you're much less likely to encounter outside of C, and that's a shame. Sometimes those abstract little low level concepts can be real time savers when applied creatively.

Tags for this page

All content for this site is licensed as CC BY-SA.

Finger

Contact

-- Response ended

-- Page fetched on Mon May 20 08:59:41 2024