Linux NILFS file system: automatic continuous snapshots

Author: Solène

Date: 05 October 2022

Tags: linux filesystem nilfs

Introduction

Today, I'll share about a special Linux file system that I really enjoy. It's called NILFS and has been imported into Linux in 2009, so it's not really a new player, despite being stable and used in production it never got popular.

In this file system, there is a unique system of continuous checkpoint creation. A checkpoint is a snapshot of your system at a given point in time, but it can be deleted automatically if some disk space must be reclaimed. A checkpoint can be transformed into a snapshot that will never be removed.

This mechanism works very well for workstations or file servers on which redundancy is nonexistent, and on which backups are done every day/weeks which give room for unrecoverable mistakes.

NILFS project official website

Wikipedia page about NILFS

NILFS concepts

As NILFS is a Copy-On-Write file system (CoW), which mean when you make a change to a file, the original chunk on the disk isn't modified but a new chunk is created with the new content, this play well with making an history of the files.

From my experience, it performs very well on SSD devices on a desktop system, even during heavy I/O operation.

The continuous checkpoint creation system may be very confusing, so I'll explain how to learn about this mechanism and how to tame it.

Garbage collection

The concept of a garbage collector may appear given for most people, but if it doesn't speak to you, let me give a quick explanation. In computer science, a garbage collector is a task that will look at unused memory and make it available again.

On NILFS, as a checkpoint is created every few seconds, used data is never freed and one would run out of disk pretty quickly. But here is the `nilfs_cleanerd` program, the garbage collector, that will look at the oldest checkpoint and delete them to reclaim the disk space under certain condition. Its default strategy is trying to keep checkpoints as long as possible, until it needs to make some room to avoid issues, it may not suit a workload creating a lot of files and that's why it can be tuned very precisely. For most desktop users, the defaults should work fine.

The garbage collector is automatically started on a volume upon mount. You can use the command `nilfs-clean` to control that daemon, reload its configuration, stop it etc...

When you delete a file on a NILFS file system, it doesn't free up any disk space because it's still available in a previous checkpoint, you need to wait for the according checkpoints to be removed to have some space freed.

How to find the current size of your data set

As the output of `df` for a NILFS filesystem will give you the real data used on the disk for your data AND the snapshots/checkpoints, it can't be used to know how much free disk is available/used.

In order to figure the current disk usage (without accounting older checkpoints/snapshots), we will use the command lscp to look at the number of blocks contained in the most recent checkpoint. On Linux, a block is 4096 bytes, we can then turn the total in bytes into gigabytes by dividing three time by 1024 (bytes -> kilobytes -> megabytes -> gigabytes).

This number is the current size of what you have on the partition.

Create a checkpoint / snapshot

It's possible to create a snapshot of your current system state using the command `mkcp`.

Or you can turn a checkpoint into a snapshot using the command chcp.

The opposite operation (snapshot to checkpoint) can be done using `chcp cp`.

How to recover files after a big mistake

Let's say you deleted an important in-progress work, you don't have any backup and no way to retrieve it, fortunately you are using NILFS and a checkpoint was created every few seconds, so the files are still there and at reach!

The first step is to pause the garbage collector to avoid losing the files: `nilfs-clean --suspend`. After this, we can think slowly about the next steps without having to worry.

The next step is to list the checkpoints using the command `lscp` and look at the date/time in which the files still existed and preferably in their latest version, so the best is to get just before the deletion.

Then, we can mount the checkpoint (let's say number 12345 for the example) on a different directory using the following command:

If it went fine, you should be able to browse the data in `/mnt` to recover your files.

Once you finished recovering your files, umount `/mnt` and resume the garbage collector with `nilfs-clean --resume`.

Going further

Here is a list of extra pieces you may want to read to learn more about nilfs2:

nilfs_cleanerd and nilfs_cleanerd.conf man pages to tune the garbage collector

man pages for lscp / mkcp / rmcp / chcp to manage snapshots and checkpoints manually