ZFS is one of the technologies I have gotten to know a bit better recently (the other one being Nix, but more on that in another article).
I would like to share what I learned, what got me interested and why I think ZFS is worth learning about.
What is ZFS for?
The glorious Arch Wiki has a great description of the benefits and technical parameters of ZFS. Here are a few key quotes: it is “an advanced filesystem” and “described as ‘The last word in filesystems’”.
It supports very high file sizes, can encompass lots of storage space and does not have a limit on the number of files at all. If you have worked with other filesystems, you know that these kinds of limits are something you bump into from time to time. None of these with ZFS, at least for reasonable usage.
For me, the main use case of ZFS, is to spread data across multiple physical disks in a convenient, maintainable and, when needed, encrypted, fashion. I think it does a great job at that, plus there are some awesome advanced features we will dive into later.
First, let’s make sure you have an overview of important words and concepts, so it’s easier to discuss details.
Words and Concepts
Let’s start from the bottom and keep things simple. There are advanced concepts, but you can look into them on your own time. They are not necessary to know about to start using, and get a lot of value out of ZFS.
Some of this is simplified, and leaves out edge cases. I will link a few resources at the bottom of this article where you can get a thorough picture if you like.
First, you have your physical disks. They can be part of a
vdev, or be a
vdev on their own.
vdev stands on virtual device. This is where redundancy happens in ZFS land, so you probably don’t want to have a single disk be a
vdev. Disks fail, and you probably don’t want your data to go away.
Good news! If you have multiple disks you can connect them to a
vdev in such a way, that one of them could fail without losing your data. One of these ways is a “mirror”. You can connect two or more devices to mirror the same data. Very similar to a RAID1 setup, but not constrained to 2 devices and with other cool features (more on that later).
Apart from mirroring, you can connect disks in different
raidz configurations. Those are a more advanced topic, close to other performance-enhancing setups which we also won’t discuss in detail here. Mirror is good enough to get started.
The next important term is
zpool consists of one or more
vdevs. You can create multiple
zpools if you like (and have enough disks I guess). This is the thing you create first, and where you specify how disks are configured within it.
Here is an example of my current backup setup:
A zpool called "main" a mirror-mode vdev with 3 devices disk sda disk sdb disk sdc
You can create, export (stop using a zpool on the current machine) and import (start using a zpool on the current machine). When data is written, it is written to one of the
vdevs attached to the
zpool. So if you would add three single-disk
vdevs to a
zpool and one of those disks fail, your data is probably gone.
# this is a pretty brittle setup! # this looks similar to above, but the # disks are on their own and no # data is stored redundantly A zpool called "main" disk sda disk sdb # if this one breaks, you have a problem disk sdc
Okay, now that you know about disks
zpools, the last thing to know about are
After you create a
zpool, you can specify any number of
datasets. If the pool is called “main”, you could create a dataset with the name “main/projects” for example. You can create encrypted datasets and don’t have to specify a size. They grow as needed within their
You can have a tree of datasets, which is useful for encryption scenarios (“main/enc/secret_projects” and “main/enc/finances” for example), but I digress.
Alright! Now that you know the most important terms, let’s look at what makes ZFS exciting.
What Got Me Interested
Why should you make the effort to learn about ZFS? For me, that was because there are a few benefits and things you can do with it which are not possible with other approaches. Here is a list of nifty features which got me to try it:
- No restricting file size limits. I know, this is more of a problem of old filesystems, but it’s good to know that there won’t be this kind of problem here.
- Encryption as needed. You don’t have to encrypt everything, but can choose per dataset (or per group of datasets).
- It’s straightforward to create a setup where a single disk failure won’t take out all data, and can be replaced as needed.
- You can sync data between disks without decrypting anything. (!)
- You can sync with a remote drive, without decrypting it. (!!)
- You can add storage space to a
zpoolif needed later on. Just add more disks connected to a mirrored
- Built-in data quality maintenance. Hardware degrades, but you can check its integrity and fix creeping decay by running “scrubbing” jobs every week or so.
- Syncing data between disks (called resilvering) and scrubbing both happen without downtime. Only the data which changed gets resilvered, not everything.
- You can create a snapshot of the current state of your data, and use it for backups. Once again, without needing to stop using the data. (I haven’t tried this yet, but this is a really neat feature!)
The last thing which convinced me, was the actual feel of using ZFS. Well, OpenZFS actually. But before we get to commands, there is a technical detail which makes ZFS slightly harder to start using.
A Slight Inconvenience
Due to licensing issues, OpenZFS is not included in the Linux kernel (unlike for example btrfs, which offers a lot of the same benefits. However, I did not get around to give it a try yet).
This means, that you have to install a pre-built kernel module (and maybe lag behind newest kernel versions until there is a build of that module for them), or build your own version using dkms (Dynamic Kernel Module Support).
In the case of Arch, I switched to another kernel package without much trouble.
How Do the Commands Look Like?
How does it look like when you use ZFS? This is the part that convinced me to us ZFS and not dive deeper into btrfs. I guess it’s a matter of taste.
Note: the following commands are just there to give you a first impression of what using ZFS looks like. DON’T FOLLOW THEM.
Again, the following commands are only here as an example. Don’t type them into your terminal - you might destroy your current data. Make sure to follow a thorough tutorial, after making sure that you understand what you are doing instead.
With that out of the way, let’s assume we have two fresh disks connected to a computer (
/dev/sdc) and want to create a new
zpool. We want the disks to mirror the same data, so the data is not gone when one of them gets damaged.
Here is how we would go about creating a
zpool (let’s also assume we are root to skip the
# this might overwrite existing data on both sdb and sdc disks zpool create -o ashift=12 new-pool-name mirror /dev/sdb /dev/sdc
This will create a
zpool called “new-pool-name”. The
ashift part gets recommended in all tutorials I have seen, due to performance reasons. It’s a slight inconvenience, because forgetting about it at this stage can be an annoying gotcha.
We can check how the created pool looks like:
Now we can create a dataset:
zfs create new-pool-name/data
Note: if we wanted to have it encrypted, we’d need to add
-o encryption=onto this command.
Now we can mount the new dataset with:
zfs mount new-pool-name/data
Note: if we had an encrypted pool, we’d need to provide the key beforehand with something like
zfs load-key new-pool-name/datato unlock it.
The dataset got mounted to
/new-pool-name/data by default. We can add files or interact with it in any other usual way now.
Once we are done, we’ll unmount it:
zfs umount /new-pool-name/data
If we want to also disconnect the
zpool from this machine, to use it on another we’ll need to
export it (so it can be imported on the other one):
zpool export new-pool-name
Finally, to gracefully disconnect the external disks, here’s a command to power them off, just in case you haven’t seen it yet (you might need to install an extra package for that):
udisksctl power-off -b /dev/sdb udisksctl power-off -b /dev/sdc
I hope this writeup has helped you get an overview of ZFS. Both the concepts as well as reasons to be curious about it in the first place.
It’s great if you want to connect a few physical disks to make sure your data is safe.
As mentioned above, btrfs is an alternative! I didn’t get into it because the concepts of ZFS have been easier to grasp for me, and the command line tools have felt more intuitive. But I’m sure that can vary from person to person.
Happy data hoarding if you decide to keep it a try!