ansible-nas/docs/zfs/zfs_overview.md

231 lines
11 KiB
Markdown
Raw Normal View History

2019-04-12 20:14:05 +00:00
This is a general overview of the ZFS file system for people who are new to it.
2019-04-13 13:56:36 +00:00
If you have some experience and are actually looking for specific information
about how to configure ZFS for Ansible-NAS, check out the [ZFS example
configuration](zfs_configuration.md).
2019-04-12 20:14:05 +00:00
## What is ZFS and why would I want it?
[ZFS](https://en.wikipedia.org/wiki/ZFS) is an advanced filesystem and volume
2019-04-13 13:56:36 +00:00
manager originally created by Sun Microsystems starting in 2001. First released
in 2005 for OpenSolaris, Oracle later bought Sun and switched to developing ZFS
as closed source software. An open source fork took the name
2019-04-12 20:14:05 +00:00
[OpenZFS](http://www.open-zfs.org/wiki/Main_Page), but is still called "ZFS" for
short. It runs on Linux, FreeBSD, illumos and other platforms.
ZFS aims to be the ["last word in
2019-04-13 14:27:03 +00:00
filesystems"](https://blogs.oracle.com/bonwick/zfs:-the-last-word-in-filesystems),
a technology so future-proof that Michael W. Lucas and Allan Jude famously
2019-04-13 13:56:36 +00:00
stated that the _Enterprise's_ computer on _Star Trek_ probably runs it. The
design was based on [four
2019-04-13 14:27:03 +00:00
principles](https://www.youtube.com/watch?v=MsY-BafQgj4):
2019-04-12 20:14:05 +00:00
2019-04-13 13:56:36 +00:00
1. "Pooled" storage to eliminate the notion of volumes. You can add more storage
the same way you just add a RAM stick to memory.
2019-04-12 20:14:05 +00:00
2019-04-13 14:27:03 +00:00
1. Make sure data is always consistent on the disks. There is no `fsck` command
2019-04-13 13:56:36 +00:00
for ZFS and none is needed.
2019-04-12 20:14:05 +00:00
1. Detect and correct data corruption ("bitrot"). ZFS is one of the few storage
2019-04-13 13:56:36 +00:00
systems that checksums everything, including the data itself, and is
"self-healing".
2019-04-12 20:14:05 +00:00
1. Make it easy to use. Try to "end the suffering" for the admins involved in
managing storage.
2019-04-13 13:56:36 +00:00
ZFS includes a host of other features such as snapshots, transparent compression
and encryption. During the early years of ZFS, this all came with hardware
requirements only enterprise users could afford. By now, however, computers have
become so powerful that ZFS can run (with some effort) on a [Raspberry
2019-04-13 14:27:03 +00:00
Pi](https://gist.github.com/mohakshah/b203d33a235307c40065bdc43e287547).
FreeBSD and FreeNAS make extensive use of ZFS. What is holding ZFS back on Linux
are [licensing issues](https://en.wikipedia.org/wiki/OpenZFS#History) beyond the
2019-04-12 20:14:05 +00:00
scope of this document.
2019-04-13 13:56:36 +00:00
Ansible-NAS doesn't actually specify a filesystem - you can use EXT4, XFS or
Btrfs as well. However, ZFS not only provides the benefits listed above, but
also lets you use your hard drives with different operating systems. Some people
now using Ansible-NAS came from FreeNAS, and were able to `export` their ZFS
storage drives there and `import` them to Ubuntu. On the other hand, if you ever
decide to switch back to FreeNAS or maybe want to use FreeBSD instead of Linux,
you should be able to use the same ZFS pools.
2019-04-12 20:14:05 +00:00
2019-04-13 13:56:36 +00:00
## An overview and some actual commands
2019-04-12 20:14:05 +00:00
Storage in ZFS is organized in **pools**. Inside these pools, you create
**filesystems** (also known as "datasets") which are like partitions on
2019-04-13 13:56:36 +00:00
steroids. For instance, you can keep each user's `/home` directory in a separate
filesystem. ZFS systems tend to use lots and lots of specialized filesystems
with tailored parameters such as record size and compression. All filesystems
share the available storage in their pool.
2019-04-12 20:14:05 +00:00
Pools do not directly consist of hard disks or SSDs. Instead, drives are
2019-04-13 14:27:03 +00:00
organized as **virtual devices** (VDEVs). This is where the physical redundancy
2019-04-12 20:14:05 +00:00
in ZFS is located. Drives in a VDEV can be "mirrored" or combined as "RaidZ",
roughly the equivalent of RAID5. These VDEVs are then combined into a pool by the
2019-04-13 13:56:36 +00:00
administrator. The command might look something like this:
2019-04-12 20:14:05 +00:00
```
sudo zpool create tank mirror /dev/sda /dev/sdb
```
This combines `/dev/sba` and `/dev/sdb` to a mirrored VDEV, and then defines a
2019-04-13 14:27:03 +00:00
new pool named `tank` consisting of this single VDEV. (Actually, you'd want to
use a different ID for the drives, but you get the idea.) You can now create a
2019-04-13 13:56:36 +00:00
filesystem in this pool for, say, all of your _Mass Effect_ fan fiction:
2019-04-12 20:14:05 +00:00
```
2019-04-13 13:56:36 +00:00
sudo zfs create tank/mefanfic
2019-04-12 20:14:05 +00:00
```
2019-04-13 13:56:36 +00:00
You can then enable automatic compression on this filesystem with `sudo zfs set
compression=lz4 tank/mefanfic`. To take a **snapshot**, use
2019-04-12 20:14:05 +00:00
```
2019-04-13 13:56:36 +00:00
sudo zfs snapshot tank/mefanfic@21540411
2019-04-12 20:14:05 +00:00
```
2019-04-13 13:56:36 +00:00
Now, if evil people were somehow able to encrypt your precious fan fiction files
2019-04-13 14:27:03 +00:00
with ransomware, you can simply laugh maniacally and revert to the old version:
2019-04-12 20:14:05 +00:00
```
2019-04-13 13:56:36 +00:00
sudo zfs rollback tank/mefanfic@21540411
2019-04-12 20:14:05 +00:00
```
2019-04-13 13:56:36 +00:00
Of course, you would lose any texts you might have added to the filesystem
between that snapshot and now. Usually, you'll have some form of **automatic
snapshot administration** configured.
2019-04-12 20:14:05 +00:00
2019-04-13 13:56:36 +00:00
To detect bitrot and other data defects, ZFS periodically runs **scrubs**: The
system compares the available copies of each data record with their checksums.
If there is a mismatch, the data is repaired.
2019-04-12 20:14:05 +00:00
## Known issues
2019-04-13 14:27:03 +00:00
> At time of writing (April 2019), ZFS on Linux does not offer native
> encryption, TRIM support or device removal, which are all scheduled to be
2019-04-13 13:56:36 +00:00
> included in the upcoming [0.8
2019-04-13 14:27:03 +00:00
> release](https://www.phoronix.com/scan.php?page=news_item&px=ZFS-On-Linux-0.8-RC1-Released)
> any day now.
2019-04-13 13:56:36 +00:00
ZFS' original design for enterprise systems and redundancy requirements can make
2019-04-13 14:27:03 +00:00
some things difficult. You can't just add individual drives to a pool and tell
the system to reconfigure automatically. Instead, you have to either add a new
VDEV, or replace each of the existing drives with one of higher capacity. In an
enterprise environment, of course, you would just _buy_ a bunch of new drives
2019-04-13 13:56:36 +00:00
and move the data from the old pool to the new pool. Shrinking a pool is even
2019-04-13 14:27:03 +00:00
harder - put simply, ZFS is not built for this, though it is [being worked
on](https://www.delphix.com/blog/delphix-engineering/openzfs-device-removal).
2019-04-12 20:14:05 +00:00
2019-04-13 14:27:03 +00:00
If you absolutely must be able to add or remove single drives, ZFS might not be
the filesystem for you.
2019-04-12 20:14:05 +00:00
## Myths and misunderstandings
2019-04-13 14:27:03 +00:00
Information on the internet about ZFS can be outdated, conflicting or flat-out
wrong. Partially this is because it has been in use for almost 15 years now and
things change, partially it is the result of being used on different operating
systems which have minor differences under the hood. Also, Google searches tend
to first return the Oracle documentation for their closed source ZFS variant,
which is increasingly diverging from the open source OpenZFS standard.
2019-04-13 13:56:36 +00:00
To clear up some of the most common misunderstandings:
2019-04-12 20:14:05 +00:00
### No, ZFS does not need at least 8 GB of RAM
This myth is especially common [in FreeNAS
circles](https://www.ixsystems.com/community/threads/does-freenas-really-need-8gb-of-ram.38685/).
2019-04-13 13:56:36 +00:00
Curiously, FreeBSD, the basis of FreeNAS, will run with [1
2019-04-12 20:14:05 +00:00
GB](https://wiki.freebsd.org/ZFSTuningGuide). The [ZFS on Linux
FAQ](https://github.com/zfsonlinux/zfs/wiki/FAQ#hardware-requirements), which is
2019-04-13 13:56:36 +00:00
more relevant for Ansible-NAS, states under "suggested hardware":
2019-04-12 20:14:05 +00:00
> 8GB+ of memory for the best performance. It's perfectly possible to run with
> 2GB or less (and people do), but you'll need more if using deduplication.
2019-04-13 13:56:36 +00:00
(Deduplication is only useful in [special
2019-04-12 20:14:05 +00:00
cases](http://open-zfs.org/wiki/Performance_tuning#Deduplication). If you are
reading this, you probably don't need it.)
2019-04-13 14:27:03 +00:00
Experience shows that 8 GB of RAM is in fact a sensible minimal amount for
continuous use. But it's not a requirement. What everybody agrees on is that ZFS
_loves_ RAM and works better the more it has, so you should have as much of it
as you possibly can. When in doubt, add more RAM, and even more, and them some,
until your motherboard's capacity is reached.
2019-04-12 20:14:05 +00:00
### No, ECC RAM is not required for ZFS
2019-04-13 13:56:36 +00:00
This is another case where a recommendation has been taken as a requirement. To
2019-04-12 20:14:05 +00:00
quote the [ZFS on Linux
FAQ](https://github.com/zfsonlinux/zfs/wiki/FAQ#do-i-have-to-use-ecc-memory-for-zfs)
again:
> Using ECC memory for OpenZFS is strongly recommended for enterprise
> environments where the strongest data integrity guarantees are required.
> Without ECC memory rare random bit flips caused by cosmic rays or by faulty
> memory can go undetected. If this were to occur OpenZFS (or any other
> filesystem) will write the damaged data to disk and be unable to automatically
> detect the corruption.
2019-04-13 13:56:36 +00:00
ECC corrects [single bit errors](https://en.wikipedia.org/wiki/ECC_memory) in
memory. It is _always_ better to have it on _any_ computer if you can afford it,
and ZFS is no exception. However, there is absolutely no requirement for ZFS to
2019-04-15 16:49:07 +00:00
have ECC RAM. If you just don't care about the danger of random bit flips
because, hey, you can always just download [Night of the Living
Dead](https://archive.org/details/night_of_the_living_dead) all over again,
2019-04-25 04:52:58 +00:00
you're prefectly free to use normal RAM. If you do use ECC RAM, make sure your
processor and motherboard support it.
2019-04-12 20:14:05 +00:00
### No, the SLOG is not really a write cache
2019-04-13 14:27:03 +00:00
You'll read the suggestion to add a fast SSD or NVMe as a "SLOG drive"
(mistakenly also called "ZIL") for write caching. This isn't what happens,
because ZFS already includes [a write
cache](https://linuxhint.com/configuring-zfs-cache/) in RAM. Since RAM is always
faster, adding a disk as a write cache doesn't even make sense.
2019-04-12 20:14:05 +00:00
2019-04-13 14:27:03 +00:00
What the **ZFS Intent Log (ZIL)** does, with or without a dedicated drive, is handle
2019-04-12 20:14:05 +00:00
synchronous writes. These occur when the system refuses to signal a successful
2019-04-13 13:56:36 +00:00
write until the data is actually stored on a physical disk somewhere. This keeps
the data safe, but is slower.
By default, the ZIL initially shoves a copy of the data on a normal VDEV
2019-04-12 20:14:05 +00:00
somewhere and then gives the thumbs up. The actual write to the pool is
2019-04-13 13:56:36 +00:00
performed later from the write cache in RAM, _not_ the temporary copy. The data
there is only ever read if the power fails before the last step. The ZIL is all
about protecting data, not making transfers faster.
2019-04-13 14:27:03 +00:00
A **Separate Intent Log (SLOG)** is an additional fast drive for these temporary
2019-04-13 13:56:36 +00:00
synchronous writes. It simply allows the ZIL give the thumbs up quicker. This
means that a SLOG is never read unless the power has failed before the final
2019-04-13 14:27:03 +00:00
write to the pool.
Asynchronous writes just go through the normal write cache, by the way. If the
power fails, the data is gone.
2019-04-13 13:56:36 +00:00
2019-04-13 14:27:03 +00:00
In summary, the ZIL prevents data loss during synchronous writes, or at least
ensures that the data in storage is consistent. You always have a ZIL. A SLOG
will make the ZIL faster. You'll probably need to [do some
2019-04-12 20:14:05 +00:00
research](https://www.ixsystems.com/blog/o-slog-not-slog-best-configure-zfs-intent-log/)
2019-04-13 13:56:36 +00:00
and some testing to figure out if your system would benefit from a SLOG. NFS for
2019-04-13 14:27:03 +00:00
instance uses synchronous writes, SMB usually doesn't. When in doubt, add more
RAM instead.
2019-04-12 20:14:05 +00:00
## Further reading and viewing
2019-04-13 13:56:36 +00:00
- In 2012, Aaron Toponce wrote a now slightly dated, but still very good
[introduction](https://pthree.org/2012/04/17/install-zfs-on-debian-gnulinux/)
to ZFS on Linux. If you only read one part, make it the [explanation of the
ARC](https://pthree.org/2012/12/07/zfs-administration-part-iv-the-adjustable-replacement-cache/),
2019-04-13 14:41:29 +00:00
ZFS' read cache.
2019-04-13 13:56:36 +00:00
- One of the best books on ZFS around is _FreeBSD Mastery: ZFS_ by Michael W.
2019-04-12 20:14:05 +00:00
Lucas and Allan Jude. Though it is written for FreeBSD, the general guidelines
2019-04-13 13:56:36 +00:00
apply for all variants. There is a second volume for advanced use.
2019-04-12 20:14:05 +00:00
- Jeff Bonwick, one of the original creators of ZFS, tells the story of how ZFS
came to be [on YouTube](https://www.youtube.com/watch?v=dcV2PaMTAJ4).