SPOF #1 – Storage Node

To start with we’re going to need a minimum of two storage servers, although the system should scale to any (?) number of storage nodes. I’m using ZFS-Z2 as a storage platform which introduces an additional layer of complexity, but at the same time a layer of flexibility and resilience you’re not going to get from any other option. If you hear anyone mention “BTRFS” as an alternative to ZFS, run away (screaming).

Why ZFS?

  • Software RAID without the traditional RAID-5 failings
  • RAID-5 resilience with striped reads, which is MUCH faster than traditional RAID
  • Checksums and low-level data integrity checks
  • Works well on in ‘hot swap’ mode
  • Snapshots, incremental backups etc etc ..

You can obviously run on any hardware but ideally steer clear of “onboard” SATA controllers if you can and opt for something like an LSI MegaRAID controller. This might set you back £150 but it will be well worth it in terms of performance. You will find it difficult to push more than 500Mb through a stock motherboard, even with 8 SATA III disks, whereas the same disks on a good controller will give you double. (onboard controllers typically have a very limited bus-width, so despite for example 180Mb/channel, the controller itself can’t really handle more than 3 channels flat out)

Also worth a mention, *never* use RAID software supplied on the card, you will always get better performance by making your controller present each disk as a JBOD device (or RAID0/1 device) and then using software RAID (ZFS in this case) to do all the RAIDing. Sounds mad (!) but if you think about it, your average RAID controller uses a 750MHz PowerPC chip, whereas your average server is running a 6 or 8 core 64bit chip at 3GHz. Which one is going to give the best throughput ?!

So, starting with a stock Ubuntu 12.04, first thing to do is add ZFS as follows;

add-apt-repository ppa:zfs-native/stable
apt-get update
apt-get install ubuntu-zfs zfs-dkms zfs-utils zfs-auto-snapshot mountall

And we should be ready to roll, so next we need to see what sort of hardware we have. Now in order to be truly portable and to work with hot-swapping, i.e. so we can survive either the system or the user changing the order in which the component disks are presented to the system, we’ll work with the disk’s ID’s as identifiers, rather than relying on device names. Take a look in /dev/disk/by-id to see what’s available on your system;

# la /dev/disk/by-id/scsi*|grep -v part

As you can see I have 8 drives (which are 1Tb each) and a 64Gb SSD disk that I’m using as a root filesystem. Now we’re going to create a RAID-Z pool using all 8 disks, using 2 parity disks, which should allow the array to survive up to 2 simultaneous drive failures.

zpool create -f srv raidz2 /dev/disk/by-id/scsi-350024e9204f28dbe \
 scsi-350024e9204f28dd8 \
 scsi-350024e9204f28db7 \
 scsi-350024e9204f28d7f \
 scsi-350024e9204f28da7 \
 scsi-350024e9204f28d67 \
 scsi-350024e9204f28d73 \

Check this out with;

# zpool list
srv 7.25T 754G 6.51T 10% 1.00x ONLINE -

If you do a “df” you should also find that it’s created a mount-point and automatically mounted the filesystem for you. This is done automatically at boot time by the “mountall” package you installed earlier as part of the ZFS PPA. However, in order to activate this, you need to edit /etc/default/zfs and set ZFS_MOUNT=’yes’.

You should also create a file called /etc/modprobe.d/zfs.conf and insert options zfs zfs_arc_max=2684354560 zfs_arc_min=0, which will limit the amount of cache space that ZFS can use. Unfortunately ZFS uses it’s own page cache which is not integrated into Linux’s page cache (yet?) so if you don’t add this limit there is the potential for ZFS to consume all free space and this ‘can’ cause deadlock problems when the system page cache can’t get enough memory to operate. There is an argument for having this set by default, however … [!]

Now we can get ready to implement our clustered network filesystem, so we’ll make some filesystems in readiness;

zfs create srv/bricks
zfs create srv/isos
zfs create src/images

We’re going to store filesystem data in the bricks folder, installation ISO images in the isos folder, and Virtual Machine images in the images folder.


Obviously the speed at which our network filesystems will work will be dependent on the speed of our network connections, so I’m opting for a 3-NIC approach, although you could use 5 or indeed use more expensive 10G NIC’s .. which in a few years time I’m sure everyone will be. So, assuming you have the right hardware (!), edit your /etc/network/interfaces file to look something like this;

auto lo
iface lo inet loopback
auto eth0
iface eth0 inet manual
auto eth3
iface eth3 inet manual
auto eth2
iface eth2 inet manual
auto public
iface public inet static
 bridge_ports eth0
 bridge_fd 0
 bridge_stp off
 metric 1
auto data1
iface data1 inet static
 bridge_ports eth3
 bridge_fd 0
 bridge_stp off
 metric 1
auto data2
iface data2 inet static
 bridge_ports eth2
 bridge_fd 0
 bridge_stp off
 metric 1

Obviously your device names and address ranges will need to suit your hardware and network, and you need to make sure you have the bridge-utils package installed. We’re pretty much ready for the next stage now, just bear in mind you need to duplicate all this on a second server. (incidentally, I’ve called my servers data1 and data2 in this instance).

-> Part II – SPOF # 2 Clustered Filesystem


Leave a Reply

You must be logged in to post a comment.