SPOF #1 – Storage Node
To start with we’re going to need a minimum of two storage servers, although the system should scale to any (?) number of storage nodes. I’m using ZFS-Z2 as a storage platform which introduces an additional layer of complexity, but at the same time a layer of flexibility and resilience you’re not going to get from any other option. If you hear anyone mention “BTRFS” as an alternative to ZFS, run away (screaming).
- Software RAID without the traditional RAID-5 failings
- RAID-5 resilience with striped reads, which is MUCH faster than traditional RAID
- Checksums and low-level data integrity checks
- Works well on in ‘hot swap’ mode
- Snapshots, incremental backups etc etc ..
You can obviously run on any hardware but ideally steer clear of “onboard” SATA controllers if you can and opt for something like an LSI MegaRAID controller. This might set you back £150 but it will be well worth it in terms of performance. You will find it difficult to push more than 500Mb through a stock motherboard, even with 8 SATA III disks, whereas the same disks on a good controller will give you double. (onboard controllers typically have a very limited bus-width, so despite for example 180Mb/channel, the controller itself can’t really handle more than 3 channels flat out)
Also worth a mention, *never* use RAID software supplied on the card, you will always get better performance by making your controller present each disk as a JBOD device (or RAID0/1 device) and then using software RAID (ZFS in this case) to do all the RAIDing. Sounds mad (!) but if you think about it, your average RAID controller uses a 750MHz PowerPC chip, whereas your average server is running a 6 or 8 core 64bit chip at 3GHz. Which one is going to give the best throughput ?!
So, starting with a stock Ubuntu 12.04, first thing to do is add ZFS as follows;
add-apt-repository ppa:zfs-native/stable apt-get update apt-get install ubuntu-zfs zfs-dkms zfs-utils zfs-auto-snapshot mountall
And we should be ready to roll, so next we need to see what sort of hardware we have. Now in order to be truly portable and to work with hot-swapping, i.e. so we can survive either the system or the user changing the order in which the component disks are presented to the system, we’ll work with the disk’s ID’s as identifiers, rather than relying on device names. Take a look in /dev/disk/by-id to see what’s available on your system;
# la /dev/disk/by-id/scsi*|grep -v part /dev/disk/by-id/scsi-350024e9204f28d67 /dev/disk/by-id/scsi-350024e9204f28d73 /dev/disk/by-id/scsi-350024e9204f28d7f /dev/disk/by-id/scsi-350024e9204f28d92 /dev/disk/by-id/scsi-350024e9204f28da7 /dev/disk/by-id/scsi-350024e9204f28db7 /dev/disk/by-id/scsi-350024e9204f28dbe /dev/disk/by-id/scsi-350024e9204f28dd8 /dev/disk/by-id/scsi-SATA_C300-CTFDDAC064000000001048030041DC
As you can see I have 8 drives (which are 1Tb each) and a 64Gb SSD disk that I’m using as a root filesystem. Now we’re going to create a RAID-Z pool using all 8 disks, using 2 parity disks, which should allow the array to survive up to 2 simultaneous drive failures.
zpool create -f srv raidz2 /dev/disk/by-id/scsi-350024e9204f28dbe \ scsi-350024e9204f28dd8 \ scsi-350024e9204f28db7 \ scsi-350024e9204f28d7f \ scsi-350024e9204f28da7 \ scsi-350024e9204f28d67 \ scsi-350024e9204f28d73 \ scsi-350024e9204f28d92
Check this out with;
# zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT srv 7.25T 754G 6.51T 10% 1.00x ONLINE -
If you do a “df” you should also find that it’s created a mount-point and automatically mounted the filesystem for you. This is done automatically at boot time by the “mountall” package you installed earlier as part of the ZFS PPA. However, in order to activate this, you need to edit /etc/default/zfs and set ZFS_MOUNT=’yes’.
You should also create a file called /etc/modprobe.d/zfs.conf and insert options zfs zfs_arc_max=2684354560 zfs_arc_min=0, which will limit the amount of cache space that ZFS can use. Unfortunately ZFS uses it’s own page cache which is not integrated into Linux’s page cache (yet?) so if you don’t add this limit there is the potential for ZFS to consume all free space and this ‘can’ cause deadlock problems when the system page cache can’t get enough memory to operate. There is an argument for having this set by default, however … [!]
Now we can get ready to implement our clustered network filesystem, so we’ll make some filesystems in readiness;
zfs create srv/bricks zfs create srv/isos zfs create src/images
We’re going to store filesystem data in the bricks folder, installation ISO images in the isos folder, and Virtual Machine images in the images folder.
Obviously the speed at which our network filesystems will work will be dependent on the speed of our network connections, so I’m opting for a 3-NIC approach, although you could use 5 or indeed use more expensive 10G NIC’s .. which in a few years time I’m sure everyone will be. So, assuming you have the right hardware (!), edit your /etc/network/interfaces file to look something like this;
auto lo iface lo inet loopback # auto eth0 iface eth0 inet manual
auto eth3 iface eth3 inet manual
auto eth2 iface eth2 inet manual
auto public iface public inet static address 22.214.171.124 netmask 255.255.255.0 gateway 126.96.36.199 bridge_ports eth0 bridge_fd 0 bridge_stp off metric 1
auto data1 iface data1 inet static address 10.1.0.254 netmask 255.255.255.0 bridge_ports eth3 bridge_fd 0 bridge_stp off metric 1
auto data2 iface data2 inet static address 10.2.0.254 netmask 255.255.255.0 bridge_ports eth2 bridge_fd 0 bridge_stp off metric 1
Obviously your device names and address ranges will need to suit your hardware and network, and you need to make sure you have the bridge-utils package installed. We’re pretty much ready for the next stage now, just bear in mind you need to duplicate all this on a second server. (incidentally, I’ve called my servers data1 and data2 in this instance).
Leave a Reply
You must be logged in to post a comment.