SPOF #2 – Clustered Filesystem

For true resilience, we need at the very least to be able to re-locate a virtual machine to different physical hardware without restarting it, which gives us the ability to upgrade and maintain the physical platform without interruption to service. To facilitate this, the data we use needs to exist on some sort of shared storage platform, and as we’re playing with commodity hardware and have ‘real’ pockets – this means a network filesystem.

Here are the options I’ve tried in the past, and specifically why I’m not using them this time around;

  • DRBD/LVM – poor write performance, does not scale (two nodes only)
  • DRBD/OCFS2 – poor performance, unreliable, two nodes only, does not scale well (or indeed at all)
  • GFS2 – limited performance, complex to setup and maintain, on a small scale, horribly unreliable
  • NFS or CIFS – no ability to cluster / replicate

We’re using Gluster 3.3

What is Gluster and why is the “3.3″ significant? Well, gluster has been around for quite some time, typically as a network or distributed filesystem. (it’s very flexible and can be configured / tuned in many ways) However, this year the company behind Gluster was bought up by Redhat, who decided that if all the issues that had been brought up over the years by users / developers were actually addressed, Gluster would be ideal for storing Virtual Machine images on.

So, 3.2 is missing all the features you want to make Gluster a good / solid VM storage platform, whereas 3.3 has pretty much all of the featured you want .. so when you hear Redhat touting “Redhat Storage 2.0″, you’ll know what it is they actually have “under the hood” .. :)

VM Partitioning

In an ideal world, I’d be running a minimum of 4 data nodes, however in reality I only have two, yet I’d like to run “striped replicated” storage that needs a minimum of 4. Why striped-replicated? Well, striping means reading alternate blocks from alternate servers, so in two servers that gives me a network bandwidth of 200Mbytes/sec (2Gbit/sec) rather than 100, and replication means that if one data node goes down, my network storage is still available, albeit at half speed. (running at half speed during a fault is infinitely more preferable to not running at all !)

I’m afraid I’m going to let a little creativity creep in on your part here as setting up virtual machines is a relatively trivial process, yet time-consuming to document. In principle you need to install the qemu-kvm and virt-manager packages on both data nodes, then install the virt-manager package on your workstation. Then create volumes data1::/srv/bricks/storage1.img , data1 ::/srv/bricks/ storage2.img , data2 ::/srv/bricks/ storage3.img , data2 ::/srv/bricks/ storage4.img , each should be created as a sparse QCOW2 image, I would recommend something in the 500G-1TB range, depending in your requirements. This can be done using the virt-manager GUI, or alternatively on the command line with something like;

qemu-img create -f qcow2 /srv/bricks/storage1.img 1000G

Then install 4 virtual machines, gluster1 using storage1 and gluster2 using storage2 on data1, then gluster3 using storage3 and gluster4 using storage4 on data2.

Each gluster instance should have three network connections, the first should be a NAT connection so your instance can connect the the outside world, then the other two should straddle the two interfaces (data1,data2) we created on each storage node for connecting privately to the other machines. Your /etc/network/interfaces file in each gluster node should look something like this;

auto lo
iface lo inet loopback
auto eth0
 iface eth0 inet dhcp
auto eth1
 iface eth1 inet static
 address 10.1.0.1
 netmask 255.255.255.0
auto eth2
 iface eth2 inet static
 address 10.2.0.1
 netmask 255.255.255.0

Give each node an incrementally higher node, my /etc/hosts file looks like this;

10.1.0.1 gluster1
10.2.0.1 gluster1
10.1.0.2 gluster2
10.2.0.2 gluster2
10.1.0.3 gluster3
10.2.0.3 gluster3
10.1.0.4 gluster4
10.2.0.4 gluster4

Once this is all set up and assuming all the nodes can “ping” each other, then we’re ready to roll.

Clustering

Next step is to install Gluster on all four nodes. Note that there is a server component and a client component, however my preference is to make every node a member of the Gluster cluster which helps when it comes to resilience / failover, hence the installation process is identical for each machine.

apt-get install gnome-keyring
apt-get install python-software-properties
add-apt-repository ppa:semiosis/glusterfs-3.3
apt-get update
apt-get install glusterfs-client glusterd

Once this is done, set up the cluster, log in to gluster 1 and do the following;

gluster peer probe 10.1.0.2
gluster peer probe 10.1.0.3
gluster peer probe 10.1.0.4 gluster peer status 

Assuming it was happy to connect, you should see the following output;

Number of Peers: 3
Hostname: 10.2.0.4
Uuid: 9e6ee031-c4ce-44a1-a31c-92d04298ee24
State: Peer in Cluster (Connected)
Hostname: 10.2.0.3
Uuid: 9e6ee031-c4ce-44a1-a31c-92d04298ee22
State: Peer in Cluster (Connected)
Hostname: 10.1.0.2
Uuid: 9e6ee031-c4ce-44a1-a31c-92d04298ee21
State: Peer in Cluster (Connected)

Next thing we need to do is to create a clustered filesystem, which is where is becomes apparent WHY we need 4 nodes and why 2 just won’t cut it. What we’re going to do is set up a striped filesystem between gluster1 and gluster3, so alternate blocks are read from alternate machines, thus giving us the benefit of 2 x network links. We then need to replicate which block so we’ve a copy in case anything bad happens, so we need to replicate gluster1 onto gluster4 and gluster3 onto gluster2. Note that the order here is critical, if you replicate gluster1 onto gluster2 for example, and the physical server data1 crashes, both the primary and replica volumes go off-line together, and your filesystem will crash. If set up as described and data1 crashes, the filesystem will be expecting blocks from gluster1 and gluster3, as gluster1 is unavailable (as it sits on data1) then it will use it’s replica on gluster4, which is available because it sits on data2. Conversely, if data2 goes down, it will be expecting blocks from gluster1 and gluster3, as gluster3 is on data2 it will be unavailable and requests will revert to the replica on gluster2, which is available as it sits on data1 .. headache yet?  Try this;

With any luck it’s now becoming apparent that although we can have striping or replica’s on two machines, which would for example mean that we wouldn’t need to create virtual machines to run in, to get both we really do need 4 instances. Aah-ha you’re going to say, but why can’t we just use two different storage blobs from data1 and data2, rather than having to encapsulate them in VM’s?! Answer: there is a ‘design feature’ in gluster that associates a gluster instance / uuid with a specific IP address, which does not play well with trying to load-balance requests across different network cards, and although things may change, at this point in time, you do want to use VM’s.

Incidentally, when building you VM’s don’t forget to edit /etc/default/grub and add “elevator=noop” after the “quiet” kernel parameter (and then run grub-update and reboot). This little tweak will almost double the IO throughput on your VM!

You’re now ready to ‘do’ something, so you’ll need another machine (!) which will be one of the machines you’re going to use as a front-end processor / machine to run Virtual Machines on. I typically use a relatively high powered processor, lots of RAM, and a single 120G SSD. Partition 20G of the SSD as the root filesystem, then partition the rest as LVM space. i.e. three raw partitions, 2G swap, 20G root, then the remainder as type “8e”. Once installed, you can set up LVM with;

pvcreate /dev/sda3
vgcreate  cache /dev/sda3
vgs 

This should return a list showing a single volume called ‘cache’ with 90G+ of free space. Now from gluster1 you can add your new machine to the Gluster cluster, let’s say you called your machine node1 with address 10.1.0.252, from gluster1 just do;

gluster peer probe 10.1.0.252

To make full use of striping, you really want a box with three NIC’s, one for public access (i.e. to take Internet requests for hosted VM’s) and two for access to the shared filesystem. This is my /etc/network/interfaces file from node1;

auto lo
iface lo inet loopback
#
auto eth0
iface eth0 inet manual
metric 0
auto public
iface public inet static
 address 193.111.184.4
 netmask 255.255.255.0
 gateway 193.111.184.1
 bridge_ports eth0
 bridge_fd 0
 bridge_stp off
 metric 1
#
auto eth1
iface eth1 inet static
 address 10.1.0.252
 netmask 255.255.255.0
auto eth2
iface eth2 inet static
 address 10.2.0.252
 netmask 255.255.255.0

So, let’s create our network filessystem, from Gluster1 do;

gluster volume create storage \
    stripe 2 replica 2 transport tcp
        10.1.0.1:/srv/storage \
        10.2.0.3:/srv/storage \
        10.2.0.4:/srv/storage \
        10.1.0.2:/srv/storage
gluster volume set storage nfs.disable ON
gluster volume start storage

Then on node1 we can do;

mkdir -p /srv/storage
mount -t glusterfs localhost:/storage /srv/storage

And we should be good to go!

I should probably explain the “localhost” component of this mount command. If we had back-end storage nodes using glusterd and front-end nodes just using gluster client, we would need to use one of the gluster[1-4] addresses rather than localhost. However, as we’ve joined node1 to the Gluster cluster, we can connect to localhost and gluster will work out where the physical data is actually stored. This is good from a resilience point of view, if for example we try to mount a volume and use the address of gluster1, and gluster1 is down, the volume won’t mount. If we use localhost, so long as gluster is up on node1 and there are sufficient back-end nodes running to support the filesystem, gluster will work out where they are and use them automatically.

Part III – Coming Soon!

Leave a Reply

You must be logged in to post a comment.