No single point of failure …

Sep 24th, 2012No Comments

Over the last 20 years, practically since on-line services came about, I’ve been chasing the holy grail of computing in the Internet age .. the system with no single point of failure.

It started with resilience against the most common fault computers used to experience, hard drive failure. This level of fault tolerance is relatively easy to protect against with RAID type disk arrays, but things move on quickly. Not only did we find we needed protection against component failure, but also against downtime experienced through upgrades, network outages, application crashes and so on .. thus far nobody has really presented a ‘completely’ fault tolerant system.

I know the “big guys” apparently have such systems, but these seem to be application specific (Google for example) or don’t actually provide 100% uptime (AWS for example) ! Not to mention that many of these options are way beyond the pockets of us mere mortals.

I’ve been working on the problem for a number of years and we now have a system behind that I think approaches the target that’s been so elusive. A generic platform capable or running on-line applications that have no single point of failure and that is infinitely scalable. (Almost forgot, “and that performs!”)

I’m going to document the entire system here to the extent it can be reproduced by anyone with sufficient equipment, time and desire, albeit I’m thinking the article is going to span many blog posts and will take a little time to get it all ‘down’.

To kick it off, specifically these are the issues the system attempts to address;

  • Local data storage, performance, resilience and integrity
  • Storage clustering, how to make common storage available to multiple nodes simultaneously
  • Portability, allowing instances to migrate between physical nodes to avoid downtime due to hardware issues / upgrades
  • Space efficiency, allocating chunks of disks to different VM’s generally wastes huge amounts of storage capacity
  • Performance, almost always bottlenecks on IO, whether it be local disk or network FS
  • Geographic dispersal, putting all your eggs in a Data Centre proves to be a bad move when the DC burns down

I’ll be including some code and methods I’ve developed along the way that I’m pretty sure are particular to our implementation, so if clouds and virtualisation are ‘your thing’, check back for updates and you can at least tell me which bits you think are a bad idea … :)

Here are the instalments as they appear;

SPOF #1 – Storage Node



About author:

All entries by

Leave a Reply

You must be logged in to post a comment.