High Availability in Cloud Computing – Are the clouds really 99.99% available?
I know a director in Oracle’s High Availability group who was telling me that whenever crashes happen they have a domino effect. The redundancy and the interaction in the systems are so complicated that if more than a certain number of components start failing then the system goes into a unusable state and this has a chain effect. So, 99.99% available is not a practical number. Oracle has already changed the four 9’s, may be Microsoft, Google should also change the number – just my 2 cents.
Scalability testing is definitely a tough job. I remember having worked with 100 cluster Atom machines and at one point I could see the front tier web servers failing due to overloading and the dynamic compute partitioning mechanism took a hit but did not fail due to loose coupling. An important aspect of large scale distributed systems is loose coupling. With development happening over a long time with different parts may lead to slip ups in architecture especially in overloading scenarios, message queues etc may start failing and may have a cascading effect with a choke point. Ideas, suggestions???
NOTE I now post my TRADING ALERTS into my personal FACEBOOK ACCOUNT and TWITTER. Don't worry as I don't post stupid cat videos or what I eat!