02 January, 2011

Skype's recent outage


Skype suffered a massive outage just before Christmas, crippling the service for 24 hours. Fortunately for me, I don't use Skype as my primary telephone and when I do use Skype, it is 'bursty'.

The detailed explanation given by Skype made me wonder about Skype's competitive advantage:
  • VoIP (shipping voice packets over the internet) is as old as the (internet) hills. 
    • Skype's incredible growth / user base is down to two aspects:
    • Product launch happened to be around the moment in internet history when people were (a) purchasing machines for their own use at home (b) broadband penetration was just taking off
  • Service does a slightly better job of routing VoIP packets over the internet using Skype supernodes.
Allow me to explain the last point (technicians might want to refer to Skype's explanation of its technology):
  • Skype uses P2P concept, meaning that the VoIP packets aren't sucked into a central server (from the 'sender' of the voice packet) and then squirted out to the client (ie the recipient / listener of the call). 
  • Instead, the speaker's machine sends the voice traffic directly to the listener's machine.
  • Previous VoIP service just lumped the VoIP out onto the internet and allowed the internet to determine the best method of getting the packets from speaker to listener. (Note that the internet is magnificently self-tuning to route packets in the most efficient manner from sender to receiver).
  • Skype differentiating factor is that, within its network, it defined some of its users' machines as 'supernodes'. (Skype's CEO indicates that there are tens of thousands of these supernodes.)
  • These supernodes take on additional responsibilities - see the detailed explanation. (I must admit that I thought that these machines took on a disproportional amount of traffic, but no mention of this is made in the explanation.)
The outage occurred when a bug in a particular version of Skype caused some of these supernodes to crash. (Technically, there was a precipitating factor before the crash, but I'll gloss over that.) This then overburdened the remaining supernodes causing a cascading effect, bringing down the whole Skype network.

Conclusions:
  • In a networked-based or distributed-communication system, defining super nodes will create a vulnerability. (In a true P2P, each node is genuinely a peer of every other, so this isn't a concern.)
  • If you have a vulnerability, then for God's sake make sure you have control of the weak points. (If you read the detailed explanation of the failure, then you notice that Skype is reviewing its auto update policy of its software, so that Skype can autonomously self update these supernodes without users restarting their machine / Skype. You'll notice that Skype had already deployed a fix for the original bug, but not every machine had updated to the latest version. I assume that the update occurs when Skype restarts or when the machine restarts.)
  • It would appear that Skype didn't have an automatic way to promote node to becoming supernodes when a systemic failure started to occur to compensate for the diminishing network. (The detailed explanation of the failure indicates that Skype engineers had to put manually inject several thousands of "mega-supernodes" to compensate until the network of supernodes could recover.)
So, quite a mess, indicating that the disaster recovery plan needs some additional scenarios added!

Full credit to Skype for the level of information that they provided their users - see the Skype blog for December. Also, they offered paying users some credit to compensate for the lack of service - my voucher arrived today.

No comments: