Wed, 18 Jul 2007
Single Point of Failure
Pavlína has discovered that our home computer cannot connect to the network. So I have tried to ping it from work, no response. Call to the hotline of our ISP: number is not available. Traceroute has ended somewhere in CESNET (our academic network). Even their web site was unreachable.
Netbox link to NIX.CZ.
Looking at the graph of their link to NIX.CZ revealed that there has been indeed some problem with their network for last two hours. Apparently even their call center is connected over their network via VoIP. Talk about single point of failure.
6 replies for this story:
Vasek Stodulka wrote:
It is too hot for computers to operate. Karneval (my ISP) had two or three one hour outages yesterday... :/ What is even worse, I have no non-alcoholic drinks left in refidgerator and I am completely out if ice...
Miroslav Suchý wrote: Sitel power failure
It seems that Netbox go to the NIX through Sitel and their data center have big power failure. Reportedly bad circuit breaker. They have 3 source of power, big UPS and diesel generating set. But one wrong circuit breaker and everything goes to the ...
I think, Sitel power failure wasn't that problem. Other telcos in Sitel went without any problem. Netbox's NIX BGP peer was last reinicialized 6 weeks ago: r2>sh ip bgp neighbors 22.214.171.124 BGP neighbor is 126.96.36.199, remote AS 31246, external link Description: === NIX SMART COMP === Member of peer-group EBGP-NIX-SMALL for session parameters BGP version 4, remote router ID 188.8.131.52 BGP state = Established, up for 6w0d It seems, that Netbox has non redundat lines in their backbone (maybe Prague-Brno), they annonced redundant topology on their web [http://www.sc.cz/cz/index.php?pageid=43], but reality is probably different.
(Bezda: sorry for the removal of formatting - I need to fix the comment system). According to traceroutes, yesterday afternoon the Brno Netbox network has been routed through Jihlava (with a 30-40% packet loss). So there definitely is some redundancy, but probably the router in Jihlava (or the Brno-Jihlava line) could not handle the load. BTW, the page you mention also shows that since yesterday, the Netbox peering to SIX is also down. Interesting.
It seems they have only two 100Mbit lines between Brno and Jihlava, so the packet loss from yesterday was obvious: http://www.sc.cz/images/linestats/jih_brn_1.png and http://www.sc.cz/images/linestats/jih_brn_2.png. It would be nice if those statistics can be available in a better form than guessing the image URL from the name of other images.
These mrtg graphs (jih_brn_1.png and jih_brn_2.png) are identical, I guess that they only draw two graphs from one source :) And this is not only one example of non existent redundancy lines: (BRNO) brn_data_1.png == brn_data_2.png, brn_cbix1.png == brn_cbix2.png [http://www.cbix.cz/cs/pripojene-site - 1 connected port to CBIX], (JIHLAVA) jih_1.png == jih_2.png, and may be others... so where is no redundat lines, I guess there aren't redundat HW in topology, as they annonced on images on their web.