The Gathering Technical blog

The instability in the network at TG13

03 Apr 2014, by Marius Hole from Tech:Net

There's been a lot of questions around the instability in the network at TG13. We are very sorry for not explaining this during TG13 or any time sooner then this. We are very, very sorry. But behold, the explanation is here... and I can say it in one short sentence:

We had a defective backplane on one of the two supervisor slots of the 6500 that terminated the internet connection.

First a short intro to Cisco, Catalyst 6500, sturdiness and reliability: the Cisco Catalyst 6500 is a modular switch that have been in production since 1999. It's packed with functionality and is one of the most reliable modular switches as we know it.

You may ask, why did we end up with a defective one if they are so reliable... and you are right to ask. The answer is split in two... one: they have produced and sold many, many thousands of units - and even if they have very strict test routines there will always be some troublesome units. Two: this type of equipment is not meant to be shipped and moved back and forth as much as the Cisco Demo Depot equipment is. In the end the equipment will show problems caused by all the moving back and forth, not to mention the line cards always replaced.

We would have run the equipment in extensible testing, if we were a regular customer. And as a regular customer with a service agreement would we have contacted the Cisco TAC (Technical Assistance Center) and registered a service request. They would have either solved the problem remote if it were software bug or configuration error or RMA (return material authorization) the faulty hardware. This takes time, and that is something we usually calculate for when building a new, big network.

At TG we test all the equipment before we ship it to Vikingskipet. On the 6500 and 4500 we run the command "diagnostic bootup level complete", and reboot the switches. When they are up again we have a complete diagnostic of all the line cards of the switch - and we can easily spot if something is wrong. Both our 6500 last year passed the tests, so we shipped them up to Vikingskipet and configured them and everything was seemingly ok.

But not everything was ok. Not at all. We noticed something weird on Wednesday after the participants arrived. We had some strange latency and loss on the link between us and Blix Solutions. The interfaces did not have drops or errors, we just saw that more packets went out than return packets. We involved first Blix Solutions in the investigation, and they checked everything at their end and everything was ok there, so we involved Eidsiva bredbånd. They checked everything between Blix Solutions and us, and everything was fine also there.

This started to get stranger and stranger... Traffic was going out, but didn't come back. No interface drops. No interface errors. Peculiar.

The setup was quite simple... 4x10Gig interfaces in a port-channel bundle towards Blix Solutions. 2x10Gig interface (X2) in each supervisor (Sup720).

So after some testing and some different hypothesis during Wednesday, we waited until the day was over and we went towards the Thursday morning and the traffic was dropped beneath 8Gig. At once it was stable beneath 8Gig, we forced all the traffic over to one and one 10Gig interfaces - first in the top most supervisor. The traffic was going smooth as silk, and I'm very sorry to say - but we did actually fail them over, one by one, to the 10Gig interfaces on the last supervisor for five minutes - and guess what? The internet traffic had drops and we have latency... We forced them back to the first supervisor and everything went smooth again.

We had an answer for why we had the problems, but then we had to find the solution to solve the problem on the fly, in production, before 6000 nerds woke up to life and started using internet traffic again! OMG! WHAT A PRESSURE! oO

As you may know are the transport between Vikingskipet and Blix Solutions "colored", and we only had colored optical transceivers in the SFP+ format and SFP+ converter in X2 format. The rest of the 10Gig interfaces on the 6500 was XENPAK.

Luckily for us, Blix Solutions had put a TransPacket-box in each side, and this box can be configured with the right wavelengths (colors) on the interfaces. That gave us the possibility to translate the wavelength through the TransPacket-box and receive on a 10Gig XENPAK interface on regular 1310 single mode wavelength.

But this was not until a little bit out into the day, Thursday. So all instability you may have experienced on Wednesday and "early" Thursday was caused by this. After this, there should not have been any troubles or instabilities in the network as far as we know.

At the same time as we explain this we also have to applaud both Eidsiva bredbånd and Blix Solutions for their service. They were working with us every step of the way to find a solution. And we can really say we know now why Blix has the name they have... Blix Solutions - because they are very open and forward about finding a solution, however far fetched or unrealistic it is.

Thank you both Eidsiva bredbånd and Blix Solutions for the cooperation in TG13!

We really look forward to working with both of you for TG14 :)

Marius Hole

My name is Marius Hole and I'm a network consultant working for Atea AS in Oslo, Norway. My focus is networking (core/dc, distribution, access/wireless). I currently hold a CCNP and CCNP Wireless certification, working towards the CCIE DC. In my spare time I play games, socialize, watch movies and TV-series

Crew: Tech:Net
Author image: Marius Hole

About

TG - Technical Blog is the unofficial rambling place of the Info:Systems, Tech:Net and Tech:Server crews from The Gathering.

Related sites

Collaborators