A report from the NOC
The party is well underway, and I was dumb enough to say aloud the phrase “We should probably blog something?”. Everyone agreed, and thus told me to do it. Damn it.
Anyway, things are going disturbingly well. We were done with our setup 24 hours ahead of time, more or less. I’ve had the special honor of being the first to get a valid DHCP lease in the NOC and the first to get a proper DHCP lease “on the floor”. And I’ve zeroized the entire west side of the ship (e.g.: reset the switches to get them to request proper configuration, this involved physically walking to each switch with a console cable and laptop).
But we have had some minor issues.
First, which you might have picked up, we have to tickle the edge switches a bit to get them to request configuration. This cost us a couple of hours of delay during the setup. And it means that whenever we get a power failure, our edge switches boot up in a useless state and we have to poke them with a console cable. We’ve been trying to improve this situation, but it’s not really a disaster.
We’ve also had some CPU issues on our distribution switches. Mainly whenever we power on all the edge switches. To reduce the load, we disabled LACP – the protocol used to control how the three uplinks to each edge switch is combined into a single link. This worked great, until we ran into the next problem.
The next problem was a crash on one of the EX3300 switches that make up a distribution switch (each distribution switch has 3 EX3300 switches in a virtual chassis). We’re working with Juniper on the root cause of these crashes (we’ve had at least 4 so far as I am aware). A single member in a VC crashing shouldn’t be a big deal. At worst, we could get about a minute or two of down-time on that single distribution switch before the two remaining members take over the functionality.
However, since we hade disabled LACP earlier, that caused some trobule: The link between the core router and the distribution switch didn’t come back up again because that’s a job for LACP. This happened to distro7 on wednesday. We were able to bring distro7 up again quite fast regardless, even with a member missing.
After that, we re-enabled LACP on all distribution switches, which was the cause of the (very short) network outage on wednesday across the entire site.
Other than that, there is little to report. On my side, being in charge of monitoring and tooling (e.g.: Gondul), the biggest challenge is the ring now being a single virtual chassis making it trickier to measure the individual members. And the fact that graphite-api has completely broken down.
Oh, and we’ve had to move our SRX firewall, because it was getting far too hot… more on that later?
This time, there were no funny pictures though!