Welcome to today’s tech challenge. In this article, I’d like to share an eye opening moment from way back in my early days as an IT Consultant. I was called in to help a business that was struggling with outages. The business was in the gift card processing industry. When gift cards were used at a store, the transactions went through their systems.
The architecture was pretty simple, and had lots of redundancy built in. The had a co-location facility which acted as their main datacenter. In there were redundant Internet connections, Redundant Firewalls, a Public DNS server and a 2-node Database Cluster.
They also had a completely redundant single node architecture in their office with a point to point connection between the two sites.
The simplified sketch below outlines this:

Looks like plenty of redundancy, right? We’ll, their compliant was that when their point to point went down (which happened a lot for some reason), no one could get to anything in the datacenter.
My job was to fix this and find the root cause.
A survey of their main office didn’t find anything unusual. All servers and network devices were up and functioning.
However, a survey of the co-lo, found something quite different. The secondary database server had the same IP address as the primary firewall, and the public DNS server was crashed.

At this point it was obvious. With the Public DNS server down in the Co-lo, all traffic had to go through the main office to work. When the point to point line goes down, there’s no DNS in the co-lo and everything failed over to the main office.
Clearly, they had a lot of issues. When I discussed this with the IT team, they said “yeah, redundancy has never really worked right here”. My response was “actually it has, that’s the only reason you’re up at all”.
Now, back to my question. What’s the one thing that every redundant network needs?
The answer in my mind, is Monitoring. After repairing the damage, I helped them to put in a monitoring environment so they would know when they were in a failed state.
This is a unique case, in that every point of redundancy was failed over. That was 25 years ago. Networks are much more complex now, with hybrid-cloud environments and several unknown pieces of architecture that could be failed over without any visibility. Monitoring and situational awareness is key.
What are your thoughts?

