Archive for May 16th, 2006
Unfreakin’ Believable
by balleman on May.16, 2006, under Networking
There are good networking problems. These are the kind that something happens that you don’t understand. You mull over it, eventually diagram out the different layers of the various protocol stacks, and discover that things are working exactly as designed, just not as you expected. Ian and I uncovered an example of this kind of problem a few years back at CTI. There was a periodic surge in broadcast traffic on our LAN destined for a specific box. After ruling out all of the various NAT rules and such going on, we eventually found the simple cause. The box was not sending any packets of its own. So, the switch had no forwarding entry for it. So it did what switches are supposed to do in that case: broadcast the packet. Coming to those kind of conclusions is fun.
There are at least three kinds of bad networking problems. The first is when something is evil by design. An infamous example of this would be the error measurements in the DS1-MIB. Absolutely useless when it comes to making time-series graphs, or setting event thresholds. The design seems to have been one of convenience, since the quantities represented date back to the first of the DS1 channel banks. Anachronistic design sucks.
Another type of evil networking problem is one that is seemingly random. Everyone knows that one of the first steps in troubleshooting is knowing what steps are needed to reproduce the problem. With random events, you’re essentially screwed. Turn up the debugging until it gets painful and wait for the event to recur, and pray you captured something useful when and if it does. DDOS attacks and the like can fall into this category, especially for those of us without the ability to do meaningful flow logging. Tracking down random problems is evil.
The third category of networking problems I feel like discussing this evening is the unreasonable problem. This is the non sequitur, the problem that makes you yell futilely at your terminal or coworker, from the complete and utter ridiculousness of it. If you manage to solve one of these, you might end up with a good networking problem, as described above. Or you might want to take a sledge hammer to a piece of hardware. Let’s explore some examples:
Near the end of my CTI experience, a certain Astrocom CSU/DSU was observed having a most unlikely problem. If I remember correctly, it somehow would drop packets over a certain size. A most improbable feat considering that your average CSU/DSU should not really have any concept of what a packet is, let alone be able to drop one. Patrick offered a bounty on the problem, but as far as I know, it was never solved.
The last example is the reason I am writing tonight. Last summer at Doug‘s LAN party, I had difficulties copying large files to my desktop machine. I eventually blamed it on my patch cord, but by that time we were packing things up, so this was never really tested. Even before this, I was having trouble sending large print jobs to my printer. I quickly blamed this on the network card in the printer, or some postscript oddities, but never came to a solution.
Later, when trying to use my desktop as a file server for a CentOS install, I realized that my network issue with my desktop was ongoing. Traffic analysis indicated that during a high speed transfer coming from my desktop to another machine, the connection would stop passing packets. Packets on other TCP connections between the same machines were unaffected, but subsequent retries of packets associated with the dead connection were getting dropped somewhere. Since this seemed to be connection related, firewall settings were verified and found to be fine. I let the problem fester, as it was not causing any day-to-day difficulties, since my desktop isn’t ordinarily sending large amounts of data, just receiving.
So, this evening I was talking with Doug about printer stuff, and he made a connection that I had been missing. Was my printer problem related to my network problem? And what about all of those problems with NFS in the recent past? Yes, they all sound like candidates. With that late realization, I delved into the annoying network problem. I replaced the network card. The problem persisted. The cabling was ruled out. And the problem was narrowed down to a switch. A specific switch port, to be precise. Now, I’m not exactly sure how one of my NetGear GS506 gigabit switches is managing to drop packets belonging to a specific connection when that connection begins spouting lots of traffic in a certain direction, but that’s exactly what it appears to be doing. And it’s reproducible. So yes, a problem as insidious as this problem should be solved with the sledge, but some tape over port 1 of the switch should do. Thanks for the insight, Doug! And if you see a problem like this, check the switch, even though it doesn’t make any sense.