Musings on the Cisco TAC

Submitted by reuben on Mon, 29/05/2006 - 00:39

I've had a number of dealings with the cisco TAC over the last few years - mostly as an employee of a cisco Premier or Gold Partner where I've needed the TAC occasionally for troubleshooting or helping me resolve problems.  In the past we have for one or another reason tried to keep the number of TAC cases opened to a minimum as this affects the discount we get from cisco in the coming year when we buy gear from them.  That is why I would be lucky if I've opened 3 cases a year in the last 3 years.

However now that I have my own maintenance contract with no restrictions, I've started to be a bit more forward about opening cases where there are software problems.  In doing so it has started to become more obvious what drives the TAC and what incentives and directions people in there actually seem to be working to.

What I've noticed are a bunch of things.  Firstly, cisco (and when I refer to "cisco" I am referring to the TAC engineers collectively) seem very reluctant to spend too much time on a problem no matter what the complexity of it.  In my experience they rarely seem to admit that there are bugs in IOS and the process and time spent testing and reproducing it makes you in the end feel like it would have just been easier to live with the bug or problem and not bother reporting it.  It seems to be the fear of all TAC engineers to have an end customer find a bug that no-one else has found before.

Let me give you some examples.

One was for an HWIC-1ADSL card which I bought recently in a bundle with an 1841 router.  Now I had in the past and through work, spent a lot of time working with the WIC-ADSL cards, which were the older version of the same thing.  When I plugged my shiny new HWIC-1ADSL card in and configured everything up I found that I could not do any SNMP polling for anything on the card, which, given I used to poll the physical line speed with my old card, meant changing to a new card was a regression.  It took about 5 emails and 2 phone calls before I could convince the engineer that it didn't work as documented, that I wasn't using the wrong community string, and that the snmp value really didn't come back when I queried it.  I sent him a full snmpwalk of the router when I opened the case.  He kept asking me to find this value which wasn't there, kept insisting it must be there somewhere and that I must have sent him an incomplete walk.  I was asked to walk the SNMP tree at least 6 times over the course of the case, each time looking for something that wasn't there.  Eventually the engineer admitted that yes, there was a bug, and it was submitted to development and subsequently resolved.  3 months after the bug was confirmed, there wasn't any code publically released which fixes this obvious oversight in the design and testing of the card, but a few months later it did make it's way into a subsequent release.

The second incident was even more frustrating.  This was a case to do with WCCP (Web Cache Co-ordination Protocol) not working on the router.  This was only very simple WCCP, broken in a series of releases of IOS.  I opened this case in about March 2005 and was screwed about by three engineers before I was put onto someone who actually had heard of WCCP before.  I can say honestly that I knew more about WCCP than the first two engineers who were assigned to the case (I had made it clear when opening the case that I wanted someone who knew about WCCP).  The first one asked me some silly questions and then disappeared and stopped answering his emails despite repeated requests by me for a response.  The second asked some more questions and then went on holiday for 3 weeks.  However the third engineer was much better...

Not long after then, I provided a config which was absolutely bare bones and had nothing in it except to enable WCCP, which did not work.  The reason for this was so that cisco could easily reproduce the problem and see for themselves that there were bugs.  I had no problem whatsoever proving that it did not work.

I spent tens of hours testing all manner of permutations of variables on this problem.  I had it really really narrowed down, specific software releases which did and didn't work, verified that certain other features weren't causing it etc etc.  It took 8 months or so before the engineer was able to concede that yes, WCCP was broken in a bunch of releases that I had identified, so he proceeded to open a bug report about it.

A while later that bug report was closed, marked as UNREPRODUCEABLE by someone.  Ouch, that hurt.

Next, 12.4(6) came out and just as mysteriously as it broke, WCCP was fixed.  Partially.  Basic WCCP worked now so long as you didn't turn on ip firewall inspection, otherwise it still didn't work.  I eventually just said to the engineer to close the case as we'd gotten some functionality back at least, and I was sick of having a case open for so long.

I have another few cases like that, a more recent one relates to a new feature (MSN Protocol Inspection) in the code that only supports some very old versions of client software that has long been obsoleted.  Cisco couldn't give me a date or in fact any clue as to when it might be fixed/updated and just want to offer me workarounds and want to know why I'm trying to use the documented feature for.  I was disappointed in that I thought a new feature would actually support the most current versions of client software - but it didn't.

Lastly another example, from work, where a router had a spectacular crash and gave a really really good stack trace and dump file, and according to the CCO output interpreter was 95% likely to be a software bug.  I opened a TAC case, but the engineer assigned didn't want to take it any further despite my requests "unless the router crashes again".  I mean, come on... it was a telco environment, the telco support engineers weren't very happy to be told to wait for the router to crash again before any investigation could be done.  An organisation that was very keen on stamping out bugs in their software would say "oooer, that's an interesting looking trace, lets see if we can get some useful information out of it to determine if it is a real software bug or not".

Common to all of these experiences and in fact many others I have had, is the approach the TAC take to cases.  The perception that I have is that engineers want to do all they can to stop the call getting any higher up the support chain even if they are unable to resolve the problem.  Bug reports take time, tie up resources, and involve dealing with (clued up and generally good) Development Engineers who it seems are people to be afraid of.  The TAC guys also seem to have a very limited ability to actually test anything out in their own space, which is bizarre - you'd expect a network company could at least provide a good lab for their engineers to test customer suggested bugs out.  Perhaps the view is that people only really call up the TAC when they can't figure out what to do.  Unfortunately in both those examples that wasn't the case as I had spent quite some time debugging before I opened the cases.

TAC support *could* be a wonderful thing if the engineers had a different attitude.  Customers are not there to be gotten rid off because they're too dumb to figure things out, but every now and then a customer like me calls up because he or she wants a bug in cisco's code resolved.  It is so frustrating to call them with reproduceable, simple test cases which prove problems, for them to offer workarounds like "well don't use the feature then if it doesn't work" or, "turn off CEF" - in other words, give my router a 50% performance decrease in order to work around some software problem in the code.

Certainly hardware support that you get by paying for a maintenance contract is excellent, and for that reason alone I don't mind paying the money.  Response time for CCO cases is good and the engineers are almost always courteous and pleasant to deal with.  But for the techie who has done most of the work before calling and wants to log tricky faults or bugs, CCO seems to let the ball down far too often.