"A few stories on the theme of Check the Plug"
from Nick Coghlan.
Some debugging stories which, taken together, probably consumed a few months
of development time (it could actually have been a LOT worse!)
1. Check those hardware configuration registers
The system I am working on is normally sold as a complete, vendor provided
solution - they provide the hardware, and package it with third-party signal
processing software. The vendor discovered they also had a market for companies
like us that really liked the hardware design, but wanted to develop custom
signal processing software on top of it.
This was fine, but since the vendor had only recently started doing this,
their capacity to support the custom software development wasn't that great
(a lot of the necessary knowledge was held by the third-party software vendor,
instead of the hardware vendor). So, we allocated plenty of time for experimentation
and prototyping of the signal processing software.
However, for the first few months, we were consistently getting strange behaviour
from the threading in the RTOS we were using. Working on the assumption that
there was an errant pointer in the code, or something similar, I simply kept
on eye on the problem for a while, making progress in the prototyping and
generally operating reasonably well. Eventually, however, we'd got to the
point where we'd established that there was nothing obviously wrong in the
prototype code, but we were still getting threading problems - a software
interrupt was getting pre-empted by a standard thread.
So, we stripped out as much of the prototype as we could, and the problem
was still there. Not perfectly consistent, but extremely frequent. So, we
went back to the RTOS vendor, describing the problem we were seeing. They
were basically stumped too, so they passed me on to their actual software
development team, who were intensely curious as to what was going on - as
far as they were concerned, this behaviour was impossible.
We managed to trap an occurence of the error, and the support engineer was
able to get me to look at some of the internal RTOS data values - which had
reached values that, according to the engineer, should never be reached.
Finally, after a few days of trans-Pacific phone calls, the RTOS vendor's
engineer and I were stepping through a section of the RTOS assembler code,
monitoring register values. We saw the processor perform a calculation along
the lines of "1 & 1 -> 0". Needless to say, this was confusing the
RTOS more than a little. At this point, the RTOS engineer asked me to check
the configuration register for the processor's PLL multiplier. When I'd set
the value in the RTOS configuration file, I'd simply set it to the maximum
value without checking what the correct value for the vendor's hardware was.
Once we tried changing it back down to 1, the strange threading behaviour
disappeared, as the processor rediscovered its ability to do basic math.
When I went back to our hardware vendor, I discovered I'd been clocking the
processor on the development hardware at 150% of its rated speed. Since the
deployed hardware uses a faster crystal, if I'd tried running the incorrectly
configured code on that, it would have been running at 750% (actually, at
that speed, the magic smoke probably would have escaped from the processor).
2. Sometimes, it IS the operating system
On another occasion, a serial I/O driver seemed to be suffering strange timing
problems, with a software interrupt appearing to miss its deadline. Instrumenting
the code, stripping out all of the code contents _except_ the instrumentation,
shutting down the rest of the sytem, none of it appeared to make any difference.
Initially, we chalked it up to interference from the JTAG-based emulation,
but then we discovered that the actual I/O paths were totally corrupted,
even when the JTAG emulator was not connected.
After much testing, and head-scratching, as well as reviews of the driver
code by other developers, I checked the vendor's bug listing page for the
first time in a few months. Included was one along the lines of 'Priority
1 software interrupts will sometimes fail to be posted'. We only had one
software interrupt - I changed it from priority 1 to priority 2 and the strange
driver behaviour disappeared.
Assuming your own code is at fault is generally the right way to go - but
keep that list of known OS bugs handy, too.
3. Linkers are all the same, right?
During the course of the project, we upgraded the version of our toolchain.
Overall, this was a Very Good Thing, but there was an interesting teething
problem. Our application, which worked fine in the original version of the
tool chain, wasn't working at all with the newer version of the debugger.
In fact, it was causing the entire processor to lock up. Again, we went back
to the RTOS vendor, even sending them a stripped down version of the application
that exhibited the problem.
This time, it DID appear to be related to the JTAG emulator - if the emulator
wasn't attached, we didn't seem to have a problem. But, all we'd done was
change the tools - what could be so different as to cause the application
to crash completely?
It turned out that the new linker was arranging things differently in memory
from the previous version. The data buffers used by the serial IO and the
data buffers used to transfer debugging data to the IDE were now in the same
block of memory, and the memory port access controls were causing scheduling
and/or pipeline issues that were locking up the processor. The simple expedient
of moving the serial IO buffers to dual-access memory eliminated the conflict,
and allowed the debugger to work correctly with the updated toolchain. This
fix was actually discovered by lucky accident, before we managed to work
backward to figure out _why_ moving the serial IO buffers solved the problem.
This problem actually meant that our first attempt at upgrading the tool
chain was aborted - the deadline had been set that, if the new version wasn't
working as well as the previous version within four weeks, then we would
postpone the upgrade until the next iteration. The next time around, we managed
to track it down, and were able to shift to the new toolchain.
-Nick Coghlan
© 2002 by David J.
Agans
All content on this site is copyrighted.
You have permission to download, print, and otherwise distribute
the Debugging Rules poster, providing you use it in its entirety with
no modifications whatsoever.