"Look at the simple stuff first" from Quentin
Lewis.
I can't tell you how many times I have seen people burned because they instantly
dig into problems up to their eyeballs, only to find that it was a simple
BOM error all along. This is the story of EXACTLY such a case.
While working at a still famous computer three letter computer company
(there are a couple of them you know) I was debugging a CPU board I had designed.
Things seemed to be going pretty well, except for a random crash. This crash
was the result of a memory ECC error, and this was clear. We began collecting
data on all crashes, looking for data or address patterns to the failures.
But the problem didn't occur all that often, so it was going to be hard to
find.
We did some temperature, voltage and frequency margining, and found the problem
to respond as though it was a clear timing violation....yet the more we looked
at it with oscilliscope and logic analyzer, we did not see any issues.
We then noticed that error rates were memory vendor related, with one particular
vendor's devices having a much larger failure rate than others. (and one
vendor not failing at all)
We called in Engineers from the failing vendor, and we looked at the problem
together for a week. We saw nothing. All the timing looked good, and even
though the problem did not occure often, we had been able to hook up the
logic analyzer and create a trigger that allowed us to capture any signals
we wanted at the time of the error. (unfortunately, the error was created
during the write, so it really wasn't always in the trace)
Well, at one point, we were scoping around and we looked at a signal going
between two memory controller PALS. (this will probably date the problem
for you) The signal "looked funny", but it was only a point to point signal,
so it made little sense. We went in and checked the PAL equations, to make
sure we had not messed them up, and we just kept looking at this strange
signal, knowing that we were onto something, but now knowing that was going
on.
Then a quick look at the schematic showed that there really was only one
other SMALL item in the curcuit, and this was a 20 ohm series resistor. We
looked at it, and it was stuffed....but it looked like it might have gone
on upside down as we could not read the value.......BUT WAIT....thinking
quick, and remembering the shape of the signal we were looking at, we reolized
that that resistor was acting like a capacitor!
Sure enough, we took it off and it was a capacitor. We then checked other
board in the proto run, and they were incorrectly stuffed. A peek at the
BOM and it was clear, someone had somehow fat fingered the BOM entry of an
ECO. The capacitor was called out in the BOM. We had been working on a problem
for three weeks that was a simple BOM error that we should have picked up
in the initial board inspection. (had we done a close check)
Lesson learned.....no matter how much it seems to be a waste of time, go
through the "start-up checklist" every time. It will save you in the long
run. (check power to gnd shorts, visual inspection, board mechanical dimensions,
BOM check against board and schematic, etc...)