Around the time Windows 3.0 was launched, myself and a colleague were developing
a Windows video driver for an advanced graphics card. During development,
we noticed that occasionally, Windows would hang while loading the graphics
driver. When this occurred, we'd simply reset the PC and it would work fine
the second time. Since we had plenty of other more pressing issues to work
on, we decided to keep an eye open for possible causes, but otherwise ignore
it for now.
After a couple of months, the video driver was in pretty good shape and we
were almost ready to ship. And of course, we hadn't yet figured out why the
driver would sometimes cause Windows to crash. I _had_ noticed, however,
that it always failed on a cold boot, and this was repeatable. It never happened
after pressing CTRL-ALT-DEL or the hardware button. (We rarely turned our
PCs off, which was why it wasn't happening often enough to annoy us.)
The graphics card used an onboard TI graphics processor (TMS34020) which
ran independently of the main CPU. The host CPU communicated with the TI
chip through a 16K shared memory window. This window was mapped into the
PC's memory space at a configurable address in the PC's high memory area.
During startup, the PC downloaded graphics code to the TI chip and then brought
it out of reset to execute it.
Now that I could reproduce the crash, the next step was to use the debugger
to watch the driver initialisation code. Everything seemed fine -- the window
was mapped in, the downloaded code could be read back correctly and appeared
fine; the only problem was that the graphics CPU refused to execute it.
So much for the PC's debugger. Fortunately, we also had a debug port on the
graphics card itself. Unfortunately, this port required some stub code on
the TI chip to operate. However, by adding some debug output messages at
the start of the TI code, I was able to confirm that none of the downloaded
code was executing on those occasions when the crash occurred.
So, I resorted to comparing traces of the working vs non-working driver initialisation
runs on the PC. Eventually, after much painstaking logging, I noticed that
the initial pattern of data in the shared memory window was somewhat different
in the failing case (after a cold start) than in the working case. After
the first warm start, the data was a mixture of FF's, FE's, and the downloaded
program code. After a cold start, the data was more like something you'd
see in MS-DOS program memory...
... which was impossible, of course, because this memory resided on the graphics
card and was inaccessible to MS-DOS until the graphics driver had performed
a number of initialisation steps to map it in.
Eventually, it became clear what was happening. Windows 3.0 tried to use
as much "high memory" as possible for its own needs, to leave normal MS-DOS
program space available for applications. To determine how much high memory
could be used, it did a simple non-destructive read/write test on all pages
in the high memory area. Any pages that appeared to contain valid RAM were
assumed to belong to expansion cards and left alone. All other pages were
commandeered for Windows' own use, and the memory management unit of the
386 processor was used to remap them to extended memory.
By now, you may have figured out the problem: on a cold boot, the shared
memory window used by the graphics card hadn't yet been initialised, and
so it wasn't mapped into memory. This led Windows to remap normal RAM at
its memory address. When the shared memory window was enabled, the processor
had no way to access it since the MMU was intercepting all such accesses.
So, that explained why it failed the first time -- but why did it work on
subsequent occasions? Because of a slight design flaw on the graphics card:
the hardware designer had neglected to connect a reset line to the latch
used to control where in PC memory the shared window would get mapped. On
power-on, this could be set to any random value (though typically FF) and
it was assumed that the driver would set it to something sensible before
using it.
Since the latch wasn't reset by a warm start, it retained its previous value
-- which had been set by the graphics driver as part of its initialisation
during the last cold boot. Thus, when Windows booted up the second time,
its memory test showed that page to be valid memory on an expansion card,
which meant Windows didn't try to remap it for system use. This in turn meant
the graphics driver (which was loaded after Windows did its own initialisation)
could happily read and write to the graphics card memory.
Now that the sequence of events was clear, it turned out to be very easy
to fix: we simply modified our installation program to add a line to the
Windows SYSTEM.INI file telling it to always exclude the memory range where
the graphics board was located from system use; no change to the graphics
driver itself was required. This was a lot easier than modifying the reset
logic on the board, especially since we had a large number of units already
out in the field. We did ensure that future board designs correctly initialised
the latch register on reset, to ensure consistent behaviour.
And our driver shipped two weeks late, as a result.
So, what lessons can be learnt from this?
- Understand the system: although Windows 3.0 was very new at the time, and
Google wasn't around to make it easy to find obscure information, it was
a chance comment about Windows memory usage in an early Windows programming
book that alerted me to the possibility that Windows could steal high memory
from I/O cards.
- It's never a good idea to let things default to a random value, even in
hardware. If the graphics board had always been set to a consistent state
after a reset, the failure would have been identified and fixed at the start
rather than the end of the development period.
- It's not always a good idea to fix bugs immediately, especially if they
are hard to reproduce. In this case, I was able to narrow the cause of the
bug to the cold start situation over a period of several weeks, in parallel
with normal development. By then, I also had enough confidence in the graphics
hardware to be sure that the problem wasn't likely to be related to flakiness
in the graphics chip itself. (In all other respects, the board was proving
very reliable.)
- The reason I was seeing any recognisable patterns in memory at all after
a cold start was because I was too impatient to wait 10 seconds when I power-cycled
my development PC. As a result, the DRAM didn't have a chance to fully discharge,
and a ghost image of the previous contents remained. (This was a 25 MHz 386,
and the memory could hold its charge for several seconds, despite the official
rating.) While I don't suggest making a habit of quick power-cycling, it
does act as a reminder that help can sometimes come from unexpected quarters!