What was it like working with John Carmack on Quake? Like being strapped onto a rocket during takeoff—in the middle of a hurricane. It seemed like the whole world was watching, waiting to see if id Software could top Doom; every casual e-mail tidbit or conversation with a visitor ended up posted on the Internet within hours. And meanwhile, we were pouring everything we had into Quake’s technology; I’d often come in in the morning to find John still there, working on a new idea so intriguing that he couldn’t bear to sleep until he had tried it out. Toward the end, when I spent most of my time speeding things up, I would spend the day in a trance writing optimized assembly code, stagger out of the Town East Tower into the blazing Texas heat, and somehow drive home on LBJ Freeway without smacking into any of the speeding pickups whizzing past me on both sides. At home, I’d fall into a fitful sleep, then come back the next day in a daze and do it again. Everything happened so fast, and under so much pressure, that sometimes I wonder how any of us made it through that without completely burning out.
At the same time, of course, it was tremendously exciting. John’s ideas were endless and brilliant, and Quake ended up establishing a new standard for Internet and first-person 3-D game technology. Happily, id has an enlightened attitude about sharing information, and was willing to let me write about the Quake technology—both how it worked and how it evolved. Over the two years I worked at id, I wrote a number of columns about Quake in Dr. Dobb’s Sourcebook, as well as a detailed overview for the 1997 Computer Game Developers Conference. You can find these in the latter part of this book; they represent a rare look into the development and inner workings of leading-edge software development, and I hope you enjoy reading them as much as I enjoyed developing the technology and writing about it.
The rest of this book is pretty much everything I’ve written over the past decade about graphics and performance programming that’s still relevant to programming today, and that covers a lot of ground. Most of Zen of Graphics Programming, 2nd Edition is in there (and the rest is on the CD); all of Zen of Code Optimization is there too, and even my 1989 book Zen of Assembly Language, with its long-dated 8088 cycle counts but a lot of useful perspectives, is on the CD. Add to that the most recent 20,000 words of Quake material, and you have most of what I’ve learned over the past decade in one neat package.
I’m delighted to have all this material in print in a single place, because over the past ten years I’ve run into a lot of people who have found my writings useful—and a lot more who would like to read them, but couldn’t find them. It’s hard to keep programming material (especially stuff that started out as columns) in print for very long, and I would like to thank The Coriolis Group, and particularly my good friend Jeff Duntemann (without whom not only this volume but pretty much my entire writing career wouldn’t exist), for helping me keep this material available.
I’d also like to thank Jon Erickson, editor of Dr. Dobb’s, both for encouragement and general good cheer and for giving me a place to write whatever I wanted about realtime 3-D. It still amazes me that I was able to find time to write a column every two months during Quake’s development, and if Jon hadn’t made it so easy and enjoyable, it could never have happened.
I’d also like to thank Chris Hecker and Jennifer Pahlka of the Computer Game Developers Conference, without whose encouragement, nudging, and occasional well-deserved nagging there is no chance I would ever have written a paper for the CGDC—a paper that ended up being the most comprehensive overview of the Quake technology that’s ever likely to be written, and which appears in these pages.
I don’t have much else to say that hasn’t already been said elsewhere in this book, in one of the introductions to the previous volumes or in one of the astonishingly large number of chapters. As you’ll see as you read, it’s been quite a decade for microcomputer programmers, and I have been extremely fortunate to not only be a part of it, but to be able to chronicle part of it as well.
And the next decade is shaping up to be just as exciting!
—Michael Abrash
Bellevue, Washington
May 1997
I got my start programming on Apple II computers at school, and almost all of my early work was on the Apple platform. After graduating, it quickly became obvious that I was going to have trouble paying my rent working in the Apple II market in the late eighties, so I was forced to make a very rapid move into the Intel PC environment.
What I was able to pick up over several years on the Apple, I needed to learn in the space of a few months on the PC.
The biggest benefit to me of actually making money as a programmer was the ability to buy all the books and magazines I wanted. I bought a lot. I was in territory that I knew almost nothing about, so I read everything that I could get my hands on. Feature articles, editorials, even advertisements held information for me to assimilate.
John Romero clued me in early to the articles by Michael Abrash. The good stuff. Graphics hardware. Code optimization. Knowledge and wisdom for the aspiring developer. They were even fun to read. For a long time, my personal quest was to find a copy of Michael’s first book, Zen of Assembly Language. I looked in every bookstore I visited, but I never did find it. I made do with the articles I could dig up.
I learned the dark secrets of the EGA video controller there, and developed a few neat tricks of my own. Some of those tricks became the basis for the Commander Keen series of games, which launched id Software.
A year or two later, after Wolfenstein-3D, I bumped into Michael (in a virtual sense) for the first time. I was looking around on M&T Online, a BBS run by the Dr. Dobb’s publishers before the Internet explosion, when I saw some posts from the man himself. We traded email, and for a couple months we played tag-team gurus on the graphics forum before Doom’s development took over my life.
A friend of Michael’s at his new job put us back in touch with each other after Doom began to make its impact, and I finally got a chance to meet up with him in person.
I talked myself hoarse that day, explaining all the ins and outs of Doom to Michael and an interested group of his coworkers. Every few days afterwards, I would get an email from Michael asking for an elaboration on one of my points, or discussing an aspect of the future of graphics.
Eventually, I popped the question—I offered him a job at id. “Just think: no reporting to anyone, an opportunity to code all day, starting with a clean sheet of paper. A chance to do the right thing as a programmer.” It didn’t work. I kept at it though, and about a year later I finally convinced him to come down and take a look at id. I was working on Quake.
Going from Doom to Quake was a tremendous step. I knew where I wanted to end up, but I wasn’t at all clear what the steps were to get there. I was trying a huge number of approaches, and even the failures were teaching me a lot. My enthusiasm must have been contagious, because he took the job.
Much heroic programming ensued. Several hundred thousand lines of code were written. And rewritten. And rewritten. And rewritten.
In hindsight, I have plenty of regrets about various aspects of Quake, but it is a rare person that doesn’t freely acknowledge the technical triumph of it. We nailed it. Sure, a year from now I will have probably found a new perspective that will make me cringe at the clunkiness of some part of Quake, but at the moment it still looks pretty damn good to me.
I was very happy to have Michael describe much of the Quake technology in his ongoing magazine articles. We learned a lot, and I hope we managed to teach a bit.
When a non-programmer hears about Michael’s articles or the source code I have released, I usually get a stunned “WTF would you do that for???” look.
They don’t get it.
Programming is not a zero-sum game. Teaching something to a fellow programmer doesn’t take it away from you. I’m happy to share what I can, because I’m in it for the love of programming. The Ferraris are just gravy, honest!
This book contains many of the original articles that helped launch my programming career. I hope my contribution to the contents of the later articles can provide similar stepping stones for others.
—John Carmack
id Software
There are many people to thank—because this book was written over many years, in many different settings, an unusually large number of people have played a part in making this book possible. Thanks to Dan Illowsky for not only contributing ideas and encouragement, but also getting me started writing articles long ago, when I lacked the confidence to do it on my own—and for teaching me how to handle the business end of things. Thanks to Will Fastie for giving me my first crack at writing for a large audience in the long-gone but still-missed PC Tech Journal, and for showing me how much fun it could be in his even longer-vanished but genuinely terrific column in Creative Computing (the most enjoyable single column I have ever read in a computer magazine; I used to haunt the mailbox around the beginning of the month just to see what Will had to say). Thanks to Robert Keller, Erin O’Connor, Liz Oakley, Steve Baker, and the rest of the cast of thousands that made Programmer’s Journal a uniquely fun magazine—especially Erin, who did more than anyone to teach me the proper use of the English language. (To this day, Erin will still patiently explain to me when one should use “that” and when one should use “which,” even though eight years of instruction on this and related topics have left no discernible imprint on my brain.) Thanks to Tami Zemel, Monica Berg, and the rest of the Dr. Dobb’s Journal crew for excellent, professional editing, and for just being great people. Thanks to the Coriolis gang for their tireless hard work: Jeff Duntemann, Kim Eoff, Jody Kent, Robert Clarfield, and Anthony Stock. Thanks to Jack Tseng for teaching me a lot about graphics hardware, and even more about how much difference hard work can make. Thanks to John Cockerham, David Stafford, Terje Mathisen, the BitMan, Chris Hecker, Jim Mackraz, Melvin Lafitte, John Navas, Phil Coleman, Anton Truenfels, John Carmack, John Miles, John Bridges, Jim Kent, Hal Hardenbergh, Dave Miller, Steve Levy, Jack Davis, Duane Strong, Daev Rohr, Bill Weber, Dan Gochnauer, Patrick Milligan, Tom Wilson, Peter Klerings, Dave Methvin, Mick Brown, the people in the ibm.pc/fast.code topic on Bix, and all the rest of you who have been so generous with your ideas and suggestions. I’ve done my best to acknowledge contributors by name in this book, but if your name is omitted, my apologies, and consider yourself thanked; this book could not have happened without you. And, of course, thanks to Shay and Emily for their generous patience with my passion for writing and computers.
This book is devoted to a topic near and dear to my heart: writing software that pushes PCs to the limit. Given run-of-the-mill software, PCs run like the 97-pound-weakling minicomputers they are. Give them the proper care, however, and those ugly boxes are capable of miracles. The key is this: Only on microcomputers do you have the run of the whole machine, without layers of operating systems, drivers, and the like getting in the way. You can do anything you want, and you can understand everything that’s going on, if you so wish.
As we’ll see shortly, you should indeed so wish.
Is performance still an issue in this era of cheap 486 computers and super-fast Pentium computers? You bet. How many programs that you use really run so fast that you wouldn’t be happier if they ran faster? We’re so used to slow software that when a compile-and-link sequence that took two minutes on a PC takes just ten seconds on a 486 computer, we’re ecstatic—when in truth we should be settling for nothing less than instantaneous response.
Impossible, you say? Not with the proper design, including incremental compilation and linking, use of extended and/or expanded memory, and well-crafted code. PCs can do just about anything you can imagine (with a few obvious exceptions, such as applications involving super-computer-class number-crunching) if you believe that it can be done, if you understand the computer inside and out, and if you’re willing to think past the obvious solution to unconventional but potentially more fruitful approaches.
My point is simply this: PCs can work wonders. It’s not easy coaxing them into doing that, but it’s rewarding—and it’s sure as heck fun. In this book, we’re going to work some of those wonders, starting…
…now.
Before we can create high-performance code, we must understand what high performance is. The objective (not always attained) in creating high-performance software is to make the software able to carry out its appointed tasks so rapidly that it responds instantaneously, as far as the user is concerned. In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.
Notice that the above definition most emphatically does not say anything about making the software as fast as possible. It also does not say anything about using assembly language, or an optimizing compiler, or, for that matter, a compiler at all. It also doesn’t say anything about how the code was designed and written. What it does say is that high-performance code shouldn’t get in the user’s way—and that’s all.
That’s an important distinction, because all too many programmers think that assembly language, or the right compiler, or a particular high-level language, or a certain design approach is the answer to creating high-performance code. They’re not, any more than choosing a certain set of tools is the key to building a house. You do indeed need tools to build a house, but any of many sets of tools will do. You also need a blueprint, an understanding of everything that goes into a house, and the ability to use the tools.
Likewise, high-performance programming requires a clear understanding of the purpose of the software being built, an overall program design, algorithms for implementing particular tasks, an understanding of what the computer can do and of what all relevant software is doing—and solid programming skills, preferably using an optimizing compiler or assembly language. The optimization at the end is just the finishing touch, however.
Without good design, good algorithms, and complete understanding of the program’s operation, your carefully optimized code will amount to one of mankind’s least fruitful creations—a fast slow program.
“What’s a fast slow program?” you ask. That’s a good question, and a brief (true) story is perhaps the best answer.
In the early 1970s, as the first hand-held calculators were hitting the market, I knew a fellow named Irwin. He was a good student, and was planning to be an engineer. Being an engineer back then meant knowing how to use a slide rule, and Irwin could jockey a slipstick with the best of them. In fact, he was so good that he challenged a fellow with a calculator to a duel—and won, becoming a local legend in the process.
When you get right down to it, though, Irwin was spitting into the wind. In a few short years his hard-earned slipstick skills would be worthless, and the entire discipline would be essentially wiped from the face of the earth. What’s more, anyone with half a brain could see that changeover coming. Irwin had basically wasted the considerable effort and time he had spent optimizing his soon-to-be-obsolete skills.
What does all this have to do with programming? Plenty. When you spend time optimizing poorly-designed assembly code, or when you count on an optimizing compiler to make your code fast, you’re wasting the optimization, much as Irwin did. Particularly in assembly, you’ll find that without proper up-front design and everything else that goes into high-performance design, you’ll waste considerable effort and time on making an inherently slow program as fast as possible—which is still slow—when you could easily have improved performance a great deal more with just a little thought. As we’ll see, handcrafted assembly language and optimizing compilers matter, but less than you might think, in the grand scheme of things—and they scarcely matter at all unless they’re used in the context of a good design and a thorough understanding of both the task at hand and the PC.
We’ve got the following rules for creating high-performance software:
Making rules is easy; the hard part is figuring out how to apply them in the real world. For my money, examining some actual working code is always a good way to get a handle on programming concepts, so let’s look at some of the performance rules in action.
If we’re going to create high-performance code, first we have to know what that code is going to do. As an example, let’s write a program that generates a 16-bit checksum of the bytes in a file. In other words, the program will add each byte in a specified file in turn into a 16-bit value. This checksum value might be used to make sure that a file hasn’t been corrupted, as might occur during transmission over a modem or if a Trojan horse virus rears its ugly head. We’re not going to do anything with the checksum value other than print it out, however; right now we’re only interested in generating that checksum value as rapidly as possible.
How are we going to generate a checksum value for a specified file? The logical approach is to get the file name, open the file, read the bytes out of the file, add them together, and print the result. Most of those actions are straightforward; the only tricky part lies in reading the bytes and adding them together.
Actually, we’re only going to make one little map, because we only have one program section that requires much thought—the section that reads the bytes and adds them up. What’s the best way to do this?
It would be convenient to load the entire file into memory and then sum the bytes in one loop. Unfortunately, there’s no guarantee that any particular file will fit in the available memory; in fact, it’s a sure thing that many files won’t fit into memory, so that approach is out.
Well, if the whole file won’t fit into memory, one byte surely will. If we read the file one byte at a time, adding each byte to the checksum value before reading the next byte, we’ll minimize memory requirements and be able to handle any size file at all.
Sounds good, eh? Listing 1.1 shows an implementation of this
approach. Listing 1.1 uses C’s read()
function to read a
single byte, adds the byte into the checksum value, and loops back to
handle the next byte until the end of the file is reached. The code is
compact, easy to write, and functions perfectly—with one slight
hitch:
It’s slow.
LISTING 1.1 L1-1.C
/*
* Program to calculate the 16-bit checksum of all bytes in the
* specified file. Obtains the bytes one at a time via read(),
* letting DOS perform all data buffering.
*/
#include <stdio.h>
#include <fcntl.h>
(int argc, char *argv[]) {
mainint Handle;
unsigned char Byte;
unsigned int Checksum;
int ReadLength;
if ( argc != 2 ) {
("usage: checksum filename\n");
printf(1);
exit}
if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
("Can't open file: %s\n", argv[1]);
printf(1);
exit}
/* Initialize the checksum accumulator */
= 0;
Checksum
/* Add each byte in turn into the checksum accumulator */
while ( (ReadLength = read(Handle, &Byte, sizeof(Byte))) > 0 ) {
+= (unsigned int) Byte;
Checksum }
if ( ReadLength == -1 ) {
("Error reading file %s\n", argv[1]);
printf(1);
exit}
/* Report the result */
("The checksum is: %u\n", Checksum);
printf(0);
exit}
Table 1.1 shows the time taken for Listing 1.1 to generate a checksum of the WordPerfect version 4.2 thesaurus file, TH.WP (362,293 bytes in size), on a 10 MHz AT machine of no special parentage. Execution times are given for Listing 1.1 compiled with Borland and Microsoft compilers, with optimization both on and off; all four times are pretty much the same, however, and all are much too slow to be acceptable. Listing 1.1 requires over two and one-half minutes to checksum one file!
Listings 1.2 and 1.3 form the C/assembly equivalent to Listing 1.1, and Listings 1.6 and 1.7 form the C/assembly equivalent to Listing 1.5.
These results make it clear that it’s folly to rely on your compiler’s optimization to make your programs fast. Listing 1.1 is simply poorly designed, and no amount of compiler optimization will compensate for that failing. To drive home the point, Listings 1.2 and 1.3, which together are equivalent to Listing 1.1 except that the entire checksum loop is written in tight assembly code. The assembly language implementation is indeed faster than any of the C versions, as shown in Table 1.1, but it’s less than 10 percent faster, and it’s still unacceptably slow.
Listing | Borland | Microsoft | Borland | Microsoft | Assembly | Optimization Ratio |
---|---|---|---|---|---|---|
(no opt) | (no opt) | (opt) | (opt) | |||
1 | 166.9 | 166.8 | 167.0 | 165.8 | 155.1 | 1.08 |
4 | 13.5 | 13.6 | 13.5 | 13.5 | … | 1.01 |
5 | 4.7 | 5.5 | 3.8 | 3.4 | 2.7 | 2.04 |
Ratio best designed to worst designed | 35.51 | 30.33 | 43.95 | 48.76 | 57.44 |
Note: The execution times (in seconds) for this chapter’s listings were timed when the compiled listings were run on the WordPerfect 4.2 thesaurus file TH.WP (362,293 bytes in size), as compiled in the small model with Borland and Microsoft compilers with optimization on (opt) and off (no opt). All times were measured with Paradigm Systems’ TIMER program on a 10 MHz 1-wait-state AT clone with a 28-ms hard disk, with disk caching turned off.
LISTING 1.2 L1-2.C
/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Obtains the bytes one at a time in
* assembler, via direct calls to DOS.
*/
#include <stdio.h>
#include <fcntl.h>
(int argc, char *argv[]) {
mainint Handle;
unsigned char Byte;
unsigned int Checksum;
int ReadLength;
if ( argc != 2 ) {
("usage: checksum filename\n");
printf(1);
exit}
if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
("Can't open file: %s\n", argv[1]);
printf(1);
exit}
if ( !ChecksumFile(Handle, &Checksum) ) {
("Error reading file %s\n", argv[1]);
printf(1);
exit}
/* Report the result */
("The checksum is: %u\n", Checksum);
printf(0);
exit}
LISTING 1.3 L1-3.ASM
; Assembler subroutine to perform a 16-bit checksum on the file
; opened on the passed-in handle. Stores the result in the
; passed-in checksum variable. Returns 1 for success, 0 for error.
;
; Call as:
; int ChecksumFile(unsigned int Handle, unsigned int *Checksum);
;
; where:
; Handle = handle # under which file to checksum is open
; Checksum = pointer to unsigned int variable checksum is
; to be stored in
;
; Parameter structure:
;
struc
Parms dw ? ;pushed BP
dw ? ;return address
dw ?
Handle dw ?
Checksum
Parms ends;
.model small
.dataword
TempWord label db ? ;each byte read by DOS will be stored here
TempByte db 0 ;high byte of TempWord is always 0
;for 16-bit adds
;
.code
public _ChecksumFile
_ChecksumFile proc nearpush bp
mov bp,sp
push si ;save C's register variable
;
mov bx,[bp+Handle] ;get file handle
sub si,si ;zero the checksum ;accumulator
mov cx,1 ;request one byte on each ;read
mov dx,offset TempByte ;point DX to the byte in
;which DOS should store
;each byte read
ChecksumLoop:
mov ah,3fh ;DOS read file function #
int 21h ;read the byte
jc ErrorEnd ;an error occurred
and ax,ax ;any bytes read?
jz Success ;no-end of file reached-we're done
add si,[TempWord] ;add the byte into the
;checksum total
jmp ChecksumLoop
ErrorEnd:
sub ax,ax ;error
jmp short Done
Success:
mov bx,[bp+Checksum] ;point to the checksum variable
mov [bx],si ;save the new checksum
mov ax,1 ;success
;
Done:
pop si ;restore C's register variable
pop bp
ret
_ChecksumFile endp end
The lesson is clear: Optimization makes code faster, but without proper design, optimization just creates fast slow code.
Well, then, how are we going to improve our design? Before we can do that, we have to understand what’s wrong with the current design.
Just why is Listing 1.1 so slow? In a word: overhead. The C library
implements the read()
function by calling DOS to read the
desired number of bytes. (I figured this out by watching the code
execute with a debugger, but you can buy library source code from both
Microsoft and Borland.) That means that Listing 1.1 (and Listing 1.3 as
well) executes one DOS function per byte processed—and DOS functions,
especially this one, come with a lot of overhead.
For starters, DOS functions are invoked with interrupts, and
interrupts are among the slowest instructions of the x86 family CPUs.
Then, DOS has to set up internally and branch to the desired function,
expending more cycles in the process. Finally, DOS has to search its own
buffers to see if the desired byte has already been read, read it from
the disk if not, store the byte in the specified location, and return.
All of that takes a long time—far, far longer than the rest of
the main loop in Listing 1.1. In short, Listing 1.1 spends virtually all
of its time executing read()
, and most of that time is
spent somewhere down in DOS.
You can verify this for yourself by watching the code with a debugger or using a code profiler, but take my word for it: There’s a great deal of overhead to DOS calls, and that’s what’s draining the life out of Listing 1.1.
How can we speed up Listing 1.1? It should be clear that we must somehow avoid invoking DOS for every byte in the file, and that means reading more than one byte at a time, then buffering the data and parceling it out for examination one byte at a time. By gosh, that’s a description of C’s stream I/O feature, whereby C reads files in chunks and buffers the bytes internally, doling them out to the application as needed by reading them from memory rather than calling DOS. Let’s try using stream I/O and see what happens.
Listing 1.4 is similar to Listing 1.1, but uses fopen()
and getc()
(rather than open()
and
read()
) to access the file being checksummed. The results
confirm our theories splendidly, and validate our new design. As shown
in Table 1.1, Listing 1.4 runs more than an order of magnitude faster
than even the assembly version of Listing 1.1, even though Listing
1.1 and Listing 1.4 look almost the same. To the casual observer,
read()
and getc()
would seem slightly
different but pretty much interchangeable, and yet in this application
the performance difference between the two is about the same as that
between a 4.77 MHz PC and a 16 MHz 386.
Make sure you understand what really goes on when you insert a seemingly-innocuous function call into the time-critical portions of your code.
In this case that means knowing how DOS and the C/C++ file-access libraries do their work. In other words, know the territory!
LISTING 1.4 L1-4.C
/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Obtains the bytes one at a time via
* getc(), allowing C to perform data buffering.
*/
#include <stdio.h>
(int argc, char *argv[]) {
mainFILE *CheckFile;
int Byte;
unsigned int Checksum;
if ( argc != 2 ) {
("usage: checksum filename\n");
printf(1);
exit}
if ( (CheckFile = fopen(argv[1], "rb")) == NULL ) {
("Can't open file: %s\n", argv[1]);
printf(1);
exit}
/* Initialize the checksum accumulator */
= 0;
Checksum
/* Add each byte in turn into the checksum accumulator */
while ( (Byte = getc(CheckFile)) != EOF ) {
+= (unsigned int) Byte;
Checksum }
/* Report the result */
("The checksum is: %u\n", Checksum);
printf(0);
exit}
The last section contained a particularly interesting phrase: the time-critical portions of your code. Time-critical portions of your code are those portions in which the speed of the code makes a significant difference in the overall performance of your program—and by “significant,” I don’t mean that it makes the code 100 percent faster, or 200 percent, or any particular amount at all, but rather that it makes the program more responsive and/or usable from the user’s perspective.
Don’t waste time optimizing non-time-critical code: set-up code, initialization code, and the like. Spend your time improving the performance of the code inside heavily-used loops and in the portions of your programs that directly affect response time. Notice, for example, that I haven’t bothered to implement a version of the checksum program entirely in assembly; Listings 1.2 and 1.6 call assembly subroutines that handle the time-critical operations, but C is still used for checking command-line parameters, opening files, printing, and the like.
If you were to implement any of the listings in this chapter entirely in hand-optimized assembly, I suppose you might get a performance improvement of a few percent—but I rather doubt you’d get even that much, and you’d sure as heck spend an awful lot of time for whatever meager improvement does result. Let C do what it does well, and use assembly only when it makes a perceptible difference.
Besides, we don’t want to optimize until the design is refined to our satisfaction, and that won’t be the case until we’ve thought about other approaches.
Listing 1.4 is good, but let’s see if there are other—perhaps less
obvious—ways to get the same results faster. Let’s start by considering
why Listing 1.4 is so much better than Listing 1.1. Like
read()
, getc()
calls DOS to read from the
file; the speed improvement of Listing 1.4 over Listing 1.1 occurs
because getc()
reads many bytes at once via DOS, then
manages those bytes for us. That’s faster than reading them one at a
time using read()
—but there’s no reason to think that it’s
faster than having our program read and manage blocks itself. Easier,
yes, but not faster.
Consider this: Every invocation of getc()
involves
pushing a parameter, executing a call to the C library function, getting
the parameter (in the C library code), looking up information about the
desired stream, unbuffering the next byte from the stream, and returning
to the calling code. That takes a considerable amount of time,
especially by contrast with simply maintaining a pointer to a buffer and
whizzing through the data in the buffer inside a single loop.
There are four reasons that many programmers would give for not trying to improve on Listing 1.4:
The code is already fast enough.
The code works, and some people are content with code that works, even when it’s slow enough to be annoying.
The C library is written in optimized assembly, and it’s likely to be faster than any code that the average programmer could write to perform essentially the same function.
The C library conveniently handles the buffering of file data, and it would be a nuisance to have to implement that capability.
I’ll ignore the first reason, both because performance is no longer an issue if the code is fast enough and because the current application does not run fast enough—13 seconds is a long time. (Stop and wait for 13 seconds while you’re doing something intense, and you’ll see just how long it is.)
The second reason is the hallmark of the mediocre programmer. Know when optimization matters—and then optimize when it does!
The third reason is often fallacious. C library functions are not always written in assembly, nor are they always particularly well-optimized. (In fact, they’re often written for portability, which has nothing to do with optimization.) What’s more, they’re general-purpose functions, and often can be outperformed by well-but-not-brilliantly-written code that is well-matched to a specific task. As an example, consider Listing 1.5, which uses internal buffering to handle blocks of bytes at a time. Table 1.1 shows that Listing 1.5 is 2.5 to 4 times faster than Listing 1.4 (and as much as 49 times faster than Listing 1.1!), even though it uses no assembly at all.
Clearly, you can do well by using special-purpose C code in place of a C library function—if you have a thorough understanding of how the C library function operates and exactly what your application needs done. Otherwise, you’ll end up rewriting C library functions in C, which makes no sense at all.
LISTING 1.5 L1-5.C
/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Buffers the bytes internally, rather
* than letting C or DOS do the work.
*/
#include <stdio.h>
#include <fcntl.h>
#include <alloc.h> /* alloc.h for Borland,
malloc.h for Microsoft */
#define BUFFER_SIZE 0x8000 /* 32Kb data buffer */
(int argc, char *argv[]) {
mainint Handle;
unsigned int Checksum;
unsigned char *WorkingBuffer, *WorkingPtr;
int WorkingLength, LengthCount;
if ( argc != 2 ) {
("usage: checksum filename\n");
printf(1);
exit}
if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
("Can't open file: %s\n", argv[1]);
printf(1);
exit}
/* Get memory in which to buffer the data */
if ( (WorkingBuffer = malloc(BUFFER_SIZE)) == NULL ) {
("Can't get enough memory\n");
printf(1);
exit}
/* Initialize the checksum accumulator */
= 0;
Checksum
/* Process the file in BUFFER_SIZE chunks */
do {
if ( (WorkingLength = read(Handle, WorkingBuffer,
)) == -1 ) {
BUFFER_SIZE("Error reading file %s\n", argv[1]);
printf(1);
exit}
/* Checksum this chunk */
= WorkingBuffer;
WorkingPtr = WorkingLength;
LengthCount while ( LengthCount-- ) {
/* Add each byte in turn into the checksum accumulator */
+= (unsigned int) *WorkingPtr++;
Checksum }
} while ( WorkingLength );
/* Report the result */
("The checksum is: %u\n", Checksum);
printf(0);
exit}
That brings us to the fourth reason: avoiding an internal-buffered implementation like Listing 1.5 because of the difficulty of coding such an approach. True, it is easier to let a C library function do the work, but it’s not all that hard to do the buffering internally. The key is the concept of handling data in restartable blocks; that is, reading a chunk of data, operating on the data until it runs out, suspending the operation while more data is read in, and then continuing as though nothing had happened.
In Listing 1.5 the restartable block implementation is pretty simple because checksumming works with one byte at a time, forgetting about each byte immediately after adding it into the total. Listing 1.5 reads in a block of bytes from the file, checksums the bytes in the block, and gets another block, repeating the process until the entire file has been processed. In Chapter 5, we’ll see a more complex restartable block implementation, involving searching for text strings.
At any rate, Listing 1.5 isn’t much more complicated than Listing 1.4—and it’s a lot faster. Always consider the alternatives; a bit of clever thinking and program redesign can go a long way.
I have said time and again that optimization is pointless until the design is settled. When that time comes, however, optimization can indeed make a significant difference. Table 1.1 indicates that the optimized version of Listing 1.5 produced by Microsoft C outperforms an unoptimized version of the same code by more than 60 percent. What’s more, a mostly-assembly version of Listing 1.5, shown in Listings 1.6 and 1.7, outperforms even the best-optimized C version of Listing 1.5 by 26 percent. These are considerable improvements, well worth pursuing—once the design has been maxed out.
LISTING 1.6 L1-6.C
/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Buffers the bytes internally, rather
* than letting C or DOS do the work, with the time-critical
* portion of the code written in optimized assembler.
*/
#include <stdio.h>
#include <fcntl.h>
#include <alloc.h> /* alloc.h for Borland,
malloc.h for Microsoft */
#define BUFFER_SIZE 0x8000 /* 32K data buffer */
(int argc, char *argv[]) {
mainint Handle;
unsigned int Checksum;
unsigned char *WorkingBuffer;
int WorkingLength;
if ( argc != 2 ) {
("usage: checksum filename\n");
printf(1);
exit}
if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
("Can't open file: %s\n", argv[1]);
printf(1);
exit}
/* Get memory in which to buffer the data */
if ( (WorkingBuffer = malloc(BUFFER_SIZE)) == NULL ) {
("Can't get enough memory\n");
printf(1);
exit}
/* Initialize the checksum accumulator */
= 0;
Checksum
/* Process the file in 32K chunks */
do {
if ( (WorkingLength = read(Handle, WorkingBuffer,
)) == -1 ) {
BUFFER_SIZE("Error reading file %s\n", argv[1]);
printf(1);
exit}
/* Checksum this chunk if there's anything in it */
if ( WorkingLength )
(WorkingBuffer, WorkingLength, &Checksum);
ChecksumChunk} while ( WorkingLength );
/* Report the result */
("The checksum is: %u\n", Checksum);
printf(0);
exit}
LISTING 1.7 L1-7.ASM
; Assembler subroutine to perform a 16-bit checksum on a block of
; bytes 1 to 64K in size. Adds checksum for block into passed-in
; checksum.
;
; Call as:
; void ChecksumChunk(unsigned char *Buffer,
; unsigned int BufferLength, unsigned int *Checksum);
;
; where:
; Buffer = pointer to start of block of bytes to checksum
; BufferLength = # of bytes to checksum (0 means 64K, not 0)
; Checksum = pointer to unsigned int variable checksum is
;stored in
;
; Parameter structure:
;
struc
Parms dw ? ;pushed BP
dw ? ;return address
dw ?
Buffer dw ?
BufferLength dw ?
Checksum
Parms ends;
.model small
.code
public _ChecksumChunk
_ChecksumChunk proc nearpush bp
mov bp,sp
push si ;save C's register variable
;
cld ;make LODSB increment SI
mov si,[bp+Buffer] ;point to buffer
mov cx,[bp+BufferLength] ;get buffer length
mov bx,[bp+Checksum] ;point to checksum variable
mov dx,[bx] ;get the current checksum
sub ah,ah ;so AX will be a 16-bit value after LODSB
ChecksumLoop:
lodsb ;get the next byte
add dx,ax ;add it into the checksum total
loop ChecksumLoop ;continue for all bytes in block
mov [bx],dx ;save the new checksum
;
pop si ;restore C's register variable
pop bp
ret
_ChecksumChunk endp end
Note that in Table 1.1, optimization makes little difference except in the case of Listing 1.5, where the design has been refined considerably. Execution time in the other cases is dominated by time spent in DOS and/or the C library, so optimization of the code you write is pretty much irrelevant. What’s more, while the approximately two-times improvement we got by optimizing is not to be sneezed at, it pales against the up-to-50-times improvement we got by redesigning.
By the way, the execution times even of Listings 1.6 and 1.7 are dominated by DOS disk access times. If a disk cache is enabled and the file to be checksummed is already in the cache, the assembly version is three times as fast as the C version. In other words, the inherent nature of this application limits the performance improvement that can be obtained via assembly. In applications that are more CPU-intensive and less disk-bound, particularly those applications in which string instructions and/or unrolled loops can be used effectively, assembly tends to be considerably faster relative to C than it is in this very specific case.
Don’t get hung up on optimizing compilers or assembly language—the best optimizer is between your ears.
All this is basically a way of saying: Know where you’re going, know the territory, and know when it matters.
What have we learned? Don’t let other people’s code—even DOS—do the work for you when speed matters, at least not without knowing what that code does and how well it performs.
Optimization only matters after you’ve done your part on the program design end. Consider the ratios on the vertical axis of Table 1.1, which show that optimization is almost totally wasted in the checksumming application without an efficient design. Optimization is no panacea. Table 1.1 shows a two-times improvement from optimization—and a 50-times-plus improvement from redesign. The longstanding debate about which C compiler optimizes code best doesn’t matter quite so much in light of Table 1.1, does it? Your organic optimizer matters much more than your compiler’s optimizer, and there’s always assembly for those usually small sections of code where performance really matters.
This chapter has presented a quick step-by-step overview of the design process. I’m not claiming that this is the only way to create high-performance code; it’s just an approach that works for me. Create code however you want, but never forget that design matters more than detailed optimization. Never stop looking for inventive ways to boost performance—and never waste time speeding up code that doesn’t need to be sped up.
I’m going to focus on specific ways to create high-performance code from now on. In Chapter 5, we’ll continue to look at restartable blocks and internal buffering, in the form of a program that searches files for text strings.
As I showed in the previous chapter, optimization is by no means always a matter of “dropping into assembly.” In fact, in performance tuning high-level language code, assembly should be used rarely, and then only after you’ve made sure a badly chosen or clumsily implemented algorithm isn’t eating you alive. Certainly if you use assembly at all, make absolutely sure you use it right. The potential of assembly code to run slowly is poorly understood by a lot of people, but that potential is great, especially in the hands of the ignorant.
Truly great optimization, however, happens only at the assembly level, and it happens in response to a set of dynamics that is totally different from that governing C/C++ or Pascal optimization. I’ll be speaking of assembly-level optimization time and again in this book, but when I do, I think it will be helpful if you have a grasp of those assembly specific dynamics.
As usual, the best way to wade in is to present a real-world example.
Some time ago, I was asked to work over a critical assembly subroutine in order to make it run as fast as possible. The task of the subroutine was to construct a nibble out of four bits read from different bytes, rotating and combining the bits so that they ultimately ended up neatly aligned in bits 3-0 of a single byte. (In case you’re curious, the object was to construct a 16-color pixel from bits scattered over 4 bytes.) I examined the subroutine line by line, saving a cycle here and a cycle there, until the code truly seemed to be optimized. When I was done, the key part of the code looked something like this:
LoopTop:
lodsb ;get the next byte to extract a bit from
and al,ah ;isolate the bit we want
rol al,cl ;rotate the bit into the desired position
or bl,al ;insert the bit into the final nibble
dec cx ;the next bit goes 1 place to the right
dec dx ;count down the number of bits
jnz LoopTop ;process the next bit, if any
Now, it’s hard to write code that’s much faster than seven instructions, only one of which accesses memory, and most programmers would have called it a day at this point. Still, something bothered me, so I spent a bit of time going over the code again. Suddenly, the answer struck me—the code was rotating each bit into place separately, so that a multibit rotation was being performed every time through the loop, for a total of four separate time-consuming multibit rotations!
While the instructions themselves were individually optimized, the overall approach did not make the best possible use of the instructions.
I changed the code to the following:
LoopTop:
lodsb ;get the next byte to extract a bit from
and al,ah ;isolate the bit we want
or bl,al ;insert the bit into the final nibble
rol bl,1 ;make room for the next bit
dec dx ;count down the number of bits
jnz LoopTop ;process the next bit, if any
rol bl,cl ;rotate all four bits into their final
; positions at the same time
This moved the costly multibit rotation out of the loop so that it
was performed just once, rather than four times. While the code may not
look much different from the original, and in fact still contains
exactly the same number of instructions, the performance of the entire
subroutine improved by about 10 percent from just this one change.
(Incidentally, that wasn’t the end of the optimization; I eliminated the
DEC
and JNZ
instructions by expanding the four
iterations of the loop—but that’s a tale for another chapter.)
The point is this: To write truly superior assembly programs, you need to know what the various instructions do and which instructions execute fastest…and more. You must also learn to look at your programming problems from a variety of perspectives so that you can put those fast instructions to work in the most effective ways.
Is it really so hard as all that to write good assembly code for the PC? Yes! Thanks to the decidedly quirky nature of the x86 family CPUs, assembly language differs fundamentally from other languages, and is undeniably harder to work with. On the other hand, the potential of assembly code is much greater than that of other languages, as well.
To understand why this is so, consider how a program gets written. A programmer examines the requirements of an application, designs a solution at some level of abstraction, and then makes that design come alive in a code implementation. If not handled properly, the transformation that takes place between conception and implementation can reduce performance tremendously; for example, a programmer who implements a routine to search a list of 100,000 sorted items with a linear rather than binary search will end up with a disappointingly slow program.
No matter how well an implementation is derived from the corresponding design, however, high-level languages like C/C++ and Pascal inevitably introduce additional transformation inefficiencies, as shown in Figure 2.1.
The process of turning a design into executable code by way of a high-level language involves two transformations: one performed by the programmer to generate source code, and another performed by the compiler to turn source code into machine language instructions. Consequently, the machine language code generated by compilers is usually less than optimal given the requirements of the original design.
High-level languages provide artificial environments that lend themselves relatively well to human programming skills, in order to ease the transition from design to implementation. The price for this ease of implementation is a considerable loss of efficiency in transforming source code into machine language. This is particularly true given that the x86 family in real and 16-bit protected mode, with its specialized memory-addressing instructions and segmented memory architecture, does not lend itself particularly well to compiler design. Even the 32-bit mode of the 386 and its successors, with their more powerful addressing modes, offer fewer registers than compilers would like.
Assembly, on the other hand, is simply a human-oriented representation of machine language. As a result, assembly provides a difficult programming environment—the bare hardware and systems software of the computer—but properly constructed assembly programs suffer no transformation loss, as shown in Figure 2.2.
Only one transformation is required when creating an assembler program, and that single transformation is completely under the programmer’s control. Assemblers perform no transformation from source code to machine language; instead, they merely map assembler instructions to machine language instructions on a one-to-one basis. As a result, the programmer is able to produce machine language code that’s precisely tailored to the needs of each task a given application requires.
The key, of course, is the programmer, since in assembly the programmer must essentially perform the transformation from the application specification to machine language entirely on his or her own. (The assembler merely handles the direct translation from assembly to machine language.)
The first part of assembly language optimization, then, is self. An assembler is nothing more than a tool to let you design machine-language programs without having to think in hexadecimal codes. So assembly language programmers—unlike all other programmers—must take full responsibility for the quality of their code. Since assemblers provide little help at any level higher than the generation of machine language, the assembly programmer must be capable both of coding any programming construct directly and of controlling the PC at the lowest practical level—the operating system, the BIOS, even the hardware where necessary. High-level languages handle most of this transparently to the programmer, but in assembly everything is fair—and necessary—game, which brings us to another aspect of assembly optimization: knowledge.
In the PC world, you can never have enough knowledge, and every item you add to your store will make your programs better. Thorough familiarity with both the operating system APIs and BIOS interfaces is important; since those interfaces are well-documented and reasonably straightforward, my advice is to get a good book or two and bring yourself up to speed. Similarly, familiarity with the PC hardware is required. While that topic covers a lot of ground—display adapters, keyboards, serial ports, printer ports, timer and DMA channels, memory organization, and more—most of the hardware is well-documented, and articles about programming major hardware components appear frequently in the literature, so this sort of knowledge can be acquired readily enough.
The single most critical aspect of the hardware, and the one about which it is hardest to learn, is the CPU. The x86 family CPUs have a complex, irregular instruction set, and, unlike most processors, they are neither straightforward nor well-documented true code performance. What’s more, assembly is so difficult to learn that most articles and books that present assembly code settle for code that just works, rather than code that pushes the CPU to its limits. In fact, since most articles and books are written for inexperienced assembly programmers, there is very little information of any sort available about how to generate high-quality assembly code for the x86 family CPUs. As a result, knowledge about programming them effectively is by far the hardest knowledge to gather. A good portion of this book is devoted to seeking out such knowledge.
Be forewarned, though: No matter how much you learn about programming the PC in assembly, there’s always more to discover.
Is the never-ending collection of information all there is to the assembly optimization, then? Hardly. Knowledge is simply a necessary base on which to build. Let’s take a moment to examine the objectives of good assembly programming, and the remainder of the forces that act on assembly optimization will fall into place.
Basically, there are only two possible objectives to high-performance assembly programming: Given the requirements of the application, keep to a minimum either the number of processor cycles the program takes to run, or the number of bytes in the program, or some combination of both. We’ll look at ways to achieve both objectives, but we’ll more often be concerned with saving cycles than saving bytes, for the PC generally offers relatively more memory than it does processing horsepower. In fact, we’ll find that two-to-three times performance improvements over already tight assembly code are often possible if we’re willing to spend additional bytes in order to save cycles. It’s not always desirable to use such techniques to speed up code, due to the heavy memory requirements—but it is almost always possible.
You will notice that my short list of objectives for high-performance assembly programming does not include traditional objectives such as easy maintenance and speed of development. Those are indeed important considerations—to persons and companies that develop and distribute software. People who actually buy software, on the other hand, care only about how well that software performs, not how it was developed nor how it is maintained. These days, developers spend so much time focusing on such admittedly important issues as code maintainability and reusability, source code control, choice of development environment, and the like that they often forget rule #1: From the user’s perspective, performance is fundamental.
Comment your code, design it carefully, and write non-time-critical portions in a high-level language, if you wish—but when you write the portions that interact with the user and/or affect response time, performance must be your paramount objective, and assembly is the path to that goal.
Knowledge of the sort described earlier is absolutely essential to fulfilling either of the objectives of assembly programming. What that knowledge doesn’t do by itself is meet the need to write code that both performs to the requirements of the application at hand and also operates as efficiently as possible in the PC environment. Knowledge makes that possible, but your programming instincts make it happen. And it is that intuitive, on-the-fly integration of a program specification and a sea of facts about the PC that is the heart of the Zen-class assembly optimization.
As with Zen of any sort, mastering that Zen of assembly language is more a matter of learning than of being taught. You will have to find your own path of learning, although I will start you on your way with this book. The subtle facts and examples I provide will help you gain the necessary experience, but you must continue the journey on your own. Each program you create will expand your programming horizons and increase the options available to you in meeting the next challenge. The ability of your mind to find surprising new and better ways to craft superior code from a concept—the flexible mind, if you will—is the linchpin of good assembler code, and you will develop this skill only by doing.
Never underestimate the importance of the flexible mind. Good assembly code is better than good compiled code. Many people would have you believe otherwise, but they’re wrong. That doesn’t mean that high-level languages are useless; far from it. High-level languages are the best choice for the majority of programmers, and for the bulk of the code of most applications. When the best code—the fastest or smallest code possible—is needed, though, assembly is the only way to go.
Simple logic dictates that no compiler can know as much about what a piece of code needs to do or adapt as well to those needs as the person who wrote the code. Given that superior information and adaptability, an assembly language programmer can generate better code than a compiler, all the more so given that compilers are constrained by the limitations of high-level languages and by the process of transformation from high-level to machine language. Consequently, carefully optimized assembly is not just the language of choice but the only choice for the 1 percent to 10 percent of code—usually consisting of small, well-defined subroutines—that determines overall program performance, and it is the only choice for code that must be as compact as possible, as well. In the run-of-the-mill, non-time-critical portions of your programs, it makes no sense to waste time and effort on writing optimized assembly code—concentrate your efforts on loops and the like instead; but in those areas where you need the finest code quality, accept no substitutes.
Note that I said that an assembly programmer can generate better code than a compiler, not will generate better code. While it is true that good assembly code is better than good compiled code, it is also true that bad assembly code is often much worse than bad compiled code; since the assembly programmer has so much control over the program, he or she has virtually unlimited opportunities to waste cycles and bytes. The sword cuts both ways, and good assembly code requires more, not less, forethought and planning than good code written in a high-level language.
The gist of all this is simply that good assembly programming is done in the context of a solid overall framework unique to each program, and the flexible mind is the key to creating that framework and holding it together.
To summarize, the skill of assembly language optimization is a combination of knowledge, perspective, and a way of thought that makes possible the genesis of absolutely the fastest or the smallest code. With that in mind, what should the first step be? Development of the flexible mind is an obvious step. Still, the flexible mind is no better than the knowledge at its disposal. The first step in the journey toward mastering optimization at that exalted level, then, would seem to be learning how to learn.
When you’re pushing the envelope in writing optimized PC code, you’re likely to become more than a little compulsive about finding approaches that let you wring more speed from your computer. In the process, you’re bound to make mistakes, which is fine—as long as you watch for those mistakes and learn from them.
A case in point: A few years back, I came across an article about 8088 assembly language called “Optimizing for Speed.” Now, “optimize” is not a word to be used lightly; Webster’s Ninth New Collegiate Dictionary defines optimize as “to make as perfect, effective, or functional as possible,” which certainly leaves little room for error. The author had, however, chosen a small, well-defined 8088 assembly language routine to refine, consisting of about 30 instructions that did nothing more than expand 8 bits to 16 bits by duplicating each bit.
The author of “Optimizing” had clearly fine-tuned the code with care, examining alternative instruction sequences and adding up cycles until he arrived at an implementation he calculated to be nearly 50 percent faster than the original routine. In short, he had used all the information at his disposal to improve his code, and had, as a result, saved cycles by the bushel. There was, in fact, only one slight problem with the optimized version of the routine….
It ran slower than the original version!
As diligent as the author had been, he had nonetheless committed a cardinal sin of x86 assembly language programming: He had assumed that the information available to him was both correct and complete. While the execution times provided by Intel for its processors are indeed correct, they are incomplete; the other—and often more important—part of code performance is instruction fetch time, a topic to which I will return in later chapters.
Had the author taken the time to measure the true performance of his code, he wouldn’t have put his reputation on the line with relatively low-performance code. What’s more, had he actually measured the performance of his code and found it to be unexpectedly slow, curiosity might well have led him to experiment further and thereby add to his store of reliable information about the CPU.
There you have an important tenet of assembly language optimization: After crafting the best code possible, check it in action to see if it’s really doing what you think it is. If it’s not behaving as expected, that’s all to the good, since solving mysteries is the path to knowledge. You’ll learn more in this way, I assure you, than from any manual or book on assembly language.
Assume nothing. I cannot emphasize this strongly enough—when you care about performance, do your best to improve the code and then measure the improvement. If you don’t measure performance, you’re just guessing, and if you’re guessing, you’re not very likely to write top-notch code.
Ignorance about true performance can be costly. When I wrote video games for a living, I spent days at a time trying to wring more performance from my graphics drivers. I rewrote whole sections of code just to save a few cycles, juggled registers, and relied heavily on blurry-fast register-to-register shifts and adds. As I was writing my last game, I discovered that the program ran perceptibly faster if I used look-up tables instead of shifts and adds for my calculations. It shouldn’t have run faster, according to my cycle counting, but it did. In truth, instruction fetching was rearing its head again, as it often does, and the fetching of the shifts and adds was taking as much as four times the nominal execution time of those instructions.
Ignorance can also be responsible for considerable wasted effort. I recall a debate in the letters column of one computer magazine about exactly how quickly text can be drawn on a Color/Graphics Adapter (CGA) screen without causing snow. The letter-writers counted every cycle in their timing loops, just as the author in the story that started this chapter had. Like that author, the letter-writers had failed to take the prefetch queue into account. In fact, they had neglected the effects of video wait states as well, so the code they discussed was actually much slower than their estimates. The proper test would, of course, have been to run the code to see if snow resulted, since the only true measure of code performance is observing it in action.
Clearly, one key to mastering Zen-class optimization is a tool with which to measure code performance. The most accurate way to measure performance is with expensive hardware, but reasonable measurements at no cost can be made with the PC’s 8253 timer chip, which counts at a rate of slightly over 1,000,000 times per second. The 8253 can be started at the beginning of a block of code of interest and stopped at the end of that code, with the resulting count indicating how long the code took to execute with an accuracy of about 1 microsecond. (A microsecond is one millionth of a second, and is abbreviated µs). To be precise, the 8253 counts once every 838.1 nanoseconds. (A nanosecond is one billionth of a second, and is abbreviated ns.)
Listing 3.1 shows 8253-based timer software, consisting of three
subroutines: ZTimerOn
, ZTimerOff
, and
ZTimerReport
. For the remainder of this book, I’ll refer to
these routines collectively as the “Zen timer.” C-callable versions of
the two precision Zen timers are presented in Chapter K on the companion
CD-ROM.
LISTING 3.1 PZTIMER.ASM
; The precision Zen timer (PZTIMER.ASM)
;
; Uses the 8253 timer to time the performance of code that takes
; less than about 54 milliseconds to execute, with a resolution
; of better than 10 microseconds.
;
; By Michael Abrash
;
; Externally callable routines:
;
; ZTimerOn: Starts the Zen timer, with interrupts disabled.
;
; ZTimerOff: Stops the Zen timer, saves the timer count,
; times the overhead code, and restores interrupts to the
; state they were in when ZTimerOn was called.
;
; ZTimerReport: Prints the net time that passed between starting
; and stopping the timer.
;
; Note: If longer than about 54 ms passes between ZTimerOn and
; ZTimerOff calls, the timer turns over and the count is
; inaccurate. When this happens, an error message is displayed
; instead of a count. The long-period Zen timer should be used
; in such cases.
;
; Note: Interrupts *MUST* be left off between calls to ZTimerOn
; and ZTimerOff for accurate timing and for detection of
; timer overflow.
;
; Note: These routines can introduce slight inaccuracies into the
; system clock count for each code section timed even if
; timer 0 doesn't overflow. If timer 0 does overflow, the
; system clock can become slow by virtually any amount of
; time, since the system clock can't advance while the
; precison timer is timing. Consequently, it's a good idea
; to reboot at the end of each timing session. (The
; battery-backed clock, if any, is not affected by the Zen
; timer.)
;
; All registers, and all flags except the interrupt flag, are
; preserved by all routines. Interrupts are enabled and then disabled
; by ZTimerOn, and are restored by ZTimerOff to the state they were
; in when ZTimerOn was called.
;
segment word public 'CODE'
Code cs:Code, ds:nothing
assume , ZTimerOff, ZTimerReport
public ZTimerOn
;
; Base address of the 8253 timer chip.
;
equ 40h
BASE_8253 ;
; The address of the timer 0 count registers in the 8253.
;
equ BASE_8253 + 0
TIMER_0_8253 ;
; The address of the mode register in the 8253.
;
equ BASE_8253 + 3
MODE_8253 ;
; The address of Operation Command Word 3 in the 8259 Programmable
; Interrupt Controller (PIC) (write only, and writable only when
; bit 4 of the byte written to this address is 0 and bit 3 is 1).
;
equ 20h
OCW3 ;
; The address of the Interrupt Request register in the 8259 PIC
; (read only, and readable only when bit 1 of OCW3 = 1 and bit 0
; of OCW3 = 0).
;
equ 20h
IRR ;
; Macro to emulate a POPF instruction in order to fix the bug in some
; 80286 chips which allows interrupts to occur during a POPF even when
; interrupts remain disabled.
;
MPOPF macro, p2
local p1jmp short p2
p1: iret ; jump to pushed address & pop flags
p2: push cs ; construct far return address to
call p1 ; the next instruction
endm
;
; Macro to delay briefly to ensure that enough time has elapsed
; between successive I/O accesses so that the device being accessed
; can respond to both accesses even on a very fast PC.
;
DELAY macrojmp $+2
jmp $+2
jmp $+2
endm
db ? ; storage for upper byte of
OriginalFlags ; FLAGS register when
; ZTimerOn called
dw ? ; timer 0 count when the timer
TimedCount ; is stopped
dw ? ; number of counts required to
ReferenceCount ; execute timer overhead code
db ? ; used to indicate whether the
OverflowFlag ; timer overflowed during the
; timing interval
;
; String printed to report results.
;
byte
OutputStr label db 0dh, 0ah, 'Timed count: ', 5 dup (?)
byte
ASCIICountEnd label db ' microseconds', 0dh, 0ah
db '$'
;
; String printed to report timer overflow.
;
byte
OverflowStr label db 0dh, 0ah
db '****************************************************'
db 0dh, 0ah
db '* The timer overflowed, so the interval timed was *'
db 0dh, 0ah
db '* too long for the precision timer to measure. *'
db 0dh, 0ah
db '* Please perform the timing test again with the *'
db 0dh, 0ah
db '* long-period timer. *'
db 0dh, 0ah
db '****************************************************'
db 0dh, 0ah
db '$'
; ********************************************************************
; * Routine called to start timing. *
; ********************************************************************
ZTimerOn proc near
;
; Save the context of the program being timed.
;
push ax
pushf
pop ax ; get flags so we can keep
; interrupts off when leaving
; this routine
mov cs:[OriginalFlags],ah ; remember the state of the
; Interrupt flag
and ah,0fdh ; set pushed interrupt flag
; to 0
push ax
;
; Turn on interrupts, so the timer interrupt can occur if it's
; pending.
;
sti
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting. Also
; leaves the 8253 waiting for the initial timer 0 count to
; be loaded.
;
mov al,00110100b ;mode 2
out MODE_8253,al
;
; Set the timer count to 0, so we know we won't get another
; timer interrupt right away.
; Note: this introduces an inaccuracy of up to 54 ms in the system
; clock count each time it is executed.
;
DELAYsub al,al
out TIMER_0_8253,al ;lsb
DELAYout TIMER_0_8253,al ;msb
;
; Wait before clearing interrupts to allow the interrupt generated
; when switching from mode 3 to mode 2 to be recognized. The delay
; must be at least 210 ns long to allow time for that interrupt to
; occur. Here, 10 jumps are used for the delay to ensure that the
; delay time will be more than long enough even on a very fast PC.
;
10
rept jmp $+2
endm;
; Disable interrupts to get an accurate count.
;
cli
;
; Set the timer count to 0 again to start the timing interval.
;
mov al,00110100b ; set up to load initial
out MODE_8253,al ; timer count
DELAYsub al,al
out TIMER_0_8253,al ; load count lsb
DELAYout TIMER_0_8253,al; load count msb
;
; Restore the context and return.
;
; keeps interrupts off
MPOPF pop ax
ret
ZTimerOn endp
;********************************************************************
;* Routine called to stop timing and get count. *
;********************************************************************
ZTimerOff proc near
;
; Save the context of the program being timed.
;
push ax
push cx
pushf
;
; Latch the count.
;
mov al,00000000b ; latch timer 0
out MODE_8253,al
;
; See if the timer has overflowed by checking the 8259 for a pending
; timer interrupt.
;
mov al,00001010b ; OCW3, set up to read
out OCW3,al ; Interrupt Request register
DELAYin al,IRR ; read Interrupt Request
; register
and al,1 ; set AL to 1 if IRQ0 (the
; timer interrupt) is pending
mov cs:[OverflowFlag],al; store the timer overflow
; status
;
; Allow interrupts to happen again.
;
sti
;
; Read out the count we latched earlier.
;
in al,TIMER_0_8253 ; least significant byte
DELAYmov ah,al
in al,TIMER_0_8253 ; most significant byte
xchg ah,al
neg ax ; convert from countdown
; remaining to elapsed
; count
mov cs:[TimedCount],ax
; Time a zero-length code fragment, to get a reference for how
; much overhead this routine has. Time it 16 times and average it,
; for accuracy, rounding the result.
;
mov cs:[ReferenceCount],0
mov cx,16
cli ; interrupts off to allow a
; precise reference count
RefLoop:
call ReferenceZTimerOn
call ReferenceZTimerOff
loop RefLoop
sti
add cs:[ReferenceCount],8; total + (0.5 * 16)
mov cl,4
shr cs:[ReferenceCount],cl; (total) / 16 + 0.5
;
; Restore original interrupt state.
;
pop ax ; retrieve flags when called
mov ch,cs:[OriginalFlags] ; get back the original upper
; byte of the FLAGS register
and ch,not 0fdh ; only care about original
; interrupt flag...
and ah,0fdh ; ...keep all other flags in
; their current condition
or ah,ch ; make flags word with original
; interrupt flag
push ax ; prepare flags to be popped
;
; Restore the context of the program being timed and return to it.
;
; restore the flags with the
MPOPF ; original interrupt state
pop cx
pop ax
ret
ZTimerOff endp
;
; Called by ZTimerOff to start timer for overhead measurements.
;
ReferenceZTimerOn proc near;
; Save the context of the program being timed.
;
push ax
pushf ; interrupts are already off
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting.
;
mov al,00110100b ; set up to load
out MODE_8253,al ; initial timer count
DELAY;
; Set the timer count to 0.
;
sub al,al
out TIMER_0_8253,al; load count lsb
DELAYout TIMER_0_8253,al; load count msb
;
; Restore the context of the program being timed and return to it.
;
MPOPFpop ax
ret
ReferenceZTimerOn endp
;
; Called by ZTimerOff to stop timer and add result to ReferenceCount
; for overhead measurements.
;
ReferenceZTimerOff proc near;
; Save the context of the program being timed.
;
push ax
push cx
pushf
;
; Latch the count and read it.
;
mov al,00000000b ; latch timer 0
out MODE_8253,al
DELAYin al,TIMER_0_8253 ; lsb
DELAYmov ah,al
in al,TIMER_0_8253 ; msb
xchg ah,al
neg ax ; convert from countdown
; remaining to amount
; counted down
add cs:[ReferenceCount],ax
;
; Restore the context of the program being timed and return to it.
;
MPOPFpop cx
pop ax
ret
ReferenceZTimerOff endp
; ********************************************************************
; * Routine called to report timing results. *
; ********************************************************************
ZTimerReport proc near
pushf
push ax
push bx
push cx
push dx
push si
push ds
;
push cs ; DOS functions require that DS point
pop ds ; to text to be displayed on the screen
ds :Code
assume ;
; Check for timer 0 overflow.
;
cmp [OverflowFlag],0
jz PrintGoodCount
mov dx,offset OverflowStr
mov ah,9
int 21h
jmp short EndZTimerReport
;
; Convert net count to decimal ASCII in microseconds.
;
PrintGoodCount:
mov ax,[TimedCount]
sub ax,[ReferenceCount]
mov si,offset ASCIICountEnd - 1
;
; Convert count to microseconds by multiplying by .8381.
;
mov dx, 8381
mul dx
mov bx, 10000
div bx ;* .8381 = * 8381 / 10000
;
; Convert time in microseconds to 5 decimal ASCII digits.
;
mov bx, 10
mov cx, 5
CTSLoop:
sub dx, dx
div bx
add dl,'0'
mov [si],dl
dec si
loop CTSLoop
;
; Print the results.
;
mov ah, 9
mov dx, offset OutputStr
int 21h
;
EndZTimerReport:
pop ds
pop si
pop dx
pop cx
pop bx
pop ax
MPOPFret
ZTimerReport endp
Code ends end
We’re going to spend the rest of this chapter seeing what the Zen timer can do, examining how it works, and learning how to use it. I’ll be using the Zen timer again and again over the course of this book, so it’s essential that you learn what the Zen timer can do and how to use it. On the other hand, it is by no means essential that you understand exactly how the Zen timer works. (Interesting, yes; essential, no.)
In other words, the Zen timer isn’t really part of the knowledge we seek; rather, it’s one tool with which we’ll acquire that knowledge. Consequently, you shouldn’t worry if you don’t fully grasp the inner workings of the Zen timer. Instead, focus on learning how to use it, and you’ll be on the right road.
ZTimerOn
is called at the start of a segment of code to
be timed. ZTimerOn
saves the context of the calling code,
disables interrupts, sets timer 0 of the 8253 to mode 2 (divide-by-N
mode), sets the initial timer count to 0, restores the context of the
calling code, and returns. (I’d like to note that while Intel’s
documentation for the 8253 seems to indicate that a timer won’t reset to
0 until it finishes counting down, in actual practice, timers seem to
reset to 0 as soon as they’re loaded.)
Two aspects of ZTimerOn
are worth discussing further.
One point of interest is that ZTimerOn
disables interrupts.
(ZTimerOff
later restores interrupts to the state they were
in when ZTimerOn
was called.) Were interrupts not disabled
by ZTimerOn
, keyboard, mouse, timer, and other interrupts
could occur during the timing interval, and the time required to service
those interrupts would incorrectly and erratically appear to be part of
the execution time of the code being measured. As a result, code timed
with the Zen timer should not expect any hardware interrupts to occur
during the interval between any call to ZTimerOn
and the
corresponding call to ZTimerOff
, and should not enable
interrupts during that time.
A second interesting point about ZTimerOn
is that it may
introduce some small inaccuracy into the system clock time whenever it
is called. To understand why this is so, we need to examine the way in
which both the 8253 and the PC’s system clock (which keeps the current
time) work.
The 8253 actually contains three timers, as shown in Figure 3.1. All three timers are driven by the system board’s 14.31818 MHz crystal, divided by 12 to yield a 1.19318 MHz clock to the timers, so the timers count once every 838.1 ns. Each of the three timers counts down in a programmable way, generating a signal on its output pin when it counts down to 0. Each timer is capable of being halted at any time via a 0 level on its gate input; when a timer’s gate input is 1, that timer counts constantly. All in all, the 8253’s timers are inherently very flexible timing devices; unfortunately, much of that flexibility depends on how the timers are connected to external circuitry, and in the PC the timers are connected with specific purposes in mind.
Timer 2 drives the speaker, although it can be used for other timing purposes when the speaker is not in use. As shown in Figure 3.1, timer 2 is the only timer with a programmable gate input in the PC; that is, timer 2 is the only timer that can be started and stopped under program control in the manner specified by Intel. On the other hand, the output of timer 2 is connected to nothing other than the speaker. In particular, timer 2 cannot generate an interrupt to get the 8088’s attention.
Timer 1 is dedicated to providing dynamic RAM refresh, and should not be tampered with lest system crashes result.
Finally, timer 0 is used to drive the system clock. As programmed by the BIOS at power-up, every 65,536 (64K) counts, or 54.925 milliseconds, timer 0 generates a rising edge on its output line. (A millisecond is one-thousandth of a second, and is abbreviated ms.) This line is connected to the hardware interrupt 0 (IRQ0) line on the system board, so every 54.925 ms, timer 0 causes hardware interrupt 0 to occur.
The interrupt vector for IRQ0 is set by the BIOS at power-up time to
point to a BIOS routine, TIMER_INT
, that maintains a
time-of-day count. TIMER_INT
keeps a 16-bit count of IRQ0
interrupts in the BIOS data area at address 0000:046C (all addresses in
this book are given in segment:offset hexadecimal pairs); this count
turns over once an hour (less a few microseconds), and when it does,
TIMER_INT
updates a 16-bit hour count at address 0000:046E
in the BIOS data area. This count is the basis for the current time and
date that DOS supports via functions 2AH (2A hexadecimal) through 2DH
and by way of the DATE and TIME commands.
Each timer channel of the 8253 can operate in any of six modes. Timer 0 normally operates in mode 3: square wave mode. In square wave mode, the initial count is counted down two at a time; when the count reaches zero, the output state is changed. The initial count is again counted down two at a time, and the output state is toggled back when the count reaches zero. The result is a square wave that changes state more slowly than the input clock by a factor of the initial count. In its normal mode of operation, timer 0 generates an output pulse that is low for about 27.5 ms and high for about 27.5 ms; this pulse is sent to the 8259 interrupt controller, and its rising edge generates a timer interrupt once every 54.925 ms.
Square wave mode is not very useful for precision timing because it counts down by two twice per timer interrupt, thereby rendering exact timings impossible. Fortunately, the 8253 offers another timer mode, mode 2 (divide-by-N mode), which is both a good substitute for square wave mode and a perfect mode for precision timing.
Divide-by-N mode counts down by one from the initial count. When the count reaches zero, the timer turns over and starts counting down again without stopping, and a pulse is generated for a single clock period. While the pulse is not held for nearly as long as in square wave mode, it doesn’t matter, since the 8259 interrupt controller is configured in the PC to be edge-triggered and hence cares only about the existence of a pulse from timer 0, not the duration of the pulse. As a result, timer 0 continues to generate timer interrupts in divide-by-N mode, and the system clock continues to maintain good time.
Why not use timer 2 instead of timer 0 for precision timing? After all, timer 2 has a programmable gate input and isn’t used for anything but sound generation. The problem with timer 2 is that its output can’t generate an interrupt; in fact, timer 2 can’t do anything but drive the speaker. We need the interrupt generated by the output of timer 0 to tell us when the count has overflowed, and we will see shortly that the timer interrupt also makes it possible to time much longer periods than the Zen timer shown in Listing 3.1 supports.
In fact, the Zen timer shown in Listing 3.1 can only time intervals of up to about 54 ms in length, since that is the period of time that can be measured by timer 0 before its count turns over and repeats. Fifty-four ms may not seem like a very long time, but even a CPU as slow as the 8088 can perform more than 1,000 divides in 54 ms, and division is the single instruction that the 8088 performs most slowly. If a measured period turns out to be longer than 54 ms (that is, if timer 0 has counted down and turned over), the Zen timer will display a message to that effect. A long-period Zen timer for use in such cases will be presented later in this chapter.
The Zen timer determines whether timer 0 has turned over by checking
to see whether an IRQ0 interrupt is pending. (Remember, interrupts are
off while the Zen timer runs, so the timer interrupt cannot be
recognized until the Zen timer stops and enables interrupts.) If an IRQ0
interrupt is pending, then timer 0 has turned over and generated a timer
interrupt. Recall that ZTimerOn
initially sets timer 0 to
0, in order to allow for the longest possible period—about 54 ms—before
timer 0 reaches 0 and generates the timer interrupt.
Now we’re ready to look at the ways in which the Zen timer can
introduce inaccuracy into the system clock. Since timer 0 is initially
set to 0 by the Zen timer, and since the system clock ticks only when
timer 0 counts off 54.925 ms and reaches 0 again, an average inaccuracy
of one-half of 54.925 ms, or about 27.5 ms, is incurred each time the
Zen timer is started. In addition, a timer interrupt is generated when
timer 0 is switched from mode 3 to mode 2, advancing the system clock by
up to 54.925 ms, although this only happens the first time the Zen timer
is run after a warm or cold boot. Finally, up to 54.925 ms can again be
lost when ZTimerOff
is called, since that routine again
sets the timer count to zero. Net result: The system clock will run up
to 110 ms (about a ninth of a second) slow each time the Zen timer is
used.
Potentially far greater inaccuracy can be incurred by timing code that takes longer than about 110 ms to execute. Recall that all interrupts, including the timer interrupt, are disabled while timing code with the Zen timer. The 8259 interrupt controller is capable of remembering at most one pending timer interrupt, so all timer interrupts after the first one during any given Zen timing interval are ignored. Consequently, if a timing interval exceeds 54.9 ms, the system clock effectively stops 54.9 ms after the timing interval starts and doesn’t restart until the timing interval ends, losing time all the while.
The effects on the system time of the Zen timer aren’t a matter for great concern, as they are temporary, lasting only until the next warm or cold boot. System that have battery-backed clocks, (AT-style machines; that is, virtually all machines in common use) automatically reset the correct time whenever the computer is booted, and systems without battery-backed clocks prompt for the correct date and time when booted. Also, repeated use of the Zen timer usually makes the system clock slow by at most a total of a few seconds, unless code that takes much longer than 54 ms to run is timed (in which case the Zen timer will notify you that the code is too long to time).
Nonetheless, it’s a good idea to reboot your computer at the end of each session with the Zen timer in order to make sure that the system clock is correct.
At some point after ZTimerOn
is called,
ZTimerOff
must always be called to mark the end of the
timing interval. ZTimerOff
saves the context of the calling
program, latches and reads the timer 0 count, converts that count from
the countdown value that the timer maintains to the number of counts
elapsed since ZTimerOn
was called, and stores the result.
Immediately after latching the timer 0 count—and before enabling
interrupts—ZTimerOff
checks the 8259 interrupt controller
to see if there is a pending timer interrupt, setting a flag to mark
that the timer overflowed if there is indeed a pending timer
interrupt.
After that, ZTimerOff
executes just the overhead code of
ZTimerOn
and ZTimerOff
16 times, and averages
and saves the results in order to determine how many of the counts in
the timing result just obtained were incurred by the overhead of the Zen
timer rather than by the code being timed.
Finally, ZTimerOff
restores the context of the calling
program, including the state of the interrupt flag that was in effect
when ZTimerOn
was called to start timing, and returns.
One interesting aspect of ZTimerOff
is the manner in
which timer 0 is stopped in order to read the timer count. We don’t
actually have to stop timer 0 to read the count; the 8253 provides a
special latched read feature for the specific purpose of reading the
count while a time is running. (That’s a good thing, too; we’ve no
documented way to stop timer 0 if we wanted to, since its gate input
isn’t connected. Later in this chapter, though, we’ll see that timer 0
can be stopped after all.) We simply tell the 8253 to latch the current
count, and the 8253 does so without breaking stride.
ZTimerReport
may be called to display timing results at
any time after both ZTimerOn
and ZTimerOff
have been called. ZTimerReport
first checks to see whether
the timer overflowed (counted down to 0 and turned over) before
ZTimerOff
was called; if overflow did occur,
ZTimerOff
prints a message to that effect and returns.
Otherwise, ZTimerReport
subtracts the reference count
(representing the overhead of the Zen timer) from the count measured
between the calls to ZTimerOn
and ZTimerOff
,
converts the result from timer counts to microseconds, and prints the
resulting time in microseconds to the standard output.
Note that ZTimerReport
need not be called immediately
after ZTimerOff
. In fact, after a given call to
ZTimerOff
, ZTimerReport
can be called at any
time right up until the next call to ZTimerOn
.
You may want to use the Zen timer to measure several portions of a
program while it executes normally, in which case it may not be
desirable to have the text printed by ZTimerReport
interfere with the program’s normal display. There are many ways to deal
with this. One approach is removal of the invocations of the DOS print
string function (INT 21H with AH equal to 9) from
ZTimerReport
, instead running the program under a debugger
that supports screen flipping (such as Turbo Debugger or CodeView),
placing a breakpoint at the start of ZTimerReport
, and
directly observing the count in microseconds as
ZTimerReport
calculates it.
A second approach is modification of ZTimerReport
to
place the result at some safe location in memory, such as an unused
portion of the BIOS data area.
A third approach is alteration of ZTimerReport
to print
the result over a serial port to a terminal or to another PC acting as a
terminal. Similarly, many debuggers can be run from a remote terminal
via a serial link.
Yet another approach is modification of ZTimerReport
to
send the result to the printer via either DOS function 5 or BIOS
interrupt 17H.
A final approach is to modify ZTimerReport
to print the
result to the auxiliary output via DOS function 4, and to then write and
load a special device driver named AUX
, to which DOS
function 4 output would automatically be directed. This device driver
could send the result anywhere you might desire. The result might go to
the secondary display adapter, over a serial port, or to the printer, or
could simply be stored in a buffer within the driver, to be dumped at a
later time. (Credit for this final approach goes to Michael Geary, and
thanks go to David Miller for passing the idea on to me.)
You may well want to devise still other approaches better suited to your needs than those I’ve presented. Go to it! I’ve just thrown out a few possibilities to get you started.
The Zen timer subroutines are designed to be near-called from
assembly language code running in the public segment Code
.
The Zen timer subroutines can, however, be called from any assembly or
high-level language code that generates OBJ files that are compatible
with the Microsoft linker, simply by modifying the segment that the
timer code runs in to match the segment used by the code being timed, or
by changing the Zen timer routines to far procedures and making far
calls to the Zen timer code from the code being timed, as discussed at
the end of this chapter. All three subroutines preserve all registers
and all flags except the interrupt flag, so calls to these routines are
transparent to the calling code.
If you do change the Zen timer routines to far procedures in order to
call them from code running in another segment, be sure to make
all the Zen timer routines far, including
ReferenceZTimerOn
and ReferenceZTimerOff
.
(You’ll have to put FAR PTR
overrides on the calls from
ZTimerOff
to the latter two routines if you do make them
far.) If the reference routines aren’t the same type—near or far—as the
other routines, they won’t reflect the true overhead incurred by
starting and stopping the Zen timer.
Please be aware that the inaccuracy that the Zen timer can introduce into the system clock time does not affect the accuracy of the performance measurements reported by the Zen timer itself. The 8253 counts once every 838 ns, giving us a count resolution of about 1µs, although factors such as the prefetch queue (as discussed below), dynamic RAM refresh, and internal timing variations in the 8253 make it perhaps more accurate to describe the Zen timer as measuring code performance with an accuracy of better than 10µs. In fact, the Zen timer is actually most accurate in assessing code performance when timing intervals longer than about 100 µs. At any rate, we’re most interested in using the Zen timer to assess the relative performance of various code sequences—that is, using it to compare and tweak code—and the timer is more than accurate enough for that purpose.
The Zen timer works on all PC-compatible computers I’ve tested it on, including XTs, ATs, PS/2 computers, and 386, 486, and Pentium-based machines. Of course, I haven’t been able to test it on all PC-compatibles, but I don’t expect any problems; computers on which the Zen timer doesn’t run can’t truly be called “PC-compatible.”
On the other hand, there is certainly no guarantee that code performance as measured by the Zen timer will be the same on compatible computers as on genuine IBM machines, or that either absolute or relative code performance will be similar even on different IBM models; in fact, quite the opposite is true. For example, every PS/2 computer, even the relatively slow Model 30, executes code much faster than does a PC or XT. As another example, I set out to do the timings for my earlier book Zen of Assembly Language on an XT-compatible computer, only to find that the computer wasn’t quite IBM-compatible regarding code performance. The differences were minor, mind you, but my experience illustrates the risk of assuming that a specific make of computer will perform in a certain way without actually checking.
Not that this variation between models makes the Zen timer one whit less useful—quite the contrary. The Zen timer is an excellent tool for evaluating code performance over the entire spectrum of PC-compatible computers.
Listing 3.2 shows a test-bed program for measuring code performance
with the Zen timer. This program sets DS equal to CS (for reasons we’ll
discuss shortly), includes the code to be measured from the file
TESTCODE, and calls ZTimerReport
to display the timing
results. Consequently, the code being measured should be in the file
TESTCODE, and should contain calls to ZTimerOn
and
ZTimerOff
.
LISTING 3.2 PZTEST.ASM
; Program to measure performance of code that takes less than
; 54 ms to execute. (PZTEST.ASM)
;
; Link with PZTIMER.ASM (Listing 3.1). PZTEST.BAT (Listing 3.4)
; can be used to assemble and link both files. Code to be
; measured must be in the file TESTCODE; Listing 3.3 shows
; a sample TESTCODE file.
;
; By Michael Abrash
;
segment para stack 'STACK'
mystack db 512 dup(?)
mystack ends;
segment para public 'CODE'
Code cs:Code, ds:Code
assume :near, ZTimerOff:near, ZTimerReport:near
extrn ZTimerOn
Start proc nearpush cs
pop ds ; set DS to point to the code segment,
; so data as well as code can easily
; be included in TESTCODE
;
;code to be measured, including
include TESTCODE ; calls to ZTimerOn and ZTimerOff
;
; Display the results.
;
call ZTimerReport
;
; Terminate the program.
;
mov ah,4ch
int 21h
Start endp
Code ends end Start
Listing 3.3 shows some sample code to be timed. This listing measures
the time required to execute 1,000 loads of AL from the memory variable
MemVar
. Note that Listing 3.3 calls ZTimerOn
to start timing, performs 1,000 MOV
instructions in a row,
and calls ZTimerOff
to end timing. When Listing 3.2 is
named TESTCODE and included by Listing 3.3, Listing 3.2 calls
ZTimerReport
to display the execution time after the code
in Listing 3.3 has been run.
LISTING 3.3 LST3-3.ASM
; Test file;
; Measures the performance of 1,000 loads of AL from
; memory. (Use by renaming to TESTCODE, which is
; included by PZTEST.ASM (Listing 3.2). PZTIME.BAT
; (Listing 3.4) does this, along with all assembly
; and linking.)
;
jmp Skip ;jump around defined data
;
db ?
MemVar ;
Skip:
;
; Start timing.
;
call ZTimerOn
;
1000
rept mov al,[MemVar]
endm;
; Stop timing.
;
call ZTimerOff
It’s worth noting that Listing 3.3 begins by jumping around the
memory variable MemVar
. This approach lets us avoid
reproducing Listing 3.2 in its entirety for each code fragment we want
to measure; by defining any needed data right in the code segment and
jumping around that data, each listing becomes self-contained and can be
plugged directly into Listing 3.2 as TESTCODE. Listing 3.2 sets DS equal
to CS before doing anything else precisely so that data can be embedded
in code fragments being timed. Note that only after the initial jump is
performed in Listing 3.3 is the Zen timer started, since we don’t want
to include the execution time of start-up code in the timing interval.
That’s why the calls to ZTimerOn
and ZTimerOff
are in TESTCODE, not in PZTEST.ASM; this way, we have full control over
which portion of TESTCODE is timed, and we can keep set-up code and the
like out of the timing interval.
Listing 3.3 is used by naming it TESTCODE, assembling both Listing 3.2 (which includes TESTCODE) and Listing 3.1 with TASM or MASM, and linking the two resulting OBJ files together by way of the Borland or Microsoft linker. Listing 3.4 shows a batch file, PZTIME.BAT, which does all that; when run, this batch file generates and runs the executable file PZTEST.EXE. PZTIME.BAT (Listing 3.4) assumes that the file PZTIMER.ASM contains Listing 3.1, and the file PZTEST.ASM contains Listing 3.2. The command-line parameter to PZTIME.BAT is the name of the file to be copied to TESTCODE and included into PZTEST.ASM. (Note that Turbo Assembler can be substituted for MASM by replacing “masm” with “tasm” and “link” with “tlink” in Listing 3.4. The same is true of Listing 3.7.)
LISTING 3.4 PZTIME.BAT
echo off
rem
rem *** Listing 3.4 ***
rem
rem ***************************************************************
rem * Batch file PZTIME.BAT, which builds and runs the precision *
rem * Zen timer program PZTEST.EXE to time the code named as the *
rem * command-line parameter. Listing 3.1 must be named *
rem * PZTIMER.ASM, and Listing 3.2 must be named PZTEST.ASM. To *
rem * time the code in LST3-3, you'd type the DOS command: *
rem * *
rem * pztime lst3-3 *
rem * *
rem * Note that MASM and LINK must be in the current directory or *
rem * on the current path in order for this batch file to work. *
rem * *
rem * This batch file can be speeded up by assembling PZTIMER.ASM *
rem * once, then removing the lines: *
rem * *
rem * masm pztimer; *
rem * if errorlevel 1 goto errorend *
rem * *
rem * from this file. *
rem * *
rem * By Michael Abrash *
rem ***************************************************************
rem
rem Make sure a file to test was specified.
rem
if not x%1==x goto ckexist
echo ***************************************************************
echo * Please specify a file to test. *
echo ***************************************************************
goto end
rem
rem Make sure the file exists.
rem
:ckexist
if exist %1 goto docopy
echo ***************************************************************
echo * The specified file, "%1," doesn't exist. *
echo ***************************************************************
goto end
rem
rem copy the file to measure to TESTCODE.
rem
:docopy
copy %1 testcode
masm pztest;
if errorlevel 1 goto errorend
masm pztimer;
if errorlevel 1 goto errorend
link pztest+pztimer;
if errorlevel 1 goto errorend
pztest
goto end
:errorend
echo ***************************************************************
echo * An error occurred while building the precision Zen timer. *
echo ***************************************************************
:end
Assuming that Listing 3.3 is named LST3-3.ASM and Listing 3.4 is named PZTIME.BAT, the code in Listing 3.3 would be timed with the command:
pztime LST3-3.ASM
which performs all assembly and linking, and reports the execution time of the code in Listing 3.3.
When the above command is executed on an original 4.77 MHz IBM PC,
the time reported by the Zen timer is 3619 µs, or about 3.62 µs per load
of AL from memory. (While the exact number is 3.619 µs per load of AL,
I’m going to round off that last digit from now on. No matter how many
repetitions of a given instruction are timed, there’s just too much
noise in the timing process—between dynamic RAM refresh, the prefetch
queue, and the internal state of the processor at the start of
timing—for that last digit to have any significance.) Given the test
PC’s 4.77 MHz clock, this works out to about 17 cycles per
MOV
, which is actually a good bit longer than Intel’s
specified 10-cycle execution time for this instruction. (See the MASM or
TASM documentation, or Intel’s processor reference manuals, for official
execution times.) Fear not, the Zen timer is right—MOV
AL,[MEMVAR] really does take 17 cycles as used in Listing 3.3.
Exactly why that is so is just what this book is all about.
In order to perform any of the timing tests in this book, enter Listing 3.1 and name it PZTIMER.ASM, enter Listing 3.2 and name it PZTEST.ASM, and enter Listing 3.4 and name it PZTIME.BAT. Then simply enter the listing you wish to run into the file filename and enter the command:
pztime <filename>
In fact, that’s exactly how I timed each of the listings in this
book. Code fragments you write yourself can be timed in just the same
way. If you wish to time code directly in place in your programs, rather
than in the test-bed program of Listing 3.2, simply insert calls to
ZTimerOn
, ZTimerOff
, and
ZTimerReport
in the appropriate places and link PZTIMER to
your program.
With a few exceptions, the Zen timer presented above will serve us well for the remainder of this book since we’ll be focusing on relatively short code sequences that generally take much less than 54 ms to execute. Occasionally, however, we will need to time longer intervals. What’s more, it is very likely that you will want to time code sequences longer than 54 ms at some point in your programming career. Accordingly, I’ve also developed a Zen timer for periods longer than 54 ms. The long-period Zen timer (so named by contrast with the precision Zen timer just presented) shown in Listing 3.5 can measure periods up to one hour in length.
The key difference between the long-period Zen timer and the precision Zen timer is that the long-period timer leaves interrupts enabled during the timing period. As a result, timer interrupts are recognized by the PC, allowing the BIOS to maintain an accurate system clock time over the timing period. Theoretically, this enables measurement of arbitrarily long periods. Practically speaking, however, there is no need for a timer that can measure more than a few minutes, since the DOS time of day and date functions (or, indeed, the DATE and TIME commands in a batch file) serve perfectly well for longer intervals. Since very long timing intervals aren’t needed, the long-period Zen timer uses a simplified means of calculating elapsed time that is limited to measuring intervals of an hour or less. If a period longer than an hour is timed, the long-period Zen timer prints a message to the effect that it is unable to time an interval of that length.
For implementation reasons, the long-period Zen timer is also incapable of timing code that starts before midnight and ends after midnight; if that eventuality occurs, the long-period Zen timer reports that it was unable to time the code because midnight was crossed. If this happens to you, just time the code again, secure in the knowledge that at least you won’t run into the problem again for 23-odd hours.
You should not use the long-period Zen timer to time code that requires interrupts to be disabled for more than 54 ms at a stretch during the timing interval, since when interrupts are disabled the long-period Zen timer is subject to the same 54 ms maximum measurement time as the precision Zen timer.
While permitting the timer interrupt to occur allows long intervals to be timed, that same interrupt makes the long-period Zen timer less accurate than the precision Zen timer, since the time the BIOS spends handling timer interrupts during the timing interval is included in the time measured by the long-period timer. Likewise, any other interrupts that occur during the timing interval, most notably keyboard and mouse interrupts, will increase the measured time.
The long-period Zen timer has some of the same effects on the system time as does the precision Zen timer, so it’s a good idea to reboot the system after a session with the long-period Zen timer. The long-period Zen timer does not, however, have the same potential for introducing major inaccuracy into the system clock time during a single timing run since it leaves interrupts enabled and therefore allows the system clock to update normally.
There’s a potential problem with the long-period Zen timer. The problem is this: In order to measure times longer than 54 ms, we must maintain not one but two timing components, the timer 0 count and the BIOS time-of-day count. The time-of-day count measures the passage of 54.9 ms intervals, while the timer 0 count measures time within those 54.9 ms intervals. We need to read the two time components simultaneously in order to get a clean reading. Otherwise, we may read the timer count just before it turns over and generates an interrupt, then read the BIOS time-of-day count just after the interrupt has occurred and caused the time-of-day count to turn over, with a resulting 54 ms measurement inaccuracy. (The opposite sequence—reading the time-of-day count and then the timer count—can result in a 54 ms inaccuracy in the other direction.)
The only way to avoid this problem is to stop timer 0, read both the timer and time-of-day counts while the timer is stopped, and then restart the timer. Alas, the gate input to timer 0 isn’t program-controllable in the PC, so there’s no documented way to stop the timer. (The latched read feature we used in Listing 3.1 doesn’t stop the timer; it latches a count, but the timer keeps running.) What should we do?
As it turns out, an undocumented feature of the 8253 makes it possible to stop the timer dead in its tracks. Setting the timer to a new mode and waiting for an initial count to be loaded causes the timer to stop until the count is loaded. Surprisingly, the timer count remains readable and correct while the timer is waiting for the initial load.
In my experience, this approach works beautifully with fully 8253-compatible chips. However, there’s no guarantee that it will always work, since it programs the 8253 in an undocumented way. What’s more, IBM chose not to implement compatibility with this particular 8253 feature in the custom chips used in PS/2 computers. On PS/2 computers, we have no choice but to latch the timer 0 count and then stop the BIOS count (by disabling interrupts) as quickly as possible. We’ll just have to accept the fact that on PS/2 computers we may occasionally get a reading that’s off by 54 ms, and leave it at that.
I’ve set up Listing 3.5 so that it can assemble to either use or not
use the undocumented timer-stopping feature, as you please. The
PS2
equate selects between the two modes of operation. If
PS2
is 1 (as it is in Listing 3.5), then the latch-and-read
method is used; if PS2
is 0, then the undocumented
timer-stop approach is used. The latch-and-read method will work on all
PC-compatible computers, but may occasionally produce results that are
incorrect by 54 ms. The timer-stop approach avoids synchronization
problems, but doesn’t work on all computers.
LISTING 3.5 LZTIMER.ASM
;
; The long-period Zen timer. (LZTIMER.ASM)
; Uses the 8253 timer and the BIOS time-of-day count to time the
; performance of code that takes less than an hour to execute.
; Because interrupts are left on (in order to allow the timer
; interrupt to be recognized), this is less accurate than the
; precision Zen timer, so it is best used only to time code that takes
; more than about 54 milliseconds to execute (code that the precision
; Zen timer reports overflow on). Resolution is limited by the
; occurrence of timer interrupts.
;
; By Michael Abrash
;
; Externally callable routines:
;
; ZTimerOn: Saves the BIOS time of day count and starts the
; long-period Zen timer.
;
; ZTimerOff: Stops the long-period Zen timer and saves the timer
; count and the BIOS time-of-day count.
;
; ZTimerReport: Prints the time that passed between starting and
; stopping the timer.
;
; Note: If either more than an hour passes or midnight falls between
; calls to ZTimerOn and ZTimerOff, an error is reported. For
; timing code that takes more than a few minutes to execute,
; either the DOS TIME command in a batch file before and after
; execution of the code to time or the use of the DOS
; time-of-day function in place of the long-period Zen timer is
; more than adequate.
;
; Note: The PS/2 version is assembled by setting the symbol PS2 to 1.
; PS2 must be set to 1 on PS/2 computers because the PS/2's
; timers are not compatible with an undocumented timer-stopping
; feature of the 8253; the alternative timing approach that
; must be used on PS/2 computers leaves a short window
; during which the timer 0 count and the BIOS timer count may
; not be synchronized. You should also set the PS2 symbol to
; 1 if you're getting erratic or obviously incorrect results.
;
; Note: When PS2 is 0, the code relies on an undocumented 8253
; feature to get more reliable readings. It is possible that
; the 8253 (or whatever chip is emulating the 8253) may be put
; into an undefined or incorrect state when this feature is
; used.
;
; ******************************************************************
; * If your computer displays any hint of erratic behavior *
; * after the long-period Zen timer is used, such as the floppy*
; * drive failing to operate properly, reboot the system, set *
; * PS2 to 1 and leave it that way! *
; ******************************************************************
;
; Note: Each block of code being timed should ideally be run several
; times, with at least two similar readings required to
; establish a true measurement, in order to eliminate any
; variability caused by interrupts.
;
; Note: Interrupts must not be disabled for more than 54 ms at a
; stretch during the timing interval. Because interrupts
; are enabled, keys, mice, and other devices that generate
; interrupts should not be used during the timing interval.
;
; Note: Any extra code running off the timer interrupt (such as
; some memory-resident utilities) will increase the time
; measured by the Zen timer.
;
; Note: These routines can introduce inaccuracies of up to a few
; tenths of a second into the system clock count for each
; code section timed. Consequently, it's a good idea to
; reboot at the conclusion of timing sessions. (The
; battery-backed clock, if any, is not affected by the Zen
; timer.)
;
; All registers and all flags are preserved by all routines.
;
segment word public 'CODE'
Code cs:Code, ds:nothing
assume , ZTimerOff, ZTimerReport
public ZTimerOn
;
; Set PS2 to 0 to assemble for use on a fully 8253-compatible
; system; when PS2 is 0, the readings are more reliable if the
; computer supports the undocumented timer-stopping feature,
; but may be badly off if that feature is not supported. In
; fact, timer-stopping may interfere with your computer's
; overall operation by putting the 8253 into an undefined or
; incorrect state. Use with caution!!!
;
; Set PS2 to 1 to assemble for use on non-8253-compatible
; systems, including PS/2 computers; when PS2 is 1, readings
; may occasionally be off by 54 ms, but the code will work
; properly on all systems.
;
; A setting of 1 is safer and will work on more systems,
; while a setting of 0 produces more reliable results in systems
; which support the undocumented timer-stopping feature of the
; 8253. The choice is yours.
;
equ 1
PS2 ;
; Base address of the 8253 timer chip.
;
equ 40h
BASE_8253 ;
; The address of the timer 0 count registers in the 8253.
;
equ BASE_8253 + 0
TIMER_0_8253 ;
; The address of the mode register in the 8253.
;
equ BASE_8253 + 3
MODE_8253 ;
; The address of the BIOS timer count variable in the BIOS
; data segment.
;
equ 46ch
TIMER_COUNT ;
; Macro to emulate a POPF instruction in order to fix the bug in some
; 80286 chips which allows interrupts to occur during a POPF even when
; interrupts remain disabled.
;
MPOPF macro, p2
local p1jmp short p2
p1: iret ;jump to pushed address & pop flags
p2: push cs ;construct far return address to
call p1 ; the next instruction
endm
;
; Macro to delay briefly to ensure that enough time has elapsed
; between successive I/O accesses so that the device being accessed
; can respond to both accesses even on a very fast PC.
;
DELAY macrojmp $+2
jmp $+2
jmp $+2
endm
dw ? ;BIOS count low word at the
StartBIOSCountLow ; start of the timing period
dw ? ;BIOS count high word at the
StartBIOSCountHigh ; start of the timing period
dw ? ;BIOS count low word at the
EndBIOSCountLow ; end of the timing period
dw ? ;BIOS count high word at the
EndBIOSCountHigh ; end of the timing period
dw ? ;timer 0 count at the end of
EndTimedCount ; the timing period
dw ? ;number of counts required to
ReferenceCount ; execute timer overhead code
;
; String printed to report results.
;
byte
OutputStr label db 0dh, 0ah, 'Timed count: '
db 10 dup (?)
TimedCountStr db ' microseconds', 0dh, 0ah
db '$'
;
; Temporary storage for timed count as it's divided down by powers
; of ten when converting from doubleword binary to ASCII.
;
dw ?
CurrentCountLow dw ?
CurrentCountHigh ;
; Powers of ten table used to perform division by 10 when doing
; doubleword conversion from binary to ASCII.
;
word
PowersOfTen label dd 1
dd 10
dd 100
dd 1000
dd 10000
dd 100000
dd 1000000
dd 10000000
dd 100000000
dd 1000000000
word
PowersOfTenEnd label ;
; String printed to report that the high word of the BIOS count
; changed while timing (an hour elapsed or midnight was crossed),
; and so the count is invalid and the test needs to be rerun.
;
byte
TurnOverStr label db 0dh, 0ah
db '****************************************************'
db 0dh, 0ah
db '* Either midnight passed or an hour or more passed *'
db 0dh, 0ah
db '* while timing was in progress. If the former was *'
db 0dh, 0ah
db '* the case, please rerun the test; if the latter *'
db 0dh, 0ah
db '* was the case, the test code takes too long to *'
db 0dh, 0ah
db '* run to be timed by the long-period Zen timer. *'
db 0dh, 0ah
db '* Suggestions: use the DOS TIME command, the DOS *'
db 0dh, 0ah
db '* time function, or a watch. *'
db 0dh, 0ah
db '****************************************************'
db 0dh, 0ah
db '$'
;********************************************************************
;* Routine called to start timing. *
;********************************************************************
ZTimerOn proc near
;
; Save the context of the program being timed.
;
push ax
pushf
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting. Also stops
; timer 0 until the timer count is loaded, except on PS/2
; computers.
;
mov al,00110100b ;mode 2
out MODE_8253,al
;
; Set the timer count to 0, so we know we won't get another
; timer interrupt right away.
; Note: this introduces an inaccuracy of up to 54 ms in the system
; clock count each time it is executed.
;
DELAYsub al,al
out TIMER_0_8253,al ;lsb
DELAYout TIMER_0_8253,al ;msb
;
; In case interrupts are disabled, enable interrupts briefly to allow
; the interrupt generated when switching from mode 3 to mode 2 to be
; recognized. Interrupts must be enabled for at least 210 ns to allow
; time for that interrupt to occur. Here, 10 jumps are used for the
; delay to ensure that the delay time will be more than long enough
; even on a very fast PC.
;
pushf
sti
10
rept jmp $+2
endm
MPOPF;
; Store the timing start BIOS count.
; (Since the timer count was just set to 0, the BIOS count will
; stay the same for the next 54 ms, so we don't need to disable
; interrupts in order to avoid getting a half-changed count.)
;
push ds
sub ax, ax
mov ds, ax
mov ax, ds:[TIMER_COUNT+2]
mov cs:[StartBIOSCountHigh],ax
mov ax, ds:[TIMER_COUNT]
mov cs:[StartBIOSCountLow],ax
pop ds
;
; Set the timer count to 0 again to start the timing interval.
;
mov al,00110100b ;set up to load initial
out MODE_8253,al ; timer count
DELAYsub al, al
out TIMER_0_8253,al; load count lsb
DELAYout TIMER_0_8253,al; load count msb
;
; Restore the context of the program being timed and return to it.
;
MPOPFpop ax
ret
ZTimerOn endp
;********************************************************************
;* Routine called to stop timing and get count. *
;********************************************************************
ZTimerOff proc near
;
; Save the context of the program being timed.
;
pushf
push ax
push cx
;
; In case interrupts are disabled, enable interrupts briefly to allow
; any pending timer interrupt to be handled. Interrupts must be
; enabled for at least 210 ns to allow time for that interrupt to
; occur. Here, 10 jumps are used for the delay to ensure that the
; delay time will be more than long enough even on a very fast PC.
;
sti
10
rept jmp $+2
endm
;
; Latch the timer count.
;
if PS2
mov al,00000000b
out MODE_8253,al ;latch timer 0 count
;
; This is where a one-instruction-long window exists on the PS/2.
; The timer count and the BIOS count can lose synchronization;
; since the timer keeps counting after it's latched, it can turn
; over right after it's latched and cause the BIOS count to turn
; over before interrupts are disabled, leaving us with the timer
; count from before the timer turned over coupled with the BIOS
; count from after the timer turned over. The result is a count
; that's 54 ms too long.
;
else
;
; Set timer 0 to mode 2 (divide-by-N), waiting for a 2-byte count
; load, which stops timer 0 until the count is loaded. (Only works
; on fully 8253-compatible chips.)
;
mov al,00110100b; mode 2
out MODE_8253,al
DELAYmov al,00000000b ;latch timer 0 count
out MODE_8253,al
endif
cli ;stop the BIOS count
;
; Read the BIOS count. (Since interrupts are disabled, the BIOS
; count won't change.)
;
push ds
sub ax,ax
mov ds,ax
mov ax,ds:[TIMER_COUNT+2]
mov cs:[EndBIOSCountHigh],ax
mov ax,ds:[TIMER_COUNT]
mov cs:[EndBIOSCountLow],ax
pop ds
;
; Read the timer count and save it.
;
in al,TIMER_0_8253 ;lsb
DELAYmov ah,al
in al,TIMER_0_8253 ;msb
xchg ah,al
neg ax ;convert from countdown
; remaining to elapsed
; count
mov cs:[EndTimedCount],ax
;
; Restart timer 0, which is still waiting for an initial count
; to be loaded.
;
ife PS2
DELAYmov al,00110100b ;mode 2, waiting to load a
; 2-byte count
out MODE_8253,al
DELAYsub al,al
out TIMER_0_8253,al ;lsb
DELAYmov al,ah
out TIMER_0_8253,al ;msb
DELAY
endif
sti ;let the BIOS count continue
;
; Time a zero-length code fragment, to get a reference for how
; much overhead this routine has. Time it 16 times and average it,
; for accuracy, rounding the result.
;
mov cs:[ReferenceCount],0
mov cx,16
cli ;interrupts off to allow a
; precise reference count
RefLoop:
call ReferenceZTimerOn
call ReferenceZTimerOff
loop RefLoop
sti
add cs:[ReferenceCount],8 ;total + (0.5 * 16)
mov cl,4
shr cs:[ReferenceCount],cl ;(total) / 16 + 0.5
;
; Restore the context of the program being timed and return to it.
;
pop cx
pop ax
MPOPFret
ZTimerOff endp
;
; Called by ZTimerOff to start the timer for overhead measurements.
;
ReferenceZTimerOn proc near;
; Save the context of the program being timed.
;
push ax
pushf
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting.
;
mov al,00110100b ;mode 2
out MODE_8253,al
;
; Set the timer count to 0.
;
DELAYsub al,al
out TIMER_0_8253,al ;lsb
DELAYout TIMER_0_8253,al ;msb
;
; Restore the context of the program being timed and return to it.
;
MPOPFpop ax
ret
ReferenceZTimerOn endp
;
; Called by ZTimerOff to stop the timer and add the result to
; ReferenceCount for overhead measurements. Doesn't need to look
; at the BIOS count because timing a zero-length code fragment
; isn't going to take anywhere near 54 ms.
;
ReferenceZTimerOff proc near;
; Save the context of the program being timed.
;
pushf
push ax
push cx
;
; Match the interrupt-window delay in ZTimerOff.
;
sti
10
rept jmp $+2
endm
mov al,00000000b
out MODE_8253,al ;latch timer
;
; Read the count and save it.
;
DELAYin al,TIMER_0_8253 ;lsb
DELAYmov ah,al
in al,TIMER_0_8253 ;msb
xchg ah,al
neg ax ;convert from countdown
; remaining to elapsed
; count
add cs:[ReferenceCount],ax
;
; Restore the context and return.
;
pop cx
pop ax
MPOPFret
ReferenceZTimerOff endp
;********************************************************************
;* Routine called to report timing results. *
;********************************************************************
ZTimerReport proc near
pushf
push ax
push bx
push cx
push dx
push si
push di
push ds
;
push cs ;DOS functions require that DS point
pop ds ; to text to be displayed on the screen
ds :Code
assume ;
; See if midnight or more than an hour passed during timing. If so,
; notify the user.
;
mov ax,[StartBIOSCountHigh]
cmp ax,[EndBIOSCountHigh]
jz CalcBIOSTime ;hour count didn't change,
; so everything's fine
inc ax
cmp ax,[EndBIOSCountHigh]
jnz TestTooLong ;midnight or two hour
; boundaries passed, so the
; results are no good
mov ax,[EndBIOSCountLow]
cmp ax,[StartBIOSCountLow]
jb CalcBIOSTime ;a single hour boundary
; passed--that's OK, so long as
; the total time wasn't more
; than an hour
;
; Over an hour elapsed or midnight passed during timing, which
; renders the results invalid. Notify the user. This misses the
; case where a multiple of 24 hours has passed, but we'll rely
; on the perspicacity of the user to detect that case.
;
TestTooLong:
mov ah,9
mov dx,offset TurnOverStr
int 21h
jmp short ZTimerReportDone
;
; Convert the BIOS time to microseconds.
;
CalcBIOSTime:
mov ax,[EndBIOSCountLow]
sub ax,[StartBIOSCountLow]
mov dx,54925 ;number of microseconds each
; BIOS count represents
mul dx
mov bx,ax ;set aside BIOS count in
mov cx,dx ; microseconds
;
; Convert timer count to microseconds.
;
mov ax,[EndTimedCount]
mov si,8381
mul si
mov si,10000
div si ;* .8381 = * 8381 / 10000
;
; Add timer and BIOS counts together to get an overall time in
; microseconds.
;
add bx,ax
adc cx,0
;
; Subtract the timer overhead and save the result.
;
mov ax,[ReferenceCount]
mov si,8381 ;convert the reference count
mul si ; to microseconds
mov si,10000
div si ;* .8381 = * 8381 / 10000
sub bx,ax
sbb cx,0
mov [CurrentCountLow],bx
mov [CurrentCountHigh],cx
;
; Convert the result to an ASCII string by trial subtractions of
; powers of 10.
;
mov di,offset PowersOfTenEnd - offset PowersOfTen - 4
mov si,offset TimedCountStr
CTSNextDigit:
mov bl,'0'
CTSLoop:
mov ax,[CurrentCountLow]
mov dx,[CurrentCountHigh]
sub ax,PowersOfTen[di]
sbb dx,PowersOfTen[di+2]
jc CTSNextPowerDown
inc bl
mov [CurrentCountLow],ax
mov [CurrentCountHigh],dx
jmp CTSLoop
CTSNextPowerDown:
mov [si],bl
inc si
sub di,4
jns CTSNextDigit
;
;
; Print the results.
;
mov ah,9
mov dx,offset OutputStr
int 21h
;
ZTimerReportDone:
pop ds
pop di
pop si
pop dx
pop cx
pop bx
pop ax
MPOPFret
ZTimerReport endp
Code ends end
Moreover, because it uses an undocumented feature, the timer-stop
approach could conceivably cause erratic 8253 operation, which could in
turn seriously affect your computer’s operation until the next reboot.
In non-8253-compatible systems, I’ve observed not only wildly incorrect
timing results, but also failure of a diskette drive to operate properly
after the long-period Zen timer with PS2
set to 0 has run,
so be alert for signs of trouble if you do set PS2
to
0.
Rebooting should clear up any timer-related problems of the sort
described above. (This gives us another reason to reboot at the end of
each code-timing session.) You should immediately reboot and
set the PS2
equate to 1 if you get erratic or obviously
incorrect results with the long-period Zen timer when PS2
is set to 0. If you want to set PS2
to 0, it would be a
good idea to time a few of the listings in this book with
PS2
set first to 1 and then to 0, to make sure that the
results match. If they’re consistently different, you should set
PS2
to 1.
While the the non-PS/2 version is more dangerous than the PS/2 version, it also produces more accurate results when it does work. If you have a non-PS/2 PC-compatible computer, the choice between the two timing approaches is yours.
If you do leave the PS2
equate at 1 in Listing 3.5, you
should repeat each code-timing run several times before relying on the
results to be accurate to more than 54 ms, since variations may result
from the possible lack of synchronization between the timer 0 count and
the BIOS time-of-day count. In fact, it’s a good idea to time code more
than once no matter which version of the long-period Zen timer you’re
using, since interrupts, which must be enabled in order for the
long-period timer to work properly, may occur at any time and can alter
execution time substantially.
Finally, please note that the precision Zen timer works perfectly well on both PS/2 and non-PS/2 computers. The PS/2 and 8253 considerations we’ve just discussed apply only to the long-period Zen timer.
The long-period Zen timer has exactly the same calling interface as the precision Zen timer, and can be used in place of the precision Zen timer simply by linking it to the code to be timed in place of linking the precision timer code. Whenever the precision Zen timer informs you that the code being timed takes too long for the precision timer to handle, all you have to do is link in the long-period timer instead.
Listing 3.6 shows a test-bed program for the long-period Zen timer.
While this program is similar to Listing 3.2, it’s worth noting that
Listing 3.6 waits for a few seconds before calling
ZTimerOn
, thereby allowing any pending keyboard interrupts
to be processed. Since interrupts must be left on in order to time
periods longer than 54 ms, the interrupts generated by keystrokes
(including the upstroke of the Enter key press that starts the
program)—or any other interrupts, for that matter—could incorrectly
inflate the time recorded by the long-period Zen timer. In light of
this, resist the temptation to type ahead, move the mouse, or the like
while the long-period Zen timer is timing.
LISTING 3.6 LZTEST.ASM
; Program to measure performance of code that takes longer than
; 54 ms to execute. (LZTEST.ASM)
;
; Link with LZTIMER.ASM (Listing 3.5). LZTIME.BAT (Listing 3.7)
; can be used to assemble and link both files. Code to be
; measured must be in the file TESTCODE; Listing 3.8 shows
; a sample file (LST3-8.ASM) which should be named TESTCODE.
;
; By Michael Abrash
;
segment para stack 'STACK'
mystack db 512 dup(?)
mystack ends;
segment para public 'CODE'
Code cs:Code, ds:Code
assume :near, ZTimerOff:near, ZTimerReport:near
extrn ZTimerOn
Start proc nearpush cs
pop ds ;point DS to the code segment,
; so data as well as code can easily
; be included in TESTCODE
;
; Delay for 6-7 seconds, to let the Enter keystroke that started the
; program come back up.
;
mov ah,2ch
int 21h ;get the current time
mov bh,dh ;set the current time aside
DelayLoop:
mov ah,2ch
push bx ;preserve start time
int 21h ;get time
pop bx ;retrieve start time
cmp dh,bh ;is the new seconds count less than
; the start seconds count?
jnb CheckDelayTime ;no
add dh,60 ;yes, a minute must have turned over,
; so add one minute
CheckDelayTime:
sub dh,bh ;get time that's passed
cmp dh,7 ;has it been more than 6 seconds yet?
jb DelayLoop ;not yet
;
;code to be measured, including calls
include TESTCODE ; to ZTimerOn and ZTimerOff
;
; Display the results.
;
call ZTimerReport
;
; Terminate the program.
;
mov ah,4ch
int 21h
Start endp
Code ends end Start
As with the precision Zen timer, the program in Listing 3.6 is used by naming the file containing the code to be timed TESTCODE, then assembling both Listing 3.6 and Listing 3.5 with MASM or TASM and linking the two files together by way of the Microsoft or Borland linker. Listing 3.7 shows a batch file, named LZTIME.BAT, which does all of the above, generating and running the executable file LZTEST.EXE. LZTIME.BAT assumes that the file LZTIMER.ASM contains Listing 3.5 and the file LZTEST.ASM contains Listing 3.6.
LISTING 3.7 LZTIME.BAT
echo off
rem
rem *** Listing 3.7 ***
rem
rem ***************************************************************
rem * Batch file LZTIME.BAT, which builds and runs the *
rem * long-period Zen timer program LZTEST.EXE to time the code *
rem * named as the command-line parameter. Listing 3.5 must be *
rem * named LZTIMER.ASM, and Listing 3.6 must be named *
rem * LZTEST.ASM. To time the code in LST3-8, you'd type the *
rem * DOS command: *
rem * *
rem * lztime lst3-8 *
rem * *
rem * Note that MASM and LINK must be in the current directory or *
rem * on the current path in order for this batch file to work. *
rem * *
rem * This batch file can be speeded up by assembling LZTIMER.ASM *
rem * once, then removing the lines: *
rem * *
rem * masm lztimer; *
rem * if errorlevel 1 goto errorend *
rem * *
rem * from this file. *
rem * *
rem * By Michael Abrash *
rem ***************************************************************
rem
rem Make sure a file to test was specified.
rem
if not x%1==x goto ckexist
echo ***************************************************************
echo * Please specify a file to test. *
echo ***************************************************************
goto end
rem
rem Make sure the file exists.
rem
:ckexist
if exist %1 goto docopy
echo ***************************************************************
echo * The specified file, "%1," doesn't exist. *
echo ***************************************************************
goto end
rem
rem copy the file to measure to TESTCODE.
:docopy
copy %1 testcode
masm lztest;
if errorlevel 1 goto errorend
masm lztimer;
if errorlevel 1 goto errorend
link lztest+lztimer;
if errorlevel 1 goto errorend
lztest
goto end
:errorend
echo ***************************************************************
echo * An error occurred while building the long-period Zen timer. *
echo ***************************************************************
:end
Listing 3.8 shows sample code that can be timed with the test-bed program of Listing 3.6. Listing 3.8 measures the time required to execute 20,000 loads of AL from memory, a length of time too long for the precision Zen timer to handle on the 8088.
LISTING 3.8 LST3-8.ASM
;
; Measures the performance of 20,000 loads of AL from
; memory. (Use by renaming to TESTCODE, which is
; included by LZTEST.ASM (Listing 3.6). LZTIME.BAT
; (Listing 3.7) does this, along with all assembly
; and linking.)
;
; Note: takes about ten minutes to assemble on a slow PC if
;you are using MASM
;
jmp Skip ;jump around defined data
;
db ?
MemVar ;
Skip:
;
; Start timing.
;
call ZTimerOn
;
20000
rept mov al,[MemVar]
endm;
; Stop timing.
;
call ZTimerOff
When LZTIME.BAT is run on a PC with the following command line (assuming the code in Listing 3.8 is the file LST3-8.ASM)
lztime lst3-8.asm
the result is 72,544 µs, or about 3.63 µs per load of AL from memory.
This is just slightly longer than the time per load of AL measured by
the precision Zen timer, as we would expect given that interrupts are
left enabled by the long-period Zen timer. The extra fraction of a
microsecond measured per MOV
reflects the time required to
execute the BIOS code that handles the 18.2 timer interrupts that occur
each second.
Note that the command can take as much as 10 minutes to finish on a
slow PC if you are using MASM, with most of that time spent assembling
Listing 3.8. Why? Because MASM is notoriously slow at assembling
REPT
blocks, and the block in Listing 3.8 is repeated
20,000 times.
The Zen timer can be used to measure code performance when
programming in C—but not right out of the box. As presented earlier, the
timer is designed to be called from assembly language; some relatively
minor modifications are required before the ZTimerOn
(start
timer), ZTimerOff
(stop timer), and
ZTimerReport
(display timing results) routines can be
called from C. There are two separate cases to be dealt with here: small
code model and large; I’ll tackle the simpler one, the small code model,
first.
Altering the Zen timer for linking to a small code model C program
involves the following steps: Change ZTimerOn
to
_ZTimerOn
, change ZTimerOff
to
_ZTimerOff
, change ZTimerReport
to
_ZTimerReport
, and change Code
to
_TEXT
. Figure 3.2 shows the line numbers and new states of
all lines from Listing 3.1 that must be changed. These changes convert
the code to use C-style external label names and the small model C code
segment. (In C++, use the “C” specifier, as in
extern "C" ZTimerOn(void);
when declaring the timer routines extern
, so that
name-mangling doesn’t occur, and the linker can find the routines’
C-style names.)
That’s all it takes; after doing this, you’ll be able to use the Zen timer from C, as, for example, in:
():
ZTimerOnfor (i=0, x=0; i<100; i++)
+= i;
x ();
ZTimerOff(); ZTimerReport
(I’m talking about the precision timer here. The long-period timer—Listing 3.5—requires the same modifications, but to different lines.)
Altering the Zen timer for use in C’s large code model is a tad more complex, because in addition to the above changes, all functions, including the internal reference timing routines that are used to calculate overhead so it can be subtracted out, must be converted to far. Figure 3.3 shows the line numbers and new states of all lines from Listing 3.1 that must be changed in order to call the Zen timer from large code model C. Again, the line numbers are specific to the precision timer, but the long-period timer is very similar.
The full listings for the C-callable Zen timers are presented in Chapter K on the companion CD-ROM.
One important safety tip when modifying the Zen timer for use with large code model C code: Watch out for optimizing assemblers! TASM actually replaces
call far ptr ReferenceZTimerOn
with
push cs
call near ptr ReferenceZTimerOn
(and likewise for ReferenceZTimerOff
), which works
because ReferenceZTimerOn
is in the same segment as the
calling code. This is normally a great optimization, being both smaller
and faster than a far call.
However, it’s not so great for the Zen timer, because our purpose in
calling the reference timing code is to determine exactly how much time
is taken by overhead code—including the far calls to
ZTimerOn
and ZTimerOf
! By converting the far
calls to push/near call pairs within the Zen timer module, TASM makes it
impossible to emulate exactly the overhead of the Zen timer, and makes
timings slightly (about 16 cycles on a 386) less accurate.
What’s the solution? Put the NOSMART
directive at the
start of the Zen timer code. This directive instructs TASM to turn off
all optimizations, including converting far calls to push/near call
pairs. By the way, there is, to the best of my knowledge, no such
problem with MASM up through version 5.10A.
In my mind, the whole business of optimizing assemblers is a mixed blessing. In general, it’s nice to have the assembler shortening jumps and selecting sign-extended forms of instructions for you. On the other hand, the benefits of tricks like substituting push/near call pairs for far calls are relatively small, and those tricks can get in the way when complete control is needed. Sure, complete control is needed very rarely, but when it is, optimizing assemblers can cause subtle problems; I discovered TASM’s alteration of far calls only because I happened to view the code in the debugger, and you might want to do the same if you’re using a recent version of MASM.
I’ve tested the changes shown in Figures 3.2 and 3.3 with TASM and Borland C++ 4.0, and also with the latest MASM and Microsoft C/C++ compiler.
For those of you who wish to pursue the mechanics of code measurement further, one good article about measuring code performance with the 8253 timer is “Programming Insight: High-Performance Software Analysis on the IBM PC,” by Byron Sheppard, which appeared in the January, 1987 issue of Byte. For complete if somewhat cryptic information on the 8253 timer itself, I refer you to Intel’s Microsystem Components Handbook, which is also a useful reference for a number of other PC components, including the 8259 Programmable Interrupt Controller and the 8237 DMA Controller. For details about the way the 8253 is used in the PC, as well as a great deal of additional information about the PC’s hardware and BIOS resources, I suggest you consult IBM’s series of technical reference manuals for the PC, XT, AT, Model 30, and microchannel computers, such as the Models 50, 60, and 80.
For our purposes, however, it’s not critical that you understand exactly how the Zen timer works. All you really need to know is what the Zen timer can do and how to use it, and we’ve accomplished that in this chapter.
The Zen timer is not perfect. For one thing, the finest resolution to which it can measure an interval is at best about 1µs, a period of time in which a 66 MHz Pentium computer can execute as many as 132 instructions (although an 8088-based PC would be hard-pressed to manage two instructions in a microsecond). Another problem is that the timing code itself interferes with the state of the prefetch queue and processor cache at the start of the code being timed, because the timing code is not necessarily fetched and does not necessarily access memory in exactly the same time sequence as the code immediately preceding the code under measurement normally does. This prefetch effect can introduce as much as 3 to 4 µs of inaccuracy. Similarly, the state of the prefetch queue at the end of the code being timed affects how long the code that stops the timer takes to execute. Consequently, the Zen timer tends to be more accurate for longer code sequences, since the relative magnitude of the inaccuracy introduced by the Zen timer becomes less over longer periods.
Imperfections notwithstanding, the Zen timer is a good tool for exploring C code and x86 family assembly language, and it’s a tool we’ll use frequently for the remainder of this book.
This chapter, adapted from my earlier book, Zen of Assembly Language located on the companion CD-ROM, goes right to the heart of my philosophy of optimization: Understand where the time really goes when your code runs. That may sound ridiculously simple, but, as this chapter makes clear, it turns out to be a challenging task indeed, one that at times verges on black magic. This chapter is a long-time favorite of mine because it was the first—and to a large extent only—work that I know of that discussed this material, thereby introducing a generation of PC programmers to pedal-to-the-metal optimization.
This chapter focuses almost entirely on the first popular x86-family processor, the 8088. Some of the specific features and results that I cite in this chapter are no longer applicable to modern x86-family processors such as the 486 and Pentium, as I’ll point out later on when we discuss those processors. Nonetheless, the overall theme of this chapter—that understanding dimly-seen and poorly-documented code gremlins called cycle-eaters that lurk in your system is essential to performance programming—is every bit as valid today. Also, later chapters often refer back to the basic cycle-eaters described in this chapter, so this chapter is the foundation for the discussions of x86-family optimization to come. What’s more, the Zen timer remains an excellent tool with which to flush out and examine cycle-eaters, as we’ll see in later chapters, and this chapter is as good an illustration of how to use the Zen timer as you’re likely to find.
So, don’t take either the absolute or the relative execution times presented in this chapter as gospel for newer processors, and read on to later chapters to see how the cycle-eaters and optimization rules have changed over time, but do take the time to at least skim through this chapter to give yourself a good start on the material in the rest of this book.
Programming has many levels, ranging from the familiar (high-level languages, DOS calls, and the like) down to the esoteric things that lie on the shadowy edge of hardware-land. I call these cycle-eaters because, like the monsters in a bad 50s horror movie, they lurk in those shadows, taking their share of your program’s performance without regard to the forces of goodness or the U.S. Army. In this chapter, we’re going to jump right in at the lowest level by examining the cycle-eaters that live beneath the programming interface; that is, beneath your application, DOS, and BIOS—in fact, beneath the instruction set itself.
Why start at the lowest level? Simply because cycle-eaters affect the performance of all assembler code, and yet are almost unknown to most programmers. A full understanding of code optimization requires an understanding of cycle-eaters and their implications. That’s no simple task, and in fact it is in precisely that area that most books and articles about assembly programming fall short.
Nearly all literature on assembly programming discusses only the programming interface: the instruction set, the registers, the flags, and the BIOS and DOS calls. Those topics cover the functionality of assembly programs most thoroughly—but it’s performance above all else that we’re after. No one ever tells you about the raw stuff of performance, which lies beneath the programming interface, in the dimly-seen realm—populated by instruction prefetching, dynamic RAM refresh, and wait states—where software meets hardware. This area is the domain of hardware engineers, and is almost never discussed as it relates to code performance. And yet it is only by understanding the mechanisms operating at this level that we can fully understand and properly improve the performance of our code.
Which brings us to cycle-eaters.
Cycle-eaters are gremlins that live on the bus or in peripherals (and sometimes within the CPU itself), slowing the performance of PC code so that it doesn’t execute at full speed. Most cycle-eaters (and all of those haunting the older Intel processors) live outside the CPU’s Execution Unit, where they can only affect the CPU when the CPU performs a bus access (a memory or I/O read or write). Once your code and data are already inside the CPU, those cycle-eaters can no longer be a problem. Only on the 486 and Pentium CPUs will you find cycle-eaters inside the chip, as we’ll see in later chapters.
The nature and severity of the cycle-eaters vary enormously from processor to processor, and (especially) from memory architecture to memory architecture. In order to understand them all, we need first to understand the simplest among them, those that haunted the original 8088-based IBM PC. Later on in this book, I’ll be better able to explain the newer generation of cycle-eaters in terms of those ancestral cycle-eaters—but we have to get the groundwork down first.
Internally, the 8088 is a 16-bit processor, capable of running at full speed at all times—unless external data is required. External data must traverse the 8088’s external data bus and the PC’s data bus one byte at a time to and from peripherals, with cycle-eaters lurking along every step of the way. What’s more, external data includes not only memory operands but also instruction bytes, so even instructions with no memory operands can suffer from cycle-eaters. Since some of the 8088’s fastest instructions are register-only instructions, that’s important indeed.
The major cycle-eaters are:
The locations of these cycle-eaters in the primordial 8088-based PC are shown in Figure 4.1. We’ll cover each of the cycle-eaters in turn in this chapter. The material won’t be easy since cycle-eaters are among the most subtle aspects of assembly programming. By the same token, however, this will be one of the most important and rewarding chapters in this book. Don’t worry if you don’t catch everything in this chapter, but do read it all even if the going gets a bit tough. Cycle-eaters play a key role in later chapters, so some familiarity with them is highly desirable.
Look! Down on the motherboard! It’s a 16-bit processor! It’s an 8-bit processor! It’s…
…an 8088!
Fans of the 8088 call it a 16-bit processor. Fans of other 16-bit processors call the 8088 an 8-bit processor. The truth of the matter is that the 8088 is a 16-bit processor that often performs like an 8-bit processor.
The 8088 is internally a full 16-bit processor, equivalent to an 8086. (In fact, the 8086 is identical to the 8088, except that it has a full 16-bit bus. The 8088 is basically the poor man’s 8086, because it allows a cheaper—albeit slower—system to be built, thanks to the half-sized bus.) In terms of the instruction set, the 8088 is clearly a 16-bit processor, capable of performing any given 16-bit operation—addition, subtraction, even multiplication or division—with a single instruction. Externally, however, the 8088 is unequivocally an 8-bit processor, since the external data bus is only 8 bits wide. In other words, the programming interface is 16 bits wide, but the hardware interface is only 8 bits wide, as shown in Figure 4.2. The result of this mismatch is simple: Word-sized data can be transferred between the 8088 and memory or peripherals at only one-half the maximum rate of the 8086, which is to say one-half the maximum rate for which the Execution Unit of the 8088 was designed.
As shown in Figure 4.1, the 8-bit bus cycle-eater lies squarely on the 8088’s external data bus. Technically, it might be more accurate to place this cycle-eater in the Bus Interface Unit, which breaks 16-bit memory accesses into paired 8-bit accesses, but it is really the limited width of the external data bus that constricts data flow into and out of the 8088. True, the original PC’s bus is also only 8 bits wide, but that’s just to match the 8088’s 8-bit bus; even if the PC’s bus were 16 bits wide, data could still pass into and out of the 8088 chip itself only 1 byte at a time.
Each bus access by the 8088 takes 4 clock cycles, or 0.838 µs in the 4.77 MHz PC, and transfers 1 byte. That means that the maximum rate at which data can be transferred into and out of the 8088 is 1 byte every 0.838 µs. While 8086 bus accesses also take 4 clock cycles, each 8086 bus access can transfer either 1 byte or 1 word, for a maximum transfer rate of 1 word every 0.838 µs. Consequently, for word-sized memory accesses, the 8086 has an effective transfer rate of 1 byte every 0.419 µs. By contrast, every word-sized access on the 8088 requires two 4-cycle-long bus accesses, one for the high byte of the word and one for the low byte of the word. As a result, the 8088 has an effective transfer rate for word-sized memory accesses of just 1 word every 1.676 µs—and that, in a nutshell, is the 8-bit bus cycle-eater.
A related cycle-eater lurks beneath the 386SX chip, which is a 32-bit processor internally with only a 16-bit path to system memory. The numbers are different, but the way the cycle-eater operates is exactly the same. AT-compatible systems have 16-bit data buses, which can access a full 16-bit word at a time. The 386SX can process 32 bits (a doubleword) at a time, however, and loses a lot of time fetching that doubleword from memory in two halves.
One obvious effect of the 8-bit bus cycle-eater is that word-sized accesses to memory operands on the 8088 take 4 cycles longer than byte-sized accesses. That’s why the official instruction timings indicate that for code running on an 8088 an additional 4 cycles are required for every word-sized access to a memory operand. For instance,
mov ax,word ptr [MemVar]
takes 4 cycles longer to read the word at address MemVar
than
mov al,byte ptr [MemVar]
takes to read the byte at address MemVar
. (Actually, the
difference between the two isn’t very likely to be exactly 4 cycles, for
reasons that will become clear once we discuss the prefetch queue and
dynamic RAM refresh cycle-eaters later in this chapter.)
What’s more, in some cases one instruction can perform multiple word-sized accesses, incurring that 4-cycle penalty on each access. For example, adding a value to a word-sized memory variable requires two word-sized accesses—one to read the destination operand from memory prior to adding to it, and one to write the result of the addition back to the destination operand—and thus incurs not one but two 4-cycle penalties. As a result
add word ptr [MemVar],ax
takes about 8 cycles longer to execute than:
add byte ptr [MemVar],al
String instructions can suffer from the 8-bit bus cycle-eater to a
greater extent than other instructions. Believe it or not, a single
REP MOVSW
instruction can lose as much as 131,070
word-sized memory accesses x 4 cycles, or 524,280 cycles to the
8-bit bus cycle-eater! In other words, one 8088 instruction (admittedly,
an instruction that does a great deal) can take over one-tenth of a
second longer on an 8088 than on an 8086, simply because of the 8-bit
bus. One-tenth of a second! That’s a phenomenally long time in
computer terms; in one-tenth of a second, the 8088 can perform more than
50,000 additions and subtractions.
The upshot of all this is simply that the 8088 can transfer word-sized data to and from memory at only half the speed of the 8086, which inevitably causes performance problems when coupled with an Execution Unit that can process word-sized data every bit as quickly as an 8086. These problems show up with any code that uses word-sized memory operands. More ominously, as we will see shortly, the 8-bit bus cycle-eater can cause performance problems with other sorts of code as well.
The obvious implication of the 8-bit bus cycle-eater is that
byte-sized memory variables should be used whenever possible. After all,
the 8088 performs byte-sized memory accesses just as quickly as
the 8086. For instance, Listing 4.1, which uses a byte-sized memory
variable as a loop counter, runs in 10.03 s per loop. That’s 20 percent
faster than the 12.05 µs per loop execution time of Listing 4.2, which
uses a word-sized counter. Why the difference in execution times? Simply
because each word-sized DEC
performs 4 byte-sized memory
accesses (two to read the word-sized operand and two to write the result
back to memory), while each byte-sized DEC
performs only 2
byte-sized memory accesses in all.
LISTING 4.1 LST4-1.ASM
; Measures the performance of a loop which uses a
; byte-sized memory variable as the loop counter.
;
jmp Skip
;
db 100
Counter ;
Skip:
call ZTimerOn
LoopTop:
dec [Counter]
jnz LoopTop
call ZTimerOff
LISTING 4.2 LST4-2.ASM
; Measures the performance of a loop which uses a
; word-sized memory variable as the loop counter.
;
jmp Skip
;
dw 100
Counter ;
Skip:
call ZTimerOn
LoopTop:
dec [Counter]
jnz LoopTop
call ZTimerOff
I’d like to make a brief aside concerning code optimization in the
listings in this book. Throughout this book I’ve modeled the sample code
after working code so that the timing results are applicable to
real-world programming. In Listings 4.1 and 4.2, for example, I could
have shown a still greater advantage for byte-sized operands simply by
performing 1,000 DEC
instructions in a row, with no
branching at all. However, DEC
instructions don’t exist in
a vacuum, so in the listings I used code that both decremented the
counter and tested the result. The difference is that between
decrementing a memory location (simply an instruction) and using a loop
counter (a functional instruction sequence). If you come across code in
this book that seems less than optimal, it’s simply due to my desire to
provide code that’s relevant to real programming problems. On the other
hand, optimal code is an elusive thing indeed; by no means should you
assume that the code in this book is ideal! Examine it, question it, and
improve upon it, for an inquisitive, skeptical mind is an important part
of the Zen of assembly optimization.
Back to the 8-bit bus cycle-eater. As I’ve said, in 8088 work you should strive to use byte-sized memory variables whenever possible. That does not mean that you should use 2 byte-sized memory accesses to manipulate a word-sized memory variable in preference to 1 word-sized memory access, as, for instance,
mov dl,byte ptr [MemVar]
mov dh,byte ptr [MemVar+1]
versus:
mov dx,word ptr [MemVar]
Recall that every access to a memory byte takes at least 4 cycles; that limitation is built right into the 8088. The 8088 is also built so that the second byte-sized memory access to a 16-bit memory variable takes just those 4 cycles and no more. There’s no way you can manipulate the second byte of a word-sized memory variable faster with a second separate byte-sized instruction in less than 4 cycles. As a matter of fact, you’re bound to access that second byte much more slowly with a separate instruction, thanks to the overhead of instruction fetching and execution, address calculation, and the like.
For example, consider Listing 4.3, which performs 1,000 word-sized
reads from memory. This code runs in 3.77 µs per word read on a 4.77 MHz
8088. That’s 45 percent faster than the 5.49 µs per word read of Listing
4.4, which reads the same 1,000 words as Listing 4.3 but does so with
2,000 byte-sized reads. Both listings perform exactly the same number of
memory accesses—2,000 accesses, each byte-sized, as all 8088 memory
accesses must be. (Remember that the Bus Interface Unit must perform two
byte-sized memory accesses in order to handle a word-sized memory
operand.) However, Listing 4.3 is considerably faster because it expends
only 4 additional cycles to read the second byte of each word, while
Listing 4.4 performs a second LODSB
, requiring 13 cycles,
to read the second byte of each word.
LISTING 4.3 LST4-3.ASM
; Measures the performance of reading 1,000 words
; from memory with 1,000 word-sized accesses.
;
sub si,si
mov cx,1000
call ZTimerOn
rep lodsw
call ZTimerOff
LISTING 4.4 LST4-4.ASM
; Measures the performance of reading 1000 words
; from memory with 2,000 byte-sized accesses.
;
sub si,si
mov cx,2000
call ZTimerOn
rep lodsb
call ZTimerOff
In short, if you must perform a 16-bit memory access, let the 8088 break the access into two byte-sized accesses for you. The 8088 is more efficient at that task than your code can possibly be.
Word-sized variables should be stored in registers to the greatest feasible extent, since registers are inside the 8088, where 16-bit operations are just as fast as 8-bit operations because the 8-bit cycle-eater can’t get at them. In fact, it’s a good idea to keep as many variables of all sorts in registers as you can. Instructions with register-only operands execute very rapidly, partially because they avoid both the time-consuming memory accesses and the lengthy address calculations associated with memory operands.
There is yet another reason why register operands are preferable to memory operands, and it’s an unexpected effect of the 8-bit bus cycle-eater. Instructions with only register operands tend to be shorter (in terms of bytes) than instructions with memory operands, and when it comes to performance, shorter is usually better. In order to explain why that is true and how it relates to the 8-bit bus cycle-eater, I must diverge for a moment.
For the last few pages, you may well have been thinking that the 8-bit bus cycle-eater, while a nuisance, doesn’t seem particularly subtle or difficult to quantify. After all, any instruction reference tells us exactly how many cycles each instruction loses to the 8-bit bus cycle-eater, doesn’t it?
Yes and no. It’s true that in general we know approximately how much longer a given instruction will take to execute with a word-sized memory operand than with a byte-sized operand, although the dynamic RAM refresh and wait state cycle-eaters (which I’ll cover a little later) can raise the cost of the 8-bit bus cycle-eater considerably. However, all word-sized memory accesses lose 4 cycles to the 8-bit bus cycle-eater, and there’s one sort of word-sized memory access we haven’t discussed yet: instruction fetching. The ugliest manifestation of the 8-bit bus cycle-eater is in fact the prefetch queue cycle-eater.
In an 8088 context, here’s the prefetch queue cycle-eater in a nutshell: The 8088’s 8-bit external data bus keeps the Bus Interface Unit from fetching instruction bytes as fast as the 16-bit Execution Unit can execute them, so the Execution Unit often lies idle while waiting for the next instruction byte to be fetched.
Exactly why does this happen? Recall that the 8088 is an 8086 internally, but accesses word-sized memory data at only one-half the maximum rate of the 8086 due to the 8088’s 8-bit external data bus. Unfortunately, instructions are among the word-sized data the 8086 fetches, meaning that the 8088 can fetch instructions at only one-half the speed of the 8086. On the other hand, the 8086-equivalent Execution Unit of the 8088 can execute instructions every bit as fast as the 8086. The net result is that the Execution Unit burns up instruction bytes much faster than the Bus Interface Unit can fetch them, and ends up idling while waiting for instructions bytes to arrive.
The BIU can fetch instruction bytes at a maximum rate of one byte every 4 cycles—and that 4-cycle per instruction byte rate is the ultimate limit on overall instruction execution time, regardless of EU speed. While the EU may execute a given instruction that’s already in the prefetch queue in less than 4 cycles per byte, over time the EU can’t execute instructions any faster than they can arrive—and they can’t arrive faster than 1 byte every 4 cycles.
Clearly, then, the prefetch queue cycle-eater is nothing more than one aspect of the 8-bit bus cycle-eater. 8088 code often runs at less than the Execution Unit’s maximum speed because the 8-bit data bus can’t keep up with the demand for instruction bytes. That’s straightforward enough—so why all the fuss about the prefetch queue cycle-eater?
What makes the prefetch queue cycle-eater tricky is that it’s undocumented and unpredictable. That is, with a word-sized memory access, such as
mov [bx],ax
it’s well-documented that an extra 4 cycles will always be required to write the upper byte of AX to memory. Not so with the prefetch queue cycle-eater lurking nearby. For instance, the instructions
shr ax,1
shr ax,1
shr ax,1
shr ax,1
shr ax,1
should execute in 10 cycles, since each SHR
takes 2
cycles to execute, according to Intel’s specifications. Those
specifications contain Intel’s official instruction execution times, but
in this case—and in many others—the specifications are drastically
wrong. Why? Because they describe execution time once an instruction
reaches the prefetch queue. They say nothing about whether a given
instruction will be in the prefetch queue when it’s time for that
instruction to run, or how long it will take that instruction to reach
the prefetch queue if it’s not there already. Thanks to the low
performance of the 8088’s external data bus, that’s a glaring
omission—but, alas, an unavoidable one. Let’s look at why the official
execution times are wrong, and why that can’t be helped.
The sequence of 5 SHR
instructions in the last example
is 10 bytes long. That means that it can never execute in less than 24
cycles even if the 4-byte prefetch queue is full when it starts, since 6
instruction bytes would still remain to be fetched, at 4 cycles per
fetch. If the prefetch queue is empty at the start, the sequence
could take 40 cycles. In short, thanks to instruction fetching,
the code won’t run at its documented speed, and could take up to four
times longer than it is supposed to.
Why does Intel document Execution Unit execution time rather than overall instruction execution time, which includes both instruction fetch time and Execution Unit (EU) execution time? Well, instruction fetching isn’t performed as part of instruction execution by the Execution Unit, but instead is carried on in parallel by the Bus Interface Unit (BIU) whenever the external data bus isn’t in use or whenever the EU runs out of instruction bytes to execute. Sometimes the BIU is able to use spare bus cycles to prefetch instruction bytes before the EU needs them, so in those cases instruction fetching takes no time at all, practically speaking. At other times the EU executes instructions faster than the BIU can fetch them, and instruction fetching then becomes a significant part of overall execution time. As a result, the effective fetch time for a given instruction varies greatly depending on the code mix preceding that instruction. Similarly, the state in which a given instruction leaves the prefetch queue affects the overall execution time of the following instructions.
In other words, while the execution time for a given instruction is constant, the fetch time for that instruction depends heavily on the context in which the instruction is executing—the amount of prefetching the preceding instructions allowed—and can vary from a full 4 cycles per instruction byte to no time at all.
As we’ll see later, other cycle-eaters, such as DRAM refresh and display memory wait states, can cause prefetching variations even during different executions of the same code sequence. Given that, it’s meaningless to talk about the prefetch time of a given instruction except in the context of a specific code sequence.
So now you know why the official instruction execution times are often wrong, and why Intel can’t provide better specifications. You also know now why it is that you must time your code if you want to know how fast it really is.
The effect of the code preceding an instruction on the execution time of that instruction makes the Zen timer trickier to use than you might expect, and complicates the interpretation of the results reported by the Zen timer. For one thing, the Zen timer is best used to time code sequences that are more than a few instructions long; below 10µs or so, prefetch queue effects and the limited resolution of the clock driving the timer can cause problems.
Some slight prefetch queue-induced inaccuracy usually exists even when the Zen timer is used to time longer code sequences, since the calls to the Zen timer usually alter the code’s prefetch queue from its normal state. (Branches—jumps, calls, returns and the like—empty the prefetch queue.) Ideally, the Zen timer is used to measure the performance of an entire subroutine, so the prefetch queue effects of the branches at the start and end of the subroutine are similar to the effects of the calls to the Zen timer when you’re measuring the subroutine’s performance.
Another way in which the prefetch queue cycle-eater complicates the use of the Zen timer involves the practice of timing the performance of a few instructions over and over. I’ll often repeat one or two instructions 100 or 1,000 times in a row in listings in this book in order to get timing intervals that are long enough to provide reliable measurements. However, as we just learned, the actual performance of any 8088 instruction depends on the code mix preceding any given use of that instruction, which in turn affects the state of the prefetch queue when the instruction starts executing. Alas, the execution time of an instruction preceded by dozens of identical instructions reflects just one of many possible prefetch states (and not a very likely state at that), and some of the other prefetch states may well produce distinctly different results.
For example, consider the code in Listings 4.5 and 4.6. Listing 4.5
shows our familiar SHR
case. Here, because the prefetch
queue is always empty, execution time should work out to about 4 cycles
per byte, or 8 cycles per SHR
, as shown in Figure 4.3.
(Figure 4.3 illustrates the relationship between instruction fetching
and execution in a simplified way, and is not intended to show the exact
timings of 8088 operations.) That’s quite a contrast to the official
2-cycle execution time of SHR
. In fact, the Zen timer
reports that Listing 4.5 executes in 1.81µs per byte, or slightly
more than 4 cycles per byte. (The extra time is the result of
the dynamic RAM refresh cycle-eater, which we’ll discuss shortly.) Going
by Listing 4.5, we would conclude that the “true” execution time of
SHR
is 8.64 cycles.
LISTING 4.5 LST4-5.ASM
; Measures the performance of 1,000 SHR instructions
; in a row. Since SHR executes in 2 cycles but is
; 2 bytes long, the prefetch queue is always empty,
; and prefetching time determines the overall
; performance of the code.
;
call ZTimerOn
1000
rept shr ax,1
endmcall ZTimerOff
LISTING 4.6 LST4-6.ASM
; Measures the performance of 1,000 MUL/SHR instruction
; pairs in a row. The lengthy execution time of MUL
; should keep the prefetch queue from ever emptying.
;
mov cx,1000
sub ax,ax
call ZTimerOn
1000
rept mul ax
shr ax,1
endmcall ZTimerOff
Now let’s examine Listing 4.6. Here each SHR
follows a
MUL
instruction. Since MUL
instructions take
so long to execute that the prefetch queue is always full when they
finish, each SHR
should be ready and waiting in the
prefetch queue when the preceding MUL
ends. As a result,
we’d expect that each SHR
would execute in 2 cycles;
together with the 118-cycle execution time of multiplying 0 times 0, the
total execution time should come to 120 cycles per SHR/MUL
pair, as shown in Figure 4.4. And, by God, when we run Listing 4.6 we
get an execution time of 25.14 µs per SHR/MUL
pair, or
exactly 120 cycles! According to these results, the “true”
execution time of SHR
would seem to be 2 cycles, quite a
change from the conclusion we drew from Listing 4.5.
The key point is this: We’ve seen one code sequence in which
SHR
took 8-plus cycles to execute, and another in which it
took only 2 cycles. Are we talking about two different forms of
SHR
here? Of course not—the difference is purely a
reflection of the differing states in which the preceding code left the
prefetch queue. In Listing 4.5, each SHR
after the first
few follows a slew of other SHR
instructions which have
sucked the prefetch queue dry, so overall performance reflects
instruction fetch time. By contrast, each SHR
in Listing
4.6 follows a MUL
instruction which leaves the prefetch
queue full, so overall performance reflects Execution Unit execution
time.
Clearly, either instruction fetch time or Execution Unit execution time—or even a mix of the two, if an instruction is partially prefetched—can determine code performance. Some people operate under a rule of thumb by which they assume that the execution time of each instruction is 4 cycles times the number of bytes in the instruction. While that’s often true for register-only code, it frequently doesn’t hold for code that accesses memory. For one thing, the rule should be 4 cycles times the number of memory accesses, not instruction bytes, since all accesses take 4 cycles on the 8088-based PC. For another, memory-accessing instructions often have slower Execution Unit execution times than the 4 cycles per memory access rule would dictate, because the 8088 isn’t very fast at calculating memory addresses. Also, the 4 cycles per instruction byte rule isn’t true for register-only instructions that are already in the prefetch queue when the preceding instruction ends.
The truth is that it never hurts performance to reduce either the
cycle count or the byte count of a given bit of code, but there’s no
guarantee that one or the other will improve performance either. For
example, consider Listing 4.7, which consists of a series of 4-cycle,
2-byte MOV AL,0
instructions, and which executes at the
rate of 1.81 µs per instruction. Now consider Listing 4.8, which
replaces the 4-cycle MOV AL,0
with the 3-cycle (but still
2-byte) SUB AL,AL,
Despite its 1-cycle-per-instruction
advantage, Listing 4.8 runs at exactly the same speed as Listing 4.7.
The reason: Both instructions are 2 bytes long, and in both cases it is
the 8-cycle instruction fetch time, not the 3 or 4-cycle Execution Unit
execution time, that limits performance.
LISTING 4.7 LST4-7.ASM
; Measures the performance of repeated MOV AL,0 instructions,
; which take 4 cycles each according to Intel's official
; specifications.
;
sub ax,ax
call ZTimerOn
1000
rept mov al,0
endmcall ZTimerOff
LISTING 4.8 LST4-8.ASM
; Measures the performance of repeated SUB AL,AL instructions,
; which take 3 cycles each according to Intel's official
; specifications.
;
sub ax,ax
call ZTimerOn
1000
rept sub al,al
endmcall ZTimerOff
As you can see, it’s easy to be drawn into thinking you’re saving cycles when you’re not. You can only improve the performance of a specific bit of code by reducing the factor—either instruction fetch time or execution time, or sometimes a mix of the two—that’s limiting the performance of that code.
In case you missed it in all the excitement, the variability of prefetching means that our method of testing performance by executing 1,000 instructions in a row by no means produces “true” instruction execution times, any more than the official execution times in the Intel manuals are “true” times. The fact of the matter is that a given instruction takes at least as long to execute as the time given for it in the Intel manuals, but may take as much as 4 cycles per byte longer, depending on the state of the prefetch queue when the preceding instruction ends.
The only true execution time for an instruction is a time measured in a certain context, and that time is meaningful only in that context.
What we really want is to know how long useful working code takes to run, not how long a single instruction takes, and the Zen timer gives us the tool we need to gather that information. Granted, it would be easier if we could just add up neatly documented instruction execution times—but that’s not going to happen. Without actually measuring the performance of a given code sequence, you simply don’t know how fast it is. For crying out loud, even the people who designed the 8088 at Intel couldn’t tell you exactly how quickly a given 8088 code sequence executes on the PC just by looking at it! Get used to the idea that execution times are only meaningful in context, learn the rules of thumb in this book, and use the Zen timer to measure your code.
Don’t think that because overall instruction execution time is
determined by both instruction fetch time and Execution Unit execution
time, the two times should be added together when estimating
performance. For example, practically speaking, each SHR
in
Listing 4.5 does not take 8 cycles of instruction fetch time plus 2
cycles of Execution Unit execution time to execute. Figure 4.3 shows
that while a given SHR
is executing, the fetch of the next
SHR
is starting, and since the two operations are
overlapped for 2 cycles, there’s no sense in charging the time to both
instructions. You could think of the extra instruction fetch time for
SHR
in Listing 4.5 as being 6 cycles, which yields an
overall execution time of 8 cycles when added to the 2 cycles of
Execution Unit execution time.
Alternatively, you could think of each SHR
in Listing
4.5 as taking 8 cycles to fetch, and then executing in effectively 0
cycles while the next SHR
is being fetched. Whichever
perspective you prefer is fine. The important point is that the time
during which the execution of one instruction and the fetching of the
next instruction overlap should only be counted toward the overall
execution time of one of the instructions. For all intents and purposes,
one of the two instructions runs at no performance cost whatsoever while
the overlap exists.
As a working definition, we’ll consider the execution time of a given instruction in a particular context to start when the first byte of the instruction is sent to the Execution Unit and end when the first byte of the next instruction is sent to the EU.
Reducing the impact of the prefetch queue cycle-eater is one of the overriding principles of high-performance assembly code. How can you do this? One effective technique is to minimize access to memory operands, since such accesses compete with instruction fetching for precious memory accesses. You can also greatly reduce instruction fetch time simply by your choice of instructions: Keep your instructions short. Less time is required to fetch instructions that are 1 or 2 bytes long than instructions that are 5 or 6 bytes long. Reduced instruction fetching lowers minimum execution time (minimum execution time is 4 cycles times the number of instruction bytes) and often leads to faster overall execution.
While short instructions minimize overall prefetch time, ironically
they actually often suffer more from the prefetch queue bottleneck than
do long instructions. Short instructions generally have such fast
execution times that they drain the prefetch queue despite their small
size. For example, consider the SHR
of Listing 4.5, which
runs at only 25 percent of its Execution Unit execution time even though
it’s only 2 bytes long, thanks to the prefetch queue bottleneck. Short
instructions are nonetheless generally faster than long instructions,
thanks to the combination of fewer instruction bytes and faster
Execution Unit execution times, and should be used as much as
possible—just don’t expect them to run at their “official” documented
speeds.
More than anything, the above rules mean using the registers as heavily as possible, both because register-only instructions are short and because they don’t perform memory accesses to read or write operands. However, using the registers is a rule of thumb, not a commandment. In some circumstances, it may actually be faster to access memory. (The look-up table technique is one such case.) What’s more, the performance of the prefetch queue (and hence the performance of each instruction) differs from one code sequence to the next, and can even differ during different executions of the same code sequence.
All in all, writing good assembler code is as much an art as a science. As a result, you should follow the rules of thumb described here—and then time your code to see how fast it really is. You should experiment freely, but always remember that actual, measured performance is the bottom line.
In this chapter I’ve taken you further and further into the depths of the PC, telling you again and again that you must understand the computer at the lowest possible level in order to write good code. At this point, you may well wonder, “Have we gotten low enough?”
Not quite yet. The 8-bit bus and prefetch queue cycle-eaters are low-level indeed, but we’ve one level yet to go. Dynamic RAM refresh and wait states—our next topics—together form the lowest level at which the hardware of the PC affects code performance. Below this level, the PC is of interest only to hardware engineers.
Before we begin our discussion of dynamic RAM refresh, let’s step back for a moment to take an overall look at this lowest level of cycle-eaters. In truth, the distinctions between wait states and dynamic RAM refresh don’t much matter to a programmer. What is important is that you understand this: Under certain circumstances, devices on the PC bus can stop the CPU for 1 or more cycles, making your code run more slowly than it seemingly should.
Unlike all the cycle-eaters we’ve encountered so far, wait states and dynamic RAM refresh are strictly external to the CPU, as was shown in Figure 4.1. Adapters on the PC’s bus, such as video and memory cards, can insert wait states on any bus access, the idea being that they won’t be able to complete the access properly unless the access is stretched out. Likewise, the channel of the DMA controller dedicated to dynamic RAM refresh can request control of the bus at any time, although the CPU must relinquish the bus before the DMA controller can take over. This means that your code can’t directly control wait states or dynamic RAM refresh. However, code can sometimes be designed to minimize the effects of these cycle-eaters, and even when the cycle-eaters slow your code without there being a thing in the world you can do about it, you’re still better off understanding that you’re losing performance and knowing why your code doesn’t run as fast as it’s supposed to than you were programming in ignorance.
Let’s start with DRAM refresh, which affects the performance of every program that runs on the PC.
Dynamic RAM (DRAM) refresh is sort of an act of God. By that I mean that DRAM refresh invisibly and inexorably steals a certain fraction of all available memory access time from your programs, when they are accessing memory for code and data. (When they are accessing cache on more recent processors, theoretically the DRAM refresh cycle-eater doesn’t come into play, but there are other cycle-eaters waiting to prey on cache-bound programs.) While you could stop DRAM refresh, you wouldn’t want to since that would be a sure prescription for crashing your computer. In the end, thanks to DRAM refresh, almost all code runs a bit slower on the PC than it otherwise would, and that’s that.
A bit of background: A static RAM (SRAM) chip is a memory chip that retains its contents indefinitely so long as power is maintained. By contrast, each of several blocks of bits in a dynamic RAM (DRAM) chip retains its contents for only a short time after it’s accessed for a read or write. In order to get a DRAM chip to store data for an extended period, each of the blocks of bits in that chip must be accessed regularly, so that the chip’s stored data is kept refreshed and valid. So long as this is done often enough, a DRAM chip will retain its contents indefinitely.
All of the PC’s system memory consists of DRAM chips. Each DRAM chip in the PC must be completely refreshed about once every four milliseconds in order to ensure the integrity of the data it stores. Obviously, it’s highly desirable that the memory in the PC retain the correct data indefinitely, so each DRAM chip in the PC must always be refreshed within 4 ms of the last refresh. Since there’s no guarantee that a given program will access each and every DRAM block once every 4 ms, the PC contains special circuitry and programming for providing DRAM refresh.
On the original 8088-based IBM PC, timer 1 of the 8253 timer chip is programmed at power-up to generate a signal once every 72 cycles, or once every 15.08µs. That signal goes to channel 0 of the 8237 DMA controller, which requests the bus from the 8088 upon receiving the signal. (DMA stands for direct memory access, the ability of a device other than the 8088 to control the bus and access memory directly, without any help from the 8088.) As soon as the 8088 is between memory accesses, it gives control of the bus to the 8237, which in conjunction with special circuitry on the PC’s motherboard then performs a single 4-cycle read access to 1 of 256 possible addresses, advancing to the next address on each successive access. (The read access is only for the purpose of refreshing the DRAM; the data that is read isn’t used.)
The 256 addresses accessed by the refresh DMA accesses are arranged so that taken together they properly refresh all the memory in the PC. By accessing one of the 256 addresses every 15.08 µs, all of the PC’s DRAM is refreshed in 256 x 15.08 µs, or 3.86 ms, which is just about the desired 4 ms time I mentioned earlier. (Only the first 640K of memory is refreshed in the PC; video adapters and other adapters above 640K containing memory that requires refreshing must provide their own DRAM refresh in pre-AT systems.)
Don’t sweat the details here. The important point is this: For at least 4 out of every 72 cycles, the original PC’s bus is given over to DRAM refresh and is not available to the 8088, as shown in Figure 4.5. That means that as much as 5.56 percent of the PC’s already inadequate bus capacity is lost. However, DRAM refresh doesn’t necessarily stop the 8088 in its tracks for 4 cycles. The Execution Unit of the 8088 can keep processing while DRAM refresh is occurring, unless the EU needs to access memory. Consequently, DRAM refresh can slow code performance anywhere from 0 percent to 5.56 percent (and actually a bit more, as we’ll see shortly), depending on the extent to which DRAM refresh occupies cycles during which the 8088 would otherwise be accessing memory.
Let’s look at examples from opposite ends of the spectrum in terms of
the impact of DRAM refresh on code performance. First, consider the
series of MUL
instructions in Listing 4.9. Since a 16-bit
MUL
on the 8088 executes in between 118 and 133 cycles and
is only 2 bytes long, there should be plenty of time for the prefetch
queue to fill after each instruction, even after DRAM refresh has taken
its slice of memory access time. Consequently, the prefetch queue should
be able to keep the Execution Unit well-supplied with instruction bytes
at all times. Since Listing 4.9 uses no memory operands, the Execution
Unit should never have to wait for data from memory, and DRAM refresh
should have no impact on performance. (Remember that the Execution Unit
can operate normally during DRAM refreshes so long as it doesn’t need to
request a memory access from the Bus Interface Unit.)
LISTING 4.9 LST4-9.ASM
; Measures the performance of repeated MUL instructions,
; which allow the prefetch queue to be full at all times,
; to demonstrate a case in which DRAM refresh has no impact
; on code performance.
;
sub ax,ax
call ZTimerOn
1000
rept mul ax
endmcall ZTimerOff
Running Listing 4.9, we find that each MUL
executes in
24.72 µs, or exactly 118 cycles. Since that’s the shortest time in which
MUL
can execute, we can see that no performance is lost to
DRAM refresh. Listing 4.9 clearly illustrates that DRAM refresh only
affects code performance when a DRAM refresh forces the Execution Unit
of the 8088 to wait for a memory access.
Now let’s look at the series of SHR
instructions shown
in Listing 4.10. Since SHR
executes in 2 cycles but is 2
bytes long, the prefetch queue should be empty while Listing 4.10
executes, with the 8088 prefetching instruction bytes non-stop. As a
result, the time per instruction of Listing 4.10 should precisely
reflect the time required to fetch the instruction bytes.
LISTING 4.10 LST4-10.ASM
; Measures the performance of repeated SHR instructions,
; which empty the prefetch queue, to demonstrate the
; worst-case impact of DRAM refresh on code performance.
;
call ZTimerOn
1000
rept shr ax,1
endmcall ZTimerOff
Since 4 cycles are required to read each instruction byte, we’d
expect each SHR
to execute in 8 cycles, or 1.676 µs, if
there were no DRAM refresh. In fact, each SHR
in Listing
4.10 executes in 1.81 µs, indicating that DRAM refresh is taking 7.4
percent of the program’s execution time. That’s nearly 2 percent more
than our worst-case estimate of the loss to DRAM refresh overhead! In
fact, the result indicates that DRAM refresh is stealing not 4, but 5.33
cycles out of every 72 cycles. How can this be?
The answer is that a given DRAM refresh can actually hold up CPU
memory accesses for as many as 6 cycles, depending on the timing of the
DRAM refresh’s DMA request relative to the 8088’s internal instruction
execution state. When the code in Listing 4.10 runs, each DRAM refresh
holds up the CPU for either 5 or 6 cycles, depending on where the 8088
is in executing the current SHR
instruction when the
refresh request occurs. Now we see that things can get even worse than
we thought: DRAM refresh can steal as much as 8.33 percent of
available memory access time—6 out of every 72 cycles—from the
8088.
Which of the two cases we’ve examined reflects reality? While either case can happen, the latter case—significant performance reduction, ranging as high as 8.33 percent—is far more likely to occur. This is especially true for high-performance assembly code, which uses fast instructions that tend to cause non-stop instruction fetching.
Hmmm. When we discovered the prefetch queue cycle-eater, we learned to use short instructions. When we discovered the 8-bit bus cycle-eater, we learned to use byte-sized memory operands whenever possible, and to keep word-sized variables in registers. What can we do to work around the DRAM refresh cycle-eater?
Nothing.
As I’ve said before, DRAM refresh is an act of God. DRAM refresh is a fundamental, unchanging part of the PC’s operation, and there’s nothing you or I can do about it. If refresh were any less frequent, the reliability of the PC would be compromised, so tinkering with either timer 1 or DMA channel 0 to reduce DRAM refresh overhead is out. Nor is there any way to structure code to minimize the impact of DRAM refresh. Sure, some instructions are affected less by DRAM refresh than others, but how many multiplies and divides in a row can you really use? I suppose that code could conceivably be structured to leave a free memory access every 72 cycles, so DRAM refresh wouldn’t have any effect. In the old days when code size was measured in bytes, not K bytes, and processors were less powerful—and complex—programmers did in fact use similar tricks to eke every last bit of performance from their code. When programming the PC, however, the prefetch queue cycle-eater would make such careful code synchronization a difficult task indeed, and any modest performance improvement that did result could never justify the increase in programming complexity and the limits on creative programming that such an approach would entail. Besides, all that effort goes to waste on faster 8088s, 286s, and other computers with different execution speeds and refresh characteristics. There’s no way around it: Useful code accesses memory frequently and at irregular intervals, and over the long haul DRAM refresh always exacts its price.
If you’re still harboring thoughts of reducing the overhead of DRAM refresh, consider this. Instructions that tend not to suffer very much from DRAM refresh are those that have a high ratio of execution time to instruction fetch time, and those aren’t the fastest instructions of the PC. It certainly wouldn’t make sense to use slower instructions just to reduce DRAM refresh overhead, for it’s total execution time—DRAM refresh, instruction fetching, and all—that matters.
The important thing to understand about DRAM refresh is that it generally slows your code down, and that the extent of that performance reduction can vary considerably and unpredictably, depending on how the DRAM refreshes interact with your code’s pattern of memory accesses. When you use the Zen timer and get a fractional cycle count for the execution time of an instruction, that’s often the DRAM refresh cycle-eater at work. (The display adapter cycle is another possible culprit, and, on 386s and later processors, cache misses and pipeline execution hazards produce this sort of effect as well.) Whenever you get two timing results that differ less or more than they seemingly should, that’s usually DRAM refresh too. Thanks to DRAM refresh, variations of up to 8.33 percent in PC code performance are par for the course.
Wait states are cycles during which a bus access by the CPU to a device on the PC’s bus is temporarily halted by that device while the device gets ready to complete the read or write. Wait states are well and truly the lowest level of code performance. Everything we have discussed (and will discuss)—even DMA accesses—can be affected by wait states.
Wait states exist because the CPU must to be able to coexist with any adapter, no matter how slow (within reason). The 8088 expects to be able to complete each bus access—a memory or I/O read or write—in 4 cycles, but adapters can’t always respond that quickly for a number of reasons. For example, display adapters must split access to display memory between the CPU and the circuitry that generates the video signal based on the contents of display memory, so they often can’t immediately fulfill a request by the CPU for a display memory read or write. To resolve this conflict, display adapters can tell the CPU to wait during bus accesses by inserting one or more wait states, as shown in Figure 4.6. The CPU simply sits and idles as long as wait states are inserted, then completes the access as soon as the display adapter indicates its readiness by no longer inserting wait states. The same would be true of any adapter that couldn’t keep up with the CPU.
Mind you, this is all transparent to executing code. An instruction that encounters wait states runs exactly as if there were no wait states, only slower. Wait states are nothing more or less than wasted time as far as the CPU and your program are concerned.
By understanding the circumstances in which wait states can occur, you can avoid them when possible. Even when it’s not possible to work around wait states, it’s still to your advantage to understand how they can cause your code to run more slowly.
First, let’s learn a bit more about wait states by contrast with DRAM refresh. Unlike DRAM refresh, wait states do not occur on any regularly scheduled basis, and are of no particular duration. Wait states can only occur when an instruction performs a memory or I/O read or write. Both the presence of wait states and the number of wait states inserted on any given bus access are entirely controlled by the device being accessed. When it comes to wait states, the CPU is passive, merely accepting whatever wait states the accessed device chooses to insert during the course of the access. All of this makes perfect sense given that the whole point of the wait state mechanism is to allow a device to stretch out any access to itself for however much time it needs to perform the access.
As with DRAM refresh, wait states don’t stop the 8088 completely. The Execution Unit can continue processing while wait states are inserted, so long as the EU doesn’t need to perform a bus access. However, in the PC, wait states most often occur when an instruction accesses a memory operand, so in fact the Execution Unit usually is stopped by wait states. (Instruction fetches rarely wait in an 8088-based PC because system memory is zero-wait-state. AT-class memory systems routinely insert 1 or more wait states, however.)
As it turns out, wait states pose a serious problem in just one area in the PC. While any adapter can insert wait states, in the PC only display adapters do so to the extent that performance is seriously affected.
Display adapters must serve two masters, and that creates a fundamental performance problem. Master #1 is the circuitry that drives the display screen. This circuitry must constantly read display memory in order to obtain the information used to draw the characters or dots displayed on the screen. Since the screen must be redrawn between 50 and 70 times per second, and since each redraw of the screen can require as many as 36,000 reads of display memory (more in Super VGA modes), master #1 is a demanding master indeed. No matter how demanding master #1 gets, however, its needs must always be met—otherwise the quality of the picture on the screen would suffer.
Master #2 is the CPU, which reads from and writes to display memory in order to manipulate the bytes that the video circuitry reads to form the picture on the screen. Master #2 is less important than master #1, since the CPU affects display quality only indirectly. In other words, if the video circuitry has to wait for display memory accesses, the picture will develop holes, snow, and the like, but if the CPU has to wait for display memory accesses, the program will just run a bit slower—no big deal.
It matters a great deal which master is more important, for while both the CPU and the video circuitry must gain access to display memory, only one of the two masters can read or write display memory at any one time. Potential conflicts are resolved by flat-out guaranteeing the video circuitry however many accesses to display memory it needs, with the CPU waiting for whatever display memory accesses are left over.
It turns out that the 8088 CPU has to do a lot of waiting, for three reasons. First, the video circuitry can take as much as about 90 percent of the available display memory access time, as shown in Figure 4.7, leaving as little as about 10 percent of all display memory accesses for the 8088. (These percentages vary considerably among the many EGA and VGA clones.)
Second, because the displayed dots (or pixels, short for “picture elements”) must be drawn on the screen at a constant speed, many display adapters provide memory accesses only at fixed intervals. As a result, time can be lost while the 8088 synchronizes with the start of the next display adapter memory access, even if the video circuitry isn’t accessing display memory at that time, as shown in Figure 4.8.
Finally, the time it takes a display adapter to complete a memory access is related to the speed of the clock which generates pixels on the screen rather than to the memory access speed of the 8088. Consequently, the time taken for display memory to complete an 8088 read or write access is often longer than the time taken for system memory to complete an access, even if the 8088 lucks into hitting a free display memory access just as it becomes available, again as shown in Figure 4.8. Any or all of the three factors I’ve described can result in wait states, slowing the 8088 and creating the display adapter cycle.
If some of this is Greek to you, don’t worry. The important point is that display memory is not very fast compared to normal system memory. How slow is it? Incredibly slow. Remember how slow IBM’s ill-fated PCjrwas? In case you’ve forgotten, I’ll refresh your memory: The PCjrwas at best only half as fast as the PC. The PCjr had an 8088 running at 4.77 MHz, just like the PC—why do you suppose it was so much slower? I’ll tell you why: All the memory in the PCjr was display memory.
Enough said. All the memory in the PC is not display memory, however, and unless you’re thickheaded enough to put code in display memory, the PC isn’t going to run as slowly as a PCjr. (Putting code or other non-video data in unused areas of display memory sounds like a neat idea—until you consider the effect on instruction prefetching of cutting the 8088’s already-poor memory access performance in half. Running your code from display memory is sort of like running on a hypothetical 8084—an 8086 with a 4-bit bus. Not recommended!) Given that your code and data reside in normal system memory below the 640K mark, how great an impact does the display adapter cycle-eater have on performance?
The answer varies considerably depending on what display adapter and what display mode we’re talking about. The display adapter cycle-eater is worst with the Enhanced Graphics Adapter (EGA) and the original Video Graphics Array (VGA). (Many VGAs, especially newer ones, insert many fewer wait states than IBM’s original VGA. On the other hand, Super VGAs have more bytes of display memory to be accessed in high-resolution mode.) While the Color/Graphics Adapter (CGA), Monochrome Display Adapter (MDA), and Hercules Graphics Card (HGC) all suffer from the display adapter cycle-eater as well, they suffer to a lesser degree. Since the VGA represents the base standard for PC graphics now and for the foreseeable future, and since it is the hardest graphics adapter to wring performance from, we’ll restrict our discussion to the VGA (and its close relative, the EGA) for the remainder of this chapter.
Even on the EGA and VGA, the effect of the display adapter cycle-eater depends on the display mode selected. In text mode, the display adapter cycle-eater is rarely a major factor. It’s not that the cycle-eater isn’t present; however, a mere 4,000 bytes control the entire text mode display, and even with the display adapter cycle-eater it just doesn’t take that long to manipulate 4,000 bytes. Even if the display adapter cycle-eater were to cause the 8088 to take as much as 5µs per display memory access—more than five times normal—it would still take only 4,000x 2x 5µs, or 40 ms, to read and write every byte of display memory. That’s a lot of time as measured in 8088 cycles, but it’s less than the blink of an eye in human time, and video performance only matters in human time. After all, the whole point of drawing graphics is to convey visual information, and if that information can be presented faster than the eye can see, that is by definition fast enough.
That’s not to say that the display adapter cycle-eater can’t matter in text mode. In Chapter 3, I recounted the story of a debate among letter-writers to a magazine about exactly how quickly characters could be written to display memory without causing snow. The writers carefully added up Intel’s instruction cycle times to see how many writes to display memory they could squeeze into a single horizontal retrace interval. (On a CGA, it’s only during the short horizontal retrace interval and the longer vertical retrace interval that display memory can be accessed in 80-column text mode without causing snow.) Of course, now we know that their cardinal sin was to ignore the prefetch queue; even if there were no wait states, their calculations would have been overly optimistic. There are display memory wait states as well, however, so the calculations were not just optimistic but wildly optimistic.
Text mode situations such as the above notwithstanding, where the display adapter cycle-eater really kicks in is in graphics mode, and most especially in the high-resolution graphics modes of the EGA and VGA. The problem here is not that there are necessarily more wait states per access in highgraphics modes (that varies from adapter to adapter and mode to mode). Rather, the problem is simply that are many more bytes of display memory per screen in these modes than in lower-resolution graphics modes and in text modes, so many more display memory accesses—each incurring its share of display memory wait states—are required in order to draw an image of a given size. When accessing the many thousands of bytes used in the high-resolution graphics modes, the cumulative effects of display memory wait states can seriously impact code performance, even as measured in human time.
For example, if we assume the same 5 µs per display memory access for the EGA’s high-resolution graphics mode that we assumed for text mode, it would take 26,000 x 2 x 5 µs, or 260 ms, to scroll the screen once in the EGA’s high-resolution graphics mode, mode 10H. That’s more than one-quarter of a second—noticeable by human standards, an eternity by computer standards.
That sounds pretty serious, but we did make an unfounded assumption
about memory access speed. Let’s get some hard numbers. Listing 4.11
accesses display memory at the 8088’s maximum speed, by way of a
REP MOVSW
with display memory as both source and
destination. The code in Listing 4.11 executes in 3.18 µs per access to
display memory—not as long as we had assumed, but a long time
nonetheless.
LISTING 4.11 LST4-11.ASM
; Times speed of memory access to Enhanced Graphics
; Adapter graphics mode display memory at A000:0000.
;
mov ax,0010h
int 10h; select hi-res EGA graphics
; mode 10 hex (AH=0 selects
; BIOS set mode function,
; with AL=mode to select)
;
mov ax,0a000h
mov ds,ax
mov es,ax ;move to & from same segment
sub si,si ;move to & from same offset
mov di,si
mov cx,800h ;move 2K words
cld
call ZTimerOn
rep movsw ;simply read each of the first
; 2K words of the destination segment,
; writing each byte immediately back
; to the same address. No memory
; locations are actually altered; this
; is just to measure memory access
; times
call ZTimerOff
;
mov ax,0003h
int 10h ;return to text mode
For comparison, let’s see how long the same code takes when accessing
normal system RAM instead of display memory. The code in Listing 4.12,
which performs a REP MOVSW
from the code segment to the
code segment, executes in 1.39 µs per display memory access. That means
that on average, 1.79 µs (more than 8 cycles!) are lost to the display
adapter cycle-eater on each access. In other words, the display adapter
cycle-eater can more than double the execution time of 8088
code!
LISTING 4.12 LST4-12.ASM
; Times speed of memory access to normal system
; memory.
;
mov ax,ds
mov es,ax ;move to & from same segment
sub si,si ;move to & from same offset
mov di,si
mov cx,800h ;move 2K words
cld
call ZTimerOn
rep movsw ;simply read each of the first
; 2K words of the destination segment,
; writing each byte immediately back
; to the same address. No memory
; locations are actually altered; this
; is just to measure memory access
; times
call ZTimerOff
Bear in mind that we’re talking about a worst case here; the impact of the display adapter cycle-eater is proportional to the percent of time a given code sequence spends accessing display memory.
A line-drawing subroutine, which executes perhaps a dozen instructions for each display memory access, generally loses less performance to the display adapter cycle-eater than does a block-copy or scrolling subroutine that uses
REP MOVS
instructions. Scaled and three-dimensional graphics, which spend a great deal of time performing calculations (often using very slow floating-point arithmetic), tend to suffer less.
In addition, code that accesses display memory infrequently tends to suffer only about half of the maximum display memory wait states, because on average such code will access display memory halfway between one available display memory access slot and the next. As a result, code that accesses display memory less intensively than the code in Listing 4.11 will on average lose 4 or 5 rather than 8-plus cycles to the display adapter cycle-eater on each memory access.
Nonetheless, the display adapter cycle-eater always takes its toll on graphics code. Interestingly, that toll becomes much higher on ATs and 80386 machines because while those computers can execute many more instructions per microsecond than can the 8088-based PC, it takes just as long to access display memory on those computers as on the 8088-based PC. Remember, the limited speed of access to a graphics adapter is an inherent characteristic of the adapter, so the fastest computer around can’t access display memory one iota faster than the adapter will allow.
What can we do about the display adapter cycle-eater? Well, we can minimize display memory accesses whenever possible. In particular, we can try to avoid read/modify/write display memory operations of the sort used to mask individual pixels and clip images. Why? Because read/modify/write operations require two display memory accesses (one read and one write) each time display memory is manipulated. Instead, we should try to use writes of the sort that set all the pixels in a given byte of display memory at once, since such writes don’t require accompanying read accesses. The key here is that only half as many display memory accesses are required to write a byte to display memory as are required to read a byte from display memory, mask part of it off and alter the rest, and write the byte back to display memory. Half as many display memory accesses means half as many display memory wait states.
Moreover, 486s and Pentiums, as well as recent Super VGAs, employ write-caching schemes that make display memory writes considerably faster than display memory reads.
Along the same line, the display adapter cycle-eater makes the popular exclusive-OR animation technique, which requires paired reads and writes of display memory, less-than-ideal for the PC. Exclusive-OR animation should be avoided in favor of simply writing images to display memory whenever possible.
Another principle for display adapter programming on the 8088 is to perform multiple accesses to display memory very rapidly, in order to make use of as many of the scarce accesses to display memory as possible. This is especially important when many large images need to be drawn quickly, since only by using virtually every available display memory access can many bytes be written to display memory in a short period of time. Repeated string instructions are ideal for making maximum use of display memory accesses; of course, repeated string instructions can only be used on whole bytes, so this is another point in favor of modifying display memory a byte at a time. (On faster processors, however, display memory is so slow that it often pays to do several instructions worth of work between display memory accesses, to take advantage of cycles that would otherwise be wasted on the wait states.)
It would be handy to explore the display adapter cycle-eater issue in depth, with lots of example code and execution timings, but alas, I don’t have the space for that right now. For the time being, all you really need to know about the display adapter cycle-eater is that on the 8088 you can lose more than 8 cycles of execution time on each access to display memory. For intensive access to display memory, the loss really can be as high as 8cycles (and up to 50, 100, or even more on 486s and Pentiums paired with slow VGAs), while for average graphics code the loss is closer to 4 cycles; in either case, the impact on performance is significant. There is only one way to discover just how significant the impact of the display adapter cycle-eater is for any particular graphics code, and that is of course to measure the performance of that code.
We’ve covered a great deal of sophisticated material in this chapter, so don’t feel bad if you haven’t understood everything you’ve read; it will all become clear from further reading, especially once you study, time, and tune code that you have written yourself. What’s really important is that you come away from this chapter understanding that on the 8088:
This basic knowledge about cycle-eaters puts you in a good position to understand the results reported by the Zen timer, and that means that you’re well on your way to writing high-performance assembler code.
There you have it: life under the programming interface. It’s not a particularly pretty picture for the inhabitants of that strange realm where hardware and software meet are little-known cycle-eaters that sap the speed from your unsuspecting code. Still, some of those cycle-eaters can be minimized by keeping instructions short, using the registers, using byte-sized memory operands, and accessing display memory as little as possible. None of the cycle-eaters can be eliminated, and dynamic RAM refresh can scarcely be addressed at all; still, aren’t you better off knowing how fast your code really runs—and why—than you were reading the official execution times and guessing? And while specific cycle-eaters vary in importance on later x86-family processors, with some cycle-eaters vanishing altogether and new ones appearing, the concept that understanding these obscure gremlins is a key to performance remains unchanged, as we’ll see again and again in later chapters.
We just moved. Those three little words should strike terror into the heart of anyone who owns more than a sleeping bag and a toothbrush. Our last move was the usual zoo—and then some. Because the distance from the old house to the new was only five miles, we used cars to move everything smaller than a washing machine. We have a sizable household—cats, dogs, kids, com, you name it—so the moving process took a number of car trips. A large number—33, to be exact. I personally spent about 15 hours just driving back and forth between the two houses. The move took days to complete.
Never again.
You’re probably wondering two things: What does this have to do with high-performance programming, and why on earth didn’t I rent a truck and get the move over in one or two trips, saving hours of driving? As it happens, the second question answers the first. I didn’t rent a truck because it seemed easier and cheaper to use cars—no big truck to drive, no rentals, spread the work out more manageably, and so on.
It wasn’t easier, and wasn’t even much cheaper. (It costs quite a bit to drive a car 330 miles, to say nothing of the value of 15 hours of my time.) But, at the time, it seemed as though my approach would be easier and cheaper. In fact, I didn’t realize just how much time I had wasted driving back and forth until I sat down to write this chapter.
In Chapter 1, I briefly discussed using restartable blocks. This, you might remember, is the process of handling in chunks data sets too large to fit in memory so that they can be processed just about as fast as if they did fit in memory. The restartable block approach is very fast but is relatively difficult to program.
At the opposite end of the spectrum lies byte-by-byte processing, whereby DOS (or, in less extreme cases, a group of library functions) is allowed to do all the hard work, so that you only have to deal with one byte at a time. Byte-by-byte processing is easy to program but can be extremely slow, due to the vast overhead that results from invoking DOS each time a byte must be processed.
Sound familiar? It should. I moved via the byte-by-byte approach, and the overhead of driving back and forth made for miserable performance. Renting a truck (the restartable block approach) would have required more effort and forethought, but would have paid off handsomely.
The easy, familiar approach often has nothing in its favor except that it requires less thinking; not a great virtue when writing high-performance code—or when moving.
And with that, let’s look at a fairly complex application of restartable blocks.
The application we’re going to examine searches a file for a
specified string. We’ll develop a program that will search the file
specified on the command line for a string (also specified on the
comline), then report whether the string was found or not. (Because the
searched-for string is obtained via argv
, it can’t contain
any whitespace characters.)
This is a very limited subset of what search utilities such as grep can do, and isn’t really intended to be a generally useful application; the purpose is to provide insight into restartable blocks in particular and optimization in general in the course of developing a search engine. That search engine will, however, be easy to plug into any program, and there’s nothing preventing you from using it in a more fruitful context, like searching through a user-selectable file set.
The first point to address in designing our program involves the appropriate text-search approach to use. Literally dozens of workable ways exist to search a file. We can immediately discard all approaches that involve reading any byte of the file more than once, because disk access time is orders of magnitude slower than any data handling performed by our own code. Based on our experience in Chapter 1, we can also discard all approaches that get bytes either one at a time or in small sets from DOS. We want to read big “buffers-full” of bytes at a pop from the searched file, and the bigger the buffer the better—in order to minimize DOS’s overhead. A good rough cut is a buffer that will be between 16K and 64K, depending on the exact search approach, 64K being the maximum size because near pointers make for superior performance.
So we know we want to work with a large buffer, filling it as infrequently as possible. Now we have to figure out how to search through a file by loading it into that large buffer in chunks. To accomplish this, we have to know how we want to do our searching, and that’s not immediately obvious. Where do we begin?
Well, it might be instructive to consider how we would search if our search involved only one buffer, already resident in memory. In other words, suppose we don’t have to bother with file handling at all, and further suppose that we don’t have to deal with searching through multiple blocks. After all, that’s a good description of the all-important inner loop of our searching program, where the program will spend virtually all of its time (aside from the unavoidable disk access overhead).
The easiest approach would be to use a C/C++ library function. The
closest match to what we need is strstr()
, which searches
one string for the first occurrence of a second string. However, while
strstr()
would work, it isn’t ideal for our purposes. The
problem is this: Where we want to search a fixed-length buffer for the
first occurrence of a string, strstr()
searches a
string for the first occurrence of another string.
We could put a zero byte at the end of our buffer to allow
strstr()
to work, but why bother? The strstr()
function must spend time either checking for the end of the string being
searched or determining the length of that string—wasted effort given
that we already know exactly how long our search buffer is. Even if a
given strstr()
implementation is well-written, its
performance will suffer, at least for our application, from unnecessary
overhead.
This illustrates why you shouldn’t think of C/C++ library functions as black boxes; understand what they do and try to figure out how they do it, and relate that to their performance in the context you’re interested in.
Given that no C/C++ library function meets our needs precisely, an
obvious alternative approach is the brute-force technique that uses
memcmp()
to compare every potential matching
location in the buffer to the string we’re searching for, as illustrated
in Figure 5.1.
By the way, we could, of course, use our own code, working with
pointers in a loop, to perform the comparison in place of
memcmp()
. But memcmp()
will almost certainly
use the very fast REPZ CMPS
instruction. However, never
assume! It wouldn’t hurt to use a debugger to check out the actual
machine-code implementation of memcmp()
from your compiler.
If necessary, you could always write your own assembly language
implementation of memcmp()
.
Invoking memcmp()
for each potential match location
works, but entails considerable overhead. Each comparison requires that
parameters be pushed and that a call to and return from
memcmp()
be performed, along with a pass through the
comparison loop. Surely there’s a better way!
Indeed there is. We can eliminate most calls to memcmp()
by performing a simple test on each potential match location that will
reject most such locations right off the bat. We’ll just check whether
the first character of the potentially matching buffer location matches
the first character of the string we’re searching for. We could make
this check by using a pointer in a loop to scan the buffer for the next
match for the first character, stopping to check for a match with the
rest of the string only when the first character matches, as
shown in Figure 5.2.
There’s yet a better way to implement this approach, however. Use the
memchr()
function, which does nothing more or less than
find the next occurrence of a specified character in a fixed-length
buffer (presumably by using the extremely efficient
REPNZ SCASB
instruction, although again it wouldn’t hurt to
check). By using memchr()
to scan for potential matches
that can then be fully tested with memcmp()
, we can build a
highly efficient search engine that takes good advantage of the
information we have about the buffer being searched and the string we’re
searching for. Our engine also relies heavily on repeated string
instructions, assuming that the memchr()
and
memcmp()
library functions are properly coded.
We’re going to go with the this approach in our file-searching program; the only trick lies in deciding how to integrate this approach with restartable blocks in order to search through files larger than our buffer. This certainly isn’t the fastest-possible searching algorithm; as one example, the Boyer-Moore algorithm, which cleverly eliminates many buffer locations as potential matches in the process of checking preceding locations, can be considerably faster. However, the Boyer-Moore algorithm is quite complex to understand and implement, and would distract us from our main focus, restartable blocks, so we’ll save it for a later chapter (Chapter 14, to be precise). Besides, I suspect you’ll find the approach we’ll use to be fast enough for most purposes.
Now that we’ve selected a searching approach, let’s integrate it with file handling and searching through multiple blocks. In other words, let’s make it restartable.
As it happens, there’s no great trick to putting the pieces of this
search program together. Basically, we’ll read in a buffer of data
(we’ll work with 16K at a time to avoid signed overflow problems with
integers), search it for a match with the
memchr()
/memcmp()
engine described, and exit
with a “string found” response if the desired string is found.
Otherwise, we’ll load in another buffer full of data from the file, search it, and so on. The only trick lies in handling potentially matching sequences in the file that start in one buffer and end in the next—that is, sequences that span buffers. We’ll handle this by copying the unchecked bytes at the end of one buffer to the start of the next and reading that many fewer bytes the next time we fill the buffer.
The exact number of bytes to be copied from the end of one buffer to the start of the next is the length of the searched-for string minus 1, since that’s how many bytes at the end of the buffer can’t be checked as possible matches (because the check would run off the end of the buffer).
That’s really all there is to it. Listing 5.1 shows the
file-searching program. As you can see, it’s not particularly complex,
although a few fairly opaque lines of code are required to handle
merging the end of one block with the start of the next. The code that
searches a single block—the function SearchForString()
—is
simple and compact (as it should be, given that it’s by far the most
heavily-executed code in the listing).
Listing 5.1 nicely illustrates the core concept of restartable blocks: Organize your program so that you can do your processing within each block as fast as you could if there were only one block—which is to say at top speed—and make your blocks as large as possible in order to minimize the overhead associated with going from one block to the next.
LISTING 5.1 SEARCH.C
/* Program to search the file specified by the first command-line
* argument for the string specified by the second command-line
* argument. Performs the search by reading and searching blocks
* of size BLOCK_SIZE. */
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <alloc.h> /* alloc.h for Borland compilers,
malloc.h for Microsoft compilers */
#define BLOCK_SIZE 0x4000 /* we'll process the file in 16K blocks */
/* Searches the specified number of sequences in the specified
buffer for matches to SearchString of SearchStringLength. Note
that the calling code should already have shortened SearchLength
if necessary to compensate for the distance from the end of the
buffer to the last possible start of a matching sequence in the
buffer.
*/
int SearchForString(unsigned char *Buffer, int SearchLength,
unsigned char *SearchString, int SearchStringLength)
{
unsigned char *PotentialMatch;
/* Search so long as there are potential-match locations
remaining */
while ( SearchLength ) {
/* See if the first character of SearchString can be found */
if ( (PotentialMatch =
(Buffer, *SearchString, SearchLength)) == NULL ) {
memchrbreak; /* No matches in this buffer */
}
/* The first character matches; see if the rest of the string
also matches */
if ( SearchStringLength == 1 ) {
return(1); /* That one matching character was the whole
search string, so we've got a match */
}
else {
/* Check whether the remaining characters match */
if ( !memcmp(PotentialMatch + 1, SearchString + 1,
- 1) ) {
SearchStringLength return(1); /* We've got a match */
}
}
/* The string doesn't match; keep going by pointing past the
potential match location we just rejected */
-= PotentialMatch - Buffer + 1;
SearchLength = PotentialMatch + 1;
Buffer }
return(0); /* No match found */
}
(int argc, char *argv[]) {
mainint Done; /* Indicates whether search is done */
int Handle; /* Handle of file being searched */
int WorkingLength; /* Length of current block */
int SearchStringLength; /* Length of string to search for */
int BlockSearchLength; /* Length to search in current block */
int Found; /* Indicates final search completion
status */
int NextLoadCount; /* # of bytes to read into next block,
accounting for bytes copied from the
last block */
unsigned char *WorkingBlock; /* Block storage buffer */
unsigned char *SearchString; /* Pointer to the string to search for */
unsigned char *NextLoadPtr; /* Offset at which to start loading
the next block, accounting for
bytes copied from the last block */
/* Check for the proper number of arguments */
if ( argc != 3 ) {
("usage: search filename search-string\n");
printf(1);
exit}
/* Try to open the file to be searched */
if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
("Can't open file: %s\n", argv[1]);
printf(1);
exit}
/* Calculate the length of text to search for */
= argv[2];
SearchString = strlen(SearchString);
SearchStringLength /* Try to get memory in which to buffer the data */
if ( (WorkingBlock = malloc(BLOCK_SIZE)) == NULL ) {
("Can't get enough memory\n");
printf(1);
exit}
/* Load the first block at the start of the buffer, and try to
fill the entire buffer */
= WorkingBlock;
NextLoadPtr = BLOCK_SIZE;
NextLoadCount = 0; /* Not done with search yet */
Done = 0; /* Assume we won't find a match */
Found /* Search the file in BLOCK_SIZE chunks */
do {
/* Read in however many bytes are needed to fill out the block
(accounting for bytes copied over from the last block), or
the rest of the bytes in the file, whichever is less */
if ( (WorkingLength = read(Handle, NextLoadPtr,
)) == -1 ) {
NextLoadCount("Error reading file %s\n", argv[1]);
printf(1);
exit}
/* If we didn't read all the bytes we requested, we're done
after this block, whether we find a match or not */
if ( WorkingLength != NextLoadCount ) {
= 1;
Done }
/* Account for any bytes we copied from the end of the last
block in the total length of this block */
+= NextLoadPtr - WorkingBlock;
WorkingLength /* Calculate the number of bytes in this block that could
possibly be the start of a matching sequence that lies
entirely in this block (sequences that run off the end of
the block will be transferred to the next block and found
when that block is searched)
*/
if ( (BlockSearchLength =
- SearchStringLength + 1) <= 0 ) {
WorkingLength = 1; /* Too few characters in this block for
Done there to be any possible matches, so this
is the final block and we're done without
finding a match
*/
}
else {
/* Search this block */
if ( SearchForString(WorkingBlock, BlockSearchLength,
, SearchStringLength) ) {
SearchString= 1; /* We've found a match */
Found = 1;
Done }
else {
/* Copy any bytes from the end of the block that start
potentially-matching sequences that would run off
the end of the block over to the next block */
if ( SearchStringLength > 1 ) {
(WorkingBlock,
memcpy+BLOCK_SIZE - SearchStringLength + 1,
WorkingBlock- 1);
SearchStringLength }
/* Set up to load the next bytes from the file after the
bytes copied from the end of the current block */
= WorkingBlock + SearchStringLength - 1;
NextLoadPtr = BLOCK_SIZE - SearchStringLength + 1;
NextLoadCount }
}
} while ( !Done );
/* Report the results */
if ( Found ) {
("String found\n");
printf} else {
("String not found\n");
printf}
(Found); /* Return the found/not found status as the
exit DOS errorlevel */
}
To boost the overall performance of Listing 5.1, I would normally
convert SearchForString()
to assembly language at this
point. However, I’m not going to do that, and the reason is as important
a lesson as any discussion of optimized assembly code is likely to be.
Take a moment to examine some interesting performance aspects of the C
implementation, and all should become much clearer.
As you’ll recall from Chapter 1, one of the important rules for optimization involves knowing when optimization is worth bothering with at all. Another rule involves understanding where most of a program’s execution time is going. That’s more true for Listing 5.1 than you might think.
When Listing 5.1 is run on a 1 MB assembly source file, it takes
about three seconds to find the string “xxxend” (which is at the end of
the file) on a 20 MHz 386 machine, with the entire file in a disk cache.
If BLOCK_SIZE
is trimmed from 16K to 4K, execution time
does not increase perceptibly! At 2K, the program slows slightly;
it’s not until the block size shrinks to 64 bytes that execution time
becomes approximately double that of the 16K buffer.
So the first thing we’ve discovered is that, while bigger blocks do
make for the best performance, the increment in performance may not be
very large, and might not justify the extra memory required for those
larger blocks. Our next discovery is that, even though we read the file
in large chunks, most of the execution time of Listing 5.1 is
nonetheless spent in executing the read()
function.
When I replaced the read()
function call in Listing 5.1
with code that simply fools the program into thinking that a 1 MB file
is being read, the program ran almost instantaneously—in less than 1/2
second, even when the searched-for string wasn’t anywhere to be found.
By contrast, Listing 5.1 requires three seconds to run even when
searching for a single character that isn’t found anywhere in the file,
the case in which a single call to memchr()
(and thus a
single REPNZ SCASB
) can eliminate an entire block at a
time.
All in all, the time required for DOS disk access calls is taking up
at least 80 percent of execution time, and search time is less than 20
percent of overall execution time. In fact, search time is probably a
good deal less than 20 percent of the total, given that the overhead of
loading the program, running through the C startup code, opening the
file, executing printf()
, and exiting the program and
returning to the DOS shell are also included in my timings. Given which,
it should be apparent why converting to assembly language isn’t worth
the trouble—the best we could do by speeding up the search is a 10
percent or so improvement, and that would require more than doubling the
performance of code that already uses repeated string instructions to do
most of the work.
Not likely.
So that’s why we’re not going to go to assembly language in this example—which is not to say it would never be worth converting the search engine in Listing 5.1 to assembly.
If, for example, your application will typically search buffers in
which the first character of the search string occurs frequently as
might be the case when searching a text buffer for a string starting
with the space character an assembly implementation might be several
times faster. Why? Because assembly code can switch from
REPNZ SCASB
to match the first character to
REPZ CMPS
to check the remaining characters in just a few
instructions.
In contrast, Listing 5.1 must return from memchr()
, set
up parameters, and call memcmp()
in order to do the same
thing. Likewise, assembly can switch back to REPNZ SCASB
after a non-match much more quickly than Listing 5.1. The switching
overhead is high; when searching a file completely filled with the
character z for the string “zy,” Listing 5.1 takes almost 1/2 minute, or
nearly an order of magnitude longer than when searching a file filled
with normal text.
It might also be worth converting the search engine to assembly for searches performed entirely in memory; with the overhead of file access eliminated, improvements in search-engine performance would translate directly into significantly faster overall performance. One such application that would have much the same structure as Listing 5.1 would be searching through expanded memory buffers, and another would be searching through huge (segment-spanning) buffers.
And so we find, as we so often will, that optimization is definitely not a cut-and-dried matter, and that there is no such thing as a single “best” approach.
You must know what your application will typically do, and you must know whether you’re more concerned with average or worst-case performance before you can decide how best to speed up your program—and, indeed, whether speeding it up is worth doing at all.
By the way, don’t think that just because very large block sizes don’t much improve performance, it wasn’t worth using restartable blocks in Listing 5.1. Listing 5.1 runs more than three times more slowly with a block size of 32 bytes than with a block size of 4K, and any byte-by-byte approach would surely be slower still, due to the overhead of repeated calls to DOS and/or the C stream I/O library.
Restartable blocks do minimize the overhead of DOS file-access calls in Listing 5.1; it’s just that there’s no way to reduce that overhead to the point where it becomes worth attempting to further improve the performance of our relatively efficient search engine. Although the search engine is by no means fully optimized, it’s nonetheless as fast as there’s any reason for it to be, given the balance of performance among the components of this program.
I’ve explained two important lessons: Know when it’s worth optimizing further, and use restartable blocks to process large data sets as a series of blocks, with each block handled at high speed. The first lesson is less obvious than it seems.
When I set out to write this chapter, I fully intended to write an
assembly language version of Listing 5.1, and I expected the assembly
version to be much faster. When I actually looked at where execution
time was going (which I did by modifying the program to remove the calls
to the read()
function, but a code profiler could be used
to do the same thing much more easily), I found that the best code in
the world wouldn’t make much difference.
When you try to speed up code, take a moment to identify the hot spots in your program so that you know where optimization is needed and whether it will make a significant difference before you invest your time.
As for restartable blocks: Here we tackled a considerably more complex application of restartable blocks than we did in Chapter 1—which turned out not to be so difficult after all. Don’t let irregularities in the programming tasks you tackle, such as strings that span blocks, fluster you into settling for easy, general—and slow—solutions. Focus on making the inner loop—the code that handles each block—as efficient as possible, then structure the rest of your code to support the inner loop.
Programming with restartable blocks isn’t easy, but when speed is an issue, using restartable blocks in the right places more than pays for itself with greatly improved performance. And when speed is not an issue, of course, or in code that’s not time-critical, you wouldn’t dream of wasting your time on optimization.
Would you?
I first met Jeff Duntemann at an authors’ dinner hosted by PC Tech Journal at Fall Comdex, back in 1985. Jeff was already reasonably well-known as a computer editor and writer, although not as famous as Complete Turbo Pascal, editions 1 through 672 (or thereabouts), TURBO TECHNIX, and PC TECHNIQUES would soon make him. I was fortunate enough to be seated next to Jeff at the dinner table, and, not surprisingly, our often animated conversation revolved around computers, computer writing, and more computers (not necessarily in that order).
Although I was making a living at computer work and enjoying it at the time, I nonetheless harbored vague ambitions of being a science-fiction writer when I grew up. (I have since realized that this hardly puts me in elite company, especially in the computer world, where it seems that every other person has told me they plan to write science fiction “someday.” Given that probably fewer than 500—I’m guessing here—original science fiction and fantasy short stories, and perhaps a few more novels than that, are published each year in this country, I see a few mid-life crises coming.)
At any rate, I had accumulated a small collection of rejection slips, and fancied myself something of an old hand in the field. At the end of the dinner, as the other writers complained half-seriously about how little they were paid for writing for Tech Journal, I leaned over to Jeff and whispered, “You know, the pay isn’t so bad here. You should see what they pay for science fiction—even to the guys who win awards!”
To which Jeff replied, “I know. I’ve been nominated for two Hugos.”
Oh.
Had I known I was seated next to a real, live science-fiction writer—an award-nominated writer, by God!—I would have pumped him for all I was worth, but the possibility had never occurred to me. I was at a dinner put on by a computer magazine, seated next to an editor who had just finished a book about Turbo Pascal, and, gosh, it was obvious that the appropriate topic was computers.
For once, the moral is not “don’t judge a book by its cover.” Jeff is in fact what he appeared to be at face value: a computer writer and editor. However, he is more, too; face value wasn’t full value. You’ll similarly find that face value isn’t always full value in computer programming, and especially so when working in assembly language, where many instructions have talents above and beyond their obvious abilities.
On the other hand, there are also a number of instructions, such as
LOOP
, that are designed to perform specific functions but
aren’t always the best instructions for those functions. So don’t judge
a book by its cover, either.
Assembly language for the x86 family isn’t like any other language (for which we should, without hesitation, offer our profuse thanks). Assembly language reflects the design of the processor rather than the way we think, so it’s full of multiple instructions that perform similar functions, instructions with odd and often confusing side effects, and endless ways to string together different instructions to do much the same things, often with seemingly minuscule differences that can turn out to be surprisingly important.
To produce the best code, you must decide precisely what you need to accomplish, then put together the sequence of instructions that accomplishes that end most efficiently, regardless of what the instructions are usually used for. That’s why optimization for the PC is an art, and it’s why the best assembly language for the x86 family will almost always handily outperform compiled code. With that in mind, let’s look past face value—and while we’re at it, I’ll toss in a few examples of not judging a book by its cover.
The point to all this: You must come to regard the x86 family
instructions for what they do, not what you’re used to thinking they do.
Yes, SHL
shifts a pattern left—but a look-up table can do
the same thing, and can often do it faster. ADD
can indeed
add two operands, but it can’t put the result in a third register;
LEA
can. The instruction set is your raw material for
writing high-performance code. By limiting yourself to thinking only in
certain well-established ways about the various instructions, you’re
putting yourself at a substantial disadvantage every time you sit down
to program.
In short, the x86 family can do much more than you think—if you’ll use everything it has to offer. Give it a shot!
Years ago, I saw a clip on the David Letterman show in which Letterman walked into a store by the name of “Just Lamps” and asked, “So what do you sell here?”
“Lamps,” he was told. “Just lamps. Can’t you read?”
“Lamps,” he said. “I see. And what else?”
From that bit of sublime idiocy we can learn much about divining the full value of an instruction. To wit:
Quick, what do the x86’s memory addressing modes do?
“Calculate memory addresses,” you no doubt replied. And you’re right, of course. But what else do they do?
They perform arithmetic, that’s what they do, and that’s a distinctly different and often useful perspective on memory address calculations.
For example, suppose you have an array base address in BX and an index into the array in SI. You could add the two registers together to address memory, like this:
add bx,si
mov al,[bx]
Or you could let the processor do the arithmetic for you in a single instruction:
mov al,[bx+si]
The two approaches are functionally interchangeable but not
equivalent from a performance standpoint, and which is better depends on
the particular context. If it’s a one-shot memory access, it’s best to
let the processor perform the addition; it’s generally faster at doing
this than a separate ADD
instruction would be. If it’s a
memory access within a loop, however, it’s advantageous on the 8088 CPU
to perform the addition outside the loop, if possible, reducing
effective address calculation time inside the loop, as in the
following:
add bx,si
LoopTop:
mov al,[bx]
inc bx
loop LoopTop
Here, MOV AL,[BX]
is two cycles faster than
MOV AL,[BX+SI]
.
On a 286 or 386, however, the balance shifts.
MOV AL,[BX+SI]
takes no longer than
MOV AL,[BX]
on these processors because effective address
calculations generally take no extra time at all. (According to the MASM
manual, one extra clock is required if three memory addressing
components, as in MOV AL,[BX+SI+1]
, are used. I have not
been able to confirm this from Intel publications, but then I haven’t
looked all that hard.) If you’re optimizing for the 286 or 386, then,
you can take advantage of the processor’s ability to perform arithmetic
as part of memory address calculations without taking a performance
hit.
The 486 is an odd case, in which the use of an index register or the use of a base register that’s the destination of the previous instruction may slow things down, so it is generally but not always better to perform the addition outside the loop on the 486. All memory addressing calculations are free on the Pentium, however. I’ll discuss 486 performance issues in Chapters 12 and 13, and the Pentium in Chapters 19 through 21.
You’re probably not particularly wowed to hear that you can use addressing modes to perform memory addressing arithmetic that would otherwise have to be performed with separate arithmetic instructions. You may, however, be a tad more interested to hear that you can also use addressing modes to perform arithmetic that has nothing to do with memory addressing, and with a couple of advantages over arithmetic instructions, at that.
How?
With LEA
, the only instruction that performs memory
addressing calculations but doesn’t actually address memory.
LEA
accepts a standard memory addressing operand, but does
nothing more than store the calculated memory offset in the specified
register, which may be any general-purpose register. The operation of
LEA
is illustrated in Figure 6.1, which also shows the
operation of register-to-register ADD
, for comparison.
What does that give us? Two things that ADD
doesn’t
provide: the ability to perform addition with either two or three
operands, and the ability to store the result in any register,
not just in one of the source operands.
Imagine that we want to add BX to DI, add two to the result, and store the result in AX. The obvious solution is this:
mov ax,bx
add ax,di
add ax,2
(It would be more compact to increment AX twice than to add two to it, and would probably be faster on an 8088, but that’s not what we’re after at the moment.) An elegant alternative solution is simply:
lea ax,[bx+di+2]
Likewise, either of the following would copy SI plus two to DI
mov di,si
add di,2
or:
lea di,[si+2]
Mind you, the only components LEA
can add are BX or BP,
SI or DI, and a constant displacement, so it’s not going to replace
ADD
most of the time. Also, LEA
is
considerably slower than ADD
on an 8088, although it is
just as fast as ADD
on a 286 or 386 when fewer than three
memory addressing components are used. LEA
is 1 cycle
slower than ADD
on a 486 if the sum of two registers is
used to point to memory, but no slower than ADD
on a
Pentium. On both a 486 and Pentium, LEA
can also be slowed
down by addressing interlocks.
LEA
really comes into its own as a “super-ADD”
instruction on the 386, 486, and Pentium, where it can take advantage of
the enhanced memory addressing modes of those processors. (The 486 and
Pentium offer the same modes as the 386, so I’ll refer only to the 386
from now on.) The 386 can do two very interesting things: It can use
any 32-bit register (EAX, EBX, and so on) as the memory
addressing base register and/or the memory addressing index register,
and it can multiply any 32-bit register used as an index by two, four,
or eight in the process of calculating a memory address, as shown in
Figure 6.2. Let’s see what that’s good for.
Well, the obvious advantage is that any two 32-bit registers, or any
32-bit register and any constant, or any two 32-bit registers and any
constant, can be added together, with the result stored in any register.
This makes the 32-bit LEA
much more generally useful than
the standard 16-bit LEA
in the role of an ADD
with an independent destination.
But what else can LEA
do on a 386, besides add?
It can multiply any register used as an index. LEA
can
multiply only by the power-of-two values 2, 4, or 8, but that’s useful
more often than you might imagine, especially when dealing with pointers
into tables. Besides, multiplying by 2, 4, or 8 amounts to a left shift
of 1, 2, or 3 bits, so we can now add up to two 32-bit registers and a
constant, and shift (or multiply) one of the registers to some
extent—all with a single instruction. For example,
lea edi,TableBase[ecx+edx*4]
replaces all this
mov edi,edx
shl edi,2
add edi,ecx
add edi,offset TableBase
when pointing to an entry in a doubly indexed table.
Are you impressed yet with all that LEA
can do on the
386? Believe it or not, one more feature still awaits us.
LEA
can actually perform a fast multiply of a 32-bit
register by some values other than powers of two. You see, the
same 32-bit register can be both base and index on the 386, and can be
scaled as the index while being used unchanged as the base. That means
that you can, for example, multiply EBX by 5 with:
lea ebx,[ebx+ebx*4]
Without LEA
and scaling, multiplication of EBX by 5
would require either a relatively slow MUL
, along with a
set-up instruction or two, or three separate instructions along the
lines of the following
mov edx,ebx
shl ebx,2
add ebx,edx
and would in either case require the destruction of the contents of another register.
Multiplying a 32-bit value by a non-power-of-two multiplier in just 2 cycles is a pretty neat trick, even though it works only on a 386 or 486.
The full list of values that
LEA
can multiply a register by on a 386 or 486 is: 2, 3, 4, 5, 8, and 9. That list doesn’t include every multiplier you might want, but it covers some commonly used ones, and the performance is hard to beat.
I’d like to extend my thanks to Duane Strong of Metagraphics for his
help in brainstorming uses for the 386 version of LEA
and
for pointing out the complications of 486 instruction timings.
You might not think it, but there’s much to learn about performance programming from the Great Buffalo Sauna Fiasco. To wit:
The scene is Buffalo, New York, in the dead of winter, with the snow piled several feet deep. Four college students, living in typical student housing, are frozen to the bone. The third floor of their house, uninsulated and so cold that it’s uninhabitable, has an ancient bathroom. One fabulously cold day, inspiration strikes:
“Hey—we could make that bathroom into a sauna!”
Pandemonium ensues. Someone rushes out and buys a gas heater, and at considerable risk to life and limb hooks it up to an abandoned but still live gas pipe that once fed a stove on the third floor. Someone else gets sheets of plastic and lines the walls of the bathroom to keep the moisture in, and yet another student gets a bucket full of rocks. The remaining chap brings up some old wooden chairs and sets them up to make benches along the sides of the bathroom. Voila—instant sauna!
They crank up the gas heater, put the bucket of rocks in front of it, close the door, take off their clothes, and sit down to steam themselves. Mind you, it’s not yet 50 degrees Fahrenheit in this room, but the gas heater is roaring. Surely warmer times await.
Indeed they do. The temperature climbs to 55 degrees, then 60, then 63, then 65, and finally creeps up to 68 degrees.
And there it stops.
68 degrees is warm for an uninsulated third floor in Buffalo in the dead of winter. Damn warm. It is not, however, particularly warm for a sauna. Eventually someone acknowledges the obvious and allows that it might have been a stupid idea after all, and everyone agrees, and they shut off the heater and leave, each no doubt offering silent thanks that they had gotten out of this without any incidents requiring major surgery.
And so we see that the best idea in the world can fail for lack of either proper design or adequate horsepower. The primary cause of the Great Buffalo Sauna Fiasco was a lack of horsepower; the gas heater was flat-out undersized. This is analogous to trying to write programs that incorporate features like bitmapped text and searching of multisegment buffers without using high-performance assembly language. Any PC language can perform just about any function you can think of—eventually. That heater would eventually have heated the room to 110 degrees, too—along about the first of June or so.
The Great Buffalo Sauna Fiasco also suffered from fundamental design flaws. A more powerful heater would indeed have made the room hotter—and might well have burned the house down in the process. Likewise, proper algorithm selection and good design are fundamental to performance. The extra horsepower a superb assembly language implementation gives a program is worth bothering with only in the context of a good design.
Assembly language optimization is a small but crucial corner of the PC programming world. Use it sparingly and only within the framework of a good design—but ignore it and you may find various portions of your anatomy out in the cold.
So, drawing fortitude from the knowledge that our quest is a pure and worthy one, let’s resume our exploration of assembly language instructions with hidden talents and instructions with well-known talents that are less than they appear to be. In the process, we’ll come to see that there is another, very important optimization level between the algorithm/design level and the cycle-counting/individual instruction level. I’ll call this middle level local optimization; it involves focusing on optimizing sequences of instructions rather than individual instructions, all with an eye to implementing designs as efficiently as possible given the capabilities of the x86 family instruction set.
And yes, in case you’re wondering, the above story is indeed true. Was I there? Let me put it this way: If I were, I’d never admit it!
Let’s examine first an instruction that is less than it appears to
be: LOOP
. There’s no mystery about what LOOP
does; it decrements CX and branches if CX doesn’t decrement to zero.
It’s so beautifully suited to the task of counting down loops that any
experienced x86 programmer instinctively stuffs the loop count in CX and
reaches for LOOP
when setting up a loop. That’s
fine—LOOP
does, of course, work as advertised—but there is
one problem:
On half of the processors in the x86 family,
LOOP
is slower thanDEC CX
followed byJNZ
. (Granted,DEC CX/JNZ
isn’t precisely equivalent toLOOP
, becauseDEC
alters the flags and LOOP doesn’t, but in most situations they’re comparable.)
How can this be? Don’t ask me, ask Intel. On the 8088 and 80286,
LOOP
is indeed faster than DEC CX/JNZ
by a
cycle, and LOOP
is generally a little faster still because
it’s a byte shorter and so can be fetched faster. On the 386, however,
things change; LOOP
is two cycles slower than
DEC/JNZ
and the fetch time for one extra byte on even an
uncached 386 generally isn’t significant. (Remember that the 386 fetches
four instruction bytes at a pop.) LOOP
is three cycles
slower than DEC/JNZ
on the 486, and the 486 executes
instructions in so few cycles that those three cycles mean that
DEC/JNZ
is nearly twice as fast as
LOOP
. Then, too, unlike LOOP, DEC
doesn’t
require that CX
be used, so the DEC/JNZ
solution is both faster and more flexible on the 386 and 486, and on the
Pentium as well. (By the way, all this is not just theory; I’ve timed
the relative performances of LOOP
and
DEC CX/JNZ
on a cached 386, and LOOP really is slower.)
Things are stranger still for
LOOP
’s relativeJCXZ
, which branches if and only if CX is zero.JCXZ
is faster thanAND CX,CX/JZ
on the 8088 and 80286, and equivalent on the 80386—but is about twice as slow on the 486!
By the way, don’t fall victim to the lures of JCXZ
and
do something like this:
and cx,ofh ;Isolate the desired field
jcxz SkipLoop ;If field is 0, don't bother
The AND
instruction has already set the Zero flag, so
this
and cx,0fh ;Isolate the desired field
jz SkipLoop ;If field is 0, don't bother
will do just fine and is faster on all processors. Use
JCXZ
only when the Zero flag isn’t already set to reflect
the status of CX.
What can we learn from LOOP
and JCXZ
?
First, that a single instruction that is intended to do a complex task
is not necessarily faster than several instructions that together do the
same thing. Second, that the relative merits of instructions and
optimization rules vary to a surprisingly large degree across the x86
family.
In particular, if you’re going to write 386 protected mode code,
which will run only on the 386, 486, and Pentium, you’d be well advised
to rethink your use of the more esoteric members of the x86 instruction
set. LOOP, JCXZ
, the various accumulator-specific
instructions, and even the string instructions in many circumstances no
longer offer the advantages they did on the 8088. Sometimes they’re just
not any faster than more general instructions, so they’re not worth
going out of your way to use; sometimes, as with LOOP
,
they’re actually slower, and you’d do well to avoid them altogether in
the 386/486 world. Reviewing the instruction cycle times in the MASM or
TASM manuals, or looking over the cycle times in Intel’s literature, is
a good place to start; published cycle times are closer to actual
execution times on the 386 and 486 than on the 8088, and are reasonably
reliable indicators of the relative performance levels of x86
instructions.
Cycle counting and directly substituting instructions
(DEC CX/JNZ
for LOOP
, for example) are
techniques that belong at the lowest level of optimization. It’s an
important level, but it’s fairly mechanical; once you’ve learned the
capabilities and relative performance levels of the various
instructions, you should be able to select the best instructions fairly
easily. What’s more, this is a task at which compilers excel. What I’m
saying is that you shouldn’t get too caught up in counting cycles
because that’s a small (albeit important) part of the optimization
picture, and not the area in which your greatest advantage lies.
One level at which assembly language programming pays off handsomely is that of local optimization; that is, selecting the best sequence of instructions for a task. The key to local optimization is viewing the 80x86 instruction set as a set of building blocks, each with unique characteristics. Your job is to sequence those blocks so that they perform well. It doesn’t matter what the instructions are intended to do or what their names are; all that matters is what they do.
Our discussion of LOOP
versus DEC/JNZ
is an
excellent example of optimization by cycle counting. It’s worth knowing,
but once you’ve learned it, you just routinely use DEC/JNZ
at the bottom of loops in 386/486-specific code, and that’s that.
Besides, you’ll save at most a few cycles each time, and while that
helps a little, it’s not going to make all that much
difference.
Now let’s step back for a moment, and with no preconceptions consider
what the x86 instruction set can do for us. The bulk of the time with
both LOOP
and DEC/JNZ
is taken up by
branching, which just happens to be one of the slowest aspects of every
processor in the x86 family, and the rest is taken up by decrementing
the count register and checking whether it’s zero. There may be ways to
perform those tasks a little faster by selecting different instructions,
but they can get only so fast, and branching can’t even get all that
fast.
The trick, then, is not to find the fastest way to decrement a count and branch conditionally, but rather to figure out how to accomplish the same result without decrementing or branching as often. Remember the Kobiyashi Maru problem in Star Trek? The same principle applies here: Redefine the problem to one that offers better solutions.
Consider Listing 7.1, which searches a buffer until either the
specified byte is found, a zero byte is found, or the specified number
of characters have been checked. Such a function would be useful for
scanning up to a maximum number of characters in a zero-terminated
buffer. Listing 7.1, which uses LOOP
in the main loop,
performs a search of the sample string for a period (‘.’) in 170 µs on a
20 MHz cached 386.
When the LOOP
in Listing 7.1 is replaced with
DEC CX/JNZ
, performance improves to 168 µs, less than 2
percent faster than Listing 7.1. Actually, instruction fetching,
instruction alignment, cache characteristics, or something similar is
affecting these results; I’d expect a slightly larger improvement—around
7 percent—but that’s the most that counting cycles could buy us in this
case. (All right, already; LOOPNZ
could be used at the
bottom of the loop, and other optimizations are surely possible, but all
that won’t add up to anywhere near the benefits we’re about to see from
local optimization, and that’s the whole point.)
LISTING 7.1 L7-1.ASM
; Program to illustrate searching through a buffer of a specified
; length until either a specified byte or a zero byte is
; encountered.
; A standard loop terminated with LOOP is used.
.model small100h
.stack
.data; Sample string to search through.
byte
SampleString label db 'This is a sample string of a long enough length '
db 'so that raw searching speed can outweigh any '
db 'extra set-up time that may be required.',0
equ $-SampleString
SAMPLE_STRING_LENGTH
; User prompt.
db 'Enter character to search for:$'
Prompt
; Result status messages.
db 0dh,0ah
ByteFoundMsg db 'Specified byte found.',0dh,0ah,'$'
db 0dh, 0ah
ZeroByteFoundMsg db 'Zero byte encountered.',0dh,0ah,'$'
db 0dh,0ah
NoByteFoundMsg db 'Buffer exhausted with no match.', 0dh, 0ah, '$'
.code
Start proc nearmov ax,@data ;point to standard data segment
mov ds,ax
mov dx,offset Prompt
mov ah,9 ;DOS print string function
int 21h ;prompt the user
mov ah,1 ;DOS get key function
int 21h ;get the key to search for
mov ah,al ;put character to search for in AH
mov cx,SAMPLE_STRING_LENGTH ;# of bytes to search
mov si,offset SampleString ;point to buffer to search
call SearchMaxLength ;search the buffer
mov dx,offset ByteFoundMsg ;assume we found the byte
jc PrintStatus ;we did find the byte
;we didn't find the byte, figure out
;whether we found a zero byte or
;ran out of buffer
mov dx,offset NoByteFoundMsg
;assume we didn't find a zero byte
jcxz PrintStatus ;we didn't find a zero byte
mov dx,offset ZeroByteFoundMsg ;we found a zero byte
PrintStatus:
mov ah,9 ;DOS print string function
int 21h ;report status
mov ah,4ch ;return to DOS
int 21h
Start endp
; Function to search a buffer of a specified length until either a
; specified byte or a zero byte is encountered.
; Input:
; AH = character to search for
; CX = maximum length to be searched (must be > 0)
; DS:SI = pointer to buffer to be searched
; Output:
; CX = 0 if and only if we ran out of bytes without finding
; either the desired byte or a zero byte
; DS:SI = pointer to searched-for byte if found, otherwise byte
; after zero byte if found, otherwise byte after last
; byte checked if neither searched-for byte nor zero
; byte is found
; Carry Flag = set if searched-for byte found, reset otherwise
SearchMaxLength proc nearcld
SearchMaxLengthLoop:
lodsb ;get the next byte
cmp al,ah ;is this the byte we want?
jz ByteFound ;yes, we're done with success
and al,al ;is this the terminating 0 byte?
jz ByteNotFound ;yes, we're done with failure
loop SearchMaxLengthLoop ;it's neither, so check the next
;byte, if any
ByteNotFound:
clc ;return "not found" status
ret
ByteFound:
dec si ;point back to the location at which
;we found the searched-for byte
stc ;return "found" status
ret
SearchMaxLength endp end Start
Listing 7.2 takes a different tack, unrolling the loop so that four
bytes are checked for each LOOP
performed. The same
instructions are used inside the loop in each listing, but Listing 7.2
is arranged so that three-quarters of the LOOP
s are
eliminated. Listings 7.1 and 7.2 perform exactly the same task, and they
use the same instructions in the loop—the searching algorithm hasn’t
changed in any way—but we have sequenced the instructions differently in
Listing 7.2, and that makes all the difference.
LISTING 7.2 L7-2.ASM
; Program to illustrate searching through a buffer of a specified
; length until a specified zero byte is encountered.
; A loop unrolled four times and terminated with LOOP is used.
.model small100h
.stack
.data; Sample string to search through.
byte
SampleString label db 'This is a sample string of a long enough length '
db 'so that raw searching speed can outweigh any '
db 'extra set-up time that may be required.',0
equ $-SampleString
SAMPLE_STRING_LENGTH
; User prompt.
db 'Enter character to search for:$'
Prompt
; Result status messages.
db 0dh,0ah
ByteFoundMsg db 'Specified byte found.',0dh,0ah,'$'
db 0dh,0ah
ZeroByteFoundMsg db 'Zero byte encountered.', 0dh, 0ah, '$'
db 0dh,0ah
NoByteFoundMsg db 'Buffer exhausted with no match.', 0dh, 0ah, '$'
; Table of initial, possibly partial loop entry points for
; SearchMaxLength.
SearchMaxLengthEntryTable labelworddw SearchMaxLengthEntry4
dw SearchMaxLengthEntry1
dw SearchMaxLengthEntry2
dw SearchMaxLengthEntry3
.code
Start proc nearmov ax,@data ;point to standard data segment
mov ds,ax
mov dx,offset Prompt
mov ah,9 ;DOS print string function
int 21h ;prompt the user
mov ah,1 ;DOS get key function
int 21h ;get the key to search for
mov ah,al ;put character to search for in AH
mov cx,SAMPLE_STRING_LENGTH ;# of bytes to search
mov si,offset SampleString ;point to buffer to search
call SearchMaxLength ;search the buffer
mov dx,offset ByteFoundMsg ;assume we found the byte
jc PrintStatus ;we did find the byte
;we didn't find the byte, figure out
;whether we found a zero byte or
;ran out of buffer
mov dx,offset NoByteFoundMsg
;assume we didn't find a zero byte
jcxz PrintStatus ;we didn't find a zero byte
mov dx,offset ZeroByteFoundMsg ;we found a zero byte
PrintStatus:
mov ah,9 ;DOS print string function
int 21h ;report status
mov ah,4ch ;return to DOS
int 21h
Start endp
; Function to search a buffer of a specified length until either a
; specified byte or a zero byte is encountered.
; Input:
; AH = character to search for
; CX = maximum length to be searched (must be > 0)
; DS:SI = pointer to buffer to be searched
; Output:
; CX = 0 if and only if we ran out of bytes without finding
; either the desired byte or a zero byte
; DS:SI = pointer to searched-for byte if found, otherwise byte
; after zero byte if found, otherwise byte after last
; byte checked if neither searched-for byte nor zero
; byte is found
; Carry Flag = set if searched-for byte found, reset otherwise
SearchMaxLength proc nearcld
mov bx,cx
add cx,3 ;calculate the maximum # of passes
shr cx,1 ;through the loop, which is
shr cx,1 ;unrolled 4 times
and bx,3 ;calculate the index into the entry
;point table for the first,
;possibly partial loop
shl bx,1 ;prepare for a word-sized look-up
jmp SearchMaxLengthEntryTable[bx]
;branch into the unrolled loop to do
;the first, possibly partial loop
SearchMaxLengthLoop:
SearchMaxLengthEntry4:
lodsb ;get the next byte
cmp al,ah ;is this the byte we want?
jz ByteFound ;yes, we're done with success
and al,al ;is this the terminating 0 byte?
jz ByteNotFound ;yes, we're done with failure
SearchMaxLengthEntry3:
lodsb ;get the next byte
cmp al,ah ;is this the byte we want?
jz ByteFound ;yes, we're done with success
and al,al ;is this the terminating 0 byte?
jz ByteNotFound ;yes, we're done with failure
SearchMaxLengthEntry2:
lodsb ;get the next byte
cmp al,ah ;is this the byte we want?
jz ByteFound ;yes, we're done with success
and al,al ;is this the terminating 0 byte?
jz ByteNotFound ;yes, we're done with failure
SearchMaxLengthEntry1:
lodsb ;get the next byte
cmp al,ah ;is this the byte we want?
jz ByteFound ;yes, we're done with success
and al,al ;is this the terminating 0 byte?
jz ByteNotFound ;yes, we're done with failure
loop SearchMaxLengthLoop ;it's neither, so check the next
; four bytes, if any
ByteNotFound:
clc ;return "not found" status
ret
ByteFound:
dec si ;point back to the location at which
; we found the searched-for byte
stc ;return "found" status
ret
SearchMaxLength endp end Start
How much difference? Listing 7.2 runs in 121 µs—40 percent faster
than Listing 7.1, even though Listing 7.2 still uses LOOP
rather than DEC CX/JNZ
. (The loop in Listing 7.2 could be
unrolled further, too; it’s just a question of how much more memory you
want to trade for ever-decreasing performance benefits.) That’s typical
of local optimization; it won’t often yield the order-of-magnitude
improvements that algorithmic improvements can produce, but it can get
you a critical 50 percent or 100 percent improvement when you’ve
exhausted all other avenues.
The point is simply this: You can gain far more by stepping back a bit and thinking of the fastest overall way for the CPU to perform a task than you can by saving a cycle here or there using different instructions. Try to think at the level of sequences of instructions rather than individual instructions, and learn to treat x86 instructions as building blocks with unique characteristics rather than as instructions dedicated to specific tasks.
As another example of local optimization, consider the matter of rotating or shifting a mask into position. First, let’s look at the simple task of setting bit N of AX to 1.
The obvious way to do this is to place N in CL, rotate the bit into position, and OR it with AX, as follows:
MOV BX,1
SHL BX,CL
OR AX,BX
This solution is obvious because it takes good advantage of the special ability of the x86 family to shift or rotate by the variable number of bits specified by CL. However, it takes an average of about 45 cycles on an 8088. It’s actually far faster to precalculate the results, pass the bit number in BX, and look the shifted bit up, as shown in Listing 7.3.
LISTING 7.3 L7-3.ASM
SHL BX,1 ;prepare for word sized look up
OR AX,ShiftTable[BX] ;look up the bit and OR it in
:WORD
ShiftTable LABEL =0001H
BIT_PATTERN16
REPT DW BIT_PATTERN
=BIT_PATTERN SHL 1
BIT_PATTERN ENDM
Even though it accesses memory, this approach takes only 20 cycles—more than twice as fast as the variable shift. Once again, we were able to improve performance considerably—not by knowing the fastest instructions, but by selecting the fastest sequence of instructions.
In the particular example above, we once again run into the difficulty of optimizing across the x86 family. The table lookup is faster on the 8088 and 286, but it’s slightly slower on the 386 and no faster on the 486. However, 386/486-specific code could use enhanced addressing to accomplish the whole job in just one instruction, along the lines of the code snippet in Listing 7.4.
LISTING 7.4 L7-4.ASM
OR EAX,ShiftTable[EBX*4] ;look up the bit and OR it in
:DWORD
ShiftTable LABEL =0001H
BIT_PATTERN32
REPT DD BIT_PATTERN
=BIT_PATTERN SHL 1
BIT_PATTERN ENDM
Besides illustrating the advantages of local optimization, this example also shows that it generally pays to precalculate results; this is often done at or before assembly time, but precalculated tables can also be built at run time. This is merely one aspect of a fundamental optimization rule: Move as much work as possible out of your critical code by whatever means necessary.
The NOT
instruction flips all the bits in the operand,
from 0 to 1 or from 1 to 0. That’s as simple as could be, but
NOT
nonetheless has a minor but interesting talent: It
doesn’t affect the flags. That can be irritating; I once spent a good
hour tracking down a bug caused by my unconscious assumption that
NOT
does set the flags. After all, every other arithmetic
and logical instruction sets the flags; why not NOT
?
Probably because NOT
isn’t considered to be an arithmetic
or logical instruction at all; rather, it’s a data manipulation
instruction, like MOV
and the various rotates. (These are
RCR
, RCL
, ROR
, and
ROL
, which affect only the Carry and Overflow flags.) NOT
is often used for tasks, such as flipping masks, where there’s no reason
to test the state of the result, and in that context it can be handy to
keep the flags unmodified for later testing.
Besides, if you want to
NOT
an operand and set the flags in the process, you can justXOR
it with -1. Put another way, the only functional difference betweenNOT AX
andXOR AX,0FFFFH
is thatXOR
modifies the flags andNOT
doesn’t.
The x86 instruction set offers many ways to accomplish almost any task. Understanding the subtle distinctions between the instructions—whether and which flags are set, for example—can be critical when you’re trying to optimize a code sequence and you’re running out of registers, or when you’re trying to minimize branching.
Another case in which there are two slightly different ways to
perform a task involves adding 1 to an operand. You can do this with
INC
, as in INC AX
, or you can do it with
ADD
, as in ADD AX,1
. What’s the difference?
The obvious difference is that INC
is usually a byte or two
shorter (the exception being ADD AL,1
, which at two bytes
is the same length as INC AL
), and is faster on some
processors. Less obvious, but no less important, is that
ADD
sets the Carry flag while INC
leaves the
Carry flag untouched.
Why is that important? Because it allows INC
to function
as a data pointer manipulation instruction for multi-word arithmetic.
You can use INC
to advance the pointers in code like that
shown in Listing 7.5 without having to do any work to preserve the Carry
status from one addition to the next.
LISTING 7.5 L7-5.ASM
CLC ;clear the Carry for the initial addition
LOOP_TOP:
MOV AX,[SI];get next source operand word
ADC [DI],AX;add with Carry to dest operand word
INC SI ;point to next source operand word
INC SI
INC DI ;point to next dest operand word
INC DI
LOOP LOOP_TOP
If ADD
were used, the Carry flag would have to be saved
between additions, with code along the lines shown in Listing 7.6.
LISTING 7.6 L7-6.ASM
CLC ;clear the carry for the initial addition
LOOP_TOP:
MOV AX,[SI] ;get next source operand word
ADC [DI],AX ;add with carry to dest operand word
LAHF ;set aside the carry flag
ADD SI,2 ;point to next source operand word
ADD DI,2 ;point to next dest operand word
SAHF ;restore the carry flag
LOOP LOOP_TOP
It’s not that the Listing 7.6 approach is necessarily better or
worse; that depends on the processor and the situation. The Listing 7.6
approach is different, and if you understand the differences,
you’ll be able to choose the best approach for whatever code you happen
to write. (DEC
has the same property of preserving the
Carry flag, by the way.)
There are a couple of interesting aspects to the last example. First,
note that LOOP
doesn’t affect any flags at all; this allows
the Carry flag to remain unchanged from one addition to the next. Not
altering the arithmetic flags is a common characteristic of program
control instructions (as opposed to arithmetic and logical instructions
like SUB
and AND
, which do alter the
flags).
The rule is not that the arithmetic flags change whenever the CPU performs a calculation; rather, the flags change whenever you execute an arithmetic, logical, or flag control (such as
CLC
to clear the Carry flag) instruction.
Not only do LOOP
and JCXZ
not alter the
flags, but REP MOVS
, which counts down CX to 0, doesn’t
affect the flags either.
The other interesting point about the last example is the use of
LAHF
and SAHF
, which transfer the low byte of
the FLAGS register to and from AH, respectively. These instructions were
created to help provide compatibility with the 8080’s (that’s
8080, not 8088) PUSH
PSW
and
POP PSW
instructions, but turn out to be compact (one byte)
instructions for saving and restoring the arithmetic flags. A word of
caution, however: SAHF
restores the Carry, Zero, Sign,
Auxiliary Carry, and Parity flags—but not the Overflow flag,
which resides in the high byte of the FLAGS register. Also, be aware
that LAHF
and SAHF
provide a fast way to
preserve the flags on an 8088 but are relatively slow instructions on
the 486 and Pentium.
There are times when it’s a clear liability that INC
doesn’t set the Carry flag. For instance
INC AX
ADC DX,0
does not increment the 32-bit value in DX:AX. To do that, you’d need the following:
ADD AX,1
ADC DX,0
As always, pay attention!
When I was a senior in high school, a pop song called “Seasons in the Sun,” sung by one Terry Jacks, soared up the pop charts and spent, as best I can recall, two straight weeks atop Kasey Kasem’s American Top 40. “Seasons in the Sun” wasn’t a particularly good song, primarily because the lyrics were silly. I’ve never understood why the song was a hit, but, as so often happens with undistinguished but popular music by forgotten one- or two-shot groups (“Don’t Pull Your Love Out on Me Baby,” “Billy Don’t Be a Hero,” et al.), I heard it everywhere for a month or so, then gave it not another thought for 15 years.
Recently, though, I came across a review of a Rhino Records collection of obscure 1970s pop hits. Knowing that Jeff Duntemann is an aficionado of such esoterica (who do you know who owns an album by The Peppermint Trolley Company?), I sent the review to him. He was amused by it and, as we kicked the names of old songs around, “Seasons in the Sun” came up. I expressed my wonderment that a song that really wasn’t very good was such a big hit.
“Well,” said Jeff, “I think it suffered in the translation from the French.”
Ah-ha! Mystery solved. Apparently everyone but me knew that it was translated from French, and that novelty undoubtedly made the song a big hit. The translation was also surely responsible for the sappy lyrics; dollars to donuts that the original French lyrics were stronger.
Which brings us without missing a beat to this chapter’s theme, speeding up C with assembly language. When you seek to speed up a C program by converting selected parts of it (generally no more than a few functions) to assembly language, make sure you end up with high-performance assembly language code, not fine-tuned C code. Compilers like Microsoft C/C++ and Watcom C are by now pretty good at fine-tuning C code, and you’re not likely to do much better by taking the compiler’s assembly language output and tweaking it.
To make the process of translating C code to assembly language worth the trouble, you must ignore what the compiler does and design your assembly language code from a pure assembly language perspective. With a merely adequate translation, you risk laboring mightily for little or no reward.
Apropos of which, when was the last time you heard of Terry Jacks?
The key to optimizing C programs with assembly language is, as always, writing good assembly language code, but with an added twist. Rule 1 when converting C code to assembly is this: Don’t think like a compiler. That’s more easily said than done, especially when the C code you’re converting is readily available as a model and the assembly code that the compiler generates is available as well. Nevertheless, the principle of not thinking like a compiler is essential, and is, in one form or another, the basis for all that I’ll discuss below.
Before I discuss Rule 1 further, let me mention rule number 0: Only optimize where it matters. The bulk of execution time in any program is spent in a very small portion of the code, and most code beyond that small portion doesn’t have any perceptible impact on performance. Unless you’re supremely concerned with code size (an area in which assembly-only programs can excel), I’d suggest that you write most of your code in C and reserve assembly for the truly critical sections of your code; that’s the formula that I find gives the most bang for the buck.
This is not to say that complete programs shouldn’t be designed with optimized assembly language in mind. As you’ll see shortly, orienting your data structures towards assembly language can be a salubrious endeavor indeed, even if most of your code is in C. When it comes to actually optimizing code and/or converting it to assembly, though, do it only where it matters. Get a profiler—and use it!
Also make it a point to concentrate on refining your program design and algorithmic approach at the conceptual and/or C levels before doing any assembly language optimization.
Assembly language optimization is the final and far from the only step in the optimization chain, and as such should be performed last; converting to assembly too soon can lock in your code before the design is optimal. At the very least, conversion to assembly tends to make future changes and debugging more difficult, slowing you down and limiting your options.
In order to think differently from a compiler, you must understand both what compilers and C programmers tend to do and how that differs from what assembly language does well. In this pursuit, it can be useful to examine the code your compiler generates, either by viewing the code in a debugger or by having the compiler generate an assembly language output file. (The latter is done with /Fa or /Fc in Microsoft C/C++ and -S in Borland C++.)
C programmers tend to modularize their code with lots of function calls. That’s good for readable, reliable, reusable code, and it allows the compiler to optimize better because it can deal with fewer variables and statements in each optimization arena—but it’s not so good when viewed from the assembly language level. Calls and returns are slow, especially in the large code model, and the pushes required to put parameters on the stack are expensive as well.
What this means is that when you want to speed up a portion of a C program, you should identify the entire critical portion and move all of that critical portion into an assembly language function. You don’t want to move a part of the inner loop into assembly language and then call it from C every time through the loop; the function call and return overhead would be unacceptable. Carve out the critical code en masse and move it into assembly, and try to avoid calls and returns even in your assembly code. True, in assembly you can pass parameters in registers, but the calls and returns themselves are still slow; if the extra cycles they take don’t affect performance, then the code they’re in probably isn’t critical, and perhaps you’ve chosen to convert too much code to assembly, eh?
C compilers work within the stack frame model, whereby variables reside in a block of stack memory and are accessed via offsets from BP. Compilers may store a couple of variables in registers and may briefly keep other variables in registers when they’re used repeatedly, but the stack frame is the underlying architecture. It’s a nice architecture; it’s flexible, convenient, easy to program, and makes for fairly compact code. However, stack frames have a few drawbacks. They must be constructed and destroyed, which takes both time and code. They are so easy to use that they tend to bias the assembly language programmer in favor of accessing memory variables more often than might be necessary. Finally, you cannot use BP as a general-purpose register if you intend to access a stack frame, and having that seventh register available is sometimes useful indeed.
That doesn’t mean you shouldn’t use stack frames, which are useful and often necessary. Just don’t fall victim to their undeniable charms.
C compilers are not terrific at handling segments. Some compilers can efficiently handle a single far pointer used in a loop by leaving ES set for the duration of the loop. But two far pointers used in the same loop confuse every compiler I’ve seen, causing the full segment:offset address to be reloaded each time either pointer is used.
This particularly affects performance in 286 protected mode (under OS/2 1.X or the Rational DOS Extender, for example) because segment loads in protected mode take a minimum of 17 cycles, versus a mere 2 cycles in real mode.
In assembly language you have full control over segments. Use it, and, if necessary, reorganize your code to minimize segment loading.
You might think that the most obvious advantage assembly language has
over C is that it allows the use of all forms of instructions and all
registers in all ways, whereas C compilers tend to use a subset of
registers and instructions in a limited number of ways. Yes and no. It’s
true that C compilers typically don’t generate instructions such as
XLAT
, rotates, or the string instructions. On the other
hand, XLAT
and rotates are useful in a limited set of
circumstances, and string instructions are used in the C
library functions. In fact, C library code is likely to be carefully
optimized by experts, and may be much better than equivalent code you’d
produce yourself.
Am I saying that C compilers produce better code than you do? No, I’m saying that they can, unless you use assembly language properly. Writing code in assembly language rather than C guarantees nothing.
You can write good assembly, bad assembly, or assembly that is virtually indistinguishable from compiled code; you are more likely than not to write the latter if you think that optimization consists of tweaking compiled C code.
Sure, you can probably use the registers more efficiently and take advantage of an instruction or two that the compiler missed, but the code isn’t going to get a whole lot faster that way.
True optimization requires rethinking your code to take advantage of assembly language. A C loop that searches through an integer array for matches might compile
to something like Figure 8.1A. You might look at that and tweak it to the code shown in Figure 8.1B.
Congratulations! You’ve successfully eliminated all stack frame
access, you’ve used LOOP
(although DEC SI/JNZ
is actually faster on 386 and later machines, as I explained in the last
chapter), and you’ve used a string instruction. Unfortunately, the new
code isn’t going to run very much faster. Maybe 25 percent faster, maybe
a little more. Big deal. You’ve eliminated the trappings of the
compiler—the stack frame and the restricted register usage—but you’re
still thinking like the compiler. Try this:
repnz scasw
jz Match
It’s a simple example—but, I hope, a convincing one. Stretch your brain when you optimize.
The ultimate in assembly language optimization comes when you change
the rules; that is, when you reorganize the entire program to allow the
use of better assembly language code in the small section of code that
most affects overall performance. For example, consider that the data
searched in the last example is stored in an array of structures, with
each structure in the array containing other information as well. In
this situation, REP SCASW
couldn’t be used because the data
searched through wouldn’t be contiguous.
However, if the need for performance in searching the array is urgent
enough, there’s no reason why you can’t reorganize the data. This might
mean removing the array elements from the structures and storing them in
their own array so that REP SCASW
could be
used.
Organizing a program’s data so that the performance of the critical sections can be optimized is a key part of design, and one that’s easily shortchanged unless, during the design stage, you thoroughly understand and work to bring together your data needs, the critical sections of your program, and potential assembly language optimizations.
More on this shortly.
To recap, here are some things to look for when striving to convert C code into optimized assembly language:
That said, let me show some of these precepts in action.
Listing 8.1 is the sample C application I’m going to use to examine
optimization in action. Listing 8.1 isn’t really complete—it doesn’t
handle the “no-matches” case well, and it assumes that the sum of all
matches will fit into an int
-but it will do just fine as an
optimization example.
LISTING 8.1 L8-1.C
/* Program to search an array spanning a linked list of variable-
sized blocks, for all entries with a specified ID number,
and return the average of the values of all such entries. Each of
the variable-sized blocks may contain any number of data entries,
stored as an array of structures within the block. */
#include <stdio.h>
#ifdef __TURBOC__
#include <alloc.h>
#else
#include <malloc.h>
#endif
void main(void);
void exit(int);
unsigned int FindIDAverage(unsigned int, struct BlockHeader *);
/* Structure that starts each variable-sized block */
struct BlockHeader {
struct BlockHeader *NextBlock; /* Pointer to next block, or NULL
if this is the last block in the
linked list */
unsigned int BlockCount; /* The number of DataElement entries
in this variable-sized block */
};
/* Structure that contains one element of the array we'll search */
struct DataElement {
unsigned int ID; /* ID # for array entry */
unsigned int Value; /* Value of array entry */
};
void main(void) {
int i,j;
unsigned int IDToFind;
struct BlockHeader *BaseArrayBlockPointer,*WorkingBlockPointer;
struct DataElement *WorkingDataPointer;
struct BlockHeader **LastBlockPointer;
("ID # for which to find average: ");
printf("%d",&IDToFind);
scanf/* Build an array across 5 blocks, for testing */
/* Anchor the linked list to BaseArrayBlockPointer */
= &BaseArrayBlockPointer;
LastBlockPointer /* Create 5 blocks of varying sizes */
for (i = 1; i < 6; i++) {
/* Try to get memory for the next block */
if ((WorkingBlockPointer =
(struct BlockHeader *) malloc(sizeof(struct BlockHeader) +
sizeof(struct DataElement) * i * 10)) == NULL) {
(1);
exit}
/* Set the # of data elements in this block */
->BlockCount = i * 10;
WorkingBlockPointer/* Link the new block into the chain */
*LastBlockPointer = WorkingBlockPointer;
/* Point to the first data field */
=
WorkingDataPointer (struct DataElement *) ((char *)WorkingBlockPointer +
sizeof(struct BlockHeader));
/* Fill the data fields with ID numbers and values */
for (j = 0; j < (i * 10); j++, WorkingDataPointer++) {
->ID = j;
WorkingDataPointer->Value = i * 1000 + j;
WorkingDataPointer}
/* Remember where to set link from this block to the next */
= &WorkingBlockPointer->NextBlock;
LastBlockPointer }
/* Set the last block's "next block" pointer to NULL to indicate
that there are no more blocks */
->NextBlock = NULL;
WorkingBlockPointer("Average of all elements with ID %d: %u\n",
printf, FindIDAverage(IDToFind, BaseArrayBlockPointer));
IDToFind(0);
exit}
/* Searches through the array of DataElement entries spanning the
linked list of variable-sized blocks, starting with the block
pointed to by BlockPointer, for all entries with IDs matching
SearchedForID, and returns the average value of those entries. If
no matches are found, zero is returned */
unsigned int FindIDAverage(unsigned int SearchedForID,
struct BlockHeader *BlockPointer)
{
struct DataElement *DataPointer;
unsigned int IDMatchSum;
unsigned int IDMatchCount;
unsigned int WorkingBlockCount;
= IDMatchSum = 0;
IDMatchCount /* Search through all the linked blocks until the last block
(marked with a NULL pointer to the next block) has been
searched */
do {
/* Point to the first DataElement entry within this block */
=
DataPointer (struct DataElement *) ((char *)BlockPointer +
sizeof(struct BlockHeader));
/* Search all the DataElement entries within this block
and accumulate data from all that match the desired ID */
for (WorkingBlockCount=0;
<BlockPointer->BlockCount;
WorkingBlockCount++, DataPointer++) {
WorkingBlockCount/* If the ID matches, add in the value and increment the
match counter */
if (DataPointer->ID == SearchedForID) {
++;
IDMatchCount+= DataPointer->Value;
IDMatchSum }
}
/* Point to the next block, and continue as long as that pointer
isn't NULL */
} while ((BlockPointer = BlockPointer->NextBlock) != NULL);
/* Calculate the average of all matches */
if (IDMatchCount == 0)
return(0); /* Avoid division by 0 */
else
return(IDMatchSum / IDMatchCount);
}
The main body of Listing 8.1 constructs a linked list of memory
blocks of various sizes and stores an array of structures across those
blocks, as shown in Figure 8.2. The function FindIDAverage
in Listing 8.1 searches through that array for all matches to a
specified ID number and returns the average value of all such matches.
FindIDAverage
contains two nested loops, the outer one
repeating once for each linked block and the inner one repeating once
for each array element in each block. The inner loop—the critical one—is
compact, containing only four statements, and should lend itself rather
well to compiler optimization.
As it happens, Microsoft C/C++ does optimize the inner loop of
FindIDAverage
nicely. Listing 8.2 shows the code Microsoft
C/C++ generates for the inner loop, consisting of a mere seven assembly
language instructions inside the loop. The compiler is smart enough to
convert the loop index variable, which counts up but is used for nothing
but counting loops, into a count-down variable so that the
LOOP
instruction can be used.
LISTING 8.2 L8-2.COD
; Code generated by Microsoft C for inner loop of FindIDAverage.
;|*** for (WorkingBlockCount=0;
;|*** WorkingBlockCount<BlockPointer->BlockCount;
;|*** WorkingBlockCount++, DataPointer++) {
mov WORD PTR [bp-6],0 ;WorkingBlockCount
mov bx,WORD PTR [bp+6] ;BlockPointer
cmp WORD PTR [bx+2],0
je $FB264
mov cx,WORD PTR [bx+2]
add WORD PTR [bp-6],cx ;WorkingBlockCount
mov di,WORD PTR [bp-2] ;IDMatchSum
mov dx,WORD PTR [bp-4] ;IDMatchCount
:
$L20004;|*** if (DataPointer->ID == SearchedForID) {
mov ax,WORD PTR [si]
cmp WORD PTR [bp+4],ax ;SearchedForID
jne $I265
;|*** IDMatchCount++;
inc dx
;|*** IDMatchSum += DataPointer->Value;
add di,WORD PTR [si+2]
;|*** }
;|*** }
:
$I265add si,4
loop $L20004
mov WORD PTR [bp-2],di ;IDMatchSum
mov WORD PTR [bp-4],dx ;IDMatchCount
: $FB264
It’s hard to squeeze much more performance from this code by tweaking
it, as exemplified by Listing 8.3, a fine-tuned assembly version of
FindIDAverage
that was produced by looking at the assembly
output of MS C/C++ and tightening it. Listing 8.3 eliminates all stack
frame access in the inner loop, but that’s about all the tightening
there is to do. The result, as shown in Table 8.1, is that Listing 8.3
runs a modest 11 percent faster than Listing 8.1 on a 386. The results
could vary considerably, depending on the nature of the data set
searched through (average block size and frequency of matches). But,
then, understanding the typical and worst case conditions is part of
optimization, isn’t it?
LISTING 8.3 L8-3.ASM
; Typically optimized assembly language version of FindIDAverage.
equ 4 ;Passed parameter offsets in the
SearchedForID equ 6 ; stack frame (skip over pushed BP
BlockPointer ; and the return address)
equ 0 ;Field offsets in struct BlockHeader
NextBlock equ 2
BlockCount equ 4 ;Number of bytes in struct BlockHeader
BLOCK_HEADER_SIZE equ 0 ;struct DataElement field offsets
ID equ 2
Value equ 4 ;Number of bytes in struct DataElement
DATA_ELEMENT_SIZE
.model small
.code public _FindIDAverage
On 20 MHz 386 | On 10 MHz 286 | |
---|---|---|
Listing 8.1 (MSC with maximum optimization) | 294 microseconds | 768 microseconds |
Listing 8.3 (Assembly) | 265 | 644 |
Listing 8.4 (Optimized assembly) | 212 | 486 |
Listing 8.6 (Optimized assembly with reorganized data) | 100 | 207 |
_FindIDAverage proc nearpush bp ;Save caller's stack frame
mov bp,sp ;Point to our stack frame
push di ;Preserve C register variables
push si
sub dx,dx ;IDMatchSum = 0
mov bx,dx ;IDMatchCount = 0
mov si,[bp+BlockPointer] ;Pointer to first block
mov ax,[bp+SearchedForID] ;ID we're looking for
; Search through all the linked blocks until the last block
; (marked with a NULL pointer to the next block) has been searched.
BlockLoop:
; Point to the first DataElement entry within this block.
lea di,[si+BLOCK_HEADER_SIZE]
; Search through all the DataElement entries within this block
; and accumulate data from all that match the desired ID.
mov cx,[si+BlockCount]
jcxz DoNextBlock ;No data in this block
IntraBlockLoop:
cmp [di+ID],ax ;Do we have an ID match?
jnz NoMatch ;No match
inc bx ;We have a match; IDMatchCount++;
add dx,[di+Value] ;IDMatchSum += DataPointer->Value;
NoMatch:
add di,DATA_ELEMENT_SIZE ;point to the next element
loop IntraBlockLoop
; Point to the next block and continue if that pointer isn't NULL.
DoNextBlock:
mov si,[si+NextBlock] ;Get pointer to the next block
and si,si ;Is it a NULL pointer?
jnz BlockLoop ;No, continue
; Calculate the average of all matches.
sub ax,ax ;Assume we found no matches
and bx,bx
jz Done ;We didn't find any matches, return 0
xchg ax,dx ;Prepare for division
div bx ;Return IDMatchSum / IDMatchCount
Done: pop si ;Restore C register variables
pop di
pop bp ;Restore caller's stack frame
ret
_FindIDAverage ENDP end
Listing 8.4 tosses some sophisticated optimization techniques into
the mix. The loop is unrolled eight times, eliminating a good deal of
branching, and SCASW
is used instead of
CMP [DI],AX
. (Note, however, that SCASW
is in
fact slower than CMP [DI],AX
on the 386 and 486, and is
sometimes faster on the 286 and 8088 only because it’s shorter and
therefore may prefetch faster.) This advanced tweaking produces a 39
percent improvement over the original C code—substantial, but not a
tremendous return for the optimization effort invested.
LISTING 8.4 L8-4.ASM
; Heavily optimized assembly language version of FindIDAverage.
; Features an unrolled loop and more efficient pointer use.
equ 4 ;Passed parameter offsets in the
SearchedForID equ 6 ; stack frame (skip over pushed BP
BlockPointer ; and the return address)
equ 0 ;Field offsets in struct BlockHeader
NextBlock equ 2
BlockCount equ 4 ;Number of bytes in struct BlockHeader
BLOCK_HEADER_SIZE equ 0 ;struct DataElement field offsets
ID equ 2
Value equ 4 ;Number of bytes in struct DataElement
DATA_ELEMENT_SIZE
.model small
.code
public _FindIDAverage
_FindIDAverage proc nearpush bp ;Save caller's stack frame
mov bp,sp ;Point to our stack frame
push di ;Preserve C register variables
push si
mov di,ds ;Prepare for SCASW
mov es,di
cld
sub dx,dx ;IDMatchSum = 0
mov bx,dx ;IDMatchCount = 0
mov si,[bp+BlockPointer] ;Pointer to first block
mov ax,[bp+SearchedForID] ;ID we're looking for
; Search through all of the linked blocks until the last block
; (marked with a NULL pointer to the next block) has been searched.
BlockLoop:
; Point to the first DataElement entry within this block.
lea di,[si+BLOCK_HEADER_SIZE]
; Search through all the DataElement entries within this block
; and accumulate data from all that match the desired ID.
mov cx,[si+BlockCount] ;Number of elements in this block
jcxz DoNextBlock ;Skip this block if it's empty
mov bp,cx ;***stack frame no longer available***
add cx,7
shr cx,1 ;Number of repetitions of the unrolled
shr cx,1 ; loop = (BlockCount + 7) / 8
shr cx,1
and bp,7 ;Generate the entry point for the
shl bp,1 ; first, possibly partial pass through
jmp cs:[LoopEntryTable+bp] ; the unrolled loop and
; vector to that entry point
align 2
word
LoopEntryTable label dw LoopEntry8,LoopEntry1,LoopEntry2,LoopEntry3
dw LoopEntry4,LoopEntry5,LoopEntry6,LoopEntry7
M_IBL macro P1
local NoMatch&P1&:
LoopEntryscasw ;Do we have an ID match?
jnz NoMatch ;No match
;We have a match
inc bx ;IDMatchCount++;
add dx,[di] ;IDMatchSum += DataPointer->Value;
NoMatch:
add di,DATA_ELEMENT_SIZE-2 ;point to the next element
; (SCASW advanced 2 bytes already)
endmalign 2
IntraBlockLoop:
8
M_IBL 7
M_IBL 6
M_IBL 5
M_IBL 4
M_IBL 3
M_IBL 2
M_IBL 1
M_IBL loop IntraBlockLoop
; Point to the next block and continue if that pointer isn't NULL.
DoNextBlock:
mov si,[si+NextBlock] ;Get pointer to the next block
and si,si ;Is it a NULL pointer?
jnz BlockLoop ;No, continue
; Calculate the average of all matches.
sub ax,ax ;Assume we found no matches
and bx,bx
jz Done ;We didn't find any matches, return 0
xchg ax,dx ;Prepare for division
div bx ;Return IDMatchSum / IDMatchCount
Done: pop si ;Restore C register variables
pop di
pop bp ;Restore caller's stack frame
ret
_FindIDAverage ENDP end
Listings 8.5 and 8.6 together go the final step and change the rules in favor of assembly language. Listing 8.5 creates the same list of linked blocks as Listing 8.1. However, instead of storing an array of structures within each block, it stores two arrays in each block, one consisting of ID numbers and the other consisting of the corresponding values, as shown in Figure 8.3. No information is lost; the data is merely rearranged.
LISTING 8.5 L8-5.C
/* Program to search an array spanning a linked list of variable-
sized blocks, for all entries with a specified ID number,
and return the average of the values of all such entries. Each of
the variable-sized blocks may contain any number of data entries,
stored in the form of two separate arrays, one for ID numbers and
one for values. */
#include <stdio.h>
#ifdef __TURBOC__
#include <alloc.h>
#else
#include <malloc.h>
#endif
void main(void);
void exit(int);
extern unsigned int FindIDAverage2(unsigned int,
struct BlockHeader *);
/* Structure that starts each variable-sized block */
struct BlockHeader {
struct BlockHeader *NextBlock; /* Pointer to next block, or NULL
if this is the last block in the
linked list */
unsigned int BlockCount; /* The number of DataElement entries
in this variable-sized block */
};
void main(void) {
int i,j;
unsigned int IDToFind;
struct BlockHeader *BaseArrayBlockPointer,*WorkingBlockPointer;
int *WorkingDataPointer;
struct BlockHeader **LastBlockPointer;
("ID # for which to find average: ");
printf("%d",&IDToFind);
scanf
/* Build an array across 5 blocks, for testing */
/* Anchor the linked list to BaseArrayBlockPointer */
= &BaseArrayBlockPointer;
LastBlockPointer /* Create 5 blocks of varying sizes */
for (i = 1; i < 6; i++) {
/* Try to get memory for the next block */
if ((WorkingBlockPointer =
(struct BlockHeader *) malloc(sizeof(struct BlockHeader) +
sizeof(int) * 2 * i * 10)) == NULL) {
(1);
exit}
/* Set the number of data elements in this block */
->BlockCount = i * 10;
WorkingBlockPointer/* Link the new block into the chain */
*LastBlockPointer = WorkingBlockPointer;
/* Point to the first data field */
= (int *) ((char *)WorkingBlockPointer +
WorkingDataPointer sizeof(struct BlockHeader));
/* Fill the data fields with ID numbers and values */
for (j = 0; j < (i * 10); j++, WorkingDataPointer++) {
*WorkingDataPointer = j;
*(WorkingDataPointer + i * 10) = i * 1000 + j;
}
/* Remember where to set link from this block to the next */
= &WorkingBlockPointer->NextBlock;
LastBlockPointer }
/* Set the last block's "next block" pointer to NULL to indicate
that there are no more blocks */
->NextBlock = NULL;
WorkingBlockPointer("Average of all elements with ID %d: %u\n",
printf, FindIDAverage2(IDToFind, BaseArrayBlockPointer));
IDToFind(0);
exit}
LISTING 8.6 L8-6.ASM
; Alternative optimized assembly language version of FindIDAverage
; requires data organized as two arrays within each block rather
; than as an array of two-value element structures. This allows the
; use of REP SCASW for ID searching.
;Passed parameter offsets in the
SearchedForIDequ4 ; stack frame (skip over pushed BP
BlockPointerequ6 ; and the return address)
;Field offsets in struct BlockHeader
NextBlockequ0
BlockCountequ2;Number of bytes in struct BlockHeader
BLOCK_HEADER_SIZEequ4
.model small
.codepublic _FindIDAverage2
near
_FindIDAverage2 proc push bp ;Save caller's stack frame
mov bp,sp ;Point to our stack frame
push di ;Preserve C register variables
push si
mov di,ds ;Prepare for SCASW
mov es,di
cld
mov si,[bp+BlockPointer] ;Pointer to first block
mov ax,[bp+SearchedForID] ;ID we're looking for
sub dx,dx ;IDMatchSum = 0
mov bp,dx ;IDMatchCount = 0
;***stack frame no longer available***
; Search through all the linked blocks until the last block
; (marked with a NULL pointer to the next block) has been searched.
BlockLoop:
; Search through all the DataElement entries within this block
; and accumulate data from all that match the desired ID.
mov cx,[si+BlockCount]
jcxz DoNextBlock;Skip this block if there's no data
; to search through
mov bx,cx ;We'll use BX to point to the
shl bx,1 ; corresponding value entry in the
; case of an ID match (BX is the
; length in bytes of the ID array)
; Point to the first DataElement entry within this block.
lea di,[si+BLOCK_HEADER_SIZE]
IntraBlockLoop:
repnz scasw ;Search for the ID
jnz DoNextBlock ;No match, the block is done
inc bp ;We have a match; IDMatchCount++;
add dx,[di+bx-2];IDMatchSum += DataPointer->Value;
; (SCASW has advanced DI 2 bytes)
and cx,cx ;Is there more data to search through?
jnz IntraBlockLoop ;yes
; Point to the next block and continue if that pointer isn't NULL.
DoNextBlock:
mov si,[si+NextBlock] ;Get pointer to the next block
and si,si ;Is it a NULL pointer?
jnz BlockLoop ;No, continue
; Calculate the average of all matches.
sub ax,ax ;Assume we found no matches
and bp,bp
jz Done ;We didn't find any matches, return 0
xchg ax,dx ;Prepare for division
div bp ;Return IDMatchSum / IDMatchCount
Done: pop si ;Restore C register variables
pop di
pop bp ;Restore caller's stack frame
ret
_FindIDAverage2 ENDPend
The whole point of this rearrangement is to allow us to use
REP SCASW
to search through each block, and that’s exactly
what FindIDAverage2
in Listing 8.6 does. The result:
Listing 8.6 calculates the average about three times as fast as
the original C implementation and more than twice as fast as Listing
8.4, heavily optimized as the latter code is.
I trust you get the picture. The sort of instruction-by-instruction optimization that so many of us love to do as a kind of puzzle is fun, but compilers can do it nearly as well as you can, and in the future will surely do it better. What a compiler can’t do is tie together the needs of the program specification on the high end and the processor on the low end, resulting in critical code that runs just about as fast as the hardware permits. The only software that can do that is located north of your sternum and slightly aft of your nose. Dust it off and put it to work—and your code will never again be confused with anything by Hamilton, Joe, Frank, eynolds or Bo Donaldson and the Heywoods.
Back in high school, I took a pre-calculus class from Mr. Bourgeis, whose most notable characteristics were incessant pacing and truly enormous feet. My friend Barry, who sat in the back row, right behind me, claimed that it was because of his large feet that Mr. Bourgeis was so restless. Those feet were so heavy, Barry hypothesized, that if Mr. Bourgeis remained in any one place for too long, the floor would give way under the strain, plunging the unfortunate teacher deep into the mantle of the Earth and possibly all the way through to China. Many amusing cartoons were drawn to this effect.
Unfortunately, Barry was too busy drawing cartoons, or, alternatively, sleeping, to actually learn any math. In the long run, that didn’t turn out to be a handicap for Barry, who went on to become vice-president of sales for a ham-packing company, where presumably he was rarely called upon to derive the quadratic equation. Barry’s lack of scholarship caused some problems back then, though. On one memorable occasion, Barry was half-asleep, with his eyes open but unfocused and his chin balanced on his hand in the classic “if I fall asleep my head will fall off my hand and I’ll wake up” posture, when Mr. Bourgeis popped a killer problem:
“Barry, solve this for X, please.” On the blackboard lay the equation:
X - 1 = 0
“Minus 1,” Barry said promptly.
Mr. Bourgeis shook his head mournfully. “Try again.” Barry thought hard. He knew the fundamental rule that the answer to most mathematical questions is either 0, 1, infinity, -1, or minus infinity (do not apply this rule to balancing your checkbook, however); unfortunately, that gave him only a 25 percent chance of guessing right.
“One,” I whispered surreptitiously.
“Zero,” Barry announced. Mr. Bourgeis shook his head even more sadly.
“One,” I whispered louder. Barry looked still more thoughtful—a bad sign—so I whispered “one” again, even louder. Barry looked so thoughtful that his eyes nearly rolled up into his head, and I realized that he was just doing his best to convince Mr. Bourgeis that Barry had solved this one by himself.
As Barry neared the climax of his stirring performance and opened his mouth to speak, Mr. Bourgeis looked at him with great concern. “Barry, can you hear me all right?”
“Yes, sir,” Barry replied. “Why?”
“Well, I could hear the answer all the way up here. Surely you could hear it just one row away?”
The class went wild. They might as well have sent us home early for all we accomplished the rest of the day.
I like to think I know more about performance programming than Barry knew about math. Nonetheless, I always welcome good ideas and comments, and many readers have sent me a slew of those over the years. So in this chapter, I think I’ll return the favor by devoting a chapter to reader feedback.
Several people have pointed out that while LEA
is great
for performing certain additions (see Chapter 6), it isn’t a perfect
replacement for ADD
. What’s the difference?
LEA
, an addressing instruction by trade, doesn’t affect the
flags, while the arithmetic ADD
instruction most certainly
does. This is no problem when performing additions that involve only
quantities that fit in one machine word (32 bits in 386 protected mode,
16 bits otherwise), but it renders LEA
useless for
multiword operations, which use the Carry flag to tie together partial
results. For example, these instructions
ADD EAX,EBX
ADC EDX,ECX
could not be replaced
LEA EAX,[EAX+EBX]
ADC EDX,ECX
because LEA
doesn’t affect the Carry flag.
The no-carry characteristic of LEA
becomes a distinct
advantage when performing pointer arithmetic, however. For instance, the
following code uses LEA
to advance the pointers while
adding one 128-bit memory variable to another such variable:
MOV ECX,4 ;# of 32-bit words to add
CLC
;no carry into the initial ADC
ADDLOOP:
MOV EAX,[ESI] ;get the next element of one array
ADC [EDI],EAX ;add it to the other array, with carry
LEA ESI,[ESI+4] ;advance one array's pointer
LEA EDI,[EDI+4] ;advance the other array's pointer
LOOP ADDLOOP
(Yes, I could use LODSD
instead of MOV/LEA
;
I’m just illustrating a point here. Besides, LODS
is only 1
cycle faster than MOV/LEA
on the 386, and is actually more
than twice as slow on the 486.) If we used ADD
rather than
LEA
to advance the pointers, the carry from one
ADC
to the next would have to be preserved with either
PUSHF/POPF
or LAHF/SAHF
. (Alternatively, we
could use multiple INC
s, since INC
doesn’t
affect the Carry flag.)
In short, LEA
is indeed different from ADD
.
Sometimes it’s better. Sometimes not; that’s the nature of the various
instruction substitutions and optimizations that will occur to you over
time. There’s no such thing as “best” instructions on the x86; it all
depends on what you’re trying to do.
But there sure are a lot of interesting options, aren’t there?
Reader John Kennedy regularly passes along intriguing assembly programming tricks, many of which I’ve never seen mentioned anywhere else. John likes to optimize for size, whereas I lean more toward speed, but many of his optimizations are good for both purposes. Here are a few of my favorites:
John’s code for setting AX to its absolute value is:
CWD
XOR AX,DX
SUB AX,DX
This does nothing when bit 15 of AX is 0 (that is, if AX is positive). When AX is negative, the code “nots” it and adds 1, which is exactly how you perform a two’s complement negate. For the case where AX is not negative, this trick usually beats the stuffing out of the standard absolute value code:
AND AX,AX ;negative?
JNS IsPositive ;no
NEG AX ;yes,negate it
IsPositive:
However, John’s code is slower on a 486; as you’re no doubt coming to realize (and as I’ll explain in Chapters 12 and 13), the 486 is an optimization world unto itself.
Here’s how John copies a block of bytes from DS:SI to ES:DI, moving as much data as possible a word at a time:
SHR CX,1 ;word count
REP MOVSW ;copy as many words as possible
ADC CX,CX ;CX=1 if copy length was odd,
;0 else
REP MOVSB ;copy any odd byte
(ADC CX,CX
can be replaced with RCL CX,1
;
which is faster depends on the processor type.) It might be hard to
believe that the above is faster than this:
SHR CX,1 ;word count
REP MOVSW ;copy as many words as
;possible
JNC CopyDone ;done if even copy length
MOVSB ;copy the odd byte
CopyDone:
However, it generally is. Sure, if the length is odd, John’s approach
incurs a penalty approximately equal to the REP
startup
time for MOVSB
. However, if the length is even, John’s
approach doesn’t branch, saving cycles and not emptying the prefetch
queue. If copy lengths are evenly distributed between even and odd,
John’s approach is faster in most x86 systems. (Not on the 486,
though.)
John also points out that on the 386, multiple LEA
s can
be combined to perform multiplications that can’t be handled by a single
LEA
, much as multiple shifts and adds can be used for
multiplication, only faster. LEA
can be used to multiply in
a single instruction on the 386, but only by the values 2, 3, 4, 5, 8,
and 9; several LEA
s strung together can handle a much wider
range of values. For example, video programmers are undoubtedly familiar
with the following code to multiply AX times 80 (the width in bytes of
the bitmap in most PC display modes):
SHL AX,1 ;*2
SHL AX,1 ;*4
SHL AX,1 ;*8
SHL AX,1 ;*16
MOV BX,AX
SHL AX,1 ;*32
SHL AX,1 ;*64
ADD AX,BX ;*80
Using LEA
on the 386, the above could be reduced to
LEA EAX,[EAX*2] ;*2
LEA EAX,[EAX*8] ;*16
LEA EAX,[EAX+EAX*4] ;*80
which still isn’t as fast as using a lookup table like
MOV EAX,MultiplesOf80Table[EAX*4]
but is close and takes a great deal less space.
Of course, on the 386, the shift and add version could also be reduced to this considerably more efficient code:
SHL AX,4 ;*16
MOV BX,AX
SHL AX,2 ;*64
ADD AX,BX ;*80
That brings us to multiplication, one of the slowest of x86
operations and one that allows for considerable optimization. One way to
speed up multiplication is to use shift and add, LEA
, or a
lookup table to hard-code a multiplication operation for a fixed
multiplier, as shown above. Another is to take advantage of the
early-out feature of the 386 (and the 486, but in the interests of
brevity I’ll just say “386” from now on) by arranging your operands so
that the multiplier (always the rightmost operand following
MUL
or IMUL
) is no larger than the other
operand.
Why? Because the 386 processes one multiplier bit per cycle and immediately ends a multiplication when all significant bits of the multiplier have been processed, so fewer cycles are required to multiply a large multiplicand times a small multiplier than a small multiplicand times a large multiplier, by a factor of about 1 cycle for each significant multiplier bit eliminated.
(There’s a minimum execution time on this trick; below 3 significant multiplier bits, no additional cycles are saved.) For example, multiplication of 32,767 times 1 is 12 cycles faster than multiplication of 1 times 32,727.
Choosing the right operand as the multiplier can work wonders. According to published specs, the 386 takes 38 cycles to multiply by a multiplier with 32 significant bits but only 9 cycles to multiply by a multiplier of 2, a performance improvement of more than four times! (My tests regularly indicate that multiplication takes 3 to 4 cycles longer than the specs indicate, but the cycle-per-bit advantage of smaller multipliers holds true nonetheless.)
This highlights another interesting point: MUL
and
IMUL
on the 386 are so fast that alternative multiplication
approaches, while generally still faster, are worthwhile only in truly
time-critical code.
On 386SXs and uncached 386s, where code size can significantly affect performance due to instruction prefetching, the compact
MUL
andIMUL
instructions can approach and in some cases even outperform the “optimized” alternatives.
All in all, MUL
and IMUL
are reasonable
performers on the 386, no longer to be avoided in most cases—and you can
help that along by arranging your code to make the smaller operand the
multiplier whenever you know which operand is smaller.
That doesn’t mean that your code should test and swap operands to make sure the smaller one is the multiplier; that rarely pays off. I’m speaking more of the case where you’re scaling an array up by a value that’s always in the range of, say, 2 to 10; because the scale value will always be small and the array elements may have any value, the scale value is the logical choice for the multiplier.
Rob Williams writes with a wonderful optimization to the
REPNZ SCASB
-based optimized searching routine I discussed
in Chapter 5. As a quick refresher, I described searching a buffer for a
text string as follows: Scan for the first byte of the text string with
REPNZ SCASB
, then use REPZ CMPS
to check for a
full match whenever REPNZ SCASB
finds a match for the first
character, as shown in Figure 9.1. The principle is that most buffer
characters won’t match the first character of any given string, so
REPNZ SCASB
, by far the fastest way to search on the PC,
can be used to eliminate most potential matches; each remaining
potential match can then be checked in its entirety with
REPZ CMPS
.
Rob’s revelation, which he credits without explanation to Edgar Allen
Poe (search nevermore?), was that by far the slowest part of the whole
deal is handling REPNZ SCASB
matches, which require
checking the remainder of the string with REPZ CMPS
and
restarting REPNZ SCASB
if no match is found.
Rob points out that the number of
REPNZ SCASB
matches can easily be reduced simply by scanning for the character in the searched-for string that appears least often in the buffer being searched.
Imagine, if you will, that you’re searching for the string “EQUAL.”
By my approach, you’d use REPNZ SCASB
to scan for each
occurrence of “E,” which crops up quite often in normal text. Rob points
out that it would make more sense to scan for “Q,” then back up one
character and check the whole string when a “Q” is found, as shown in
Figure 9.2. “Q” is likely to occur much less often, resulting in many
fewer whole-string checks and much faster processing.
Listing 9.1 implements the scan-on-first-character approach. Listing
9.2 scans for whatever character the caller specifies. Listing 9.3 is a
test program used to compare the two approaches. How much difference
does Rob’s revelation make? Plenty. Even when the entire C function call
to FindString
is timed—strlen
calls, parameter
pushing, calling, setup, and all—the version of FindString
in Listing 9.2, which is directed by Listing 9.3 to scan for the
infrequently-occurring “Q,” is about 40 percent faster on a 20 MHz
cached 386 for the test search of Listing 9.3 than is the version of
FindString
in Listing 9.1, which always scans for the first
character, in this case “E.” However, when only the search loops (the
code that actually does the searching) in the two versions of
FindString
are compared, Listing 9.2 is more than
twice as fast as Listing 9.1—a remarkable improvement over code
that already uses REPNZ SCASB
and
REPZ CMPS
.
What I like so much about Rob’s approach is that it demonstrates that
optimization involves much more than instruction selection and cycle
counting. Listings 9.1 and 9.2 use pretty much the same instructions,
and even use the same approach of scanning with REPNZ SCASB
and using REPZ CMPS
to check scanning matches.
The difference between Listings 9.1 and 9.2 (which gives you more than a doubling of performance) is due entirely to understanding the nature of the data being handled, and biasing the code to reflect that knowledge.
LISTING 9.1 L9-1.ASM
; Searches a text buffer for a text string. Uses REPNZ SCASB to sca"n
; the buffer for locations that match the first character of the
; searched-for string, then uses REPZ CMPS to check fully only those
; locations that REPNZ SCASB has identified as potential matches.
;
; Adapted from Zen of Assembly Language, by Michael Abrash
;
; C small model-callable as:
; unsigned char * FindString(unsigned char * Buffer,
; unsigned int BufferLength, unsigned char * SearchString,
; unsigned int SearchStringLength);
;
; Returns a pointer to the first match for SearchString in Buffer,or
; a NULL pointer if no match is found. Buffer should not start at
; offset 0 in the data segment to avoid confusing a match at 0 with
; no match found.
Parmsstrucdw 2 dup(?) ;pushed BP/return address
dw ? ;pointer to buffer to search
Buffer dw ? ;length of buffer to search
BufferLength dw ? ;pointer to string for which to search
SearchString dw ? ;length of string for which to search
SearchStringLength
Parmsends
.model small
.code
public _FindString
_FindString proc nearpush bp ;preserve caller's stack frame
mov bp,sp ;point to our stack frame
push si ;preserve caller's register variables
push di
cld ;make string instructions increment pointers
mov si,[bp+SearchString] ;pointer to string to search for
mov bx,[bp+SearchStringLength] ;length of string
and bx,bx
jz FindStringNotFound ;no match if string is 0 length
movd x,[bp+BufferLength] ;length of buffer
sub dx,bx ;difference between buffer and string lengths
jc FindStringNotFound ;no match if search string is
; longer than buffer
inc dx ;difference between buffer and search string
; lengths, plus 1 (# of possible string start
; locations to check in the buffer)
mov di,ds
mov es,di
mov di,[bp+Buffer] ;point ES:DI to buffer to search thru
lodsb ;put the first byte of the search string in AL
mov bp,si ;set aside pointer to the second search byte
dec bx ;don't need to compare the first byte of the
; string with CMPS; we'll do it with SCAS
FindStringLoop:
mov cx,dx ;put remaining buffer search length in CX
repnz scasb ;scan for the first byte of the string
jnz FindStringNotFound ;not found, so there's no match
;found, so we have a potential match-check the
; rest of this candidate location
push di ;remember the address of the next byte to scan
mov dx,cx ;set aside the remaining length to search in
; the buffer
mov si,bp ;point to the rest of the search string
mov cx,bx ;string length (minus first byte)
shr cx,1 ;convert to word for faster search
jnc FindStringWord ;do word search if no odd byte
cmpsb ;compare the odd byte
jnz FindStringNoMatch ;odd byte doesn't match, so we
; haven't found the search string here
FindStringWord:
jcxz FindStringFound ;test whether we've already checked
; the whole string; if so, this is a match
; bytes long; if so, we've found a match
repz cmpsw ;check the rest of the string a word at a time
jz FindStringFound ;it's a match
FindStringNoMatch:
pop di ;get back pointer to the next byte to scan
and dx,dx ;is there anything left to check?
jnz FindStringLoop ;yes-check next byte
FindStringNotFound:
sub ax,ax ;return a NULL pointer indicating that the
jmp FindStringDone ; string was not found
FindStringFound:
pop ax ;point to the buffer location at which the
dec ax ; string was found (earlier we pushed the
; address of the byte after the start of the
; potential match)
FindStringDone:
pop di ;restore caller's register variables
pop si
pop bp ;restore caller's stack frame
ret
_FindString endp end
LISTING 9.2 L9-2.ASM
; Searches a text buffer for a text string. Uses REPNZ SCASB to scan
; the buffer for locations that match a specified character of the
; searched-for string, then uses REPZ CMPS to check fully only those
; locations that REPNZ SCASB has identified as potential matches.
;
; C small model-callable as:
; unsigned char * FindString(unsigned char * Buffer,
; unsigned int BufferLength, unsigned char * SearchString,
; unsigned int SearchStringLength,
; unsigned int ScanCharOffset);
;
; Returns a pointer to the first match for SearchString in Buffer,or
; a NULL pointer if no match is found. Buffer should not start at
; offset 0 in the data segment to avoid confusing a match at 0 with
; no match found.
struc
Parms dw 2 dup(?) ;pushed BP/return address
dw ? ;pointer to buffer to search
Buffer dw ? ;length of buffer to search
BufferLength dw ? ;pointer to string for which to search
SearchString dw ? ;length of string for which to search
SearchStringLength dw ? ;offset in string of character for
ScanCharOffset ; which to scan
Parmsends
.model small
.code
public _FindString
_FindString proc nearpush bp ;preserve caller's stack frame
mov bp,sp ;point to our stack frame
push si ;preserve caller's register variables
push di
cld ;make string instructions increment pointers
mov si,[bp+SearchString] ;pointer to string to search for
mov cx,[bp+SearchStringLength] ;length of string
jcxz FindStringNotFound ;no match if string is 0 length
mov dx,[bp+BufferLength] ;length of buffer
sub dx,cx ;difference between buffer and search
; lengths
jc FindStringNotFound ;no match if search string is
; longer than buffer
inc dx ; difference between buffer and search string
; lengths, plus 1 (# of possible string start
; locations to check in the buffer)
mov di,ds
mov es,di
mov di,[bp+Buffer] ;point ES:DI to buffer to search thru
mov bx,[bp+ScanCharOffset] ;offset in string of character
; on which to scan
add di,bx ;point ES:DI to first buffer byte to scan
mov al,[si+bx] ;put the scan character in AL
inc bx ;set BX to the offset back to the start of the
; potential full match after a scan match,
; accounting for the 1-byte overrun of
; REPNZ SCASB
FindStringLoop:
mov cx,dx ;put remaining buffer search length in CX
repnz scasb ;scan for the scan byte
jnz FindStringNotFound ;not found, so there's no match
;found, so we have a potential match-check the
; rest of this candidate location
push di ;remember the address of the next byte to scan
mov dx,cx ;set aside the remaining length to search in
; the buffer
sub di,bx ;point back to the potential start of the
; match in the buffer
mov si,[bp+SearchString] ;point to the start of the string
mov cx,[bp+SearchStringLength] ;string length
shr cx,1 ;convert to word for faster search
jnc FindStringWord ;do word search if no odd byte
cmpsb ;compare the odd byte
jnz FindStringNoMatch ;odd byte doesn't match, so we
; haven't found the search string here
FindStringWord:
jcxz FindStringFound ;if the string is only 1 byte long,
; we've found a match
repz cmpsw ;check the rest of the string a word at a time
jz FindStringFound ;it's a match
FindStringNoMatch:
pop di ;get back pointer to the next byte to scan
and dx,dx ;is there anything left to check?
jnz FindStringLoop ;yes-check next byte
FindStringNotFound:
sub ax,ax ;return a NULL pointer indicating that the
jmp FindStringDone ; string was not found
FindStringFound:
pop ax ;point to the buffer location at which the
sub ax,bx ; string was found (earlier we pushed the
; address of the byte after the scan match)
FindStringDone:
pop di ;restore caller's register variables
pop si
pop bp ;restore caller's stack frame
ret
_FindString endp end
LISTING 9.3 L9-3.C
/* Program to exercise buffer-search routines in Listings 9.1 & 9.2 */
#include <stdio.h>
#include <string.h>
#define DISPLAY_LENGTH 40
extern unsigned char * FindString(unsigned char *, unsigned int,
unsigned char *, unsigned int, unsigned int);
void main(void);
static unsigned char TestBuffer[] = "When, in the course of human \
events, it becomes necessary for one people to dissolve the \
political bands which have connected them with another, and to \
assume among the powers of the earth the separate and equal station \
to which the laws of nature and of nature's God entitle them...";
void main() {
static unsigned char TestString[] = "equal";
unsigned char TempBuffer[DISPLAY_LENGTH+1];
unsigned char *MatchPtr;
/* Search for TestString and report the results */
if ((MatchPtr = FindString(TestBuffer,
(unsigned int) strlen(TestBuffer), TestString,
(unsigned int) strlen(TestString), 1)) == NULL) {
/* TestString wasn't found */
("\"%s\" not found\n", TestString);
printf} else {
/* TestString was found. Zero-terminate TempBuffer; strncpy
won't do it if DISPLAY_LENGTH characters are copied */
[DISPLAY_LENGTH] = 0;
TempBuffer("\"%s\" found. Next %d characters at match:\n\"%s\"\n",
printf, DISPLAY_LENGTH,
TestString(TempBuffer, MatchPtr, DISPLAY_LENGTH));
strncpy}
}
You’ll notice that in Listing 9.2 I didn’t use a table of character frequencies in English text to determine the character for which to scan, but rather let the caller make that choice. Each buffer of bytes has unique characteristics, and English-letter frequency could well be inappropriate. What if the buffer is filled with French text? Cyrillic? What if it isn’t text that’s being searched? It might be worthwhile for an application to build a dynamic frequency table for each buffer so that the best scan character could be chosen for each search. Or perhaps not, if the search isn’t time-critical or the buffer is small.
The point is that you can improve performance dramatically by understanding the nature of the data with which you work. (This is equally true for high-level language programming, by the way.) Listing 9.2 is very similar to and only slightly more complex than Listing 9.1; the difference lies not in elbow grease or cycle counting but in the organic integrating optimizer technology we all carry around in our heads.
David Stafford (recently of Borland and Borland Japan) who happens to be one of the best assembly language programmers I’ve ever met, has written a C-callable routine that sorts an array of integers in ascending order. That wouldn’t be particularly noteworthy, except that David’s routine, shown in Listing 9.4, is exactly 25 bytes long. Look at the code; you’ll keep saying to yourself, “But this doesn’t work…oh, yes, I guess it does.” As they say in the Prego spaghetti sauce ads, it’s in there—and what a job of packing. Anyway, David says that a 24-byte sort routine eludes him, and he’d like to know if anyone can come up with one.
LISTING 9.4 L9-4.ASM
;--------------------------------------------------------------------------
; Sorts an array of ints. C callable (small model). 25 bytes.
; void sort( int num, int a[] );
;
; Courtesy of David Stafford.
;--------------------------------------------------------------------------
.model small
.code
public _sort
top: mov dx,[bx] ;swap two adjacent integers
xchg dx,[bx+2]
xchg dx,[bx]
cmp dx,[bx] ;did we put them in the right order?
jl top ;no, swap them back
inc bx ;go to next integer
inc bx
loop top
_sort: pop dx ;get return address (entry point)
pop cx ;get count
pop bx ;get pointer
push bx ;restore pointer
dec cx ;decrement count
push cx ;save count
push dx ;restore return address
jg top ;if cx > 0
ret
end
One of the most annoying limitations of the x86 is that while the
dividend operand to the DIV
instruction can be 32 bits in
size, both the divisor and the result must be 16 bits. That’s
particularly annoying in regards to the result because sometimes you
just don’t know whether the ratio of the dividend to the divisor is
greater than 64K-1 or not—and if you guess wrong, you get that godawful
Divide By Zero interrupt. So, what is one to do when the result might
not fit in 16 bits, or when the dividend is larger than 32 bits? Fall
back to a software division approach? That will work—but oh so
slowly.
There’s another technique that’s much faster than a pure software approach, albeit not so flexible. This technique allows arbitrarily large dividends and results, but the divisor is still limited to16 bits. That’s not perfect, but it does solve a number of problems, in particular eliminating the possibility of a Divide By Zero interrupt from a too-large result.
This technique involves nothing more complicated than breaking up the division into word-sized chunks, starting with the most significant word of the dividend. The most significant word is divided by the divisor (with no chance of overflow because there are only 16 bits in each); then the remainder is prepended to the next 16 bits of dividend, and the process is repeated, as shown in Figure 9.3. This process is equivalent to dividing by hand, except that here we stop to carry the remainder manually only after each word of the dividend; the hardware divide takes care of the rest. Listing 9.5 shows a function to divide an arbitrarily large dividend by a 16-bit divisor, and Listing 9.6 shows a sample division of a large dividend. Note that the same principle can be applied to handling arbitrarily large dividends in 386 native mode code, but in that case the operation can proceed a dword, rather than a word, at a time.
As for handling signed division with arbitrarily large dividends,
that can be done easily enough by remembering the signs of the dividend
and divisor, dividing the absolute value of the dividend by the absolute
value of the divisor, and applying the stored signs to set the proper
signs for the quotient and remainder. There may be more clever ways to
produce the same result, by using IDIV
, for example; if you
know of one, drop me a line c/o Coriolis Group Books.
LISTING 9.5 L9-5.ASM
; Divides an arbitrarily long unsigned dividend by a 16-bit unsigned
; divisor. C near-callable as:
; unsigned int Div(unsigned int * Dividend,
; int DividendLength, unsigned int Divisor,
; unsigned int * Quotient);
;
; Returns the remainder of the division.
;
; Tested with TASM 2.
struc
parms dw 2 dup (?) ;pushed BP & return address
dw ? ;pointer to value to divide, stored in Intel
Dividend ; order, with lsb at lowest address, msb at
; highest. Must be composed of an integral
; number of words
dw ? ;# of bytes in Dividend. Must be a multiple
DividendLength ; of 2
dw ? ;value by which to divide. Must not be zero,
Divisor ; or a Divide By Zero interrupt will occur
dw ? ;pointer to buffer in which to store the
Quotient ; result of the division, in Intel order.
; The quotient returned is of the same
; length as the dividend
parmsends
.model small
.code
public _Div
_Div proc nearpush bp ;preserve caller's stack frame
mov bp,sp ;point to our stack frame
push si ;preserve caller's register variables
push di
std ;we're working from msb to lsb
mov ax,ds
mov es,ax ;for STOS
mov cx,[bp+DividendLength]
sub cx,2
mov si,[bp+Dividend]
add si,cx ;point to the last word of the dividend
; (the most significant word)
mov di,[bp+Quotient]
add di,cx ;point to the last word of the quotient
; buffer (the most significant word)
mov bx,[bp+Divisor]
shr cx,1
inc cx ;# of words to process
sub dx,dx ;convert initial divisor word to a 32-bit
;value for DIV
DivLoop:
lodsw ;get next most significant word of divisor
div bx
stosw ;save this word of the quotient
;DX contains the remainder at this point,
; ready to prepend to the next divisor word
loop DivLoop
mov ax,dx ;return the remainder
cld ;restore default Direction flag setting
pop di ;restore caller's register variables
pop si
pop bp ;restore caller's stack frame
ret
_Div endp end
LISTING 9.6 L9-6.C
/* Sample use of Div function to perform division when the result
doesn't fit in 16 bits */
#include <stdio.h>
extern unsigned int Div(unsigned int * Dividend,
int DividendLength, unsigned int Divisor,
unsigned int * Quotient);
() {
mainunsigned long m, i = 0x20000001;
unsigned int k, j = 0x10;
= Div((unsigned int *)&i, sizeof(i), j, (unsigned int *)&m);
k ("%lu / %u = %lu r %u\n", i, j, m, k);
printf}
Way back in Volume 1, Number 1 of PC TECHNIQUES, (April/May 1990) I wrote the very first of that magazine’s HAX (#1), which extolled the virtues of placing your most commonly-used automatic (stack-based) variables within the stack’s “sweet spot,” the area between +127 to -128 bytes away from BP, the stack frame pointer. The reason was that the 8088 can store addressing displacements that fall within that range in a single byte; larger displacements require a full word of storage, increasing code size by a byte per instruction, and thereby slowing down performance due to increased instruction fetching time.
This takes on new prominence in 386 native mode, where straying from the sweet spot costs not one, but two or three bytes. Where the 8088 had two possible displacement sizes, either byte or word, on the 386 there are three possible sizes: byte, word, or dword. In native mode (32-bit protected mode), however, a prefix byte is needed in order to use a word-sized displacement, so a variable located outside the sweet spot requires either two extra bytes (an extra displacement byte plus a prefix byte) or three extra bytes (a dword displacement rather than a byte displacement). Either way, instructions grow alarmingly.
Performance may or may not suffer from missing the sweet spot, depending on the processor, the memory architecture, and the code mix. On a 486, prefix bytes often cost a cycle; on a 386SX, increased code size often slows performance because instructions must be fetched through the half-pint 16-bit bus; on a 386, the effect depends on the instruction mix and whether there’s a cache.
On balance, though, it’s as important to keep your most-used variables in the stack’s sweet spot in 386 native mode as it was on the 8088.
In assembly, it’s easy to control the organization of your stack frame. In C, however, you’ll have to figure out the allocation scheme your compiler uses to allocate automatic variables, and declare automatics appropriately to produce the desired effect. It can be done: I did it in Turbo C some years back, and trimmed the size of a program (admittedly, a large one) by several K—not bad, when you consider that the “sweet spot” optimization is essentially free, with no code reorganization, change in logic, or heavy thinking involved.
Next, we come to an item that cycle counters will love, especially
since it involves apparently incorrect documentation on Intel’s part.
According to Intel’s documents, all RCR
and
RCL
instructions, which perform rotations through the Carry
flag, as shown in Figure 9.4, take 9 cycles on the 386 when working with
a register operand. My measurements indicate that the 9-cycle execution
time almost holds true for multibit rotate-through-carries,
which I’ve timed at 8 cycles apiece; for example, RCR AX,CL
takes 8 cycles on my 386, as does RCL DX,2
.
Contrast that with ROR
and ROL
, which can
rotate the contents of a register any number of bits in just 3
cycles.
However, rotating by one bit through the Carry flag does not
take 9 cycles, contrary to Intel’s 80386 Programmer’s Reference
Manual, or even 8 cycles. In fact, RCR
reg,1
and RCL
reg,1 take 3 cycles, just like
ROR
, ROL
, SHR
, and
SHL
. At least, that’s how fast they run on my 386, and I
very much doubt that you’ll find different execution times on other
386s. (Please let me know if you do, though!)
Interestingly, according to Intel’s i486 Microprocessor
Programmer’s Reference Manual, the 486 can RCR
or
RCL
a register by one bit in 3 cycles, but takes between 8
and 30 cycles to perform a multibit register RCR
or
RCL
!
No great lesson here, just a caution to be leery of multibit
RCR
and RCL
when performance matters—and to
take cycle-time documentation with a grain of salt.
Did you ever wonder how to code a far jump to an absolute address in
assembly language? Probably not, but if you ever do, you’re going to be
glad for this next item, because the obvious solution doesn’t work. You
might think all it would take to jump to, say, 1000:5 would be
JMP FAR PTR 1000:5
, but you’d be wrong. That won’t even
assemble. You might then think to construct in memory a far pointer
containing 1000:5, as in the following:
Ptr dd ?
:mov word ptr [Ptr],5
mov word ptr [Ptr+2],1000h
jmp [Ptr]
That will work, but at a price in performance. On an 8088,
JMP DWORD PTR [*mem*]
(an indirect far jump) takes at least
37 cycles; JMP DWORD PTR *label*
(a direct far jump) takes
only 15 cycles (plus, almost certainly, some cycles for instruction
fetching). On a 386, an indirect far jump is documented to take at least
43 cycles in real mode (31 in protected mode); a direct far jump is
documented to take at least 12 cycles, about three times faster. In
truth, the difference between those two is nowhere near that big; the
fastest I’ve measured for a direct far jump is 21 cycles, and I’ve
measured indirect far jumps as fast as 30 cycles, so direct is still
faster, but not by so much. (Oh, those cycle-time documentation blues!)
Also, a direct far jump is documented to take at least 27 cycles in
protected mode; why the big difference in protected mode, I have no
idea.
At any rate, to return to our original problem of jumping to 1000:5: Although an indirect far jump will work, a direct far jump is still preferable.
Listing 9.7 shows a short program that performs a direct far call to
1000:5. (Don’t run it, unless you want to crash your system!) It does
this by creating a dummy segment at 1000H, so that the label
FarLabel
can be created with the desired far attribute at
the proper location. (Segments created with “AT” don’t cause the
generation of any actual bytes or the allocation of any memory; they’re
just templates.) It’s a little kludgey, but at least it does work. There
may be a better solution; if you have one, pass it along.
LISTING 9.7 L9-7.ASM
; Program to perform a direct far jump to address 1000:5.
; *** Do not run this program! It's just an example of how ***
; *** to build a direct far jump to an absolute address ***
;
; Tested with TASM 2 and MASM 5.
segment at 01000h
FarSeg org 5
FarLabel label far
FarSeg ends
.model small
.codestart:
jmp FarLabel
end start
By the way, if you’re wondering how I figured this out, I merely applied my good friend Dan Illowsky’s long-standing rule for dealing with MASM:
If the obvious doesn’t work (and it usually doesn’t), just try everything you can think of, no matter how ridiculous, until you find something that does—a rule with plenty of history on its side.
To finish up this chapter, consider these two items. First, in 32-bit protected mode,
sub eax,eax
inc eax
takes 4 cycles to execute, but is only 3 bytes long, while
mov eax,1
takes only 2 cycles to execute, but is 5 bytes long (because native
mode constants are dwords and the MOV
instruction doesn’t
sign-extend). Both code fragments are ways to set EAX
to 1
(although the first affects the flags and the second doesn’t); this is a
classic trade-off of speed for space. Second,
or ebx,-1
takes 2 cycles to execute and is 3 bytes long, while
bx,-1 move
takes 2 cycles to execute and is 5 bytes long. Both instructions set
EBX
to -1; this is a classic trade-off of—gee, it’s not a
trade-off at all, is it? OR
is a better way to set a 32-bit
register to all 1-bits, just as SUB
or XOR
is
a better way to set a register to all 0-bits. Who woulda thunk it? Just
goes to show how the 32-bit displacements and constants of 386 native
mode change the familiar landscape of 80x86 optimization.
Be warned, though, that I’ve found OR
, AND
,
ADD
, and the like to be a cycle slower than
MOV
when working with immediate operands on the 386 under
some circumstances, for reasons that thus far escape me. This just
reinforces the first rule of optimization: Measure your code in action,
and place not your trust in documented cycle times.
My grandfather does The New York Times crossword puzzle every Sunday. In ink. With nary a blemish.
The relevance of which will become apparent in a trice.
What my grandfather is, is a pattern matcher par excellence. You’re a pattern matcher, too. So am I. We can’t help it; it comes with the territory. Try focusing on text and not reading it. Can’t do it. Can you hear the voice of someone you know and not recognize it? I can’t. And how in the Nine Billion Names of God is it that we’re capable of instantly recognizing one face out of the thousands we’ve seen in our lifetimes—even years later, from a different angle and in different light? Although we take them for granted, our pattern-matching capabilities are surely a miracle on the order of loaves and fishes.
By “pattern matching,” I mean more than just recognition, though. I mean that we are generally able to take complex and often seemingly woefully inadequate data, instantaneously match it in an incredibly flexible way to our past experience, extrapolate, and reach amazing conclusions, something that computers can scarcely do at all. Crossword puzzles are an excellent example; given a couple of letters and a cryptic clue, we’re somehow able to come up with one out of several hundred thousand words that we know. Try writing a program to do that! What’s more, we don’t process data in the serial brute-force way that computers do. Solutions tend to be virtually instantaneous or not at all; none of those “N log N” or “N2” execution times for us.
It goes without saying that pattern matching is good; more than that, it’s a large part of what we are, and, generally, the faster we are at it, the better. Not always, though. Sometimes insufficient information really is insufficient, and, in our haste to get the heady rush of coming up with a solution, incorrect or less-than-optimal conclusions are reached, as anyone who has ever done the Times Sunday crossword will attest. Still, my grandfather does that puzzle every Sunday in ink. What’s his secret? Patience and discipline. He never fills a word in until he’s confirmed it in his head via intersecting words, no matter how strong the urge may be to put something down where he can see it and feel like he’s getting somewhere.
There’s a surprisingly close parallel to programming here. Programming is certainly a sort of pattern matching in the sense I’ve described above, and, as with crossword puzzles, following your programming instincts too quickly can be a liability. For many programmers, myself included, there’s a strong urge to find a workable approach to a particular problem and start coding it right now, what some people call “hacking” a program. Going with the first thing your programming pattern matcher comes up with can be a lot of fun; there’s instant gratification and a feeling of unbounded creativity. Personally, I’ve always hungered to get results from my work as soon as possible; I gravitated toward graphics for its instant and very visible gratification. Over time, however, I’ve learned patience.
I’ve come to spend an increasingly large portion of my time choosing algorithms, designing, and simply giving my mind quiet time in which to work on problems and come up with non-obvious approaches before coding; and I’ve found that the extra time up front more than pays for itself in both decreased coding time and superior programs.
In this chapter, I’m going to walk you through a simple but illustrative case history that nicely points up the wisdom of delaying gratification when faced with programming problems, so that your mind has time to chew on the problems from other angles. The alternative solutions you find by doing this may seem obvious, once you’ve come up with them. They may not even differ greatly from your initial solutions. Often, however, they will be much better—and you’ll never even have the chance to decide whether they’re better or not if you take the first thing that comes into your head and run with it.
Once upon a time, I set out to read Algorithms, by Robert Sedgewick (Addison-Wesley), which turned out to be a wonderful, stimulating, and most useful book, one that I recommend highly. My story, however, involves only what happened in the first 12 pages, for it was in those pages that Sedgewick discussed Euclid’s algorithm.
Euclid’s algorithm (discovered by Euclid, of Euclidean geometry fame, a very long time ago, way back when computers still used core memory) is a straightforward algorithm that solves one of the simplest problems imaginable: finding the greatest common integer divisor (GCD) of two positive integers. Sedgewick points out that this is useful for reducing a fraction to its lowest terms. I’m sure it’s useful for other things, as well, although none spring to mind. (A long time ago, I wrote an article about optimizing a bit of code that wasn’t even vaguely time-critical, and got swamped with letters telling me so. I knew it wasn’t time-critical; it was just a good example. So for now, close your eyes and imagine that finding the GCD is not only necessary but must also be done as quickly as possible, because it’s perfect for the point I want to make here and now. Okay?)
The problem at hand, then, is simply this: Find the largest integer value that evenly divides two arbitrary positive integers. That’s all there is to it. So warm up your pattern matchers…and go!
I have a funny feeling that you’d already figured out how to find the GCD before I even said “go.” That’s what I did when reading Algorithms; before I read another word, I had to figure it out for myself. Programmers are like that; give them a problem and their eyes immediately glaze over as they try to solve it before you’ve even shut your mouth. That sort of instant response can certainly be impressive, but it can backfire, too, as it did in my case.
You see, I fell victim to a common programming pitfall, the “brute-force” syndrome. The basis of this syndrome is that there are many problems that have obvious, brute-force solutions—with one small drawback. The drawback is that if you were to try to apply a brute-force solution by hand—that is, work a single problem out with pencil and paper or a calculator—it would generally require that you have the patience and discipline to work on the problem for approximately seven hundred years, not counting eating and sleeping, in order to get an answer. Finding all the prime numbers less than 1,000,000 is a good example; just divide each number up to 1,000,000 by every lesser number, and see what’s left standing. For most of the history of humankind, people were forced to think of cleverer solutions, such as the Sieve of Eratosthenes (we’d have been in big trouble if the ancient Greeks had had computers), mainly because after about five minutes of brute force-type work, people’s attention gets diverted to other important matters, such as how far a paper airplane will fly from a second-story window.
Not so nowadays, though. Computers love boring work; they’re very patient and disciplined, and, besides, one human year = seven dog years = two zillion computer years. So when we’re faced with a problem that has an obvious but exceedingly lengthy solution, we’re apt to say, “Ah, let the computer do that, it’s fast,” and go back to making paper airplanes. Unfortunately, brute-force solutions tend to be slow even when performed by modern-day microcomputers, which are capable of several MIPS except when I’m late for an appointment and want to finish a compile and run just one more test before I leave, in which case the crystal in my computer is apparently designed to automatically revert to 1 Hz.)
The solution that I instantly came up with to finding the GCD is about as brute- force as you can get: Divide both the larger integer (iL) and the smaller integer (iS) by every integer equal to or less than the smaller integer, until a number is found that divides both evenly, as shown in Figure 10.1. This works, but it’s a lousy solution, requiring as many as iS*2 divisions; very expensive, especially for large values of iS. For example, finding the GCD of 30,001 and 30,002 would require 60,002 divisions, which alone, disregarding tests and branches, would take about 2 seconds on an 8088, and more than 50 milliseconds even on a 25 MHz 486—a very long time in computer years, and not insignificant in human years either.
Listing 10.1 is an implementation of the brute-force approach to GCD calculation. Table 10.1 shows how long it takes this approach to find the GCD for several integer pairs. As expected, performance is extremely poor when iS is large.
Integer pairs for which to find GCD
90 & 27 | 42 & 998 | 453 & 121 | 27432 & 165 | 27432 & 17550 | |
---|---|---|---|---|---|
Listing 10.1 (Brute force) | 60µs (100%) | 110µs (100%) | 311ms (100%) | 426µs (100%) | 43580µs (100%) |
Listing 10.2 (Subtraction) | 25 (42%) | 72 (65%) | 67 (22%) | 280 (66%) | 72 (0.16%) |
Listing 10.3 (Division: code recursive Euclid’s algorithm) | 20 (33%) | 33 (30%) | 48 (15%) | 32 (8%) | 53 (0.12%) |
Listing 10.4 (C version of data recursive Euclid’s algorithm; normal optimization) | 12 (20%) | 17 (15%) | 25 (8%) | 16 (4%) | 26 (0.06%) |
Listing 10.4 (/Ox = maximumoptimization) | 12 (20%) | 16 (15%) | 20 (6%) | 15 (4%) | 23 (0.05%) |
Listing 10.5 (Assembly version of data recursive Euclid’s algorithm) | 10 (17%) | 10 (9%) | 15 (5%) | 10 (2%) | 17 (0.04%) |
Note: Performance of Listings 10.1 through 10.5 in finding the greatest common divisors of various pairs of integers. Times are in microseconds. Percentages represent execution time as a percentage of the execution time of Listing 10.1 for the same integer pair. Listings 10.1-10.4 were compiled with Microsoft C /C++ except as noted, the default optimization was used. All times measured with the Zen timer (from Chapter 3) on a 20 MHz cached 386.
LISTING 10.1 L10-1.C
/* Finds and returns the greatest common divisor of two positive
integers. Works by trying every integral divisor between the
smaller of the two integers and 1, until a divisor that divides
both integers evenly is found. All C code tested with Microsoft
and Borland compilers.*/
unsigned int gcd(unsigned int int1, unsigned int int2) {
unsigned int temp, trial_divisor;
/* Swap if necessary to make sure that int1 >= int2 */
if (int1 < int2) {
= int1;
temp = int2;
int1 = temp;
int2 }
/* Now just try every divisor from int2 on down, until a common
divisor is found. This can never be an infinite loop because
1 divides everything evenly */
for (trial_divisor = int2; ((int1 % trial_divisor) != 0) ||
((int2 % trial_divisor) != 0); trial_divisor—)
;
return(trial_divisor);
}
Sedgewick’s first solution to the GCD problem was pretty much the one I came up with. He then pointed out that the GCD of iL and iS is the same as the GCD of iL-iS and iS. This was obvious (once Sedgewick pointed it out); by the very nature of division, any number that divides iL evenly nL times and iS evenly nS times must divide iL-iS evenly nL-nS times. Given that insight, I immediately designed a new, faster approach, shown in Listing 10.2.
LISTING 10.2 L10-2.C
/* Finds and returns the greatest common divisor of two positive
integers. Works by subtracting the smaller integer from the
larger integer until either the values match (in which case
that's the gcd), or the larger integer becomes the smaller of
the two, in which case the two integers swap roles and the
subtraction process continues. */
unsigned int gcd(unsigned int int1, unsigned int int2) {
unsigned int temp;
/* If the two integers are the same, that's the gcd and we're
done */
if (int1 == int2) {
return(int1);
}
/* Swap if necessary to make sure that int1 >= int2 */
if (int1 < int2) {
= int1;
temp = int2;
int1 = temp;
int2 }
/* Subtract int2 from int1 until int1 is no longer the larger of
the two */
do {
-= int2;
int1 } while (int1 > int2);
/* Now recursively call this function to continue the process */
return(gcd(int1, int2));
}
Listing 10.2 repeatedly subtracts iS from iL until iL becomes less than or equal to iS. If iL becomes equal to iS, then that’s the GCD; alternatively, if iL becomes less than iS, iL and iS switch values, and the process is repeated, as shown in Figure 10.2. The number of iterations this approach requires relative to Listing 10.1 depends heavily on the values of iL and iS, so it’s not always faster, but, as Table 10.1 indicates, Listing 10.2 is generally much better code.
Listing 10.2 is a far graver misstep than Listing 10.1, for all that it’s faster. Listing 10.1 is obviously a hacked-up, brute-force approach; no one could mistake it for anything else. It could be speeded up in any of a number of ways with a little thought. (Simply skipping testing all the divisors between iS and iS/2, not inclusive, would cut the worst-case time in half, for example; that’s not a particularly good optimization, but it illustrates how easily Listing 10.1 can be improved.) Listing 10.1 is a hack job, crying out for inspiration.
Listing 10.2, on the other hand, has gotten the inspiration—and largely wasted it through haste. Had Sedgewick not told me otherwise, I might well have assumed that Listing 10.2 was optimized, a mistake I would never have made with Listing 10.1. I experienced a conceptual breakthrough when I understood Sedgewick’s point: A smaller number can be subtracted from a larger number without affecting their GCD, thereby inexpensively reducing the scale of the problem. And, in my hurry to make this breakthrough reality, I missed its full scope. As Sedgewick says on the very next page, the number that one gets by subtracting iS from iL until iL is less than iS is precisely the same as the remainder that one gets by dividing iL by iS—again, this is inherent in the nature of division—and that is the basis for Euclid’s algorithm, shown in Figure 10.3. Listing 10.3 is an implementation of Euclid’s algorithm.
LISTING 10.3 L10-3.C
/* Finds and returns the greatest common divisor of two integers.
Uses Euclid's algorithm: divides the larger integer by the
smaller; if the remainder is 0, the smaller integer is the GCD,
otherwise the smaller integer becomes the larger integer, the
remainder becomes the smaller integer, and the process is
repeated. */
static unsigned int gcd_recurs(unsigned int, unsigned int);
unsigned int gcd(unsigned int int1, unsigned int int2) {
unsigned int temp;
/* If the two integers are the same, that's the GCD and we're
done */
if (int1 == int2) {
return(int1);
}
/* Swap if necessary to make sure that int1 >= int2 */
if (int1 < int2) {
= int1;
temp = int2;
int1 = temp;
int2 }
/* Now call the recursive form of the function, which assumes
that the first parameter is the larger of the two */
return(gcd_recurs(int1, int2));
}
static unsigned int gcd_recurs(unsigned int larger_int,
unsigned int smaller_int)
{
int temp;
/* If the remainder of larger_int divided by smaller_int is 0,
then smaller_int is the gcd */
if ((temp = larger_int % smaller_int) == 0) {
return(smaller_int);
}
/* Make smaller_int the larger integer and the remainder the
smaller integer, and call this function recursively to
continue the process */
return(gcd_recurs(smaller_int, temp));
}
As you can see from Table 10.1, Euclid’s algorithm is superior, especially for large numbers (and imagine if we were working with large longs!).
Had I been implementing GCD determination without Sedgewick’s help, I would surely not have settled for Listing 10.1—but I might well have ended up with Listing 10.2 in my enthusiasm over the “brilliant” discovery of subtracting the lesser Using Euclid’s algorithm to find a GCD number from the greater. In a commercial product, my lack of patience and discipline could have been costly indeed.
Give your mind time and space to wander around the edges of important programming problems before you settle on any one approach. I titled this book’s first chapter “The Best Optimizer Is between Your Ears,” and that’s still true; what’s even more true is that the optimizer between your ears does its best work not at the implementation stage, but at the very beginning, when you try to imagine how what you want to do and what a computer is capable of doing can best be brought together.
Euclid’s algorithm lends itself to recursion beautifully, so much so that an implementation like Listing 10.3 comes almost without thought. Again, though, take a moment to stop and consider what’s really going on, at the assembly language level, in Listing 10.3. There’s recursion and then there’s recursion; code recursion and data recursion, to be exact. Listing 10.3 is code recursion—recursion through calls—the sort most often used because it is conceptually simplest. However, code recursion tends to be slow because it pushes parameters and calls a subroutine for every iteration. Listing 10.4, which uses data recursion, is much faster and no more complicated than Listing 10.3. Actually, you could just say that Listing 10.4 uses a loop and ignore any mention of recursion; conceptually, though, Listing 10.4 performs the same recursive operations that Listing 10.3 does.
LISTING 10.4 L10-4.C
/* Finds and returns the greatest common divisor of two integers.
Uses Euclid's algorithm: divides the larger integer by the
smaller; if the remainder is 0, the smaller integer is the GCD,
otherwise the smaller integer becomes the larger integer, the
remainder becomes the smaller integer, and the process is
repeated. Avoids code recursion. */
unsigned int gcd(unsigned int int1, unsigned int int2) {
unsigned int temp;
/* Swap if necessary to make sure that int1 >= int2 */
if (int1 < int2) {
= int1;
temp = int2;
int1 = temp;
int2 }
/* Now loop, dividing int1 by int2 and checking the remainder,
until the remainder is 0. At each step, if the remainder isn't
0, assign int2 to int1, and the remainder to int2, then
repeat */
for (;;) {
/* If the remainder of int1 divided by int2 is 0, then int2 is
the gcd */
if ((temp = int1 % int2) == 0) {
return(int2);
}
/* Make int2 the larger integer and the remainder the
smaller integer, and repeat the process */
= int2;
int1 = temp;
int2 }
}
At long last, we’re ready to optimize GCD determination in the classic sense. Table 10.1 shows the performance of Listing 10.4 with and without Microsoft C/C++’s maximum optimization, and also shows the performance of Listing 10.5, an assembly language version of Listing 10.4. Sure, the optimized versions are faster than the unoptimized version of Listing 10.4—but the gains are small compared to those realized from the higher-level optimizations in Listings 10.2 through 10.4.
LISTING 10.5 L10-5.ASM
; Finds and returns the greatest common divisor of two integers.
; Uses Euclid's algorithm: divides the larger integer by the
; smaller; if the remainder is 0, the smaller integer is the GCD,
; otherwise the smaller integer becomes the larger integer, the
; remainder becomes the smaller integer, and the process is
; repeated. Avoids code recursion.
;
;
;
; C near-callable as:
; unsigned int gcd(unsigned int int1, unsigned int int2);
; Parameter structure:
struc
parms dw ? ;pushed BP
dw ? ;pushed return address
int1 dw ? ;integers for which to find
dw ? ; the GCD
int2
parms ends
.model small
.code
public _gcdalign 2
_gcd proc nearpush bp ;preserve caller's stack frame
mov bp,sp ;set up our stack frame
push si ;preserve caller's register variables
push di
;Swap if necessary to make sure that int1 >= int2
mov ax,int1[bp]
mov bx,int2[bp]
cmp ax,bx ;is int1 >= int2?
jnb IntsSet ;yes, so we're all set
xchg ax,bx ;no, so swap int1 and int2
IntsSet:
; Now loop, dividing int1 by int2 and checking the remainder, until
; the remainder is 0. At each step, if the remainder isn't 0, assign
; int2 to int1, and the remainder to int2, then repeat.
GCDLoop:
;if the remainder of int1 divided by
; int2 is 0, then int2 is the gcd
sub dx,dx ;prepare int1 in DX:AX for division
div bx ;int1/int2; remainder is in DX
and dx,dx ;is the remainder zero?
jz Done ;yes, so int2 (BX) is the gcd
;no, so move int2 to int1 and the
; remainder to int2, and repeat the
; process
mov ax,bx ;int1 = int2;
mov bx,dx ;int2 = remainder from DIV
;—start of loop unrolling; the above is repeated three times—
sub dx,dx ;prepare int1 in DX:AX for division
div bx ;int1/int2; remainder is in DX
and dx,dx ;is the remainder zero?
jz Done ;yes, so int2 (BX) is the gcd
mov ax,bx ;int1 = int2;
mov bx,dx ;int2 = remainder from DIV
;—
sub dx,dx ;prepare int1 in DX:AX for division
div bx ;int1/int2; remainder is in DX
and dx,dx ;is the remainder zero?
jz Done ;yes, so int2 (BX) is the gcd
mov ax,bx ;int1 = int2;
mov bx,dx ;int2 = remainder from DIV
;—
sub dx,dx ;prepare int1 in DX:AX for division
div bx ;int1/int2; remainder is in DX
and dx,dx ;is the remainder zero?
jz Done ;yes, so int2 (BX) is the gcd
mov ax,bx ;int1 = int2;
mov bx,dx ;int2 = remainder from DIV
;—end of loop unrolling—
jmp GCDLoop
align2Done:
mov ax,bx ;return the GCD
pop di ;restore caller's register variables
pop si
pop bp ;restore caller's stack frame
ret
_gcd endp end
Assembly language optimization is pattern matching on a local scale. Frankly, it’s also the sort of boring, brute-force work that people are lousy at; compilers could out-optimize you at this level with one pass tied behind their back if they knew as much about the code you’re writing as you do, which they don’t.
Design optimization—conceptual breakthroughs in understanding the relationships between the needs of an application, the nature of the data the application works with, and what the computer can do—is global pattern matching.
Computers are much worse at that sort of pattern matching than humans; computers have no way to integrate vast amounts of disparate information, much of it only vaguely defined or subject to change. People, oddly enough, are better at global optimization than at local optimization. For one thing, it’s more interesting. For another, it’s complex and imprecise enough to allow intuition and inspiration, two vastly underrated programming tools, to come to the fore. And, as I pointed out earlier, people tend to perform instantaneous solutions to even the most complex problems, while computers bog down in geometrically or exponentially increasing execution times. Oh, it may take days or weeks for a person to absorb enough information to be able to reach a solution, and the solution may only be near-optimal—but the solution itself (or, at least, each of the pieces of the solution) arrives in a flash.
Those flashes are your programming pattern matcher doing its job. Your job is to give your pattern matcher the opportunity to get to know each problem and run through it two or three times, from different angles, to see what unexpected solutions it can come up with.
Pull back the reins a little. Don’t measure progress by lines of code written today; measure it instead by overall progress and by quality. Relax and listen to that quiet inner voice that provides the real breakthroughs. Stop, look, listen—and think. Not only will you find that it’s a more productive and creative way to program—but you’ll also find that it’s more fun.
And think what you could do with all those extra computer years!
This chapter, adapted from my earlier book Zen of Assembly Language (1989; now out of print), provides an overview of the 286 and 386, often contrasting those processors with the 8088. At the time I originally wrote this, the 8088 was the king of processors, and the 286 and 386 were the new kids on the block. Today, of course, all three processors are past their primes, but many millions of each are still in use, and the 386 in particular is still well worth considering when optimizing software.
This chapter provides an interesting look at the evolution of the x86 architecture, to a greater degree than you might expect, for the x86 family came into full maturity with the 386; the 486 and the Pentium are really nothing more than faster 386s, with very little in the way of new functionality. In contrast, the 286 added a number of instructions, respectable performance, and protected mode to the 8088’s capabilities, and the 386 added more instructions and a whole new set of addressing modes, and brought the x86 family into the 32-bit world that represents the future (and, increasingly, the present) of personal computing. This chapter also provides insight into the effects on optimization of the variations in processors and memory architectures that are common in the PC world. So, although the 286 and 386 no longer represent the mainstream of computing, this chapter is a useful mix of history lesson, x86 overview, and details on two workhorse processors that are still in wide use.
While the x86 family is a large one, only a few members of the family—which includes the 8088, 8086, 80188, 80186, 286, 386SX, 386DX, numerous permutations of the 486, and now the Pentium—really matter.
The 8088 is now all but extinct in the PC arena. The 8086 was used fairly widely for a while, but has now all but disappeared. The 80186 and 80188 never really caught on for use in PC and don’t require further discussion.
That leaves us with the high-end chips: the 286, the 386SX, the 386, the 486, and the Pentium. At this writing, the 386SX is fast going the way of the 8088; people are realizing that its relatively small cost advantage over the 386 isn’t enough to offset its relatively large performance disadvantage. After all, the 386SX suffers from the same debilitating problem that looms over the 8088—a too-small bus. Internally, the 386SX is a 32-bit processor, but externally, it’s a 16-bit processor, a non-optimal architecture, especially for 32-bit code.
I’m not going to discuss the 386SX in detail. If you do find yourself programming for the 386SX, follow the same general rules you should follow for the 8088: use short instructions, use the registers as heavily as possible, and don’t branch. In other words, avoid memory, since the 386SX is by definition better at processing data internally than it is at accessing memory.
The 486 is a world unto itself for the purposes of optimization, and the Pentium is a universe unto itself. We’ll treat them separately in later chapters.
This leaves us with just two processors: the 286 and the 386. Each was the PC standard in its day. The 286 is no longer used in new systems, but there are millions of 286-based systems still in daily use. The 386 is still being used in new systems, although it’s on the downhill leg of its lifespan, and it is in even wider use than the 286. The future clearly belongs to the 486 and Pentium, but the 286 and 386 are still very much a part of the present-day landscape.
Apart from vastly improved performance, the biggest difference between the 8088 and the 286 and 386 (as well as the later Intel CPUs) is that the 286 introduced protected mode, and the 386 greatly expanded the capabilities of protected mode. We’re only going to talk about real-mode operation of the 286 and 386 in this book, however. Protected mode offers a whole new memory management scheme, one that isn’t supported by the 8088. Only code specifically written for protected mode can run in that mode; it’s an alien and hostile environment for MS-DOS programs.
In particular, segments are different creatures in protected mode. They’re selectors—indexes into a table of segment descriptors—rather than plain old registers, and can’t be set to arbitrary values. That means that segments can’t be used for temporary storage or as part of a fast indivisible 32-bit load from memory, as in
les ax,dword ptr [LongVar]
mov dx,es
which loads LongVar
into DX:AX faster than this:
mov ax,word ptr [LongVar]
mov dx,word ptr [LongVar+2]
Protected mode uses those altered segment registers to offer access to a great deal more memory than real mode: The 286 supports 16 megabytes of memory, while the 386 supports 4 gigabytes (4K megabytes) of physical memory and 64 terabytes (64K gigabytes!) of virtual memory.
In protected mode, your programs generally run under an operating system (OS/2, Unix, Windows NT or the like) that exerts much more control over the computer than does MS-DOS. Protected mode operating systems can generally run multiple programs simultaneously, and the performance of any one program may depend far less on code quality than on how efficiently the program uses operating system services and how often and under what circumstances the operating system preempts the program. Protected mode programs are often mostly collections of operating system calls, and the performance of whatever code isn’t operating-system oriented may depend primarily on how large a time slice the operating system gives that code to run in.
In short, taken as a whole, protected mode programming is a different kettle of fish altogether from what I’ve been describing in this book. There’s certainly a knack to optimizing specifically for protected mode under a given operating system…but it’s not what we’ve been learning, and now is not the time to pursue it further. In general, though, the optimization strategies discussed in this book still hold true in protected mode; it’s just issues specific to protected mode or a particular operating system that we won’t discuss.
Under the programming interface, the 286 and 386 differ considerably from the 8088. Nonetheless, with one exception and one addition, the cycle-eaters remain much the same on computers built around the 286 and 386. Next, we’ll review each of the familiar cycle-eaters I covered in Chapter 4 as they apply to the 286 and 386, and we’ll look at the new member of the gang, the data alignment cycle-eater.
The one cycle-eater that vanishes on the 286 and 386 is the 8-bit bus cycle-eater. The 286 is a 16-bit processor both internally and externally, and the 386 is a 32-bit processor both internally and externally, so the Execution Unit/Bus Interface Unit size mismatch that plagues the 8088 is eliminated. Consequently, there’s no longer any need to use byte-sized memory variables in preference to word-sized variables, at least so long as word-sized variables start at even addresses, as we’ll see shortly. On the other hand, access to byte-sized variables still isn’t any slower than access to word-sized variables, so you can use whichever size suits a given task best.
You might think that the elimination of the 8-bit bus cycle-eater would mean that the prefetch queue cycle-eater would also vanish, since on the 8088 the prefetch queue cycle-eater is a side effect of the 8-bit bus. That would seem all the more likely given that both the 286 and the 386 have larger prefetch queues than the 8088 (6 bytes for the 286, 16 bytes for the 386) and can perform memory accesses, including instruction fetches, in far fewer cycles than the 8088.
However, the prefetch queue cycle-eater doesn’t vanish on either the 286 or the 386, for several reasons. For one thing, branching instructions still empty the prefetch queue, so instruction fetching still slows things down after most branches; when the prefetch queue is empty, it doesn’t much matter how big it is. (Even apart from emptying the prefetch queue, branches aren’t particularly fast on the 286 or the 386, at a minimum of seven-plus cycles apiece. Avoid branching whenever possible.)
After a branch it does matter how fast the queue can refill, and there we come to the second reason the prefetch queue cycle-eater lives on: The 286 and 386 are so fast that sometimes the Execution Unit can execute instructions faster than they can be fetched, even though instruction fetching is much faster on the 286 and 386 than on the 8088.
(All other things being equal, too-slow instruction fetching is more of a problem on the 286 than on the 386, since the 386 fetches 4 instruction bytes at a time versus the 2 instruction bytes fetched per memory access by the 286. However, the 386 also typically runs at least twice as fast as the 286, meaning that the 386 can easily execute instructions faster than they can be fetched unless very high-speed memory is used.)
The most significant reason that the prefetch queue cycle-eater not only survives but prospers on the 286 and 386, however, lies in the various memory architectures used in computers built around the 286 and 386. Due to the memory architectures, the 8-bit bus cycle-eater is replaced by a new form of the wait state cycle-eater: wait states on accesses to normal system memory.
The 286 and 386 were designed to lose relatively little performance to the prefetch queue cycle-eater…when used with zero-wait-state memory: memory that can complete memory accesses so rapidly that no wait states are needed. However, true zero-wait-state memory is almost never used with those processors. Why? Because memory that can keep up with a 286 is fairly expensive, and memory that can keep up with a 386 is very expensive. Instead, computer designers use alternative memory architectures that offer more performance for the dollar—but less performance overall—than zero-wait-state memory. (It is possible to build zero-wait-state systems for the 286 and 386; it’s just so expensive that it’s rarely done.)
The IBM AT and true compatibles use one-wait-state memory (some AT clones use zero-wait-state memory, but such clones are less common than one-wait-state AT clones). The 386 systems use a wide variety of memory systems—including high-speed caches, interleaved memory, and static-column RAM—that insert anywhere from 0 to about 5 wait states (and many more if 8 or 16-bit memory expansion cards are used); the exact number of wait states inserted at any given time depends on the interaction between the code being executed and the memory system it’s running on.
The performance of most 386 memory systems can vary greatly from one memory access to another, depending on factors such as what data happens to be in the cache and which interleaved bank and/or RAM column was accessed last.
The many memory systems in use make it impossible for us to optimize for 286/386 computers with the precision that’s possible on the 8088. Instead, we must write code that runs reasonably well under the varying conditions found in the 286/386 arena.
The wait states that occur on most accesses to system memory in 286 and 386 computers mean that nearly every access to system memory—memory in the DOS’s normal 640K memory area—is slowed down. (Accesses in computers with high-speed caches may be wait-state-free if the desired data is already in the cache, but will certainly encounter wait states if the data isn’t cached; this phenomenon produces highly variable instruction execution times.) While this is our first encounter with system memory wait states, we have run into a wait-state cycle-eater before: the display adapter cycle-eater, which we discussed along with the other 8088 cycle-eaters way back in Chapter 4. System memory generally has fewer wait states per access than display memory. However, system memory is also accessed far more often than display memory, so system memory wait states hurt plenty—and the place they hurt most is instruction fetching.
Consider this: The 286 can store an immediate value to memory, as in
MOV [WordVar],0
, in just 3 cycles. However, that
instruction is 6 bytes long. The 286 is capable of fetching 1 word every
2 cycles; however, the one-wait-state architecture of the AT stretches
that to 3 cycles. Consequently, nine cycles are needed to fetch the six
instruction bytes. On top of that, 3 cycles are needed to write to
memory, bringing the total memory access time to 12 cycles. On balance,
memory access time—especially instruction prefetching—greatly exceeds
execution time, to the extent that this particular instruction can take
up to four times as long to run as it does to execute in the Execution
Unit.
And that, my friend, is unmistakably the prefetch queue cycle-eater. I might add that the prefetch queue cycle-eater is in rare good form in the above example: A 4-to-1 ratio of instruction fetch time to execution time is in a class with the best (or worst!) that’s found on the 8088.
Let’s check out the prefetch queue cycle-eater in action. Listing
11.1 times MOV [WordVar],0
. The Zen timer reports that on a
one-wait-state 10 MHz 286-based AT clone (the computer used for all
tests in this chapter), Listing 11.1 runs in 1.27 µs per instruction.
That’s 12.7 cycles per instruction, just as we calculated. (That extra
seven-tenths of a cycle comes from DRAM refresh, which we’ll get to
shortly.)
LISTING 11.1 L11-1.ASM
;
; *** Listing 11.1 ***
;
; Measures the performance of an immediate move to
; memory, in order to demonstrate that the prefetch
; queue cycle-eater is alive and well on the AT.
;
jmp Skip
;
;always make sure word-sized memory
even ; variables are word-aligned!
dw 0
WordVar ;
Skip:
call ZTimerOn
1000
rept mov [WordVar],0
endmcall ZTimerOff
What does this mean? It means that, practically speaking, the 286 as used in the AT doesn’t have a 16-bit bus. From a performance perspective, the 286 in an AT has two-thirds of a 16-bit bus (a 10.7-bit bus?), since every bus access on an AT takes 50 percent longer than it should. A 286 running at 10 MHz should be able to access memory at a maximum rate of 1 word every 200 ns; in a 10 MHz AT, however, that rate is reduced to 1 word every 300 ns by the one-wait-state memory.
In short, a close relative of our old friend the 8-bit bus cycle-eater—the system memory wait state cycle-eater—haunts us still on all but zero-wait-state 286 and 386 computers, and that means that the prefetch queue cycle-eater is alive and well. (The system memory wait state cycle-eater isn’t really a new cycle-eater, but rather a variant of the general wait state cycle-eater, of which the display adapter cycle-eater is yet another variant.) While the 286 in the AT can fetch instructions much faster than can the 8088 in the PC, it can execute those instructions faster still.
The picture is less clear in the 386 world since there are so many different memory architectures, but similar problems can occur in any computer built around a 286 or 386. The prefetch queue cycle-eater is even a factor—albeit a lesser one—on zero-wait-state machines, both because branching empties the queue and because some instructions can outrun even zero—5 cycles longer than the official execution time.)
To summarize:
What’s to be learned from all this? Several things:
Of course, those are exactly the rules that apply to 8088 optimization as well. Isn’t it convenient that the same general rules apply across the board?
Thanks to its 16-bit bus, the 286 can access word-sized memory variables just as fast as byte-sized variables. There’s a catch, however: That’s only true for word-sized variables that start at even addresses. When the 286 is asked to perform a word-sized access starting at an odd address, it actually performs two separate accesses, each of which fetches 1 byte, just as the 8088 does for all word-sized accesses.
Figure 11.1 illustrates this phenomenon. The conversion of word-sized accesses to odd addresses into double byte-sized accesses is transparent to memory-accessing instructions; all any instruction knows is that the requested word has been accessed, no matter whether 1 word-sized access or 2 byte-sized accesses were required to accomplish it.
The penalty for performing a word-sized access starting at an odd address is easy to calculate: Two accesses take twice as long as one access.
In other words, the effective capacity of the 286’s external data bus is halved when a word-sized access to an odd address is performed.
That, in a nutshell, is the data alignment cycle-eater, the one new cycle-eater of the 286 and 386. (The data alignment cycle-eater is a close relative of the 8088’s 8-bit bus cycle-eater, but since it behaves differently—occurring only at odd addresses—and is avoided with a different workaround, we’ll consider it to be a new cycle-eater.)
The way to deal with the data alignment cycle-eater is
straightforward: Don’t perform word-sized accesses to odd addresses
on the 286 if you can help it. The easiest way to avoid the data
alignment cycle-eater is to place the directive EVEN
before
each of your word-sized variables. EVEN
forces the offset
of the next byte assembled to be even by inserting a NOP
if
the current offset is odd; consequently, you can ensure that any
word-sized variable can be accessed efficiently by the 286 simply by
preceding it with EVEN
.
Listing 11.2, which accesses memory a word at a time with each word
starting at an odd address, runs on a 10 MHz AT clone in 1.27 ms per
repetition of MOVSW
, or 0.64 ms per word-sized memory
access. That’s 6-plus cycles per word-sized access, which breaks down to
two separate memory accesses—3 cycles to access the high byte of each
word and 3 cycles to access the low byte of each word, the inevitable
result of non-word-aligned word-sized memory accesses—plus a bit extra
for DRAM refresh.
LISTING 11.2 L11-2.ASM
;
; *** Listing 11.2 ***
;
; Measures the performance of accesses to word-sized
; variables that start at odd addresses (are not
; word-aligned).
;
Skip:
push ds
pop es
mov si,1 ;source and destination are the same
mov di,si ; and both are not word-aligned
mov cx,1000 ;move 1000 words
cld
call ZTimerOn
rep movsw
call ZTimerOff
On the other hand, Listing 11.3, which is exactly the same as Listing
11.2 save that the memory accesses are word-aligned (start at even
addresses), runs in 0.64 ms per repetition of MOVSW
, or
0.32 µs per word-sized memory access. That’s 3 cycles per word-sized
access—exactly twice as fast as the non-word-aligned accesses of Listing
11.2, just as we predicted.
LISTING 11.3 L11-3.ASM
;
; *** Listing 11.3 ***
;
; Measures the performance of accesses to word-sized
; variables that start at even addresses (are word-aligned).
;
Skip:
push ds
pop es
sub si,si ;source and destination are the same
mov di,si ; and both are word-aligned
mov cx,1000 ;move 1000 words
cld
call ZTimerOn
rep movsw
call ZTimerOff
The data alignment cycle-eater has intriguing implications for speeding up 286/386 code. The expenditure of a little care and a few bytes to make sure that word-sized variables and memory blocks are word-aligned can literally double the performance of certain code running on the 286. Even if it doesn’t double performance, word alignment usually helps and never hurts.
Lack of word alignment can also interfere with instruction fetching on the 286, although not to the extent that it interferes with access to word-sized memory variables. The 286 prefetches instructions a word at a time; even if a given instruction doesn’t begin at an even address, the 286 simply fetches the first byte of that instruction at the same time that it fetches the last byte of the previous instruction, as shown in Figure 11.2, then separates the bytes internally. That means that in most cases, instructions run just as fast whether they’re word-aligned or not.
There is, however, a non-word-alignment penalty on branches to odd addresses. On a branch to an odd address, the 286 is only able to fetch 1 useful byte with the first instruction fetch following the branch, as shown in Figure 11.3. In other words, lack of word alignment of the target instruction for any branch effectively cuts the instruction-fetching power of the 286 in half for the first instruction fetch after that branch. While that may not sound like much, you’d be surprised at what it can do to tight loops; in fact, a brief story is in order.
When I was developing the Zen timer, I used my trusty 10 MHz 286-based AT clone to verify the basic functionality of the timer by measuring the performance of simple instruction sequences. I was cruising along with no problems until I timed the following code:
mov cx,1000
call ZTimerOn
LoopTop:
loop LoopTop
call ZTimerOff
Now, this code should run in, say, about 12 cycles per loop
at most. Instead, it took over 14 cycles per loop, an execution time
that I could not explain in any way. After rolling it around in my head
for a while, I took a look at the code under a debugger…and the answer
leaped out at me. The loop began at an odd address! That meant
that two instruction fetches were required each time through the loop;
one to get the opcode byte of the LOOP
instruction, which
resided at the end of one word-aligned word, and another to get the
displacement byte, which resided at the start of the next word-aligned
word.
One simple change brought the execution time down to a reasonable 12.5 cycles per loop:
mov cx,1000
call ZTimerOn
evenLoopTop:
loop LoopTop
call ZTimerOff
While word-aligning branch destinations can improve branching
performance, it’s a nuisance and can increase code size a good deal, so
it’s not worth doing in most code. Besides, EVEN
inserts a
NOP
instruction if necessary, and the time required to
execute a NOP
can sometimes cancel the performance
advantage of having a word-aligned branch destination.
Consequently, it’s best to word-align only those branch destinations that can be reached solely by branching.
I recommend that you only go out of your way to word-align the start offsets of your subroutines, as in:
even
FindChar proc near :
In my experience, this simple practice is the one form of code alignment that consistently provides a reasonable return for bytes and effort expended, although sometimes it also pays to word-align tight time-critical loops.
So far we’ve only discussed alignment as it pertains to the 286. What, you may well ask, of the 386?
The 386 adds the issue of doubleword alignment (that is, alignment to addresses that are multiples of four.) The rule for the 386 is: Word-sized memory accesses should be word-aligned (it’s impossible for word-aligned word-sized accesses to cross doubleword boundaries), and doubleword-sized memory accesses should be doubleword-aligned. However, in real (as opposed to 32-bit protected) mode, doubleword-sized memory accesses are rare, so the simple word-alignment rule we’ve developed for the 286 serves for the 386 in real mode as well.
As for code alignment…the subroutine-start word-alignment rule of the 286 serves reasonably well there too since it avoids the worst case, where just 1 byte is fetched on entry to a subroutine. While optimum performance would dictate doubleword alignment of subroutines, that takes 3 bytes, a high price to pay for an optimization that improves performance only on the post 286 processors.
One side-effect of the data alignment cycle-eater of the 286 and 386
is that you should never allow the stack pointer to become odd.
(You can make the stack pointer odd by adding an odd value to it or
subtracting an odd value from it, or by loading it with an odd value.)
An odd stack pointer on the 286 or 386 (or a non-doubleword-aligned
stack in 32-bit protected mode on the 386, 486, or Pentium) will
significantly reduce the performance of PUSH
,
POP
, CALL
, and RET
, as well as
INT
and IRET
, which are executed to invoke DOS
and BIOS functions, handle keystrokes and incoming serial characters,
and manage the mouse. I know of a Forth programmer who vastly improved
the performance of a complex application on the AT simply by forcing the
Forth interpreter to maintain an even stack pointer at all times.
An interesting corollary to this rule is that you shouldn’t
INC SP
twice to add 2, even though that takes fewer bytes
than ADD SP,2
. The stack pointer is odd between the first
and second INC
, so any interrupt occurring between the two
instructions will be serviced more slowly than it normally would. The
same goes for decrementing twice; use SUB SP,2
instead.
Keep the stack pointer aligned at all times.
The DRAM refresh cycle-eater is the cycle-eater that’s least changed from its 8088 form on the 286 and 386. In the AT, DRAM refresh uses a little over five percent of all available memory accesses, slightly less than it uses in the PC, but in the same ballpark. While the DRAM refresh penalty varies somewhat on various AT clones and 386 computers (in fact, a few computers are built around static RAM, which requires no refresh at all; likewise, caches are made of static RAM so cached systems generally suffer less from DRAM refresh), the 5 percent figure is a good rule of thumb.
Basically, the effect of the DRAM refresh cycle-eater is pretty much the same throughout the PC-compatible world: fairly small, so it doesn’t greatly affect performance; unavoidable, so there’s no point in worrying about it anyway; and a nuisance since it results in fractional cycle counts when using the Zen timer. Just as with the PC, a given code sequence on the AT can execute at varying speeds at different times as a result of the interaction between the code and DRAM refresh.
There’s nothing much new with DRAM refresh on 286/386 computers, then. Be aware of it, but don’t overly concern yourself—DRAM refresh is still an act of God, and there’s not a blessed thing you can do about it. Happily, the internal caches of the 486 and Pentium make DRAM refresh largely a performance non-issue on those processors.
Finally we come to the last of the cycle-eaters, the display adapter cycle-eater. There are two ways of looking at this cycle-eater on 286/386 computers: (1) It’s much worse than it was on the PC, or (2) it’s just about the same as it was on the PC.
Either way, the display adapter cycle-eater is extremely bad news on 286/386 computers and on 486s and Pentiums as well. In fact, this cycle-eater on those systems is largely responsible for the popularity of VESA local bus (VLB).
The two ways of looking at the display adapter cycle-eater on 286/386
computers are actually the same. As you’ll recall from my earlier
discussion of the matter in Chapter 4, display adapters offer only a
limited number of accesses to display memory during any given period of
time. The 8088 is capable of making use of most but not all of those
slots with REP MOVSW
, so the number of memory accesses
allowed by a display adapter such as a standard VGA is reasonably
well-matched to an 8088’s memory access speed. Granted, access to a VGA
slows the 8088 down considerably—but, as we’re about to find out,
“considerably” is a relative term. What a VGA does to PC performance is
nothing compared to what it does to faster computers.
Under ideal conditions, a 286 can access memory much, much faster
than an 8088. A 10 MHz 286 is capable of accessing a word of system
memory every 0.20 ms with REP MOVSW
, dwarfing the 1 byte
every 1.31 µs that the 8088 in a PC can manage. However, access to
display memory is anything but ideal for a 286. For one thing, most
display adapters are 8-bit devices, although newer adapters are 16-bit
in nature. One consequence of that is that only 1 byte can be read or
written per access to display memory; word-sized accesses to 8-bit
devices are automatically split into 2 separate byte-sized accesses by
the AT’s bus. Another consequence is that accesses are simply slower;
the AT’s bus inserts additional wait states on accesses to 8-bit devices
since it must assume that such devices were designed for PCs and may not
run reliably at AT speeds.
However, the 8-bit size of most display adapters is but one of the
two factors that reduce the speed with which the 286 can access display
memory. Far more cycles are eaten by the inherent memory-access
limitations of display adapters—that is, the limited number of display
memory accesses that display adapters make available to the 286. Look at
it this way: If REP MOVSW
on a PC can use more than half of
all available accesses to display memory, then how much faster can code
running on a 286 or 386 possibly run when accessing display memory?
That’s right—less than twice as fast.
In other words, instructions that access display memory won’t run a whole lot faster on ATs and faster computers than they do on PCs. That explains one of the two viewpoints expressed at the beginning of this section: The display adapter cycle-eater is just about the same on high-end computers as it is on the PC, in the sense that it allows instructions that access display memory to run at just about the same speed on all computers.
Of course, the picture is quite a bit different when you compare the performance of instructions that access display memory to the maximum performance of those instructions. Instructions that access display memory receive many more wait states when running on a 286 than they do on an 8088. Why? While the 286 is capable of accessing memory much more often than the 8088, we’ve seen that the frequency of access to display memory is determined not by processor speed but by the display adapter itself. As a result, both processors are actually allowed just about the same maximum number of accesses to display memory in any given time. By definition, then, the 286 must spend many more cycles waiting than does the 8088.
And that explains the second viewpoint expressed above regarding the display adapter cycle-eater vis-a-vis the 286 and 386. The display adapter cycle-eater, as measured in cycles lost to wait states, is indeed much worse on AT-class computers than it is on the PC, and it’s worse still on more powerful computers.
How bad is the display adapter cycle-eater on an AT? It’s this bad: Based on my (not inconsiderable) experience in timing display adapter access, I’ve found that the display adapter cycle-eater can slow an AT—or even a 386 computer—to near-PC speeds when display memory is accessed.
I know that’s hard to believe, but the display adapter cycle-eater gives out just so many display memory accesses in a given time, and no more, no matter how fast the processor is. In fact, the faster the processor, the more the display adapter cycle-eater hurts the performance of instructions that access display memory. The display adapter cycle-eater is not only still present in 286/386 computers, it’s worse than ever.
What can we do about this new, more virulent form of the display adapter cycle-eater? The workaround is the same as it was on the PC: Access display memory as little as you possibly can.
The 286 and 386 offer a number of new instructions. The 286 has a relatively small number of instructions that the 8088 lacks, while the 386 has those instructions and quite a few more, along with new addressing modes and data sizes. We’ll discuss the 286 and the 386 separately in this regard.
The 286 has a number of instructions designed for protected-mode operations. As I’ve said, we’re not going to discuss protected mode in this book; in any case, protected-mode instructions are generally used only by operating systems. (I should mention that the 286’s protected mode brings with it the ability to address 16 MB of memory, a considerable improvement over the 8088’s 1 MB. In real mode, however, programs are still limited to 1 MB of addressable memory on the 286. In either mode, each segment is still limited to 64K.)
There are also a handful of 286-specific real-mode instructions, and
they can be quite useful. BOUND
checks array bounds.
ENTER
and LEAVE
support compact and speedy
stack frame construction and removal, ideal for interfacing to
high-level languages such as C and Pascal (although these instructions
are actually relatively slow on the 386 and its successors, and should
be used with caution when performance matters). INS
and
OUTS
are new string instructions that support efficient
data transfer between memory and I/O ports. Finally, PUSHA
and POPA
push and pop all eight general-purpose
registers.
A couple of old instructions gain new features on the 286. For one,
the 286 version of PUSH
is capable of pushing a constant on
the stack. For another, the 286 allows all shifts and rotates to be
performed for not just 1 bit or the number of bits specified by CL, but
for any constant number of bits.
The 386 is somewhat more complex than the 286 regarding new features. Once again, we won’t discuss protected mode, which on the 386 comes with the ability to address up to 4 gigabytes per segment and 64 terabytes in all. In real mode (and in virtual-86 mode, which allows the 386 to multitask MS-DOS applications, and which is identical to real mode so far as MS-DOS programs are concerned), programs running on the 386 are still limited to 1 MB of addressable memory and 64K per segment.
The 386 has many new instructions, as well as new registers, addressing modes and data sizes that have trickled down from protected mode. Let’s take a quick look at these new real-mode features.
Even in real mode, it’s possible to access many of the 386’s new and extended registers. Most of these registers are simply 32-bit extensions of the 16-bit registers of the 8088. For example, EAX is a 32-bit register containing AX as its lower 16 bits, EBX is a 32-bit register containing BX as its lower 16 bits, and so on. There are also two new segment registers: FS and GS.
The 386 also comes with a slew of new real-mode instructions beyond
those supported by the 8088 and 286. These instructions can scan data on
a bit-by-bit basis, set the Carry flag to the value of a specified bit,
sign-extend or zero-extend data as it’s moved, set a register or memory
variable to 1 or 0 on the basis of any of the conditions that can be
tested with conditional jumps, and more. (Again, beware: Many of these
complex 386-specific instructions are slower than equivalent sequences
of simple instructions on the 486 and especially on the Pentium.) What’s
more, both old and new instructions support 32-bit operations on the
386. For example, it’s relatively simple to copy data in chunks of 4
bytes on a 386, even in real mode, by using the MOVSD
(“move string double”) instruction, or to negate a 32-bit value with
NEG eax
.
Finally, it’s possible in real mode to use the 386’s new addressing modes, in which any 32-bit general-purpose register or pair of registers can be used to address memory. What’s more, multiplication of memory-addressing registers by 2, 4, or 8 for look-ups in word, doubleword, or quadword tables can be built right into the memory addressing mode. (The 32-bit addressing modes are discussed further in later chapters.) In protected mode, these new addressing modes allow you to address a full 4 gigabytes per segment, but in real mode you’re still limited to 64K, even with 32-bit registers and the new addressing modes, unless you play some unorthodox tricks with the segment registers.
Note well: Those tricks don’t necessarily work with system software such as Windows, so I’d recommend against using them. If you want 4-gigabyte segments, use a 32-bit environment such as Win32.
Let’s see what we’ve learned about 286/386 optimization. Mostly what we’ve learned is that our familiar PC cycle-eaters still apply, although in somewhat different forms, and that the major optimization rules for the PC hold true on ATs and 386-based computers. You won’t go wrong on any of these computers if you keep your instructions short, use the registers heavily and avoid memory, don’t branch, and avoid accessing display memory like the plague.
Although we haven’t touched on them, repeated string instructions are still desirable on the 286 and 386 since they provide a great deal of functionality per instruction byte and eliminate both the prefetch queue cycle-eater and branching. However, string instructions are not quite so spectacularly superior on the 286 and 386 as they are on the 8088 since non-string memory-accessing instructions have been speeded up considerably on the newer processors.
There’s one cycle-eater with new implications on the 286 and 386, and that’s the data alignment cycle-eater. From the data alignment cycle-eater we get a new rule: Word-align your word-sized variables, and start your subroutines at even addresses.
While the major 8088 optimization rules hold true on computers built
around the 286 and 386, many of the instruction-specific optimizations
no longer hold, for the execution times of most instructions are quite
different on the 286 and 386 than on the 8088. We have already seen one
such example of the sometimes vast difference between 8088 and 286/386
instruction execution times: MOV [WordVar],0
, which has an
Execution Unit execution time of 20 cycles on the 8088, has an EU
execution time of just 3 cycles on the 286 and 2 cycles on the 386.
In fact, the performance of virtually all memory-accessing
instructions has been improved enormously on the 286 and 386. The key to
this improvement is the near elimination of effective address (EA)
calculation time. Where an 8088 takes from 5 to 12 cycles to calculate
an EA, a 286 or 386 usually takes no time whatsoever to perform the
calculation. If a base+index+displacement addressing mode, such as
MOV AX,[WordArray+bx+si]
, is used on a 286 or 386, 1 cycle
is taken to perform the EA calculation, but that’s both the worst case
and the only case in which there’s any EA overhead at all.
The elimination of EA calculation time means that the EU execution
time of memory-addressing instructions is much closer to the EU
execution time of register-only instructions. For instance, on the 8088
ADD [WordVar],100H
is a 31-cycle instruction, while
ADD DX,100H
is a 4-cycle instruction—a ratio of nearly 8 to
1. By contrast, on the 286 ADD [WordVar],100H
is a 7-cycle
instruction, while ADD DX,100H
is a 3-cycle instruction—a
ratio of just 2.3 to 1.
It would seem, then, that it’s less necessary to use the registers on
the 286 than it was on the 8088, but that’s simply not the case, for
reasons we’ve already seen. The key is this: The 286 can execute
memory-addressing instructions so fast that there’s no spare instruction
prefetching time during those instructions, so the prefetch queue runs
dry, especially on the AT, with its one-wait-state memory. On the AT,
the 6-byte instruction ADD [WordVar],100H
is effectively at
least a 15-cycle instruction, because 3 cycles are needed to fetch each
of the three instruction words and 6 more cycles are needed to read
WordVar
and write the result back to memory.
Granted, the register-only instruction ADD DX,100H
also
slows down—to 6 cycles—because of instruction prefetching, leaving a
ratio of 2.5 to 1. Now, however, let’s look at the performance of the
same code on an 8088. The register-only code would run in 16 cycles (4
instruction bytes at 4 cycles per byte), while the memory-accessing code
would run in 40 cycles (6 instruction bytes at 4 cycles per byte, plus 2
word-sized memory accesses at 8 cycles per word). That’s a ratio of 2.5
to 1, exactly the same as on the 286.
This is all theoretical. We put our trust not in theory but in actual
performance, so let’s run this code through the Zen timer. On a PC,
Listing 11.4, which performs register-only addition, runs in 3.62 ms,
while Listing 11.5, which performs addition to a memory variable, runs
in 10.05 ms. On a 10 MHz AT clone, Listing 11.4 runs in 0.64 ms, while
Listing 11.5 runs in 1.80 ms. Obviously, the AT is much faster…but the
ratio of Listing 11.5 to Listing 11.4 is virtually identical on both
computers, at 2.78 for the PC and 2.81 for the AT. If anything, the
register-only form of ADD
has a slightly larger
advantage on the AT than it does on the PC in this case.
Theory confirmed.
LISTING 11.4 L11-4.ASM
;
; *** Listing 11.4 ***
;
; Measures the performance of adding an immediate value
; to a register, for comparison with Listing 11.5, which
; adds an immediate value to a memory variable.
;
call ZTimerOn
1000
rept add dx,100h
endmcall ZTimerOff
LISTING 11.5 L11-5.ASM
;
; *** Listing 11.5 ***
;
; Measures the performance of adding an immediate value
; to a memory variable, for comparison with Listing 11.4,
; which adds an immediate value to a register.
;
jmp Skip
;
;always make sure word-sized memory
even ; variables are word-aligned!
dw 0
WordVar ;
Skip:
call ZTimerOn
1000
rept add [WordVar]100h
endmcall ZTimerOff
What’s going on? Simply this: Instruction fetching is controlling overall execution time on both processors. Both the 8088 in a PC and the 286 in an AT can execute the bytes of the instructions in Listings 11.4 and 11.5 faster than they can be fetched. Since the instructions are exactly the same lengths on both processors, it stands to reason that the ratio of the overall execution times of the instructions should be the same on both processors as well. Instruction length controls execution time, and the instruction lengths are the same—therefore the ratios of the execution times are the same. The 286 can both fetch and execute instruction bytes faster than the 8088 can, so code executes much faster on the 286; nonetheless, because the 286 can also execute those instruction bytes much faster than it can fetch them, overall performance is still largely determined by the size of the instructions.
Is this always the case? No. When the prefetch queue is full, memory-accessing instructions on the 286 and 386 are much faster (relative to register-only instructions) than they are on the 8088. Given the system wait states prevalent on 286 and 386 computers, however, the prefetch queue is likely to be empty quite a bit, especially when code consisting of instructions with short EU execution times is executed. Of course, that’s just the sort of code we’re likely to write when we’re optimizing, so the performance of high-speed code is more likely to be controlled by instruction size than by EU execution time on most 286 and 386 computers, just as it is on the PC.
All of which is just a way of saying that faster memory access and EA calculation notwithstanding, it’s just as desirable to keep instructions short and memory accesses to a minimum on the 286 and 386 as it is on the 8088. And the way to do that is to use the registers as heavily as possible, use string instructions, use short forms of instructions, and the like.
The more things change, the more they remain the same….
We’ve one final 286-related item to discuss: the hardware malfunction
of POPF
under certain circumstances on the 286.
The problem is this: Sometimes POPF
permits interrupts
to occur when interrupts are initially off and the setting popped into
the Interrupt flag from the stack keeps interrupts off. In other words,
an interrupt can happen even though the Interrupt flag is never set to
1. Now, I don’t want to blow this particular bug out of proportion. It
only causes problems in code that cannot tolerate interrupts under any
circumstances, and that’s a rare sort of code, especially in user
programs. However, some code really does need to have interrupts
absolutely disabled, with no chance of an interrupt sneaking through.
For example, a critical portion of a disk BIOS might need to retrieve
data from the disk controller the instant it becomes available; even a
few hundred microseconds of delay could result in a sector’s worth of
data misread. In this case, one misplaced interrupt during a
POPF
could result in a trashed hard disk if that interrupt
occurs while the disk BIOS is reading a sector of the File Allocation
Table.
There is a workaround for the POPF
bug. While the
workaround is easy to use, it’s considerably slower than
POPF
, and costs a few bytes as well, so you won’t want to
use it in code that can tolerate interrupts. On the other hand, in code
that truly cannot be interrupted, you should view those extra cycles and
bytes as cheap insurance against mysterious and erratic program
crashes.
One obvious reason to discuss the POPF
workaround is
that it’s useful. Another reason is that the workaround is an excellent
example of Zen-level assembly coding, in that there’s a well-defined
goal to be achieved but no obvious way to do so. The goal is to
reproduce the functionality of the POPF
instruction without
using POPF
, and the place to start is by asking exactly
what POPF
does.
All POPF
does is pop the word on top of the stack into
the FLAGS register, as shown in Figure 11.4. How can we do that without
POPF
? Of course, the 286’s designers intended us to use
POPF
for this purpose, and didn’t intentionally provide any
alternative approach, so we’ll have to devise an alternative approach of
our own. To do that, we’ll have to search for instructions that contain
some of the same functionality as POPF
, in the hope that
one of those instructions can be used in some way to replace
POPF
.
Well, there’s only one instruction other than POPF
that
loads the FLAGS register directly from the stack, and that’s
IRET
, which loads the FLAGS register from the stack as it
branches, as shown in Figure 11.5. iret has no known bugs of the sort
that plague POPF
, so it’s certainly a candidate to replace
popf in non-interruptible applications. Unfortunately, IRET
loads the FLAGS register with the third word down on the stack,
not the word on top of the stack, as is the case with POPF
;
the far return address that IRET
pops into CS:IP lies
between the top of the stack and the word popped into the FLAGS
register.
Obviously, the segment:offset that IRET
expects to find
on the stack above the pushed flags isn’t present when the stack is set
up for POPF
, so we’ll have to adjust the stack a bit before
we can substitute IRET
for POPF
. What we’ll
have to do is push the segment:offset of the instruction after our
workaround code onto the stack right above the pushed flags.
IRET
will then branch to that address and pop the flags,
ending up at the instruction after the workaround code with the flags
popped. That’s just the result that would have occurred had we executed
POPF
—WITH the bonus that no interrupts can accidentally
occur when the Interrupt flag is 0 both before and after the pop.
How can we push the segment:offset of the next instruction? Well,
finding the offset of the next instruction by performing a near call to
that instruction is a tried-and-true trick. We can do something similar
here, but in this case we need a far call, since IRET
requires both a segment and an offset. We’ll also branch backward so
that the address pushed on the stack will point to the instruction we
want to continue with. The code works out like this:
jmp short popfskip
popfiret:
iret; branches to the instruction after the
; call, popping the word below the address
; pushed by CALL into the FLAGS register
popfskip:
call far ptr popfiret
;pushes the segment:offset of the next
; instruction on the stack just above
; the flags word, setting things up so
; that IRET will branch to the next
; instruction and pop the flags
; When execution reaches the instruction following this comment,
; the word that was on top of the stack when JMP SHORT POPFSKIP
; was reached has been popped into the FLAGS register, just as
; if a POPF instruction had been executed.
The operation of this code is illustrated in Figure 11.6.
The POPF
workaround can best be implemented as a macro;
we can also emulate a far call by pushing CS and performing a near call,
thereby shrinking the workaround code by 1 byte:
EMULATE_POPF macro, popfiret
local popfskipjmp short popfskip
popfiret:
iret
popfskip:
push cs
call popfiret
endm
By the way, the flags can be popped much more quickly if you’re
willing to alter a register in the process. For example, the following
macro emulates POPF
with just one branch, but wipes out
AX:
EMULATE_POPF_TRASH_AX macropush cs
mov ax,offset $+5
push ax
iret
endm
It’s not a perfect substitute for POPF
, since
POPF
doesn’t alter any registers, but it’s faster and
shorter than EMULATE_POPF
when you can spare the register.
If you’re using 286-specific instructions, you can use which is shorter
still, alters no registers, and branches just once. (Of course, this
version of EMULATE_POPF
won’t work on an 8088.)
.286
:
EMULATE_POPFmacropush cs
push offset $+4
iret
endm
The standard version of EMULATE_POPF
is 6 bytes longer
than POPF
and much slower, as you’d expect given that it
involves three branches. Anyone in his/her right mind would prefer
POPF
to a larger, slower, three-branch macro—given a
choice. In noncode, however, there’s no choice here; the safer—if
slower—approach is the best. (Having people associate your programs with
crashed computers is not a desirable situation, no matter how
unfair the circumstances under which it occurs.)
And now you know the nature of and the workaround for the
POPF
bug. Whether you ever need the workaround or not, it’s
a neatly packaged example of the tremendous flexibility of the x86
instruction set.
So this traveling salesman is walking down a road, and he sees a group of men digging a ditch with their bare hands. “Whoa, there!” he says. “What you guys need is a Model 8088 ditch digger!” And he whips out a trowel and sells it to them.
A few days later, he stops back around. They’re happy with the trowel, but he sells them the latest ditch-digging technology, the Model 80286 spade. That keeps them content until he stops by again with a Model 80386 shovel (a full 32 inches wide, with a narrow point to emulate the trowel), and that holds them until he comes back around with what they really need: a Model 80486 bulldozer.
Having reached the top of the line, the salesman doesn’t pay them a call for a while. When he does, not only are they none too friendly, but they’re digging with the 80386 shovel; the bulldozer is sitting off to one side. “Why on earth are you using that shovel?” the salesman asks. “Why aren’t you digging with the bulldozer?”
“Well, Lord knows we tried,” says the foreman, “but it was all we could do just to lift the damn thing!”
Substitute “processor” for the various digging implements, and you
get an idea of just how different the optimization rules for the 486 are
from what you’re used to. Okay, it’s not quite that bad—but
upon encountering a processor where string instructions are often to be
avoided and memory-to-register MOV
s are frequently as fast
as register-to-register MOV
s, Dorothy was heard to exclaim
(before she sank out of sight in a swirl of hopelessly mixed metaphors),
“I don’t think we’re in Kansas anymore, Toto.”
No chip that is a direct, fully compatible descendant of the 8088,
286, and 386 could ever be called a RISC chip, but the 486 certainly
contains RISC elements, and it’s those elements that are most
responsible for making 486 optimization unique. Simple, common
instructions are executed in a single cycle by a RISC-like core
processor, but other instructions are executed pretty much as they were
on the 386, where every instruction takes at least 2 cycles. For
example, MOV AL, [TestChar]
takes only 1 cycle on the 486,
assuming both instruction and data are in the cache—3 cycles faster than
the 386—but STOSB
takes 5 cycles, 1 cycle slower
than on the 386. The floating-point execution unit inside the 486 is
also much faster than the 387 math coprocessor, largely because, being
in the same silicon as the CPU (the 486 has a math coprocessor built
in), it is more tightly coupled. The results are sometimes startling:
FMUL
(floating point multiply) is usually faster on the 486
than IMUL
(integer multiply)!
An encyclopedic approach to 486 optimization would take a book all by itself, so in this chapter I’m only going to hit the highlights of 486 optimization, touching on several optimization rules, some documented, some not. You might also want to check out the following sources of 486 information: i486 Microprocessor Programmer’s Reference Manual, from Intel; “8086 Optimization: Aim Down the Middle and Pray,” in the March, 1991 Dr. Dobb’s Journal; and “Peak Performance: On to the 486,” in the November, 1990 Programmer’s Journal.
In Appendix G of the i486 Microprocessor Programmer’s Reference Manual, Intel lists a number of optimization techniques for the 486. While neither exhaustive (we’ll look at two undocumented optimizations shortly) nor entirely accurate (we’ll correct two of the rules here), Intel’s list is certainly a good starting point. In particular, the list conveys the extent to which 486 optimization differs from optimization for earlier x86 processors. Generally, I’ll be discussing optimization for real mode (it being the most widely used mode at the moment), although many of the rules should apply to protected mode as well.
486 optimization is generally more precise and less frustrating than optimization for other x86 processors because every 486 has an identical internal cache. Whenever both the instructions being executed and the data the instructions access are in the cache, those instructions will run in a consistent and calculatable number of cycles on all 486s, with little chance of interference from the prefetch queue and without regard to the speed of external memory.
In other words, for cached code (which time-critical code almost always is), performance is predictable and can be calculated with good precision, and those calculations will apply on any 486. However, “predictable” doesn’t mean “trivial”; the cycle times printed for the various instructions are not the whole story. You must be aware of all the rules, documented and undocumented, that go into calculating actual execution times—and uncovering some of those rules is exactly what this chapter is about.
Rule #1: Avoid indexed addressing (that is, try not to use either two registers or scaled addressing to point to memory).
Intel cautions against using indexing to address memory because there’s a one-cycle penalty for indexed addressing. True enough—but “indexed addressing” might not mean what you expect.
Traditionally, SI and DI are considered the index registers of the
x86 CPUs. That is not the sense in which “indexed addressing” is meant
here, however. In real mode, indexed addressing means that two
registers, rather than one or none, are used to point to memory. (In
this context, the use of one register to address memory is “base
addressing,” no matter what register is used.)
MOV AX, [BX+DI]
and MOV CL, [BP+SI+10]
perform
indexed addressing; MOV AX,[BX]
and
MOV DL, [SI+1]
do not.
Therefore, in real mode, the rule is to avoid using two registers to point to memory whenever possible. Often, this simply means adding the two registers together outside a loop before memory is actually addressed.
As an example, you might adhere to this rule by replacing the code
LoopTop:
add ax,[bx+si]
add si,2
dec cx
jnz LoopTop
with this
add si,bx
LoopTop:
add ax,[si]
add si,2
dec cx
jnz LoopTop
sub si,bx
which calculates the same sum and leaves the registers in the same state as the first example, but avoids indexed addressing.
In protected mode, the definition of indexed addressing is a tad more
complex. The use of two registers to address memory, as in
MOV EAX, [EDX+EDI]
, still qualifies for the one-cycle
penalty. In addition, the use of 386/486 scaled addressing, as in
MOV [ECX*2],EAX
, also constitutes indexed addressing, even
if only one register is used to point to memory.
All this fuss over one cycle! You might well wonder how much
difference one cycle could make. After all, on the 8088, effective
address calculations take a minimum of 5 cycles. On the 486,
however, 1 cycle is a big deal because many instructions, including most
register-only instructions (MOV
, ADD
,
CMP
, and so on) execute in just 1 cycle. In particular,
MOV
s to and from memory execute in 1 cycle—if they’re not
hampered by something like indexed addressing, in which case they slow
to half speed (or worse, as we will see shortly).
For example, consider the summing example shown earlier. The version that uses base+index ([BX+SI]) addressing executes in eight cycles per loop. As expected, the version that uses base ([SI]) addressing runs one cycle faster, at seven cycles per loop. However, the loop code executes so fast on the 486 that the single cycle saved by using base addressing makes the whole loop more than 14 percent faster.
In a key loop on the 486, 1 cycle can indeed matter.
Rule #2: Don’t use a register as a memory pointer during the next two cycles after loading it.
Intel states that if the destination of one instruction is used as the base addressing component of the next instruction, then a one-cycle penalty is imposed. This rule, unlike anything ever before seen in the x86 family, reflects the heavily pipelined nature of the 486. Apparently, the 486 starts each effective address calculation before the start of the instruction that will need it, as shown in Figure 12.1; this effectively makes the address calculation time vanish, because it happens while the preceding instruction executes.
Of course, the 486 can’t perform an effective address calculation for a target instruction ahead of time if one of the address components isn’t known until the instruction starts, and that’s exactly the case when the preceding instruction modifies one of the target instruction’s addressing registers. For example, in the code
MOV BX,OFFSET MemVar
MOV AX,[BX]
there’s no way that the 486 can calculate the address referenced by
MOV AX,[BX]
until MOV BX,OFFSET MemVar
finishes, so pipelining that calculation ahead of time is not possible.
A good workaround is rearranging your code so that at least one
instruction lies between the loading of the memory pointer and its use.
For example, postdecrementing, as in the following
LoopTop:
add ax,[si]
add si,2
dec cx
jnz LoopTop
is faster than preincrementing, as in:
LoopTop:
add si,2
add ax,[SI]
dec cx
jnz LoopTop
Now that we understand what Intel means by this rule, let me make a very important comment: My observations indicate that for real-mode code, the documentation understates the extent of the penalty for interrupting the address calculation pipeline by loading a memory pointer just before it’s used.
The truth of the matter appears to be that if a register is the destination of one instruction and is then used by the next instruction to address memory in real mode, not one but two cycles are lost!
In 32-bit protected mode, however, the penalty is, in fact, the 1 cycle that Intel .
Considering that MOV
normally takes only one cycle
total, that’s quite a loss. For example, the postdecrement loop shown
above is 2 full cycles faster than the preincrement loop, resulting in a
29 percent improvement in the performance of the entire loop. But wait,
there’s more. If a register is loaded 2 cycles (which generally means 2
instructions, but, because some 486 instructions take more than 1
cycle,
the 2 are not always equivalent) before it’s used to point to memory, 1 cycle is lost. Therefore, whereas this code
mov bx,offset MemVar
mov ax,[bx]
inc dx
dec cx
jnz LoopTop
loses two cycles from interrupting the address calculation pipeline, this code
mov bx,offset MemVar
inc dx
mov ax,[bx]
dec cx
jnz LoopTop
loses only one cycle, and this code
mov bx,offset MemVar
inc dx
dec cx
mov ax,[bx]
jnz LoopTop
loses no cycles at all. Apparently, the 486’s addressing calculation pipeline actually starts 2 cycles ahead, as shown in Figure 12.2. (In truth, my best guess at the moment is that the addressing pipeline really does start only 1 cycle ahead; the additional cycle crops up when the addressing pipeline has to wait for a register to be written into the register file before it can read it out for use in addressing calculations. However, I’m guessing here, and the 2-cycle-ahead model in Figure 12.2 will do just fine for optimization purposes.)
Clearly, there’s considerable optimization potential in careful rearrangement of 486 code.
A caution: I’m quite certain that the 2-cycle-ahead addressing pipeline interruption penalty I’ve described exists in the two 486s I’ve tested. However, there’s no guarantee that Intel won’t change this aspect of the 486 in the future, especially given that the documentation indicates otherwise. Perhaps the 2-cycle penalty is the result of a bug in the initial steps of the 486, and will revert to the documented 1-cycle penalty someday; likewise for the undocumented optimizations I’ll describe below. Nonetheless, none of the optimizations I suggest would hurt performance even if the undocumented performance characteristics of the 486 were to vanish, and they certainly will help performance on at least some 486s right now, so I feel they’re well worth using.
There is, of course, no guarantee that I’m entirely correct about the optimizations discussed in this chapter. Without knowing the internals of the 486, all I can do is time code and make inferences from the results; I invite you to deduce your own rules and cross-check them against mine. Also, most likely there are other optimizations that I’m unaware of. If you have further information on these or any other undocumented optimizations, please write and let me know. And, of course, if anyone from Intel is reading this and wants to give us the gospel truth, please do!
Rule #2A: Rule #2 sometimes, but not always, applies to the stack pointer when it is implicitly used to point to memory.
Intel states that the stack pointer is an implied destination
register for CALL
, ENTER
, LEAVE
,
RET
, PUSH
, and POP
(which alter
(E)SP), and that it is the implied base addressing register for
PUSH
, POP
, and RET
(which use
(E)SP to address memory). Intel then implies that the aforementioned
addressing pipeline penalty is incurred whenever the stack pointer is
used as a destination by one of the first set of instructions and is
then immediately used to address memory by one of the second set. This
raises the specter of unpleasant programming contortions such as
intermixing PUSH
es and POP
s with other
instructions to avoid interrupting the addressing pipeline. Fortunately,
matters are actually not so grim as Intel’s documentation would
indicate; my tests indicate that the addressing pipeline penalty pops up
only spottily when the stack pointer is involved.
For example, you’d certainly expect a sequence such as
:pop ax
ret
pop ax
ret
:
to exhibit the addressing pipeline interruption phenomenon (SP is
both destination and addressing register for both instructions,
according to Intel), but this code runs in six cycles per
POP/RET
pair, matching the official execution times
exactly. Likewise, a sequence like
pop dx
pop cx
pop bx
pop ax
runs in one cycle per instruction, just as it should.
On the other hand, performing arithmetic directly on SP as an
explicit destination—for example, to deallocate local
variables—and then using PUSH
, POP
, or
RET
, definitely can interrupt the addressing pipeline. For
example
add sp,10h
ret
loses two cycles because SP is the explicit destination of one instruction and then the implied addressing register for the next, and the sequence
add sp,10h
pop ax
loses two cycles for the same reason.
I certainly haven’t tried all possible combinations, but the results
so far indicate that the stack pointer incurs the addressing pipeline
penalty only if (E)SP is the explicit destination of one
instruction and is then used by one of the two following instructions to
address memory. So, for instance, SP isn’t the explicit operand of
POP AX
-AX is—and no cycles are lost if POP AX
is followed by POP
or RET
. Happily, then, we
need not worry about the sequence in which we use PUSH
and
POP
. However, adding to, moving to, or subtracting from the
stack pointer should ideally be done at least two cycles before
PUSH
, POP
, RET
, or any other
instruction that uses the stack pointer to address memory.
There are two ways to lose cycles by using byte registers, and neither of them is documented by Intel, so far as I know. Let’s start with the lesser and simpler of the two.
Rule #3: Do not load a byte portion of a register during one instruction, then use that register in its entirety as a source register during the next instruction.
So, for example, it would be a bad idea to do this
mov ah,o
:mov cx,[MemVar1]
mov al,[MemVar2]
add cx,ax
because AL is loaded by one instruction, then AX is used as the source register for the next instruction. A cycle can be saved simply by rearranging the instructions so that the byte register load isn’t immediately followed by the word register usage, like so:
mov ah,o
:mov al,[MemVar2]
mov cx,[MemVar1]
add cx,ax
Strange as it may seem, this rule is neither arbitrary nor nonsensical. Basically, when a byte destination register is part of a word source register for the next instruction, the 486 is unable to directly use the result from the first instruction as the source for the second instruction, because only part of the register required by the second instruction is contained in the first instruction’s result. The full, updated register value must be read from the register file, and that value can’t be read out until the result from the first instruction has been written into the register file, a process that takes an extra cycle. I’m not going to explain this in great detail because it’s not important that you understand why this rule exists (only that it does in fact exist), but it is an interesting window on the way the 486 works.
In case you’re curious, there’s no such penalty for the typical
XLAT
sequence like
mov bx,offset MemTable
:mov al,[si]
xlat
even though AL must be converted to a word by XLAT
before it can be added to BX and used to address memory. In fact, none
of the penalties mentioned in this chapter apply to XLAT
,
apparently because XLAT
is so slow—4 cycles—that it gives
the 486 time to perform addressing calculations during the course of the
instruction.
While it’s nice that
XLAT
doesn’t suffer from the various 486 addressing penalties, the reason for that is basically thatXLAT
is slow, so there’s still no compelling reason to useXLAT
on the 486.
In general, penalties for interrupting the 486’s pipeline apply
primarily to the fast core instructions of the 486, most notably
register-only instructions and MOV
, although arithmetic and
logical operations that access memory are also often affected. I don’t
know all the performance dependencies, and I don’t plan to; figuring all
of them out would be a big, boring job of little value. Basically, on
the 486 you should concentrate on using those fast core instructions
when performance matters, and all the rules I’ll discuss do indeed apply
to those instructions.
You don’t need to understand every corner of the 486 universe unless you’re a diehard ASMhead who does this stuff for fun. Just learn enough to be able to speed up the key portions of your programs, and spend the rest of your time on a fast design and overall implementation.
Rule #4: Don’t load any byte register exactly 2 cycles before using any register to address memory.
This, the last of this chapter’s rules, is the strangest of the lot. If any byte register is loaded, and then two cycles later any register is used to point to memory, one cycle is lost. So, for example, this code
mov al,bl
mov cx,dx
mov si,[di]
takes four rather than the expected three cycles to execute. Note that it is not required that the byte register be part of the register used to address memory; any byte register will do the trick.
Worse still, loading byte registers both one and two cycles before a register is used to address memory costs two cycles, as in
mov bl,al
mov cl,3
mov bx,[si]
which takes five rather than three cycles to run. However, there is no penalty if a byte register is loaded one cycle but not two cycles before a register is used to address memory. Therefore,
mov cx,3
mov dl,al
mov si,[bx]
runs in the expected three cycles.
In truth, I do not know why this happens. Clearly, it has something to do with interrupting the start of the addressing pipeline, and I have my theories about how this works, but at this point they’re pure speculation. Whatever the reason for this rule, ignorance of it—and of its interaction with the other rules—could lead to considerable performance loss in seemingly air-tight code. For instance, a casual observer would expect the following code to run in 3 cycles:
mov bx,offset MemVar
mov cl,al
mov ax,[bx]
A more sophisticated programmer would expect to lose one cycle,
because BX is loaded two cycles before being used to address memory. In
fact, though, this code takes 5 cycles—2 cycles, or 67 percent, longer
than normal. Why? Well, under normal conditions, loading a byte
register—CL in this case—one cycle before using a register to address
memory produces no penalty; loading 2 cycles ahead is the only case that
normally incurs a penalty. However, think of Rule #4 as meaning that
loading a byte register disrupts the memory addressing pipeline as it
starts up. Viewed that way, we can see that
MOV BX,OFFSET MemVar
interrupts the addressing pipeline,
forcing it to start again, and then, presumably, MOV CL,AL
interrupts the pipeline again because the pipeline is now on its first
cycle: the one that loading a byte register can affect.
I know—it seems awfully complicated. It isn’t, really. Generally, try not to use byte destinations exactly two cycles before using a register to address memory, and try not to load a register either one or two cycles before using it to address memory, and you’ll be fine.
In case you want to do some 486 performance analysis of your own, let me show you how I arrived at one of the above conclusions; at the same time, I can warn you of the timing hazards of the cache. Listings 12.1 and 12.2 show the code I ran through the Zen timer in order to establish the effects of loading a byte register before using a register to address memory. Listing 12.1 ran in 120 µs on a 33 MHz 486, or 4 cycles per repetition (120 µs/1000 repetitions = 120 ns per repetition; 120 ns per repetition/30 ns per cycle = 4 cycles per repetition); Listing 12.2 ran in 90 µs, or 3 cycles, establishing that loading a byte register costs a cycle only when it’s performed exactly 2 cycles before addressing memory.
LISTING 12.1 LST12-1.ASM
; Measures the effect of loading a byte register 2 cycles before
; using a register to address memory.
mov bp,2 ;run the test code twice to make sure
; it's cached
sub bx,bx
CacheFillLoop:
call ZTimerOn ;start timing
1000
rept mov dl,cl
nop
mov ax,[bx]
endmcall ZTimerOff ;stop timing
dec bp
jz Done
jmp CacheFillLoop
Done:
LISTING 12.2 LST12-2.ASM
; Measures the effect of loading a byte register 1 cycle before
; using a register to address memory.
mov bp,2 ;run the test code twice to make sure
; it's cached
sub bx,bx
CacheFillLoop:
call ZTimerOn ;start timing
1000
rept nop
mov dl,cl
mov ax,[bx]
endmcall ZTimerOff ;stop timing
dec bp
jz Done
jmp CacheFillLoop
Done:
Note that Listings 12.1 and 12.2 each repeat the timing of the code
under test a second time, to make sure that the instructions are in the
cache on the second pass, the one for which results are displayed. Also
note that the code is less than 8K in size, so that it can all fit in
the 486’s 8K internal cache. If I double the REPT
value in
Listing 12.2 to 2,000, making the test code larger than 8K, the
execution time more than doubles to 224 µs, or 3.7 cycles per
repetition; the extra seven-tenths of a cycle comes from fetching
non-cached instruction bytes.
Whenever you see non-integral timing results of this sort, it’s a good bet that the test code or data isn’t cached.
There’s certainly plenty more 486 lore to explore, including the 486’s unique prefetch queue, more optimization rules, branching optimizations, performance implications of the cache, the cost of cache misses for reads, and the implications of cache write-through for writes. Nonetheless, we’ve covered quite a bit of ground in this chapter, and I trust you’ve gotten a feel for the considerable extent to which 486 optimization differs from what you’re used to. Odd as 486 optimization is, though, it’s well worth mastering, for the 486 is, at its best, so staggeringly fast that carefully crafted 486 code can do more than twice as much per cycle as the best 386 code—which makes it perhaps 50 times as fast as optimized code for the original PC.
Sometimes it is hard to believe we’re still in Kansas!
It’s a sad but true fact that 84 percent of American schoolchildren are ignorant of 92 percent of American history. Not my daughter, though. We recently visited historical Revolutionary-War-vintage Fort Ticonderoga, and she’s now 97 percent aware of a key element of our national heritage: that the basic uniform for soldiers in those days was what appears to be underwear, plus a hat so that no one could complain that they were undermining family values. Ha! Just kidding! Actually, what she learned was that in those days, it was pure coincidence if a cannonball actually hit anything it was aimed at, which isn’t surprising considering the lack of rifling, precision parts, and ballistics. The guides at the fort shot off three cannons; the closest they came to the target was about 50 feet, and that was only because the wind helped. I think the idea in early wars was just to put so much lead in the air that some of it was bound to hit something; preferably, but not necessarily, the enemy.
Nowadays, of course, we have automatic weapons that allow a teenager to singlehandedly defeat the entire U.S. Army, not to mention so-called “smart” bombs, which are smart in the sense that they can seek out and empty a taxpayer’s wallet without being detected by radar. There’s an obvious lesson here about progress, which I leave you to deduce for yourselves.
Here’s the same lesson, in another form. Ten years ago, we had a slow processor, the 8088, for which it was devilishly hard to optimize, and for which there was no good optimization documentation available. Now we have a processor, the 486, that’s 50 to 100 times faster than the 8088—and for which there is no good optimization documentation available. Sure, Intel provides a few tidbits on optimization in the back of the i486 Microprocessor Programmer’s Reference Manual, but, as I discussed in Chapter 12, that information is both incomplete and not entirely correct. Besides, most assembly language programmers don’t bother to read Intel’s manuals (which are extremely informative and well done, but only slightly more fun to read than the phone book), and go right on programming the 486 using outdated 8088 optimization techniques, blissfully unaware of a new and heavily mutated generation of cycle-eaters that interact with their code in ways undreamt of even on the 386.
For example, consider how Terje Mathisen doubled the speed of his word-counting program on a 486 simply by shuffling a couple of instructions.
I’ve mentioned Terje Mathisen in my writings before. Terje is an assembly language programmer extraordinaire, and author of the incredibly fast public-domain word-counting program WC (which comes complete with source code; well worth a look, if you want to see what really fast code looks like). Terje’s a regular participant in the ibm.pc/fast.code topic on Bix. In a thread titled “486 Pipeline Optimization, or TANSTATFC (There Ain’t No Such Thing As The Fastest Code),” he detailed the following optimization to WC, perhaps the best example of 486 pipeline optimization I’ve yet seen.
Terje’s inner loop originally looked something like the code in Listing 13.1. (I’ve taken a few liberties for illustrative purposes.) Of course, Terje unrolls this loop a few times (128 times, to be exact). By the way, in Listing 13.1 you’ll notice that Terje counts not only words but also lines, at a rate of three instructions for every two characters!
LISTING 13.1 L13-1.ASM
mov di,[bp+OFFS] ;get the next pair of characters
mov bl,[di] ;get the state value for the pair
add dx,[bx+8000h] ;increment word and line count
; appropriately for the pair
Listing 13.1 looks as tight as it could be, with just two one-cycle instructions, one two-cycle instruction, and no branches. It is tight, but those three instructions actually take a minimum of 8 cycles to execute, as shown in Figure 13.1. The problem is that DI is loaded just before being used to address memory, and that costs 2 cycles because it interrupts the 486’s internal instruction pipeline. Likewise, BX is loaded just before being used to address memory, costing another two cycles. Thus, this loop takes twice as long as cycle counts would seem to indicate, simply because two registers are loaded immediately before being used, disrupting the 486’s pipeline.
Listing 13.2 shows Terje’s immediate response to these pipelining problems; he simply swapped the instructions that load DI and BL. This one change cut execution time per character pair from eight cycles to five cycles! The load of BL is now separated by one instruction from the use of BX to address memory, so the pipeline penalty is reduced from two cycles to one cycle. The load of DI is also separated by one instruction from the use of DI to address memory (remember, the loop is unrolled, so the last instruction is followed by the first instruction), but because the intervening instruction takes two cycles, there’s no penalty at all.
Remember, pipeline penalties diminish with increasing number of cycles, not instructions, between the pipeline disrupter and the potentially affected instruction.
LISTING 13.2 L13-2.ASM
mov bl,[di] ;get the state value for the pair
mov di,[bp+OFFS] ;get the next pair of characters
add dx,[bx+8000h] ;increment word and line count
; appropriately for the pair
At this point, Terje had nearly doubled the performance of this code simply by moving one instruction. (Note that swapping the instructions also made it necessary to preload DI at the start of the loop; Listing 13.2 is not exactly equivalent to Listing 13.1.) I’ll let Terje describe his next optimization in his own words:
“When I looked closely as this, I realized that the two cycles for
the final ADD
is just the sum of 1 cycle to load the data
from memory, and 1 cycle to add it to DX, so the code could just as well
have been written as shown in Listing 13.3. The final breakthrough came
when I realized that by initializing AX to zero outside the loop, I
could rearrange it as shown in Listing 13.4 and do the final
ADD DX,AX
after the loop. This way there are two
single-cycle instructions between the first and the fourth line,
avoiding all pipeline stalls, for a total throughput of two
cycles/char.”
LISTING 13.3 L13-3.ASM
mov bl,[di] ;get the state value for the pair
mov di,[bp+OFFS] ;get the next pair of characters
mov ax,[bx+8000h] ;increment word and line count
add dx,ax ; appropriately for the pair
LISTING 13.4 L13-4.ASM
mov bl,[di] ;get the state value for the pair
mov di,[bp+OFFS] ;get the next pair of characters
add dx,ax ;increment word and line count
; appropriately for the pair
mov ax,[bx+8000h<