Michael Abrash’s Graphics Programming Black Book, Special Edition

Michael Abrash

Introduction

What was it like working with John Carmack on Quake? Like being strapped onto a rocket during takeoff—in the middle of a hurricane. It seemed like the whole world was watching, waiting to see if id Software could top Doom; every casual e-mail tidbit or conversation with a visitor ended up posted on the Internet within hours. And meanwhile, we were pouring everything we had into Quake’s technology; I’d often come in in the morning to find John still there, working on a new idea so intriguing that he couldn’t bear to sleep until he had tried it out. Toward the end, when I spent most of my time speeding things up, I would spend the day in a trance writing optimized assembly code, stagger out of the Town East Tower into the blazing Texas heat, and somehow drive home on LBJ Freeway without smacking into any of the speeding pickups whizzing past me on both sides. At home, I’d fall into a fitful sleep, then come back the next day in a daze and do it again. Everything happened so fast, and under so much pressure, that sometimes I wonder how any of us made it through that without completely burning out.

At the same time, of course, it was tremendously exciting. John’s ideas were endless and brilliant, and Quake ended up establishing a new standard for Internet and first-person 3-D game technology. Happily, id has an enlightened attitude about sharing information, and was willing to let me write about the Quake technology—both how it worked and how it evolved. Over the two years I worked at id, I wrote a number of columns about Quake in Dr. Dobb’s Sourcebook, as well as a detailed overview for the 1997 Computer Game Developers Conference. You can find these in the latter part of this book; they represent a rare look into the development and inner workings of leading-edge software development, and I hope you enjoy reading them as much as I enjoyed developing the technology and writing about it.

The rest of this book is pretty much everything I’ve written over the past decade about graphics and performance programming that’s still relevant to programming today, and that covers a lot of ground. Most of Zen of Graphics Programming, 2nd Edition is in there (and the rest is on the CD); all of Zen of Code Optimization is there too, and even my 1989 book Zen of Assembly Language, with its long-dated 8088 cycle counts but a lot of useful perspectives, is on the CD. Add to that the most recent 20,000 words of Quake material, and you have most of what I’ve learned over the past decade in one neat package.

I’m delighted to have all this material in print in a single place, because over the past ten years I’ve run into a lot of people who have found my writings useful—and a lot more who would like to read them, but couldn’t find them. It’s hard to keep programming material (especially stuff that started out as columns) in print for very long, and I would like to thank The Coriolis Group, and particularly my good friend Jeff Duntemann (without whom not only this volume but pretty much my entire writing career wouldn’t exist), for helping me keep this material available.

I’d also like to thank Jon Erickson, editor of Dr. Dobb’s, both for encouragement and general good cheer and for giving me a place to write whatever I wanted about realtime 3-D. It still amazes me that I was able to find time to write a column every two months during Quake’s development, and if Jon hadn’t made it so easy and enjoyable, it could never have happened.

I’d also like to thank Chris Hecker and Jennifer Pahlka of the Computer Game Developers Conference, without whose encouragement, nudging, and occasional well-deserved nagging there is no chance I would ever have written a paper for the CGDC—a paper that ended up being the most comprehensive overview of the Quake technology that’s ever likely to be written, and which appears in these pages.

I don’t have much else to say that hasn’t already been said elsewhere in this book, in one of the introductions to the previous volumes or in one of the astonishingly large number of chapters. As you’ll see as you read, it’s been quite a decade for microcomputer programmers, and I have been extremely fortunate to not only be a part of it, but to be able to chronicle part of it as well.

And the next decade is shaping up to be just as exciting!

Michael Abrash
Bellevue, Washington
May 1997

Foreword

I got my start programming on Apple II computers at school, and almost all of my early work was on the Apple platform. After graduating, it quickly became obvious that I was going to have trouble paying my rent working in the Apple II market in the late eighties, so I was forced to make a very rapid move into the Intel PC environment.

What I was able to pick up over several years on the Apple, I needed to learn in the space of a few months on the PC.

The biggest benefit to me of actually making money as a programmer was the ability to buy all the books and magazines I wanted. I bought a lot. I was in territory that I knew almost nothing about, so I read everything that I could get my hands on. Feature articles, editorials, even advertisements held information for me to assimilate.

John Romero clued me in early to the articles by Michael Abrash. The good stuff. Graphics hardware. Code optimization. Knowledge and wisdom for the aspiring developer. They were even fun to read. For a long time, my personal quest was to find a copy of Michael’s first book, Zen of Assembly Language. I looked in every bookstore I visited, but I never did find it. I made do with the articles I could dig up.

I learned the dark secrets of the EGA video controller there, and developed a few neat tricks of my own. Some of those tricks became the basis for the Commander Keen series of games, which launched id Software.

A year or two later, after Wolfenstein-3D, I bumped into Michael (in a virtual sense) for the first time. I was looking around on M&T Online, a BBS run by the Dr. Dobb’s publishers before the Internet explosion, when I saw some posts from the man himself. We traded email, and for a couple months we played tag-team gurus on the graphics forum before Doom’s development took over my life.

A friend of Michael’s at his new job put us back in touch with each other after Doom began to make its impact, and I finally got a chance to meet up with him in person.

I talked myself hoarse that day, explaining all the ins and outs of Doom to Michael and an interested group of his coworkers. Every few days afterwards, I would get an email from Michael asking for an elaboration on one of my points, or discussing an aspect of the future of graphics.

Eventually, I popped the question—I offered him a job at id. “Just think: no reporting to anyone, an opportunity to code all day, starting with a clean sheet of paper. A chance to do the right thing as a programmer.” It didn’t work. I kept at it though, and about a year later I finally convinced him to come down and take a look at id. I was working on Quake.

Going from Doom to Quake was a tremendous step. I knew where I wanted to end up, but I wasn’t at all clear what the steps were to get there. I was trying a huge number of approaches, and even the failures were teaching me a lot. My enthusiasm must have been contagious, because he took the job.

Much heroic programming ensued. Several hundred thousand lines of code were written. And rewritten. And rewritten. And rewritten.

In hindsight, I have plenty of regrets about various aspects of Quake, but it is a rare person that doesn’t freely acknowledge the technical triumph of it. We nailed it. Sure, a year from now I will have probably found a new perspective that will make me cringe at the clunkiness of some part of Quake, but at the moment it still looks pretty damn good to me.

I was very happy to have Michael describe much of the Quake technology in his ongoing magazine articles. We learned a lot, and I hope we managed to teach a bit.

When a non-programmer hears about Michael’s articles or the source code I have released, I usually get a stunned “WTF would you do that for???” look.

They don’t get it.

Programming is not a zero-sum game. Teaching something to a fellow programmer doesn’t take it away from you. I’m happy to share what I can, because I’m in it for the love of programming. The Ferraris are just gravy, honest!

This book contains many of the original articles that helped launch my programming career. I hope my contribution to the contents of the later articles can provide similar stepping stones for others.

John Carmack
id Software

Acknowledgments

There are many people to thank—because this book was written over many years, in many different settings, an unusually large number of people have played a part in making this book possible. Thanks to Dan Illowsky for not only contributing ideas and encouragement, but also getting me started writing articles long ago, when I lacked the confidence to do it on my own—and for teaching me how to handle the business end of things. Thanks to Will Fastie for giving me my first crack at writing for a large audience in the long-gone but still-missed PC Tech Journal, and for showing me how much fun it could be in his even longer-vanished but genuinely terrific column in Creative Computing (the most enjoyable single column I have ever read in a computer magazine; I used to haunt the mailbox around the beginning of the month just to see what Will had to say). Thanks to Robert Keller, Erin O’Connor, Liz Oakley, Steve Baker, and the rest of the cast of thousands that made Programmer’s Journal a uniquely fun magazine—especially Erin, who did more than anyone to teach me the proper use of the English language. (To this day, Erin will still patiently explain to me when one should use “that” and when one should use “which,” even though eight years of instruction on this and related topics have left no discernible imprint on my brain.) Thanks to Tami Zemel, Monica Berg, and the rest of the Dr. Dobb’s Journal crew for excellent, professional editing, and for just being great people. Thanks to the Coriolis gang for their tireless hard work: Jeff Duntemann, Kim Eoff, Jody Kent, Robert Clarfield, and Anthony Stock. Thanks to Jack Tseng for teaching me a lot about graphics hardware, and even more about how much difference hard work can make. Thanks to John Cockerham, David Stafford, Terje Mathisen, the BitMan, Chris Hecker, Jim Mackraz, Melvin Lafitte, John Navas, Phil Coleman, Anton Truenfels, John Carmack, John Miles, John Bridges, Jim Kent, Hal Hardenbergh, Dave Miller, Steve Levy, Jack Davis, Duane Strong, Daev Rohr, Bill Weber, Dan Gochnauer, Patrick Milligan, Tom Wilson, Peter Klerings, Dave Methvin, Mick Brown, the people in the ibm.pc/fast.code topic on Bix, and all the rest of you who have been so generous with your ideas and suggestions. I’ve done my best to acknowledge contributors by name in this book, but if your name is omitted, my apologies, and consider yourself thanked; this book could not have happened without you. And, of course, thanks to Shay and Emily for their generous patience with my passion for writing and computers.

Part I

Chapter 1 – The Best Optimizer Is between Your Ears

The Human Element of Code Optimization

This book is devoted to a topic near and dear to my heart: writing software that pushes PCs to the limit. Given run-of-the-mill software, PCs run like the 97-pound-weakling minicomputers they are. Give them the proper care, however, and those ugly boxes are capable of miracles. The key is this: Only on microcomputers do you have the run of the whole machine, without layers of operating systems, drivers, and the like getting in the way. You can do anything you want, and you can understand everything that’s going on, if you so wish.

As we’ll see shortly, you should indeed so wish.

Is performance still an issue in this era of cheap 486 computers and super-fast Pentium computers? You bet. How many programs that you use really run so fast that you wouldn’t be happier if they ran faster? We’re so used to slow software that when a compile-and-link sequence that took two minutes on a PC takes just ten seconds on a 486 computer, we’re ecstatic—when in truth we should be settling for nothing less than instantaneous response.

Impossible, you say? Not with the proper design, including incremental compilation and linking, use of extended and/or expanded memory, and well-crafted code. PCs can do just about anything you can imagine (with a few obvious exceptions, such as applications involving super-computer-class number-crunching) if you believe that it can be done, if you understand the computer inside and out, and if you’re willing to think past the obvious solution to unconventional but potentially more fruitful approaches.

My point is simply this: PCs can work wonders. It’s not easy coaxing them into doing that, but it’s rewarding—and it’s sure as heck fun. In this book, we’re going to work some of those wonders, starting…

…now.

Understanding High Performance

Before we can create high-performance code, we must understand what high performance is. The objective (not always attained) in creating high-performance software is to make the software able to carry out its appointed tasks so rapidly that it responds instantaneously, as far as the user is concerned. In other words, high-performance code should ideally run so fast that any further improvement in the code would be pointless.

Notice that the above definition most emphatically does not say anything about making the software as fast as possible. It also does not say anything about using assembly language, or an optimizing compiler, or, for that matter, a compiler at all. It also doesn’t say anything about how the code was designed and written. What it does say is that high-performance code shouldn’t get in the user’s way—and that’s all.

That’s an important distinction, because all too many programmers think that assembly language, or the right compiler, or a particular high-level language, or a certain design approach is the answer to creating high-performance code. They’re not, any more than choosing a certain set of tools is the key to building a house. You do indeed need tools to build a house, but any of many sets of tools will do. You also need a blueprint, an understanding of everything that goes into a house, and the ability to use the tools.

Likewise, high-performance programming requires a clear understanding of the purpose of the software being built, an overall program design, algorithms for implementing particular tasks, an understanding of what the computer can do and of what all relevant software is doing—and solid programming skills, preferably using an optimizing compiler or assembly language. The optimization at the end is just the finishing touch, however.

Without good design, good algorithms, and complete understanding of the program’s operation, your carefully optimized code will amount to one of mankind’s least fruitful creations—a fast slow program.

“What’s a fast slow program?” you ask. That’s a good question, and a brief (true) story is perhaps the best answer.

When Fast Isn’t Fast

In the early 1970s, as the first hand-held calculators were hitting the market, I knew a fellow named Irwin. He was a good student, and was planning to be an engineer. Being an engineer back then meant knowing how to use a slide rule, and Irwin could jockey a slipstick with the best of them. In fact, he was so good that he challenged a fellow with a calculator to a duel—and won, becoming a local legend in the process.

When you get right down to it, though, Irwin was spitting into the wind. In a few short years his hard-earned slipstick skills would be worthless, and the entire discipline would be essentially wiped from the face of the earth. What’s more, anyone with half a brain could see that changeover coming. Irwin had basically wasted the considerable effort and time he had spent optimizing his soon-to-be-obsolete skills.

What does all this have to do with programming? Plenty. When you spend time optimizing poorly-designed assembly code, or when you count on an optimizing compiler to make your code fast, you’re wasting the optimization, much as Irwin did. Particularly in assembly, you’ll find that without proper up-front design and everything else that goes into high-performance design, you’ll waste considerable effort and time on making an inherently slow program as fast as possible—which is still slow—when you could easily have improved performance a great deal more with just a little thought. As we’ll see, handcrafted assembly language and optimizing compilers matter, but less than you might think, in the grand scheme of things—and they scarcely matter at all unless they’re used in the context of a good design and a thorough understanding of both the task at hand and the PC.

Rules for Building High-Performance Code

We’ve got the following rules for creating high-performance software:

  • Know where you’re going (understand the objective of the software).
  • Make a big map (have an overall program design firmly in mind, so the various parts of the program and the data structures work well together).
  • Make lots of little maps (design an algorithm for each separate part of the overall design).
  • Know the territory (understand exactly how the computer carries out each task).
  • Know when it matters (identify the portions of your programs where performance matters, and don’t waste your time optimizing the rest).
  • Always consider the alternatives (don’t get stuck on a single approach; odds are there’s a better way, if you’re clever and inventive enough).
  • Know how to turn on the juice (optimize the code as best you know how when it does matter).

Making rules is easy; the hard part is figuring out how to apply them in the real world. For my money, examining some actual working code is always a good way to get a handle on programming concepts, so let’s look at some of the performance rules in action.

Know Where You’re Going

If we’re going to create high-performance code, first we have to know what that code is going to do. As an example, let’s write a program that generates a 16-bit checksum of the bytes in a file. In other words, the program will add each byte in a specified file in turn into a 16-bit value. This checksum value might be used to make sure that a file hasn’t been corrupted, as might occur during transmission over a modem or if a Trojan horse virus rears its ugly head. We’re not going to do anything with the checksum value other than print it out, however; right now we’re only interested in generating that checksum value as rapidly as possible.

Make a Big Map

How are we going to generate a checksum value for a specified file? The logical approach is to get the file name, open the file, read the bytes out of the file, add them together, and print the result. Most of those actions are straightforward; the only tricky part lies in reading the bytes and adding them together.

Make Lots of Little Maps

Actually, we’re only going to make one little map, because we only have one program section that requires much thought—the section that reads the bytes and adds them up. What’s the best way to do this?

It would be convenient to load the entire file into memory and then sum the bytes in one loop. Unfortunately, there’s no guarantee that any particular file will fit in the available memory; in fact, it’s a sure thing that many files won’t fit into memory, so that approach is out.

Well, if the whole file won’t fit into memory, one byte surely will. If we read the file one byte at a time, adding each byte to the checksum value before reading the next byte, we’ll minimize memory requirements and be able to handle any size file at all.

Sounds good, eh? Listing 1.1 shows an implementation of this approach. Listing 1.1 uses C’s read() function to read a single byte, adds the byte into the checksum value, and loops back to handle the next byte until the end of the file is reached. The code is compact, easy to write, and functions perfectly—with one slight hitch:

It’s slow.

LISTING 1.1 L1-1.C

/*
* Program to calculate the 16-bit checksum of all bytes in the
* specified file. Obtains the bytes one at a time via read(),
* letting DOS perform all data buffering.
*/
#include <stdio.h>
#include <fcntl.h>

main(int argc, char *argv[]) {
     int Handle;
     unsigned char Byte;
     unsigned int Checksum;
     int ReadLength;

     if ( argc != 2 ) {
          printf("usage: checksum filename\n");
          exit(1);
     }
     if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
          printf("Can't open file: %s\n", argv[1]);
          exit(1);
     }

     /* Initialize the checksum accumulator */
     Checksum = 0;

     /* Add each byte in turn into the checksum accumulator */
     while ( (ReadLength = read(Handle, &Byte, sizeof(Byte))) > 0 ) {
          Checksum += (unsigned int) Byte;
     }
     if ( ReadLength == -1 ) {
          printf("Error reading file %s\n", argv[1]);
          exit(1);
     }


     /* Report the result */
     printf("The checksum is: %u\n", Checksum);
     exit(0);
}

Table 1.1 shows the time taken for Listing 1.1 to generate a checksum of the WordPerfect version 4.2 thesaurus file, TH.WP (362,293 bytes in size), on a 10 MHz AT machine of no special parentage. Execution times are given for Listing 1.1 compiled with Borland and Microsoft compilers, with optimization both on and off; all four times are pretty much the same, however, and all are much too slow to be acceptable. Listing 1.1 requires over two and one-half minutes to checksum one file!

Listings 1.2 and 1.3 form the C/assembly equivalent to Listing 1.1, and Listings 1.6 and 1.7 form the C/assembly equivalent to Listing 1.5.

These results make it clear that it’s folly to rely on your compiler’s optimization to make your programs fast. Listing 1.1 is simply poorly designed, and no amount of compiler optimization will compensate for that failing. To drive home the point, Listings 1.2 and 1.3, which together are equivalent to Listing 1.1 except that the entire checksum loop is written in tight assembly code. The assembly language implementation is indeed faster than any of the C versions, as shown in Table 1.1, but it’s less than 10 percent faster, and it’s still unacceptably slow.

Table 1.1 Execution Times for WordPerfect Checksum.
Listing Borland Microsoft Borland Microsoft Assembly Optimization Ratio
(no opt) (no opt) (opt) (opt)
1 166.9 166.8 167.0 165.8 155.1 1.08
4 13.5 13.6 13.5 13.5 1.01
5 4.7 5.5 3.8 3.4 2.7 2.04
Ratio best designed to worst designed 35.51 30.33 43.95 48.76 57.44

Note: The execution times (in seconds) for this chapter’s listings were timed when the compiled listings were run on the WordPerfect 4.2 thesaurus file TH.WP (362,293 bytes in size), as compiled in the small model with Borland and Microsoft compilers with optimization on (opt) and off (no opt). All times were measured with Paradigm Systems’ TIMER program on a 10 MHz 1-wait-state AT clone with a 28-ms hard disk, with disk caching turned off.

LISTING 1.2 L1-2.C

/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Obtains the bytes one at a time in
* assembler, via direct calls to DOS.
*/

#include <stdio.h>
#include <fcntl.h>

main(int argc, char *argv[]) {
      int Handle;
      unsigned char Byte;
      unsigned int Checksum;
      int ReadLength;

      if ( argc != 2 ) {
            printf("usage: checksum filename\n");
            exit(1);
      }
      if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
            printf("Can't open file: %s\n", argv[1]);
            exit(1);
      }
      if ( !ChecksumFile(Handle, &Checksum) ) {
            printf("Error reading file %s\n", argv[1]);
            exit(1);
      }

      /* Report the result */
      printf("The checksum is: %u\n", Checksum);
      exit(0);
}

LISTING 1.3 L1-3.ASM

; Assembler subroutine to perform a 16-bit checksum on the file
; opened on the passed-in handle. Stores the result in the
; passed-in checksum variable. Returns 1 for success, 0 for error.
;
; Call as:
;           int ChecksumFile(unsigned int Handle, unsigned int *Checksum);
;
; where:
;           Handle = handle # under which file to checksum is open
;           Checksum = pointer to unsigned int variable checksum is
;           to be stored in
;
; Parameter structure:
;
Parms      struc
                 dw        ?       ;pushed BP
                 dw        ?       ;return address
Handle           dw        ?
Checksum         dw        ?
Parms      ends
;
                 .model small
                 .data
TempWord label   word
TempByte         db        ?       ;each byte read by DOS will be stored here
                 db        0       ;high byte of TempWord is always 0
                                   ;for 16-bit adds
;
                 .code
                 public _ChecksumFile
_ChecksumFile    proc near
                 push      bp
                 mov       bp,sp
                 push      si                  ;save C's register variable
;
                 mov       bx,[bp+Handle]       ;get file handle
                 sub       si,si                ;zero the checksum ;accumulator
                 mov       cx,1                 ;request one byte on each ;read
                 mov       dx,offset TempByte   ;point DX to the byte in
                                                ;which DOS should store
                                                ;each byte read
ChecksumLoop:
                 mov       ah,3fh               ;DOS read file function #
                 int       21h                  ;read the byte
                 jc        ErrorEnd             ;an error occurred
                 and       ax,ax                ;any bytes read?
                 jz        Success              ;no-end of file reached-we're done
                 add       si,[TempWord]        ;add the byte into the
                                                ;checksum total
                 jmp       ChecksumLoop
ErrorEnd:
                 sub       ax,ax                ;error
                 jmp       short Done
Success:
                 mov       bx,[bp+Checksum] ;point to the checksum variable
                 mov       [bx],si              ;save the new checksum
                 mov       ax,1                 ;success
;
Done:
                 pop       si                   ;restore C's register variable
                 pop       bp
                 ret
_ChecksumFile    endp
                 end

The lesson is clear: Optimization makes code faster, but without proper design, optimization just creates fast slow code.

Well, then, how are we going to improve our design? Before we can do that, we have to understand what’s wrong with the current design.

Know the Territory

Just why is Listing 1.1 so slow? In a word: overhead. The C library implements the read() function by calling DOS to read the desired number of bytes. (I figured this out by watching the code execute with a debugger, but you can buy library source code from both Microsoft and Borland.) That means that Listing 1.1 (and Listing 1.3 as well) executes one DOS function per byte processed—and DOS functions, especially this one, come with a lot of overhead.

For starters, DOS functions are invoked with interrupts, and interrupts are among the slowest instructions of the x86 family CPUs. Then, DOS has to set up internally and branch to the desired function, expending more cycles in the process. Finally, DOS has to search its own buffers to see if the desired byte has already been read, read it from the disk if not, store the byte in the specified location, and return. All of that takes a long time—far, far longer than the rest of the main loop in Listing 1.1. In short, Listing 1.1 spends virtually all of its time executing read(), and most of that time is spent somewhere down in DOS.

You can verify this for yourself by watching the code with a debugger or using a code profiler, but take my word for it: There’s a great deal of overhead to DOS calls, and that’s what’s draining the life out of Listing 1.1.

How can we speed up Listing 1.1? It should be clear that we must somehow avoid invoking DOS for every byte in the file, and that means reading more than one byte at a time, then buffering the data and parceling it out for examination one byte at a time. By gosh, that’s a description of C’s stream I/O feature, whereby C reads files in chunks and buffers the bytes internally, doling them out to the application as needed by reading them from memory rather than calling DOS. Let’s try using stream I/O and see what happens.

Listing 1.4 is similar to Listing 1.1, but uses fopen() and getc() (rather than open() and read()) to access the file being checksummed. The results confirm our theories splendidly, and validate our new design. As shown in Table 1.1, Listing 1.4 runs more than an order of magnitude faster than even the assembly version of Listing 1.1, even though Listing 1.1 and Listing 1.4 look almost the same. To the casual observer, read() and getc() would seem slightly different but pretty much interchangeable, and yet in this application the performance difference between the two is about the same as that between a 4.77 MHz PC and a 16 MHz 386.

Make sure you understand what really goes on when you insert a seemingly-innocuous function call into the time-critical portions of your code.

In this case that means knowing how DOS and the C/C++ file-access libraries do their work. In other words, know the territory!

LISTING 1.4 L1-4.C

/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Obtains the bytes one at a time via
* getc(), allowing C to perform data buffering.
*/
#include <stdio.h>

main(int argc, char *argv[]) {
      FILE *CheckFile;
      int Byte;
      unsigned int Checksum;

      if ( argc != 2 ) {
            printf("usage: checksum filename\n");
            exit(1);
      }
      if ( (CheckFile = fopen(argv[1], "rb")) == NULL ) {
            printf("Can't open file: %s\n", argv[1]);
            exit(1);
      }

      /* Initialize the checksum accumulator */
      Checksum = 0;

      /* Add each byte in turn into the checksum accumulator */
      while ( (Byte = getc(CheckFile)) != EOF ) {
            Checksum += (unsigned int) Byte;
      }

      /* Report the result */
      printf("The checksum is: %u\n", Checksum);
      exit(0);
}

Know When It Matters

The last section contained a particularly interesting phrase: the time-critical portions of your code. Time-critical portions of your code are those portions in which the speed of the code makes a significant difference in the overall performance of your program—and by “significant,” I don’t mean that it makes the code 100 percent faster, or 200 percent, or any particular amount at all, but rather that it makes the program more responsive and/or usable from the user’s perspective.

Don’t waste time optimizing non-time-critical code: set-up code, initialization code, and the like. Spend your time improving the performance of the code inside heavily-used loops and in the portions of your programs that directly affect response time. Notice, for example, that I haven’t bothered to implement a version of the checksum program entirely in assembly; Listings 1.2 and 1.6 call assembly subroutines that handle the time-critical operations, but C is still used for checking command-line parameters, opening files, printing, and the like.

If you were to implement any of the listings in this chapter entirely in hand-optimized assembly, I suppose you might get a performance improvement of a few percent—but I rather doubt you’d get even that much, and you’d sure as heck spend an awful lot of time for whatever meager improvement does result. Let C do what it does well, and use assembly only when it makes a perceptible difference.

Besides, we don’t want to optimize until the design is refined to our satisfaction, and that won’t be the case until we’ve thought about other approaches.

Always Consider the Alternatives

Listing 1.4 is good, but let’s see if there are other—perhaps less obvious—ways to get the same results faster. Let’s start by considering why Listing 1.4 is so much better than Listing 1.1. Like read(), getc() calls DOS to read from the file; the speed improvement of Listing 1.4 over Listing 1.1 occurs because getc() reads many bytes at once via DOS, then manages those bytes for us. That’s faster than reading them one at a time using read()—but there’s no reason to think that it’s faster than having our program read and manage blocks itself. Easier, yes, but not faster.

Consider this: Every invocation of getc() involves pushing a parameter, executing a call to the C library function, getting the parameter (in the C library code), looking up information about the desired stream, unbuffering the next byte from the stream, and returning to the calling code. That takes a considerable amount of time, especially by contrast with simply maintaining a pointer to a buffer and whizzing through the data in the buffer inside a single loop.

There are four reasons that many programmers would give for not trying to improve on Listing 1.4:

  1. The code is already fast enough.

  2. The code works, and some people are content with code that works, even when it’s slow enough to be annoying.

  3. The C library is written in optimized assembly, and it’s likely to be faster than any code that the average programmer could write to perform essentially the same function.

  4. The C library conveniently handles the buffering of file data, and it would be a nuisance to have to implement that capability.

I’ll ignore the first reason, both because performance is no longer an issue if the code is fast enough and because the current application does not run fast enough—13 seconds is a long time. (Stop and wait for 13 seconds while you’re doing something intense, and you’ll see just how long it is.)

The second reason is the hallmark of the mediocre programmer. Know when optimization matters—and then optimize when it does!

The third reason is often fallacious. C library functions are not always written in assembly, nor are they always particularly well-optimized. (In fact, they’re often written for portability, which has nothing to do with optimization.) What’s more, they’re general-purpose functions, and often can be outperformed by well-but-not-brilliantly-written code that is well-matched to a specific task. As an example, consider Listing 1.5, which uses internal buffering to handle blocks of bytes at a time. Table 1.1 shows that Listing 1.5 is 2.5 to 4 times faster than Listing 1.4 (and as much as 49 times faster than Listing 1.1!), even though it uses no assembly at all.

Clearly, you can do well by using special-purpose C code in place of a C library function—if you have a thorough understanding of how the C library function operates and exactly what your application needs done. Otherwise, you’ll end up rewriting C library functions in C, which makes no sense at all.

LISTING 1.5 L1-5.C

/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Buffers the bytes internally, rather
* than letting C or DOS do the work.
*/
#include <stdio.h>
#include <fcntl.h>
#include <alloc.h>   /* alloc.h for Borland,
                                malloc.h for Microsoft  */

#define BUFFER_SIZE  0x8000   /* 32Kb data buffer */

main(int argc, char *argv[]) {
      int Handle;
      unsigned int Checksum;
      unsigned char *WorkingBuffer, *WorkingPtr;
      int WorkingLength, LengthCount;

      if ( argc != 2 ) {
            printf("usage: checksum filename\n");
            exit(1);
      }
      if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
            printf("Can't open file: %s\n", argv[1]);
            exit(1);
      }

      /* Get memory in which to buffer the data */
      if ( (WorkingBuffer = malloc(BUFFER_SIZE)) == NULL ) {
            printf("Can't get enough memory\n");
            exit(1);
      }

      /* Initialize the checksum accumulator */
      Checksum = 0;

      /* Process the file in BUFFER_SIZE chunks */
      do {
            if ( (WorkingLength = read(Handle, WorkingBuffer,
                  BUFFER_SIZE)) == -1 ) {
                  printf("Error reading file %s\n", argv[1]);
                  exit(1);
            }
            /* Checksum this chunk */
            WorkingPtr = WorkingBuffer;
            LengthCount = WorkingLength;
            while ( LengthCount-- ) {
            /* Add each byte in turn into the checksum accumulator */
                  Checksum += (unsigned int) *WorkingPtr++;
            }
      } while ( WorkingLength );

      /* Report the result */
      printf("The checksum is: %u\n", Checksum);
      exit(0);
}

That brings us to the fourth reason: avoiding an internal-buffered implementation like Listing 1.5 because of the difficulty of coding such an approach. True, it is easier to let a C library function do the work, but it’s not all that hard to do the buffering internally. The key is the concept of handling data in restartable blocks; that is, reading a chunk of data, operating on the data until it runs out, suspending the operation while more data is read in, and then continuing as though nothing had happened.

In Listing 1.5 the restartable block implementation is pretty simple because checksumming works with one byte at a time, forgetting about each byte immediately after adding it into the total. Listing 1.5 reads in a block of bytes from the file, checksums the bytes in the block, and gets another block, repeating the process until the entire file has been processed. In Chapter 5, we’ll see a more complex restartable block implementation, involving searching for text strings.

At any rate, Listing 1.5 isn’t much more complicated than Listing 1.4—and it’s a lot faster. Always consider the alternatives; a bit of clever thinking and program redesign can go a long way.

Know How to Turn On the Juice

I have said time and again that optimization is pointless until the design is settled. When that time comes, however, optimization can indeed make a significant difference. Table 1.1 indicates that the optimized version of Listing 1.5 produced by Microsoft C outperforms an unoptimized version of the same code by more than 60 percent. What’s more, a mostly-assembly version of Listing 1.5, shown in Listings 1.6 and 1.7, outperforms even the best-optimized C version of Listing 1.5 by 26 percent. These are considerable improvements, well worth pursuing—once the design has been maxed out.

LISTING 1.6 L1-6.C

/*
* Program to calculate the 16-bit checksum of the stream of bytes
* from the specified file. Buffers the bytes internally, rather
* than letting C or DOS do the work, with the time-critical
* portion of the code written in optimized assembler.
*/
#include <stdio.h>
#include <fcntl.h>
#include <alloc.h>   /* alloc.h for Borland,
                         malloc.h for Microsoft  */

#define BUFFER_SIZE  0x8000   /* 32K data buffer */

main(int argc, char *argv[]) {
      int Handle;
      unsigned int Checksum;
      unsigned char *WorkingBuffer;
      int WorkingLength;

      if ( argc != 2 ) {
            printf("usage: checksum filename\n");
            exit(1);
      }
      if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
            printf("Can't open file: %s\n", argv[1]);
            exit(1);
      }

      /* Get memory in which to buffer the data */
      if ( (WorkingBuffer = malloc(BUFFER_SIZE)) == NULL ) {
            printf("Can't get enough memory\n");
            exit(1);
      }

      /* Initialize the checksum accumulator */
      Checksum = 0;

      /* Process the file in 32K chunks */
      do {
            if ( (WorkingLength = read(Handle, WorkingBuffer,
            BUFFER_SIZE)) == -1 ) {
                  printf("Error reading file %s\n", argv[1]);
                  exit(1);
            }
            /* Checksum this chunk if there's anything in it */
            if ( WorkingLength )
                  ChecksumChunk(WorkingBuffer, WorkingLength, &Checksum);
            } while ( WorkingLength );

            /* Report the result */
            printf("The checksum is: %u\n", Checksum);
            exit(0);
}

LISTING 1.7 L1-7.ASM

; Assembler subroutine to perform a 16-bit checksum on a block of
; bytes 1 to 64K in size. Adds checksum for block into passed-in
; checksum.
;
; Call as:
;     void ChecksumChunk(unsigned char *Buffer,
;     unsigned int BufferLength, unsigned int *Checksum);
;
; where:
;     Buffer = pointer to start of block of bytes to checksum
;     BufferLength = # of bytes to checksum (0 means 64K, not 0)
;     Checksum = pointer to unsigned int variable checksum is
;stored in
;
; Parameter structure:
;
Parms struc
                    dw    ?    ;pushed BP
                    dw    ?    ;return address
Buffer              dw    ?
BufferLength        dw    ?
Checksum            dw    ?
Parms ends
;
     .model small
     .code
     public _ChecksumChunk
_ChecksumChunk proc near
     push  bp
     mov   bp,sp
     push  si                        ;save C's register variable
;
     cld                             ;make LODSB increment SI
      mov  si,[bp+Buffer]            ;point to buffer
      mov  cx,[bp+BufferLength]      ;get buffer length
      mov  bx,[bp+Checksum]          ;point to checksum variable
      mov  dx,[bx]                   ;get the current checksum
      sub  ah,ah                     ;so AX will be a 16-bit value after LODSB
ChecksumLoop:
      lodsb                  ;get the next byte
      add  dx,ax             ;add it into the checksum total
      loop ChecksumLoop      ;continue for all bytes in block
      mov  [bx],dx           ;save the new checksum
;
      pop  si                ;restore C's register variable
      pop  bp
      ret
_ChecksumChunk endp
      end

Note that in Table 1.1, optimization makes little difference except in the case of Listing 1.5, where the design has been refined considerably. Execution time in the other cases is dominated by time spent in DOS and/or the C library, so optimization of the code you write is pretty much irrelevant. What’s more, while the approximately two-times improvement we got by optimizing is not to be sneezed at, it pales against the up-to-50-times improvement we got by redesigning.

By the way, the execution times even of Listings 1.6 and 1.7 are dominated by DOS disk access times. If a disk cache is enabled and the file to be checksummed is already in the cache, the assembly version is three times as fast as the C version. In other words, the inherent nature of this application limits the performance improvement that can be obtained via assembly. In applications that are more CPU-intensive and less disk-bound, particularly those applications in which string instructions and/or unrolled loops can be used effectively, assembly tends to be considerably faster relative to C than it is in this very specific case.

Don’t get hung up on optimizing compilers or assembly language—the best optimizer is between your ears.

All this is basically a way of saying: Know where you’re going, know the territory, and know when it matters.

Where We’ve Been, What We’ve Seen

What have we learned? Don’t let other people’s code—even DOS—do the work for you when speed matters, at least not without knowing what that code does and how well it performs.

Optimization only matters after you’ve done your part on the program design end. Consider the ratios on the vertical axis of Table 1.1, which show that optimization is almost totally wasted in the checksumming application without an efficient design. Optimization is no panacea. Table 1.1 shows a two-times improvement from optimization—and a 50-times-plus improvement from redesign. The longstanding debate about which C compiler optimizes code best doesn’t matter quite so much in light of Table 1.1, does it? Your organic optimizer matters much more than your compiler’s optimizer, and there’s always assembly for those usually small sections of code where performance really matters.

Where We’re Going

This chapter has presented a quick step-by-step overview of the design process. I’m not claiming that this is the only way to create high-performance code; it’s just an approach that works for me. Create code however you want, but never forget that design matters more than detailed optimization. Never stop looking for inventive ways to boost performance—and never waste time speeding up code that doesn’t need to be sped up.

I’m going to focus on specific ways to create high-performance code from now on. In Chapter 5, we’ll continue to look at restartable blocks and internal buffering, in the form of a program that searches files for text strings.

Chapter 2 – A World Apart

The Unique Nature of Assembly Language Optimization

As I showed in the previous chapter, optimization is by no means always a matter of “dropping into assembly.” In fact, in performance tuning high-level language code, assembly should be used rarely, and then only after you’ve made sure a badly chosen or clumsily implemented algorithm isn’t eating you alive. Certainly if you use assembly at all, make absolutely sure you use it right. The potential of assembly code to run slowly is poorly understood by a lot of people, but that potential is great, especially in the hands of the ignorant.

Truly great optimization, however, happens only at the assembly level, and it happens in response to a set of dynamics that is totally different from that governing C/C++ or Pascal optimization. I’ll be speaking of assembly-level optimization time and again in this book, but when I do, I think it will be helpful if you have a grasp of those assembly specific dynamics.

As usual, the best way to wade in is to present a real-world example.

Instructions: The Individual versus the Collective

Some time ago, I was asked to work over a critical assembly subroutine in order to make it run as fast as possible. The task of the subroutine was to construct a nibble out of four bits read from different bytes, rotating and combining the bits so that they ultimately ended up neatly aligned in bits 3-0 of a single byte. (In case you’re curious, the object was to construct a 16-color pixel from bits scattered over 4 bytes.) I examined the subroutine line by line, saving a cycle here and a cycle there, until the code truly seemed to be optimized. When I was done, the key part of the code looked something like this:

LoopTop:
      lodsb            ;get the next byte to extract a bit from
      and   al,ah      ;isolate the bit we want
      rol   al,cl      ;rotate the bit into the desired position
      or    bl,al      ;insert the bit into the final nibble
      dec   cx         ;the next bit goes 1 place to the right
      dec   dx         ;count down the number of bits
      jnz   LoopTop    ;process the next bit, if any

Now, it’s hard to write code that’s much faster than seven instructions, only one of which accesses memory, and most programmers would have called it a day at this point. Still, something bothered me, so I spent a bit of time going over the code again. Suddenly, the answer struck me—the code was rotating each bit into place separately, so that a multibit rotation was being performed every time through the loop, for a total of four separate time-consuming multibit rotations!

While the instructions themselves were individually optimized, the overall approach did not make the best possible use of the instructions.

I changed the code to the following:

LoopTop:
      lodsb            ;get the next byte to extract a bit from
      and   al,ah      ;isolate the bit we want
      or    bl,al      ;insert the bit into the final nibble
      rol   bl,1       ;make room for the next bit
      dec   dx         ;count down the number of bits
      jnz   LoopTop    ;process the next bit, if any
      rol   bl,cl      ;rotate all four bits into their final
                       ; positions at the same time

This moved the costly multibit rotation out of the loop so that it was performed just once, rather than four times. While the code may not look much different from the original, and in fact still contains exactly the same number of instructions, the performance of the entire subroutine improved by about 10 percent from just this one change. (Incidentally, that wasn’t the end of the optimization; I eliminated the DEC and JNZ instructions by expanding the four iterations of the loop—but that’s a tale for another chapter.)

The point is this: To write truly superior assembly programs, you need to know what the various instructions do and which instructions execute fastest…and more. You must also learn to look at your programming problems from a variety of perspectives so that you can put those fast instructions to work in the most effective ways.

Assembly Is Fundamentally Different

Is it really so hard as all that to write good assembly code for the PC? Yes! Thanks to the decidedly quirky nature of the x86 family CPUs, assembly language differs fundamentally from other languages, and is undeniably harder to work with. On the other hand, the potential of assembly code is much greater than that of other languages, as well.

To understand why this is so, consider how a program gets written. A programmer examines the requirements of an application, designs a solution at some level of abstraction, and then makes that design come alive in a code implementation. If not handled properly, the transformation that takes place between conception and implementation can reduce performance tremendously; for example, a programmer who implements a routine to search a list of 100,000 sorted items with a linear rather than binary search will end up with a disappointingly slow program.

Transformation Inefficiencies

No matter how well an implementation is derived from the corresponding design, however, high-level languages like C/C++ and Pascal inevitably introduce additional transformation inefficiencies, as shown in Figure 2.1.

The process of turning a design into executable code by way of a high-level language involves two transformations: one performed by the programmer to generate source code, and another performed by the compiler to turn source code into machine language instructions. Consequently, the machine language code generated by compilers is usually less than optimal given the requirements of the original design.

High-level languages provide artificial environments that lend themselves relatively well to human programming skills, in order to ease the transition from design to implementation. The price for this ease of implementation is a considerable loss of efficiency in transforming source code into machine language. This is particularly true given that the x86 family in real and 16-bit protected mode, with its specialized memory-addressing instructions and segmented memory architecture, does not lend itself particularly well to compiler design. Even the 32-bit mode of the 386 and its successors, with their more powerful addressing modes, offer fewer registers than compilers would like.

Figure 2.1 The high-level language transformation inefficiencies.

Assembly, on the other hand, is simply a human-oriented representation of machine language. As a result, assembly provides a difficult programming environment—the bare hardware and systems software of the computer—but properly constructed assembly programs suffer no transformation loss, as shown in Figure 2.2.

Only one transformation is required when creating an assembler program, and that single transformation is completely under the programmer’s control. Assemblers perform no transformation from source code to machine language; instead, they merely map assembler instructions to machine language instructions on a one-to-one basis. As a result, the programmer is able to produce machine language code that’s precisely tailored to the needs of each task a given application requires.

Figure 2.2  Properly constructed assembly programs suffer no transformation loss.

The key, of course, is the programmer, since in assembly the programmer must essentially perform the transformation from the application specification to machine language entirely on his or her own. (The assembler merely handles the direct translation from assembly to machine language.)

Self-Reliance

The first part of assembly language optimization, then, is self. An assembler is nothing more than a tool to let you design machine-language programs without having to think in hexadecimal codes. So assembly language programmers—unlike all other programmers—must take full responsibility for the quality of their code. Since assemblers provide little help at any level higher than the generation of machine language, the assembly programmer must be capable both of coding any programming construct directly and of controlling the PC at the lowest practical level—the operating system, the BIOS, even the hardware where necessary. High-level languages handle most of this transparently to the programmer, but in assembly everything is fair—and necessary—game, which brings us to another aspect of assembly optimization: knowledge.

Knowledge

In the PC world, you can never have enough knowledge, and every item you add to your store will make your programs better. Thorough familiarity with both the operating system APIs and BIOS interfaces is important; since those interfaces are well-documented and reasonably straightforward, my advice is to get a good book or two and bring yourself up to speed. Similarly, familiarity with the PC hardware is required. While that topic covers a lot of ground—display adapters, keyboards, serial ports, printer ports, timer and DMA channels, memory organization, and more—most of the hardware is well-documented, and articles about programming major hardware components appear frequently in the literature, so this sort of knowledge can be acquired readily enough.

The single most critical aspect of the hardware, and the one about which it is hardest to learn, is the CPU. The x86 family CPUs have a complex, irregular instruction set, and, unlike most processors, they are neither straightforward nor well-documented true code performance. What’s more, assembly is so difficult to learn that most articles and books that present assembly code settle for code that just works, rather than code that pushes the CPU to its limits. In fact, since most articles and books are written for inexperienced assembly programmers, there is very little information of any sort available about how to generate high-quality assembly code for the x86 family CPUs. As a result, knowledge about programming them effectively is by far the hardest knowledge to gather. A good portion of this book is devoted to seeking out such knowledge.

Be forewarned, though: No matter how much you learn about programming the PC in assembly, there’s always more to discover.

The Flexible Mind

Is the never-ending collection of information all there is to the assembly optimization, then? Hardly. Knowledge is simply a necessary base on which to build. Let’s take a moment to examine the objectives of good assembly programming, and the remainder of the forces that act on assembly optimization will fall into place.

Basically, there are only two possible objectives to high-performance assembly programming: Given the requirements of the application, keep to a minimum either the number of processor cycles the program takes to run, or the number of bytes in the program, or some combination of both. We’ll look at ways to achieve both objectives, but we’ll more often be concerned with saving cycles than saving bytes, for the PC generally offers relatively more memory than it does processing horsepower. In fact, we’ll find that two-to-three times performance improvements over already tight assembly code are often possible if we’re willing to spend additional bytes in order to save cycles. It’s not always desirable to use such techniques to speed up code, due to the heavy memory requirements—but it is almost always possible.

You will notice that my short list of objectives for high-performance assembly programming does not include traditional objectives such as easy maintenance and speed of development. Those are indeed important considerations—to persons and companies that develop and distribute software. People who actually buy software, on the other hand, care only about how well that software performs, not how it was developed nor how it is maintained. These days, developers spend so much time focusing on such admittedly important issues as code maintainability and reusability, source code control, choice of development environment, and the like that they often forget rule #1: From the user’s perspective, performance is fundamental.

Comment your code, design it carefully, and write non-time-critical portions in a high-level language, if you wish—but when you write the portions that interact with the user and/or affect response time, performance must be your paramount objective, and assembly is the path to that goal.

Knowledge of the sort described earlier is absolutely essential to fulfilling either of the objectives of assembly programming. What that knowledge doesn’t do by itself is meet the need to write code that both performs to the requirements of the application at hand and also operates as efficiently as possible in the PC environment. Knowledge makes that possible, but your programming instincts make it happen. And it is that intuitive, on-the-fly integration of a program specification and a sea of facts about the PC that is the heart of the Zen-class assembly optimization.

As with Zen of any sort, mastering that Zen of assembly language is more a matter of learning than of being taught. You will have to find your own path of learning, although I will start you on your way with this book. The subtle facts and examples I provide will help you gain the necessary experience, but you must continue the journey on your own. Each program you create will expand your programming horizons and increase the options available to you in meeting the next challenge. The ability of your mind to find surprising new and better ways to craft superior code from a concept—the flexible mind, if you will—is the linchpin of good assembler code, and you will develop this skill only by doing.

Never underestimate the importance of the flexible mind. Good assembly code is better than good compiled code. Many people would have you believe otherwise, but they’re wrong. That doesn’t mean that high-level languages are useless; far from it. High-level languages are the best choice for the majority of programmers, and for the bulk of the code of most applications. When the best code—the fastest or smallest code possible—is needed, though, assembly is the only way to go.

Simple logic dictates that no compiler can know as much about what a piece of code needs to do or adapt as well to those needs as the person who wrote the code. Given that superior information and adaptability, an assembly language programmer can generate better code than a compiler, all the more so given that compilers are constrained by the limitations of high-level languages and by the process of transformation from high-level to machine language. Consequently, carefully optimized assembly is not just the language of choice but the only choice for the 1 percent to 10 percent of code—usually consisting of small, well-defined subroutines—that determines overall program performance, and it is the only choice for code that must be as compact as possible, as well. In the run-of-the-mill, non-time-critical portions of your programs, it makes no sense to waste time and effort on writing optimized assembly code—concentrate your efforts on loops and the like instead; but in those areas where you need the finest code quality, accept no substitutes.

Note that I said that an assembly programmer can generate better code than a compiler, not will generate better code. While it is true that good assembly code is better than good compiled code, it is also true that bad assembly code is often much worse than bad compiled code; since the assembly programmer has so much control over the program, he or she has virtually unlimited opportunities to waste cycles and bytes. The sword cuts both ways, and good assembly code requires more, not less, forethought and planning than good code written in a high-level language.

The gist of all this is simply that good assembly programming is done in the context of a solid overall framework unique to each program, and the flexible mind is the key to creating that framework and holding it together.

Where to Begin?

To summarize, the skill of assembly language optimization is a combination of knowledge, perspective, and a way of thought that makes possible the genesis of absolutely the fastest or the smallest code. With that in mind, what should the first step be? Development of the flexible mind is an obvious step. Still, the flexible mind is no better than the knowledge at its disposal. The first step in the journey toward mastering optimization at that exalted level, then, would seem to be learning how to learn.

Chapter 3 – Assume Nothing

Understanding and Using the Zen Timer

When you’re pushing the envelope in writing optimized PC code, you’re likely to become more than a little compulsive about finding approaches that let you wring more speed from your computer. In the process, you’re bound to make mistakes, which is fine—as long as you watch for those mistakes and learn from them.

A case in point: A few years back, I came across an article about 8088 assembly language called “Optimizing for Speed.” Now, “optimize” is not a word to be used lightly; Webster’s Ninth New Collegiate Dictionary defines optimize as “to make as perfect, effective, or functional as possible,” which certainly leaves little room for error. The author had, however, chosen a small, well-defined 8088 assembly language routine to refine, consisting of about 30 instructions that did nothing more than expand 8 bits to 16 bits by duplicating each bit.

The author of “Optimizing” had clearly fine-tuned the code with care, examining alternative instruction sequences and adding up cycles until he arrived at an implementation he calculated to be nearly 50 percent faster than the original routine. In short, he had used all the information at his disposal to improve his code, and had, as a result, saved cycles by the bushel. There was, in fact, only one slight problem with the optimized version of the routine….

It ran slower than the original version!

The Costs of Ignorance

As diligent as the author had been, he had nonetheless committed a cardinal sin of x86 assembly language programming: He had assumed that the information available to him was both correct and complete. While the execution times provided by Intel for its processors are indeed correct, they are incomplete; the other—and often more important—part of code performance is instruction fetch time, a topic to which I will return in later chapters.

Had the author taken the time to measure the true performance of his code, he wouldn’t have put his reputation on the line with relatively low-performance code. What’s more, had he actually measured the performance of his code and found it to be unexpectedly slow, curiosity might well have led him to experiment further and thereby add to his store of reliable information about the CPU.

There you have an important tenet of assembly language optimization: After crafting the best code possible, check it in action to see if it’s really doing what you think it is. If it’s not behaving as expected, that’s all to the good, since solving mysteries is the path to knowledge. You’ll learn more in this way, I assure you, than from any manual or book on assembly language.

Assume nothing. I cannot emphasize this strongly enough—when you care about performance, do your best to improve the code and then measure the improvement. If you don’t measure performance, you’re just guessing, and if you’re guessing, you’re not very likely to write top-notch code.

Ignorance about true performance can be costly. When I wrote video games for a living, I spent days at a time trying to wring more performance from my graphics drivers. I rewrote whole sections of code just to save a few cycles, juggled registers, and relied heavily on blurry-fast register-to-register shifts and adds. As I was writing my last game, I discovered that the program ran perceptibly faster if I used look-up tables instead of shifts and adds for my calculations. It shouldn’t have run faster, according to my cycle counting, but it did. In truth, instruction fetching was rearing its head again, as it often does, and the fetching of the shifts and adds was taking as much as four times the nominal execution time of those instructions.

Ignorance can also be responsible for considerable wasted effort. I recall a debate in the letters column of one computer magazine about exactly how quickly text can be drawn on a Color/Graphics Adapter (CGA) screen without causing snow. The letter-writers counted every cycle in their timing loops, just as the author in the story that started this chapter had. Like that author, the letter-writers had failed to take the prefetch queue into account. In fact, they had neglected the effects of video wait states as well, so the code they discussed was actually much slower than their estimates. The proper test would, of course, have been to run the code to see if snow resulted, since the only true measure of code performance is observing it in action.

The Zen Timer

Clearly, one key to mastering Zen-class optimization is a tool with which to measure code performance. The most accurate way to measure performance is with expensive hardware, but reasonable measurements at no cost can be made with the PC’s 8253 timer chip, which counts at a rate of slightly over 1,000,000 times per second. The 8253 can be started at the beginning of a block of code of interest and stopped at the end of that code, with the resulting count indicating how long the code took to execute with an accuracy of about 1 microsecond. (A microsecond is one millionth of a second, and is abbreviated µs). To be precise, the 8253 counts once every 838.1 nanoseconds. (A nanosecond is one billionth of a second, and is abbreviated ns.)

Listing 3.1 shows 8253-based timer software, consisting of three subroutines: ZTimerOn, ZTimerOff, and ZTimerReport. For the remainder of this book, I’ll refer to these routines collectively as the “Zen timer.” C-callable versions of the two precision Zen timers are presented in Chapter K on the companion CD-ROM.

LISTING 3.1 PZTIMER.ASM

; The precision Zen timer (PZTIMER.ASM)
;
; Uses the 8253 timer to time the performance of code that takes
; less than about 54 milliseconds to execute, with a resolution
; of better than 10 microseconds.
;
; By Michael Abrash
;
; Externally callable routines:
;
;  ZTimerOn: Starts the Zen timer, with interrupts disabled.
;
;  ZTimerOff: Stops the Zen timer, saves the timer count,
;    times the overhead code, and restores interrupts to the
;    state they were in when ZTimerOn was called.
;
;  ZTimerReport: Prints the net time that passed between starting
;    and stopping the timer.
;
; Note: If longer than about 54 ms passes between ZTimerOn and
;    ZTimerOff calls, the timer turns over and the count is
;    inaccurate. When this happens, an error message is displayed
;    instead of a count. The long-period Zen timer should be used
;    in such cases.
;
; Note: Interrupts *MUST* be left off between calls to ZTimerOn
;    and ZTimerOff for accurate timing and for detection of
;    timer overflow.
;
; Note: These routines can introduce slight inaccuracies into the
;    system clock count for each code section timed even if
;    timer 0 doesn't overflow. If timer 0 does overflow, the
;    system clock can become slow by virtually any amount of
;    time, since the system clock can't advance while the
;    precison timer is timing. Consequently, it's a good idea
;    to reboot at the end of each timing session. (The
;    battery-backed clock, if any, is not affected by the Zen
;    timer.)
;
; All registers, and all flags except the interrupt flag, are
; preserved by all routines. Interrupts are enabled and then disabled
; by ZTimerOn, and are restored by ZTimerOff to the state they were
; in when ZTimerOn was called.
;

Code segment word public 'CODE'
     assume       cs:Code, ds:nothing
     public       ZTimerOn, ZTimerOff, ZTimerReport

;
; Base address of the 8253 timer chip.
;
BASE_8253        equ 40h
;
; The address of the timer 0 count registers in the 8253.
;
TIMER_0_8253     equ BASE_8253 + 0
;
; The address of the mode register in the 8253.
;
MODE_8253        equ BASE_8253 + 3
;
; The address of Operation Command Word 3 in the 8259 Programmable
; Interrupt Controller (PIC) (write only, and writable only when
; bit 4 of the byte written to this address is 0 and bit 3 is 1).
;
OCW3              equ 20h
;
; The address of the Interrupt Request register in the 8259 PIC
; (read only, and readable only when bit 1 of OCW3 = 1 and bit 0
; of OCW3 = 0).
;
IRR               equ 20h
;
; Macro to emulate a POPF instruction in order to fix the bug in some
; 80286 chips which allows interrupts to occur during a POPF even when
; interrupts remain disabled.
;
MPOPF macro
      local p1, p2
      jmp short p2
p1:   iret             ; jump to pushed address & pop flags
p2:   push cs          ; construct far return address to
      call p1          ; the next instruction
      endm

;
; Macro to delay briefly to ensure that enough time has elapsed
; between successive I/O accesses so that the device being accessed
; can respond to both accesses even on a very fast PC.
;
DELAY macro
      jmp     $+2
      jmp     $+2
      jmp     $+2
      endm

OriginalFlags   db    ?    ; storage for upper byte of
                           ; FLAGS register when
                           ; ZTimerOn called
TimedCount      dw    ?    ; timer 0 count when the timer
                           ; is stopped
ReferenceCount  dw    ?    ; number of counts required to
                           ; execute timer overhead code
OverflowFlag    db    ?    ; used to indicate whether the
                           ; timer overflowed during the
                           ; timing interval
;
; String printed to report results.
;
OutputStr  label byte
           db    0dh, 0ah, 'Timed count: ', 5 dup (?)
ASCIICountEnd    label byte
           db    ' microseconds', 0dh, 0ah
           db    '$'
;
; String printed to report timer overflow.
;
OverflowStr label byte
      db    0dh, 0ah
      db    '****************************************************'
      db    0dh, 0ah
      db    '* The timer overflowed, so the interval timed was  *'
      db    0dh, 0ah
      db    '* too long for the precision timer to measure.     *'
      db    0dh, 0ah
      db    '* Please perform the timing test again with the    *'
      db    0dh, 0ah
      db    '* long-period timer.                               *'
      db    0dh, 0ah
      db    '****************************************************'
      db    0dh, 0ah
      db    '$'

; ********************************************************************
; * Routine called to start timing.                                  *
; ********************************************************************

ZTimerOn    proc   near

;
; Save the context of the program being timed.
;
   push   ax
   pushf
   pop    ax                       ; get flags so we can keep
                                   ; interrupts off when leaving
                                   ; this routine
   mov    cs:[OriginalFlags],ah    ; remember the state of the
                                   ; Interrupt flag
   and    ah,0fdh                  ; set pushed interrupt flag
                                   ; to 0
   push   ax
;
; Turn on interrupts, so the timer interrupt can occur if it's
; pending.
;
     sti
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting. Also
; leaves the 8253 waiting for the initial timer 0 count to
; be loaded.
;
     mov  al,00110100b               ;mode 2
     out  MODE_8253,al
;
; Set the timer count to 0, so we know we won't get another
; timer interrupt right away.
; Note: this introduces an inaccuracy of up to 54 ms in the system
; clock count each time it is executed.
;
     DELAY
     sub     al,al
     out     TIMER_0_8253,al     ;lsb
     DELAY
     out     TIMER_0_8253,al     ;msb
;
; Wait before clearing interrupts to allow the interrupt generated
; when switching from mode 3 to mode 2 to be recognized. The delay
; must be at least 210 ns long to allow time for that interrupt to
; occur. Here, 10 jumps are used for the delay to ensure that the
; delay time will be more than long enough even on a very fast PC.
;
    rept 10
    jmp   $+2
    endm
;
; Disable interrupts to get an accurate count.
;
     cli
;
; Set the timer count to 0 again to start the timing interval.
;
      mov  al,00110100b        ; set up to load initial
      out  MODE_8253,al        ; timer count
      DELAY
      sub  al,al
      out  TIMER_0_8253,al     ; load count lsb
      DELAY
      out  TIMER_0_8253,al; load count msb
;
; Restore the context and return.
;
     MPOPF                   ; keeps interrupts off
     pop   ax
     ret

ZTimerOn     endp

;********************************************************************
;* Routine called to stop timing and get count.                     *
;********************************************************************

ZTimerOff proc     near

;
; Save the context of the program being timed.
;
     push    ax
     push    cx
     pushf
;
; Latch the count.
;
     mov  al,00000000b     ; latch timer 0
     out  MODE_8253,al
;
; See if the timer has overflowed by checking the 8259 for a pending
; timer interrupt.
;
     mov   al,00001010b        ; OCW3, set up to read
     out   OCW3,al             ; Interrupt Request register
     DELAY
     in    al,IRR              ; read Interrupt Request
                               ; register
     and   al,1                ; set AL to 1 if IRQ0 (the
                               ; timer interrupt) is pending
     mov   cs:[OverflowFlag],al; store the timer overflow
                               ; status
;
; Allow interrupts to happen again.
;
      sti
;
; Read out the count we latched earlier.
;
     in     al,TIMER_0_8253   ; least significant byte
     DELAY
     mov    ah,al
     in     al,TIMER_0_8253   ; most significant byte
     xchg   ah,al
     neg    ax                ; convert from countdown
                              ; remaining to elapsed
                              ; count
     mov    cs:[TimedCount],ax
; Time a zero-length code fragment, to get a reference for how
; much overhead this routine has. Time it 16 times and average it,
; for accuracy, rounding the result.
;
     mov   cs:[ReferenceCount],0
     mov   cx,16
     cli                ; interrupts off to allow a
                        ; precise reference count
 RefLoop:
     call   ReferenceZTimerOn
     call   ReferenceZTimerOff
     loop   RefLoop
     sti
     add    cs:[ReferenceCount],8; total + (0.5 * 16)
     mov    cl,4
     shr    cs:[ReferenceCount],cl; (total) / 16 + 0.5
;
; Restore original interrupt state.
;
     pop    ax                    ; retrieve flags when called
     mov    ch,cs:[OriginalFlags] ; get back the original upper
                                  ; byte of the FLAGS register
     and    ch,not 0fdh           ; only care about original
                                  ; interrupt flag...
     and    ah,0fdh               ; ...keep all other flags in
                                  ; their current condition
     or     ah,ch                 ; make flags word with original
                                  ; interrupt flag
     push   ax                    ; prepare flags to be popped
;
; Restore the context of the program being timed and return to it.
;
    MPOPF                      ; restore the flags with the
                               ; original interrupt state
    pop    cx
    pop    ax
    ret

ZTimerOff  endp

;
; Called by ZTimerOff to start timer for overhead measurements.
;

ReferenceZTimerOn proc  near
;
; Save the context of the program being timed.
;
      push  ax
      pushf     ; interrupts are already off
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting.
;
   mov    al,00110100b    ; set up to load
   out    MODE_8253,al    ; initial timer count
   DELAY
;
; Set the timer count to 0.
;
     sub    al,al
     out    TIMER_0_8253,al; load count lsb
     DELAY
     out    TIMER_0_8253,al; load count msb
;
; Restore the context of the program being timed and return to it.
;
     MPOPF
     pop    ax
     ret

ReferenceZTimerOn endp

;
; Called by ZTimerOff to stop timer and add result to ReferenceCount
; for overhead measurements.
;

ReferenceZTimerOff proc     near
;
; Save the context of the program being timed.
;
      push   ax
      push   cx
      pushf
;
; Latch the count and read it.
;
     mov   al,00000000b        ; latch timer 0
     out   MODE_8253,al
     DELAY
     in    al,TIMER_0_8253     ; lsb
     DELAY
     mov   ah,al
     in    al,TIMER_0_8253     ; msb
     xchg  ah,al
     neg   ax                  ; convert from countdown
                               ; remaining to amount
                               ; counted down
     add   cs:[ReferenceCount],ax
;
; Restore the context of the program being timed and return to it.
;
    MPOPF
    pop    cx
    pop    ax
    ret

ReferenceZTimerOff endp

; ********************************************************************
; * Routine called to report timing results.                         *
; ********************************************************************

ZTimerReport proc    near

       pushf
       push  ax
       push  bx
       push  cx
       push  dx
       push  si
       push  ds
;
       push       cs     ; DOS functions require that DS point
       pop        ds     ; to text to be displayed on the screen
       assume     ds     :Code
;
; Check for timer 0 overflow.
;
     cmp  [OverflowFlag],0
     jz   PrintGoodCount
     mov  dx,offset OverflowStr
     mov  ah,9
     int  21h
     jmp  short EndZTimerReport
;
; Convert net count to decimal ASCII in microseconds.
;
PrintGoodCount:
     mov   ax,[TimedCount]
     sub   ax,[ReferenceCount]
     mov   si,offset ASCIICountEnd - 1
;
; Convert count to microseconds by multiplying by .8381.
;
     mov   dx, 8381
     mul   dx
     mov   bx, 10000
     div   bx                ;* .8381 = * 8381 / 10000
;
; Convert time in microseconds to 5 decimal ASCII digits.
;
     mov   bx, 10
     mov   cx, 5
CTSLoop:
     sub   dx, dx
     div   bx
      add  dl,'0'
     mov   [si],dl
     dec   si
     loop  CTSLoop
;
; Print the results.
;
     mov   ah, 9
     mov   dx, offset OutputStr
     int   21h
;
EndZTimerReport:
     pop   ds
     pop   si
     pop   dx
     pop   cx
     pop   bx
     pop   ax
     MPOPF
     ret

ZTimerReport  endp

Code   ends
       end

The Zen Timer Is a Means, Not an End

We’re going to spend the rest of this chapter seeing what the Zen timer can do, examining how it works, and learning how to use it. I’ll be using the Zen timer again and again over the course of this book, so it’s essential that you learn what the Zen timer can do and how to use it. On the other hand, it is by no means essential that you understand exactly how the Zen timer works. (Interesting, yes; essential, no.)

In other words, the Zen timer isn’t really part of the knowledge we seek; rather, it’s one tool with which we’ll acquire that knowledge. Consequently, you shouldn’t worry if you don’t fully grasp the inner workings of the Zen timer. Instead, focus on learning how to use it, and you’ll be on the right road.

Starting the Zen Timer

ZTimerOn is called at the start of a segment of code to be timed. ZTimerOn saves the context of the calling code, disables interrupts, sets timer 0 of the 8253 to mode 2 (divide-by-N mode), sets the initial timer count to 0, restores the context of the calling code, and returns. (I’d like to note that while Intel’s documentation for the 8253 seems to indicate that a timer won’t reset to 0 until it finishes counting down, in actual practice, timers seem to reset to 0 as soon as they’re loaded.)

Two aspects of ZTimerOn are worth discussing further. One point of interest is that ZTimerOn disables interrupts. (ZTimerOff later restores interrupts to the state they were in when ZTimerOn was called.) Were interrupts not disabled by ZTimerOn, keyboard, mouse, timer, and other interrupts could occur during the timing interval, and the time required to service those interrupts would incorrectly and erratically appear to be part of the execution time of the code being measured. As a result, code timed with the Zen timer should not expect any hardware interrupts to occur during the interval between any call to ZTimerOn and the corresponding call to ZTimerOff, and should not enable interrupts during that time.

Time and the PC

A second interesting point about ZTimerOn is that it may introduce some small inaccuracy into the system clock time whenever it is called. To understand why this is so, we need to examine the way in which both the 8253 and the PC’s system clock (which keeps the current time) work.

The 8253 actually contains three timers, as shown in Figure 3.1. All three timers are driven by the system board’s 14.31818 MHz crystal, divided by 12 to yield a 1.19318 MHz clock to the timers, so the timers count once every 838.1 ns. Each of the three timers counts down in a programmable way, generating a signal on its output pin when it counts down to 0. Each timer is capable of being halted at any time via a 0 level on its gate input; when a timer’s gate input is 1, that timer counts constantly. All in all, the 8253’s timers are inherently very flexible timing devices; unfortunately, much of that flexibility depends on how the timers are connected to external circuitry, and in the PC the timers are connected with specific purposes in mind.

Timer 2 drives the speaker, although it can be used for other timing purposes when the speaker is not in use. As shown in Figure 3.1, timer 2 is the only timer with a programmable gate input in the PC; that is, timer 2 is the only timer that can be started and stopped under program control in the manner specified by Intel. On the other hand, the output of timer 2 is connected to nothing other than the speaker. In particular, timer 2 cannot generate an interrupt to get the 8088’s attention.

Timer 1 is dedicated to providing dynamic RAM refresh, and should not be tampered with lest system crashes result.

Figure 3.1  The configuration of the 8253 timer chip in the PC.

Finally, timer 0 is used to drive the system clock. As programmed by the BIOS at power-up, every 65,536 (64K) counts, or 54.925 milliseconds, timer 0 generates a rising edge on its output line. (A millisecond is one-thousandth of a second, and is abbreviated ms.) This line is connected to the hardware interrupt 0 (IRQ0) line on the system board, so every 54.925 ms, timer 0 causes hardware interrupt 0 to occur.

The interrupt vector for IRQ0 is set by the BIOS at power-up time to point to a BIOS routine, TIMER_INT, that maintains a time-of-day count. TIMER_INT keeps a 16-bit count of IRQ0 interrupts in the BIOS data area at address 0000:046C (all addresses in this book are given in segment:offset hexadecimal pairs); this count turns over once an hour (less a few microseconds), and when it does, TIMER_INT updates a 16-bit hour count at address 0000:046E in the BIOS data area. This count is the basis for the current time and date that DOS supports via functions 2AH (2A hexadecimal) through 2DH and by way of the DATE and TIME commands.

Each timer channel of the 8253 can operate in any of six modes. Timer 0 normally operates in mode 3: square wave mode. In square wave mode, the initial count is counted down two at a time; when the count reaches zero, the output state is changed. The initial count is again counted down two at a time, and the output state is toggled back when the count reaches zero. The result is a square wave that changes state more slowly than the input clock by a factor of the initial count. In its normal mode of operation, timer 0 generates an output pulse that is low for about 27.5 ms and high for about 27.5 ms; this pulse is sent to the 8259 interrupt controller, and its rising edge generates a timer interrupt once every 54.925 ms.

Square wave mode is not very useful for precision timing because it counts down by two twice per timer interrupt, thereby rendering exact timings impossible. Fortunately, the 8253 offers another timer mode, mode 2 (divide-by-N mode), which is both a good substitute for square wave mode and a perfect mode for precision timing.

Divide-by-N mode counts down by one from the initial count. When the count reaches zero, the timer turns over and starts counting down again without stopping, and a pulse is generated for a single clock period. While the pulse is not held for nearly as long as in square wave mode, it doesn’t matter, since the 8259 interrupt controller is configured in the PC to be edge-triggered and hence cares only about the existence of a pulse from timer 0, not the duration of the pulse. As a result, timer 0 continues to generate timer interrupts in divide-by-N mode, and the system clock continues to maintain good time.

Why not use timer 2 instead of timer 0 for precision timing? After all, timer 2 has a programmable gate input and isn’t used for anything but sound generation. The problem with timer 2 is that its output can’t generate an interrupt; in fact, timer 2 can’t do anything but drive the speaker. We need the interrupt generated by the output of timer 0 to tell us when the count has overflowed, and we will see shortly that the timer interrupt also makes it possible to time much longer periods than the Zen timer shown in Listing 3.1 supports.

In fact, the Zen timer shown in Listing 3.1 can only time intervals of up to about 54 ms in length, since that is the period of time that can be measured by timer 0 before its count turns over and repeats. Fifty-four ms may not seem like a very long time, but even a CPU as slow as the 8088 can perform more than 1,000 divides in 54 ms, and division is the single instruction that the 8088 performs most slowly. If a measured period turns out to be longer than 54 ms (that is, if timer 0 has counted down and turned over), the Zen timer will display a message to that effect. A long-period Zen timer for use in such cases will be presented later in this chapter.

The Zen timer determines whether timer 0 has turned over by checking to see whether an IRQ0 interrupt is pending. (Remember, interrupts are off while the Zen timer runs, so the timer interrupt cannot be recognized until the Zen timer stops and enables interrupts.) If an IRQ0 interrupt is pending, then timer 0 has turned over and generated a timer interrupt. Recall that ZTimerOn initially sets timer 0 to 0, in order to allow for the longest possible period—about 54 ms—before timer 0 reaches 0 and generates the timer interrupt.

Now we’re ready to look at the ways in which the Zen timer can introduce inaccuracy into the system clock. Since timer 0 is initially set to 0 by the Zen timer, and since the system clock ticks only when timer 0 counts off 54.925 ms and reaches 0 again, an average inaccuracy of one-half of 54.925 ms, or about 27.5 ms, is incurred each time the Zen timer is started. In addition, a timer interrupt is generated when timer 0 is switched from mode 3 to mode 2, advancing the system clock by up to 54.925 ms, although this only happens the first time the Zen timer is run after a warm or cold boot. Finally, up to 54.925 ms can again be lost when ZTimerOff is called, since that routine again sets the timer count to zero. Net result: The system clock will run up to 110 ms (about a ninth of a second) slow each time the Zen timer is used.

Potentially far greater inaccuracy can be incurred by timing code that takes longer than about 110 ms to execute. Recall that all interrupts, including the timer interrupt, are disabled while timing code with the Zen timer. The 8259 interrupt controller is capable of remembering at most one pending timer interrupt, so all timer interrupts after the first one during any given Zen timing interval are ignored. Consequently, if a timing interval exceeds 54.9 ms, the system clock effectively stops 54.9 ms after the timing interval starts and doesn’t restart until the timing interval ends, losing time all the while.

The effects on the system time of the Zen timer aren’t a matter for great concern, as they are temporary, lasting only until the next warm or cold boot. System that have battery-backed clocks, (AT-style machines; that is, virtually all machines in common use) automatically reset the correct time whenever the computer is booted, and systems without battery-backed clocks prompt for the correct date and time when booted. Also, repeated use of the Zen timer usually makes the system clock slow by at most a total of a few seconds, unless code that takes much longer than 54 ms to run is timed (in which case the Zen timer will notify you that the code is too long to time).

Nonetheless, it’s a good idea to reboot your computer at the end of each session with the Zen timer in order to make sure that the system clock is correct.

Stopping the Zen Timer

At some point after ZTimerOn is called, ZTimerOff must always be called to mark the end of the timing interval. ZTimerOff saves the context of the calling program, latches and reads the timer 0 count, converts that count from the countdown value that the timer maintains to the number of counts elapsed since ZTimerOn was called, and stores the result. Immediately after latching the timer 0 count—and before enabling interrupts—ZTimerOff checks the 8259 interrupt controller to see if there is a pending timer interrupt, setting a flag to mark that the timer overflowed if there is indeed a pending timer interrupt.

After that, ZTimerOff executes just the overhead code of ZTimerOn and ZTimerOff 16 times, and averages and saves the results in order to determine how many of the counts in the timing result just obtained were incurred by the overhead of the Zen timer rather than by the code being timed.

Finally, ZTimerOff restores the context of the calling program, including the state of the interrupt flag that was in effect when ZTimerOn was called to start timing, and returns.

One interesting aspect of ZTimerOff is the manner in which timer 0 is stopped in order to read the timer count. We don’t actually have to stop timer 0 to read the count; the 8253 provides a special latched read feature for the specific purpose of reading the count while a time is running. (That’s a good thing, too; we’ve no documented way to stop timer 0 if we wanted to, since its gate input isn’t connected. Later in this chapter, though, we’ll see that timer 0 can be stopped after all.) We simply tell the 8253 to latch the current count, and the 8253 does so without breaking stride.

Reporting Timing Results

ZTimerReport may be called to display timing results at any time after both ZTimerOn and ZTimerOff have been called. ZTimerReport first checks to see whether the timer overflowed (counted down to 0 and turned over) before ZTimerOff was called; if overflow did occur, ZTimerOff prints a message to that effect and returns. Otherwise, ZTimerReport subtracts the reference count (representing the overhead of the Zen timer) from the count measured between the calls to ZTimerOn and ZTimerOff, converts the result from timer counts to microseconds, and prints the resulting time in microseconds to the standard output.

Note that ZTimerReport need not be called immediately after ZTimerOff. In fact, after a given call to ZTimerOff, ZTimerReport can be called at any time right up until the next call to ZTimerOn.

You may want to use the Zen timer to measure several portions of a program while it executes normally, in which case it may not be desirable to have the text printed by ZTimerReport interfere with the program’s normal display. There are many ways to deal with this. One approach is removal of the invocations of the DOS print string function (INT 21H with AH equal to 9) from ZTimerReport, instead running the program under a debugger that supports screen flipping (such as Turbo Debugger or CodeView), placing a breakpoint at the start of ZTimerReport, and directly observing the count in microseconds as ZTimerReport calculates it.

A second approach is modification of ZTimerReport to place the result at some safe location in memory, such as an unused portion of the BIOS data area.

A third approach is alteration of ZTimerReport to print the result over a serial port to a terminal or to another PC acting as a terminal. Similarly, many debuggers can be run from a remote terminal via a serial link.

Yet another approach is modification of ZTimerReport to send the result to the printer via either DOS function 5 or BIOS interrupt 17H.

A final approach is to modify ZTimerReport to print the result to the auxiliary output via DOS function 4, and to then write and load a special device driver named AUX, to which DOS function 4 output would automatically be directed. This device driver could send the result anywhere you might desire. The result might go to the secondary display adapter, over a serial port, or to the printer, or could simply be stored in a buffer within the driver, to be dumped at a later time. (Credit for this final approach goes to Michael Geary, and thanks go to David Miller for passing the idea on to me.)

You may well want to devise still other approaches better suited to your needs than those I’ve presented. Go to it! I’ve just thrown out a few possibilities to get you started.

Notes on the Zen Timer

The Zen timer subroutines are designed to be near-called from assembly language code running in the public segment Code. The Zen timer subroutines can, however, be called from any assembly or high-level language code that generates OBJ files that are compatible with the Microsoft linker, simply by modifying the segment that the timer code runs in to match the segment used by the code being timed, or by changing the Zen timer routines to far procedures and making far calls to the Zen timer code from the code being timed, as discussed at the end of this chapter. All three subroutines preserve all registers and all flags except the interrupt flag, so calls to these routines are transparent to the calling code.

If you do change the Zen timer routines to far procedures in order to call them from code running in another segment, be sure to make all the Zen timer routines far, including ReferenceZTimerOn and ReferenceZTimerOff. (You’ll have to put FAR PTR overrides on the calls from ZTimerOff to the latter two routines if you do make them far.) If the reference routines aren’t the same type—near or far—as the other routines, they won’t reflect the true overhead incurred by starting and stopping the Zen timer.

Please be aware that the inaccuracy that the Zen timer can introduce into the system clock time does not affect the accuracy of the performance measurements reported by the Zen timer itself. The 8253 counts once every 838 ns, giving us a count resolution of about 1µs, although factors such as the prefetch queue (as discussed below), dynamic RAM refresh, and internal timing variations in the 8253 make it perhaps more accurate to describe the Zen timer as measuring code performance with an accuracy of better than 10µs. In fact, the Zen timer is actually most accurate in assessing code performance when timing intervals longer than about 100 µs. At any rate, we’re most interested in using the Zen timer to assess the relative performance of various code sequences—that is, using it to compare and tweak code—and the timer is more than accurate enough for that purpose.

The Zen timer works on all PC-compatible computers I’ve tested it on, including XTs, ATs, PS/2 computers, and 386, 486, and Pentium-based machines. Of course, I haven’t been able to test it on all PC-compatibles, but I don’t expect any problems; computers on which the Zen timer doesn’t run can’t truly be called “PC-compatible.”

On the other hand, there is certainly no guarantee that code performance as measured by the Zen timer will be the same on compatible computers as on genuine IBM machines, or that either absolute or relative code performance will be similar even on different IBM models; in fact, quite the opposite is true. For example, every PS/2 computer, even the relatively slow Model 30, executes code much faster than does a PC or XT. As another example, I set out to do the timings for my earlier book Zen of Assembly Language on an XT-compatible computer, only to find that the computer wasn’t quite IBM-compatible regarding code performance. The differences were minor, mind you, but my experience illustrates the risk of assuming that a specific make of computer will perform in a certain way without actually checking.

Not that this variation between models makes the Zen timer one whit less useful—quite the contrary. The Zen timer is an excellent tool for evaluating code performance over the entire spectrum of PC-compatible computers.

A Sample Use of the Zen Timer

Listing 3.2 shows a test-bed program for measuring code performance with the Zen timer. This program sets DS equal to CS (for reasons we’ll discuss shortly), includes the code to be measured from the file TESTCODE, and calls ZTimerReport to display the timing results. Consequently, the code being measured should be in the file TESTCODE, and should contain calls to ZTimerOn and ZTimerOff .

LISTING 3.2 PZTEST.ASM

; Program to measure performance of code that takes less than
; 54 ms to execute. (PZTEST.ASM)
;
; Link with PZTIMER.ASM (Listing 3.1). PZTEST.BAT (Listing 3.4)
; can be used to assemble and link both files. Code to be
; measured must be in the file TESTCODE; Listing 3.3 shows
; a sample TESTCODE file.
;
; By Michael Abrash
;
mystack   segment  para stack 'STACK'
      db  512 dup(?)
mystack   ends
;
Code  segment   para public 'CODE'
      assume    cs:Code, ds:Code
      extrn     ZTimerOn:near, ZTimerOff:near, ZTimerReport:near
Start proc near
      push cs
      pop  ds    ; set DS to point to the code segment,
                 ; so data as well as code can easily
                 ; be included in TESTCODE
;
      include    TESTCODE ;code to be measured, including
                 ; calls to ZTimerOn and ZTimerOff
;
; Display the results.
;
    call   ZTimerReport
;
; Terminate the program.
;
       mov   ah,4ch
       int   21h
Start endp
Code  ends
      end  Start

Listing 3.3 shows some sample code to be timed. This listing measures the time required to execute 1,000 loads of AL from the memory variable MemVar . Note that Listing 3.3 calls ZTimerOn to start timing, performs 1,000 MOV instructions in a row, and calls ZTimerOff to end timing. When Listing 3.2 is named TESTCODE and included by Listing 3.3, Listing 3.2 calls ZTimerReport to display the execution time after the code in Listing 3.3 has been run.

LISTING 3.3 LST3-3.ASM

; Test file;
; Measures the performance of 1,000 loads of AL from
; memory. (Use by renaming to TESTCODE, which is
; included by PZTEST.ASM (Listing 3.2). PZTIME.BAT
; (Listing 3.4) does this, along with all assembly
; and linking.)
;
jmp   Skip     ;jump around defined data
;
MemVar db      ?
;
Skip:
;
; Start timing.
;
      call  ZTimerOn
;
      rept  1000
      mov al,[MemVar]
      endm
;
; Stop timing.
;
    call  ZTimerOff

It’s worth noting that Listing 3.3 begins by jumping around the memory variable MemVar. This approach lets us avoid reproducing Listing 3.2 in its entirety for each code fragment we want to measure; by defining any needed data right in the code segment and jumping around that data, each listing becomes self-contained and can be plugged directly into Listing 3.2 as TESTCODE. Listing 3.2 sets DS equal to CS before doing anything else precisely so that data can be embedded in code fragments being timed. Note that only after the initial jump is performed in Listing 3.3 is the Zen timer started, since we don’t want to include the execution time of start-up code in the timing interval. That’s why the calls to ZTimerOn and ZTimerOff are in TESTCODE, not in PZTEST.ASM; this way, we have full control over which portion of TESTCODE is timed, and we can keep set-up code and the like out of the timing interval.

Listing 3.3 is used by naming it TESTCODE, assembling both Listing 3.2 (which includes TESTCODE) and Listing 3.1 with TASM or MASM, and linking the two resulting OBJ files together by way of the Borland or Microsoft linker. Listing 3.4 shows a batch file, PZTIME.BAT, which does all that; when run, this batch file generates and runs the executable file PZTEST.EXE. PZTIME.BAT (Listing 3.4) assumes that the file PZTIMER.ASM contains Listing 3.1, and the file PZTEST.ASM contains Listing 3.2. The command-line parameter to PZTIME.BAT is the name of the file to be copied to TESTCODE and included into PZTEST.ASM. (Note that Turbo Assembler can be substituted for MASM by replacing “masm” with “tasm” and “link” with “tlink” in Listing 3.4. The same is true of Listing 3.7.)

LISTING 3.4 PZTIME.BAT

echo off
rem
rem *** Listing 3.4 ***
rem
rem ***************************************************************
rem * Batch file PZTIME.BAT, which builds and runs the precision  *
rem * Zen timer program PZTEST.EXE to time the code named as the  *
rem * command-line parameter. Listing 3.1 must be named           *
rem * PZTIMER.ASM, and Listing 3.2 must be named PZTEST.ASM. To   *
rem * time the code in LST3-3, you'd type the DOS command:        *
rem *                                                             *
rem * pztime lst3-3                                               *
rem *                                                             *
rem * Note that MASM and LINK must be in the current directory or *
rem * on the current path in order for this batch file to work.   *
rem *                                                             *
rem * This batch file can be speeded up by assembling PZTIMER.ASM *
rem * once, then removing the lines:                              *
rem *                                                             *
rem * masm pztimer;                                               *
rem * if errorlevel 1 goto errorend                               *
rem *                                                             *
rem * from this file.                                             *
rem *                                                             *
rem * By Michael Abrash                                           *
rem ***************************************************************
rem
rem Make sure a file to test was specified.
rem
if not x%1==x goto ckexist
echo ***************************************************************
echo * Please specify a file to test.                              *
echo ***************************************************************
goto end
rem
rem Make sure the file exists.
rem
:ckexist
if exist %1 goto docopy
echo ***************************************************************
echo * The specified file, "%1," doesn't exist.                    *
echo ***************************************************************
goto end
rem
rem copy the file to measure to TESTCODE.
rem
:docopy
copy %1 testcode
masm pztest;
if errorlevel 1 goto errorend
masm pztimer;
if errorlevel 1 goto errorend
link pztest+pztimer;
if errorlevel 1 goto errorend
pztest
goto end
:errorend
echo ***************************************************************
echo * An error occurred while building the precision Zen timer.   *
echo ***************************************************************
:end

Assuming that Listing 3.3 is named LST3-3.ASM and Listing 3.4 is named PZTIME.BAT, the code in Listing 3.3 would be timed with the command:

pztime LST3-3.ASM

which performs all assembly and linking, and reports the execution time of the code in Listing 3.3.

When the above command is executed on an original 4.77 MHz IBM PC, the time reported by the Zen timer is 3619 µs, or about 3.62 µs per load of AL from memory. (While the exact number is 3.619 µs per load of AL, I’m going to round off that last digit from now on. No matter how many repetitions of a given instruction are timed, there’s just too much noise in the timing process—between dynamic RAM refresh, the prefetch queue, and the internal state of the processor at the start of timing—for that last digit to have any significance.) Given the test PC’s 4.77 MHz clock, this works out to about 17 cycles per MOV, which is actually a good bit longer than Intel’s specified 10-cycle execution time for this instruction. (See the MASM or TASM documentation, or Intel’s processor reference manuals, for official execution times.) Fear not, the Zen timer is right—MOV AL,[MEMVAR] really does take 17 cycles as used in Listing 3.3. Exactly why that is so is just what this book is all about.

In order to perform any of the timing tests in this book, enter Listing 3.1 and name it PZTIMER.ASM, enter Listing 3.2 and name it PZTEST.ASM, and enter Listing 3.4 and name it PZTIME.BAT. Then simply enter the listing you wish to run into the file filename and enter the command:

pztime <filename>

In fact, that’s exactly how I timed each of the listings in this book. Code fragments you write yourself can be timed in just the same way. If you wish to time code directly in place in your programs, rather than in the test-bed program of Listing 3.2, simply insert calls to ZTimerOn, ZTimerOff, and ZTimerReport in the appropriate places and link PZTIMER to your program.

The Long-Period Zen Timer

With a few exceptions, the Zen timer presented above will serve us well for the remainder of this book since we’ll be focusing on relatively short code sequences that generally take much less than 54 ms to execute. Occasionally, however, we will need to time longer intervals. What’s more, it is very likely that you will want to time code sequences longer than 54 ms at some point in your programming career. Accordingly, I’ve also developed a Zen timer for periods longer than 54 ms. The long-period Zen timer (so named by contrast with the precision Zen timer just presented) shown in Listing 3.5 can measure periods up to one hour in length.

The key difference between the long-period Zen timer and the precision Zen timer is that the long-period timer leaves interrupts enabled during the timing period. As a result, timer interrupts are recognized by the PC, allowing the BIOS to maintain an accurate system clock time over the timing period. Theoretically, this enables measurement of arbitrarily long periods. Practically speaking, however, there is no need for a timer that can measure more than a few minutes, since the DOS time of day and date functions (or, indeed, the DATE and TIME commands in a batch file) serve perfectly well for longer intervals. Since very long timing intervals aren’t needed, the long-period Zen timer uses a simplified means of calculating elapsed time that is limited to measuring intervals of an hour or less. If a period longer than an hour is timed, the long-period Zen timer prints a message to the effect that it is unable to time an interval of that length.

For implementation reasons, the long-period Zen timer is also incapable of timing code that starts before midnight and ends after midnight; if that eventuality occurs, the long-period Zen timer reports that it was unable to time the code because midnight was crossed. If this happens to you, just time the code again, secure in the knowledge that at least you won’t run into the problem again for 23-odd hours.

You should not use the long-period Zen timer to time code that requires interrupts to be disabled for more than 54 ms at a stretch during the timing interval, since when interrupts are disabled the long-period Zen timer is subject to the same 54 ms maximum measurement time as the precision Zen timer.

While permitting the timer interrupt to occur allows long intervals to be timed, that same interrupt makes the long-period Zen timer less accurate than the precision Zen timer, since the time the BIOS spends handling timer interrupts during the timing interval is included in the time measured by the long-period timer. Likewise, any other interrupts that occur during the timing interval, most notably keyboard and mouse interrupts, will increase the measured time.

The long-period Zen timer has some of the same effects on the system time as does the precision Zen timer, so it’s a good idea to reboot the system after a session with the long-period Zen timer. The long-period Zen timer does not, however, have the same potential for introducing major inaccuracy into the system clock time during a single timing run since it leaves interrupts enabled and therefore allows the system clock to update normally.

Stopping the Clock

There’s a potential problem with the long-period Zen timer. The problem is this: In order to measure times longer than 54 ms, we must maintain not one but two timing components, the timer 0 count and the BIOS time-of-day count. The time-of-day count measures the passage of 54.9 ms intervals, while the timer 0 count measures time within those 54.9 ms intervals. We need to read the two time components simultaneously in order to get a clean reading. Otherwise, we may read the timer count just before it turns over and generates an interrupt, then read the BIOS time-of-day count just after the interrupt has occurred and caused the time-of-day count to turn over, with a resulting 54 ms measurement inaccuracy. (The opposite sequence—reading the time-of-day count and then the timer count—can result in a 54 ms inaccuracy in the other direction.)

The only way to avoid this problem is to stop timer 0, read both the timer and time-of-day counts while the timer is stopped, and then restart the timer. Alas, the gate input to timer 0 isn’t program-controllable in the PC, so there’s no documented way to stop the timer. (The latched read feature we used in Listing 3.1 doesn’t stop the timer; it latches a count, but the timer keeps running.) What should we do?

As it turns out, an undocumented feature of the 8253 makes it possible to stop the timer dead in its tracks. Setting the timer to a new mode and waiting for an initial count to be loaded causes the timer to stop until the count is loaded. Surprisingly, the timer count remains readable and correct while the timer is waiting for the initial load.

In my experience, this approach works beautifully with fully 8253-compatible chips. However, there’s no guarantee that it will always work, since it programs the 8253 in an undocumented way. What’s more, IBM chose not to implement compatibility with this particular 8253 feature in the custom chips used in PS/2 computers. On PS/2 computers, we have no choice but to latch the timer 0 count and then stop the BIOS count (by disabling interrupts) as quickly as possible. We’ll just have to accept the fact that on PS/2 computers we may occasionally get a reading that’s off by 54 ms, and leave it at that.

I’ve set up Listing 3.5 so that it can assemble to either use or not use the undocumented timer-stopping feature, as you please. The PS2 equate selects between the two modes of operation. If PS2 is 1 (as it is in Listing 3.5), then the latch-and-read method is used; if PS2 is 0, then the undocumented timer-stop approach is used. The latch-and-read method will work on all PC-compatible computers, but may occasionally produce results that are incorrect by 54 ms. The timer-stop approach avoids synchronization problems, but doesn’t work on all computers.

LISTING 3.5 LZTIMER.ASM

;
; The long-period Zen timer. (LZTIMER.ASM)
; Uses the 8253 timer and the BIOS time-of-day count to time the
; performance of code that takes less than an hour to execute.
; Because interrupts are left on (in order to allow the timer
; interrupt to be recognized), this is less accurate than the
; precision Zen timer, so it is best used only to time code that takes
; more than about 54 milliseconds to execute (code that the precision
; Zen timer reports overflow on). Resolution is limited by the
; occurrence of timer interrupts.
;
; By Michael Abrash
;
; Externally callable routines:
;
;  ZTimerOn: Saves the BIOS time of day count and starts the
;    long-period Zen timer.
;
;  ZTimerOff: Stops the long-period Zen timer and saves the timer
;    count and the BIOS time-of-day count.
;
;  ZTimerReport: Prints the time that passed between starting and
;    stopping the timer.
;
; Note: If either more than an hour passes or midnight falls between
;     calls to ZTimerOn and ZTimerOff, an error is reported. For
;     timing code that takes more than a few minutes to execute,
;     either the DOS TIME command in a batch file before and after
;     execution of the code to time or the use of the DOS
;     time-of-day function in place of the long-period Zen timer is
;     more than adequate.
;
; Note: The PS/2 version is assembled by setting the symbol PS2 to 1.
;     PS2 must be set to 1 on PS/2 computers because the PS/2's
;     timers are not compatible with an undocumented timer-stopping
;     feature of the 8253; the alternative timing approach that
;     must be used on PS/2 computers leaves a short window
;     during which the timer 0 count and the BIOS timer count may
;     not be synchronized. You should also set the PS2 symbol to
;     1 if you're getting erratic or obviously incorrect results.
;
; Note: When PS2 is 0, the code relies on an undocumented 8253
;     feature to get more reliable readings. It is possible that
;     the 8253 (or whatever chip is emulating the 8253) may be put
;     into an undefined or incorrect state when this feature is
;     used.
;
;     ******************************************************************
;     * If your computer displays any hint of erratic behavior         *
;      *    after the long-period Zen timer is used, such as the floppy*
;     *    drive failing to operate properly, reboot the system, set   *
;     *    PS2 to 1 and leave it that way!                             *
;     ******************************************************************
;
; Note: Each block of code being timed should ideally be run several
;     times, with at least two similar readings required to
;     establish a true measurement, in order to eliminate any
;     variability caused by interrupts.
;
; Note: Interrupts must not be disabled for more than 54 ms at a
;     stretch during the timing interval. Because interrupts
;     are enabled, keys, mice, and other devices that generate
;     interrupts should not be used during the timing interval.
;
; Note: Any extra code running off the timer interrupt (such as
;     some memory-resident utilities) will increase the time
;     measured by the Zen timer.
;
; Note: These routines can introduce inaccuracies of up to a few
;     tenths of a second into the system clock count for each
;     code section timed. Consequently, it's a good idea to
;     reboot at the conclusion of timing sessions. (The
;     battery-backed clock, if any, is not affected by the Zen
;     timer.)
;
; All registers and all flags are preserved by all routines.
;

Code segment word public 'CODE'
     assume      cs:Code, ds:nothing
     public      ZTimerOn, ZTimerOff, ZTimerReport

;
; Set PS2 to 0 to assemble for use on a fully 8253-compatible
; system; when PS2 is 0, the readings are more reliable if the
; computer supports the undocumented timer-stopping feature,
; but may be badly off if that feature is not supported. In
; fact, timer-stopping may interfere with your computer's
; overall operation by putting the 8253 into an undefined or
; incorrect state. Use with caution!!!
;
; Set PS2 to 1 to assemble for use on non-8253-compatible
; systems, including PS/2 computers; when PS2 is 1, readings
; may occasionally be off by 54 ms, but the code will work
; properly on all systems.
;
; A setting of 1 is safer and will work on more systems,
; while a setting of 0 produces more reliable results in systems
; which support the undocumented timer-stopping feature of the
; 8253. The choice is yours.
;
PS2                  equ    1
;
; Base address of the 8253 timer chip.
;
BASE_8253            equ    40h
;
; The address of the timer 0 count registers in the 8253.
;
TIMER_0_8253         equ    BASE_8253 + 0
;
; The address of the mode register in the 8253.
;
MODE_8253            equ    BASE_8253 + 3
;
; The address of the BIOS timer count variable in the BIOS
; data segment.
;
TIMER_COUNT           equ   46ch
;
; Macro to emulate a POPF instruction in order to fix the bug in some
; 80286 chips which allows interrupts to occur during a POPF even when
; interrupts remain disabled.
;
MPOPF macro
      local p1, p2
      jmp short p2
p1:   iret        ;jump to pushed address & pop flags
p2:   push cs      ;construct far return address to
      call p1     ; the next instruction
      endm

;
; Macro to delay briefly to ensure that enough time has elapsed
; between successive I/O accesses so that the device being accessed
; can respond to both accesses even on a very fast PC.
;
DELAY macro
     jmp $+2
     jmp $+2
     jmp $+2
     endm

StartBIOSCountLow     dw   ?       ;BIOS count low word at the
                                   ; start of the timing period
StartBIOSCountHigh    dw   ?       ;BIOS count high word at the
                                   ; start of the timing period
EndBIOSCountLow       dw   ?       ;BIOS count low word at the
                                   ; end of the timing period
EndBIOSCountHigh      dw   ?       ;BIOS count high word at the
                                   ; end of the timing period
EndTimedCount         dw   ?       ;timer 0 count at the end of
                                   ; the timing period
ReferenceCount        dw   ?       ;number of counts required to
                                   ; execute timer overhead code
;
; String printed to report results.
;
OutputStr label byte
          db     0dh, 0ah, 'Timed count: '
TimedCountStr    db    10 dup (?)
          db     '    microseconds', 0dh, 0ah
          db     '$'
;
; Temporary storage for timed count as it's divided down by powers
; of ten when converting from doubleword binary to ASCII.
;
CurrentCountLow       dw    ?
CurrentCountHigh  dw  ?
;
; Powers of ten table used to perform division by 10 when doing
; doubleword conversion from binary to ASCII.
;
PowersOfTen label word
     dd   1
     dd   10
     dd   100
     dd   1000
     dd   10000
     dd   100000
     dd   1000000
     dd   10000000
     dd   100000000
     dd   1000000000
PowersOfTenEnd   label word
;
; String printed to report that the high word of the BIOS count
; changed while timing (an hour elapsed or midnight was crossed),
; and so the count is invalid and the test needs to be rerun.
;
TurnOverStr label byte
     db 0dh, 0ah
     db '****************************************************'
     db 0dh, 0ah
     db '*   Either midnight passed or an hour or more passed *'
     db 0dh, 0ah
     db '*  while timing was in progress. If the former was  *'
     db 0dh, 0ah
     db '*  the case, please rerun the test; if the latter   *'
     db 0dh, 0ah
     db '* was the case, the test code takes too long to    *'
     db 0dh, 0ah
     db '* run to be timed by the long-period Zen timer.    *'
     db 0dh, 0ah
     db '* Suggestions: use the DOS TIME command, the DOS   *'
     db 0dh, 0ah
     db '* time function, or a watch.                       *'
     db 0dh, 0ah
     db '****************************************************'
     db 0dh, 0ah
     db '$'

;********************************************************************
;* Routine called to start timing.         *
;********************************************************************

ZTimerOn  proc near

;
; Save the context of the program being timed.
;
     push ax
     pushf
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting. Also stops
; timer 0 until the timer count is loaded, except on PS/2
; computers.
;
     mov  al,00110100b      ;mode 2
     out  MODE_8253,al
;
; Set the timer count to 0, so we know we won't get another
; timer interrupt right away.
; Note: this introduces an inaccuracy of up to 54 ms in the system
; clock count each time it is executed.
;
     DELAY
     sub al,al
     out TIMER_0_8253,al       ;lsb
     DELAY
     out TIMER_0_8253,al       ;msb
;
; In case interrupts are disabled, enable interrupts briefly to allow
; the interrupt generated when switching from mode 3 to mode 2 to be
; recognized. Interrupts must be enabled for at least 210 ns to allow
; time for that interrupt to occur. Here, 10 jumps are used for the
; delay to ensure that the delay time will be more than long enough
; even on a very fast PC.
;
     pushf
     sti
     rept 10
     jmp  $+2
     endm
     MPOPF
;
; Store the timing start BIOS count.
; (Since the timer count was just set to 0, the BIOS count will
; stay the same for the next 54 ms, so we don't need to disable
; interrupts in order to avoid getting a half-changed count.)
;
     push   ds
     sub    ax, ax
     mov    ds, ax
     mov    ax, ds:[TIMER_COUNT+2]
     mov    cs:[StartBIOSCountHigh],ax
     mov    ax, ds:[TIMER_COUNT]
     mov    cs:[StartBIOSCountLow],ax
     pop    ds
;
; Set the timer count to 0 again to start the timing interval.
;
     mov    al,00110100b        ;set up to load initial
     out    MODE_8253,al        ; timer count
     DELAY
     sub    al, al
     out    TIMER_0_8253,al;    load count lsb
     DELAY
     out   TIMER_0_8253,al;     load count msb
;
; Restore the context of the program being timed and return to it.
;
     MPOPF
     pop   ax
     ret

ZTimerOn  endp

;********************************************************************
;* Routine called to stop timing and get count.                     *
;********************************************************************

ZTimerOff proc near

;
; Save the context of the program being timed.
;
     pushf
     push ax
     push cx
;
; In case interrupts are disabled, enable interrupts briefly to allow
; any pending timer interrupt to be handled. Interrupts must be
; enabled for at least 210 ns to allow time for that interrupt to
; occur. Here, 10 jumps are used for the delay to ensure that the
; delay time will be more than long enough even on a very fast PC.
;
     sti
     rept 10
     jmp  $+2
     endm

;
; Latch the timer count.
;

if PS2

     mov  al,00000000b
     out  MODE_8253,al     ;latch timer 0 count
;
; This is where a one-instruction-long window exists on the PS/2.
; The timer count and the BIOS count can lose synchronization;
; since the timer keeps counting after it's latched, it can turn
; over right after it's latched and cause the BIOS count to turn
; over before interrupts are disabled, leaving us with the timer
; count from before the timer turned over coupled with the BIOS
; count from after the timer turned over. The result is a count
; that's 54 ms too long.
;

else

;
; Set timer 0 to mode 2 (divide-by-N), waiting for a 2-byte count
; load, which stops timer 0 until the count is loaded. (Only works
; on fully 8253-compatible chips.)
;
     mov   al,00110100b;     mode 2
     out   MODE_8253,al
     DELAY
     mov   al,00000000b      ;latch timer 0 count
     out   MODE_8253,al

endif

     cli                     ;stop the BIOS count
;
; Read the BIOS count. (Since interrupts are disabled, the BIOS
; count won't change.)
;
     push ds
     sub  ax,ax
     mov  ds,ax
     mov  ax,ds:[TIMER_COUNT+2]
     mov  cs:[EndBIOSCountHigh],ax
     mov  ax,ds:[TIMER_COUNT]
     mov  cs:[EndBIOSCountLow],ax
     pop  ds
;
; Read the timer count and save it.
;
     in   al,TIMER_0_8253        ;lsb
     DELAY
     mov  ah,al
     in   al,TIMER_0_8253        ;msb
     xchg ah,al
     neg  ax                     ;convert from countdown
                                 ; remaining to elapsed
                                 ; count
     mov  cs:[EndTimedCount],ax
;
; Restart timer 0, which is still waiting for an initial count
; to be loaded.
;

ife PS2

     DELAY
     mov  al,00110100b        ;mode 2, waiting to load a
                              ; 2-byte count
     out  MODE_8253,al
     DELAY
     sub  al,al
     out  TIMER_0_8253,al     ;lsb
     DELAY
     mov  al,ah
     out  TIMER_0_8253,al     ;msb
     DELAY

endif

    sti                       ;let the BIOS count continue
;
; Time a zero-length code fragment, to get a reference for how
; much overhead this routine has. Time it 16 times and average it,
; for accuracy, rounding the result.
;
     mov   cs:[ReferenceCount],0
     mov   cx,16
     cli                         ;interrupts off to allow a
                                 ; precise reference count
RefLoop:
     call  ReferenceZTimerOn
     call  ReferenceZTimerOff
     loop  RefLoop
     sti
     add   cs:[ReferenceCount],8    ;total + (0.5 * 16)
     mov   cl,4
     shr   cs:[ReferenceCount],cl   ;(total) / 16 + 0.5
;
; Restore the context of the program being timed and return to it.
;
     pop cx
     pop ax
     MPOPF
     ret

ZTimerOff endp

;
; Called by ZTimerOff to start the timer for overhead measurements.
;

ReferenceZTimerOn proc near
;
; Save the context of the program being timed.
;
     push ax
     pushf
;
; Set timer 0 of the 8253 to mode 2 (divide-by-N), to cause
; linear counting rather than count-by-two counting.
;
     mov    al,00110100b     ;mode 2
     out    MODE_8253,al
;
; Set the timer count to 0.
;
     DELAY
     sub     al,al
     out     TIMER_0_8253,al     ;lsb
     DELAY
     out     TIMER_0_8253,al     ;msb
;
; Restore the context of the program being timed and return to it.
;
     MPOPF
     pop   ax
     ret

ReferenceZTimerOn endp

;
; Called by ZTimerOff to stop the timer and add the result to
; ReferenceCount for overhead measurements. Doesn't need to look
; at the BIOS count because timing a zero-length code fragment
; isn't going to take anywhere near 54 ms.
;

ReferenceZTimerOff proc near
;
; Save the context of the program being timed.
;
     pushf
     push ax
     push cx

;
; Match the interrupt-window delay in ZTimerOff.
;
     sti
     rept 10
     jmp $+2
     endm

     mov    al,00000000b
     out    MODE_8253,al     ;latch timer
;
; Read the count and save it.
;
     DELAY
     in    al,TIMER_0_8253     ;lsb
     DELAY
     mov   ah,al
     in    al,TIMER_0_8253     ;msb
     xchg  ah,al
     neg   ax                  ;convert from countdown
                               ; remaining to elapsed
                               ; count
     add   cs:[ReferenceCount],ax
;
; Restore the context and return.
;
     pop cx
     pop ax
     MPOPF
     ret

ReferenceZTimerOff endp

;********************************************************************
;* Routine called to report timing results.                           *
;********************************************************************

ZTimerReport proc near

     pushf
     push    ax
     push    bx
     push    cx
     push    dx
     push    si
     push    di
     push    ds
;
     push    cs     ;DOS functions require that DS point
     pop     ds     ; to text to be displayed on the screen
     assume  ds     :Code
;
; See if midnight or more than an hour passed during timing. If so,
; notify the user.
;
     mov    ax,[StartBIOSCountHigh]
     cmp    ax,[EndBIOSCountHigh]
     jz     CalcBIOSTime     ;hour count didn't change,
                             ; so everything's fine
     inc    ax
     cmp    ax,[EndBIOSCountHigh]
     jnz    TestTooLong      ;midnight or two hour
                             ; boundaries passed, so the
                             ; results are no good
     mov    ax,[EndBIOSCountLow]
     cmp    ax,[StartBIOSCountLow]
     jb     CalcBIOSTime     ;a single hour boundary
                             ; passed--that's OK, so long as
                             ; the total time wasn't more
                             ; than an hour

;
; Over an hour elapsed or midnight passed during timing, which
; renders the results invalid. Notify the user. This misses the
; case where a multiple of 24 hours has passed, but we'll rely
; on the perspicacity of the user to detect that case.
;
TestTooLong:
     mov    ah,9
     mov    dx,offset TurnOverStr
     int    21h
     jmp    short ZTimerReportDone
;
; Convert the BIOS time to microseconds.
;
CalcBIOSTime:
     mov    ax,[EndBIOSCountLow]
     sub    ax,[StartBIOSCountLow]
     mov    dx,54925          ;number of microseconds each
                              ; BIOS count represents
     mul    dx
     mov    bx,ax             ;set aside BIOS count in
     mov    cx,dx             ; microseconds
;
; Convert timer count to microseconds.
;
     mov    ax,[EndTimedCount]
     mov    si,8381
     mul    si
     mov    si,10000
     div    si               ;* .8381 = * 8381 / 10000
;
; Add timer and BIOS counts together to get an overall time in
; microseconds.
;
     add    bx,ax
     adc    cx,0
;
; Subtract the timer overhead and save the result.
;
     mov    ax,[ReferenceCount]
     mov    si,8381          ;convert the reference count
     mul    si               ; to microseconds
     mov    si,10000
     div    si               ;* .8381 = * 8381 / 10000
     sub    bx,ax
     sbb    cx,0
     mov    [CurrentCountLow],bx
     mov    [CurrentCountHigh],cx
;
; Convert the result to an ASCII string by trial subtractions of
; powers of 10.
;
     mov    di,offset PowersOfTenEnd - offset PowersOfTen - 4
     mov    si,offset TimedCountStr
CTSNextDigit:
     mov    bl,'0'
CTSLoop:
     mov    ax,[CurrentCountLow]
     mov    dx,[CurrentCountHigh]
     sub    ax,PowersOfTen[di]
     sbb    dx,PowersOfTen[di+2]
     jc     CTSNextPowerDown
     inc    bl
     mov    [CurrentCountLow],ax
     mov    [CurrentCountHigh],dx
     jmp    CTSLoop
CTSNextPowerDown:
     mov    [si],bl
     inc    si
     sub    di,4
     jns    CTSNextDigit
;
;
; Print the results.
;
     mov    ah,9
     mov    dx,offset OutputStr
     int    21h
;
ZTimerReportDone:
     pop    ds
     pop    di
     pop    si
     pop    dx
     pop    cx
     pop    bx
     pop    ax
     MPOPF
     ret

ZTimerReport    endp

Code   ends
       end

Moreover, because it uses an undocumented feature, the timer-stop approach could conceivably cause erratic 8253 operation, which could in turn seriously affect your computer’s operation until the next reboot. In non-8253-compatible systems, I’ve observed not only wildly incorrect timing results, but also failure of a diskette drive to operate properly after the long-period Zen timer with PS2 set to 0 has run, so be alert for signs of trouble if you do set PS2 to 0.

Rebooting should clear up any timer-related problems of the sort described above. (This gives us another reason to reboot at the end of each code-timing session.) You should immediately reboot and set the PS2 equate to 1 if you get erratic or obviously incorrect results with the long-period Zen timer when PS2 is set to 0. If you want to set PS2 to 0, it would be a good idea to time a few of the listings in this book with PS2 set first to 1 and then to 0, to make sure that the results match. If they’re consistently different, you should set PS2 to 1.

While the the non-PS/2 version is more dangerous than the PS/2 version, it also produces more accurate results when it does work. If you have a non-PS/2 PC-compatible computer, the choice between the two timing approaches is yours.

If you do leave the PS2 equate at 1 in Listing 3.5, you should repeat each code-timing run several times before relying on the results to be accurate to more than 54 ms, since variations may result from the possible lack of synchronization between the timer 0 count and the BIOS time-of-day count. In fact, it’s a good idea to time code more than once no matter which version of the long-period Zen timer you’re using, since interrupts, which must be enabled in order for the long-period timer to work properly, may occur at any time and can alter execution time substantially.

Finally, please note that the precision Zen timer works perfectly well on both PS/2 and non-PS/2 computers. The PS/2 and 8253 considerations we’ve just discussed apply only to the long-period Zen timer.

Example Use of the Long-Period Zen Timer

The long-period Zen timer has exactly the same calling interface as the precision Zen timer, and can be used in place of the precision Zen timer simply by linking it to the code to be timed in place of linking the precision timer code. Whenever the precision Zen timer informs you that the code being timed takes too long for the precision timer to handle, all you have to do is link in the long-period timer instead.

Listing 3.6 shows a test-bed program for the long-period Zen timer. While this program is similar to Listing 3.2, it’s worth noting that Listing 3.6 waits for a few seconds before calling ZTimerOn, thereby allowing any pending keyboard interrupts to be processed. Since interrupts must be left on in order to time periods longer than 54 ms, the interrupts generated by keystrokes (including the upstroke of the Enter key press that starts the program)—or any other interrupts, for that matter—could incorrectly inflate the time recorded by the long-period Zen timer. In light of this, resist the temptation to type ahead, move the mouse, or the like while the long-period Zen timer is timing.

LISTING 3.6 LZTEST.ASM

; Program to measure performance of code that takes longer than
; 54 ms to execute. (LZTEST.ASM)
;
; Link with LZTIMER.ASM (Listing 3.5). LZTIME.BAT (Listing 3.7)
; can be used to assemble and link both files. Code to be
; measured must be in the file TESTCODE; Listing 3.8 shows
; a sample file (LST3-8.ASM) which should be named TESTCODE.
;
; By Michael Abrash
;
mystack   segment    para stack 'STACK'
     db         512 dup(?)
mystack   ends
;
Code  segment   para public 'CODE'
      assume    cs:Code, ds:Code
      extrn     ZTimerOn:near, ZTimerOff:near, ZTimerReport:near
Start proc near
     push  cs
     pop   ds      ;point DS to the code segment,
                   ; so data as well as code can easily
                   ; be included in TESTCODE
;
; Delay for 6-7 seconds, to let the Enter keystroke that started the
; program come back up.
;
     mov   ah,2ch
     int   21h                ;get the current time
     mov   bh,dh              ;set the current time aside
DelayLoop:
     mov   ah,2ch
     push  bx                 ;preserve start time
     int   21h                ;get time
     pop   bx                 ;retrieve start time
     cmp   dh,bh              ;is the new seconds count less than
                              ; the start seconds count?
     jnb   CheckDelayTime     ;no
     add   dh,60              ;yes, a minute must have turned over,
                              ; so add one minute
CheckDelayTime:
     sub   dh,bh              ;get time that's passed
     cmp   dh,7               ;has it been more than 6 seconds yet?
     jb    DelayLoop          ;not yet
;
     include   TESTCODE       ;code to be measured, including calls
                              ; to ZTimerOn and ZTimerOff
;
; Display the results.
;
     call  ZTimerReport
;
; Terminate the program.
;
     mov   ah,4ch
     int   21h
Start endp
Code  ends
      end     Start

As with the precision Zen timer, the program in Listing 3.6 is used by naming the file containing the code to be timed TESTCODE, then assembling both Listing 3.6 and Listing 3.5 with MASM or TASM and linking the two files together by way of the Microsoft or Borland linker. Listing 3.7 shows a batch file, named LZTIME.BAT, which does all of the above, generating and running the executable file LZTEST.EXE. LZTIME.BAT assumes that the file LZTIMER.ASM contains Listing 3.5 and the file LZTEST.ASM contains Listing 3.6.

LISTING 3.7 LZTIME.BAT

echo off
rem
rem *** Listing 3.7 ***
rem
rem ***************************************************************
rem * Batch file LZTIME.BAT, which builds and runs the            *
rem * long-period Zen timer program LZTEST.EXE to time the code   *
rem * named as the command-line parameter. Listing 3.5 must be    *
rem * named LZTIMER.ASM, and Listing 3.6 must be named            *
rem * LZTEST.ASM. To time the code in LST3-8, you'd type the      *
rem * DOS command:                                                *
rem *                                                             *
rem * lztime lst3-8                                               *
rem *                                                             *
rem * Note that MASM and LINK must be in the current directory or *
rem * on the current path in order for this batch file to work.   *
rem *                                                             *
rem * This batch file can be speeded up by assembling LZTIMER.ASM *
rem * once, then removing the lines:                              *
rem *                                                             *
rem * masm lztimer;                                               *
rem * if errorlevel 1 goto errorend                               *
rem *                                                             *
rem * from this file.                                             *
rem *                                                             *
rem * By Michael Abrash                                           *
rem ***************************************************************
rem
rem Make sure a file to test was specified.
rem
if not x%1==x goto ckexist
echo ***************************************************************
echo * Please specify a file to test.                              *
echo ***************************************************************
goto end
rem
rem Make sure the file exists.
rem
:ckexist
if exist %1 goto docopy
echo ***************************************************************
echo * The specified file, "%1," doesn't exist.                    *
echo ***************************************************************
goto end
rem
rem copy the file to measure to TESTCODE.
:docopy
copy %1 testcode
masm lztest;
if errorlevel 1 goto errorend
masm lztimer;
if errorlevel 1 goto errorend
link lztest+lztimer;
if errorlevel 1 goto errorend
lztest
goto end
:errorend
echo ***************************************************************
echo * An error occurred while building the long-period Zen timer. *
echo ***************************************************************
:end

Listing 3.8 shows sample code that can be timed with the test-bed program of Listing 3.6. Listing 3.8 measures the time required to execute 20,000 loads of AL from memory, a length of time too long for the precision Zen timer to handle on the 8088.

LISTING 3.8 LST3-8.ASM

;
; Measures the performance of 20,000 loads of AL from
; memory. (Use by renaming to TESTCODE, which is
; included by LZTEST.ASM (Listing 3.6). LZTIME.BAT
; (Listing 3.7) does this, along with all assembly
; and linking.)
;
; Note: takes about ten minutes to assemble on a slow PC if
;you are using MASM
;
    jmp     Skip ;jump around defined data
;
MemVar  db  ?
;
Skip:
;
; Start timing.
;
    call   ZTimerOn
;
    rept   20000
    mov    al,[MemVar]
    endm
;
; Stop timing.
;
    call   ZTimerOff

When LZTIME.BAT is run on a PC with the following command line (assuming the code in Listing 3.8 is the file LST3-8.ASM)

lztime lst3-8.asm

the result is 72,544 µs, or about 3.63 µs per load of AL from memory. This is just slightly longer than the time per load of AL measured by the precision Zen timer, as we would expect given that interrupts are left enabled by the long-period Zen timer. The extra fraction of a microsecond measured per MOV reflects the time required to execute the BIOS code that handles the 18.2 timer interrupts that occur each second.

Note that the command can take as much as 10 minutes to finish on a slow PC if you are using MASM, with most of that time spent assembling Listing 3.8. Why? Because MASM is notoriously slow at assembling REPT blocks, and the block in Listing 3.8 is repeated 20,000 times.

Using the Zen Timer from C

The Zen timer can be used to measure code performance when programming in C—but not right out of the box. As presented earlier, the timer is designed to be called from assembly language; some relatively minor modifications are required before the ZTimerOn (start timer), ZTimerOff (stop timer), and ZTimerReport (display timing results) routines can be called from C. There are two separate cases to be dealt with here: small code model and large; I’ll tackle the simpler one, the small code model, first.

Altering the Zen timer for linking to a small code model C program involves the following steps: Change ZTimerOn to _ZTimerOn, change ZTimerOff to _ZTimerOff, change ZTimerReport to _ZTimerReport, and change Code to _TEXT . Figure 3.2 shows the line numbers and new states of all lines from Listing 3.1 that must be changed. These changes convert the code to use C-style external label names and the small model C code segment. (In C++, use the “C” specifier, as in

extern "C" ZTimerOn(void);

when declaring the timer routines extern, so that name-mangling doesn’t occur, and the linker can find the routines’ C-style names.)

That’s all it takes; after doing this, you’ll be able to use the Zen timer from C, as, for example, in:

ZTimerOn():
for (i=0, x=0; i<100; i++)
     x += i;
ZTimerOff();
ZTimerReport();

(I’m talking about the precision timer here. The long-period timer—Listing 3.5—requires the same modifications, but to different lines.)

Figure 3.2  Changes for use with small code model C.

Altering the Zen timer for use in C’s large code model is a tad more complex, because in addition to the above changes, all functions, including the internal reference timing routines that are used to calculate overhead so it can be subtracted out, must be converted to far. Figure 3.3 shows the line numbers and new states of all lines from Listing 3.1 that must be changed in order to call the Zen timer from large code model C. Again, the line numbers are specific to the precision timer, but the long-period timer is very similar.

The full listings for the C-callable Zen timers are presented in Chapter K on the companion CD-ROM.

Watch Out for Optimizing Assemblers!

One important safety tip when modifying the Zen timer for use with large code model C code: Watch out for optimizing assemblers! TASM actually replaces

call     far ptr ReferenceZTimerOn

with

push     cs
call     near ptr ReferenceZTimerOn

(and likewise for ReferenceZTimerOff), which works because ReferenceZTimerOn is in the same segment as the calling code. This is normally a great optimization, being both smaller and faster than a far call.

Figure 3.3  Changes for use with large code model C.

However, it’s not so great for the Zen timer, because our purpose in calling the reference timing code is to determine exactly how much time is taken by overhead code—including the far calls to ZTimerOn and ZTimerOf! By converting the far calls to push/near call pairs within the Zen timer module, TASM makes it impossible to emulate exactly the overhead of the Zen timer, and makes timings slightly (about 16 cycles on a 386) less accurate.

What’s the solution? Put the NOSMART directive at the start of the Zen timer code. This directive instructs TASM to turn off all optimizations, including converting far calls to push/near call pairs. By the way, there is, to the best of my knowledge, no such problem with MASM up through version 5.10A.

In my mind, the whole business of optimizing assemblers is a mixed blessing. In general, it’s nice to have the assembler shortening jumps and selecting sign-extended forms of instructions for you. On the other hand, the benefits of tricks like substituting push/near call pairs for far calls are relatively small, and those tricks can get in the way when complete control is needed. Sure, complete control is needed very rarely, but when it is, optimizing assemblers can cause subtle problems; I discovered TASM’s alteration of far calls only because I happened to view the code in the debugger, and you might want to do the same if you’re using a recent version of MASM.

I’ve tested the changes shown in Figures 3.2 and 3.3 with TASM and Borland C++ 4.0, and also with the latest MASM and Microsoft C/C++ compiler.

Further Reading

For those of you who wish to pursue the mechanics of code measurement further, one good article about measuring code performance with the 8253 timer is “Programming Insight: High-Performance Software Analysis on the IBM PC,” by Byron Sheppard, which appeared in the January, 1987 issue of Byte. For complete if somewhat cryptic information on the 8253 timer itself, I refer you to Intel’s Microsystem Components Handbook, which is also a useful reference for a number of other PC components, including the 8259 Programmable Interrupt Controller and the 8237 DMA Controller. For details about the way the 8253 is used in the PC, as well as a great deal of additional information about the PC’s hardware and BIOS resources, I suggest you consult IBM’s series of technical reference manuals for the PC, XT, AT, Model 30, and microchannel computers, such as the Models 50, 60, and 80.

For our purposes, however, it’s not critical that you understand exactly how the Zen timer works. All you really need to know is what the Zen timer can do and how to use it, and we’ve accomplished that in this chapter.

Armed with the Zen Timer, Onward and Upward

The Zen timer is not perfect. For one thing, the finest resolution to which it can measure an interval is at best about 1µs, a period of time in which a 66 MHz Pentium computer can execute as many as 132 instructions (although an 8088-based PC would be hard-pressed to manage two instructions in a microsecond). Another problem is that the timing code itself interferes with the state of the prefetch queue and processor cache at the start of the code being timed, because the timing code is not necessarily fetched and does not necessarily access memory in exactly the same time sequence as the code immediately preceding the code under measurement normally does. This prefetch effect can introduce as much as 3 to 4 µs of inaccuracy. Similarly, the state of the prefetch queue at the end of the code being timed affects how long the code that stops the timer takes to execute. Consequently, the Zen timer tends to be more accurate for longer code sequences, since the relative magnitude of the inaccuracy introduced by the Zen timer becomes less over longer periods.

Imperfections notwithstanding, the Zen timer is a good tool for exploring C code and x86 family assembly language, and it’s a tool we’ll use frequently for the remainder of this book.

Chapter 4 – In the Lair of the Cycle-Eaters

How the PC Hardware Devours Code Performance

This chapter, adapted from my earlier book, Zen of Assembly Language located on the companion CD-ROM, goes right to the heart of my philosophy of optimization: Understand where the time really goes when your code runs. That may sound ridiculously simple, but, as this chapter makes clear, it turns out to be a challenging task indeed, one that at times verges on black magic. This chapter is a long-time favorite of mine because it was the first—and to a large extent only—work that I know of that discussed this material, thereby introducing a generation of PC programmers to pedal-to-the-metal optimization.

This chapter focuses almost entirely on the first popular x86-family processor, the 8088. Some of the specific features and results that I cite in this chapter are no longer applicable to modern x86-family processors such as the 486 and Pentium, as I’ll point out later on when we discuss those processors. Nonetheless, the overall theme of this chapter—that understanding dimly-seen and poorly-documented code gremlins called cycle-eaters that lurk in your system is essential to performance programming—is every bit as valid today. Also, later chapters often refer back to the basic cycle-eaters described in this chapter, so this chapter is the foundation for the discussions of x86-family optimization to come. What’s more, the Zen timer remains an excellent tool with which to flush out and examine cycle-eaters, as we’ll see in later chapters, and this chapter is as good an illustration of how to use the Zen timer as you’re likely to find.

So, don’t take either the absolute or the relative execution times presented in this chapter as gospel for newer processors, and read on to later chapters to see how the cycle-eaters and optimization rules have changed over time, but do take the time to at least skim through this chapter to give yourself a good start on the material in the rest of this book.

Cycle-Eaters

Programming has many levels, ranging from the familiar (high-level languages, DOS calls, and the like) down to the esoteric things that lie on the shadowy edge of hardware-land. I call these cycle-eaters because, like the monsters in a bad 50s horror movie, they lurk in those shadows, taking their share of your program’s performance without regard to the forces of goodness or the U.S. Army. In this chapter, we’re going to jump right in at the lowest level by examining the cycle-eaters that live beneath the programming interface; that is, beneath your application, DOS, and BIOS—in fact, beneath the instruction set itself.

Why start at the lowest level? Simply because cycle-eaters affect the performance of all assembler code, and yet are almost unknown to most programmers. A full understanding of code optimization requires an understanding of cycle-eaters and their implications. That’s no simple task, and in fact it is in precisely that area that most books and articles about assembly programming fall short.

Nearly all literature on assembly programming discusses only the programming interface: the instruction set, the registers, the flags, and the BIOS and DOS calls. Those topics cover the functionality of assembly programs most thoroughly—but it’s performance above all else that we’re after. No one ever tells you about the raw stuff of performance, which lies beneath the programming interface, in the dimly-seen realm—populated by instruction prefetching, dynamic RAM refresh, and wait states—where software meets hardware. This area is the domain of hardware engineers, and is almost never discussed as it relates to code performance. And yet it is only by understanding the mechanisms operating at this level that we can fully understand and properly improve the performance of our code.

Which brings us to cycle-eaters.

The Nature of Cycle-Eaters

Cycle-eaters are gremlins that live on the bus or in peripherals (and sometimes within the CPU itself), slowing the performance of PC code so that it doesn’t execute at full speed. Most cycle-eaters (and all of those haunting the older Intel processors) live outside the CPU’s Execution Unit, where they can only affect the CPU when the CPU performs a bus access (a memory or I/O read or write). Once your code and data are already inside the CPU, those cycle-eaters can no longer be a problem. Only on the 486 and Pentium CPUs will you find cycle-eaters inside the chip, as we’ll see in later chapters.

The nature and severity of the cycle-eaters vary enormously from processor to processor, and (especially) from memory architecture to memory architecture. In order to understand them all, we need first to understand the simplest among them, those that haunted the original 8088-based IBM PC. Later on in this book, I’ll be better able to explain the newer generation of cycle-eaters in terms of those ancestral cycle-eaters—but we have to get the groundwork down first.

The 8088’s Ancestral Cycle-Eaters

Internally, the 8088 is a 16-bit processor, capable of running at full speed at all times—unless external data is required. External data must traverse the 8088’s external data bus and the PC’s data bus one byte at a time to and from peripherals, with cycle-eaters lurking along every step of the way. What’s more, external data includes not only memory operands but also instruction bytes, so even instructions with no memory operands can suffer from cycle-eaters. Since some of the 8088’s fastest instructions are register-only instructions, that’s important indeed.

The major cycle-eaters are:

  • The 8088’s 8-bit external data bus.
  • The prefetch queue.
  • Dynamic RAM refresh.
  • Wait states, notably display memory wait states and, in the AT and 80386 computers, system memory wait states.

The locations of these cycle-eaters in the primordial 8088-based PC are shown in Figure 4.1. We’ll cover each of the cycle-eaters in turn in this chapter. The material won’t be easy since cycle-eaters are among the most subtle aspects of assembly programming. By the same token, however, this will be one of the most important and rewarding chapters in this book. Don’t worry if you don’t catch everything in this chapter, but do read it all even if the going gets a bit tough. Cycle-eaters play a key role in later chapters, so some familiarity with them is highly desirable.

The 8-Bit Bus Cycle-Eater

Look! Down on the motherboard! It’s a 16-bit processor! It’s an 8-bit processor! It’s…

…an 8088!

Fans of the 8088 call it a 16-bit processor. Fans of other 16-bit processors call the 8088 an 8-bit processor. The truth of the matter is that the 8088 is a 16-bit processor that often performs like an 8-bit processor.

The 8088 is internally a full 16-bit processor, equivalent to an 8086. (In fact, the 8086 is identical to the 8088, except that it has a full 16-bit bus. The 8088 is basically the poor man’s 8086, because it allows a cheaper—albeit slower—system to be built, thanks to the half-sized bus.) In terms of the instruction set, the 8088 is clearly a 16-bit processor, capable of performing any given 16-bit operation—addition, subtraction, even multiplication or division—with a single instruction. Externally, however, the 8088 is unequivocally an 8-bit processor, since the external data bus is only 8 bits wide. In other words, the programming interface is 16 bits wide, but the hardware interface is only 8 bits wide, as shown in Figure 4.2. The result of this mismatch is simple: Word-sized data can be transferred between the 8088 and memory or peripherals at only one-half the maximum rate of the 8086, which is to say one-half the maximum rate for which the Execution Unit of the 8088 was designed.

Figure 4.1  The location of the major cycle-eaters in the IBM PC.
Figure 4.2  Internal data bus widths of the 8088.

As shown in Figure 4.1, the 8-bit bus cycle-eater lies squarely on the 8088’s external data bus. Technically, it might be more accurate to place this cycle-eater in the Bus Interface Unit, which breaks 16-bit memory accesses into paired 8-bit accesses, but it is really the limited width of the external data bus that constricts data flow into and out of the 8088. True, the original PC’s bus is also only 8 bits wide, but that’s just to match the 8088’s 8-bit bus; even if the PC’s bus were 16 bits wide, data could still pass into and out of the 8088 chip itself only 1 byte at a time.

Each bus access by the 8088 takes 4 clock cycles, or 0.838 µs in the 4.77 MHz PC, and transfers 1 byte. That means that the maximum rate at which data can be transferred into and out of the 8088 is 1 byte every 0.838 µs. While 8086 bus accesses also take 4 clock cycles, each 8086 bus access can transfer either 1 byte or 1 word, for a maximum transfer rate of 1 word every 0.838 µs. Consequently, for word-sized memory accesses, the 8086 has an effective transfer rate of 1 byte every 0.419 µs. By contrast, every word-sized access on the 8088 requires two 4-cycle-long bus accesses, one for the high byte of the word and one for the low byte of the word. As a result, the 8088 has an effective transfer rate for word-sized memory accesses of just 1 word every 1.676 µs—and that, in a nutshell, is the 8-bit bus cycle-eater.

A related cycle-eater lurks beneath the 386SX chip, which is a 32-bit processor internally with only a 16-bit path to system memory. The numbers are different, but the way the cycle-eater operates is exactly the same. AT-compatible systems have 16-bit data buses, which can access a full 16-bit word at a time. The 386SX can process 32 bits (a doubleword) at a time, however, and loses a lot of time fetching that doubleword from memory in two halves.

The Impact of the 8-Bit Bus Cycle-Eater

One obvious effect of the 8-bit bus cycle-eater is that word-sized accesses to memory operands on the 8088 take 4 cycles longer than byte-sized accesses. That’s why the official instruction timings indicate that for code running on an 8088 an additional 4 cycles are required for every word-sized access to a memory operand. For instance,

mov  ax,word ptr [MemVar]

takes 4 cycles longer to read the word at address MemVar than

mov  al,byte ptr [MemVar]

takes to read the byte at address MemVar. (Actually, the difference between the two isn’t very likely to be exactly 4 cycles, for reasons that will become clear once we discuss the prefetch queue and dynamic RAM refresh cycle-eaters later in this chapter.)

What’s more, in some cases one instruction can perform multiple word-sized accesses, incurring that 4-cycle penalty on each access. For example, adding a value to a word-sized memory variable requires two word-sized accesses—one to read the destination operand from memory prior to adding to it, and one to write the result of the addition back to the destination operand—and thus incurs not one but two 4-cycle penalties. As a result

add  word ptr [MemVar],ax

takes about 8 cycles longer to execute than:

add  byte ptr [MemVar],al

String instructions can suffer from the 8-bit bus cycle-eater to a greater extent than other instructions. Believe it or not, a single REP MOVSW instruction can lose as much as 131,070 word-sized memory accesses x 4 cycles, or 524,280 cycles to the 8-bit bus cycle-eater! In other words, one 8088 instruction (admittedly, an instruction that does a great deal) can take over one-tenth of a second longer on an 8088 than on an 8086, simply because of the 8-bit bus. One-tenth of a second! That’s a phenomenally long time in computer terms; in one-tenth of a second, the 8088 can perform more than 50,000 additions and subtractions.

The upshot of all this is simply that the 8088 can transfer word-sized data to and from memory at only half the speed of the 8086, which inevitably causes performance problems when coupled with an Execution Unit that can process word-sized data every bit as quickly as an 8086. These problems show up with any code that uses word-sized memory operands. More ominously, as we will see shortly, the 8-bit bus cycle-eater can cause performance problems with other sorts of code as well.

What to Do about the 8-Bit Bus Cycle-Eater?

The obvious implication of the 8-bit bus cycle-eater is that byte-sized memory variables should be used whenever possible. After all, the 8088 performs byte-sized memory accesses just as quickly as the 8086. For instance, Listing 4.1, which uses a byte-sized memory variable as a loop counter, runs in 10.03 s per loop. That’s 20 percent faster than the 12.05 µs per loop execution time of Listing 4.2, which uses a word-sized counter. Why the difference in execution times? Simply because each word-sized DEC performs 4 byte-sized memory accesses (two to read the word-sized operand and two to write the result back to memory), while each byte-sized DEC performs only 2 byte-sized memory accesses in all.

LISTING 4.1 LST4-1.ASM

; Measures the performance of a loop which uses a
; byte-sized memory variable as the loop counter.
;
      jmp  Skip
;
Counter    db    100
;
Skip:
      call ZTimerOn
LoopTop:
      dec  [Counter]
      jnz  LoopTop
      call ZTimerOff

LISTING 4.2 LST4-2.ASM

; Measures the performance of a loop which uses a
; word-sized memory variable as the loop counter.
;
      jmp  Skip
;
Counter    dw    100
;
Skip:
      call  ZTimerOn
LoopTop:
      dec   [Counter]
      jnz   LoopTop
      call  ZTimerOff

I’d like to make a brief aside concerning code optimization in the listings in this book. Throughout this book I’ve modeled the sample code after working code so that the timing results are applicable to real-world programming. In Listings 4.1 and 4.2, for example, I could have shown a still greater advantage for byte-sized operands simply by performing 1,000 DEC instructions in a row, with no branching at all. However, DEC instructions don’t exist in a vacuum, so in the listings I used code that both decremented the counter and tested the result. The difference is that between decrementing a memory location (simply an instruction) and using a loop counter (a functional instruction sequence). If you come across code in this book that seems less than optimal, it’s simply due to my desire to provide code that’s relevant to real programming problems. On the other hand, optimal code is an elusive thing indeed; by no means should you assume that the code in this book is ideal! Examine it, question it, and improve upon it, for an inquisitive, skeptical mind is an important part of the Zen of assembly optimization.

Back to the 8-bit bus cycle-eater. As I’ve said, in 8088 work you should strive to use byte-sized memory variables whenever possible. That does not mean that you should use 2 byte-sized memory accesses to manipulate a word-sized memory variable in preference to 1 word-sized memory access, as, for instance,

mov  dl,byte ptr [MemVar]
mov  dh,byte ptr [MemVar+1]

versus:

mov  dx,word ptr [MemVar]

Recall that every access to a memory byte takes at least 4 cycles; that limitation is built right into the 8088. The 8088 is also built so that the second byte-sized memory access to a 16-bit memory variable takes just those 4 cycles and no more. There’s no way you can manipulate the second byte of a word-sized memory variable faster with a second separate byte-sized instruction in less than 4 cycles. As a matter of fact, you’re bound to access that second byte much more slowly with a separate instruction, thanks to the overhead of instruction fetching and execution, address calculation, and the like.

For example, consider Listing 4.3, which performs 1,000 word-sized reads from memory. This code runs in 3.77 µs per word read on a 4.77 MHz 8088. That’s 45 percent faster than the 5.49 µs per word read of Listing 4.4, which reads the same 1,000 words as Listing 4.3 but does so with 2,000 byte-sized reads. Both listings perform exactly the same number of memory accesses—2,000 accesses, each byte-sized, as all 8088 memory accesses must be. (Remember that the Bus Interface Unit must perform two byte-sized memory accesses in order to handle a word-sized memory operand.) However, Listing 4.3 is considerably faster because it expends only 4 additional cycles to read the second byte of each word, while Listing 4.4 performs a second LODSB, requiring 13 cycles, to read the second byte of each word.

LISTING 4.3 LST4-3.ASM

; Measures the performance of reading 1,000 words
; from memory with 1,000 word-sized accesses.
;
     sub  si,si
     mov  cx,1000
     call ZTimerOn
     rep  lodsw
     call ZTimerOff

LISTING 4.4 LST4-4.ASM

; Measures the performance of reading 1000 words
; from memory with 2,000 byte-sized accesses.
;
     sub  si,si
     mov  cx,2000
     call ZTimerOn
     rep  lodsb
     call ZTimerOff

In short, if you must perform a 16-bit memory access, let the 8088 break the access into two byte-sized accesses for you. The 8088 is more efficient at that task than your code can possibly be.

Word-sized variables should be stored in registers to the greatest feasible extent, since registers are inside the 8088, where 16-bit operations are just as fast as 8-bit operations because the 8-bit cycle-eater can’t get at them. In fact, it’s a good idea to keep as many variables of all sorts in registers as you can. Instructions with register-only operands execute very rapidly, partially because they avoid both the time-consuming memory accesses and the lengthy address calculations associated with memory operands.

There is yet another reason why register operands are preferable to memory operands, and it’s an unexpected effect of the 8-bit bus cycle-eater. Instructions with only register operands tend to be shorter (in terms of bytes) than instructions with memory operands, and when it comes to performance, shorter is usually better. In order to explain why that is true and how it relates to the 8-bit bus cycle-eater, I must diverge for a moment.

For the last few pages, you may well have been thinking that the 8-bit bus cycle-eater, while a nuisance, doesn’t seem particularly subtle or difficult to quantify. After all, any instruction reference tells us exactly how many cycles each instruction loses to the 8-bit bus cycle-eater, doesn’t it?

Yes and no. It’s true that in general we know approximately how much longer a given instruction will take to execute with a word-sized memory operand than with a byte-sized operand, although the dynamic RAM refresh and wait state cycle-eaters (which I’ll cover a little later) can raise the cost of the 8-bit bus cycle-eater considerably. However, all word-sized memory accesses lose 4 cycles to the 8-bit bus cycle-eater, and there’s one sort of word-sized memory access we haven’t discussed yet: instruction fetching. The ugliest manifestation of the 8-bit bus cycle-eater is in fact the prefetch queue cycle-eater.

The Prefetch Queue Cycle-Eater

In an 8088 context, here’s the prefetch queue cycle-eater in a nutshell: The 8088’s 8-bit external data bus keeps the Bus Interface Unit from fetching instruction bytes as fast as the 16-bit Execution Unit can execute them, so the Execution Unit often lies idle while waiting for the next instruction byte to be fetched.

Exactly why does this happen? Recall that the 8088 is an 8086 internally, but accesses word-sized memory data at only one-half the maximum rate of the 8086 due to the 8088’s 8-bit external data bus. Unfortunately, instructions are among the word-sized data the 8086 fetches, meaning that the 8088 can fetch instructions at only one-half the speed of the 8086. On the other hand, the 8086-equivalent Execution Unit of the 8088 can execute instructions every bit as fast as the 8086. The net result is that the Execution Unit burns up instruction bytes much faster than the Bus Interface Unit can fetch them, and ends up idling while waiting for instructions bytes to arrive.

The BIU can fetch instruction bytes at a maximum rate of one byte every 4 cycles—and that 4-cycle per instruction byte rate is the ultimate limit on overall instruction execution time, regardless of EU speed. While the EU may execute a given instruction that’s already in the prefetch queue in less than 4 cycles per byte, over time the EU can’t execute instructions any faster than they can arrive—and they can’t arrive faster than 1 byte every 4 cycles.

Clearly, then, the prefetch queue cycle-eater is nothing more than one aspect of the 8-bit bus cycle-eater. 8088 code often runs at less than the Execution Unit’s maximum speed because the 8-bit data bus can’t keep up with the demand for instruction bytes. That’s straightforward enough—so why all the fuss about the prefetch queue cycle-eater?

What makes the prefetch queue cycle-eater tricky is that it’s undocumented and unpredictable. That is, with a word-sized memory access, such as

mov  [bx],ax

it’s well-documented that an extra 4 cycles will always be required to write the upper byte of AX to memory. Not so with the prefetch queue cycle-eater lurking nearby. For instance, the instructions

shr  ax,1
shr  ax,1
shr  ax,1
shr  ax,1
shr  ax,1

should execute in 10 cycles, since each SHR takes 2 cycles to execute, according to Intel’s specifications. Those specifications contain Intel’s official instruction execution times, but in this case—and in many others—the specifications are drastically wrong. Why? Because they describe execution time once an instruction reaches the prefetch queue. They say nothing about whether a given instruction will be in the prefetch queue when it’s time for that instruction to run, or how long it will take that instruction to reach the prefetch queue if it’s not there already. Thanks to the low performance of the 8088’s external data bus, that’s a glaring omission—but, alas, an unavoidable one. Let’s look at why the official execution times are wrong, and why that can’t be helped.

Official Execution Times Are Only Part of the Story

The sequence of 5 SHR instructions in the last example is 10 bytes long. That means that it can never execute in less than 24 cycles even if the 4-byte prefetch queue is full when it starts, since 6 instruction bytes would still remain to be fetched, at 4 cycles per fetch. If the prefetch queue is empty at the start, the sequence could take 40 cycles. In short, thanks to instruction fetching, the code won’t run at its documented speed, and could take up to four times longer than it is supposed to.

Why does Intel document Execution Unit execution time rather than overall instruction execution time, which includes both instruction fetch time and Execution Unit (EU) execution time? Well, instruction fetching isn’t performed as part of instruction execution by the Execution Unit, but instead is carried on in parallel by the Bus Interface Unit (BIU) whenever the external data bus isn’t in use or whenever the EU runs out of instruction bytes to execute. Sometimes the BIU is able to use spare bus cycles to prefetch instruction bytes before the EU needs them, so in those cases instruction fetching takes no time at all, practically speaking. At other times the EU executes instructions faster than the BIU can fetch them, and instruction fetching then becomes a significant part of overall execution time. As a result, the effective fetch time for a given instruction varies greatly depending on the code mix preceding that instruction. Similarly, the state in which a given instruction leaves the prefetch queue affects the overall execution time of the following instructions.

In other words, while the execution time for a given instruction is constant, the fetch time for that instruction depends heavily on the context in which the instruction is executing—the amount of prefetching the preceding instructions allowed—and can vary from a full 4 cycles per instruction byte to no time at all.

As we’ll see later, other cycle-eaters, such as DRAM refresh and display memory wait states, can cause prefetching variations even during different executions of the same code sequence. Given that, it’s meaningless to talk about the prefetch time of a given instruction except in the context of a specific code sequence.

So now you know why the official instruction execution times are often wrong, and why Intel can’t provide better specifications. You also know now why it is that you must time your code if you want to know how fast it really is.

There Is No Such Beast as a True Instruction Execution Time

The effect of the code preceding an instruction on the execution time of that instruction makes the Zen timer trickier to use than you might expect, and complicates the interpretation of the results reported by the Zen timer. For one thing, the Zen timer is best used to time code sequences that are more than a few instructions long; below 10µs or so, prefetch queue effects and the limited resolution of the clock driving the timer can cause problems.

Some slight prefetch queue-induced inaccuracy usually exists even when the Zen timer is used to time longer code sequences, since the calls to the Zen timer usually alter the code’s prefetch queue from its normal state. (Branches—jumps, calls, returns and the like—empty the prefetch queue.) Ideally, the Zen timer is used to measure the performance of an entire subroutine, so the prefetch queue effects of the branches at the start and end of the subroutine are similar to the effects of the calls to the Zen timer when you’re measuring the subroutine’s performance.

Another way in which the prefetch queue cycle-eater complicates the use of the Zen timer involves the practice of timing the performance of a few instructions over and over. I’ll often repeat one or two instructions 100 or 1,000 times in a row in listings in this book in order to get timing intervals that are long enough to provide reliable measurements. However, as we just learned, the actual performance of any 8088 instruction depends on the code mix preceding any given use of that instruction, which in turn affects the state of the prefetch queue when the instruction starts executing. Alas, the execution time of an instruction preceded by dozens of identical instructions reflects just one of many possible prefetch states (and not a very likely state at that), and some of the other prefetch states may well produce distinctly different results.

For example, consider the code in Listings 4.5 and 4.6. Listing 4.5 shows our familiar SHR case. Here, because the prefetch queue is always empty, execution time should work out to about 4 cycles per byte, or 8 cycles per SHR, as shown in Figure 4.3. (Figure 4.3 illustrates the relationship between instruction fetching and execution in a simplified way, and is not intended to show the exact timings of 8088 operations.) That’s quite a contrast to the official 2-cycle execution time of SHR. In fact, the Zen timer reports that Listing 4.5 executes in 1.81µs per byte, or slightly more than 4 cycles per byte. (The extra time is the result of the dynamic RAM refresh cycle-eater, which we’ll discuss shortly.) Going by Listing 4.5, we would conclude that the “true” execution time of SHR is 8.64 cycles.

LISTING 4.5 LST4-5.ASM

; Measures the performance of 1,000 SHR instructions
; in a row. Since SHR executes in 2 cycles but is
; 2 bytes long, the prefetch queue is always empty,
; and prefetching time determines the overall
; performance of the code.
;
      call  ZTimerOn
      rept  1000
      shr   ax,1
      endm
      call  ZTimerOff

LISTING 4.6 LST4-6.ASM

; Measures the performance of 1,000 MUL/SHR instruction
; pairs in a row. The lengthy execution time of MUL
; should keep the prefetch queue from ever emptying.
;
      mov   cx,1000
      sub   ax,ax
      call  ZTimerOn
      rept  1000
      mul   ax
      shr   ax,1
      endm
      call  ZTimerOff
Figure 4.3  Execution and instruction prefetching sequence for Listing 4.5.

Now let’s examine Listing 4.6. Here each SHR follows a MUL instruction. Since MUL instructions take so long to execute that the prefetch queue is always full when they finish, each SHR should be ready and waiting in the prefetch queue when the preceding MUL ends. As a result, we’d expect that each SHR would execute in 2 cycles; together with the 118-cycle execution time of multiplying 0 times 0, the total execution time should come to 120 cycles per SHR/MUL pair, as shown in Figure 4.4. And, by God, when we run Listing 4.6 we get an execution time of 25.14 µs per SHR/MUL pair, or exactly 120 cycles! According to these results, the “true” execution time of SHR would seem to be 2 cycles, quite a change from the conclusion we drew from Listing 4.5.

The key point is this: We’ve seen one code sequence in which SHR took 8-plus cycles to execute, and another in which it took only 2 cycles. Are we talking about two different forms of SHR here? Of course not—the difference is purely a reflection of the differing states in which the preceding code left the prefetch queue. In Listing 4.5, each SHR after the first few follows a slew of other SHR instructions which have sucked the prefetch queue dry, so overall performance reflects instruction fetch time. By contrast, each SHR in Listing 4.6 follows a MUL instruction which leaves the prefetch queue full, so overall performance reflects Execution Unit execution time.

Clearly, either instruction fetch time or Execution Unit execution time—or even a mix of the two, if an instruction is partially prefetched—can determine code performance. Some people operate under a rule of thumb by which they assume that the execution time of each instruction is 4 cycles times the number of bytes in the instruction. While that’s often true for register-only code, it frequently doesn’t hold for code that accesses memory. For one thing, the rule should be 4 cycles times the number of memory accesses, not instruction bytes, since all accesses take 4 cycles on the 8088-based PC. For another, memory-accessing instructions often have slower Execution Unit execution times than the 4 cycles per memory access rule would dictate, because the 8088 isn’t very fast at calculating memory addresses. Also, the 4 cycles per instruction byte rule isn’t true for register-only instructions that are already in the prefetch queue when the preceding instruction ends.

The truth is that it never hurts performance to reduce either the cycle count or the byte count of a given bit of code, but there’s no guarantee that one or the other will improve performance either. For example, consider Listing 4.7, which consists of a series of 4-cycle, 2-byte MOV AL,0 instructions, and which executes at the rate of 1.81 µs per instruction. Now consider Listing 4.8, which replaces the 4-cycle MOV AL,0 with the 3-cycle (but still 2-byte) SUB AL,AL, Despite its 1-cycle-per-instruction advantage, Listing 4.8 runs at exactly the same speed as Listing 4.7. The reason: Both instructions are 2 bytes long, and in both cases it is the 8-cycle instruction fetch time, not the 3 or 4-cycle Execution Unit execution time, that limits performance.

Figure 4.4  Execution and instruction prefetching sequence for Listing 4.6.

LISTING 4.7 LST4-7.ASM

; Measures the performance of repeated MOV AL,0 instructions,
; which take 4 cycles each according to Intel's official
; specifications.
;
     sub  ax,ax
     call ZTimerOn
     rept 1000
     mov  al,0
     endm
     call ZTimerOff

LISTING 4.8 LST4-8.ASM

; Measures the performance of repeated SUB AL,AL instructions,
; which take 3 cycles each according to Intel's official
; specifications.
;
     sub  ax,ax
     call ZTimerOn
     rept 1000
     sub  al,al
     endm
     call ZTimerOff

As you can see, it’s easy to be drawn into thinking you’re saving cycles when you’re not. You can only improve the performance of a specific bit of code by reducing the factor—either instruction fetch time or execution time, or sometimes a mix of the two—that’s limiting the performance of that code.

In case you missed it in all the excitement, the variability of prefetching means that our method of testing performance by executing 1,000 instructions in a row by no means produces “true” instruction execution times, any more than the official execution times in the Intel manuals are “true” times. The fact of the matter is that a given instruction takes at least as long to execute as the time given for it in the Intel manuals, but may take as much as 4 cycles per byte longer, depending on the state of the prefetch queue when the preceding instruction ends.

The only true execution time for an instruction is a time measured in a certain context, and that time is meaningful only in that context.

What we really want is to know how long useful working code takes to run, not how long a single instruction takes, and the Zen timer gives us the tool we need to gather that information. Granted, it would be easier if we could just add up neatly documented instruction execution times—but that’s not going to happen. Without actually measuring the performance of a given code sequence, you simply don’t know how fast it is. For crying out loud, even the people who designed the 8088 at Intel couldn’t tell you exactly how quickly a given 8088 code sequence executes on the PC just by looking at it! Get used to the idea that execution times are only meaningful in context, learn the rules of thumb in this book, and use the Zen timer to measure your code.

Approximating Overall Execution Times

Don’t think that because overall instruction execution time is determined by both instruction fetch time and Execution Unit execution time, the two times should be added together when estimating performance. For example, practically speaking, each SHR in Listing 4.5 does not take 8 cycles of instruction fetch time plus 2 cycles of Execution Unit execution time to execute. Figure 4.3 shows that while a given SHR is executing, the fetch of the next SHR is starting, and since the two operations are overlapped for 2 cycles, there’s no sense in charging the time to both instructions. You could think of the extra instruction fetch time for SHR in Listing 4.5 as being 6 cycles, which yields an overall execution time of 8 cycles when added to the 2 cycles of Execution Unit execution time.

Alternatively, you could think of each SHR in Listing 4.5 as taking 8 cycles to fetch, and then executing in effectively 0 cycles while the next SHR is being fetched. Whichever perspective you prefer is fine. The important point is that the time during which the execution of one instruction and the fetching of the next instruction overlap should only be counted toward the overall execution time of one of the instructions. For all intents and purposes, one of the two instructions runs at no performance cost whatsoever while the overlap exists.

As a working definition, we’ll consider the execution time of a given instruction in a particular context to start when the first byte of the instruction is sent to the Execution Unit and end when the first byte of the next instruction is sent to the EU.

What to Do about the Prefetch Queue Cycle-Eater?

Reducing the impact of the prefetch queue cycle-eater is one of the overriding principles of high-performance assembly code. How can you do this? One effective technique is to minimize access to memory operands, since such accesses compete with instruction fetching for precious memory accesses. You can also greatly reduce instruction fetch time simply by your choice of instructions: Keep your instructions short. Less time is required to fetch instructions that are 1 or 2 bytes long than instructions that are 5 or 6 bytes long. Reduced instruction fetching lowers minimum execution time (minimum execution time is 4 cycles times the number of instruction bytes) and often leads to faster overall execution.

While short instructions minimize overall prefetch time, ironically they actually often suffer more from the prefetch queue bottleneck than do long instructions. Short instructions generally have such fast execution times that they drain the prefetch queue despite their small size. For example, consider the SHR of Listing 4.5, which runs at only 25 percent of its Execution Unit execution time even though it’s only 2 bytes long, thanks to the prefetch queue bottleneck. Short instructions are nonetheless generally faster than long instructions, thanks to the combination of fewer instruction bytes and faster Execution Unit execution times, and should be used as much as possible—just don’t expect them to run at their “official” documented speeds.

More than anything, the above rules mean using the registers as heavily as possible, both because register-only instructions are short and because they don’t perform memory accesses to read or write operands. However, using the registers is a rule of thumb, not a commandment. In some circumstances, it may actually be faster to access memory. (The look-up table technique is one such case.) What’s more, the performance of the prefetch queue (and hence the performance of each instruction) differs from one code sequence to the next, and can even differ during different executions of the same code sequence.

All in all, writing good assembler code is as much an art as a science. As a result, you should follow the rules of thumb described here—and then time your code to see how fast it really is. You should experiment freely, but always remember that actual, measured performance is the bottom line.

Holding Up the 8088

In this chapter I’ve taken you further and further into the depths of the PC, telling you again and again that you must understand the computer at the lowest possible level in order to write good code. At this point, you may well wonder, “Have we gotten low enough?”

Not quite yet. The 8-bit bus and prefetch queue cycle-eaters are low-level indeed, but we’ve one level yet to go. Dynamic RAM refresh and wait states—our next topics—together form the lowest level at which the hardware of the PC affects code performance. Below this level, the PC is of interest only to hardware engineers.

Before we begin our discussion of dynamic RAM refresh, let’s step back for a moment to take an overall look at this lowest level of cycle-eaters. In truth, the distinctions between wait states and dynamic RAM refresh don’t much matter to a programmer. What is important is that you understand this: Under certain circumstances, devices on the PC bus can stop the CPU for 1 or more cycles, making your code run more slowly than it seemingly should.

Unlike all the cycle-eaters we’ve encountered so far, wait states and dynamic RAM refresh are strictly external to the CPU, as was shown in Figure 4.1. Adapters on the PC’s bus, such as video and memory cards, can insert wait states on any bus access, the idea being that they won’t be able to complete the access properly unless the access is stretched out. Likewise, the channel of the DMA controller dedicated to dynamic RAM refresh can request control of the bus at any time, although the CPU must relinquish the bus before the DMA controller can take over. This means that your code can’t directly control wait states or dynamic RAM refresh. However, code can sometimes be designed to minimize the effects of these cycle-eaters, and even when the cycle-eaters slow your code without there being a thing in the world you can do about it, you’re still better off understanding that you’re losing performance and knowing why your code doesn’t run as fast as it’s supposed to than you were programming in ignorance.

Let’s start with DRAM refresh, which affects the performance of every program that runs on the PC.

Dynamic RAM Refresh: The Invisible Hand

Dynamic RAM (DRAM) refresh is sort of an act of God. By that I mean that DRAM refresh invisibly and inexorably steals a certain fraction of all available memory access time from your programs, when they are accessing memory for code and data. (When they are accessing cache on more recent processors, theoretically the DRAM refresh cycle-eater doesn’t come into play, but there are other cycle-eaters waiting to prey on cache-bound programs.) While you could stop DRAM refresh, you wouldn’t want to since that would be a sure prescription for crashing your computer. In the end, thanks to DRAM refresh, almost all code runs a bit slower on the PC than it otherwise would, and that’s that.

A bit of background: A static RAM (SRAM) chip is a memory chip that retains its contents indefinitely so long as power is maintained. By contrast, each of several blocks of bits in a dynamic RAM (DRAM) chip retains its contents for only a short time after it’s accessed for a read or write. In order to get a DRAM chip to store data for an extended period, each of the blocks of bits in that chip must be accessed regularly, so that the chip’s stored data is kept refreshed and valid. So long as this is done often enough, a DRAM chip will retain its contents indefinitely.

All of the PC’s system memory consists of DRAM chips. Each DRAM chip in the PC must be completely refreshed about once every four milliseconds in order to ensure the integrity of the data it stores. Obviously, it’s highly desirable that the memory in the PC retain the correct data indefinitely, so each DRAM chip in the PC must always be refreshed within 4 ms of the last refresh. Since there’s no guarantee that a given program will access each and every DRAM block once every 4 ms, the PC contains special circuitry and programming for providing DRAM refresh.

How DRAM Refresh Works in the PC

On the original 8088-based IBM PC, timer 1 of the 8253 timer chip is programmed at power-up to generate a signal once every 72 cycles, or once every 15.08µs. That signal goes to channel 0 of the 8237 DMA controller, which requests the bus from the 8088 upon receiving the signal. (DMA stands for direct memory access, the ability of a device other than the 8088 to control the bus and access memory directly, without any help from the 8088.) As soon as the 8088 is between memory accesses, it gives control of the bus to the 8237, which in conjunction with special circuitry on the PC’s motherboard then performs a single 4-cycle read access to 1 of 256 possible addresses, advancing to the next address on each successive access. (The read access is only for the purpose of refreshing the DRAM; the data that is read isn’t used.)

The 256 addresses accessed by the refresh DMA accesses are arranged so that taken together they properly refresh all the memory in the PC. By accessing one of the 256 addresses every 15.08 µs, all of the PC’s DRAM is refreshed in 256 x 15.08 µs, or 3.86 ms, which is just about the desired 4 ms time I mentioned earlier. (Only the first 640K of memory is refreshed in the PC; video adapters and other adapters above 640K containing memory that requires refreshing must provide their own DRAM refresh in pre-AT systems.)

Don’t sweat the details here. The important point is this: For at least 4 out of every 72 cycles, the original PC’s bus is given over to DRAM refresh and is not available to the 8088, as shown in Figure 4.5. That means that as much as 5.56 percent of the PC’s already inadequate bus capacity is lost. However, DRAM refresh doesn’t necessarily stop the 8088 in its tracks for 4 cycles. The Execution Unit of the 8088 can keep processing while DRAM refresh is occurring, unless the EU needs to access memory. Consequently, DRAM refresh can slow code performance anywhere from 0 percent to 5.56 percent (and actually a bit more, as we’ll see shortly), depending on the extent to which DRAM refresh occupies cycles during which the 8088 would otherwise be accessing memory.

Figure 4.5  The PC bus dynamic RAM (DRAM) refresh.

The Impact of DRAM Refresh

Let’s look at examples from opposite ends of the spectrum in terms of the impact of DRAM refresh on code performance. First, consider the series of MUL instructions in Listing 4.9. Since a 16-bit MUL on the 8088 executes in between 118 and 133 cycles and is only 2 bytes long, there should be plenty of time for the prefetch queue to fill after each instruction, even after DRAM refresh has taken its slice of memory access time. Consequently, the prefetch queue should be able to keep the Execution Unit well-supplied with instruction bytes at all times. Since Listing 4.9 uses no memory operands, the Execution Unit should never have to wait for data from memory, and DRAM refresh should have no impact on performance. (Remember that the Execution Unit can operate normally during DRAM refreshes so long as it doesn’t need to request a memory access from the Bus Interface Unit.)

LISTING 4.9 LST4-9.ASM

; Measures the performance of repeated MUL instructions,
; which allow the prefetch queue to be full at all times,
; to demonstrate a case in which DRAM refresh has no impact
; on code performance.
;
     sub  ax,ax
     call ZTimerOn
     rept 1000
     mul  ax
     endm
     call ZTimerOff

Running Listing 4.9, we find that each MUL executes in 24.72 µs, or exactly 118 cycles. Since that’s the shortest time in which MUL can execute, we can see that no performance is lost to DRAM refresh. Listing 4.9 clearly illustrates that DRAM refresh only affects code performance when a DRAM refresh forces the Execution Unit of the 8088 to wait for a memory access.

Now let’s look at the series of SHR instructions shown in Listing 4.10. Since SHR executes in 2 cycles but is 2 bytes long, the prefetch queue should be empty while Listing 4.10 executes, with the 8088 prefetching instruction bytes non-stop. As a result, the time per instruction of Listing 4.10 should precisely reflect the time required to fetch the instruction bytes.

LISTING 4.10 LST4-10.ASM

; Measures the performance of repeated SHR instructions,
; which empty the prefetch queue, to demonstrate the
; worst-case impact of DRAM refresh on code performance.
;
     call ZTimerOn
     rept 1000
     shr  ax,1
     endm
     call ZTimerOff

Since 4 cycles are required to read each instruction byte, we’d expect each SHR to execute in 8 cycles, or 1.676 µs, if there were no DRAM refresh. In fact, each SHR in Listing 4.10 executes in 1.81 µs, indicating that DRAM refresh is taking 7.4 percent of the program’s execution time. That’s nearly 2 percent more than our worst-case estimate of the loss to DRAM refresh overhead! In fact, the result indicates that DRAM refresh is stealing not 4, but 5.33 cycles out of every 72 cycles. How can this be?

The answer is that a given DRAM refresh can actually hold up CPU memory accesses for as many as 6 cycles, depending on the timing of the DRAM refresh’s DMA request relative to the 8088’s internal instruction execution state. When the code in Listing 4.10 runs, each DRAM refresh holds up the CPU for either 5 or 6 cycles, depending on where the 8088 is in executing the current SHR instruction when the refresh request occurs. Now we see that things can get even worse than we thought: DRAM refresh can steal as much as 8.33 percent of available memory access time—6 out of every 72 cycles—from the 8088.

Which of the two cases we’ve examined reflects reality? While either case can happen, the latter case—significant performance reduction, ranging as high as 8.33 percent—is far more likely to occur. This is especially true for high-performance assembly code, which uses fast instructions that tend to cause non-stop instruction fetching.

What to Do About the DRAM Refresh Cycle-Eater?

Hmmm. When we discovered the prefetch queue cycle-eater, we learned to use short instructions. When we discovered the 8-bit bus cycle-eater, we learned to use byte-sized memory operands whenever possible, and to keep word-sized variables in registers. What can we do to work around the DRAM refresh cycle-eater?

Nothing.

As I’ve said before, DRAM refresh is an act of God. DRAM refresh is a fundamental, unchanging part of the PC’s operation, and there’s nothing you or I can do about it. If refresh were any less frequent, the reliability of the PC would be compromised, so tinkering with either timer 1 or DMA channel 0 to reduce DRAM refresh overhead is out. Nor is there any way to structure code to minimize the impact of DRAM refresh. Sure, some instructions are affected less by DRAM refresh than others, but how many multiplies and divides in a row can you really use? I suppose that code could conceivably be structured to leave a free memory access every 72 cycles, so DRAM refresh wouldn’t have any effect. In the old days when code size was measured in bytes, not K bytes, and processors were less powerful—and complex—programmers did in fact use similar tricks to eke every last bit of performance from their code. When programming the PC, however, the prefetch queue cycle-eater would make such careful code synchronization a difficult task indeed, and any modest performance improvement that did result could never justify the increase in programming complexity and the limits on creative programming that such an approach would entail. Besides, all that effort goes to waste on faster 8088s, 286s, and other computers with different execution speeds and refresh characteristics. There’s no way around it: Useful code accesses memory frequently and at irregular intervals, and over the long haul DRAM refresh always exacts its price.

If you’re still harboring thoughts of reducing the overhead of DRAM refresh, consider this. Instructions that tend not to suffer very much from DRAM refresh are those that have a high ratio of execution time to instruction fetch time, and those aren’t the fastest instructions of the PC. It certainly wouldn’t make sense to use slower instructions just to reduce DRAM refresh overhead, for it’s total execution time—DRAM refresh, instruction fetching, and all—that matters.

The important thing to understand about DRAM refresh is that it generally slows your code down, and that the extent of that performance reduction can vary considerably and unpredictably, depending on how the DRAM refreshes interact with your code’s pattern of memory accesses. When you use the Zen timer and get a fractional cycle count for the execution time of an instruction, that’s often the DRAM refresh cycle-eater at work. (The display adapter cycle is another possible culprit, and, on 386s and later processors, cache misses and pipeline execution hazards produce this sort of effect as well.) Whenever you get two timing results that differ less or more than they seemingly should, that’s usually DRAM refresh too. Thanks to DRAM refresh, variations of up to 8.33 percent in PC code performance are par for the course.

Wait States

Wait states are cycles during which a bus access by the CPU to a device on the PC’s bus is temporarily halted by that device while the device gets ready to complete the read or write. Wait states are well and truly the lowest level of code performance. Everything we have discussed (and will discuss)—even DMA accesses—can be affected by wait states.

Wait states exist because the CPU must to be able to coexist with any adapter, no matter how slow (within reason). The 8088 expects to be able to complete each bus access—a memory or I/O read or write—in 4 cycles, but adapters can’t always respond that quickly for a number of reasons. For example, display adapters must split access to display memory between the CPU and the circuitry that generates the video signal based on the contents of display memory, so they often can’t immediately fulfill a request by the CPU for a display memory read or write. To resolve this conflict, display adapters can tell the CPU to wait during bus accesses by inserting one or more wait states, as shown in Figure 4.6. The CPU simply sits and idles as long as wait states are inserted, then completes the access as soon as the display adapter indicates its readiness by no longer inserting wait states. The same would be true of any adapter that couldn’t keep up with the CPU.

Mind you, this is all transparent to executing code. An instruction that encounters wait states runs exactly as if there were no wait states, only slower. Wait states are nothing more or less than wasted time as far as the CPU and your program are concerned.

By understanding the circumstances in which wait states can occur, you can avoid them when possible. Even when it’s not possible to work around wait states, it’s still to your advantage to understand how they can cause your code to run more slowly.

First, let’s learn a bit more about wait states by contrast with DRAM refresh. Unlike DRAM refresh, wait states do not occur on any regularly scheduled basis, and are of no particular duration. Wait states can only occur when an instruction performs a memory or I/O read or write. Both the presence of wait states and the number of wait states inserted on any given bus access are entirely controlled by the device being accessed. When it comes to wait states, the CPU is passive, merely accepting whatever wait states the accessed device chooses to insert during the course of the access. All of this makes perfect sense given that the whole point of the wait state mechanism is to allow a device to stretch out any access to itself for however much time it needs to perform the access.

Figure 4.6  Video wait states inserted by the display adapter.

As with DRAM refresh, wait states don’t stop the 8088 completely. The Execution Unit can continue processing while wait states are inserted, so long as the EU doesn’t need to perform a bus access. However, in the PC, wait states most often occur when an instruction accesses a memory operand, so in fact the Execution Unit usually is stopped by wait states. (Instruction fetches rarely wait in an 8088-based PC because system memory is zero-wait-state. AT-class memory systems routinely insert 1 or more wait states, however.)

As it turns out, wait states pose a serious problem in just one area in the PC. While any adapter can insert wait states, in the PC only display adapters do so to the extent that performance is seriously affected.

The Display Adapter Cycle-Eater

Display adapters must serve two masters, and that creates a fundamental performance problem. Master #1 is the circuitry that drives the display screen. This circuitry must constantly read display memory in order to obtain the information used to draw the characters or dots displayed on the screen. Since the screen must be redrawn between 50 and 70 times per second, and since each redraw of the screen can require as many as 36,000 reads of display memory (more in Super VGA modes), master #1 is a demanding master indeed. No matter how demanding master #1 gets, however, its needs must always be met—otherwise the quality of the picture on the screen would suffer.

Master #2 is the CPU, which reads from and writes to display memory in order to manipulate the bytes that the video circuitry reads to form the picture on the screen. Master #2 is less important than master #1, since the CPU affects display quality only indirectly. In other words, if the video circuitry has to wait for display memory accesses, the picture will develop holes, snow, and the like, but if the CPU has to wait for display memory accesses, the program will just run a bit slower—no big deal.

It matters a great deal which master is more important, for while both the CPU and the video circuitry must gain access to display memory, only one of the two masters can read or write display memory at any one time. Potential conflicts are resolved by flat-out guaranteeing the video circuitry however many accesses to display memory it needs, with the CPU waiting for whatever display memory accesses are left over.

It turns out that the 8088 CPU has to do a lot of waiting, for three reasons. First, the video circuitry can take as much as about 90 percent of the available display memory access time, as shown in Figure 4.7, leaving as little as about 10 percent of all display memory accesses for the 8088. (These percentages vary considerably among the many EGA and VGA clones.)

Figure 4.7  Allocation of display memory access.

Second, because the displayed dots (or pixels, short for “picture elements”) must be drawn on the screen at a constant speed, many display adapters provide memory accesses only at fixed intervals. As a result, time can be lost while the 8088 synchronizes with the start of the next display adapter memory access, even if the video circuitry isn’t accessing display memory at that time, as shown in Figure 4.8.

Finally, the time it takes a display adapter to complete a memory access is related to the speed of the clock which generates pixels on the screen rather than to the memory access speed of the 8088. Consequently, the time taken for display memory to complete an 8088 read or write access is often longer than the time taken for system memory to complete an access, even if the 8088 lucks into hitting a free display memory access just as it becomes available, again as shown in Figure 4.8. Any or all of the three factors I’ve described can result in wait states, slowing the 8088 and creating the display adapter cycle.

Figure 4.8  Display memory access slots.

If some of this is Greek to you, don’t worry. The important point is that display memory is not very fast compared to normal system memory. How slow is it? Incredibly slow. Remember how slow IBM’s ill-fated PCjrwas? In case you’ve forgotten, I’ll refresh your memory: The PCjrwas at best only half as fast as the PC. The PCjr had an 8088 running at 4.77 MHz, just like the PC—why do you suppose it was so much slower? I’ll tell you why: All the memory in the PCjr was display memory.

Enough said. All the memory in the PC is not display memory, however, and unless you’re thickheaded enough to put code in display memory, the PC isn’t going to run as slowly as a PCjr. (Putting code or other non-video data in unused areas of display memory sounds like a neat idea—until you consider the effect on instruction prefetching of cutting the 8088’s already-poor memory access performance in half. Running your code from display memory is sort of like running on a hypothetical 8084—an 8086 with a 4-bit bus. Not recommended!) Given that your code and data reside in normal system memory below the 640K mark, how great an impact does the display adapter cycle-eater have on performance?

The answer varies considerably depending on what display adapter and what display mode we’re talking about. The display adapter cycle-eater is worst with the Enhanced Graphics Adapter (EGA) and the original Video Graphics Array (VGA). (Many VGAs, especially newer ones, insert many fewer wait states than IBM’s original VGA. On the other hand, Super VGAs have more bytes of display memory to be accessed in high-resolution mode.) While the Color/Graphics Adapter (CGA), Monochrome Display Adapter (MDA), and Hercules Graphics Card (HGC) all suffer from the display adapter cycle-eater as well, they suffer to a lesser degree. Since the VGA represents the base standard for PC graphics now and for the foreseeable future, and since it is the hardest graphics adapter to wring performance from, we’ll restrict our discussion to the VGA (and its close relative, the EGA) for the remainder of this chapter.

The Impact of the Display Adapter Cycle-Eater

Even on the EGA and VGA, the effect of the display adapter cycle-eater depends on the display mode selected. In text mode, the display adapter cycle-eater is rarely a major factor. It’s not that the cycle-eater isn’t present; however, a mere 4,000 bytes control the entire text mode display, and even with the display adapter cycle-eater it just doesn’t take that long to manipulate 4,000 bytes. Even if the display adapter cycle-eater were to cause the 8088 to take as much as 5µs per display memory access—more than five times normal—it would still take only 4,000x 2x 5µs, or 40 ms, to read and write every byte of display memory. That’s a lot of time as measured in 8088 cycles, but it’s less than the blink of an eye in human time, and video performance only matters in human time. After all, the whole point of drawing graphics is to convey visual information, and if that information can be presented faster than the eye can see, that is by definition fast enough.

That’s not to say that the display adapter cycle-eater can’t matter in text mode. In Chapter 3, I recounted the story of a debate among letter-writers to a magazine about exactly how quickly characters could be written to display memory without causing snow. The writers carefully added up Intel’s instruction cycle times to see how many writes to display memory they could squeeze into a single horizontal retrace interval. (On a CGA, it’s only during the short horizontal retrace interval and the longer vertical retrace interval that display memory can be accessed in 80-column text mode without causing snow.) Of course, now we know that their cardinal sin was to ignore the prefetch queue; even if there were no wait states, their calculations would have been overly optimistic. There are display memory wait states as well, however, so the calculations were not just optimistic but wildly optimistic.

Text mode situations such as the above notwithstanding, where the display adapter cycle-eater really kicks in is in graphics mode, and most especially in the high-resolution graphics modes of the EGA and VGA. The problem here is not that there are necessarily more wait states per access in highgraphics modes (that varies from adapter to adapter and mode to mode). Rather, the problem is simply that are many more bytes of display memory per screen in these modes than in lower-resolution graphics modes and in text modes, so many more display memory accesses—each incurring its share of display memory wait states—are required in order to draw an image of a given size. When accessing the many thousands of bytes used in the high-resolution graphics modes, the cumulative effects of display memory wait states can seriously impact code performance, even as measured in human time.

For example, if we assume the same 5 µs per display memory access for the EGA’s high-resolution graphics mode that we assumed for text mode, it would take 26,000 x 2 x 5 µs, or 260 ms, to scroll the screen once in the EGA’s high-resolution graphics mode, mode 10H. That’s more than one-quarter of a second—noticeable by human standards, an eternity by computer standards.

That sounds pretty serious, but we did make an unfounded assumption about memory access speed. Let’s get some hard numbers. Listing 4.11 accesses display memory at the 8088’s maximum speed, by way of a REP MOVSW with display memory as both source and destination. The code in Listing 4.11 executes in 3.18 µs per access to display memory—not as long as we had assumed, but a long time nonetheless.

LISTING 4.11 LST4-11.ASM

; Times speed of memory access to Enhanced Graphics
; Adapter graphics mode display memory at A000:0000.
;
     mov  ax,0010h
     int  10h;        select hi-res EGA graphics
                      ; mode 10 hex (AH=0 selects
                      ; BIOS set mode function,
                      ; with AL=mode to select)
;
     mov  ax,0a000h
     mov  ds,ax
     mov  es,ax       ;move to & from same segment
     sub  si,si       ;move to & from same offset
     mov  di,si
     mov  cx,800h     ;move 2K words
     cld
     call ZTimerOn
     rep  movsw       ;simply read each of the first
                      ; 2K words of the destination segment,
                      ; writing each byte immediately back
                      ; to the same address. No memory
                      ; locations are actually altered; this
                      ; is just to measure memory access
                      ; times
     call ZTimerOff
;
     mov  ax,0003h
     int  10h         ;return to text mode

For comparison, let’s see how long the same code takes when accessing normal system RAM instead of display memory. The code in Listing 4.12, which performs a REP MOVSW from the code segment to the code segment, executes in 1.39 µs per display memory access. That means that on average, 1.79 µs (more than 8 cycles!) are lost to the display adapter cycle-eater on each access. In other words, the display adapter cycle-eater can more than double the execution time of 8088 code!

LISTING 4.12 LST4-12.ASM

; Times speed of memory access to normal system
; memory.
;
     mov  ax,ds
     mov  es,ax       ;move to & from same segment
     sub  si,si       ;move to & from same offset
     mov  di,si
     mov  cx,800h     ;move 2K words
     cld
     call ZTimerOn
     rep  movsw       ;simply read each of the first
                      ; 2K words of the destination segment,
                      ; writing each byte immediately back
                      ; to the same address. No memory
                      ; locations are actually altered; this
                      ; is just to measure memory access
                      ; times
     call ZTimerOff

Bear in mind that we’re talking about a worst case here; the impact of the display adapter cycle-eater is proportional to the percent of time a given code sequence spends accessing display memory.

A line-drawing subroutine, which executes perhaps a dozen instructions for each display memory access, generally loses less performance to the display adapter cycle-eater than does a block-copy or scrolling subroutine that uses REP MOVS instructions. Scaled and three-dimensional graphics, which spend a great deal of time performing calculations (often using very slow floating-point arithmetic), tend to suffer less.

In addition, code that accesses display memory infrequently tends to suffer only about half of the maximum display memory wait states, because on average such code will access display memory halfway between one available display memory access slot and the next. As a result, code that accesses display memory less intensively than the code in Listing 4.11 will on average lose 4 or 5 rather than 8-plus cycles to the display adapter cycle-eater on each memory access.

Nonetheless, the display adapter cycle-eater always takes its toll on graphics code. Interestingly, that toll becomes much higher on ATs and 80386 machines because while those computers can execute many more instructions per microsecond than can the 8088-based PC, it takes just as long to access display memory on those computers as on the 8088-based PC. Remember, the limited speed of access to a graphics adapter is an inherent characteristic of the adapter, so the fastest computer around can’t access display memory one iota faster than the adapter will allow.

What to Do about the Display Adapter Cycle-Eater?

What can we do about the display adapter cycle-eater? Well, we can minimize display memory accesses whenever possible. In particular, we can try to avoid read/modify/write display memory operations of the sort used to mask individual pixels and clip images. Why? Because read/modify/write operations require two display memory accesses (one read and one write) each time display memory is manipulated. Instead, we should try to use writes of the sort that set all the pixels in a given byte of display memory at once, since such writes don’t require accompanying read accesses. The key here is that only half as many display memory accesses are required to write a byte to display memory as are required to read a byte from display memory, mask part of it off and alter the rest, and write the byte back to display memory. Half as many display memory accesses means half as many display memory wait states.

Moreover, 486s and Pentiums, as well as recent Super VGAs, employ write-caching schemes that make display memory writes considerably faster than display memory reads.

Along the same line, the display adapter cycle-eater makes the popular exclusive-OR animation technique, which requires paired reads and writes of display memory, less-than-ideal for the PC. Exclusive-OR animation should be avoided in favor of simply writing images to display memory whenever possible.

Another principle for display adapter programming on the 8088 is to perform multiple accesses to display memory very rapidly, in order to make use of as many of the scarce accesses to display memory as possible. This is especially important when many large images need to be drawn quickly, since only by using virtually every available display memory access can many bytes be written to display memory in a short period of time. Repeated string instructions are ideal for making maximum use of display memory accesses; of course, repeated string instructions can only be used on whole bytes, so this is another point in favor of modifying display memory a byte at a time. (On faster processors, however, display memory is so slow that it often pays to do several instructions worth of work between display memory accesses, to take advantage of cycles that would otherwise be wasted on the wait states.)

It would be handy to explore the display adapter cycle-eater issue in depth, with lots of example code and execution timings, but alas, I don’t have the space for that right now. For the time being, all you really need to know about the display adapter cycle-eater is that on the 8088 you can lose more than 8 cycles of execution time on each access to display memory. For intensive access to display memory, the loss really can be as high as 8cycles (and up to 50, 100, or even more on 486s and Pentiums paired with slow VGAs), while for average graphics code the loss is closer to 4 cycles; in either case, the impact on performance is significant. There is only one way to discover just how significant the impact of the display adapter cycle-eater is for any particular graphics code, and that is of course to measure the performance of that code.

Cycle-Eaters: A Summary

We’ve covered a great deal of sophisticated material in this chapter, so don’t feel bad if you haven’t understood everything you’ve read; it will all become clear from further reading, especially once you study, time, and tune code that you have written yourself. What’s really important is that you come away from this chapter understanding that on the 8088:

  • The 8-bit bus cycle-eater causes each access to a word-sized operand to be 4 cycles longer than an equivalent access to a byte-sized operand.
  • The prefetch queue cycle-eater can cause instruction execution times to be as much as four times longer than the officially documented cycle times.
  • The DRAM refresh cycle-eater slows most PC code, with performance reductions ranging as high as 8.33 percent.
  • The display adapter cycle-eater typically doubles and can more than triple the length of the standard 4-cycle access to display memory, with intensive display memory access suffering most.

This basic knowledge about cycle-eaters puts you in a good position to understand the results reported by the Zen timer, and that means that you’re well on your way to writing high-performance assembler code.

What Does It All Mean?

There you have it: life under the programming interface. It’s not a particularly pretty picture for the inhabitants of that strange realm where hardware and software meet are little-known cycle-eaters that sap the speed from your unsuspecting code. Still, some of those cycle-eaters can be minimized by keeping instructions short, using the registers, using byte-sized memory operands, and accessing display memory as little as possible. None of the cycle-eaters can be eliminated, and dynamic RAM refresh can scarcely be addressed at all; still, aren’t you better off knowing how fast your code really runs—and why—than you were reading the official execution times and guessing? And while specific cycle-eaters vary in importance on later x86-family processors, with some cycle-eaters vanishing altogether and new ones appearing, the concept that understanding these obscure gremlins is a key to performance remains unchanged, as we’ll see again and again in later chapters.

Chapter 5 – Crossing the Border

Searching Files with Restartable Blocks

We just moved. Those three little words should strike terror into the heart of anyone who owns more than a sleeping bag and a toothbrush. Our last move was the usual zoo—and then some. Because the distance from the old house to the new was only five miles, we used cars to move everything smaller than a washing machine. We have a sizable household—cats, dogs, kids, com, you name it—so the moving process took a number of car trips. A large number—33, to be exact. I personally spent about 15 hours just driving back and forth between the two houses. The move took days to complete.

Never again.

You’re probably wondering two things: What does this have to do with high-performance programming, and why on earth didn’t I rent a truck and get the move over in one or two trips, saving hours of driving? As it happens, the second question answers the first. I didn’t rent a truck because it seemed easier and cheaper to use cars—no big truck to drive, no rentals, spread the work out more manageably, and so on.

It wasn’t easier, and wasn’t even much cheaper. (It costs quite a bit to drive a car 330 miles, to say nothing of the value of 15 hours of my time.) But, at the time, it seemed as though my approach would be easier and cheaper. In fact, I didn’t realize just how much time I had wasted driving back and forth until I sat down to write this chapter.

In Chapter 1, I briefly discussed using restartable blocks. This, you might remember, is the process of handling in chunks data sets too large to fit in memory so that they can be processed just about as fast as if they did fit in memory. The restartable block approach is very fast but is relatively difficult to program.

At the opposite end of the spectrum lies byte-by-byte processing, whereby DOS (or, in less extreme cases, a group of library functions) is allowed to do all the hard work, so that you only have to deal with one byte at a time. Byte-by-byte processing is easy to program but can be extremely slow, due to the vast overhead that results from invoking DOS each time a byte must be processed.

Sound familiar? It should. I moved via the byte-by-byte approach, and the overhead of driving back and forth made for miserable performance. Renting a truck (the restartable block approach) would have required more effort and forethought, but would have paid off handsomely.

The easy, familiar approach often has nothing in its favor except that it requires less thinking; not a great virtue when writing high-performance code—or when moving.

And with that, let’s look at a fairly complex application of restartable blocks.

Searching for Text

The application we’re going to examine searches a file for a specified string. We’ll develop a program that will search the file specified on the command line for a string (also specified on the comline), then report whether the string was found or not. (Because the searched-for string is obtained via argv, it can’t contain any whitespace characters.)

This is a very limited subset of what search utilities such as grep can do, and isn’t really intended to be a generally useful application; the purpose is to provide insight into restartable blocks in particular and optimization in general in the course of developing a search engine. That search engine will, however, be easy to plug into any program, and there’s nothing preventing you from using it in a more fruitful context, like searching through a user-selectable file set.

The first point to address in designing our program involves the appropriate text-search approach to use. Literally dozens of workable ways exist to search a file. We can immediately discard all approaches that involve reading any byte of the file more than once, because disk access time is orders of magnitude slower than any data handling performed by our own code. Based on our experience in Chapter 1, we can also discard all approaches that get bytes either one at a time or in small sets from DOS. We want to read big “buffers-full” of bytes at a pop from the searched file, and the bigger the buffer the better—in order to minimize DOS’s overhead. A good rough cut is a buffer that will be between 16K and 64K, depending on the exact search approach, 64K being the maximum size because near pointers make for superior performance.

So we know we want to work with a large buffer, filling it as infrequently as possible. Now we have to figure out how to search through a file by loading it into that large buffer in chunks. To accomplish this, we have to know how we want to do our searching, and that’s not immediately obvious. Where do we begin?

Well, it might be instructive to consider how we would search if our search involved only one buffer, already resident in memory. In other words, suppose we don’t have to bother with file handling at all, and further suppose that we don’t have to deal with searching through multiple blocks. After all, that’s a good description of the all-important inner loop of our searching program, where the program will spend virtually all of its time (aside from the unavoidable disk access overhead).

Avoiding the String Trap

The easiest approach would be to use a C/C++ library function. The closest match to what we need is strstr(), which searches one string for the first occurrence of a second string. However, while strstr() would work, it isn’t ideal for our purposes. The problem is this: Where we want to search a fixed-length buffer for the first occurrence of a string, strstr() searches a string for the first occurrence of another string.

We could put a zero byte at the end of our buffer to allow strstr() to work, but why bother? The strstr() function must spend time either checking for the end of the string being searched or determining the length of that string—wasted effort given that we already know exactly how long our search buffer is. Even if a given strstr() implementation is well-written, its performance will suffer, at least for our application, from unnecessary overhead.

This illustrates why you shouldn’t think of C/C++ library functions as black boxes; understand what they do and try to figure out how they do it, and relate that to their performance in the context you’re interested in.

Brute-Force Techniques

Given that no C/C++ library function meets our needs precisely, an obvious alternative approach is the brute-force technique that uses memcmp() to compare every potential matching location in the buffer to the string we’re searching for, as illustrated in Figure 5.1.

By the way, we could, of course, use our own code, working with pointers in a loop, to perform the comparison in place of memcmp(). But memcmp() will almost certainly use the very fast REPZ CMPS instruction. However, never assume! It wouldn’t hurt to use a debugger to check out the actual machine-code implementation of memcmp() from your compiler. If necessary, you could always write your own assembly language implementation of memcmp().

Figure 5.1  The brute-force searching technique.

Invoking memcmp() for each potential match location works, but entails considerable overhead. Each comparison requires that parameters be pushed and that a call to and return from memcmp() be performed, along with a pass through the comparison loop. Surely there’s a better way!

Indeed there is. We can eliminate most calls to memcmp() by performing a simple test on each potential match location that will reject most such locations right off the bat. We’ll just check whether the first character of the potentially matching buffer location matches the first character of the string we’re searching for. We could make this check by using a pointer in a loop to scan the buffer for the next match for the first character, stopping to check for a match with the rest of the string only when the first character matches, as shown in Figure 5.2.

Using memchr()

There’s yet a better way to implement this approach, however. Use the memchr() function, which does nothing more or less than find the next occurrence of a specified character in a fixed-length buffer (presumably by using the extremely efficient REPNZ SCASB instruction, although again it wouldn’t hurt to check). By using memchr() to scan for potential matches that can then be fully tested with memcmp(), we can build a highly efficient search engine that takes good advantage of the information we have about the buffer being searched and the string we’re searching for. Our engine also relies heavily on repeated string instructions, assuming that the memchr() and memcmp() library functions are properly coded.

Figure 5.2  The faster string-searching technique.

We’re going to go with the this approach in our file-searching program; the only trick lies in deciding how to integrate this approach with restartable blocks in order to search through files larger than our buffer. This certainly isn’t the fastest-possible searching algorithm; as one example, the Boyer-Moore algorithm, which cleverly eliminates many buffer locations as potential matches in the process of checking preceding locations, can be considerably faster. However, the Boyer-Moore algorithm is quite complex to understand and implement, and would distract us from our main focus, restartable blocks, so we’ll save it for a later chapter (Chapter 14, to be precise). Besides, I suspect you’ll find the approach we’ll use to be fast enough for most purposes.

Now that we’ve selected a searching approach, let’s integrate it with file handling and searching through multiple blocks. In other words, let’s make it restartable.

Making a Search Restartable

As it happens, there’s no great trick to putting the pieces of this search program together. Basically, we’ll read in a buffer of data (we’ll work with 16K at a time to avoid signed overflow problems with integers), search it for a match with the memchr()/memcmp() engine described, and exit with a “string found” response if the desired string is found.

Otherwise, we’ll load in another buffer full of data from the file, search it, and so on. The only trick lies in handling potentially matching sequences in the file that start in one buffer and end in the next—that is, sequences that span buffers. We’ll handle this by copying the unchecked bytes at the end of one buffer to the start of the next and reading that many fewer bytes the next time we fill the buffer.

The exact number of bytes to be copied from the end of one buffer to the start of the next is the length of the searched-for string minus 1, since that’s how many bytes at the end of the buffer can’t be checked as possible matches (because the check would run off the end of the buffer).

That’s really all there is to it. Listing 5.1 shows the file-searching program. As you can see, it’s not particularly complex, although a few fairly opaque lines of code are required to handle merging the end of one block with the start of the next. The code that searches a single block—the function SearchForString()—is simple and compact (as it should be, given that it’s by far the most heavily-executed code in the listing).

Listing 5.1 nicely illustrates the core concept of restartable blocks: Organize your program so that you can do your processing within each block as fast as you could if there were only one block—which is to say at top speed—and make your blocks as large as possible in order to minimize the overhead associated with going from one block to the next.

LISTING 5.1 SEARCH.C

/* Program to search the file specified by the first command-line
 * argument for the string specified by the second command-line
 * argument. Performs the search by reading and searching blocks
 * of size BLOCK_SIZE. */

#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <alloc.h>   /* alloc.h for Borland compilers,
                        malloc.h for Microsoft compilers */

#define BLOCK_SIZE  0x4000   /* we'll process the file in 16K blocks */

/* Searches the specified number of sequences in the specified
   buffer for matches to SearchString of SearchStringLength. Note
   that the calling code should already have shortened SearchLength
   if necessary to compensate for the distance from the end of the
   buffer to the last possible start of a matching sequence in the
   buffer.
*/

int SearchForString(unsigned char *Buffer, int SearchLength,
   unsigned char *SearchString, int SearchStringLength)
{
   unsigned char *PotentialMatch;

   /* Search so long as there are potential-match locations
      remaining */
   while ( SearchLength ) {
     /* See if the first character of SearchString can be found */
     if ( (PotentialMatch =
           memchr(Buffer, *SearchString, SearchLength)) == NULL ) {
        break;   /* No matches in this buffer */
     }
      /* The first character matches; see if the rest of the string
         also matches */
      if ( SearchStringLength == 1 ) {
         return(1);  /* That one matching character was the whole
                        search string, so we've got a match */
      }
      else {
         /* Check whether the remaining characters match */
         if ( !memcmp(PotentialMatch + 1, SearchString + 1,
               SearchStringLength - 1) ) {
            return(1);  /* We've got a match */
         }
      }
      /* The string doesn't match; keep going by pointing past the
         potential match location we just rejected */
      SearchLength -= PotentialMatch - Buffer + 1;
      Buffer = PotentialMatch + 1;
   }

   return(0);  /* No match found */
}

main(int argc, char *argv[]) {
   int Done;               /* Indicates whether search is done */
   int Handle;             /* Handle of file being searched */
   int WorkingLength;      /* Length of current block */
   int SearchStringLength; /* Length of string to search for */
   int BlockSearchLength;  /* Length to search in current block */
   int Found;              /* Indicates final search completion
                              status */
   int NextLoadCount;      /* # of bytes to read into next block,
                              accounting for bytes copied from the
                              last block */
   unsigned char *WorkingBlock; /* Block storage buffer */
   unsigned char *SearchString; /* Pointer to the string to search for */
   unsigned char *NextLoadPtr;  /* Offset at which to start loading
                                   the next block, accounting for
                                   bytes copied from the last block */

   /* Check for the proper number of arguments */
   if ( argc != 3 ) {
      printf("usage: search filename search-string\n");
      exit(1);
   }

   /* Try to open the file to be searched */
   if ( (Handle = open(argv[1], O_RDONLY | O_BINARY)) == -1 ) {
      printf("Can't open file: %s\n", argv[1]);
      exit(1);
   }
   /* Calculate the length of text to search for */
   SearchString = argv[2];
   SearchStringLength = strlen(SearchString);
   /* Try to get memory in which to buffer the data */
   if ( (WorkingBlock = malloc(BLOCK_SIZE)) == NULL ) {
      printf("Can't get enough memory\n");
      exit(1);
   }

   /* Load the first block at the start of the buffer, and try to
      fill the entire buffer */
   NextLoadPtr = WorkingBlock;
   NextLoadCount = BLOCK_SIZE;
   Done = 0;      /* Not done with search yet */
   Found = 0;     /* Assume we won't find a match */
   /* Search the file in BLOCK_SIZE chunks */
   do {
      /* Read in however many bytes are needed to fill out the block
         (accounting for bytes copied over from the last block), or
         the rest of the bytes in the file, whichever is less */
      if ( (WorkingLength = read(Handle, NextLoadPtr,
            NextLoadCount)) == -1 ) {
         printf("Error reading file %s\n", argv[1]);
         exit(1);
      }
      /* If we didn't read all the bytes we requested, we're done
         after this block, whether we find a match or not */
      if ( WorkingLength != NextLoadCount ) {
         Done = 1;
      }

      /* Account for any bytes we copied from the end of the last
         block in the total length of this block */
      WorkingLength += NextLoadPtr - WorkingBlock;
      /* Calculate the number of bytes in this block that could
         possibly be the start of a matching sequence that lies
         entirely in this block (sequences that run off the end of
         the block will be transferred to the next block and found
         when that block is searched)
      */
      if ( (BlockSearchLength =
               WorkingLength - SearchStringLength + 1) <= 0 ) {
            Done = 1;  /* Too few characters in this block for
                          there to be any possible matches, so this
                          is the final block and we're done without
                          finding a match
                       */
      }
      else {
         /* Search this block */
         if ( SearchForString(WorkingBlock, BlockSearchLength,
               SearchString, SearchStringLength) ) {
            Found = 1;     /* We've found a match */
            Done = 1;
         }
         else {
            /* Copy any bytes from the end of the block that start
               potentially-matching sequences that would run off
               the end of the block over to the next block */
            if ( SearchStringLength > 1 ) {
               memcpy(WorkingBlock,
                  WorkingBlock+BLOCK_SIZE - SearchStringLength + 1,
                  SearchStringLength - 1);
            }
            /* Set up to load the next bytes from the file after the
               bytes copied from the end of the current block */
            NextLoadPtr = WorkingBlock + SearchStringLength - 1;
            NextLoadCount = BLOCK_SIZE - SearchStringLength + 1;
         }
      }
   } while ( !Done );

   /* Report the results */
   if ( Found ) {
      printf("String found\n");
   } else {
      printf("String not found\n");
   }
   exit(Found);   /* Return the found/not found status as the
                     DOS errorlevel */
}

Interpreting Where the Cycles Go

To boost the overall performance of Listing 5.1, I would normally convert SearchForString() to assembly language at this point. However, I’m not going to do that, and the reason is as important a lesson as any discussion of optimized assembly code is likely to be. Take a moment to examine some interesting performance aspects of the C implementation, and all should become much clearer.

As you’ll recall from Chapter 1, one of the important rules for optimization involves knowing when optimization is worth bothering with at all. Another rule involves understanding where most of a program’s execution time is going. That’s more true for Listing 5.1 than you might think.

When Listing 5.1 is run on a 1 MB assembly source file, it takes about three seconds to find the string “xxxend” (which is at the end of the file) on a 20 MHz 386 machine, with the entire file in a disk cache. If BLOCK_SIZE is trimmed from 16K to 4K, execution time does not increase perceptibly! At 2K, the program slows slightly; it’s not until the block size shrinks to 64 bytes that execution time becomes approximately double that of the 16K buffer.

So the first thing we’ve discovered is that, while bigger blocks do make for the best performance, the increment in performance may not be very large, and might not justify the extra memory required for those larger blocks. Our next discovery is that, even though we read the file in large chunks, most of the execution time of Listing 5.1 is nonetheless spent in executing the read() function.

When I replaced the read() function call in Listing 5.1 with code that simply fools the program into thinking that a 1 MB file is being read, the program ran almost instantaneously—in less than 1/2 second, even when the searched-for string wasn’t anywhere to be found. By contrast, Listing 5.1 requires three seconds to run even when searching for a single character that isn’t found anywhere in the file, the case in which a single call to memchr() (and thus a single REPNZ SCASB) can eliminate an entire block at a time.

All in all, the time required for DOS disk access calls is taking up at least 80 percent of execution time, and search time is less than 20 percent of overall execution time. In fact, search time is probably a good deal less than 20 percent of the total, given that the overhead of loading the program, running through the C startup code, opening the file, executing printf(), and exiting the program and returning to the DOS shell are also included in my timings. Given which, it should be apparent why converting to assembly language isn’t worth the trouble—the best we could do by speeding up the search is a 10 percent or so improvement, and that would require more than doubling the performance of code that already uses repeated string instructions to do most of the work.

Not likely.

Knowing When Assembly Is Pointless

So that’s why we’re not going to go to assembly language in this example—which is not to say it would never be worth converting the search engine in Listing 5.1 to assembly.

If, for example, your application will typically search buffers in which the first character of the search string occurs frequently as might be the case when searching a text buffer for a string starting with the space character an assembly implementation might be several times faster. Why? Because assembly code can switch from REPNZ SCASB to match the first character to REPZ CMPS to check the remaining characters in just a few instructions.

In contrast, Listing 5.1 must return from memchr(), set up parameters, and call memcmp() in order to do the same thing. Likewise, assembly can switch back to REPNZ SCASB after a non-match much more quickly than Listing 5.1. The switching overhead is high; when searching a file completely filled with the character z for the string “zy,” Listing 5.1 takes almost 1/2 minute, or nearly an order of magnitude longer than when searching a file filled with normal text.

It might also be worth converting the search engine to assembly for searches performed entirely in memory; with the overhead of file access eliminated, improvements in search-engine performance would translate directly into significantly faster overall performance. One such application that would have much the same structure as Listing 5.1 would be searching through expanded memory buffers, and another would be searching through huge (segment-spanning) buffers.

And so we find, as we so often will, that optimization is definitely not a cut-and-dried matter, and that there is no such thing as a single “best” approach.

You must know what your application will typically do, and you must know whether you’re more concerned with average or worst-case performance before you can decide how best to speed up your program—and, indeed, whether speeding it up is worth doing at all.

By the way, don’t think that just because very large block sizes don’t much improve performance, it wasn’t worth using restartable blocks in Listing 5.1. Listing 5.1 runs more than three times more slowly with a block size of 32 bytes than with a block size of 4K, and any byte-by-byte approach would surely be slower still, due to the overhead of repeated calls to DOS and/or the C stream I/O library.

Restartable blocks do minimize the overhead of DOS file-access calls in Listing 5.1; it’s just that there’s no way to reduce that overhead to the point where it becomes worth attempting to further improve the performance of our relatively efficient search engine. Although the search engine is by no means fully optimized, it’s nonetheless as fast as there’s any reason for it to be, given the balance of performance among the components of this program.

Always Look Where Execution Is Going

I’ve explained two important lessons: Know when it’s worth optimizing further, and use restartable blocks to process large data sets as a series of blocks, with each block handled at high speed. The first lesson is less obvious than it seems.

When I set out to write this chapter, I fully intended to write an assembly language version of Listing 5.1, and I expected the assembly version to be much faster. When I actually looked at where execution time was going (which I did by modifying the program to remove the calls to the read() function, but a code profiler could be used to do the same thing much more easily), I found that the best code in the world wouldn’t make much difference.

When you try to speed up code, take a moment to identify the hot spots in your program so that you know where optimization is needed and whether it will make a significant difference before you invest your time.

As for restartable blocks: Here we tackled a considerably more complex application of restartable blocks than we did in Chapter 1—which turned out not to be so difficult after all. Don’t let irregularities in the programming tasks you tackle, such as strings that span blocks, fluster you into settling for easy, general—and slow—solutions. Focus on making the inner loop—the code that handles each block—as efficient as possible, then structure the rest of your code to support the inner loop.

Programming with restartable blocks isn’t easy, but when speed is an issue, using restartable blocks in the right places more than pays for itself with greatly improved performance. And when speed is not an issue, of course, or in code that’s not time-critical, you wouldn’t dream of wasting your time on optimization.

Would you?

Chapter 6 – Looking Past Face Value

How Machine Instructions May Do More Than You Think

I first met Jeff Duntemann at an authors’ dinner hosted by PC Tech Journal at Fall Comdex, back in 1985. Jeff was already reasonably well-known as a computer editor and writer, although not as famous as Complete Turbo Pascal, editions 1 through 672 (or thereabouts), TURBO TECHNIX, and PC TECHNIQUES would soon make him. I was fortunate enough to be seated next to Jeff at the dinner table, and, not surprisingly, our often animated conversation revolved around computers, computer writing, and more computers (not necessarily in that order).

Although I was making a living at computer work and enjoying it at the time, I nonetheless harbored vague ambitions of being a science-fiction writer when I grew up. (I have since realized that this hardly puts me in elite company, especially in the computer world, where it seems that every other person has told me they plan to write science fiction “someday.” Given that probably fewer than 500—I’m guessing here—original science fiction and fantasy short stories, and perhaps a few more novels than that, are published each year in this country, I see a few mid-life crises coming.)

At any rate, I had accumulated a small collection of rejection slips, and fancied myself something of an old hand in the field. At the end of the dinner, as the other writers complained half-seriously about how little they were paid for writing for Tech Journal, I leaned over to Jeff and whispered, “You know, the pay isn’t so bad here. You should see what they pay for science fiction—even to the guys who win awards!”

To which Jeff replied, “I know. I’ve been nominated for two Hugos.”

Oh.

Had I known I was seated next to a real, live science-fiction writer—an award-nominated writer, by God!—I would have pumped him for all I was worth, but the possibility had never occurred to me. I was at a dinner put on by a computer magazine, seated next to an editor who had just finished a book about Turbo Pascal, and, gosh, it was obvious that the appropriate topic was computers.

For once, the moral is not “don’t judge a book by its cover.” Jeff is in fact what he appeared to be at face value: a computer writer and editor. However, he is more, too; face value wasn’t full value. You’ll similarly find that face value isn’t always full value in computer programming, and especially so when working in assembly language, where many instructions have talents above and beyond their obvious abilities.

On the other hand, there are also a number of instructions, such as LOOP, that are designed to perform specific functions but aren’t always the best instructions for those functions. So don’t judge a book by its cover, either.

Assembly language for the x86 family isn’t like any other language (for which we should, without hesitation, offer our profuse thanks). Assembly language reflects the design of the processor rather than the way we think, so it’s full of multiple instructions that perform similar functions, instructions with odd and often confusing side effects, and endless ways to string together different instructions to do much the same things, often with seemingly minuscule differences that can turn out to be surprisingly important.

To produce the best code, you must decide precisely what you need to accomplish, then put together the sequence of instructions that accomplishes that end most efficiently, regardless of what the instructions are usually used for. That’s why optimization for the PC is an art, and it’s why the best assembly language for the x86 family will almost always handily outperform compiled code. With that in mind, let’s look past face value—and while we’re at it, I’ll toss in a few examples of not judging a book by its cover.

The point to all this: You must come to regard the x86 family instructions for what they do, not what you’re used to thinking they do. Yes, SHL shifts a pattern left—but a look-up table can do the same thing, and can often do it faster. ADD can indeed add two operands, but it can’t put the result in a third register; LEA can. The instruction set is your raw material for writing high-performance code. By limiting yourself to thinking only in certain well-established ways about the various instructions, you’re putting yourself at a substantial disadvantage every time you sit down to program.

In short, the x86 family can do much more than you think—if you’ll use everything it has to offer. Give it a shot!

Memory Addressing and Arithmetic

Years ago, I saw a clip on the David Letterman show in which Letterman walked into a store by the name of “Just Lamps” and asked, “So what do you sell here?”

“Lamps,” he was told. “Just lamps. Can’t you read?”

“Lamps,” he said. “I see. And what else?”

From that bit of sublime idiocy we can learn much about divining the full value of an instruction. To wit:

Quick, what do the x86’s memory addressing modes do?

“Calculate memory addresses,” you no doubt replied. And you’re right, of course. But what else do they do?

They perform arithmetic, that’s what they do, and that’s a distinctly different and often useful perspective on memory address calculations.

For example, suppose you have an array base address in BX and an index into the array in SI. You could add the two registers together to address memory, like this:

add  bx,si
mov  al,[bx]

Or you could let the processor do the arithmetic for you in a single instruction:

mov  al,[bx+si]

The two approaches are functionally interchangeable but not equivalent from a performance standpoint, and which is better depends on the particular context. If it’s a one-shot memory access, it’s best to let the processor perform the addition; it’s generally faster at doing this than a separate ADD instruction would be. If it’s a memory access within a loop, however, it’s advantageous on the 8088 CPU to perform the addition outside the loop, if possible, reducing effective address calculation time inside the loop, as in the following:

      add   bx,si
LoopTop:
      mov   al,[bx]
      inc   bx
      loop  LoopTop

Here, MOV AL,[BX] is two cycles faster than MOV AL,[BX+SI].

On a 286 or 386, however, the balance shifts. MOV AL,[BX+SI] takes no longer than MOV AL,[BX] on these processors because effective address calculations generally take no extra time at all. (According to the MASM manual, one extra clock is required if three memory addressing components, as in MOV AL,[BX+SI+1], are used. I have not been able to confirm this from Intel publications, but then I haven’t looked all that hard.) If you’re optimizing for the 286 or 386, then, you can take advantage of the processor’s ability to perform arithmetic as part of memory address calculations without taking a performance hit.

The 486 is an odd case, in which the use of an index register or the use of a base register that’s the destination of the previous instruction may slow things down, so it is generally but not always better to perform the addition outside the loop on the 486. All memory addressing calculations are free on the Pentium, however. I’ll discuss 486 performance issues in Chapters 12 and 13, and the Pentium in Chapters 19 through 21.

Math via Memory Addressing

You’re probably not particularly wowed to hear that you can use addressing modes to perform memory addressing arithmetic that would otherwise have to be performed with separate arithmetic instructions. You may, however, be a tad more interested to hear that you can also use addressing modes to perform arithmetic that has nothing to do with memory addressing, and with a couple of advantages over arithmetic instructions, at that.

How?

With LEA, the only instruction that performs memory addressing calculations but doesn’t actually address memory. LEA accepts a standard memory addressing operand, but does nothing more than store the calculated memory offset in the specified register, which may be any general-purpose register. The operation of LEA is illustrated in Figure 6.1, which also shows the operation of register-to-register ADD, for comparison.

What does that give us? Two things that ADD doesn’t provide: the ability to perform addition with either two or three operands, and the ability to store the result in any register, not just in one of the source operands.

Imagine that we want to add BX to DI, add two to the result, and store the result in AX. The obvious solution is this:

mov  ax,bx
add  ax,di
add  ax,2

(It would be more compact to increment AX twice than to add two to it, and would probably be faster on an 8088, but that’s not what we’re after at the moment.) An elegant alternative solution is simply:

lea  ax,[bx+di+2]

Likewise, either of the following would copy SI plus two to DI

mov  di,si
add  di,2

or:

lea  di,[si+2]

Mind you, the only components LEA can add are BX or BP, SI or DI, and a constant displacement, so it’s not going to replace ADD most of the time. Also, LEA is considerably slower than ADD on an 8088, although it is just as fast as ADD on a 286 or 386 when fewer than three memory addressing components are used. LEA is 1 cycle slower than ADD on a 486 if the sum of two registers is used to point to memory, but no slower than ADD on a Pentium. On both a 486 and Pentium, LEA can also be slowed down by addressing interlocks.

Figure 6.1  Operation of ADD Reg,Reg vs. LEA Reg,{Addr}.

The Wonders of LEA on the 386

LEA really comes into its own as a “super-ADD” instruction on the 386, 486, and Pentium, where it can take advantage of the enhanced memory addressing modes of those processors. (The 486 and Pentium offer the same modes as the 386, so I’ll refer only to the 386 from now on.) The 386 can do two very interesting things: It can use any 32-bit register (EAX, EBX, and so on) as the memory addressing base register and/or the memory addressing index register, and it can multiply any 32-bit register used as an index by two, four, or eight in the process of calculating a memory address, as shown in Figure 6.2. Let’s see what that’s good for.

Well, the obvious advantage is that any two 32-bit registers, or any 32-bit register and any constant, or any two 32-bit registers and any constant, can be added together, with the result stored in any register. This makes the 32-bit LEA much more generally useful than the standard 16-bit LEA in the role of an ADD with an independent destination.

Figure 6.2  Operation of the 32-bit LEA reg,[Addr].

But what else can LEA do on a 386, besides add?

It can multiply any register used as an index. LEA can multiply only by the power-of-two values 2, 4, or 8, but that’s useful more often than you might imagine, especially when dealing with pointers into tables. Besides, multiplying by 2, 4, or 8 amounts to a left shift of 1, 2, or 3 bits, so we can now add up to two 32-bit registers and a constant, and shift (or multiply) one of the registers to some extent—all with a single instruction. For example,

lea  edi,TableBase[ecx+edx*4]

replaces all this

mov  edi,edx
shl  edi,2
add  edi,ecx
add  edi,offset TableBase

when pointing to an entry in a doubly indexed table.

Multiplication with LEA Using Non-Powers of Two

Are you impressed yet with all that LEA can do on the 386? Believe it or not, one more feature still awaits us. LEA can actually perform a fast multiply of a 32-bit register by some values other than powers of two. You see, the same 32-bit register can be both base and index on the 386, and can be scaled as the index while being used unchanged as the base. That means that you can, for example, multiply EBX by 5 with:

lea ebx,[ebx+ebx*4]

Without LEA and scaling, multiplication of EBX by 5 would require either a relatively slow MUL, along with a set-up instruction or two, or three separate instructions along the lines of the following

mov  edx,ebx
shl  ebx,2
add  ebx,edx

and would in either case require the destruction of the contents of another register.

Multiplying a 32-bit value by a non-power-of-two multiplier in just 2 cycles is a pretty neat trick, even though it works only on a 386 or 486.

The full list of values that LEA can multiply a register by on a 386 or 486 is: 2, 3, 4, 5, 8, and 9. That list doesn’t include every multiplier you might want, but it covers some commonly used ones, and the performance is hard to beat.

I’d like to extend my thanks to Duane Strong of Metagraphics for his help in brainstorming uses for the 386 version of LEA and for pointing out the complications of 486 instruction timings.

Chapter 7 – Local Optimization

Optimizing Halfway between Algorithms and Cycle Counting

You might not think it, but there’s much to learn about performance programming from the Great Buffalo Sauna Fiasco. To wit:

The scene is Buffalo, New York, in the dead of winter, with the snow piled several feet deep. Four college students, living in typical student housing, are frozen to the bone. The third floor of their house, uninsulated and so cold that it’s uninhabitable, has an ancient bathroom. One fabulously cold day, inspiration strikes:

“Hey—we could make that bathroom into a sauna!

Pandemonium ensues. Someone rushes out and buys a gas heater, and at considerable risk to life and limb hooks it up to an abandoned but still live gas pipe that once fed a stove on the third floor. Someone else gets sheets of plastic and lines the walls of the bathroom to keep the moisture in, and yet another student gets a bucket full of rocks. The remaining chap brings up some old wooden chairs and sets them up to make benches along the sides of the bathroom. Voila—instant sauna!

They crank up the gas heater, put the bucket of rocks in front of it, close the door, take off their clothes, and sit down to steam themselves. Mind you, it’s not yet 50 degrees Fahrenheit in this room, but the gas heater is roaring. Surely warmer times await.

Indeed they do. The temperature climbs to 55 degrees, then 60, then 63, then 65, and finally creeps up to 68 degrees.

And there it stops.

68 degrees is warm for an uninsulated third floor in Buffalo in the dead of winter. Damn warm. It is not, however, particularly warm for a sauna. Eventually someone acknowledges the obvious and allows that it might have been a stupid idea after all, and everyone agrees, and they shut off the heater and leave, each no doubt offering silent thanks that they had gotten out of this without any incidents requiring major surgery.

And so we see that the best idea in the world can fail for lack of either proper design or adequate horsepower. The primary cause of the Great Buffalo Sauna Fiasco was a lack of horsepower; the gas heater was flat-out undersized. This is analogous to trying to write programs that incorporate features like bitmapped text and searching of multisegment buffers without using high-performance assembly language. Any PC language can perform just about any function you can think of—eventually. That heater would eventually have heated the room to 110 degrees, too—along about the first of June or so.

The Great Buffalo Sauna Fiasco also suffered from fundamental design flaws. A more powerful heater would indeed have made the room hotter—and might well have burned the house down in the process. Likewise, proper algorithm selection and good design are fundamental to performance. The extra horsepower a superb assembly language implementation gives a program is worth bothering with only in the context of a good design.

Assembly language optimization is a small but crucial corner of the PC programming world. Use it sparingly and only within the framework of a good design—but ignore it and you may find various portions of your anatomy out in the cold.

So, drawing fortitude from the knowledge that our quest is a pure and worthy one, let’s resume our exploration of assembly language instructions with hidden talents and instructions with well-known talents that are less than they appear to be. In the process, we’ll come to see that there is another, very important optimization level between the algorithm/design level and the cycle-counting/individual instruction level. I’ll call this middle level local optimization; it involves focusing on optimizing sequences of instructions rather than individual instructions, all with an eye to implementing designs as efficiently as possible given the capabilities of the x86 family instruction set.

And yes, in case you’re wondering, the above story is indeed true. Was I there? Let me put it this way: If I were, I’d never admit it!

When LOOP Is a Bad Idea

Let’s examine first an instruction that is less than it appears to be: LOOP. There’s no mystery about what LOOP does; it decrements CX and branches if CX doesn’t decrement to zero. It’s so beautifully suited to the task of counting down loops that any experienced x86 programmer instinctively stuffs the loop count in CX and reaches for LOOP when setting up a loop. That’s fine—LOOP does, of course, work as advertised—but there is one problem:

On half of the processors in the x86 family, LOOP is slower than DEC CX followed by JNZ. (Granted, DEC CX/JNZ isn’t precisely equivalent to LOOP, because DEC alters the flags and LOOP doesn’t, but in most situations they’re comparable.)

How can this be? Don’t ask me, ask Intel. On the 8088 and 80286, LOOP is indeed faster than DEC CX/JNZ by a cycle, and LOOP is generally a little faster still because it’s a byte shorter and so can be fetched faster. On the 386, however, things change; LOOP is two cycles slower than DEC/JNZ and the fetch time for one extra byte on even an uncached 386 generally isn’t significant. (Remember that the 386 fetches four instruction bytes at a pop.) LOOP is three cycles slower than DEC/JNZ on the 486, and the 486 executes instructions in so few cycles that those three cycles mean that DEC/JNZ is nearly twice as fast as LOOP. Then, too, unlike LOOP, DEC doesn’t require that CX be used, so the DEC/JNZ solution is both faster and more flexible on the 386 and 486, and on the Pentium as well. (By the way, all this is not just theory; I’ve timed the relative performances of LOOP and DEC CX/JNZ on a cached 386, and LOOP really is slower.)

Things are stranger still for LOOP’s relative JCXZ, which branches if and only if CX is zero. JCXZ is faster than AND CX,CX/JZ on the 8088 and 80286, and equivalent on the 80386—but is about twice as slow on the 486!

By the way, don’t fall victim to the lures of JCXZ and do something like this:

and     cx,ofh          ;Isolate the desired field
jcxz    SkipLoop        ;If field is 0, don't bother

The AND instruction has already set the Zero flag, so this

and     cx,0fh           ;Isolate the desired field
jz      SkipLoop         ;If field is 0, don't bother

will do just fine and is faster on all processors. Use JCXZ only when the Zero flag isn’t already set to reflect the status of CX.

The Lessons of LOOP and JCXZ

What can we learn from LOOP and JCXZ? First, that a single instruction that is intended to do a complex task is not necessarily faster than several instructions that together do the same thing. Second, that the relative merits of instructions and optimization rules vary to a surprisingly large degree across the x86 family.

In particular, if you’re going to write 386 protected mode code, which will run only on the 386, 486, and Pentium, you’d be well advised to rethink your use of the more esoteric members of the x86 instruction set. LOOP, JCXZ, the various accumulator-specific instructions, and even the string instructions in many circumstances no longer offer the advantages they did on the 8088. Sometimes they’re just not any faster than more general instructions, so they’re not worth going out of your way to use; sometimes, as with LOOP, they’re actually slower, and you’d do well to avoid them altogether in the 386/486 world. Reviewing the instruction cycle times in the MASM or TASM manuals, or looking over the cycle times in Intel’s literature, is a good place to start; published cycle times are closer to actual execution times on the 386 and 486 than on the 8088, and are reasonably reliable indicators of the relative performance levels of x86 instructions.

Avoiding LOOPS of Any Stripe

Cycle counting and directly substituting instructions (DEC CX/JNZ for LOOP, for example) are techniques that belong at the lowest level of optimization. It’s an important level, but it’s fairly mechanical; once you’ve learned the capabilities and relative performance levels of the various instructions, you should be able to select the best instructions fairly easily. What’s more, this is a task at which compilers excel. What I’m saying is that you shouldn’t get too caught up in counting cycles because that’s a small (albeit important) part of the optimization picture, and not the area in which your greatest advantage lies.

Local Optimization

One level at which assembly language programming pays off handsomely is that of local optimization; that is, selecting the best sequence of instructions for a task. The key to local optimization is viewing the 80x86 instruction set as a set of building blocks, each with unique characteristics. Your job is to sequence those blocks so that they perform well. It doesn’t matter what the instructions are intended to do or what their names are; all that matters is what they do.

Our discussion of LOOP versus DEC/JNZ is an excellent example of optimization by cycle counting. It’s worth knowing, but once you’ve learned it, you just routinely use DEC/JNZ at the bottom of loops in 386/486-specific code, and that’s that. Besides, you’ll save at most a few cycles each time, and while that helps a little, it’s not going to make all that much difference.

Now let’s step back for a moment, and with no preconceptions consider what the x86 instruction set can do for us. The bulk of the time with both LOOP and DEC/JNZ is taken up by branching, which just happens to be one of the slowest aspects of every processor in the x86 family, and the rest is taken up by decrementing the count register and checking whether it’s zero. There may be ways to perform those tasks a little faster by selecting different instructions, but they can get only so fast, and branching can’t even get all that fast.

The trick, then, is not to find the fastest way to decrement a count and branch conditionally, but rather to figure out how to accomplish the same result without decrementing or branching as often. Remember the Kobiyashi Maru problem in Star Trek? The same principle applies here: Redefine the problem to one that offers better solutions.

Consider Listing 7.1, which searches a buffer until either the specified byte is found, a zero byte is found, or the specified number of characters have been checked. Such a function would be useful for scanning up to a maximum number of characters in a zero-terminated buffer. Listing 7.1, which uses LOOP in the main loop, performs a search of the sample string for a period (‘.’) in 170 µs on a 20 MHz cached 386.

When the LOOP in Listing 7.1 is replaced with DEC CX/JNZ, performance improves to 168 µs, less than 2 percent faster than Listing 7.1. Actually, instruction fetching, instruction alignment, cache characteristics, or something similar is affecting these results; I’d expect a slightly larger improvement—around 7 percent—but that’s the most that counting cycles could buy us in this case. (All right, already; LOOPNZ could be used at the bottom of the loop, and other optimizations are surely possible, but all that won’t add up to anywhere near the benefits we’re about to see from local optimization, and that’s the whole point.)

LISTING 7.1 L7-1.ASM

; Program to illustrate searching through a buffer of a specified
; length until either a specified byte or a zero byte is
; encountered.
; A standard loop terminated with LOOP is used.

        .model      small
        .stack      100h
        .data
; Sample string to search through.
SampleString        label byte
        db   'This is a sample string of a long enough length '
        db   'so that raw searching speed can outweigh any '
        db   'extra set-up time that may be required.',0
SAMPLE_STRING_LENGTH  equ  $-SampleString

; User prompt.
Prompt     db      'Enter character to search for:$'

; Result status messages.
ByteFoundMsg         db    0dh,0ah
                     db    'Specified byte found.',0dh,0ah,'$'
ZeroByteFoundMsg db  0dh, 0ah
                     db    'Zero byte encountered.',0dh,0ah,'$'
NoByteFoundMsg       db    0dh,0ah
                     db    'Buffer exhausted with no match.', 0dh, 0ah, '$'

    .code
Start proc near
    mov  ax,@data    ;point to standard data segment
    mov  ds,ax
    mov  dx,offset Prompt
    mov  ah,9               ;DOS print string function
    int  21h                ;prompt the user
    mov  ah,1               ;DOS get key function
    int  21h                ;get the key to search for
    mov  ah,al              ;put character to search for in AH
    mov  cx,SAMPLE_STRING_LENGTH        ;# of bytes to search
    mov  si,offset SampleString         ;point to buffer to search
    call SearchMaxLength                ;search the buffer
    mov  dx,offset ByteFoundMsg         ;assume we found the byte
    jc   PrintStatus                    ;we did find the byte
                                        ;we didn't find the byte, figure out
                                        ;whether we found a zero byte or
                                        ;ran out of buffer
    mov dx,offset NoByteFoundMsg
                                        ;assume we didn't find a zero byte
    jcxz PrintStatus                    ;we didn't find a zero byte
    mov  dx,offset ZeroByteFoundMsg     ;we found a zero byte
PrintStatus:
    mov  ah,9             ;DOS print string function
    int  21h              ;report status
    mov  ah,4ch           ;return to DOS
    int  21h
Start endp

; Function to search a buffer of a specified length until either a
; specified byte or a zero byte is encountered.
; Input:
;    AH = character to search for
;    CX = maximum length to be searched (must be > 0)
;    DS:SI = pointer to buffer to be searched
; Output:
;    CX = 0 if and only if we ran out of bytes without finding
;         either the desired byte or a zero byte
;    DS:SI = pointer to searched-for byte if found, otherwise byte
;         after zero byte if found, otherwise byte after last
;         byte checked if neither searched-for byte nor zero
;         byte is found
;    Carry Flag = set if searched-for byte found, reset otherwise

SearchMaxLength proc near
      cld
SearchMaxLengthLoop:
      lodsb                        ;get the next byte
      cmp   al,ah                  ;is this the byte we want?
      jz    ByteFound              ;yes, we're done with success
      and   al,al                  ;is this the terminating 0 byte?
      jz    ByteNotFound           ;yes, we're done with failure
      loop  SearchMaxLengthLoop    ;it's neither, so check the next
                                   ;byte, if any
ByteNotFound:
      clc                          ;return "not found" status
      ret
ByteFound:
      dec   si                     ;point back to the location at which
                                   ;we found the searched-for byte
      stc                          ;return "found" status
      ret
SearchMaxLength endp
      end   Start

Unrolling Loops

Listing 7.2 takes a different tack, unrolling the loop so that four bytes are checked for each LOOP performed. The same instructions are used inside the loop in each listing, but Listing 7.2 is arranged so that three-quarters of the LOOPs are eliminated. Listings 7.1 and 7.2 perform exactly the same task, and they use the same instructions in the loop—the searching algorithm hasn’t changed in any way—but we have sequenced the instructions differently in Listing 7.2, and that makes all the difference.

LISTING 7.2 L7-2.ASM

; Program to illustrate searching through a buffer of a specified
; length until a specified zero byte is encountered.
; A loop unrolled four times and terminated with LOOP is used.

       .model      small
       .stack      100h
       .data
; Sample string to search through.
SampleString label byte
        db     'This is a sample string of a long enough length '
        db     'so that raw searching speed can outweigh any '
        db     'extra set-up time that may be required.',0
SAMPLE_STRING_LENGTH  equ  $-SampleString

; User prompt.
Prompt  db        'Enter character to search for:$'

; Result status messages.
ByteFoundMsg          db      0dh,0ah
                      db      'Specified byte found.',0dh,0ah,'$'
ZeroByteFoundMsg db   0dh,0ah
                      db      'Zero byte encountered.', 0dh, 0ah, '$'
NoByteFoundMsg        db      0dh,0ah
                      db      'Buffer exhausted with no match.', 0dh, 0ah, '$'

; Table of initial, possibly partial loop entry points for
; SearchMaxLength.
SearchMaxLengthEntryTable    labelword
     dw     SearchMaxLengthEntry4
     dw     SearchMaxLengthEntry1
     dw     SearchMaxLengthEntry2
     dw     SearchMaxLengthEntry3

     .code
Start proc  near
      mov   ax,@data     ;point to standard data segment
      mov   ds,ax
      mov   dx,offset Prompt
      mov   ah,9             ;DOS print string function
      int   21h              ;prompt the user
      mov   ah,1             ;DOS get key function
      int   21h              ;get the key to search for
      mov   ah,al            ;put character to search for in AH
      mov   cx,SAMPLE_STRING_LENGTH       ;# of bytes to search
      mov   si,offset SampleString        ;point to buffer to search
      call  SearchMaxLength               ;search the buffer
      mov   dx,offset ByteFoundMsg        ;assume we found the byte
      jc    PrintStatus      ;we did find the byte
                             ;we didn't find the byte, figure out
                             ;whether we found a zero byte or
                             ;ran out of buffer
      mov   dx,offset NoByteFoundMsg
                             ;assume we didn't find a zero byte
      jcxz  PrintStatus      ;we didn't find a zero byte
      mov   dx,offset ZeroByteFoundMsg  ;we found a zero byte
PrintStatus:
      mov   ah,9             ;DOS print string function
      int   21h              ;report status

      mov   ah,4ch           ;return to DOS
      int   21h
Start endp

; Function to search a buffer of a specified length until either a
; specified byte or a zero byte is encountered.
; Input:
;    AH = character to search for
;    CX = maximum length to be searched (must be > 0)
;    DS:SI = pointer to buffer to be searched
; Output:
;    CX = 0 if and only if we ran out of bytes without finding
;          either the desired byte or a zero byte
;    DS:SI = pointer to searched-for byte if found, otherwise byte
;          after zero byte if found, otherwise byte after last
;          byte checked if neither searched-for byte nor zero
;          byte is found
;    Carry Flag = set if searched-for byte found, reset otherwise

SearchMaxLength proc near
     cld
     mov   bx,cx
     add   cx,3               ;calculate the maximum # of passes
     shr   cx,1               ;through the loop, which is
     shr   cx,1               ;unrolled 4 times
     and   bx,3               ;calculate the index into the entry
                              ;point table for the first,
                              ;possibly partial loop
     shl   bx,1               ;prepare for a word-sized look-up
     jmp   SearchMaxLengthEntryTable[bx]
                                  ;branch into the unrolled loop to do
                                  ;the first, possibly partial loop
SearchMaxLengthLoop:
SearchMaxLengthEntry4:
     lodsb                    ;get the next byte
     cmp   al,ah              ;is this the byte we want?
     jz    ByteFound          ;yes, we're done with success
     and   al,al              ;is this the terminating 0 byte?
     jz    ByteNotFound       ;yes, we're done with failure
SearchMaxLengthEntry3:
     lodsb                    ;get the next byte
     cmp   al,ah              ;is this the byte we want?
     jz    ByteFound          ;yes, we're done with success
     and   al,al              ;is this the terminating 0 byte?
     jz    ByteNotFound       ;yes, we're done with failure
SearchMaxLengthEntry2:
     lodsb                    ;get the next byte
     cmp   al,ah              ;is this the byte we want?
     jz    ByteFound          ;yes, we're done with success
     and   al,al              ;is this the terminating 0 byte?
     jz    ByteNotFound       ;yes, we're done with failure
SearchMaxLengthEntry1:
     lodsb                          ;get the next byte
     cmp    al,ah                   ;is this the byte we want?
     jz     ByteFound               ;yes, we're done with success
     and    al,al                   ;is this the terminating 0 byte?
     jz     ByteNotFound            ;yes, we're done with failure
     loop   SearchMaxLengthLoop     ;it's neither, so check the next
                                    ; four bytes, if any
ByteNotFound:
     clc                      ;return "not found" status
ret
ByteFound:
     dec    si                ;point back to the location at which
                              ; we found the searched-for byte
     stc                      ;return "found" status
     ret
SearchMaxLength endp
     end    Start

How much difference? Listing 7.2 runs in 121 µs—40 percent faster than Listing 7.1, even though Listing 7.2 still uses LOOP rather than DEC CX/JNZ. (The loop in Listing 7.2 could be unrolled further, too; it’s just a question of how much more memory you want to trade for ever-decreasing performance benefits.) That’s typical of local optimization; it won’t often yield the order-of-magnitude improvements that algorithmic improvements can produce, but it can get you a critical 50 percent or 100 percent improvement when you’ve exhausted all other avenues.

The point is simply this: You can gain far more by stepping back a bit and thinking of the fastest overall way for the CPU to perform a task than you can by saving a cycle here or there using different instructions. Try to think at the level of sequences of instructions rather than individual instructions, and learn to treat x86 instructions as building blocks with unique characteristics rather than as instructions dedicated to specific tasks.

Rotating and Shifting with Tables

As another example of local optimization, consider the matter of rotating or shifting a mask into position. First, let’s look at the simple task of setting bit N of AX to 1.

The obvious way to do this is to place N in CL, rotate the bit into position, and OR it with AX, as follows:

MOV  BX,1
SHL  BX,CL
OR   AX,BX

This solution is obvious because it takes good advantage of the special ability of the x86 family to shift or rotate by the variable number of bits specified by CL. However, it takes an average of about 45 cycles on an 8088. It’s actually far faster to precalculate the results, pass the bit number in BX, and look the shifted bit up, as shown in Listing 7.3.

LISTING 7.3 L7-3.ASM

     SHL  BX,1                ;prepare for word sized look up
     OR   AX,ShiftTable[BX]   ;look up the bit and OR it in
          :
ShiftTable     LABEL     WORD
BIT_PATTERN=0001H
     REPT 16
     DW   BIT_PATTERN
BIT_PATTERN=BIT_PATTERN SHL 1
     ENDM

Even though it accesses memory, this approach takes only 20 cycles—more than twice as fast as the variable shift. Once again, we were able to improve performance considerably—not by knowing the fastest instructions, but by selecting the fastest sequence of instructions.

In the particular example above, we once again run into the difficulty of optimizing across the x86 family. The table lookup is faster on the 8088 and 286, but it’s slightly slower on the 386 and no faster on the 486. However, 386/486-specific code could use enhanced addressing to accomplish the whole job in just one instruction, along the lines of the code snippet in Listing 7.4.

LISTING 7.4 L7-4.ASM

     OR   EAX,ShiftTable[EBX*4]    ;look up the bit and OR it in
          :
ShiftTable     LABEL     DWORD
BIT_PATTERN=0001H
     REPT 32
     DD   BIT_PATTERN
BIT_PATTERN=BIT_PATTERN SHL 1
     ENDM

Besides illustrating the advantages of local optimization, this example also shows that it generally pays to precalculate results; this is often done at or before assembly time, but precalculated tables can also be built at run time. This is merely one aspect of a fundamental optimization rule: Move as much work as possible out of your critical code by whatever means necessary.

NOT Flips Bits—Not Flags

The NOT instruction flips all the bits in the operand, from 0 to 1 or from 1 to 0. That’s as simple as could be, but NOT nonetheless has a minor but interesting talent: It doesn’t affect the flags. That can be irritating; I once spent a good hour tracking down a bug caused by my unconscious assumption that NOT does set the flags. After all, every other arithmetic and logical instruction sets the flags; why not NOT? Probably because NOT isn’t considered to be an arithmetic or logical instruction at all; rather, it’s a data manipulation instruction, like MOV and the various rotates. (These are RCR, RCL, ROR, and ROL, which affect only the Carry and Overflow flags.) NOT is often used for tasks, such as flipping masks, where there’s no reason to test the state of the result, and in that context it can be handy to keep the flags unmodified for later testing.

Besides, if you want to NOT an operand and set the flags in the process, you can just XOR it with -1. Put another way, the only functional difference between NOT AX and XOR AX,0FFFFH is that XOR modifies the flags and NOT doesn’t.

The x86 instruction set offers many ways to accomplish almost any task. Understanding the subtle distinctions between the instructions—whether and which flags are set, for example—can be critical when you’re trying to optimize a code sequence and you’re running out of registers, or when you’re trying to minimize branching.

Incrementing with and without Carry

Another case in which there are two slightly different ways to perform a task involves adding 1 to an operand. You can do this with INC, as in INC AX, or you can do it with ADD, as in ADD AX,1. What’s the difference? The obvious difference is that INC is usually a byte or two shorter (the exception being ADD AL,1, which at two bytes is the same length as INC AL), and is faster on some processors. Less obvious, but no less important, is that ADD sets the Carry flag while INC leaves the Carry flag untouched.

Why is that important? Because it allows INC to function as a data pointer manipulation instruction for multi-word arithmetic. You can use INC to advance the pointers in code like that shown in Listing 7.5 without having to do any work to preserve the Carry status from one addition to the next.

LISTING 7.5 L7-5.ASM

        CLC                  ;clear the Carry for the initial addition
LOOP_TOP:
        MOV    AX,[SI];get next source operand word
        ADC    [DI],AX;add with Carry to dest operand word
        INC    SI            ;point to next source operand word
        INC    SI
        INC    DI            ;point to next dest operand word
        INC    DI
        LOOP   LOOP_TOP

If ADD were used, the Carry flag would have to be saved between additions, with code along the lines shown in Listing 7.6.

LISTING 7.6 L7-6.ASM

     CLC            ;clear the carry for the initial addition
LOOP_TOP:
     MOV  AX,[SI]   ;get next source operand word
     ADC  [DI],AX   ;add with carry to dest operand word
     LAHF           ;set aside the carry flag
     ADD  SI,2      ;point to next source operand word
     ADD  DI,2      ;point to next dest operand word
     SAHF           ;restore the carry flag
     LOOP LOOP_TOP

It’s not that the Listing 7.6 approach is necessarily better or worse; that depends on the processor and the situation. The Listing 7.6 approach is different, and if you understand the differences, you’ll be able to choose the best approach for whatever code you happen to write. (DEC has the same property of preserving the Carry flag, by the way.)

There are a couple of interesting aspects to the last example. First, note that LOOP doesn’t affect any flags at all; this allows the Carry flag to remain unchanged from one addition to the next. Not altering the arithmetic flags is a common characteristic of program control instructions (as opposed to arithmetic and logical instructions like SUB and AND, which do alter the flags).

The rule is not that the arithmetic flags change whenever the CPU performs a calculation; rather, the flags change whenever you execute an arithmetic, logical, or flag control (such as CLC to clear the Carry flag) instruction.

Not only do LOOP and JCXZ not alter the flags, but REP MOVS, which counts down CX to 0, doesn’t affect the flags either.

The other interesting point about the last example is the use of LAHF and SAHF, which transfer the low byte of the FLAGS register to and from AH, respectively. These instructions were created to help provide compatibility with the 8080’s (that’s 8080, not 8088) PUSH PSW and POP PSW instructions, but turn out to be compact (one byte) instructions for saving and restoring the arithmetic flags. A word of caution, however: SAHF restores the Carry, Zero, Sign, Auxiliary Carry, and Parity flags—but not the Overflow flag, which resides in the high byte of the FLAGS register. Also, be aware that LAHF and SAHF provide a fast way to preserve the flags on an 8088 but are relatively slow instructions on the 486 and Pentium.

There are times when it’s a clear liability that INC doesn’t set the Carry flag. For instance

INC   AX
ADC   DX,0

does not increment the 32-bit value in DX:AX. To do that, you’d need the following:

ADD   AX,1
ADC   DX,0

As always, pay attention!

Chapter 8 – Speeding Up C with Assembly Language

Jumping Languages When You Know It’ll Help

When I was a senior in high school, a pop song called “Seasons in the Sun,” sung by one Terry Jacks, soared up the pop charts and spent, as best I can recall, two straight weeks atop Kasey Kasem’s American Top 40. “Seasons in the Sun” wasn’t a particularly good song, primarily because the lyrics were silly. I’ve never understood why the song was a hit, but, as so often happens with undistinguished but popular music by forgotten one- or two-shot groups (“Don’t Pull Your Love Out on Me Baby,” “Billy Don’t Be a Hero,” et al.), I heard it everywhere for a month or so, then gave it not another thought for 15 years.

Recently, though, I came across a review of a Rhino Records collection of obscure 1970s pop hits. Knowing that Jeff Duntemann is an aficionado of such esoterica (who do you know who owns an album by The Peppermint Trolley Company?), I sent the review to him. He was amused by it and, as we kicked the names of old songs around, “Seasons in the Sun” came up. I expressed my wonderment that a song that really wasn’t very good was such a big hit.

“Well,” said Jeff, “I think it suffered in the translation from the French.”

Ah-ha! Mystery solved. Apparently everyone but me knew that it was translated from French, and that novelty undoubtedly made the song a big hit. The translation was also surely responsible for the sappy lyrics; dollars to donuts that the original French lyrics were stronger.

Which brings us without missing a beat to this chapter’s theme, speeding up C with assembly language. When you seek to speed up a C program by converting selected parts of it (generally no more than a few functions) to assembly language, make sure you end up with high-performance assembly language code, not fine-tuned C code. Compilers like Microsoft C/C++ and Watcom C are by now pretty good at fine-tuning C code, and you’re not likely to do much better by taking the compiler’s assembly language output and tweaking it.

To make the process of translating C code to assembly language worth the trouble, you must ignore what the compiler does and design your assembly language code from a pure assembly language perspective. With a merely adequate translation, you risk laboring mightily for little or no reward.

Apropos of which, when was the last time you heard of Terry Jacks?

Billy, Don’t Be a Compiler

The key to optimizing C programs with assembly language is, as always, writing good assembly language code, but with an added twist. Rule 1 when converting C code to assembly is this: Don’t think like a compiler. That’s more easily said than done, especially when the C code you’re converting is readily available as a model and the assembly code that the compiler generates is available as well. Nevertheless, the principle of not thinking like a compiler is essential, and is, in one form or another, the basis for all that I’ll discuss below.

Before I discuss Rule 1 further, let me mention rule number 0: Only optimize where it matters. The bulk of execution time in any program is spent in a very small portion of the code, and most code beyond that small portion doesn’t have any perceptible impact on performance. Unless you’re supremely concerned with code size (an area in which assembly-only programs can excel), I’d suggest that you write most of your code in C and reserve assembly for the truly critical sections of your code; that’s the formula that I find gives the most bang for the buck.

This is not to say that complete programs shouldn’t be designed with optimized assembly language in mind. As you’ll see shortly, orienting your data structures towards assembly language can be a salubrious endeavor indeed, even if most of your code is in C. When it comes to actually optimizing code and/or converting it to assembly, though, do it only where it matters. Get a profiler—and use it!

Also make it a point to concentrate on refining your program design and algorithmic approach at the conceptual and/or C levels before doing any assembly language optimization.

Assembly language optimization is the final and far from the only step in the optimization chain, and as such should be performed last; converting to assembly too soon can lock in your code before the design is optimal. At the very least, conversion to assembly tends to make future changes and debugging more difficult, slowing you down and limiting your options.

Don’t Call Your Functions on Me, Baby

In order to think differently from a compiler, you must understand both what compilers and C programmers tend to do and how that differs from what assembly language does well. In this pursuit, it can be useful to examine the code your compiler generates, either by viewing the code in a debugger or by having the compiler generate an assembly language output file. (The latter is done with /Fa or /Fc in Microsoft C/C++ and -S in Borland C++.)

C programmers tend to modularize their code with lots of function calls. That’s good for readable, reliable, reusable code, and it allows the compiler to optimize better because it can deal with fewer variables and statements in each optimization arena—but it’s not so good when viewed from the assembly language level. Calls and returns are slow, especially in the large code model, and the pushes required to put parameters on the stack are expensive as well.

What this means is that when you want to speed up a portion of a C program, you should identify the entire critical portion and move all of that critical portion into an assembly language function. You don’t want to move a part of the inner loop into assembly language and then call it from C every time through the loop; the function call and return overhead would be unacceptable. Carve out the critical code en masse and move it into assembly, and try to avoid calls and returns even in your assembly code. True, in assembly you can pass parameters in registers, but the calls and returns themselves are still slow; if the extra cycles they take don’t affect performance, then the code they’re in probably isn’t critical, and perhaps you’ve chosen to convert too much code to assembly, eh?

Stack Frames Slow So Much

C compilers work within the stack frame model, whereby variables reside in a block of stack memory and are accessed via offsets from BP. Compilers may store a couple of variables in registers and may briefly keep other variables in registers when they’re used repeatedly, but the stack frame is the underlying architecture. It’s a nice architecture; it’s flexible, convenient, easy to program, and makes for fairly compact code. However, stack frames have a few drawbacks. They must be constructed and destroyed, which takes both time and code. They are so easy to use that they tend to bias the assembly language programmer in favor of accessing memory variables more often than might be necessary. Finally, you cannot use BP as a general-purpose register if you intend to access a stack frame, and having that seventh register available is sometimes useful indeed.

That doesn’t mean you shouldn’t use stack frames, which are useful and often necessary. Just don’t fall victim to their undeniable charms.

Torn Between Two Segments

C compilers are not terrific at handling segments. Some compilers can efficiently handle a single far pointer used in a loop by leaving ES set for the duration of the loop. But two far pointers used in the same loop confuse every compiler I’ve seen, causing the full segment:offset address to be reloaded each time either pointer is used.

This particularly affects performance in 286 protected mode (under OS/2 1.X or the Rational DOS Extender, for example) because segment loads in protected mode take a minimum of 17 cycles, versus a mere 2 cycles in real mode.

In assembly language you have full control over segments. Use it, and, if necessary, reorganize your code to minimize segment loading.

Why Speeding Up Is Hard to Do

You might think that the most obvious advantage assembly language has over C is that it allows the use of all forms of instructions and all registers in all ways, whereas C compilers tend to use a subset of registers and instructions in a limited number of ways. Yes and no. It’s true that C compilers typically don’t generate instructions such as XLAT, rotates, or the string instructions. On the other hand, XLAT and rotates are useful in a limited set of circumstances, and string instructions are used in the C library functions. In fact, C library code is likely to be carefully optimized by experts, and may be much better than equivalent code you’d produce yourself.

Am I saying that C compilers produce better code than you do? No, I’m saying that they can, unless you use assembly language properly. Writing code in assembly language rather than C guarantees nothing.

You can write good assembly, bad assembly, or assembly that is virtually indistinguishable from compiled code; you are more likely than not to write the latter if you think that optimization consists of tweaking compiled C code.

Sure, you can probably use the registers more efficiently and take advantage of an instruction or two that the compiler missed, but the code isn’t going to get a whole lot faster that way.

True optimization requires rethinking your code to take advantage of assembly language. A C loop that searches through an integer array for matches might compile

Figure 8.1  Tweaked compiler output for a loop.

to something like Figure 8.1A. You might look at that and tweak it to the code shown in Figure 8.1B.

Congratulations! You’ve successfully eliminated all stack frame access, you’ve used LOOP (although DEC SI/JNZ is actually faster on 386 and later machines, as I explained in the last chapter), and you’ve used a string instruction. Unfortunately, the new code isn’t going to run very much faster. Maybe 25 percent faster, maybe a little more. Big deal. You’ve eliminated the trappings of the compiler—the stack frame and the restricted register usage—but you’re still thinking like the compiler. Try this:

repnz scasw
jz    Match

It’s a simple example—but, I hope, a convincing one. Stretch your brain when you optimize.

Taking It to the Limit

The ultimate in assembly language optimization comes when you change the rules; that is, when you reorganize the entire program to allow the use of better assembly language code in the small section of code that most affects overall performance. For example, consider that the data searched in the last example is stored in an array of structures, with each structure in the array containing other information as well. In this situation, REP SCASW couldn’t be used because the data searched through wouldn’t be contiguous.

However, if the need for performance in searching the array is urgent enough, there’s no reason why you can’t reorganize the data. This might mean removing the array elements from the structures and storing them in their own array so that REP SCASW could be used.

Organizing a program’s data so that the performance of the critical sections can be optimized is a key part of design, and one that’s easily shortchanged unless, during the design stage, you thoroughly understand and work to bring together your data needs, the critical sections of your program, and potential assembly language optimizations.

More on this shortly.

To recap, here are some things to look for when striving to convert C code into optimized assembly language:

  • Move the entire performance-critical section into a single assembly language function.
  • Don’t use calls or stack frame accesses inside the critical code, if possible, and avoid unnecessary memory accesses of any kind.
  • Change segments as infrequently as possible.
  • Optimize in terms of what assembly does well, not in terms of fine-tuning compiled C code.
  • Change the rules to the benefit of assembly, if necessary; for example, reorganize data structto allow efficient assembly language processing.

That said, let me show some of these precepts in action.

A C-to-Assembly Case Study

Listing 8.1 is the sample C application I’m going to use to examine optimization in action. Listing 8.1 isn’t really complete—it doesn’t handle the “no-matches” case well, and it assumes that the sum of all matches will fit into an int-but it will do just fine as an optimization example.

LISTING 8.1 L8-1.C

/* Program to search an array spanning a linked list of variable-
   sized blocks, for all entries with a specified ID number,
   and return the average of the values of all such entries. Each of
   the variable-sized blocks may contain any number of data entries,
   stored as an array of structures within the block. */

#include <stdio.h>
#ifdef __TURBOC__
#include <alloc.h>
#else
#include <malloc.h>
#endif

void main(void);
void exit(int);
unsigned int FindIDAverage(unsigned int, struct BlockHeader *);
/* Structure that starts each variable-sized block */
struct BlockHeader {
   struct BlockHeader *NextBlock; /* Pointer to next block, or NULL
                                     if this is the last block in the
                                     linked list */
   unsigned int BlockCount;       /* The number of DataElement entries
                                     in this variable-sized block */
};

/* Structure that contains one element of the array we'll search */
struct DataElement {
   unsigned int ID;     /* ID # for array entry */
   unsigned int Value;  /* Value of array entry */
};

void main(void) {
   int i,j;
   unsigned int IDToFind;
   struct BlockHeader *BaseArrayBlockPointer,*WorkingBlockPointer;
   struct DataElement *WorkingDataPointer;
   struct BlockHeader **LastBlockPointer;

   printf("ID # for which to find average: ");
   scanf("%d",&IDToFind);
   /* Build an array across 5 blocks, for testing */
   /* Anchor the linked list to BaseArrayBlockPointer */
   LastBlockPointer = &BaseArrayBlockPointer;
   /* Create 5 blocks of varying sizes */
   for (i = 1; i < 6; i++) {
      /* Try to get memory for the next block */
      if ((WorkingBlockPointer =
          (struct BlockHeader *) malloc(sizeof(struct BlockHeader) +
           sizeof(struct DataElement) * i * 10)) == NULL) {
         exit(1);
      }
      /* Set the # of data elements in this block */
      WorkingBlockPointer->BlockCount = i * 10;
      /* Link the new block into the chain */
      *LastBlockPointer = WorkingBlockPointer;
      /* Point to the first data field */
      WorkingDataPointer =
            (struct DataElement *) ((char *)WorkingBlockPointer +
            sizeof(struct BlockHeader));
      /* Fill the data fields with ID numbers and values */
      for (j = 0; j < (i * 10); j++, WorkingDataPointer++) {
         WorkingDataPointer->ID = j;
         WorkingDataPointer->Value = i * 1000 + j;
      }
      /* Remember where to set link from this block to the next */
      LastBlockPointer = &WorkingBlockPointer->NextBlock;
   }
   /* Set the last block's "next block" pointer to NULL to indicate
      that there are no more blocks */
   WorkingBlockPointer->NextBlock = NULL;
   printf("Average of all elements with ID %d: %u\n",
         IDToFind, FindIDAverage(IDToFind, BaseArrayBlockPointer));
   exit(0);
}

/* Searches through the array of DataElement entries spanning the
   linked list of variable-sized blocks, starting with the block
   pointed to by BlockPointer, for all entries with IDs matching
   SearchedForID, and returns the average value of those entries. If
   no matches are found, zero is returned */

unsigned int FindIDAverage(unsigned int SearchedForID,
      struct BlockHeader *BlockPointer)
{
   struct DataElement *DataPointer;
   unsigned int IDMatchSum;
   unsigned int IDMatchCount;
   unsigned int WorkingBlockCount;

   IDMatchCount = IDMatchSum = 0;
   /* Search through all the linked blocks until the last block
      (marked with a NULL pointer to the next block) has been
      searched */
   do {
      /* Point to the first DataElement entry within this block */
      DataPointer =
            (struct DataElement *) ((char *)BlockPointer +
            sizeof(struct BlockHeader));
      /* Search all the DataElement entries within this block
         and accumulate data from all that match the desired ID */
      for (WorkingBlockCount=0;
            WorkingBlockCount<BlockPointer->BlockCount;
            WorkingBlockCount++, DataPointer++) {
         /* If the ID matches, add in the value and increment the
            match counter */
         if (DataPointer->ID == SearchedForID) {
            IDMatchCount++;
            IDMatchSum += DataPointer->Value;
         }
      }
   /* Point to the next block, and continue as long as that pointer
       isn't NULL */
   }  while ((BlockPointer = BlockPointer->NextBlock) != NULL);
   /* Calculate the average of all matches */
   if (IDMatchCount == 0)
      return(0);            /* Avoid division by 0 */
   else
      return(IDMatchSum / IDMatchCount);
}

The main body of Listing 8.1 constructs a linked list of memory blocks of various sizes and stores an array of structures across those blocks, as shown in Figure 8.2. The function FindIDAverage in Listing 8.1 searches through that array for all matches to a specified ID number and returns the average value of all such matches. FindIDAverage contains two nested loops, the outer one repeating once for each linked block and the inner one repeating once for each array element in each block. The inner loop—the critical one—is compact, containing only four statements, and should lend itself rather well to compiler optimization.

Figure 8.2  Linked array storage format (version 1).

As it happens, Microsoft C/C++ does optimize the inner loop of FindIDAverage nicely. Listing 8.2 shows the code Microsoft C/C++ generates for the inner loop, consisting of a mere seven assembly language instructions inside the loop. The compiler is smart enough to convert the loop index variable, which counts up but is used for nothing but counting loops, into a count-down variable so that the LOOP instruction can be used.

LISTING 8.2 L8-2.COD

; Code generated by Microsoft C for inner loop of FindIDAverage.
;|*** for (WorkingBlockCount=0;
;|***       WorkingBlockCount<BlockPointer->BlockCount;
;|***       WorkingBlockCount++, DataPointer++) {
          mov     WORD PTR [bp-6],0         ;WorkingBlockCount
          mov     bx,WORD PTR [bp+6]        ;BlockPointer
          cmp     WORD PTR [bx+2],0
          je      $FB264
          mov     cx,WORD PTR [bx+2]
          add     WORD PTR [bp-6],cx        ;WorkingBlockCount
          mov     di,WORD PTR [bp-2]        ;IDMatchSum
          mov     dx,WORD PTR [bp-4]        ;IDMatchCount
$L20004:
;|*** if (DataPointer->ID == SearchedForID) {
          mov     ax,WORD PTR [si]
          cmp     WORD PTR [bp+4],ax        ;SearchedForID
          jne     $I265
;|***             IDMatchCount++;
          inc     dx
;|***            IDMatchSum += DataPointer->Value;
          add     di,WORD PTR [si+2]
;|***          }
;|***       }
$I265:
          add     si,4
          loop    $L20004
          mov     WORD PTR [bp-2],di        ;IDMatchSum
          mov     WORD PTR [bp-4],dx        ;IDMatchCount
$FB264:

It’s hard to squeeze much more performance from this code by tweaking it, as exemplified by Listing 8.3, a fine-tuned assembly version of FindIDAverage that was produced by looking at the assembly output of MS C/C++ and tightening it. Listing 8.3 eliminates all stack frame access in the inner loop, but that’s about all the tightening there is to do. The result, as shown in Table 8.1, is that Listing 8.3 runs a modest 11 percent faster than Listing 8.1 on a 386. The results could vary considerably, depending on the nature of the data set searched through (average block size and frequency of matches). But, then, understanding the typical and worst case conditions is part of optimization, isn’t it?

LISTING 8.3 L8-3.ASM

; Typically optimized assembly language version of FindIDAverage.
SearchedForID   equ     4      ;Passed parameter offsets in the
BlockPointer    equ     6      ; stack frame (skip over pushed BP
; and the return address)
NextBlock       equ     0      ;Field offsets in struct BlockHeader
BlockCount      equ     2
BLOCK_HEADER_SIZE equ   4      ;Number of bytes in struct BlockHeader
ID              equ     0      ;struct DataElement field offsets
Value           equ     2
DATA_ELEMENT_SIZE equ   4      ;Number of bytes in struct DataElement
        .model  small
        .code
        public  _FindIDAverage
Table 8.1 Execution Times of FindIDAverage.
On 20 MHz 386 On 10 MHz 286
Listing 8.1 (MSC with maximum optimization) 294 microseconds 768 microseconds
Listing 8.3 (Assembly) 265 644
Listing 8.4 (Optimized assembly) 212 486
Listing 8.6 (Optimized assembly with reorganized data) 100 207
_FindIDAverage  proc    near
        push    bp              ;Save caller's stack frame
        mov     bp,sp           ;Point to our stack frame
        push    di              ;Preserve C register variables
        push    si
        sub     dx,dx           ;IDMatchSum = 0
        mov     bx,dx           ;IDMatchCount = 0
        mov     si,[bp+BlockPointer]    ;Pointer to first block
        mov     ax,[bp+SearchedForID]   ;ID we're looking for
; Search through all the linked blocks until the last block
; (marked with a NULL pointer to the next block) has been searched.
BlockLoop:
; Point to the first DataElement entry within this block.
        lea     di,[si+BLOCK_HEADER_SIZE]
; Search through all the DataElement entries within this block
; and accumulate data from all that match the desired ID.
        mov     cx,[si+BlockCount]
        jcxz    DoNextBlock     ;No data in this block
IntraBlockLoop:
        cmp     [di+ID],ax      ;Do we have an ID match?
        jnz     NoMatch         ;No match
        inc     bx              ;We have a match; IDMatchCount++;
        add     dx,[di+Value]   ;IDMatchSum += DataPointer->Value;
NoMatch:
        add     di,DATA_ELEMENT_SIZE ;point to the next element
        loop    IntraBlockLoop
; Point to the next block and continue if that pointer isn't NULL.
DoNextBlock:
        mov     si,[si+NextBlock] ;Get pointer to the next block
        and     si,si           ;Is it a NULL pointer?
        jnz     BlockLoop       ;No, continue
; Calculate the average of all matches.
        sub     ax,ax           ;Assume we found no matches
        and     bx,bx
        jz      Done            ;We didn't find any matches, return 0
        xchg    ax,dx           ;Prepare for division
        div     bx              ;Return IDMatchSum / IDMatchCount
Done:   pop     si              ;Restore C register variables
        pop     di
        pop     bp              ;Restore caller's stack frame
        ret
_FindIDAverage  ENDP
        end

Listing 8.4 tosses some sophisticated optimization techniques into the mix. The loop is unrolled eight times, eliminating a good deal of branching, and SCASW is used instead of CMP [DI],AX. (Note, however, that SCASW is in fact slower than CMP [DI],AX on the 386 and 486, and is sometimes faster on the 286 and 8088 only because it’s shorter and therefore may prefetch faster.) This advanced tweaking produces a 39 percent improvement over the original C code—substantial, but not a tremendous return for the optimization effort invested.

LISTING 8.4 L8-4.ASM

; Heavily optimized assembly language version of FindIDAverage.
; Features an unrolled loop and more efficient pointer use.
SearchedForID   equ     4       ;Passed parameter offsets in the
BlockPointer    equ     6       ; stack frame (skip over pushed BP
                                ; and the return address)
NextBlock       equ     0       ;Field offsets in struct BlockHeader
BlockCount      equ     2
BLOCK_HEADER_SIZE equ   4       ;Number of bytes in struct BlockHeader
ID              equ     0       ;struct DataElement field offsets
Value           equ     2
DATA_ELEMENT_SIZE equ   4       ;Number of bytes in struct DataElement
        .model  small
        .code
        public  _FindIDAverage
_FindIDAverage  proc    near
        push    bp              ;Save caller's stack frame
        mov     bp,sp           ;Point to our stack frame
        push    di              ;Preserve C register variables
        push    si
        mov     di,ds           ;Prepare for SCASW
        mov     es,di
        cld
        sub     dx,dx           ;IDMatchSum = 0
        mov     bx,dx           ;IDMatchCount = 0
        mov     si,[bp+BlockPointer]    ;Pointer to first block
        mov     ax,[bp+SearchedForID]   ;ID we're looking for
; Search through all of the linked blocks until the last block
; (marked with a NULL pointer to the next block) has been searched.
BlockLoop:
; Point to the first DataElement entry within this block.
        lea     di,[si+BLOCK_HEADER_SIZE]
; Search through all the DataElement entries within this block
; and accumulate data from all that match the desired ID.
        mov     cx,[si+BlockCount] ;Number of elements in this block
        jcxz    DoNextBlock     ;Skip this block if it's empty
        mov     bp,cx           ;***stack frame no longer available***
        add     cx,7
        shr     cx,1            ;Number of repetitions of the unrolled
        shr     cx,1            ; loop = (BlockCount + 7) / 8
        shr     cx,1
        and     bp,7            ;Generate the entry point for the
        shl     bp,1            ; first, possibly partial pass through
        jmp     cs:[LoopEntryTable+bp] ; the unrolled loop and
                                ; vector to that entry point
        align   2
LoopEntryTable  label   word
        dw      LoopEntry8,LoopEntry1,LoopEntry2,LoopEntry3
        dw      LoopEntry4,LoopEntry5,LoopEntry6,LoopEntry7
M_IBL   macro   P1
        local   NoMatch
LoopEntry&P1&:
        scasw                   ;Do we have an ID match?
        jnz     NoMatch         ;No match
                                ;We have a match
        inc     bx              ;IDMatchCount++;
        add     dx,[di]         ;IDMatchSum += DataPointer->Value;
NoMatch:
        add     di,DATA_ELEMENT_SIZE-2 ;point to the next element
                                ; (SCASW advanced 2 bytes already)
        endm
        align   2
IntraBlockLoop:
        M_IBL   8
        M_IBL   7
        M_IBL   6
        M_IBL   5
        M_IBL   4
        M_IBL   3
        M_IBL   2
        M_IBL   1
        loop    IntraBlockLoop
; Point to the next block and continue if that pointer isn't NULL.
DoNextBlock:
        mov     si,[si+NextBlock] ;Get pointer to the next block
        and     si,si           ;Is it a NULL pointer?
        jnz     BlockLoop       ;No, continue
; Calculate the average of all matches.
        sub     ax,ax           ;Assume we found no matches
        and     bx,bx
        jz      Done            ;We didn't find any matches, return 0
        xchg    ax,dx           ;Prepare for division
        div     bx              ;Return IDMatchSum / IDMatchCount
Done:   pop     si              ;Restore C register variables
        pop     di
        pop     bp              ;Restore caller's stack frame
        ret
_FindIDAverage  ENDP
        end

Listings 8.5 and 8.6 together go the final step and change the rules in favor of assembly language. Listing 8.5 creates the same list of linked blocks as Listing 8.1. However, instead of storing an array of structures within each block, it stores two arrays in each block, one consisting of ID numbers and the other consisting of the corresponding values, as shown in Figure 8.3. No information is lost; the data is merely rearranged.

LISTING 8.5 L8-5.C

/* Program to search an array spanning a linked list of variable-
   sized blocks, for all entries with a specified ID number,
   and return the average of the values of all such entries. Each of
   the variable-sized blocks may contain any number of data entries,
   stored in the form of two separate arrays, one for ID numbers and
   one for values. */

#include <stdio.h>
#ifdef __TURBOC__
#include <alloc.h>
#else
#include <malloc.h>
#endif

void main(void);
void exit(int);
extern unsigned int FindIDAverage2(unsigned int,
                                   struct BlockHeader *);
Figure 8.3  Linked array storage format (version 2).
/* Structure that starts each variable-sized block */
struct BlockHeader {
   struct BlockHeader *NextBlock; /* Pointer to next block, or NULL
                                     if this is the last block in the
                                     linked list */
   unsigned int BlockCount;       /* The number of DataElement entries
                                     in this variable-sized block */
};

void main(void) {
   int i,j;
   unsigned int IDToFind;
   struct BlockHeader *BaseArrayBlockPointer,*WorkingBlockPointer;
   int *WorkingDataPointer;
   struct BlockHeader **LastBlockPointer;

   printf("ID # for which to find average: ");
   scanf("%d",&IDToFind);

   /* Build an array across 5 blocks, for testing */
   /* Anchor the linked list to BaseArrayBlockPointer */
   LastBlockPointer = &BaseArrayBlockPointer;
   /* Create 5 blocks of varying sizes */
   for (i = 1; i < 6; i++) {
      /* Try to get memory for the next block */
      if ((WorkingBlockPointer =
          (struct BlockHeader *) malloc(sizeof(struct BlockHeader) +
           sizeof(int) * 2 * i * 10)) == NULL) {
         exit(1);
      }
      /* Set the number of data elements in this block */
      WorkingBlockPointer->BlockCount = i * 10;
      /* Link the new block into the chain */
      *LastBlockPointer = WorkingBlockPointer;
      /* Point to the first data field */
      WorkingDataPointer = (int *) ((char *)WorkingBlockPointer +
            sizeof(struct BlockHeader));
      /* Fill the data fields with ID numbers and values */
      for (j = 0; j < (i * 10); j++, WorkingDataPointer++) {
         *WorkingDataPointer = j;
         *(WorkingDataPointer + i * 10) = i * 1000 + j;
      }
      /* Remember where to set link from this block to the next */
      LastBlockPointer = &WorkingBlockPointer->NextBlock;
   }
   /* Set the last block's "next block" pointer to NULL to indicate
      that there are no more blocks */
   WorkingBlockPointer->NextBlock = NULL;
   printf("Average of all elements with ID %d: %u\n",
         IDToFind, FindIDAverage2(IDToFind, BaseArrayBlockPointer));
   exit(0);
}

LISTING 8.6 L8-6.ASM

; Alternative optimized assembly language version of FindIDAverage
; requires data organized as two arrays within each block rather
; than as an array of two-value element structures. This allows the
; use of REP SCASW for ID searching.

SearchedForIDequ4               ;Passed parameter offsets in the
BlockPointerequ6                ; stack frame (skip over pushed BP
                                ; and the return address)
NextBlockequ0                   ;Field offsets in struct BlockHeader
BlockCountequ2
BLOCK_HEADER_SIZEequ4           ;Number of bytes in struct BlockHeader

        .model  small
        .code
        public  _FindIDAverage2
_FindIDAverage2 proc    near
        push    bp              ;Save caller's stack frame
        mov     bp,sp           ;Point to our stack frame
        push    di              ;Preserve C register variables
        push    si
        mov     di,ds           ;Prepare for SCASW
        mov     es,di
        cld
        mov     si,[bp+BlockPointer]    ;Pointer to first block
        mov     ax,[bp+SearchedForID]   ;ID we're looking for
        sub     dx,dx                       ;IDMatchSum = 0
        mov     bp,dx                       ;IDMatchCount = 0
                                            ;***stack frame no longer available***
; Search through all the linked blocks until the last block
; (marked with a NULL pointer to the next block) has been searched.
BlockLoop:
; Search through all the DataElement entries within this block
; and accumulate data from all that match the desired ID.
        mov     cx,[si+BlockCount]
        jcxz    DoNextBlock;Skip this block if there's no data
                                   ; to search through
        mov     bx,cx              ;We'll use BX to point to the
        shl     bx,1               ; corresponding value entry in the
; case of an ID match (BX is the
; length in bytes of the ID array)
; Point to the first DataElement entry within this block.
        lea     di,[si+BLOCK_HEADER_SIZE]
IntraBlockLoop:
        repnz   scasw              ;Search for the ID
        jnz     DoNextBlock        ;No match, the block is done
        inc     bp                 ;We have a match; IDMatchCount++;
        add     dx,[di+bx-2];IDMatchSum += DataPointer->Value;
; (SCASW has advanced DI 2 bytes)
        and     cx,cx              ;Is there more data to search through?
        jnz     IntraBlockLoop     ;yes
; Point to the next block and continue if that pointer isn't NULL.
DoNextBlock:
        mov     si,[si+NextBlock] ;Get pointer to the next block
        and     si,si           ;Is it a NULL pointer?
        jnz     BlockLoop       ;No, continue
; Calculate the average of all matches.
        sub     ax,ax           ;Assume we found no matches
        and     bp,bp
        jz      Done            ;We didn't find any matches, return 0
        xchg    ax,dx           ;Prepare for division
        div     bp              ;Return IDMatchSum / IDMatchCount
Done:   pop     si              ;Restore C register variables
        pop     di
        pop     bp              ;Restore caller's stack frame
        ret
_FindIDAverage2 ENDP
        end

The whole point of this rearrangement is to allow us to use REP SCASW to search through each block, and that’s exactly what FindIDAverage2 in Listing 8.6 does. The result: Listing 8.6 calculates the average about three times as fast as the original C implementation and more than twice as fast as Listing 8.4, heavily optimized as the latter code is.

I trust you get the picture. The sort of instruction-by-instruction optimization that so many of us love to do as a kind of puzzle is fun, but compilers can do it nearly as well as you can, and in the future will surely do it better. What a compiler can’t do is tie together the needs of the program specification on the high end and the processor on the low end, resulting in critical code that runs just about as fast as the hardware permits. The only software that can do that is located north of your sternum and slightly aft of your nose. Dust it off and put it to work—and your code will never again be confused with anything by Hamilton, Joe, Frank, eynolds or Bo Donaldson and the Heywoods.

Chapter 9 – Hints My Readers Gave Me

Optimization Odds and Ends from the Field

Back in high school, I took a pre-calculus class from Mr. Bourgeis, whose most notable characteristics were incessant pacing and truly enormous feet. My friend Barry, who sat in the back row, right behind me, claimed that it was because of his large feet that Mr. Bourgeis was so restless. Those feet were so heavy, Barry hypothesized, that if Mr. Bourgeis remained in any one place for too long, the floor would give way under the strain, plunging the unfortunate teacher deep into the mantle of the Earth and possibly all the way through to China. Many amusing cartoons were drawn to this effect.

Unfortunately, Barry was too busy drawing cartoons, or, alternatively, sleeping, to actually learn any math. In the long run, that didn’t turn out to be a handicap for Barry, who went on to become vice-president of sales for a ham-packing company, where presumably he was rarely called upon to derive the quadratic equation. Barry’s lack of scholarship caused some problems back then, though. On one memorable occasion, Barry was half-asleep, with his eyes open but unfocused and his chin balanced on his hand in the classic “if I fall asleep my head will fall off my hand and I’ll wake up” posture, when Mr. Bourgeis popped a killer problem:

“Barry, solve this for X, please.” On the blackboard lay the equation:

X - 1 = 0

“Minus 1,” Barry said promptly.

Mr. Bourgeis shook his head mournfully. “Try again.” Barry thought hard. He knew the fundamental rule that the answer to most mathematical questions is either 0, 1, infinity, -1, or minus infinity (do not apply this rule to balancing your checkbook, however); unfortunately, that gave him only a 25 percent chance of guessing right.

“One,” I whispered surreptitiously.

“Zero,” Barry announced. Mr. Bourgeis shook his head even more sadly.

“One,” I whispered louder. Barry looked still more thoughtful—a bad sign—so I whispered “one” again, even louder. Barry looked so thoughtful that his eyes nearly rolled up into his head, and I realized that he was just doing his best to convince Mr. Bourgeis that Barry had solved this one by himself.

As Barry neared the climax of his stirring performance and opened his mouth to speak, Mr. Bourgeis looked at him with great concern. “Barry, can you hear me all right?”

“Yes, sir,” Barry replied. “Why?”

“Well, I could hear the answer all the way up here. Surely you could hear it just one row away?”

The class went wild. They might as well have sent us home early for all we accomplished the rest of the day.

I like to think I know more about performance programming than Barry knew about math. Nonetheless, I always welcome good ideas and comments, and many readers have sent me a slew of those over the years. So in this chapter, I think I’ll return the favor by devoting a chapter to reader feedback.

Another Look at LEA

Several people have pointed out that while LEA is great for performing certain additions (see Chapter 6), it isn’t a perfect replacement for ADD. What’s the difference? LEA, an addressing instruction by trade, doesn’t affect the flags, while the arithmetic ADD instruction most certainly does. This is no problem when performing additions that involve only quantities that fit in one machine word (32 bits in 386 protected mode, 16 bits otherwise), but it renders LEA useless for multiword operations, which use the Carry flag to tie together partial results. For example, these instructions

ADD  EAX,EBX
ADC  EDX,ECX

could not be replaced

LEA  EAX,[EAX+EBX]
ADC  EDX,ECX

because LEA doesn’t affect the Carry flag.

The no-carry characteristic of LEA becomes a distinct advantage when performing pointer arithmetic, however. For instance, the following code uses LEA to advance the pointers while adding one 128-bit memory variable to another such variable:

   MOV   ECX,4   ;# of 32-bit words to add
   CLC
;no carry into the initial ADC
ADDLOOP:

   MOV   EAX,[ESI]    ;get the next element of one array
   ADC   [EDI],EAX    ;add it to the other array, with carry
   LEA   ESI,[ESI+4]  ;advance one array's pointer
   LEA   EDI,[EDI+4]  ;advance the other array's pointer
         LOOP ADDLOOP

(Yes, I could use LODSD instead of MOV/LEA; I’m just illustrating a point here. Besides, LODS is only 1 cycle faster than MOV/LEA on the 386, and is actually more than twice as slow on the 486.) If we used ADD rather than LEA to advance the pointers, the carry from one ADC to the next would have to be preserved with either PUSHF/POPF or LAHF/SAHF. (Alternatively, we could use multiple INCs, since INC doesn’t affect the Carry flag.)

In short, LEA is indeed different from ADD. Sometimes it’s better. Sometimes not; that’s the nature of the various instruction substitutions and optimizations that will occur to you over time. There’s no such thing as “best” instructions on the x86; it all depends on what you’re trying to do.

But there sure are a lot of interesting options, aren’t there?

The Kennedy Portfolio

Reader John Kennedy regularly passes along intriguing assembly programming tricks, many of which I’ve never seen mentioned anywhere else. John likes to optimize for size, whereas I lean more toward speed, but many of his optimizations are good for both purposes. Here are a few of my favorites:

John’s code for setting AX to its absolute value is:

CWD
XOR   AX,DX
SUB   AX,DX

This does nothing when bit 15 of AX is 0 (that is, if AX is positive). When AX is negative, the code “nots” it and adds 1, which is exactly how you perform a two’s complement negate. For the case where AX is not negative, this trick usually beats the stuffing out of the standard absolute value code:

   AND   AX,AX        ;negative?
   JNS   IsPositive   ;no
   NEG   AX           ;yes,negate it
IsPositive:

However, John’s code is slower on a 486; as you’re no doubt coming to realize (and as I’ll explain in Chapters 12 and 13), the 486 is an optimization world unto itself.

Here’s how John copies a block of bytes from DS:SI to ES:DI, moving as much data as possible a word at a time:

SHR   CX,1      ;word count
REP   MOVSW     ;copy as many words as possible
ADC   CX,CX     ;CX=1 if copy length was odd,
                ;0 else
REP   MOVSB     ;copy any odd byte

(ADC CX,CX can be replaced with RCL CX,1; which is faster depends on the processor type.) It might be hard to believe that the above is faster than this:

   SHR   CX,1      ;word count
   REP   MOVSW     ;copy as many words as
                   ;possible
   JNC   CopyDone  ;done if even copy length
   MOVSB           ;copy the odd byte
CopyDone:

However, it generally is. Sure, if the length is odd, John’s approach incurs a penalty approximately equal to the REP startup time for MOVSB. However, if the length is even, John’s approach doesn’t branch, saving cycles and not emptying the prefetch queue. If copy lengths are evenly distributed between even and odd, John’s approach is faster in most x86 systems. (Not on the 486, though.)

John also points out that on the 386, multiple LEAs can be combined to perform multiplications that can’t be handled by a single LEA, much as multiple shifts and adds can be used for multiplication, only faster. LEA can be used to multiply in a single instruction on the 386, but only by the values 2, 3, 4, 5, 8, and 9; several LEAs strung together can handle a much wider range of values. For example, video programmers are undoubtedly familiar with the following code to multiply AX times 80 (the width in bytes of the bitmap in most PC display modes):

SHL   AX,1        ;*2
SHL   AX,1        ;*4
SHL   AX,1        ;*8
SHL   AX,1        ;*16
MOV   BX,AX
SHL   AX,1        ;*32
SHL   AX,1        ;*64
ADD   AX,BX       ;*80

Using LEA on the 386, the above could be reduced to

LEA   EAX,[EAX*2]     ;*2
LEA   EAX,[EAX*8]     ;*16
LEA   EAX,[EAX+EAX*4] ;*80

which still isn’t as fast as using a lookup table like

MOV   EAX,MultiplesOf80Table[EAX*4]

but is close and takes a great deal less space.

Of course, on the 386, the shift and add version could also be reduced to this considerably more efficient code:

SHL   AX,4       ;*16
MOV   BX,AX
SHL   AX,2       ;*64
ADD   AX,BX      ;*80

Speeding Up Multiplication

That brings us to multiplication, one of the slowest of x86 operations and one that allows for considerable optimization. One way to speed up multiplication is to use shift and add, LEA, or a lookup table to hard-code a multiplication operation for a fixed multiplier, as shown above. Another is to take advantage of the early-out feature of the 386 (and the 486, but in the interests of brevity I’ll just say “386” from now on) by arranging your operands so that the multiplier (always the rightmost operand following MUL or IMUL) is no larger than the other operand.

Why? Because the 386 processes one multiplier bit per cycle and immediately ends a multiplication when all significant bits of the multiplier have been processed, so fewer cycles are required to multiply a large multiplicand times a small multiplier than a small multiplicand times a large multiplier, by a factor of about 1 cycle for each significant multiplier bit eliminated.

(There’s a minimum execution time on this trick; below 3 significant multiplier bits, no additional cycles are saved.) For example, multiplication of 32,767 times 1 is 12 cycles faster than multiplication of 1 times 32,727.

Choosing the right operand as the multiplier can work wonders. According to published specs, the 386 takes 38 cycles to multiply by a multiplier with 32 significant bits but only 9 cycles to multiply by a multiplier of 2, a performance improvement of more than four times! (My tests regularly indicate that multiplication takes 3 to 4 cycles longer than the specs indicate, but the cycle-per-bit advantage of smaller multipliers holds true nonetheless.)

This highlights another interesting point: MUL and IMUL on the 386 are so fast that alternative multiplication approaches, while generally still faster, are worthwhile only in truly time-critical code.

On 386SXs and uncached 386s, where code size can significantly affect performance due to instruction prefetching, the compact MUL and IMUL instructions can approach and in some cases even outperform the “optimized” alternatives.

All in all, MUL and IMUL are reasonable performers on the 386, no longer to be avoided in most cases—and you can help that along by arranging your code to make the smaller operand the multiplier whenever you know which operand is smaller.

That doesn’t mean that your code should test and swap operands to make sure the smaller one is the multiplier; that rarely pays off. I’m speaking more of the case where you’re scaling an array up by a value that’s always in the range of, say, 2 to 10; because the scale value will always be small and the array elements may have any value, the scale value is the logical choice for the multiplier.

Optimizing Optimized Searching

Rob Williams writes with a wonderful optimization to the REPNZ SCASB-based optimized searching routine I discussed in Chapter 5. As a quick refresher, I described searching a buffer for a text string as follows: Scan for the first byte of the text string with REPNZ SCASB, then use REPZ CMPS to check for a full match whenever REPNZ SCASB finds a match for the first character, as shown in Figure 9.1. The principle is that most buffer characters won’t match the first character of any given string, so REPNZ SCASB, by far the fastest way to search on the PC, can be used to eliminate most potential matches; each remaining potential match can then be checked in its entirety with REPZ CMPS.

Figure 9.1  Simple searching method for locating a text string.

Rob’s revelation, which he credits without explanation to Edgar Allen Poe (search nevermore?), was that by far the slowest part of the whole deal is handling REPNZ SCASB matches, which require checking the remainder of the string with REPZ CMPS and restarting REPNZ SCASB if no match is found.

Rob points out that the number of REPNZ SCASB matches can easily be reduced simply by scanning for the character in the searched-for string that appears least often in the buffer being searched.

Imagine, if you will, that you’re searching for the string “EQUAL.” By my approach, you’d use REPNZ SCASB to scan for each occurrence of “E,” which crops up quite often in normal text. Rob points out that it would make more sense to scan for “Q,” then back up one character and check the whole string when a “Q” is found, as shown in Figure 9.2. “Q” is likely to occur much less often, resulting in many fewer whole-string checks and much faster processing.

Listing 9.1 implements the scan-on-first-character approach. Listing 9.2 scans for whatever character the caller specifies. Listing 9.3 is a test program used to compare the two approaches. How much difference does Rob’s revelation make? Plenty. Even when the entire C function call to FindString is timed—strlen calls, parameter pushing, calling, setup, and all—the version of FindString in Listing 9.2, which is directed by Listing 9.3 to scan for the infrequently-occurring “Q,” is about 40 percent faster on a 20 MHz cached 386 for the test search of Listing 9.3 than is the version of FindString in Listing 9.1, which always scans for the first character, in this case “E.” However, when only the search loops (the code that actually does the searching) in the two versions of FindString are compared, Listing 9.2 is more than twice as fast as Listing 9.1—a remarkable improvement over code that already uses REPNZ SCASB and REPZ CMPS.

What I like so much about Rob’s approach is that it demonstrates that optimization involves much more than instruction selection and cycle counting. Listings 9.1 and 9.2 use pretty much the same instructions, and even use the same approach of scanning with REPNZ SCASB and using REPZ CMPS to check scanning matches.

The difference between Listings 9.1 and 9.2 (which gives you more than a doubling of performance) is due entirely to understanding the nature of the data being handled, and biasing the code to reflect that knowledge.

Figure 9.2  Faster searching method for locating a text string.

LISTING 9.1 L9-1.ASM

; Searches a text buffer for a text string. Uses REPNZ SCASB to sca"n
; the buffer for locations that match the first character of the
; searched-for string, then uses REPZ CMPS to check fully only those
; locations that REPNZ SCASB has identified as potential matches.
;
; Adapted from Zen of Assembly Language, by Michael Abrash
;
; C small model-callable as:
;    unsigned char * FindString(unsigned char * Buffer,
;     unsigned int BufferLength, unsigned char * SearchString,
;     unsigned int SearchStringLength);
;
; Returns a pointer to the first match for SearchString in Buffer,or
; a NULL pointer if no match is found. Buffer should not start at
; offset 0 in the data segment to avoid confusing a match at 0 with
; no match found.
Parmsstruc
                        dw    2 dup(?) ;pushed BP/return address
Buffer                  dw      ?      ;pointer to buffer to search
BufferLength            dw      ?      ;length of buffer to search
SearchString            dw      ?      ;pointer to string for which to search
SearchStringLength      dw      ?      ;length of string for which to search
Parmsends
      .model      small
      .code
      public _FindString
_FindString proc near
      push     bp      ;preserve caller's stack frame
      mov      bp,sp   ;point to our stack frame
      push     si      ;preserve caller's register variables
      push     di
      cld              ;make string instructions increment pointers
      mov      si,[bp+SearchString]       ;pointer to string to search for
      mov      bx,[bp+SearchStringLength] ;length of string
      and bx,bx
      jz       FindStringNotFound         ;no match if string is 0 length
      movd      x,[bp+BufferLength]       ;length of buffer
      sub      dx,bx                      ;difference between buffer and string lengths
      jc      FindStringNotFound          ;no match if search string is
                                          ; longer than buffer
      inc      dx      ;difference between buffer and search string
                       ; lengths, plus 1 (# of possible string start
                       ; locations to check in the buffer)
      mov      di,ds
      mov      es,di
      mov      di,[bp+Buffer]       ;point ES:DI to buffer to search thru
      lodsb                         ;put the first byte of the search string in AL
      mov      bp,si                ;set aside pointer to the second search byte
      dec      bx                   ;don't need to compare the first byte of the
                                    ; string with CMPS; we'll do it with SCAS
FindStringLoop:
      mov    cx,dx ;put remaining buffer search length in CX
      repnz  scasb ;scan for the first byte of the string
      jnz    FindStringNotFound ;not found, so there's no match
                                ;found, so we have a potential match-check the
                                ; rest of this candidate location
      push   di                 ;remember the address of the next byte to scan
      mov    dx,cx              ;set aside the remaining length to search in
                                ; the buffer
      mov    si,bp              ;point to the rest of the search string
      mov    cx,bx              ;string length (minus first byte)
      shr    cx,1               ;convert to word for faster search
      jnc    FindStringWord     ;do word search if no odd byte
      cmpsb                     ;compare the odd byte
      jnz    FindStringNoMatch  ;odd byte doesn't match, so we
                                ; haven't found the search string here
FindStringWord:
      jcxz   FindStringFound    ;test whether we've already checked
                                ; the whole string; if so, this is a match
                                ; bytes long; if so, we've found a match
      repz   cmpsw              ;check the rest of the string a word at a time
      jz     FindStringFound    ;it's a match
FindStringNoMatch:
      pop    di                 ;get back pointer to the next byte to scan
      and    dx,dx              ;is there anything left to check?
      jnz    FindStringLoop     ;yes-check next byte
FindStringNotFound:
      sub    ax,ax              ;return a NULL pointer indicating that the
      jmp    FindStringDone     ; string was not found
FindStringFound:
      pop    ax                 ;point to the buffer location at which the
      dec    ax                 ; string was found (earlier we pushed the
                                ; address of the byte after the start of the
                                ; potential match)
FindStringDone:
      pop    di                 ;restore caller's register variables
      pop    si
      pop    bp                 ;restore caller's stack frame
ret
_FindString endp
      end

LISTING 9.2 L9-2.ASM

; Searches a text buffer for a text string. Uses REPNZ SCASB to scan
; the buffer for locations that match a specified character of the
; searched-for string, then uses REPZ CMPS to check fully only those
; locations that REPNZ SCASB has identified as potential matches.
;
; C small model-callable as:
;    unsigned char * FindString(unsigned char * Buffer,
;     unsigned int BufferLength, unsigned char * SearchString,
;     unsigned int SearchStringLength,
;     unsigned int ScanCharOffset);
;
; Returns a pointer to the first match for SearchString in Buffer,or
; a NULL pointer if no match is found. Buffer should not start at
; offset 0 in the data segment to avoid confusing a match at 0 with
; no match found.
Parms  struc
                        dw      2 dup(?)      ;pushed BP/return address
Buffer                  dw      ?             ;pointer to buffer to search
BufferLength            dw      ?             ;length of buffer to search
SearchString            dw      ?             ;pointer to string for which to search
SearchStringLength      dw      ?             ;length of string for which to search
ScanCharOffset          dw      ?             ;offset in string of character for
                                              ; which to scan
Parmsends
      .model      small
      .code
      public _FindString
_FindString proc near
      push     bp      ;preserve caller's stack frame
      mov      bp,sp   ;point to our stack frame
      push     si      ;preserve caller's register variables
      push     di
      cld              ;make string instructions increment pointers
      mov      si,[bp+SearchString]       ;pointer to string to search for
      mov      cx,[bp+SearchStringLength] ;length of string
      jcxz     FindStringNotFound         ;no match if string is 0 length
      mov      dx,[bp+BufferLength]       ;length of buffer
      sub      dx,cx                      ;difference between buffer and search
                                          ; lengths
      jc        FindStringNotFound ;no match if search string is
                    ; longer than buffer
      inc       dx  ; difference between buffer and search string
                    ; lengths, plus 1 (# of possible string start
                    ; locations to check in the buffer)
      mov       di,ds
      mov       es,di
      mov       di,[bp+Buffer]         ;point ES:DI to buffer to search thru
      mov       bx,[bp+ScanCharOffset] ;offset in string of character
                                       ; on which to scan
      add       di,bx         ;point ES:DI to first buffer byte to scan
      mov       al,[si+bx]    ;put the scan character in AL
      inc       bx            ;set BX to the offset back to the start of the
                              ; potential full match after a scan match,
                              ; accounting for the 1-byte overrun of
                              ; REPNZ SCASB
FindStringLoop:
      mov       cx,dx              ;put remaining buffer search length in CX
      repnz     scasb              ;scan for the scan byte
      jnz       FindStringNotFound ;not found, so there's no match
                                   ;found, so we have a potential match-check the
                                   ; rest of this candidate location
      push       di                ;remember the address of the next byte to scan
      mov        dx,cx             ;set aside the remaining length to search in
                                   ; the buffer
      sub        di,bx             ;point back to the potential start of the
                                   ; match in the buffer
      mov        si,[bp+SearchString]       ;point to the start of the string
      mov        cx,[bp+SearchStringLength] ;string length
      shr        cx,1                       ;convert to word for faster search
      jnc        FindStringWord             ;do word search if no odd byte
      cmpsb                                 ;compare the odd byte
      jnz        FindStringNoMatch          ;odd byte doesn't match, so we
                                            ; haven't found the search string here
FindStringWord:
      jcxz       FindStringFound        ;if the string is only 1 byte long,
                                        ; we've found a match
      repz       cmpsw                  ;check the rest of the string a word at a time
      jz         FindStringFound        ;it's a match
FindStringNoMatch:
      pop        di                     ;get back pointer to the next byte to scan
      and        dx,dx                  ;is there anything left to check?
      jnz        FindStringLoop         ;yes-check next byte
FindStringNotFound:
      sub        ax,ax                  ;return a NULL pointer indicating that the
      jmp        FindStringDone         ; string was not found
FindStringFound:
      pop         ax         ;point to the buffer location at which the
      sub         ax,bx      ; string was found (earlier we pushed the
                             ; address of the byte after the scan match)
FindStringDone:
      pop         di         ;restore caller's register variables
      pop         si
      pop         bp         ;restore caller's stack frame
      ret
_FindString endp
      end

LISTING 9.3 L9-3.C

/* Program to exercise buffer-search routines in Listings 9.1 & 9.2 */
#include <stdio.h>
#include <string.h>

#define DISPLAY_LENGTH  40
extern unsigned char * FindString(unsigned char *, unsigned int,
      unsigned char *, unsigned int, unsigned int);
void main(void);
static unsigned char TestBuffer[] = "When, in the course of human \
events, it becomes necessary for one people to dissolve the \
political bands which have connected them with another, and to \
assume among the powers of the earth the separate and equal station \
to which the laws of nature and of nature's God entitle them...";
void main() {
   static unsigned char TestString[] = "equal";
   unsigned char TempBuffer[DISPLAY_LENGTH+1];
   unsigned char *MatchPtr;
   /* Search for TestString and report the results */
   if ((MatchPtr = FindString(TestBuffer,
         (unsigned int) strlen(TestBuffer), TestString,
         (unsigned int) strlen(TestString), 1)) == NULL) {
      /* TestString wasn't found */
      printf("\"%s\" not found\n", TestString);
   } else {
      /* TestString was found. Zero-terminate TempBuffer; strncpy
         won't do it if DISPLAY_LENGTH characters are copied */
      TempBuffer[DISPLAY_LENGTH] = 0;
      printf("\"%s\" found. Next %d characters at match:\n\"%s\"\n",
            TestString, DISPLAY_LENGTH,
            strncpy(TempBuffer, MatchPtr, DISPLAY_LENGTH));
   }
}

You’ll notice that in Listing 9.2 I didn’t use a table of character frequencies in English text to determine the character for which to scan, but rather let the caller make that choice. Each buffer of bytes has unique characteristics, and English-letter frequency could well be inappropriate. What if the buffer is filled with French text? Cyrillic? What if it isn’t text that’s being searched? It might be worthwhile for an application to build a dynamic frequency table for each buffer so that the best scan character could be chosen for each search. Or perhaps not, if the search isn’t time-critical or the buffer is small.

The point is that you can improve performance dramatically by understanding the nature of the data with which you work. (This is equally true for high-level language programming, by the way.) Listing 9.2 is very similar to and only slightly more complex than Listing 9.1; the difference lies not in elbow grease or cycle counting but in the organic integrating optimizer technology we all carry around in our heads.

Short Sorts

David Stafford (recently of Borland and Borland Japan) who happens to be one of the best assembly language programmers I’ve ever met, has written a C-callable routine that sorts an array of integers in ascending order. That wouldn’t be particularly noteworthy, except that David’s routine, shown in Listing 9.4, is exactly 25 bytes long. Look at the code; you’ll keep saying to yourself, “But this doesn’t work…oh, yes, I guess it does.” As they say in the Prego spaghetti sauce ads, it’s in there—and what a job of packing. Anyway, David says that a 24-byte sort routine eludes him, and he’d like to know if anyone can come up with one.

LISTING 9.4 L9-4.ASM

;--------------------------------------------------------------------------
; Sorts an array of ints.  C callable (small model).  25 bytes.
; void sort( int num, int a[] );
;
; Courtesy of David Stafford.
;--------------------------------------------------------------------------

      .model small
      .code
        public _sort

top:    mov     dx,[bx]         ;swap two adjacent integers
        xchg    dx,[bx+2]
        xchg    dx,[bx]
        cmp     dx,[bx]         ;did we put them in the right order?
        jl      top             ;no, swap them back
        inc     bx              ;go to next integer
        inc     bx
        loop    top
_sort:  pop     dx              ;get return address (entry point)
        pop     cx              ;get count
        pop     bx              ;get pointer
        push    bx              ;restore pointer
        dec     cx              ;decrement count
        push    cx              ;save count
        push    dx              ;restore return address
        jg      top             ;if cx > 0

        ret

      end

Full 32-Bit Division

One of the most annoying limitations of the x86 is that while the dividend operand to the DIV instruction can be 32 bits in size, both the divisor and the result must be 16 bits. That’s particularly annoying in regards to the result because sometimes you just don’t know whether the ratio of the dividend to the divisor is greater than 64K-1 or not—and if you guess wrong, you get that godawful Divide By Zero interrupt. So, what is one to do when the result might not fit in 16 bits, or when the dividend is larger than 32 bits? Fall back to a software division approach? That will work—but oh so slowly.

There’s another technique that’s much faster than a pure software approach, albeit not so flexible. This technique allows arbitrarily large dividends and results, but the divisor is still limited to16 bits. That’s not perfect, but it does solve a number of problems, in particular eliminating the possibility of a Divide By Zero interrupt from a too-large result.

This technique involves nothing more complicated than breaking up the division into word-sized chunks, starting with the most significant word of the dividend. The most significant word is divided by the divisor (with no chance of overflow because there are only 16 bits in each); then the remainder is prepended to the next 16 bits of dividend, and the process is repeated, as shown in Figure 9.3. This process is equivalent to dividing by hand, except that here we stop to carry the remainder manually only after each word of the dividend; the hardware divide takes care of the rest. Listing 9.5 shows a function to divide an arbitrarily large dividend by a 16-bit divisor, and Listing 9.6 shows a sample division of a large dividend. Note that the same principle can be applied to handling arbitrarily large dividends in 386 native mode code, but in that case the operation can proceed a dword, rather than a word, at a time.

Figure 9.3  Fast multiword division on the 386.

As for handling signed division with arbitrarily large dividends, that can be done easily enough by remembering the signs of the dividend and divisor, dividing the absolute value of the dividend by the absolute value of the divisor, and applying the stored signs to set the proper signs for the quotient and remainder. There may be more clever ways to produce the same result, by using IDIV, for example; if you know of one, drop me a line c/o Coriolis Group Books.

LISTING 9.5 L9-5.ASM

; Divides an arbitrarily long unsigned dividend by a 16-bit unsigned
; divisor. C near-callable as:
;     unsigned int Div(unsigned int * Dividend,
;     int DividendLength, unsigned int Divisor,
;     unsigned int * Quotient);
;
; Returns the remainder of the division.
;
; Tested with TASM 2.

parms struc
          dw     2 dup (?)     ;pushed BP & return address
Dividend  dw     ?             ;pointer to value to divide, stored in Intel
                               ; order, with lsb at lowest address, msb at
                               ; highest. Must be composed of an integral
                               ; number of words
DividendLength   dw  ?         ;# of bytes in Dividend. Must be a multiple
                               ; of 2
Divisor          dw ?          ;value by which to divide. Must not be zero,
                               ; or a Divide By Zero interrupt will occur
Quotient         dw ?          ;pointer to buffer in which to store the
                               ; result of the division, in Intel order.
                               ; The quotient returned is of the same
                               ; length as the dividend
parmsends

               .model     small
               .code
               public     _Div
_Div proc near
               push    bp      ;preserve caller's stack frame
               mov     bp,sp   ;point to our stack frame
               push    si      ;preserve caller's register variables
               push    di

               std             ;we're working from msb to lsb
               mov  ax,ds
               mov  es,ax      ;for STOS
               mov  cx,[bp+DividendLength]
               sub  cx,2
               mov  si,[bp+Dividend]
               add  si,cx      ;point to the last word of the dividend
                               ; (the most significant word)
               mov  di,[bp+Quotient]
               add  di,cx      ;point to the last word of the quotient
                               ; buffer (the most significant word)
               mov  bx,[bp+Divisor]
               shr  cx,1
               inc  cx         ;# of words to process
               sub  dx,dx      ;convert initial divisor word to a 32-bit
                               ;value for DIV
DivLoop:
               lodsw           ;get next most significant word of divisor
               div  bx
               stosw           ;save this word of the quotient
                               ;DX contains the remainder at this point,
                               ; ready to prepend to the next divisor word
               loop  DivLoop
               mov   ax,dx     ;return the remainder
               cld             ;restore default Direction flag setting
               pop   di        ;restore caller's register variables
               pop   si
               pop   bp        ;restore caller's stack frame
               ret
_Div endp
               end

LISTING 9.6 L9-6.C

/* Sample use of Div function to perform division when the result
   doesn't fit in 16 bits */

#include <stdio.h>

extern unsigned int Div(unsigned int * Dividend,
          int DividendLength, unsigned int Divisor,
          unsigned int * Quotient);

main() {
   unsigned long m, i = 0x20000001;
   unsigned int k, j = 0x10;

   k = Div((unsigned int *)&i, sizeof(i), j, (unsigned int *)&m);
   printf("%lu / %u = %lu r %u\n", i, j, m, k);
}

Sweet Spot Revisited

Way back in Volume 1, Number 1 of PC TECHNIQUES, (April/May 1990) I wrote the very first of that magazine’s HAX (#1), which extolled the virtues of placing your most commonly-used automatic (stack-based) variables within the stack’s “sweet spot,” the area between +127 to -128 bytes away from BP, the stack frame pointer. The reason was that the 8088 can store addressing displacements that fall within that range in a single byte; larger displacements require a full word of storage, increasing code size by a byte per instruction, and thereby slowing down performance due to increased instruction fetching time.

This takes on new prominence in 386 native mode, where straying from the sweet spot costs not one, but two or three bytes. Where the 8088 had two possible displacement sizes, either byte or word, on the 386 there are three possible sizes: byte, word, or dword. In native mode (32-bit protected mode), however, a prefix byte is needed in order to use a word-sized displacement, so a variable located outside the sweet spot requires either two extra bytes (an extra displacement byte plus a prefix byte) or three extra bytes (a dword displacement rather than a byte displacement). Either way, instructions grow alarmingly.

Performance may or may not suffer from missing the sweet spot, depending on the processor, the memory architecture, and the code mix. On a 486, prefix bytes often cost a cycle; on a 386SX, increased code size often slows performance because instructions must be fetched through the half-pint 16-bit bus; on a 386, the effect depends on the instruction mix and whether there’s a cache.

On balance, though, it’s as important to keep your most-used variables in the stack’s sweet spot in 386 native mode as it was on the 8088.

In assembly, it’s easy to control the organization of your stack frame. In C, however, you’ll have to figure out the allocation scheme your compiler uses to allocate automatic variables, and declare automatics appropriately to produce the desired effect. It can be done: I did it in Turbo C some years back, and trimmed the size of a program (admittedly, a large one) by several K—not bad, when you consider that the “sweet spot” optimization is essentially free, with no code reorganization, change in logic, or heavy thinking involved.

Hard-Core Cycle Counting

Next, we come to an item that cycle counters will love, especially since it involves apparently incorrect documentation on Intel’s part. According to Intel’s documents, all RCR and RCL instructions, which perform rotations through the Carry flag, as shown in Figure 9.4, take 9 cycles on the 386 when working with a register operand. My measurements indicate that the 9-cycle execution time almost holds true for multibit rotate-through-carries, which I’ve timed at 8 cycles apiece; for example, RCR AX,CL takes 8 cycles on my 386, as does RCL DX,2. Contrast that with ROR and ROL, which can rotate the contents of a register any number of bits in just 3 cycles.

However, rotating by one bit through the Carry flag does not take 9 cycles, contrary to Intel’s 80386 Programmer’s Reference Manual, or even 8 cycles. In fact, RCR reg,1 and RCL reg,1 take 3 cycles, just like ROR, ROL, SHR, and SHL. At least, that’s how fast they run on my 386, and I very much doubt that you’ll find different execution times on other 386s. (Please let me know if you do, though!)

Figure 9.4  Performing rotate instructions using the Carry flag.

Interestingly, according to Intel’s i486 Microprocessor Programmer’s Reference Manual, the 486 can RCR or RCL a register by one bit in 3 cycles, but takes between 8 and 30 cycles to perform a multibit register RCR or RCL!

No great lesson here, just a caution to be leery of multibit RCR and RCL when performance matters—and to take cycle-time documentation with a grain of salt.

Hardwired Far Jumps

Did you ever wonder how to code a far jump to an absolute address in assembly language? Probably not, but if you ever do, you’re going to be glad for this next item, because the obvious solution doesn’t work. You might think all it would take to jump to, say, 1000:5 would be JMP FAR PTR 1000:5, but you’d be wrong. That won’t even assemble. You might then think to construct in memory a far pointer containing 1000:5, as in the following:

Ptr  dd   ?
     :
     mov  word ptr [Ptr],5
     mov  word ptr [Ptr+2],1000h
     jmp  [Ptr]

That will work, but at a price in performance. On an 8088, JMP DWORD PTR [*mem*] (an indirect far jump) takes at least 37 cycles; JMP DWORD PTR *label* (a direct far jump) takes only 15 cycles (plus, almost certainly, some cycles for instruction fetching). On a 386, an indirect far jump is documented to take at least 43 cycles in real mode (31 in protected mode); a direct far jump is documented to take at least 12 cycles, about three times faster. In truth, the difference between those two is nowhere near that big; the fastest I’ve measured for a direct far jump is 21 cycles, and I’ve measured indirect far jumps as fast as 30 cycles, so direct is still faster, but not by so much. (Oh, those cycle-time documentation blues!) Also, a direct far jump is documented to take at least 27 cycles in protected mode; why the big difference in protected mode, I have no idea.

At any rate, to return to our original problem of jumping to 1000:5: Although an indirect far jump will work, a direct far jump is still preferable.

Listing 9.7 shows a short program that performs a direct far call to 1000:5. (Don’t run it, unless you want to crash your system!) It does this by creating a dummy segment at 1000H, so that the label FarLabel can be created with the desired far attribute at the proper location. (Segments created with “AT” don’t cause the generation of any actual bytes or the allocation of any memory; they’re just templates.) It’s a little kludgey, but at least it does work. There may be a better solution; if you have one, pass it along.

LISTING 9.7 L9-7.ASM

; Program to perform a direct far jump to address 1000:5.
; *** Do not run this program! It's just an example of how ***
; *** to build a direct far jump to an absolute address    ***
;
; Tested with TASM 2 and MASM 5.

FarSeg     segment  at 01000h
      org  5
FarLabel label  far
FarSeg      ends

      .model     small
      .code
start:
      jmp     FarLabel
      end     start

By the way, if you’re wondering how I figured this out, I merely applied my good friend Dan Illowsky’s long-standing rule for dealing with MASM:

If the obvious doesn’t work (and it usually doesn’t), just try everything you can think of, no matter how ridiculous, until you find something that does—a rule with plenty of history on its side.

Setting 32-Bit Registers: Time versus Space

To finish up this chapter, consider these two items. First, in 32-bit protected mode,

sub  eax,eax
inc  eax

takes 4 cycles to execute, but is only 3 bytes long, while

mov  eax,1

takes only 2 cycles to execute, but is 5 bytes long (because native mode constants are dwords and the MOV instruction doesn’t sign-extend). Both code fragments are ways to set EAX to 1 (although the first affects the flags and the second doesn’t); this is a classic trade-off of speed for space. Second,

or    ebx,-1

takes 2 cycles to execute and is 3 bytes long, while

move  bx,-1

takes 2 cycles to execute and is 5 bytes long. Both instructions set EBX to -1; this is a classic trade-off of—gee, it’s not a trade-off at all, is it? OR is a better way to set a 32-bit register to all 1-bits, just as SUB or XOR is a better way to set a register to all 0-bits. Who woulda thunk it? Just goes to show how the 32-bit displacements and constants of 386 native mode change the familiar landscape of 80x86 optimization.

Be warned, though, that I’ve found OR, AND, ADD, and the like to be a cycle slower than MOV when working with immediate operands on the 386 under some circumstances, for reasons that thus far escape me. This just reinforces the first rule of optimization: Measure your code in action, and place not your trust in documented cycle times.

Chapter 10 – Patient Coding, Faster Code

How Working Quickly Can Bring Execution to a Crawl

My grandfather does The New York Times crossword puzzle every Sunday. In ink. With nary a blemish.

The relevance of which will become apparent in a trice.

What my grandfather is, is a pattern matcher par excellence. You’re a pattern matcher, too. So am I. We can’t help it; it comes with the territory. Try focusing on text and not reading it. Can’t do it. Can you hear the voice of someone you know and not recognize it? I can’t. And how in the Nine Billion Names of God is it that we’re capable of instantly recognizing one face out of the thousands we’ve seen in our lifetimes—even years later, from a different angle and in different light? Although we take them for granted, our pattern-matching capabilities are surely a miracle on the order of loaves and fishes.

By “pattern matching,” I mean more than just recognition, though. I mean that we are generally able to take complex and often seemingly woefully inadequate data, instantaneously match it in an incredibly flexible way to our past experience, extrapolate, and reach amazing conclusions, something that computers can scarcely do at all. Crossword puzzles are an excellent example; given a couple of letters and a cryptic clue, we’re somehow able to come up with one out of several hundred thousand words that we know. Try writing a program to do that! What’s more, we don’t process data in the serial brute-force way that computers do. Solutions tend to be virtually instantaneous or not at all; none of those “N log N” or “N2” execution times for us.

It goes without saying that pattern matching is good; more than that, it’s a large part of what we are, and, generally, the faster we are at it, the better. Not always, though. Sometimes insufficient information really is insufficient, and, in our haste to get the heady rush of coming up with a solution, incorrect or less-than-optimal conclusions are reached, as anyone who has ever done the Times Sunday crossword will attest. Still, my grandfather does that puzzle every Sunday in ink. What’s his secret? Patience and discipline. He never fills a word in until he’s confirmed it in his head via intersecting words, no matter how strong the urge may be to put something down where he can see it and feel like he’s getting somewhere.

There’s a surprisingly close parallel to programming here. Programming is certainly a sort of pattern matching in the sense I’ve described above, and, as with crossword puzzles, following your programming instincts too quickly can be a liability. For many programmers, myself included, there’s a strong urge to find a workable approach to a particular problem and start coding it right now, what some people call “hacking” a program. Going with the first thing your programming pattern matcher comes up with can be a lot of fun; there’s instant gratification and a feeling of unbounded creativity. Personally, I’ve always hungered to get results from my work as soon as possible; I gravitated toward graphics for its instant and very visible gratification. Over time, however, I’ve learned patience.

I’ve come to spend an increasingly large portion of my time choosing algorithms, designing, and simply giving my mind quiet time in which to work on problems and come up with non-obvious approaches before coding; and I’ve found that the extra time up front more than pays for itself in both decreased coding time and superior programs.

In this chapter, I’m going to walk you through a simple but illustrative case history that nicely points up the wisdom of delaying gratification when faced with programming problems, so that your mind has time to chew on the problems from other angles. The alternative solutions you find by doing this may seem obvious, once you’ve come up with them. They may not even differ greatly from your initial solutions. Often, however, they will be much better—and you’ll never even have the chance to decide whether they’re better or not if you take the first thing that comes into your head and run with it.

The Case for Delayed Gratification

Once upon a time, I set out to read Algorithms, by Robert Sedgewick (Addison-Wesley), which turned out to be a wonderful, stimulating, and most useful book, one that I recommend highly. My story, however, involves only what happened in the first 12 pages, for it was in those pages that Sedgewick discussed Euclid’s algorithm.

Euclid’s algorithm (discovered by Euclid, of Euclidean geometry fame, a very long time ago, way back when computers still used core memory) is a straightforward algorithm that solves one of the simplest problems imaginable: finding the greatest common integer divisor (GCD) of two positive integers. Sedgewick points out that this is useful for reducing a fraction to its lowest terms. I’m sure it’s useful for other things, as well, although none spring to mind. (A long time ago, I wrote an article about optimizing a bit of code that wasn’t even vaguely time-critical, and got swamped with letters telling me so. I knew it wasn’t time-critical; it was just a good example. So for now, close your eyes and imagine that finding the GCD is not only necessary but must also be done as quickly as possible, because it’s perfect for the point I want to make here and now. Okay?)

The problem at hand, then, is simply this: Find the largest integer value that evenly divides two arbitrary positive integers. That’s all there is to it. So warm up your pattern matchers…and go!

The Brute-Force Syndrome

I have a funny feeling that you’d already figured out how to find the GCD before I even said “go.” That’s what I did when reading Algorithms; before I read another word, I had to figure it out for myself. Programmers are like that; give them a problem and their eyes immediately glaze over as they try to solve it before you’ve even shut your mouth. That sort of instant response can certainly be impressive, but it can backfire, too, as it did in my case.

You see, I fell victim to a common programming pitfall, the “brute-force” syndrome. The basis of this syndrome is that there are many problems that have obvious, brute-force solutions—with one small drawback. The drawback is that if you were to try to apply a brute-force solution by hand—that is, work a single problem out with pencil and paper or a calculator—it would generally require that you have the patience and discipline to work on the problem for approximately seven hundred years, not counting eating and sleeping, in order to get an answer. Finding all the prime numbers less than 1,000,000 is a good example; just divide each number up to 1,000,000 by every lesser number, and see what’s left standing. For most of the history of humankind, people were forced to think of cleverer solutions, such as the Sieve of Eratosthenes (we’d have been in big trouble if the ancient Greeks had had computers), mainly because after about five minutes of brute force-type work, people’s attention gets diverted to other important matters, such as how far a paper airplane will fly from a second-story window.

Not so nowadays, though. Computers love boring work; they’re very patient and disciplined, and, besides, one human year = seven dog years = two zillion computer years. So when we’re faced with a problem that has an obvious but exceedingly lengthy solution, we’re apt to say, “Ah, let the computer do that, it’s fast,” and go back to making paper airplanes. Unfortunately, brute-force solutions tend to be slow even when performed by modern-day microcomputers, which are capable of several MIPS except when I’m late for an appointment and want to finish a compile and run just one more test before I leave, in which case the crystal in my computer is apparently designed to automatically revert to 1 Hz.)

The solution that I instantly came up with to finding the GCD is about as brute- force as you can get: Divide both the larger integer (iL) and the smaller integer (iS) by every integer equal to or less than the smaller integer, until a number is found that divides both evenly, as shown in Figure 10.1. This works, but it’s a lousy solution, requiring as many as iS*2 divisions; very expensive, especially for large values of iS. For example, finding the GCD of 30,001 and 30,002 would require 60,002 divisions, which alone, disregarding tests and branches, would take about 2 seconds on an 8088, and more than 50 milliseconds even on a 25 MHz 486—a very long time in computer years, and not insignificant in human years either.

Listing 10.1 is an implementation of the brute-force approach to GCD calculation. Table 10.1 shows how long it takes this approach to find the GCD for several integer pairs. As expected, performance is extremely poor when iS is large.

Figure 10.1  Using a brute-force algorithm to find a GCD.

Integer pairs for which to find GCD

Table 10.1 Performance of GCD algorithm implementations.
90 & 27 42 & 998 453 & 121 27432 & 165 27432 & 17550
Listing 10.1 (Brute force) 60µs (100%) 110µs (100%) 311ms (100%) 426µs (100%) 43580µs (100%)
Listing 10.2 (Subtraction) 25 (42%) 72 (65%) 67 (22%) 280 (66%) 72 (0.16%)
Listing 10.3 (Division: code recursive Euclid’s algorithm) 20 (33%) 33 (30%) 48 (15%) 32 (8%) 53 (0.12%)
Listing 10.4 (C version of data recursive Euclid’s algorithm; normal optimization) 12 (20%) 17 (15%) 25 (8%) 16 (4%) 26 (0.06%)
Listing 10.4 (/Ox = maximumoptimization) 12 (20%) 16 (15%) 20 (6%) 15 (4%) 23 (0.05%)
Listing 10.5 (Assembly version of data recursive Euclid’s algorithm) 10 (17%) 10 (9%) 15 (5%) 10 (2%) 17 (0.04%)

Note: Performance of Listings 10.1 through 10.5 in finding the greatest common divisors of various pairs of integers. Times are in microseconds. Percentages represent execution time as a percentage of the execution time of Listing 10.1 for the same integer pair. Listings 10.1-10.4 were compiled with Microsoft C /C++ except as noted, the default optimization was used. All times measured with the Zen timer (from Chapter 3) on a 20 MHz cached 386.

LISTING 10.1 L10-1.C

/* Finds and returns the greatest common divisor of two positive
   integers. Works by trying every integral divisor between the
   smaller of the two integers and 1, until a divisor that divides
   both integers evenly is found. All C code tested with Microsoft
   and Borland compilers.*/

unsigned int gcd(unsigned int int1, unsigned int int2) {
   unsigned int temp, trial_divisor;
   /* Swap if necessary to make sure that int1 >= int2 */
   if (int1 < int2) {
      temp = int1;
      int1 = int2;
      int2 = temp;
   }
   /* Now just try every divisor from int2 on down, until a common
      divisor is found. This can never be an infinite loop because
      1 divides everything evenly */
   for (trial_divisor = int2; ((int1 % trial_divisor) != 0) ||
         ((int2 % trial_divisor) != 0); trial_divisor—)
      ;
   return(trial_divisor);
}

Wasted Breakthroughs

Sedgewick’s first solution to the GCD problem was pretty much the one I came up with. He then pointed out that the GCD of iL and iS is the same as the GCD of iL-iS and iS. This was obvious (once Sedgewick pointed it out); by the very nature of division, any number that divides iL evenly nL times and iS evenly nS times must divide iL-iS evenly nL-nS times. Given that insight, I immediately designed a new, faster approach, shown in Listing 10.2.

LISTING 10.2 L10-2.C

/* Finds and returns the greatest common divisor of two positive
   integers. Works by subtracting the smaller integer from the
   larger integer until either the values match (in which case
   that's the gcd), or the larger integer becomes the smaller of
   the two, in which case the two integers swap roles and the
   subtraction process continues. */

unsigned int gcd(unsigned int int1, unsigned int int2) {
   unsigned int temp;
   /* If the two integers are the same, that's the gcd and we're
      done */
   if (int1 == int2) {
      return(int1);
   }
   /* Swap if necessary to make sure that int1 >= int2 */
   if (int1 < int2) {
      temp = int1;
      int1 = int2;
      int2 = temp;
   }

   /* Subtract int2 from int1 until int1 is no longer the larger of
      the two */
   do {
      int1 -= int2;
   } while (int1 > int2);
   /* Now recursively call this function to continue the process */
   return(gcd(int1, int2));
}

Listing 10.2 repeatedly subtracts iS from iL until iL becomes less than or equal to iS. If iL becomes equal to iS, then that’s the GCD; alternatively, if iL becomes less than iS, iL and iS switch values, and the process is repeated, as shown in Figure 10.2. The number of iterations this approach requires relative to Listing 10.1 depends heavily on the values of iL and iS, so it’s not always faster, but, as Table 10.1 indicates, Listing 10.2 is generally much better code.

Figure 10.2  Using repeated subtraction algorithm to find a GCD.

Listing 10.2 is a far graver misstep than Listing 10.1, for all that it’s faster. Listing 10.1 is obviously a hacked-up, brute-force approach; no one could mistake it for anything else. It could be speeded up in any of a number of ways with a little thought. (Simply skipping testing all the divisors between iS and iS/2, not inclusive, would cut the worst-case time in half, for example; that’s not a particularly good optimization, but it illustrates how easily Listing 10.1 can be improved.) Listing 10.1 is a hack job, crying out for inspiration.

Listing 10.2, on the other hand, has gotten the inspiration—and largely wasted it through haste. Had Sedgewick not told me otherwise, I might well have assumed that Listing 10.2 was optimized, a mistake I would never have made with Listing 10.1. I experienced a conceptual breakthrough when I understood Sedgewick’s point: A smaller number can be subtracted from a larger number without affecting their GCD, thereby inexpensively reducing the scale of the problem. And, in my hurry to make this breakthrough reality, I missed its full scope. As Sedgewick says on the very next page, the number that one gets by subtracting iS from iL until iL is less than iS is precisely the same as the remainder that one gets by dividing iL by iS—again, this is inherent in the nature of division—and that is the basis for Euclid’s algorithm, shown in Figure 10.3. Listing 10.3 is an implementation of Euclid’s algorithm.

LISTING 10.3 L10-3.C

/* Finds and returns the greatest common divisor of two integers.
   Uses Euclid's algorithm: divides the larger integer by the
   smaller; if the remainder is 0, the smaller integer is the GCD,
   otherwise the smaller integer becomes the larger integer, the
   remainder becomes the smaller integer, and the process is
   repeated. */

static unsigned int gcd_recurs(unsigned int, unsigned int);

unsigned int gcd(unsigned int int1, unsigned int int2) {
   unsigned int temp;
   /* If the two integers are the same, that's the GCD and we're
      done */
   if (int1 == int2) {
      return(int1);
   }
   /* Swap if necessary to make sure that int1 >= int2 */
   if (int1 < int2) {
      temp = int1;
      int1 = int2;
      int2 = temp;
   }

   /* Now call the recursive form of the function, which assumes
      that the first parameter is the larger of the two */
   return(gcd_recurs(int1, int2));
}

static unsigned int gcd_recurs(unsigned int larger_int,
      unsigned int smaller_int)
{
   int temp;

   /* If the remainder of larger_int divided by smaller_int is 0,
      then smaller_int is the gcd */
   if ((temp = larger_int % smaller_int) == 0) {
      return(smaller_int);
   }
   /* Make smaller_int the larger integer and the remainder the
      smaller integer, and call this function recursively to
      continue the process */
   return(gcd_recurs(smaller_int, temp));
}

As you can see from Table 10.1, Euclid’s algorithm is superior, especially for large numbers (and imagine if we were working with large longs!).

Had I been implementing GCD determination without Sedgewick’s help, I would surely not have settled for Listing 10.1—but I might well have ended up with Listing 10.2 in my enthusiasm over the “brilliant” discovery of subtracting the lesser Using Euclid’s algorithm to find a GCD number from the greater. In a commercial product, my lack of patience and discipline could have been costly indeed.

Figure 10.3  Using Euclid’s algorithm to find a GCD.

Give your mind time and space to wander around the edges of important programming problems before you settle on any one approach. I titled this book’s first chapter “The Best Optimizer Is between Your Ears,” and that’s still true; what’s even more true is that the optimizer between your ears does its best work not at the implementation stage, but at the very beginning, when you try to imagine how what you want to do and what a computer is capable of doing can best be brought together.

Recursion

Euclid’s algorithm lends itself to recursion beautifully, so much so that an implementation like Listing 10.3 comes almost without thought. Again, though, take a moment to stop and consider what’s really going on, at the assembly language level, in Listing 10.3. There’s recursion and then there’s recursion; code recursion and data recursion, to be exact. Listing 10.3 is code recursion—recursion through calls—the sort most often used because it is conceptually simplest. However, code recursion tends to be slow because it pushes parameters and calls a subroutine for every iteration. Listing 10.4, which uses data recursion, is much faster and no more complicated than Listing 10.3. Actually, you could just say that Listing 10.4 uses a loop and ignore any mention of recursion; conceptually, though, Listing 10.4 performs the same recursive operations that Listing 10.3 does.

LISTING 10.4 L10-4.C

/* Finds and returns the greatest common divisor of two integers.
   Uses Euclid's algorithm: divides the larger integer by the
   smaller; if the remainder is 0, the smaller integer is the GCD,
   otherwise the smaller integer becomes the larger integer, the
   remainder becomes the smaller integer, and the process is
   repeated. Avoids code recursion. */

unsigned int gcd(unsigned int int1, unsigned int int2) {
   unsigned int temp;

   /* Swap if necessary to make sure that int1 >= int2 */
   if (int1 < int2) {
      temp = int1;
      int1 = int2;
      int2 = temp;
   }
   /* Now loop, dividing int1 by int2 and checking the remainder,
      until the remainder is 0. At each step, if the remainder isn't
      0, assign int2 to int1, and the remainder to int2, then
      repeat */
   for (;;) {
      /* If the remainder of int1 divided by int2 is 0, then int2 is
         the gcd */
      if ((temp = int1 % int2) == 0) {
         return(int2);
      }
      /* Make int2 the larger integer and the remainder the
         smaller integer, and repeat the process */
      int1 = int2;
      int2 = temp;
   }
}

Patient Optimization

At long last, we’re ready to optimize GCD determination in the classic sense. Table 10.1 shows the performance of Listing 10.4 with and without Microsoft C/C++’s maximum optimization, and also shows the performance of Listing 10.5, an assembly language version of Listing 10.4. Sure, the optimized versions are faster than the unoptimized version of Listing 10.4—but the gains are small compared to those realized from the higher-level optimizations in Listings 10.2 through 10.4.

LISTING 10.5 L10-5.ASM

; Finds and returns the greatest common divisor of two integers.
; Uses Euclid's algorithm: divides the larger integer by the
; smaller; if the remainder is 0, the smaller integer is the GCD,
; otherwise the smaller integer becomes the larger integer, the
; remainder becomes the smaller integer, and the process is
; repeated. Avoids code recursion.
;
;
;
; C near-callable as:
; unsigned int gcd(unsigned int int1, unsigned int int2);

; Parameter structure:
parms struc
      dw    ?              ;pushed BP
      dw    ?              ;pushed return address
int1  dw    ?              ;integers for which to find
int2  dw    ?              ; the GCD
parms ends

      .model         small
      .code
      public         _gcd
      align 2
_gcd  proc  near
      push  bp             ;preserve caller's stack frame
      mov   bp,sp          ;set up our stack frame
      push  si             ;preserve caller's register variables
      push  di

;Swap if necessary to make sure that int1 >= int2
      mov   ax,int1[bp]
      mov   bx,int2[bp]
      cmp   ax,bx          ;is int1 >= int2?
      jnb   IntsSet        ;yes, so we're all set
      xchg  ax,bx          ;no, so swap int1 and int2
IntsSet:

; Now loop, dividing int1 by int2 and checking the remainder, until
; the remainder is 0. At each step, if the remainder isn't 0, assign
; int2 to int1, and the remainder to int2, then repeat.
GCDLoop:
                           ;if the remainder of int1 divided by
                           ; int2 is 0, then int2 is the gcd
      sub   dx,dx          ;prepare int1 in DX:AX for division
      div   bx             ;int1/int2; remainder is in DX
      and   dx,dx          ;is the remainder zero?
      jz    Done           ;yes, so int2 (BX) is the gcd
                           ;no, so move int2 to int1 and the
                           ; remainder to int2, and repeat the
                           ; process
      mov   ax,bx          ;int1 = int2;
      mov   bx,dx          ;int2 = remainder from DIV

;—start of loop unrolling; the above is repeated three times—
      sub   dx,dx          ;prepare int1 in DX:AX for division
      div   bx             ;int1/int2; remainder is in DX
      and   dx,dx          ;is the remainder zero?
      jz    Done           ;yes, so int2 (BX) is the gcd
      mov   ax,bx          ;int1 = int2;
      mov   bx,dx          ;int2 = remainder from DIV
;—
      sub   dx,dx          ;prepare int1 in DX:AX for division
      div   bx             ;int1/int2; remainder is in DX
      and   dx,dx          ;is the remainder zero?
      jz    Done           ;yes, so int2 (BX) is the gcd
      mov   ax,bx          ;int1 = int2;
      mov   bx,dx          ;int2 = remainder from DIV
;—
      sub   dx,dx          ;prepare int1 in DX:AX for division
      div   bx             ;int1/int2; remainder is in DX
      and   dx,dx          ;is the remainder zero?
      jz    Done           ;yes, so int2 (BX) is the gcd
      mov   ax,bx          ;int1 = int2;
      mov   bx,dx          ;int2 = remainder from DIV
;—end of loop unrolling—
      jmp   GCDLoop

      align2
Done:
      mov   ax,bx          ;return the GCD
      pop   di             ;restore caller's register variables
      pop   si
      pop   bp             ;restore caller's stack frame
      ret
_gcd  endp
      end

Assembly language optimization is pattern matching on a local scale. Frankly, it’s also the sort of boring, brute-force work that people are lousy at; compilers could out-optimize you at this level with one pass tied behind their back if they knew as much about the code you’re writing as you do, which they don’t.

Design optimization—conceptual breakthroughs in understanding the relationships between the needs of an application, the nature of the data the application works with, and what the computer can do—is global pattern matching.

Computers are much worse at that sort of pattern matching than humans; computers have no way to integrate vast amounts of disparate information, much of it only vaguely defined or subject to change. People, oddly enough, are better at global optimization than at local optimization. For one thing, it’s more interesting. For another, it’s complex and imprecise enough to allow intuition and inspiration, two vastly underrated programming tools, to come to the fore. And, as I pointed out earlier, people tend to perform instantaneous solutions to even the most complex problems, while computers bog down in geometrically or exponentially increasing execution times. Oh, it may take days or weeks for a person to absorb enough information to be able to reach a solution, and the solution may only be near-optimal—but the solution itself (or, at least, each of the pieces of the solution) arrives in a flash.

Those flashes are your programming pattern matcher doing its job. Your job is to give your pattern matcher the opportunity to get to know each problem and run through it two or three times, from different angles, to see what unexpected solutions it can come up with.

Pull back the reins a little. Don’t measure progress by lines of code written today; measure it instead by overall progress and by quality. Relax and listen to that quiet inner voice that provides the real breakthroughs. Stop, look, listen—and think. Not only will you find that it’s a more productive and creative way to program—but you’ll also find that it’s more fun.

And think what you could do with all those extra computer years!

Chapter 11 – Pushing the 286 and 386

New Registers, New Instructions, New Timings, New Complications

This chapter, adapted from my earlier book Zen of Assembly Language (1989; now out of print), provides an overview of the 286 and 386, often contrasting those processors with the 8088. At the time I originally wrote this, the 8088 was the king of processors, and the 286 and 386 were the new kids on the block. Today, of course, all three processors are past their primes, but many millions of each are still in use, and the 386 in particular is still well worth considering when optimizing software.

This chapter provides an interesting look at the evolution of the x86 architecture, to a greater degree than you might expect, for the x86 family came into full maturity with the 386; the 486 and the Pentium are really nothing more than faster 386s, with very little in the way of new functionality. In contrast, the 286 added a number of instructions, respectable performance, and protected mode to the 8088’s capabilities, and the 386 added more instructions and a whole new set of addressing modes, and brought the x86 family into the 32-bit world that represents the future (and, increasingly, the present) of personal computing. This chapter also provides insight into the effects on optimization of the variations in processors and memory architectures that are common in the PC world. So, although the 286 and 386 no longer represent the mainstream of computing, this chapter is a useful mix of history lesson, x86 overview, and details on two workhorse processors that are still in wide use.

Family Matters

While the x86 family is a large one, only a few members of the family—which includes the 8088, 8086, 80188, 80186, 286, 386SX, 386DX, numerous permutations of the 486, and now the Pentium—really matter.

The 8088 is now all but extinct in the PC arena. The 8086 was used fairly widely for a while, but has now all but disappeared. The 80186 and 80188 never really caught on for use in PC and don’t require further discussion.

That leaves us with the high-end chips: the 286, the 386SX, the 386, the 486, and the Pentium. At this writing, the 386SX is fast going the way of the 8088; people are realizing that its relatively small cost advantage over the 386 isn’t enough to offset its relatively large performance disadvantage. After all, the 386SX suffers from the same debilitating problem that looms over the 8088—a too-small bus. Internally, the 386SX is a 32-bit processor, but externally, it’s a 16-bit processor, a non-optimal architecture, especially for 32-bit code.

I’m not going to discuss the 386SX in detail. If you do find yourself programming for the 386SX, follow the same general rules you should follow for the 8088: use short instructions, use the registers as heavily as possible, and don’t branch. In other words, avoid memory, since the 386SX is by definition better at processing data internally than it is at accessing memory.

The 486 is a world unto itself for the purposes of optimization, and the Pentium is a universe unto itself. We’ll treat them separately in later chapters.

This leaves us with just two processors: the 286 and the 386. Each was the PC standard in its day. The 286 is no longer used in new systems, but there are millions of 286-based systems still in daily use. The 386 is still being used in new systems, although it’s on the downhill leg of its lifespan, and it is in even wider use than the 286. The future clearly belongs to the 486 and Pentium, but the 286 and 386 are still very much a part of the present-day landscape.

Crossing the Gulf to the 286 and the 386

Apart from vastly improved performance, the biggest difference between the 8088 and the 286 and 386 (as well as the later Intel CPUs) is that the 286 introduced protected mode, and the 386 greatly expanded the capabilities of protected mode. We’re only going to talk about real-mode operation of the 286 and 386 in this book, however. Protected mode offers a whole new memory management scheme, one that isn’t supported by the 8088. Only code specifically written for protected mode can run in that mode; it’s an alien and hostile environment for MS-DOS programs.

In particular, segments are different creatures in protected mode. They’re selectors—indexes into a table of segment descriptors—rather than plain old registers, and can’t be set to arbitrary values. That means that segments can’t be used for temporary storage or as part of a fast indivisible 32-bit load from memory, as in

les  ax,dword ptr [LongVar]
mov  dx,es

which loads LongVar into DX:AX faster than this:

mov  ax,word ptr [LongVar]
mov  dx,word ptr [LongVar+2]

Protected mode uses those altered segment registers to offer access to a great deal more memory than real mode: The 286 supports 16 megabytes of memory, while the 386 supports 4 gigabytes (4K megabytes) of physical memory and 64 terabytes (64K gigabytes!) of virtual memory.

In protected mode, your programs generally run under an operating system (OS/2, Unix, Windows NT or the like) that exerts much more control over the computer than does MS-DOS. Protected mode operating systems can generally run multiple programs simultaneously, and the performance of any one program may depend far less on code quality than on how efficiently the program uses operating system services and how often and under what circumstances the operating system preempts the program. Protected mode programs are often mostly collections of operating system calls, and the performance of whatever code isn’t operating-system oriented may depend primarily on how large a time slice the operating system gives that code to run in.

In short, taken as a whole, protected mode programming is a different kettle of fish altogether from what I’ve been describing in this book. There’s certainly a knack to optimizing specifically for protected mode under a given operating system…but it’s not what we’ve been learning, and now is not the time to pursue it further. In general, though, the optimization strategies discussed in this book still hold true in protected mode; it’s just issues specific to protected mode or a particular operating system that we won’t discuss.

In the Lair of the Cycle-Eaters, Part II

Under the programming interface, the 286 and 386 differ considerably from the 8088. Nonetheless, with one exception and one addition, the cycle-eaters remain much the same on computers built around the 286 and 386. Next, we’ll review each of the familiar cycle-eaters I covered in Chapter 4 as they apply to the 286 and 386, and we’ll look at the new member of the gang, the data alignment cycle-eater.

The one cycle-eater that vanishes on the 286 and 386 is the 8-bit bus cycle-eater. The 286 is a 16-bit processor both internally and externally, and the 386 is a 32-bit processor both internally and externally, so the Execution Unit/Bus Interface Unit size mismatch that plagues the 8088 is eliminated. Consequently, there’s no longer any need to use byte-sized memory variables in preference to word-sized variables, at least so long as word-sized variables start at even addresses, as we’ll see shortly. On the other hand, access to byte-sized variables still isn’t any slower than access to word-sized variables, so you can use whichever size suits a given task best.

You might think that the elimination of the 8-bit bus cycle-eater would mean that the prefetch queue cycle-eater would also vanish, since on the 8088 the prefetch queue cycle-eater is a side effect of the 8-bit bus. That would seem all the more likely given that both the 286 and the 386 have larger prefetch queues than the 8088 (6 bytes for the 286, 16 bytes for the 386) and can perform memory accesses, including instruction fetches, in far fewer cycles than the 8088.

However, the prefetch queue cycle-eater doesn’t vanish on either the 286 or the 386, for several reasons. For one thing, branching instructions still empty the prefetch queue, so instruction fetching still slows things down after most branches; when the prefetch queue is empty, it doesn’t much matter how big it is. (Even apart from emptying the prefetch queue, branches aren’t particularly fast on the 286 or the 386, at a minimum of seven-plus cycles apiece. Avoid branching whenever possible.)

After a branch it does matter how fast the queue can refill, and there we come to the second reason the prefetch queue cycle-eater lives on: The 286 and 386 are so fast that sometimes the Execution Unit can execute instructions faster than they can be fetched, even though instruction fetching is much faster on the 286 and 386 than on the 8088.

(All other things being equal, too-slow instruction fetching is more of a problem on the 286 than on the 386, since the 386 fetches 4 instruction bytes at a time versus the 2 instruction bytes fetched per memory access by the 286. However, the 386 also typically runs at least twice as fast as the 286, meaning that the 386 can easily execute instructions faster than they can be fetched unless very high-speed memory is used.)

The most significant reason that the prefetch queue cycle-eater not only survives but prospers on the 286 and 386, however, lies in the various memory architectures used in computers built around the 286 and 386. Due to the memory architectures, the 8-bit bus cycle-eater is replaced by a new form of the wait state cycle-eater: wait states on accesses to normal system memory.

System Wait States

The 286 and 386 were designed to lose relatively little performance to the prefetch queue cycle-eater…when used with zero-wait-state memory: memory that can complete memory accesses so rapidly that no wait states are needed. However, true zero-wait-state memory is almost never used with those processors. Why? Because memory that can keep up with a 286 is fairly expensive, and memory that can keep up with a 386 is very expensive. Instead, computer designers use alternative memory architectures that offer more performance for the dollar—but less performance overall—than zero-wait-state memory. (It is possible to build zero-wait-state systems for the 286 and 386; it’s just so expensive that it’s rarely done.)

The IBM AT and true compatibles use one-wait-state memory (some AT clones use zero-wait-state memory, but such clones are less common than one-wait-state AT clones). The 386 systems use a wide variety of memory systems—including high-speed caches, interleaved memory, and static-column RAM—that insert anywhere from 0 to about 5 wait states (and many more if 8 or 16-bit memory expansion cards are used); the exact number of wait states inserted at any given time depends on the interaction between the code being executed and the memory system it’s running on.

The performance of most 386 memory systems can vary greatly from one memory access to another, depending on factors such as what data happens to be in the cache and which interleaved bank and/or RAM column was accessed last.

The many memory systems in use make it impossible for us to optimize for 286/386 computers with the precision that’s possible on the 8088. Instead, we must write code that runs reasonably well under the varying conditions found in the 286/386 arena.

The wait states that occur on most accesses to system memory in 286 and 386 computers mean that nearly every access to system memory—memory in the DOS’s normal 640K memory area—is slowed down. (Accesses in computers with high-speed caches may be wait-state-free if the desired data is already in the cache, but will certainly encounter wait states if the data isn’t cached; this phenomenon produces highly variable instruction execution times.) While this is our first encounter with system memory wait states, we have run into a wait-state cycle-eater before: the display adapter cycle-eater, which we discussed along with the other 8088 cycle-eaters way back in Chapter 4. System memory generally has fewer wait states per access than display memory. However, system memory is also accessed far more often than display memory, so system memory wait states hurt plenty—and the place they hurt most is instruction fetching.

Consider this: The 286 can store an immediate value to memory, as in MOV [WordVar],0, in just 3 cycles. However, that instruction is 6 bytes long. The 286 is capable of fetching 1 word every 2 cycles; however, the one-wait-state architecture of the AT stretches that to 3 cycles. Consequently, nine cycles are needed to fetch the six instruction bytes. On top of that, 3 cycles are needed to write to memory, bringing the total memory access time to 12 cycles. On balance, memory access time—especially instruction prefetching—greatly exceeds execution time, to the extent that this particular instruction can take up to four times as long to run as it does to execute in the Execution Unit.

And that, my friend, is unmistakably the prefetch queue cycle-eater. I might add that the prefetch queue cycle-eater is in rare good form in the above example: A 4-to-1 ratio of instruction fetch time to execution time is in a class with the best (or worst!) that’s found on the 8088.

Let’s check out the prefetch queue cycle-eater in action. Listing 11.1 times MOV [WordVar],0. The Zen timer reports that on a one-wait-state 10 MHz 286-based AT clone (the computer used for all tests in this chapter), Listing 11.1 runs in 1.27 µs per instruction. That’s 12.7 cycles per instruction, just as we calculated. (That extra seven-tenths of a cycle comes from DRAM refresh, which we’ll get to shortly.)

LISTING 11.1 L11-1.ASM

;
; *** Listing 11.1 ***
;
; Measures the performance of an immediate move to
; memory, in order to demonstrate that the prefetch
; queue cycle-eater is alive and well on the AT.
;
        jmp     Skip
;
        even            ;always make sure word-sized memory
                        ; variables are word-aligned!
WordVar dw      0
;
Skip:
        call    ZTimerOn
        rept    1000
        mov     [WordVar],0
        endm
        call    ZTimerOff

What does this mean? It means that, practically speaking, the 286 as used in the AT doesn’t have a 16-bit bus. From a performance perspective, the 286 in an AT has two-thirds of a 16-bit bus (a 10.7-bit bus?), since every bus access on an AT takes 50 percent longer than it should. A 286 running at 10 MHz should be able to access memory at a maximum rate of 1 word every 200 ns; in a 10 MHz AT, however, that rate is reduced to 1 word every 300 ns by the one-wait-state memory.

In short, a close relative of our old friend the 8-bit bus cycle-eater—the system memory wait state cycle-eater—haunts us still on all but zero-wait-state 286 and 386 computers, and that means that the prefetch queue cycle-eater is alive and well. (The system memory wait state cycle-eater isn’t really a new cycle-eater, but rather a variant of the general wait state cycle-eater, of which the display adapter cycle-eater is yet another variant.) While the 286 in the AT can fetch instructions much faster than can the 8088 in the PC, it can execute those instructions faster still.

The picture is less clear in the 386 world since there are so many different memory architectures, but similar problems can occur in any computer built around a 286 or 386. The prefetch queue cycle-eater is even a factor—albeit a lesser one—on zero-wait-state machines, both because branching empties the queue and because some instructions can outrun even zero—5 cycles longer than the official execution time.)

To summarize:

  • Memory-accessing instructions don’t run at their official speeds on non-zero-wait-state 286/386 computers.
  • The prefetch queue cycle-eater reduces performance on 286/386 computers, particularly when non-zero-wait-state memory is used.
  • Branches often execute at less than their rated speeds on the 286 and 386 since the prefetch queue is emptied.
  • The extent to which the prefetch queue and wait states affect performance varies from one 286/386 computer to another, making precise optimization impossible.

What’s to be learned from all this? Several things:

  • Keep your instructions short.
  • Keep it in the registers; avoid memory, since memory generally can’t keep up with the processor.
  • Don’t jump.

Of course, those are exactly the rules that apply to 8088 optimization as well. Isn’t it convenient that the same general rules apply across the board?

Data Alignment

Thanks to its 16-bit bus, the 286 can access word-sized memory variables just as fast as byte-sized variables. There’s a catch, however: That’s only true for word-sized variables that start at even addresses. When the 286 is asked to perform a word-sized access starting at an odd address, it actually performs two separate accesses, each of which fetches 1 byte, just as the 8088 does for all word-sized accesses.

Figure 11.1 illustrates this phenomenon. The conversion of word-sized accesses to odd addresses into double byte-sized accesses is transparent to memory-accessing instructions; all any instruction knows is that the requested word has been accessed, no matter whether 1 word-sized access or 2 byte-sized accesses were required to accomplish it.

The penalty for performing a word-sized access starting at an odd address is easy to calculate: Two accesses take twice as long as one access.

In other words, the effective capacity of the 286’s external data bus is halved when a word-sized access to an odd address is performed.

That, in a nutshell, is the data alignment cycle-eater, the one new cycle-eater of the 286 and 386. (The data alignment cycle-eater is a close relative of the 8088’s 8-bit bus cycle-eater, but since it behaves differently—occurring only at odd addresses—and is avoided with a different workaround, we’ll consider it to be a new cycle-eater.)

Figure 11.1  The data alignment cycle-eater.

The way to deal with the data alignment cycle-eater is straightforward: Don’t perform word-sized accesses to odd addresses on the 286 if you can help it. The easiest way to avoid the data alignment cycle-eater is to place the directive EVEN before each of your word-sized variables. EVEN forces the offset of the next byte assembled to be even by inserting a NOP if the current offset is odd; consequently, you can ensure that any word-sized variable can be accessed efficiently by the 286 simply by preceding it with EVEN.

Listing 11.2, which accesses memory a word at a time with each word starting at an odd address, runs on a 10 MHz AT clone in 1.27 ms per repetition of MOVSW, or 0.64 ms per word-sized memory access. That’s 6-plus cycles per word-sized access, which breaks down to two separate memory accesses—3 cycles to access the high byte of each word and 3 cycles to access the low byte of each word, the inevitable result of non-word-aligned word-sized memory accesses—plus a bit extra for DRAM refresh.

LISTING 11.2 L11-2.ASM

;
; *** Listing 11.2 ***
;
; Measures the performance of accesses to word-sized
; variables that start at odd addresses (are not
; word-aligned).
;
Skip:
        push    ds
        pop     es
        mov     si,1    ;source and destination are the same
        mov     di,si   ; and both are not word-aligned
        mov     cx,1000 ;move 1000 words
        cld
        call    ZTimerOn
        rep     movsw
        call    ZTimerOff

On the other hand, Listing 11.3, which is exactly the same as Listing 11.2 save that the memory accesses are word-aligned (start at even addresses), runs in 0.64 ms per repetition of MOVSW, or 0.32 µs per word-sized memory access. That’s 3 cycles per word-sized access—exactly twice as fast as the non-word-aligned accesses of Listing 11.2, just as we predicted.

LISTING 11.3 L11-3.ASM

;
; *** Listing 11.3 ***
;
; Measures the performance of accesses to word-sized
; variables that start at even addresses (are word-aligned).
;
Skip:
        push    ds
        pop     es
        sub     si,si   ;source and destination are the same
        mov     di,si   ; and both are word-aligned
        mov     cx,1000 ;move 1000 words
        cld
        call    ZTimerOn
        rep     movsw
        call    ZTimerOff

The data alignment cycle-eater has intriguing implications for speeding up 286/386 code. The expenditure of a little care and a few bytes to make sure that word-sized variables and memory blocks are word-aligned can literally double the performance of certain code running on the 286. Even if it doesn’t double performance, word alignment usually helps and never hurts.

Code Alignment

Lack of word alignment can also interfere with instruction fetching on the 286, although not to the extent that it interferes with access to word-sized memory variables. The 286 prefetches instructions a word at a time; even if a given instruction doesn’t begin at an even address, the 286 simply fetches the first byte of that instruction at the same time that it fetches the last byte of the previous instruction, as shown in Figure 11.2, then separates the bytes internally. That means that in most cases, instructions run just as fast whether they’re word-aligned or not.

There is, however, a non-word-alignment penalty on branches to odd addresses. On a branch to an odd address, the 286 is only able to fetch 1 useful byte with the first instruction fetch following the branch, as shown in Figure 11.3. In other words, lack of word alignment of the target instruction for any branch effectively cuts the instruction-fetching power of the 286 in half for the first instruction fetch after that branch. While that may not sound like much, you’d be surprised at what it can do to tight loops; in fact, a brief story is in order.

When I was developing the Zen timer, I used my trusty 10 MHz 286-based AT clone to verify the basic functionality of the timer by measuring the performance of simple instruction sequences. I was cruising along with no problems until I timed the following code:

    mov    cx,1000
    call   ZTimerOn
LoopTop:
    loop   LoopTop
    call   ZTimerOff
Figure 11.2  Word-aligned prefetching on the 286.
Figure 11.3  How instruction bytes are fetched after a branch.

Now, this code should run in, say, about 12 cycles per loop at most. Instead, it took over 14 cycles per loop, an execution time that I could not explain in any way. After rolling it around in my head for a while, I took a look at the code under a debugger…and the answer leaped out at me. The loop began at an odd address! That meant that two instruction fetches were required each time through the loop; one to get the opcode byte of the LOOP instruction, which resided at the end of one word-aligned word, and another to get the displacement byte, which resided at the start of the next word-aligned word.

One simple change brought the execution time down to a reasonable 12.5 cycles per loop:

  mov   cx,1000
  call  ZTimerOn
  even
LoopTop:
  loop  LoopTop
  call  ZTimerOff

While word-aligning branch destinations can improve branching performance, it’s a nuisance and can increase code size a good deal, so it’s not worth doing in most code. Besides, EVEN inserts a NOP instruction if necessary, and the time required to execute a NOP can sometimes cancel the performance advantage of having a word-aligned branch destination.

Consequently, it’s best to word-align only those branch destinations that can be reached solely by branching.

I recommend that you only go out of your way to word-align the start offsets of your subroutines, as in:

          even
FindChar  proc near
          :

In my experience, this simple practice is the one form of code alignment that consistently provides a reasonable return for bytes and effort expended, although sometimes it also pays to word-align tight time-critical loops.

Alignment and the 386

So far we’ve only discussed alignment as it pertains to the 286. What, you may well ask, of the 386?

The 386 adds the issue of doubleword alignment (that is, alignment to addresses that are multiples of four.) The rule for the 386 is: Word-sized memory accesses should be word-aligned (it’s impossible for word-aligned word-sized accesses to cross doubleword boundaries), and doubleword-sized memory accesses should be doubleword-aligned. However, in real (as opposed to 32-bit protected) mode, doubleword-sized memory accesses are rare, so the simple word-alignment rule we’ve developed for the 286 serves for the 386 in real mode as well.

As for code alignment…the subroutine-start word-alignment rule of the 286 serves reasonably well there too since it avoids the worst case, where just 1 byte is fetched on entry to a subroutine. While optimum performance would dictate doubleword alignment of subroutines, that takes 3 bytes, a high price to pay for an optimization that improves performance only on the post 286 processors.

Alignment and the Stack

One side-effect of the data alignment cycle-eater of the 286 and 386 is that you should never allow the stack pointer to become odd. (You can make the stack pointer odd by adding an odd value to it or subtracting an odd value from it, or by loading it with an odd value.) An odd stack pointer on the 286 or 386 (or a non-doubleword-aligned stack in 32-bit protected mode on the 386, 486, or Pentium) will significantly reduce the performance of PUSH, POP, CALL, and RET, as well as INT and IRET, which are executed to invoke DOS and BIOS functions, handle keystrokes and incoming serial characters, and manage the mouse. I know of a Forth programmer who vastly improved the performance of a complex application on the AT simply by forcing the Forth interpreter to maintain an even stack pointer at all times.

An interesting corollary to this rule is that you shouldn’t INC SP twice to add 2, even though that takes fewer bytes than ADD SP,2. The stack pointer is odd between the first and second INC, so any interrupt occurring between the two instructions will be serviced more slowly than it normally would. The same goes for decrementing twice; use SUB SP,2 instead.

Keep the stack pointer aligned at all times.

The DRAM Refresh Cycle-Eater: Still an Act of God

The DRAM refresh cycle-eater is the cycle-eater that’s least changed from its 8088 form on the 286 and 386. In the AT, DRAM refresh uses a little over five percent of all available memory accesses, slightly less than it uses in the PC, but in the same ballpark. While the DRAM refresh penalty varies somewhat on various AT clones and 386 computers (in fact, a few computers are built around static RAM, which requires no refresh at all; likewise, caches are made of static RAM so cached systems generally suffer less from DRAM refresh), the 5 percent figure is a good rule of thumb.

Basically, the effect of the DRAM refresh cycle-eater is pretty much the same throughout the PC-compatible world: fairly small, so it doesn’t greatly affect performance; unavoidable, so there’s no point in worrying about it anyway; and a nuisance since it results in fractional cycle counts when using the Zen timer. Just as with the PC, a given code sequence on the AT can execute at varying speeds at different times as a result of the interaction between the code and DRAM refresh.

There’s nothing much new with DRAM refresh on 286/386 computers, then. Be aware of it, but don’t overly concern yourself—DRAM refresh is still an act of God, and there’s not a blessed thing you can do about it. Happily, the internal caches of the 486 and Pentium make DRAM refresh largely a performance non-issue on those processors.

The Display Adapter Cycle-Eater

Finally we come to the last of the cycle-eaters, the display adapter cycle-eater. There are two ways of looking at this cycle-eater on 286/386 computers: (1) It’s much worse than it was on the PC, or (2) it’s just about the same as it was on the PC.

Either way, the display adapter cycle-eater is extremely bad news on 286/386 computers and on 486s and Pentiums as well. In fact, this cycle-eater on those systems is largely responsible for the popularity of VESA local bus (VLB).

The two ways of looking at the display adapter cycle-eater on 286/386 computers are actually the same. As you’ll recall from my earlier discussion of the matter in Chapter 4, display adapters offer only a limited number of accesses to display memory during any given period of time. The 8088 is capable of making use of most but not all of those slots with REP MOVSW, so the number of memory accesses allowed by a display adapter such as a standard VGA is reasonably well-matched to an 8088’s memory access speed. Granted, access to a VGA slows the 8088 down considerably—but, as we’re about to find out, “considerably” is a relative term. What a VGA does to PC performance is nothing compared to what it does to faster computers.

Under ideal conditions, a 286 can access memory much, much faster than an 8088. A 10 MHz 286 is capable of accessing a word of system memory every 0.20 ms with REP MOVSW, dwarfing the 1 byte every 1.31 µs that the 8088 in a PC can manage. However, access to display memory is anything but ideal for a 286. For one thing, most display adapters are 8-bit devices, although newer adapters are 16-bit in nature. One consequence of that is that only 1 byte can be read or written per access to display memory; word-sized accesses to 8-bit devices are automatically split into 2 separate byte-sized accesses by the AT’s bus. Another consequence is that accesses are simply slower; the AT’s bus inserts additional wait states on accesses to 8-bit devices since it must assume that such devices were designed for PCs and may not run reliably at AT speeds.

However, the 8-bit size of most display adapters is but one of the two factors that reduce the speed with which the 286 can access display memory. Far more cycles are eaten by the inherent memory-access limitations of display adapters—that is, the limited number of display memory accesses that display adapters make available to the 286. Look at it this way: If REP MOVSW on a PC can use more than half of all available accesses to display memory, then how much faster can code running on a 286 or 386 possibly run when accessing display memory?

That’s right—less than twice as fast.

In other words, instructions that access display memory won’t run a whole lot faster on ATs and faster computers than they do on PCs. That explains one of the two viewpoints expressed at the beginning of this section: The display adapter cycle-eater is just about the same on high-end computers as it is on the PC, in the sense that it allows instructions that access display memory to run at just about the same speed on all computers.

Of course, the picture is quite a bit different when you compare the performance of instructions that access display memory to the maximum performance of those instructions. Instructions that access display memory receive many more wait states when running on a 286 than they do on an 8088. Why? While the 286 is capable of accessing memory much more often than the 8088, we’ve seen that the frequency of access to display memory is determined not by processor speed but by the display adapter itself. As a result, both processors are actually allowed just about the same maximum number of accesses to display memory in any given time. By definition, then, the 286 must spend many more cycles waiting than does the 8088.

And that explains the second viewpoint expressed above regarding the display adapter cycle-eater vis-a-vis the 286 and 386. The display adapter cycle-eater, as measured in cycles lost to wait states, is indeed much worse on AT-class computers than it is on the PC, and it’s worse still on more powerful computers.

How bad is the display adapter cycle-eater on an AT? It’s this bad: Based on my (not inconsiderable) experience in timing display adapter access, I’ve found that the display adapter cycle-eater can slow an AT—or even a 386 computer—to near-PC speeds when display memory is accessed.

I know that’s hard to believe, but the display adapter cycle-eater gives out just so many display memory accesses in a given time, and no more, no matter how fast the processor is. In fact, the faster the processor, the more the display adapter cycle-eater hurts the performance of instructions that access display memory. The display adapter cycle-eater is not only still present in 286/386 computers, it’s worse than ever.

What can we do about this new, more virulent form of the display adapter cycle-eater? The workaround is the same as it was on the PC: Access display memory as little as you possibly can.

New Instructions and Features: The 286

The 286 and 386 offer a number of new instructions. The 286 has a relatively small number of instructions that the 8088 lacks, while the 386 has those instructions and quite a few more, along with new addressing modes and data sizes. We’ll discuss the 286 and the 386 separately in this regard.

The 286 has a number of instructions designed for protected-mode operations. As I’ve said, we’re not going to discuss protected mode in this book; in any case, protected-mode instructions are generally used only by operating systems. (I should mention that the 286’s protected mode brings with it the ability to address 16 MB of memory, a considerable improvement over the 8088’s 1 MB. In real mode, however, programs are still limited to 1 MB of addressable memory on the 286. In either mode, each segment is still limited to 64K.)

There are also a handful of 286-specific real-mode instructions, and they can be quite useful. BOUND checks array bounds. ENTER and LEAVE support compact and speedy stack frame construction and removal, ideal for interfacing to high-level languages such as C and Pascal (although these instructions are actually relatively slow on the 386 and its successors, and should be used with caution when performance matters). INS and OUTS are new string instructions that support efficient data transfer between memory and I/O ports. Finally, PUSHA and POPA push and pop all eight general-purpose registers.

A couple of old instructions gain new features on the 286. For one, the 286 version of PUSH is capable of pushing a constant on the stack. For another, the 286 allows all shifts and rotates to be performed for not just 1 bit or the number of bits specified by CL, but for any constant number of bits.

New Instructions and Features: The 386

The 386 is somewhat more complex than the 286 regarding new features. Once again, we won’t discuss protected mode, which on the 386 comes with the ability to address up to 4 gigabytes per segment and 64 terabytes in all. In real mode (and in virtual-86 mode, which allows the 386 to multitask MS-DOS applications, and which is identical to real mode so far as MS-DOS programs are concerned), programs running on the 386 are still limited to 1 MB of addressable memory and 64K per segment.

The 386 has many new instructions, as well as new registers, addressing modes and data sizes that have trickled down from protected mode. Let’s take a quick look at these new real-mode features.

Even in real mode, it’s possible to access many of the 386’s new and extended registers. Most of these registers are simply 32-bit extensions of the 16-bit registers of the 8088. For example, EAX is a 32-bit register containing AX as its lower 16 bits, EBX is a 32-bit register containing BX as its lower 16 bits, and so on. There are also two new segment registers: FS and GS.

The 386 also comes with a slew of new real-mode instructions beyond those supported by the 8088 and 286. These instructions can scan data on a bit-by-bit basis, set the Carry flag to the value of a specified bit, sign-extend or zero-extend data as it’s moved, set a register or memory variable to 1 or 0 on the basis of any of the conditions that can be tested with conditional jumps, and more. (Again, beware: Many of these complex 386-specific instructions are slower than equivalent sequences of simple instructions on the 486 and especially on the Pentium.) What’s more, both old and new instructions support 32-bit operations on the 386. For example, it’s relatively simple to copy data in chunks of 4 bytes on a 386, even in real mode, by using the MOVSD (“move string double”) instruction, or to negate a 32-bit value with NEG eax.

Finally, it’s possible in real mode to use the 386’s new addressing modes, in which any 32-bit general-purpose register or pair of registers can be used to address memory. What’s more, multiplication of memory-addressing registers by 2, 4, or 8 for look-ups in word, doubleword, or quadword tables can be built right into the memory addressing mode. (The 32-bit addressing modes are discussed further in later chapters.) In protected mode, these new addressing modes allow you to address a full 4 gigabytes per segment, but in real mode you’re still limited to 64K, even with 32-bit registers and the new addressing modes, unless you play some unorthodox tricks with the segment registers.

Note well: Those tricks don’t necessarily work with system software such as Windows, so I’d recommend against using them. If you want 4-gigabyte segments, use a 32-bit environment such as Win32.

Optimization Rules: The More Things Change…

Let’s see what we’ve learned about 286/386 optimization. Mostly what we’ve learned is that our familiar PC cycle-eaters still apply, although in somewhat different forms, and that the major optimization rules for the PC hold true on ATs and 386-based computers. You won’t go wrong on any of these computers if you keep your instructions short, use the registers heavily and avoid memory, don’t branch, and avoid accessing display memory like the plague.

Although we haven’t touched on them, repeated string instructions are still desirable on the 286 and 386 since they provide a great deal of functionality per instruction byte and eliminate both the prefetch queue cycle-eater and branching. However, string instructions are not quite so spectacularly superior on the 286 and 386 as they are on the 8088 since non-string memory-accessing instructions have been speeded up considerably on the newer processors.

There’s one cycle-eater with new implications on the 286 and 386, and that’s the data alignment cycle-eater. From the data alignment cycle-eater we get a new rule: Word-align your word-sized variables, and start your subroutines at even addresses.

Detailed Optimization

While the major 8088 optimization rules hold true on computers built around the 286 and 386, many of the instruction-specific optimizations no longer hold, for the execution times of most instructions are quite different on the 286 and 386 than on the 8088. We have already seen one such example of the sometimes vast difference between 8088 and 286/386 instruction execution times: MOV [WordVar],0, which has an Execution Unit execution time of 20 cycles on the 8088, has an EU execution time of just 3 cycles on the 286 and 2 cycles on the 386.

In fact, the performance of virtually all memory-accessing instructions has been improved enormously on the 286 and 386. The key to this improvement is the near elimination of effective address (EA) calculation time. Where an 8088 takes from 5 to 12 cycles to calculate an EA, a 286 or 386 usually takes no time whatsoever to perform the calculation. If a base+index+displacement addressing mode, such as MOV AX,[WordArray+bx+si], is used on a 286 or 386, 1 cycle is taken to perform the EA calculation, but that’s both the worst case and the only case in which there’s any EA overhead at all.

The elimination of EA calculation time means that the EU execution time of memory-addressing instructions is much closer to the EU execution time of register-only instructions. For instance, on the 8088 ADD [WordVar],100H is a 31-cycle instruction, while ADD DX,100H is a 4-cycle instruction—a ratio of nearly 8 to 1. By contrast, on the 286 ADD [WordVar],100H is a 7-cycle instruction, while ADD DX,100H is a 3-cycle instruction—a ratio of just 2.3 to 1.

It would seem, then, that it’s less necessary to use the registers on the 286 than it was on the 8088, but that’s simply not the case, for reasons we’ve already seen. The key is this: The 286 can execute memory-addressing instructions so fast that there’s no spare instruction prefetching time during those instructions, so the prefetch queue runs dry, especially on the AT, with its one-wait-state memory. On the AT, the 6-byte instruction ADD [WordVar],100H is effectively at least a 15-cycle instruction, because 3 cycles are needed to fetch each of the three instruction words and 6 more cycles are needed to read WordVar and write the result back to memory.

Granted, the register-only instruction ADD DX,100H also slows down—to 6 cycles—because of instruction prefetching, leaving a ratio of 2.5 to 1. Now, however, let’s look at the performance of the same code on an 8088. The register-only code would run in 16 cycles (4 instruction bytes at 4 cycles per byte), while the memory-accessing code would run in 40 cycles (6 instruction bytes at 4 cycles per byte, plus 2 word-sized memory accesses at 8 cycles per word). That’s a ratio of 2.5 to 1, exactly the same as on the 286.

This is all theoretical. We put our trust not in theory but in actual performance, so let’s run this code through the Zen timer. On a PC, Listing 11.4, which performs register-only addition, runs in 3.62 ms, while Listing 11.5, which performs addition to a memory variable, runs in 10.05 ms. On a 10 MHz AT clone, Listing 11.4 runs in 0.64 ms, while Listing 11.5 runs in 1.80 ms. Obviously, the AT is much faster…but the ratio of Listing 11.5 to Listing 11.4 is virtually identical on both computers, at 2.78 for the PC and 2.81 for the AT. If anything, the register-only form of ADD has a slightly larger advantage on the AT than it does on the PC in this case.

Theory confirmed.

LISTING 11.4 L11-4.ASM

;
; *** Listing 11.4 ***
;
; Measures the performance of adding an immediate value
; to a register, for comparison with Listing 11.5, which
; adds an immediate value to a memory variable.
;
        call    ZTimerOn
        rept    1000
        add     dx,100h
        endm
        call    ZTimerOff

LISTING 11.5 L11-5.ASM

;
; *** Listing 11.5 ***
;
; Measures the performance of adding an immediate value
; to a memory variable, for comparison with Listing 11.4,
; which adds an immediate value to a register.
;
        jmp     Skip
;
        even            ;always make sure word-sized memory
                        ; variables are word-aligned!
WordVar dw      0
;
Skip:
        call    ZTimerOn
        rept    1000
        add     [WordVar]100h
        endm
        call    ZTimerOff

What’s going on? Simply this: Instruction fetching is controlling overall execution time on both processors. Both the 8088 in a PC and the 286 in an AT can execute the bytes of the instructions in Listings 11.4 and 11.5 faster than they can be fetched. Since the instructions are exactly the same lengths on both processors, it stands to reason that the ratio of the overall execution times of the instructions should be the same on both processors as well. Instruction length controls execution time, and the instruction lengths are the same—therefore the ratios of the execution times are the same. The 286 can both fetch and execute instruction bytes faster than the 8088 can, so code executes much faster on the 286; nonetheless, because the 286 can also execute those instruction bytes much faster than it can fetch them, overall performance is still largely determined by the size of the instructions.

Is this always the case? No. When the prefetch queue is full, memory-accessing instructions on the 286 and 386 are much faster (relative to register-only instructions) than they are on the 8088. Given the system wait states prevalent on 286 and 386 computers, however, the prefetch queue is likely to be empty quite a bit, especially when code consisting of instructions with short EU execution times is executed. Of course, that’s just the sort of code we’re likely to write when we’re optimizing, so the performance of high-speed code is more likely to be controlled by instruction size than by EU execution time on most 286 and 386 computers, just as it is on the PC.

All of which is just a way of saying that faster memory access and EA calculation notwithstanding, it’s just as desirable to keep instructions short and memory accesses to a minimum on the 286 and 386 as it is on the 8088. And the way to do that is to use the registers as heavily as possible, use string instructions, use short forms of instructions, and the like.

The more things change, the more they remain the same….

POPF and the 286

We’ve one final 286-related item to discuss: the hardware malfunction of POPF under certain circumstances on the 286.

The problem is this: Sometimes POPF permits interrupts to occur when interrupts are initially off and the setting popped into the Interrupt flag from the stack keeps interrupts off. In other words, an interrupt can happen even though the Interrupt flag is never set to 1. Now, I don’t want to blow this particular bug out of proportion. It only causes problems in code that cannot tolerate interrupts under any circumstances, and that’s a rare sort of code, especially in user programs. However, some code really does need to have interrupts absolutely disabled, with no chance of an interrupt sneaking through. For example, a critical portion of a disk BIOS might need to retrieve data from the disk controller the instant it becomes available; even a few hundred microseconds of delay could result in a sector’s worth of data misread. In this case, one misplaced interrupt during a POPF could result in a trashed hard disk if that interrupt occurs while the disk BIOS is reading a sector of the File Allocation Table.

There is a workaround for the POPF bug. While the workaround is easy to use, it’s considerably slower than POPF, and costs a few bytes as well, so you won’t want to use it in code that can tolerate interrupts. On the other hand, in code that truly cannot be interrupted, you should view those extra cycles and bytes as cheap insurance against mysterious and erratic program crashes.

One obvious reason to discuss the POPF workaround is that it’s useful. Another reason is that the workaround is an excellent example of Zen-level assembly coding, in that there’s a well-defined goal to be achieved but no obvious way to do so. The goal is to reproduce the functionality of the POPF instruction without using POPF, and the place to start is by asking exactly what POPF does.

All POPF does is pop the word on top of the stack into the FLAGS register, as shown in Figure 11.4. How can we do that without POPF? Of course, the 286’s designers intended us to use POPF for this purpose, and didn’t intentionally provide any alternative approach, so we’ll have to devise an alternative approach of our own. To do that, we’ll have to search for instructions that contain some of the same functionality as POPF, in the hope that one of those instructions can be used in some way to replace POPF.

Well, there’s only one instruction other than POPF that loads the FLAGS register directly from the stack, and that’s IRET, which loads the FLAGS register from the stack as it branches, as shown in Figure 11.5. iret has no known bugs of the sort that plague POPF, so it’s certainly a candidate to replace popf in non-interruptible applications. Unfortunately, IRET loads the FLAGS register with the third word down on the stack, not the word on top of the stack, as is the case with POPF; the far return address that IRET pops into CS:IP lies between the top of the stack and the word popped into the FLAGS register.

Obviously, the segment:offset that IRET expects to find on the stack above the pushed flags isn’t present when the stack is set up for POPF, so we’ll have to adjust the stack a bit before we can substitute IRET for POPF. What we’ll have to do is push the segment:offset of the instruction after our workaround code onto the stack right above the pushed flags. IRET will then branch to that address and pop the flags, ending up at the instruction after the workaround code with the flags popped. That’s just the result that would have occurred had we executed POPF—WITH the bonus that no interrupts can accidentally occur when the Interrupt flag is 0 both before and after the pop.

Figure 11.4  The operation of POPF.

How can we push the segment:offset of the next instruction? Well, finding the offset of the next instruction by performing a near call to that instruction is a tried-and-true trick. We can do something similar here, but in this case we need a far call, since IRET requires both a segment and an offset. We’ll also branch backward so that the address pushed on the stack will point to the instruction we want to continue with. The code works out like this:

      jmp short popfskip
popfiret:
      iret;      branches to the instruction after the
                 ; call, popping the word below the address
                 ; pushed by CALL into the FLAGS register
popfskip:
      call  far ptr popfiret
                 ;pushes the segment:offset of the next
                 ; instruction on the stack just above
                 ; the flags word, setting things up so
                 ; that IRET will branch to the next
                 ; instruction and pop the flags
; When execution reaches the instruction following this comment,
; the word that was on top of the stack when JMP SHORT POPFSKIP
; was reached has been popped into the FLAGS register, just as
; if a POPF instruction had been executed.
Figure 11.5  The operation of IRET.

The operation of this code is illustrated in Figure 11.6.

The POPF workaround can best be implemented as a macro; we can also emulate a far call by pushing CS and performing a near call, thereby shrinking the workaround code by 1 byte:

EMULATE_POPF             macro
     local popfskip, popfiret
     jmp   short popfskip
popfiret:
     iret
popfskip:
     push  cs
     call  popfiret
     endm

By the way, the flags can be popped much more quickly if you’re willing to alter a register in the process. For example, the following macro emulates POPF with just one branch, but wipes out AX:

EMULATE_POPF_TRASH_AX   macro
   push  cs
   mov   ax,offset $+5
   push  ax
   iret
   endm

It’s not a perfect substitute for POPF, since POPF doesn’t alter any registers, but it’s faster and shorter than EMULATE_POPF when you can spare the register. If you’re using 286-specific instructions, you can use which is shorter still, alters no registers, and branches just once. (Of course, this version of EMULATE_POPF won’t work on an 8088.)

      .286
                 :
EMULATE_POPFmacro
      push cs
      push offset $+4
      iret
      endm
Figure 11.6  Workaround code for the POPF bug.

The standard version of EMULATE_POPF is 6 bytes longer than POPF and much slower, as you’d expect given that it involves three branches. Anyone in his/her right mind would prefer POPF to a larger, slower, three-branch macro—given a choice. In noncode, however, there’s no choice here; the safer—if slower—approach is the best. (Having people associate your programs with crashed computers is not a desirable situation, no matter how unfair the circumstances under which it occurs.)

And now you know the nature of and the workaround for the POPF bug. Whether you ever need the workaround or not, it’s a neatly packaged example of the tremendous flexibility of the x86 instruction set.

Chapter 12 – Pushing the 486

It’s Not Just a Bigger 386

So this traveling salesman is walking down a road, and he sees a group of men digging a ditch with their bare hands. “Whoa, there!” he says. “What you guys need is a Model 8088 ditch digger!” And he whips out a trowel and sells it to them.

A few days later, he stops back around. They’re happy with the trowel, but he sells them the latest ditch-digging technology, the Model 80286 spade. That keeps them content until he stops by again with a Model 80386 shovel (a full 32 inches wide, with a narrow point to emulate the trowel), and that holds them until he comes back around with what they really need: a Model 80486 bulldozer.

Having reached the top of the line, the salesman doesn’t pay them a call for a while. When he does, not only are they none too friendly, but they’re digging with the 80386 shovel; the bulldozer is sitting off to one side. “Why on earth are you using that shovel?” the salesman asks. “Why aren’t you digging with the bulldozer?”

“Well, Lord knows we tried,” says the foreman, “but it was all we could do just to lift the damn thing!”

Substitute “processor” for the various digging implements, and you get an idea of just how different the optimization rules for the 486 are from what you’re used to. Okay, it’s not quite that bad—but upon encountering a processor where string instructions are often to be avoided and memory-to-register MOVs are frequently as fast as register-to-register MOVs, Dorothy was heard to exclaim (before she sank out of sight in a swirl of hopelessly mixed metaphors), “I don’t think we’re in Kansas anymore, Toto.”

Enter the 486

No chip that is a direct, fully compatible descendant of the 8088, 286, and 386 could ever be called a RISC chip, but the 486 certainly contains RISC elements, and it’s those elements that are most responsible for making 486 optimization unique. Simple, common instructions are executed in a single cycle by a RISC-like core processor, but other instructions are executed pretty much as they were on the 386, where every instruction takes at least 2 cycles. For example, MOV AL, [TestChar] takes only 1 cycle on the 486, assuming both instruction and data are in the cache—3 cycles faster than the 386—but STOSB takes 5 cycles, 1 cycle slower than on the 386. The floating-point execution unit inside the 486 is also much faster than the 387 math coprocessor, largely because, being in the same silicon as the CPU (the 486 has a math coprocessor built in), it is more tightly coupled. The results are sometimes startling: FMUL (floating point multiply) is usually faster on the 486 than IMUL (integer multiply)!

An encyclopedic approach to 486 optimization would take a book all by itself, so in this chapter I’m only going to hit the highlights of 486 optimization, touching on several optimization rules, some documented, some not. You might also want to check out the following sources of 486 information: i486 Microprocessor Programmer’s Reference Manual, from Intel; “8086 Optimization: Aim Down the Middle and Pray,” in the March, 1991 Dr. Dobb’s Journal; and “Peak Performance: On to the 486,” in the November, 1990 Programmer’s Journal.

Rules to Optimize By

In Appendix G of the i486 Microprocessor Programmers Reference Manual, Intel lists a number of optimization techniques for the 486. While neither exhaustive (we’ll look at two undocumented optimizations shortly) nor entirely accurate (we’ll correct two of the rules here), Intel’s list is certainly a good starting point. In particular, the list conveys the extent to which 486 optimization differs from optimization for earlier x86 processors. Generally, I’ll be discussing optimization for real mode (it being the most widely used mode at the moment), although many of the rules should apply to protected mode as well.

486 optimization is generally more precise and less frustrating than optimization for other x86 processors because every 486 has an identical internal cache. Whenever both the instructions being executed and the data the instructions access are in the cache, those instructions will run in a consistent and calculatable number of cycles on all 486s, with little chance of interference from the prefetch queue and without regard to the speed of external memory.

In other words, for cached code (which time-critical code almost always is), performance is predictable and can be calculated with good precision, and those calculations will apply on any 486. However, “predictable” doesn’t mean “trivial”; the cycle times printed for the various instructions are not the whole story. You must be aware of all the rules, documented and undocumented, that go into calculating actual execution times—and uncovering some of those rules is exactly what this chapter is about.

The Hazards of Indexed Addressing

Rule #1: Avoid indexed addressing (that is, try not to use either two registers or scaled addressing to point to memory).

Intel cautions against using indexing to address memory because there’s a one-cycle penalty for indexed addressing. True enough—but “indexed addressing” might not mean what you expect.

Traditionally, SI and DI are considered the index registers of the x86 CPUs. That is not the sense in which “indexed addressing” is meant here, however. In real mode, indexed addressing means that two registers, rather than one or none, are used to point to memory. (In this context, the use of one register to address memory is “base addressing,” no matter what register is used.) MOV AX, [BX+DI] and MOV CL, [BP+SI+10] perform indexed addressing; MOV AX,[BX] and MOV DL, [SI+1] do not.

Therefore, in real mode, the rule is to avoid using two registers to point to memory whenever possible. Often, this simply means adding the two registers together outside a loop before memory is actually addressed.

As an example, you might adhere to this rule by replacing the code

LoopTop:
    add  ax,[bx+si]
    add  si,2
    dec  cx
    jnz  LoopTop

with this

    add  si,bx
LoopTop:
    add  ax,[si]
    add  si,2
    dec  cx
    jnz  LoopTop
    sub  si,bx

which calculates the same sum and leaves the registers in the same state as the first example, but avoids indexed addressing.

In protected mode, the definition of indexed addressing is a tad more complex. The use of two registers to address memory, as in MOV EAX, [EDX+EDI], still qualifies for the one-cycle penalty. In addition, the use of 386/486 scaled addressing, as in MOV [ECX*2],EAX, also constitutes indexed addressing, even if only one register is used to point to memory.

All this fuss over one cycle! You might well wonder how much difference one cycle could make. After all, on the 8088, effective address calculations take a minimum of 5 cycles. On the 486, however, 1 cycle is a big deal because many instructions, including most register-only instructions (MOV, ADD, CMP, and so on) execute in just 1 cycle. In particular, MOVs to and from memory execute in 1 cycle—if they’re not hampered by something like indexed addressing, in which case they slow to half speed (or worse, as we will see shortly).

For example, consider the summing example shown earlier. The version that uses base+index ([BX+SI]) addressing executes in eight cycles per loop. As expected, the version that uses base ([SI]) addressing runs one cycle faster, at seven cycles per loop. However, the loop code executes so fast on the 486 that the single cycle saved by using base addressing makes the whole loop more than 14 percent faster.

In a key loop on the 486, 1 cycle can indeed matter.

Calculate Memory Pointers Ahead of Time

Rule #2: Don’t use a register as a memory pointer during the next two cycles after loading it.

Intel states that if the destination of one instruction is used as the base addressing component of the next instruction, then a one-cycle penalty is imposed. This rule, unlike anything ever before seen in the x86 family, reflects the heavily pipelined nature of the 486. Apparently, the 486 starts each effective address calculation before the start of the instruction that will need it, as shown in Figure 12.1; this effectively makes the address calculation time vanish, because it happens while the preceding instruction executes.

Of course, the 486 can’t perform an effective address calculation for a target instruction ahead of time if one of the address components isn’t known until the instruction starts, and that’s exactly the case when the preceding instruction modifies one of the target instruction’s addressing registers. For example, in the code

MOV  BX,OFFSET MemVar
MOV  AX,[BX]

there’s no way that the 486 can calculate the address referenced by MOV AX,[BX] until MOV BX,OFFSET MemVar finishes, so pipelining that calculation ahead of time is not possible. A good workaround is rearranging your code so that at least one instruction lies between the loading of the memory pointer and its use. For example, postdecrementing, as in the following

LoopTop:
    add    ax,[si]
    add    si,2
    dec    cx
    jnz    LoopTop

is faster than preincrementing, as in:

LoopTop:
    add    si,2
    add    ax,[SI]
    dec    cx
    jnz    LoopTop

Now that we understand what Intel means by this rule, let me make a very important comment: My observations indicate that for real-mode code, the documentation understates the extent of the penalty for interrupting the address calculation pipeline by loading a memory pointer just before it’s used.

The truth of the matter appears to be that if a register is the destination of one instruction and is then used by the next instruction to address memory in real mode, not one but two cycles are lost!

In 32-bit protected mode, however, the penalty is, in fact, the 1 cycle that Intel .

Considering that MOV normally takes only one cycle total, that’s quite a loss. For example, the postdecrement loop shown above is 2 full cycles faster than the preincrement loop, resulting in a 29 percent improvement in the performance of the entire loop. But wait, there’s more. If a register is loaded 2 cycles (which generally means 2 instructions, but, because some 486 instructions take more than 1 cycle,

Figure 12.1  One-cycle-ahead address pipelining.

the 2 are not always equivalent) before it’s used to point to memory, 1 cycle is lost. Therefore, whereas this code

mov    bx,offset MemVar
mov    ax,[bx]
inc    dx
dec    cx
jnz    LoopTop

loses two cycles from interrupting the address calculation pipeline, this code

mov    bx,offset MemVar
inc    dx
mov    ax,[bx]
dec    cx
jnz    LoopTop

loses only one cycle, and this code

mov    bx,offset MemVar
inc    dx
dec    cx
mov    ax,[bx]
jnz    LoopTop

loses no cycles at all. Apparently, the 486’s addressing calculation pipeline actually starts 2 cycles ahead, as shown in Figure 12.2. (In truth, my best guess at the moment is that the addressing pipeline really does start only 1 cycle ahead; the additional cycle crops up when the addressing pipeline has to wait for a register to be written into the register file before it can read it out for use in addressing calculations. However, I’m guessing here, and the 2-cycle-ahead model in Figure 12.2 will do just fine for optimization purposes.)

Clearly, there’s considerable optimization potential in careful rearrangement of 486 code.

Figure 12.2  Two-cycle-ahead address pipelining.

Caveat Programmor

A caution: I’m quite certain that the 2-cycle-ahead addressing pipeline interruption penalty I’ve described exists in the two 486s I’ve tested. However, there’s no guarantee that Intel won’t change this aspect of the 486 in the future, especially given that the documentation indicates otherwise. Perhaps the 2-cycle penalty is the result of a bug in the initial steps of the 486, and will revert to the documented 1-cycle penalty someday; likewise for the undocumented optimizations I’ll describe below. Nonetheless, none of the optimizations I suggest would hurt performance even if the undocumented performance characteristics of the 486 were to vanish, and they certainly will help performance on at least some 486s right now, so I feel they’re well worth using.

There is, of course, no guarantee that I’m entirely correct about the optimizations discussed in this chapter. Without knowing the internals of the 486, all I can do is time code and make inferences from the results; I invite you to deduce your own rules and cross-check them against mine. Also, most likely there are other optimizations that I’m unaware of. If you have further information on these or any other undocumented optimizations, please write and let me know. And, of course, if anyone from Intel is reading this and wants to give us the gospel truth, please do!

Stack Addressing and Address Pipelining

Rule #2A: Rule #2 sometimes, but not always, applies to the stack pointer when it is implicitly used to point to memory.

Intel states that the stack pointer is an implied destination register for CALL, ENTER, LEAVE, RET, PUSH, and POP (which alter (E)SP), and that it is the implied base addressing register for PUSH, POP, and RET (which use (E)SP to address memory). Intel then implies that the aforementioned addressing pipeline penalty is incurred whenever the stack pointer is used as a destination by one of the first set of instructions and is then immediately used to address memory by one of the second set. This raises the specter of unpleasant programming contortions such as intermixing PUSHes and POPs with other instructions to avoid interrupting the addressing pipeline. Fortunately, matters are actually not so grim as Intel’s documentation would indicate; my tests indicate that the addressing pipeline penalty pops up only spottily when the stack pointer is involved.

For example, you’d certainly expect a sequence such as

:
pop    ax
ret
pop    ax
ret
:

to exhibit the addressing pipeline interruption phenomenon (SP is both destination and addressing register for both instructions, according to Intel), but this code runs in six cycles per POP/RET pair, matching the official execution times exactly. Likewise, a sequence like

pop    dx
pop    cx
pop    bx
pop    ax

runs in one cycle per instruction, just as it should.

On the other hand, performing arithmetic directly on SP as an explicit destination—for example, to deallocate local variables—and then using PUSH, POP, or RET, definitely can interrupt the addressing pipeline. For example

add    sp,10h
ret

loses two cycles because SP is the explicit destination of one instruction and then the implied addressing register for the next, and the sequence

add    sp,10h
pop    ax

loses two cycles for the same reason.

I certainly haven’t tried all possible combinations, but the results so far indicate that the stack pointer incurs the addressing pipeline penalty only if (E)SP is the explicit destination of one instruction and is then used by one of the two following instructions to address memory. So, for instance, SP isn’t the explicit operand of POP AX-AX is—and no cycles are lost if POP AX is followed by POP or RET. Happily, then, we need not worry about the sequence in which we use PUSH and POP. However, adding to, moving to, or subtracting from the stack pointer should ideally be done at least two cycles before PUSH, POP, RET, or any other instruction that uses the stack pointer to address memory.

Problems with Byte Registers

There are two ways to lose cycles by using byte registers, and neither of them is documented by Intel, so far as I know. Let’s start with the lesser and simpler of the two.

Rule #3: Do not load a byte portion of a register during one instruction, then use that register in its entirety as a source register during the next instruction.

So, for example, it would be a bad idea to do this

mov    ah,o
            :
mov    cx,[MemVar1]
mov    al,[MemVar2]
add    cx,ax

because AL is loaded by one instruction, then AX is used as the source register for the next instruction. A cycle can be saved simply by rearranging the instructions so that the byte register load isn’t immediately followed by the word register usage, like so:

mov    ah,o
            :
mov    al,[MemVar2]
mov    cx,[MemVar1]
add    cx,ax

Strange as it may seem, this rule is neither arbitrary nor nonsensical. Basically, when a byte destination register is part of a word source register for the next instruction, the 486 is unable to directly use the result from the first instruction as the source for the second instruction, because only part of the register required by the second instruction is contained in the first instruction’s result. The full, updated register value must be read from the register file, and that value can’t be read out until the result from the first instruction has been written into the register file, a process that takes an extra cycle. I’m not going to explain this in great detail because it’s not important that you understand why this rule exists (only that it does in fact exist), but it is an interesting window on the way the 486 works.

In case you’re curious, there’s no such penalty for the typical XLAT sequence like

mov    bx,offset MemTable
       :
mov    al,[si]
xlat

even though AL must be converted to a word by XLAT before it can be added to BX and used to address memory. In fact, none of the penalties mentioned in this chapter apply to XLAT, apparently because XLAT is so slow—4 cycles—that it gives the 486 time to perform addressing calculations during the course of the instruction.

While it’s nice that XLAT doesn’t suffer from the various 486 addressing penalties, the reason for that is basically that XLAT is slow, so there’s still no compelling reason to use XLAT on the 486.

In general, penalties for interrupting the 486’s pipeline apply primarily to the fast core instructions of the 486, most notably register-only instructions and MOV, although arithmetic and logical operations that access memory are also often affected. I don’t know all the performance dependencies, and I don’t plan to; figuring all of them out would be a big, boring job of little value. Basically, on the 486 you should concentrate on using those fast core instructions when performance matters, and all the rules I’ll discuss do indeed apply to those instructions.

You don’t need to understand every corner of the 486 universe unless you’re a diehard ASMhead who does this stuff for fun. Just learn enough to be able to speed up the key portions of your programs, and spend the rest of your time on a fast design and overall implementation.

More Fun with Byte Registers

Rule #4: Don’t load any byte register exactly 2 cycles before using any register to address memory.

This, the last of this chapter’s rules, is the strangest of the lot. If any byte register is loaded, and then two cycles later any register is used to point to memory, one cycle is lost. So, for example, this code

mov    al,bl
mov    cx,dx
mov    si,[di]

takes four rather than the expected three cycles to execute. Note that it is not required that the byte register be part of the register used to address memory; any byte register will do the trick.

Worse still, loading byte registers both one and two cycles before a register is used to address memory costs two cycles, as in

mov    bl,al
mov    cl,3
mov    bx,[si]

which takes five rather than three cycles to run. However, there is no penalty if a byte register is loaded one cycle but not two cycles before a register is used to address memory. Therefore,

mov    cx,3
mov    dl,al
mov    si,[bx]

runs in the expected three cycles.

In truth, I do not know why this happens. Clearly, it has something to do with interrupting the start of the addressing pipeline, and I have my theories about how this works, but at this point they’re pure speculation. Whatever the reason for this rule, ignorance of it—and of its interaction with the other rules—could lead to considerable performance loss in seemingly air-tight code. For instance, a casual observer would expect the following code to run in 3 cycles:

mov    bx,offset MemVar
mov    cl,al
mov    ax,[bx]

A more sophisticated programmer would expect to lose one cycle, because BX is loaded two cycles before being used to address memory. In fact, though, this code takes 5 cycles—2 cycles, or 67 percent, longer than normal. Why? Well, under normal conditions, loading a byte register—CL in this case—one cycle before using a register to address memory produces no penalty; loading 2 cycles ahead is the only case that normally incurs a penalty. However, think of Rule #4 as meaning that loading a byte register disrupts the memory addressing pipeline as it starts up. Viewed that way, we can see that MOV BX,OFFSET MemVar interrupts the addressing pipeline, forcing it to start again, and then, presumably, MOV CL,AL interrupts the pipeline again because the pipeline is now on its first cycle: the one that loading a byte register can affect.

I know—it seems awfully complicated. It isn’t, really. Generally, try not to use byte destinations exactly two cycles before using a register to address memory, and try not to load a register either one or two cycles before using it to address memory, and you’ll be fine.

Timing Your Own 486 Code

In case you want to do some 486 performance analysis of your own, let me show you how I arrived at one of the above conclusions; at the same time, I can warn you of the timing hazards of the cache. Listings 12.1 and 12.2 show the code I ran through the Zen timer in order to establish the effects of loading a byte register before using a register to address memory. Listing 12.1 ran in 120 µs on a 33 MHz 486, or 4 cycles per repetition (120 µs/1000 repetitions = 120 ns per repetition; 120 ns per repetition/30 ns per cycle = 4 cycles per repetition); Listing 12.2 ran in 90 µs, or 3 cycles, establishing that loading a byte register costs a cycle only when it’s performed exactly 2 cycles before addressing memory.

LISTING 12.1 LST12-1.ASM

; Measures the effect of loading a byte register 2 cycles before
; using a register to address memory.
    mov    bp,2    ;run the test code twice to make sure
                   ; it's cached
    sub    bx,bx
CacheFillLoop:
    call    ZTimerOn ;start timing
    rept    1000
    mov     dl,cl
    nop
    mov     ax,[bx]
    endm
    call    ZTimerOff ;stop timing
    dec     bp
    jz      Done
    jmp     CacheFillLoop
Done:

LISTING 12.2 LST12-2.ASM

; Measures the effect of loading a byte register 1 cycle before
; using a register to address memory.
    mov    bp,2   ;run the test code twice to make sure
                  ; it's cached
    sub    bx,bx
CacheFillLoop:
    call   ZTimerOn ;start timing
    rept   1000
    nop
    mov    dl,cl
    mov    ax,[bx]
    endm
    call   ZTimerOff ;stop timing
    dec    bp
    jz     Done
    jmp    CacheFillLoop
Done:

Note that Listings 12.1 and 12.2 each repeat the timing of the code under test a second time, to make sure that the instructions are in the cache on the second pass, the one for which results are displayed. Also note that the code is less than 8K in size, so that it can all fit in the 486’s 8K internal cache. If I double the REPT value in Listing 12.2 to 2,000, making the test code larger than 8K, the execution time more than doubles to 224 µs, or 3.7 cycles per repetition; the extra seven-tenths of a cycle comes from fetching non-cached instruction bytes.

Whenever you see non-integral timing results of this sort, it’s a good bet that the test code or data isn’t cached.

The Story Continues

There’s certainly plenty more 486 lore to explore, including the 486’s unique prefetch queue, more optimization rules, branching optimizations, performance implications of the cache, the cost of cache misses for reads, and the implications of cache write-through for writes. Nonetheless, we’ve covered quite a bit of ground in this chapter, and I trust you’ve gotten a feel for the considerable extent to which 486 optimization differs from what you’re used to. Odd as 486 optimization is, though, it’s well worth mastering, for the 486 is, at its best, so staggeringly fast that carefully crafted 486 code can do more than twice as much per cycle as the best 386 code—which makes it perhaps 50 times as fast as optimized code for the original PC.

Sometimes it is hard to believe we’re still in Kansas!

Chapter 13 – Aiming the 486

Pipelines and Other Hazards of the High End

It’s a sad but true fact that 84 percent of American schoolchildren are ignorant of 92 percent of American history. Not my daughter, though. We recently visited historical Revolutionary-War-vintage Fort Ticonderoga, and she’s now 97 percent aware of a key element of our national heritage: that the basic uniform for soldiers in those days was what appears to be underwear, plus a hat so that no one could complain that they were undermining family values. Ha! Just kidding! Actually, what she learned was that in those days, it was pure coincidence if a cannonball actually hit anything it was aimed at, which isn’t surprising considering the lack of rifling, precision parts, and ballistics. The guides at the fort shot off three cannons; the closest they came to the target was about 50 feet, and that was only because the wind helped. I think the idea in early wars was just to put so much lead in the air that some of it was bound to hit something; preferably, but not necessarily, the enemy.

Nowadays, of course, we have automatic weapons that allow a teenager to singlehandedly defeat the entire U.S. Army, not to mention so-called “smart” bombs, which are smart in the sense that they can seek out and empty a taxpayer’s wallet without being detected by radar. There’s an obvious lesson here about progress, which I leave you to deduce for yourselves.

Here’s the same lesson, in another form. Ten years ago, we had a slow processor, the 8088, for which it was devilishly hard to optimize, and for which there was no good optimization documentation available. Now we have a processor, the 486, that’s 50 to 100 times faster than the 8088—and for which there is no good optimization documentation available. Sure, Intel provides a few tidbits on optimization in the back of the i486 Microprocessor Programmer’s Reference Manual, but, as I discussed in Chapter 12, that information is both incomplete and not entirely correct. Besides, most assembly language programmers don’t bother to read Intel’s manuals (which are extremely informative and well done, but only slightly more fun to read than the phone book), and go right on programming the 486 using outdated 8088 optimization techniques, blissfully unaware of a new and heavily mutated generation of cycle-eaters that interact with their code in ways undreamt of even on the 386.

For example, consider how Terje Mathisen doubled the speed of his word-counting program on a 486 simply by shuffling a couple of instructions.

486 Pipeline Optimization

I’ve mentioned Terje Mathisen in my writings before. Terje is an assembly language programmer extraordinaire, and author of the incredibly fast public-domain word-counting program WC (which comes complete with source code; well worth a look, if you want to see what really fast code looks like). Terje’s a regular participant in the ibm.pc/fast.code topic on Bix. In a thread titled “486 Pipeline Optimization, or TANSTATFC (There Ain’t No Such Thing As The Fastest Code),” he detailed the following optimization to WC, perhaps the best example of 486 pipeline optimization I’ve yet seen.

Terje’s inner loop originally looked something like the code in Listing 13.1. (I’ve taken a few liberties for illustrative purposes.) Of course, Terje unrolls this loop a few times (128 times, to be exact). By the way, in Listing 13.1 you’ll notice that Terje counts not only words but also lines, at a rate of three instructions for every two characters!

LISTING 13.1 L13-1.ASM

mov di,[bp+OFFS]    ;get the next pair of characters
mov bl,[di]         ;get the state value for the pair
add dx,[bx+8000h]   ;increment word and line count
                    ; appropriately for the pair

Listing 13.1 looks as tight as it could be, with just two one-cycle instructions, one two-cycle instruction, and no branches. It is tight, but those three instructions actually take a minimum of 8 cycles to execute, as shown in Figure 13.1. The problem is that DI is loaded just before being used to address memory, and that costs 2 cycles because it interrupts the 486’s internal instruction pipeline. Likewise, BX is loaded just before being used to address memory, costing another two cycles. Thus, this loop takes twice as long as cycle counts would seem to indicate, simply because two registers are loaded immediately before being used, disrupting the 486’s pipeline.

Listing 13.2 shows Terje’s immediate response to these pipelining problems; he simply swapped the instructions that load DI and BL. This one change cut execution time per character pair from eight cycles to five cycles! The load of BL is now separated by one instruction from the use of BX to address memory, so the pipeline penalty is reduced from two cycles to one cycle. The load of DI is also separated by one instruction from the use of DI to address memory (remember, the loop is unrolled, so the last instruction is followed by the first instruction), but because the intervening instruction takes two cycles, there’s no penalty at all.

Figure 13.1  Cycle-eaters in the original WC.

Remember, pipeline penalties diminish with increasing number of cycles, not instructions, between the pipeline disrupter and the potentially affected instruction.

LISTING 13.2 L13-2.ASM

mov bl,[di]         ;get the state value for the pair
mov di,[bp+OFFS]    ;get the next pair of characters
add dx,[bx+8000h]   ;increment word and line count
                    ; appropriately for the pair

At this point, Terje had nearly doubled the performance of this code simply by moving one instruction. (Note that swapping the instructions also made it necessary to preload DI at the start of the loop; Listing 13.2 is not exactly equivalent to Listing 13.1.) I’ll let Terje describe his next optimization in his own words:

“When I looked closely as this, I realized that the two cycles for the final ADD is just the sum of 1 cycle to load the data from memory, and 1 cycle to add it to DX, so the code could just as well have been written as shown in Listing 13.3. The final breakthrough came when I realized that by initializing AX to zero outside the loop, I could rearrange it as shown in Listing 13.4 and do the final ADD DX,AX after the loop. This way there are two single-cycle instructions between the first and the fourth line, avoiding all pipeline stalls, for a total throughput of two cycles/char.”

LISTING 13.3 L13-3.ASM

mov bl,[di]         ;get the state value for the pair
mov di,[bp+OFFS]    ;get the next pair of characters
mov ax,[bx+8000h]   ;increment word and line count
add dx,ax           ; appropriately for the pair

LISTING 13.4 L13-4.ASM

mov bl,[di]         ;get the state value for the pair
mov di,[bp+OFFS]    ;get the next pair of characters
add dx,ax           ;increment word and line count
                    ; appropriately for the pair
mov ax,[bx+8000h<