100% found this document useful (1 vote)

228 views32 pages

Assembly Language Lab

This page is for people who already know the some assembly and c to see why it is often very beneficial to use a direct assembly implementation over a pure c implementation. There are, in my opinion, too many programmers out there who just don't know what a difference hand coded assembly can make.

Uploaded by

Eri Matraku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

228 views32 pages

Assembly Language Lab

Uploaded by

Eri Matraku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 32

Assembly Language Lab

Copyright 1998-2004 by Paul Hsieh The main purpose of this page is for people who already know the some assembly and C to see why it is often very beneficial to use a direct assembly implementation over a pure C implementation. There are, in my opinion, too many programmers out there who just don't know what a difference hand coded assembly can make. I hope to help remedy this situation. If you have other examples, or input (or challenges) regarding this page don't hesitate to contact me. 1. GCD On comp.lang.asm.x86, Jon Kirwan asked for compiler output for the "greatest common divisor" function from the following C implementation:
unsigned int gcd (unsigned int a, unsigned int b) { if (a == 0 &&b == 0) b = 1; else if (b == 0) b = a; else if (a != 0) while (a != b) if (a <b) b -= a; else a -= b; return b; }

Here's my favorite C compiler with /os/s options (optimize for size, use no stack checking; other optimization options did not help much) versus a human (i.e., me) implementation:
; WATCOM C/C++ v10.0a output gcd: mov mov test jne test jne ebx,eax eax,edx ebx,ebx L1 edx,edx L1 ; Author: Paul Hsieh gcd: L1: L2: neg je neg xchg sub jg eax L3 eax eax,edx eax,edx L2

L1:

L2: L3;

L4: L5:

mov ret test jne mov ret test je cmp je jae sub jmp sub jmp ret

eax,1 L3: eax,eax L2 eax,ebx ebx,ebx L5 ebx,eax L5 L4 eax,ebx L3 ebx,eax L3

L4:

jne add jne inc ret

L1 eax,edx L4 eax

A particular weakness of this compiler is that it is not sure what registers to start with. But regardless it is missing a huge number of optimization opportunities when compared with the human implementation. There is no comparison is speed or size. Of course, as a human, I have an immense advantage since I can see mathematical relationships that are missed by all C compilers. Although xchg is not a particularly fast Intel instruction, it does help make it very tight, and probably not more than a cycle off of optimal for performance. The main loop of the routine exists entirely within an instruction pre-fetch buffer (16 bytes.) Much as I would like, it is pointless to try and describe the exact performance of the algorithms given above. They are performance limited almost entirely by branch mis-prediction penalties and branch taken clocks. Update: With the advent of modern CPUs such as the Alpha, Pentium-II or Athlon which have deep pipelines and correspondingly bad branch misprediction penalties, it makes sense to remove branches through "predication". In this case we can use min() and max() functions to remove the inner most if() statement in the original loop:
x = min(a,b); y = max(a,b); while( y!= 0 ) { x = min(x,y-x); y -= x; }

// min(x,y-x) // (y-x)+(x) - min(x,y-x)

But on a 32 bit machine:

min(x,y) = y+((x-y)>>31)&(x-y) So putting in the appropriate substitutions and simplifying as appropriate we arrive at:
int gcd(int x, int y) { int t; y += x; if(y==0) y=1; else do { x += x; t = y-x; x >>= 1; x += (t>>31)&t; y -= x; } while(x!=0); return y; }

// (y-x) - (x) // min(y-x,x) // (y-x)+(x)-min(x,y-x) == max(x,y-x)

Once again, we take the Pepsi challenge and we get:

; WATCOM C/C++ v10.0a output gcd: add jne mov jmp L1: mov edx add sub eax mov ebx sar ecx sar shifters and ecx add ebx sub eax test jne mov L2: edx,eax L1 eax,1 L2 ebx,edx ; 1 dep: eax,eax ; 1 ebx,eax ; 2 dep: ecx,ebx ; 3 dep: ecx,1fH ; 4 dep: eax,1 ; 4/5 ; Author: Paul Hsieh gcd: add jne mov jmp L1: mov edx add sub eax shr mov ebx sar and ecx add ebx sub eax test jne mov L2: edx,eax L1 eax,1 L2 ebx,edx ; 1 dep: eax,eax ; 1 ebx,eax ; 2 dep: eax,1 ; 2 ecx,ebx ; 3 dep: ebx,1fH ; 3 ebx,ecx ; 4 dep: eax,ebx ; 5 dep: edx,eax ; 6 dep: eax,eax ; 6 L1 ; 6 eax,edx

ebx,ecx ; 5 dep: eax,ebx ; 6 dep: edx,eax ; 7 dep: eax,eax ; 7 L1 ; 7 eax,edx

So I am still ahead, but in theory a good enough compiler could have equaled my results. Unlike the first loop, the performance of this loop is characterizable. Update: Michael Abrash's famous book "The Zen of Code Optimization" actually says that the pure subtraction method is too slow (this contradicts actual testing that I performed.) I have also gotten feedback from people insisting that divide is better for certain special cases or larger numbers. And the final icing on the cake is that in "Handbook of Applied Cryptography" (ISBN 0-8493-8523-7) they give a "divide out powers of two" algorithm that appears to be a good compromise (but still suffers from the branch mis-prediction penalties.) I will need to investigate this a little deeper, and may need to revise these results. Outside analysis is welcome. Update: Just as a question of completeness, I am presenting what I have discovered so far. Using what apparently is a classical approach, rather than either dividing or subtracting, we subtract powers of 2 times the smaller value from the larger. Furthermore, rather than pursuing the gcd algorithm all the way, once we have small enough values for a and b, we just perform a table lookup (table filling not shown here.)
#define SLIM (64) static unsigned smallGcds[SLIM][SLIM]; static unsigned uintGcd (unsigned a, unsigned b) { unsigned c, d; /* Divide out low powers of 2 if we can (to decrease a, b) */ d = 1; while (((a|b) & 1) == 0) { a >>= 1; b >>= 1; d <<= 1; } goto start; do { /* Find largest 2^t*b, less than a */ c = b; do { c += c; } while (c <= a); /* a -= 2^t*b */ a -= (c >> 1); start:;

/* Make sure a > b, and b != 0 */ if (a < b) { if (a == 0) return d * b; c = a; a = b; b = c; } } while (a >= SLIM); /* Return with precalculated small gcds */ return d * smallGcds[a][b]; }

Obviously some cases of (a,b) input need to be checked (a=0, b=0 sends the program into an infinite loop.) I believe that it might be possible to improve upon this by performing table look ups on several bits of each operand at once which could give prime linear factors which could be used to reduce the operands by several steps in one shot. So I am refraining from performing assembly language analysis until I resolve this. Update: There have been a number of submissions that have come to my attention. The first is from Pat D:

The second submission is from Ville Miettinen who works (or worked) for hybrid graphics using the cmovCC instructions:

2. Scan array of packed structure for non-zero key Suppose we want to scan an array of structures just to determine if one has a non-zero value as a particular entry. That is to say, suppose we wish to implement the following function:
unsigned int existnzkey (struct foobar * table, unsigned int tablelength) { int i; for(i=0;i<tablelength;i++) if( table[i].key !=0 ) return table[i].key;

return 0; }

As it stands, this code suffers from too many problems to analyze at the assembly level immediately. Applying some ingenuity and some standard tricks as described on my optimization page I arrive at:
static unsigned int existnzkey (unsigned int tablelength, struct foobar * table) { int i,c,d; d=i=0; c=tablelength; goto Start; do { d = table[i].key; if( d !=0 ) return d; i++; c--; Start:; } while( c!=0 ); return d; }

While it may look a lot longer and seem to add unnecessary complexity, it in fact helps the compiler by practically giving it a variable to register map. Now many may comment that using goto's or labels is bad and causes compilers to turn off optimizations or whatever. Well, my compiler doesn't seem to have a problem with it, which is why I've used it. Assuming we have:
struct foobar { unsigned int key; void * content; };

here's the assembly output of the compiler versus my hand coding:

; compiler xor test eax,eax ecx,ecx ; by Pisen Chiang and Paul Hsieh xor test eax,eax ecx,ecx

je L1: mov test jne add dec jne +1brt L2:

L2 eax,[esi] eax,eax L2 esi,8 ecx L1

; ; ; ; ; ;

1 2 2 3 3 5

U U V U V - V

je sub L1: add sub sbb or jz +1brt mov L2:

L2 esi,8 esi,8 ecx,1 eax,eax eax,[esi] L1 eax,[esi]

; ; ; ; ;

1 U 1 V 2 U 3 U 4 V

The compiler wins the size contest, but mine is a heck of a lot faster because I avoid an extra test and branch in the inner loop. As the comments show, my loop saves a clock (on the Pentium.) Impressed? Well, don't be too impressed. In a sense all of this should be academic. The real problem here is that we are accessing spaced out elements out of an array of packed structures. If the array is substantial in size then we end up losing for all the extra data loaded (the cache can only load contiguous lines of data.) Splitting the structure up into parallel arrays of data which have locality of reference (i.e., all elements are read/written in the relevant inner loops) will tend to do more for performance than these sorts of optimizations. But even in that situation I don't think most compilers will use a repz scasd. Nevertheless, in situations where you are adhering to a bad array structure design (like the vertex buffer array used in Microsoft's Direct 3D API) the above code does apply. 3. 63 bit LFSR The following is a typical cyclic redundancy checkcomputing algorithm. It is basically a deterministic iterating step of a function that takes 63 bits of input scrambles the bits in some uniformly pseudo-random way, then returns those 63 bits. Such algorithms can be used for high performance (but low security) encryption, pseudo-random numbers, testing for data integrity or a hash function.
typedef unsigned long UINT; void TestC(UINT &b0, UINT &b1) { UINT L1, L2, L3; UINT H1, H2; // x = (x>>31)^(x>>30)^(x<<32) (mod 2^63)

L1 L2 H1 H2 b1 b0 }

= = = = = =

(b1<<1)|(b0>>31); (b1<<2)|(b0>>30); (b1>>31); (b1>>30); H1^H2^b0 &0x7FFFFFFF; L1^L2;

I have implemented this by using 63 of 64 bits in a pair of 32 bit call by reference variables. (My compiler does not support a native 64 bit integer type.)
; compiler lea shr or mov lea shr or +agi mov shr shr xor xor mov and xor esi,[edx*2] ebx,31 esi,ebx ebx,eax ecx,[edx*4] ebx,30 ebx,ecx ecx,edx ecx,31 edx,30 edx,ecx edx,eax eax,esi edx,07FFFFFFFh eax,ebx ; ; ; ; ; ; ; 1 2 3 4 5 5 7 U U U U U ; by Paul Hsieh and Pisen Chiang mov esi,edx xor cl,cl shld edx,eax,1 NP +4c* shld esi,eax,2 NP +4c adc cl,0 xor edx,esi xchg eax,edx NP +2c xor dl,cl ; 1 U ; 1 V ; 5 ; 9 ;10 U ;10 V ;12 ;13 U

V U U U U U U U U V

; 8 ; 9 ;10 ;11 ;12 ;13 ;13 ;14

Here the cycle count is a close one on the Pentium MMX (*in fact preMMX CPUs brings the two to a tie since the first shld takes an extra clock to decode.) This is mostly because of the surprisingly slow shld instructions. On Pentium Pro/II, K6 and 6x86MX processors, shld is significantly improved in performance making this a more compelling solution based on size and speed. The following is optimized purely for performance on Pentium based CPUs.
; by hand for Pentiums mov xor shl mov esi,edx cl,cl esi,2 ebx,eax ; ; ; ; 1 U 1 V 2 U 2 V

adc lea shr xor xor xor shr xchg xor

cl,0 edx,[edx*2] ebx,30 edx,esi edx,ebx al,cl ebx,1 eax,edx eax,ebx

; ; ; ; ; ; ; ; ;

3 3 4 4 5 5 6 8 9

U V U V U V U NP +2c U

4. Quick and dirty bubble sort Andrew Howe, from Core Designs (makers of Tomb Raider), sent me an extremely short sort loop. Originally written for WATCOM C/C++, I have stripped off the cruft so that you can see it here in its most optimal form.
; by Andrew Howe outerloop: cmploop: lea mov sub cmp jle xchg cmp jnz stosd loop ebx,[edi+ecx*4] eax,[edi] ebx,4 eax,[ebx] notyet eax,[ebx] ebx,edi cmploop outerloop

notyet:

Notice the use of xchg, stosd, and loop, something you just wont see from C compilers. I did not bother working out the timings of this routine. Remember that for some applications, bubble sort will out perform some of the more sophisticated sorting algorithms. For example, 3D applications that need Z-sorting, but can rely on temporal locality to always maintain at least a mostly sorted list. (For another interesting departure along these lines see <<Sorting with less than all the facts>>) 5. A fast implementation of strlen() Recently, someone wrote to me with the comment that strlen() is a very commonly called function, and as such was interested in possible performance improvements for it. At first, without thinking too hard about it, I didn't see how there was any opportunity to fundamentally improve the algorithm. I was right, but as far as low level algorithmic scrutiny is concerned, there is plenty of opportunity. Basically, the algorithm is byte scan based, and as such the typical thing that the C version will do wrong is miss the opportunity to reduce load redundancy.

; compiler mov cmp je l1: mov inc V test jne V +1brt l2: sub edx,ebx byte ptr [ebx],0 l2 ah,[ebx+1] ; U ebx ; ah,ah l1 ebx,edx ; U ;

; by Paul Hsieh lea l1: inc test jz cmp jne jmp l2: mov add V test jz V test jz V test jz V test jnz V +1brt inc l3: inc l4: inc l5: sub l6: sub ecx,[ebx-1] ecx ecx,3 l2 byte ptr [ecx],0 l1 l6 eax,[ecx] ; U ecx,4 ; al,al l5 ah,ah l4 eax,0ff0000h l3 eax,0ff000000h l2 ecx ecx ecx ecx,4 ecx,ebx ; U ; ; U ; ; U ; ; U ;

Here, I've sacrificed size for performance, by essentially unrolling the loop 4 times. If the input strings are fairly long (which is when performance will matter) on a Pentium, the asm code will execute at a rate of 1.5 clocks per byte, while the C compiler takes 3 clocks per byte. If the strings are not long enough, branch mispredictions may make this solution worse than the straight forward one. Update! While discussing sprite data copying (see next example) I realized that there is a significant improvement for 32-bit x86's that have slow branching (P-IIs and Athlon.)
; by Paul Hsieh lea l1: inc test jnz l2: mov ecx,[ebx-1] ecx ecx,3 l3 edx,[ecx]

; U

mov and add add or and cmp je sub l3: cmp jne sub

eax,07F7F7F7Fh eax,edx ecx,4 eax,07F7F7F7Fh eax,edx eax,080808080h eax,080808080h l2 ecx,4 byte ptr [ecx],0 l1 ecx,ebx

; ; ; ; ; ; ; ;

V U V U U U U V +1brt

I think this code will perform better in general for all 32 bit x86s due to less branching. 16 bit x86's obviously can use a similar idea, but it should be pretty clear that it would be at least twice as slow. (I'm really starting to like this bit mask trick! See next example) Thanks to Zi Bin Yang for pointing out an error in my code. Here's the C version of the same thing:
int fstrlen(char *s) { char *p; int d; #define M1 (0x7f7f7f7f) #define M2 (0x80808080) #define SW (sizeof(int)/sizeof(char)) p=s-1; do { p++; if( (((int)p)&(SW-1))==0 ) { do { d = *((int *)p); p += SW; d = (((d&M1)+M1)|d)&M2; } while( d==M2 ); p -= SW; } } while( *p!=0 ); return p-s; }

It compiles to substantially the same thing (on a P6 it should have the same performance) but it has a few problems with it. Its not really portable to different sizes of int (the numbers 0x7f7f7f7f and 0x80808080 would have to be truncated or expanded, and casting p from a pointer to a scalar is not guaranteed to work) and its not any easier to understand than the assembly.

Note that although alignment on reads is not necessary on modern x86's it is performed here, as a means of avoiding a memory read failure. This can happen as a result of reading past the end of the string, but under the assumption that your memory allocator aligns to at least DWORD boundaries. Update! Ryan Mack has sent in an MMX version of this loop:
; MMX version by Ryan Mack ; Roughly 13 + 3n + BRANCH clocks on a P-II const unsigned __int64 STRINGTBL[8] = {0, 0xff, 0xffff, 0xffffff, 0xffffffff, 0xffffffffff, 0xffffffffffff, 0xffffffffffffff} /* ... */ pxor mov mov and and movq por MAIN: add pcmpeqb packsswb movd movq test jz bsf shr lea sub mm1, ecx, edx, ecx, eax, mm0, mm0, ecx, mm0, mm0, eax, mm0, eax, MAIN eax, eax, ecx, ecx, mm1 eax eax -8 7 [ecx] [STRINGTBL+eax*8] 8 mm1 mm0 mm0 [ecx] eax

eax 2 [ecx+eax-8] edx

The individual loop iterations are not dependent on each other, so on a modern out of order processor this is just instruction issue limited. On a PII we have 6 ALU microops + 1 load microop, which should translate to a 3 clock per 8 byte throughput. On AMD processors there are 7 riscops which is 3 clocks on an Athlon and 4 clocks on a K6 per 8 byte throughput. Performance measurements on an Athlon show that (ignoring branch misprediction) for long strings this 64 bit method is upwards of 55% faster than the previous 32 bit method (which in turn is 22% faster than the WATCOM clib implementation.) However, for very short strings, this

method is as much as 3 times slower then the previous method (which is about the same same speed as as the WATCOM clib implementation.) But keep in mind that if your strings are shorter, the branch mispredicts (which were not modeled in my tests) will take a comparatively larger percentage of the time. In any event, if you intend to use one of these alternatives, you should performance measure them to see which is better for your application. Update! Norbert Juffa has written in to inform me of an old trick which is a significant improvement over the device I discovered:
After a few minutes of research it turns out that the original null byte detection by Alan Mycroft is superior to the formula given on your website as it has shorter dependency chains. On a PIII, strcpy() performance jumped almost 20% from substituting one for the other. Mycroft's macro is #define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080) I rendered this as: mov not sub and and edx, eax edx, eax, eax, eax ; cycle 1

0x01010101 ; cycle 2 0x80808080 edx ; cycle 3

BTW, one of Mycroft's original posts about this macro can be found at Deja. The post is dated 1987-04-27!

I did a little digging and indeed you can find an old thread discussing this on google groups. So here's an implementation in C:

Compared to other solutions on an Athlon, this is indeed the fastest solution for strings up to about 32 characters (the MMX solution is faster for longer strings). This function greatly outperforms the strlen of most libraries. 6. Sprite data copying A common question that comes up for programmers that are interested in rendering sprite graphics (i.e., game programmers) is how to write a high performance sprite blitter. In a recent discussion on rec.games.programmer, the questioner had a requirement that an array of

pixels from one array be copied to a destination array unless the pixel value was zero (representing a transparency.) The starting code looked something like:
char * scrptr; char * destptr; /* ... */ /* copy scanline except where transparent. */ for(x=x0;x<xl;x++) { if( srcptr[x]!=0 ) destptr[x]=srcptr[x]; }

Of course this code has several problems with it: (1) It has a branch prediction in the inner loop that is likely to fail very often and (2) The memory is being read one byte at a time which wastes memory bandwidth and increases loop overhead. That said, this solution is actually good if the memory unit has "write combine" and "write mask" capabilities and that memory write access to the destination is more of a bottleneck than branch prediction, loop overhead or source reads. (This is true if the destination is VRAM, or the CPU is a highly clocked P-II.) However, if reading the destination is not a problem (if the destination is a system memory buffer), then reading 4 bytes at a time and building a mask is not a bad solution, provided you can build one without using conditional branches (otherwise there would be nothing to gain.) Russ Williams came up with a straight forward approach using the carry flag (it is optimized for the Pentium):
; by Russ Williams pushad mov esi,[pbSrc] mov ecx,[w] shr ecx,2 mov edi,[pbDest] sub edi,esi looptop: xor ebx,ebx mov eax,[esi] add esi,4 mov edx,eax

ror add sbb add sbb mov shl add sbb add sbb mov ror and or dec mov jnz

edx,16 al,255 bl,0 ah,255 bh,0 eax,edx ebx,16 al,255 bl,0 ah,255 bh,0 eax,[edi] edx,16 eax,ebx eax,edx ecx [edi+esi],eax looptop

; cf = 1 if al!=0 or 0 if al==0 ; bl = -1 if al!=0 or 0 if al==0

; dest &bg.mask (P6: PRS) ; merge with src

popad

The above loop works great for Pentiums and is out of reach for C compilers because of the use of the carry flag. However the P-II does not like the partial register stall (a non-overlappable 7 clock penalty.) After some discussion motivated by a comment by Christer Ericson, the following trick was observed: (((D &0x7F7F7F7F) + 0x7F7F7F7F) | D) &0x80808080 The above expression will generate a bitmask in the high bits of each byte; 00h indicating a zero byte, and 80h indicating a non-zero byte. With some straight forward trickery you can turn this into 00h for zero and FFh for non-zero, giving us the mask we desire. Here's the code in assembly language:
; by Paul Hsieh sub esi,edi looptop: mov eax,[esi+edi] mov ebx,7f7f7f7fh and ebx,eax add ebx,7f7f7f7fh or ebx,eax bytes. and ebx,80808080h shr ebx,7 add ebx,7f7f7f7fh and ebx,7f7f7f7fh lea eax,[ebx+ebx]

; read sprite source bits

; 80808080 bits have non-zero status of

; ; ; ;

move to 01010101 bits. 80==on or 7f==off values in each byte. 00==on or 7f==off masks. eax has 00==on or fe==off masks. (P5: AGI

stall) or eax,ebx mov ebx,eax not ebx and eax,[edi] or eax,[esi+edi] mov [edi],eax add edi,4 dec ecx jnz looptop

; eax has 00==on or ff==off masks. ; ; ; ; ebx has 00==off or ff==on masks. background bits merge with sprite bits for final result draw!

Of course the above loop is not especially suited for Pentiums since practically the whole loop has data dependencies, and there is an AGI stall. These issues can be dealt with by unrolling the loop once and using ordinary addition instead of lea (The edx and ebp registers are available for this). However, P-II's will attempt to "naturally" unroll this loop, and doesn't suffer from AGI stalls and therefore would stand to benefit much less from hand scheduling of this loop (though it probably would not hurt.) However, since the loop is 22 micro-ops (which is 2 more than the P-II reorder buffer limit) it cannot totally unroll it. A K6 can also only partially "unroll" this loop because it contains more than 6 decode clocks. Actually, with the right source, C compilers should have little difficulty coming up with the source to the above (or something very similar.) But I'm not sure how you are supposed to convince your C compiler that it needs to be unrolled once. Update:After some thought, its obvious that the above algorithm can and should be performed using MMX and some of the new SSE instructions (movntq in particular). I may implement this at a later date. Another idea, which apparently comes from Terje Mathisen, is to use the P-II's write combining features in conjunction with minimizing branch misprediction penalties. The idea is to use a mask to select between two destination addresses for the write bytes:
; Idea from Terje Mathisen, implemented by Sean T. Barrett sub esi,edi mov edx,temp sub edx,edi ... loop:

; edx = (temp-edi)

mov cmp sbb and mov inc dec jnz

al,[esi+edi] al,1 ebx,ebx ebx,edx [ebx+edi],al edi ecx loop

; ; ; ; ;

al=src cf = 1 if src==0, 0 if src!=0 ebx = -1 if src==0, 0 if src!=0 ebx = (temp-edi) if src==0, 0 if src!=0 store to dest (edi) or garbage (temp)

; well predicted branch

Of course, temp is a reused system array meant just as a trash bin for transparency writes. The outer loop gets complicated, but in ideal situations, this will make better use of the P-II chipset features without suffering from branch mispredict penalties. The carry flag trick still eludes modern C compilers. (Of course this loop should be unrolled a bit to improve performance.) However, like the key scan problem listed above, all this may be academic. For most sprites that are rendered once and used over and over again, transparencies and solid source pixels are usually in long runs. So using an alternate encoding which give transparency versus nontransparency runs will usually be more efficient. Blitting transparency would come down to adding a constant to the pointers, and the blitting solid colors would be immediate without required examination of the source. Update: Paolo Bonzini has written in with a more efficient inner loop. The idea is apply the mask to the xor difference of the source and destination.

Paolo also submitted a trivial MMX loop:

7. Now its time for MMX On sandpile DS asked how to perform the following:
An MMX 64bit reg has useful values only at byte7 and byte3 (AXXXBXXX, where A/B-useful, X-useless). I want to copy byte7 to byte6, byte5, byte4; and copy byte3 to byte2, byte1, byte0.

I thought of two ways:

psrld packssdw punpckhwd pmullw

mm0, mm0, mm0, mm0,

24 mm0 mm0 Const_0101010101010101

; ; ; ;

0 0 0 A

0 A A A

0 0 0 A

A B A A

0 0 0 B

0 A B B

0 0 0 B

B B B B

or
psrld packssdw packuswb punpcklbw punpcklbw mm0, mm0, mm0, mm0, mm0, 24 mm0 mm0 mm0 mm0 ; ; ; ; ; 0 0 A A A 0 A B A A 0 0 A B A A B B B A 0 0 A A B 0 A B A B 0 0 A B B B B B B B

The first has fewer instructions but higher latency which is ideal for a CPU like the Athlon. The second has more instructions but shorter latency, which is beneficial to CPUs like the K6, Cyrix or P55C. I am not sure which is the better loop for P6 architectures. Of course once we enter the realm of MMX, the poor C compiler really has nothing to offer. 8. Generalized shift Norbert Juffa was studying the Compaq Visual Fortran compiler and thought that generalized shift could be improved. In the Fortran language a general shift is always a left shift, but it may take a negative shift value (which is a right shift) and shift values beyond the word size (in either direction) in bits results in 0. Here's the algorithm I came up with:

The ANSI C standard, requires the we use unsigned int when using the right shift operation due to an unspecified "implementation defined" status of right shifting on signed numbers. If you are wondering why I mention it -- its because this is not well known, and can lead to incorrect resuts. This time I have included Microsoft Visual C++:

The comments indicate the clock on which an ideal 3-way out-of-order x86 machine (i.e., an Athlon) could execute these instructions. Notice the issue stalls for each of these code sequences. This means that there may be a case for x86 CPUs to go with 4-way instruction decoding and issue in

the future. In any event, MSVC++ seems to be doing quite well in terms of the final performance, but it is unable to simplify the back to back neg reg0 instructions. The classic software register renaming technique for avoiding a dependency stall (after mov esi, eax the role of the two registers can and should be swapped -- the two registers will be identical after this instruction except that one of them is available one clock later than the other) is also not being used. WATCOM C/C++ is having difficulty with register size mismatching, and is unaware of the standard sbb reb, reg trick. 9. 64bit BCD add Norbert Juffa has given me some devilishly clever code for performing a 64 bit BCD addition.
; by Norbert Juffa mov mov mov mov eax, ebx, edx, ecx, dword dword dword dword ptr ptr ptr ptr [x] [y] [x+4] [y+4] ; ; ; ; x y x y (lo) (lo) (hi) (hi)

; here: EDX:EAX = augend, ECX:EBX = addend mov lea xor add shr add sbb rcr xor and add shr sub sub mov lea xor add shr add ;;sbb rcr xor and add shr sub esi, edi, esi, eax, esi, edi, ebx, edi, esi, esi, eax, esi, eax, edx, esi, edi, esi, edx, esi, edi, esi, edi, esi, esi, edx, esi, edx, eax [eax+66666666h] ebx ebx 1 ebx ebx 1 edi 88888888h esi 2 esi ebx edx [edx+66666666h] ecx ecx 1 ecx esi 1 edi 88888888h esi 2 esi ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; x x + 0x66666666 x ^ y x + y t1 = (x ^ y) >> 1 x + y + 0x66666666 capture carry t2 = (x + y + 0x66666666) >> 1 t1 ^ t2 t3 = (t1 ^ t2) & 0x88888888 x + y + t3 t3 >> 2 x + y + t3 - (t3 >> 2) propagate carry x x + 0x66666666 x ^ y x + y t1 = (x ^ y) >> 1 x + y + 0x66666666 capture carry t2 = (x + y + 0x66666666) >> 1 t1 ^ t2 t3 = (t1 ^ t2) & 0x88888888 x + y + t3 t3 >> 2 x + y + t3 - (t3 >> 2)

; here EDX:EAX = sum mov mov mov edi, z [edi], eax [edi+4], edx

The use of carries and rcr instructions opposite their C equivalent (sort of) shows the massive simplifications possible in assembly that are not offered in C. You can find more on BCD arithmetic here at Douglas W. Jones' website 10. Pixel bashing "Gordy" has been working on a software based graphics library and asked on USENET for optimized paths for many common pixel bashing routines. Here are some of the solutions that I presented. The first is an MMX routine for compressing 4 bit per pixel to 1 bit per pixel in a specific ordering. The pixel translation simply takes the bottom bit from each nibble. The order of the original nibbles in a given qword is: 21436587 (don't ask me -- its Gordy's library.) We can assume that the nibble bits are premasked (i.e., the other bits are 0.)
MagicConst dd 48124812h,48124812h ;.... movq mm7, MagicConst pcmpeqb mm6, mm6 ; 1111111111111111 psllw mm6, 12 ; 1111000000000000 @quads: movq mm0, [esi] ; 0006000500080007 movq mm1, mm6 ; XXXX000000000000 pmullw mm0, mm7 ; 8765705800870070 pand mm1, mm0 ; 8765000000000000 psrld mm0, 28 ; 0000000000004321 psrlw mm1, 8 ; 0000000087650000 por mm0, mm1 ; 0000000087654321 packuswb mm0, mm0 ; [?B?A?B?A] packuswb mm0, mm0 ; [BABABABA] movd eax, mm0 mov [edi],ax add esi, 8 add edi, 2 dec ecx jnz @quads

Since the bits are isolated, a sequence of shift and ors can be replaced by a sequence of shifts and adds. But a sequence of shifts and adds can be

made into a single multiply. From a performance point of view this is a good trade off since pmullw is a fully pipelined instruction and in a modern x86, many internal instances of this loop should be able to execute in parallel. This solution consumes 64 source bytes in just a handful of clocks -- its not likely that any C based formulation could even come close. The second was for averaging streams of bytes together (and rounding down.) Gordy was looking for a solution for pre-K6-2 MMX/3DNow! (which did not have pavgusb or pavb instructions) as well as a straight integer solution. Here are the solutions I proposed:
; Integer Solution by Paul Hsieh LTop: mov mov mov and xor shr and add mov dec jnz

eax, [esi] ebx, [edx] ebp, eax eax, ebx ebx, ebp ebx, 1 ebx, 7F7F7F7Fh eax, ebx [edi], eax ecx LTop

; MMX Solution by Paul Hsieh LTop: movq movq movq pand pxor psrlb paddb movq dec jnz

mm0, [esi] mm1, [edx] mm2, mm0 mm0, mm1 mm1, mm2 mm1, 1 mm0, mm1 [edi], mm0 ecx LTop

These solutions use the follow observation:

A + B = (A and B) * 2 + (A xor B) And therefore: (A + B)/2 = (A and B) + (A xor B)/2

While the original (A + B) addition may overflow, the addition (A and B) + (A xor B)/2 will not. The third routine was one for performing a saturating add. Using MMX its just a trivial application of the paddusb instruction (as "Gordy" pointed out), but what of the integer fallback routine? I came up with the following integer routine:
; Integer solution by Paul Hsieh mov mov mov mov and xor and and add sar and add mov and and shr shl and or xor sub and or shl or eax, ebx, ecx, edx, ecx, edx, eax, ebx, eax, eax, ecx, eax, ecx, edx, eax, ecx, eax, ecx, eax, ecx, ecx, ecx, eax, ecx, eax, [esi] [edi] eax ebx ebx eax 0xFEFEFEFE 0xFEFEFEFE ebx 1 0x01010101 ecx eax 0x01010101 0x7F7F7F7F 7 1 0x01010101 edx 0x81818181 0x01010101 0x7F7F7F7F ecx 1 ecx ; src0 ; src1

; first bit carry ; first bit add (mod 2)

; ; ; ;

Add top 7 bits (A&0xFE)+(B&0xFE) >>= 1 to capture carry bits Is there a carry to the second bit? (...)>>1

; first bit

; ; ; ; ; ;

(...)&0xFE overflows ((...)&0xFE) + (((A&0x01)+(B&0x01))&1) blockers 0->80, 1->7F 0->00, 1->7F

This time we are seperating the bottom bit to make sure we can perform parallel adds with only 8 bits of work area at a time.
A = (A&0xFE) + (A&0x01) B = (B&0xFE) + (B&0x01) So A + B = ((A&0xFE)+(B&0xFE)) + ((A&0x01)+(B&0x01)) = ((A&0xFE)+(B&0xFE)) + ((A&B&0x01)*2 + ((A xor B)&0x01))

The carry over into the 9th bit is translated into a 0x00 or 0x7F mask. By or-ing in the mask into the destination then shifted left by one, this achieves the same effect as if the mask contained 0x00 or 0xFF values. This time, since the carry flag is captured and used on the first add, the C compiler cannot duplicate this routine. Update: Oliver Weichhold posted an interesting question in comp.lang.asm.x86:
I need to convert a color value (actually a texel fetched from a A4R4G4B4 texture) to a A15R15G15B15 color value held in an MMX register.

While one might instictively look at clever ways of performing the multiple shifts and masking that might seem to be required, its really not neccesary. Consider that what you really want is a mapping of 16 input bits that you just want placed in some output positions, where each input bit is independent from the other bits. This just screams "Look up tables" and indeed it leads to some pretty simple code:
movzx movzx shr movd movd punpckldq eax, ebx, eax, mm0, mm1, mm0, word ptr input_A4R4G4B4 al 8 dword ptr Table_G4B4_to_G15B15 [ebx*4] dword ptr Table_A4R4_to_A15R15 [eax*4] mm1

The tables can be set up with some kind of simplistic C code at initialization time. 11. Converting Uppercase Conventional wisdom say that to quickly convert an ASCII string to uppercase, that one should use a table to convert each character. The problem is that the address mode of x86 processors require 32 (or 16) bit registers, so some kind of byte -> dword (or word) conversion also happens. This leads to partial register conversion which is penalized on AMD CPUs and is devastating to Intel CPUs. So what is our alternative? Well, if we can amortize our performance using SIMD, then what we are really trying to do is translate the range [61h, 7Ah] to [41h, 5Ah]. The approach I came up with was to exploit the fact that ASCII is only well defined in the range [00h, 7Fh] by rotating the

range then masking, then rotating and capturing the middle range into the 5th bit and subtracting that from the 4 input bytes:
; Integer SIMD uppercase(string) by Paul Hsieh mov mov mov and add and add shr and sub eax, ebx, edx, ebx, ebx, ebx, ebx, ebx, ebx, eax, src[esi] 7f7f7f7fh 7f7f7f7fh eax 05050505h edx 1a1a1a1ah 2 20202020h ebx

; ; ; ;

61-7a 66-7f 66-7f 80-99

| | | |

00-60,7b-7f 05-65,80-84 00-65 1a-7f

; 20 | 00 ; 41-5a | 00-60,7b-7f

This routine is within reach of most modern C compilers. (Stay tuned for the C version.) A cursory glance, however, reveals that this routine is easily converted to MMX. Update: The need to be able to support non-ASCII (i.e., include the range [80h, FFh]) has come up. One of the bonuses of being able to do the whole range is that UTF-8 encoded unicode can be directly translated this way. Anyhow, its just a minor change, that unfortunately costs a cycle on the critical path:
; Integer SIMD uppercase(string) on UTF-8 data by Paul Hsieh mov mov mov and mov add not and add and shr and sub eax, ebx, edx, ebx, ecx, ebx, ecx ebx, ebx, ebx, ebx, ebx, eax, src[esi] 0x7f7f7f7f 0x7f7f7f7f eax eax 0x05050505 edx 0x1a1a1a1a ecx 2 0x20202020 ebx

; 61-7a | 00-60,7b-7f ; ; ; ; ; 66-7f | 05-65,80-84 flip top bits 66-7f | 00-65 80-99 | 1a-7f turn off top bit for non-ASCII

; 20 | 00

Update: Norbert Juffa wrote in to inform me that as part of Sun's opening up of the Solaris source code they have made some of his sources used for some C string functions, including caseless comparisons and tolower available to the public. You can view the source here.

Update: I have designed an alternative function that deals with the entire UTF-8 range, but is also more parallel and thus should execute somewhat faster:
; Integer SIMD uppercase(string) on UTF-8 data v2 by Paul Hsieh mov mov mov or mov sub sub not not and and and shr sub eax, ebx, edx, ebx, ecx, ebx, ecx, ebx edx ebx, edx, ebx, ebx, eax, src[esi] 0x80808080 eax eax ebx 0x7b7b7b7b 0x61616161

; src ; (0x80808080 | src) ; ; ; ; ; ; ; ; ; (0x80808080 | src) - 0x7b7b7b7b c = (0x80808080 | src) - 0x61616161 d = ~((0x80808080 | src) - 0x7b7b7b7b) ~src c & d ~src & 0x80808080 e = (c & d) & (~src & 0x80808080) e >> 2 src - (e >> 2);

ecx 0x80808080 edx 2 ebx

The C code is fairly comparable:

; Integer SIMD uppercase(string) on UTF-8 data v2 by Paul Hsieh uint32_t uint32_t uint32_t uint32_t uint32_t } upperCaseSIMD (uint32_t x) { b = 0x80808080ul | x; c = b - 0x61616161ul; d = ~(b - 0x7b7b7b7bul); e = (c & d) & (~x & 0x80808080ul); return x - (e >> 2);

12. Move bits Norbert Juffa continues to study the Compaq Visual Fortran compiler (soon to become the Intel Fortran compiler) this time looking at the MVBITS command. A description of the function can be found here.
MVBITS (FROM, FROMPOS, LEN, TO, TOPOS) Description: Copies a sequence of bits (a bit field) from one location to another. Class: Elemental subroutine Arguments: There are five arguments: FROM Can be of any integer type. It represents the location from which a bit field is transferred. FROMPOS Can be of any integer type; it must not be negative. It identifies the first bit position in the field transferred from FROM. FROMPOS + LEN must be less than or equal to BIT_SIZE (FROM).

LEN TO

TOPOS

Can be of any integer type; it must not be negative. It identifies the length of the field transferred from FROM. Can be of any integer type, but must have the same kind parameter as FROM. It represents the location to which a bit field is transferred. TO is set by copying the sequence of bits of length LEN, starting at position FROMPOS of FROM to position TOPOS of TO. No other bits of TO are altered. On return, the LEN bits of TO (starting at TOPOS) are equal to the value that LEN bits of FROM (starting at FROMPOS) had on entry. Can be of any integer type; it must not be negative. It identifies the starting position (within TO) for the bits being transferred. TOPOS + LEN must be less than or equal to BIT_SIZE (TO).

; Solution by Norbert Juffa mov mov cmp sbb shl mov not shr mov and shl shl not and or ecx, eax, ecx, edx, edx, ecx, edx eax, ecx, eax, edx, eax, edx edx, eax, [LEN] ; [FROM] ; 32 ; edx ; cl ; [FROMPOS]; ; cl ; [TOPOS] ; edx ; cl ; cl ; ; [TO] ; edx ; len from (len < 32) ? CY : NC (len < 32) ? ~0 : 0 (len < 32) ? ((~0) << len) : 0 frompos mask = (len < 32) ? ~((~0) << len) : ~0 from >> frompos topos (from >> frompos) & mask mask << topos bits << topos ~(mask << topos) to & ~(mask << topos) to=(to&~(mask<<topos)|((from>>frompos)&mask)

13. Register slide, register shift Norbert Juffa, who was looking at some 64bit shift code (for 32bit x86 registers) reduced the problem to finding a good solution for:
if (ECX == 0) { EAX = EDX EDX = 0 } else { /* ECX == 0xFFFFFFFF */ EAX = EAX EDX = EDX }

under the assumption that ecx is either 0 or FFFFFFFF. After we both fumbled around quite a bit going down wrong tracks, Norbert hit upon the right idea:
; Solution by Norbert Juffa

sub and add and

eax, eax, eax, edx,

edx ecx edx ecx

; ; ; ;

a c c c

? ? ?

d a - d : 0 a : d d : 0

However Norbert instead went for a simple branch based solution, combined with a cmovcc solution for newer processors that have this instruction:
; Solution by Norbert Juffa __declspec (naked) __int64 __stdcall MYLSHIFT64 (const __int64 *i, const int *sh) { __asm { mov ecx, [esp+8] ; &sh mov eax, [esp+4] ; &i mov ecx, [ecx] ; sh mov edx, [eax+4] ; i_hi mov eax, [eax] ; i_lo shld edx, eax, cl shl eax, cl #if (CPU == i386) test ecx, 32 jz $lshift_done mov edx, eax xor eax, eax $lshift_done: cmp ecx, 64 sbb ecx, ecx and eax, ecx and edx, ecx #else /* Athlon, P6+ */ test ecx, -64 ror ecx, 6 mov ecx, 0 cmovc edx, eax cmovc eax, ecx cmovnz edx, ecx cmovnz eax, ecx #endif ret 8 } } ; sll (i,sh & 0x1f) ; ; sh >= 32 ? ; nope, done ; sll (i,32) ; ; (sh>=64)||(sh<0) ; (sh>=64)||(sh<0) ? 0 : 0xffffffff ; (sh>=64)||(sh<0) ? 0 : sll (i,sh) ; ; ; ; ; ; ; ; (sh>=64)||(sh<0) ? NZ : ZR (64>sh>=32) ? CY : NC(ZF safe) 0 i=(64>sh>=32) ? sll(i,32) : i (sh>=64)||(sh<0) ? 0 : sll (i,sh)

; pop two DWORD parms and ret

shld, cmovcc, and ror are not likely to show up in the output of any known compiler. 14. 64 bit integer question

Phil Carmody (a well known prime number searching hacker) posted to alt.lang.asm with the following question:
I've got 2 64-bit (never above 2^40, for reference) integers, and in C, I'm doing: /* multiply tpow (in 1..p-1) by 2, modulo p */ tpow <<= 1; if (tpow > p) { tpow -= p; } Now the compiler turns the shift into a shl and a shld, which I don't have an issue with. However, the if statement is rendered using cmps and jumps, which, as the value is utterly unpredictable, causes me to take a bit of a performance hit. What is the suggested 64-bit branchless bit-twiddle to do the above? I know the 32-bit equivalent is some funky combination of subs, sbbs, ands, and other stuff, does it convert to 64-bits? Is there a cmovcc version even?

I responded with the following:

Well, it sounds like you want something like: tpow <<= 1; tpow -= p & -(tpow > p); Now, so how do we do -(tpow > p) in a totally predicate manner? Lets see this is the same as -(p - tpow < 0), so we want to perform a 64 bit subtract of tpow from p and just capture the carry, then use it as the mask and it against p then subtract from tpow. mov mov add adc mov mov sub sbb sbb mov mov and and esi, edi, esi, edi, eax, edx, eax, edx, ebx, eax, edx, eax, edx, tpow[0] tpow[4] esi edi ; tpow << 1; -- Don't use SHLD. p[0] p[4] esi edi ; p - tpow ebx ; -(p - tpow < 0) negp[0] negp[4] ebx ebx ; p & -(p - tpow < 0)

sub esi, eax sbb edi, edx ; tpow - (p & -(p - tpow < 0)) mov tpow[0], esi mov tpow[4], edi ; tpow -= p & -(p - tpow < 0) Voila! Ok, I think cmovCC should be even easier. Your code is equivalent to: tpow <<= 1; tpow = (tpow - p >= 0) ? tpow - p : tpow; So we will just compute tpow - p, capture the CF flag, then conditionally overwrite

tpow: mov mov add adc esi, edi, esi, edi, tpow[0] tpow[4] esi edi ; tpow << 1; -- Don't use SHLD.

mov eax, esi mov edx, edi ; Annoying dependency. sub eax, p[0] sbb edx, p[4] ; tpow - p cmovnb esi, eax cmovnb edi, edx ; (tpow - p >= 0) ? tpow - p : tpow mov tpow[0], esi mov tpow[4], edi ; tpow = (tpow - p >= 0) ? tpow - p : tpow

15. 63 bit atomic counter In comp.lang.asm, Matt Taylor posted that he was in search of a "lockless 63-bit counter" using 32 bit code (certainly using AMD's new AMD64 instruction set would make this trivial, but Matt is limited to 32 bit code.) I.e., just a memory location and code sequence that could perform an atomic increment and read even in multiprocessor scenarios. One issue is the fact that most memory operations are broken up over the bus as individual read and writes so ALU + memory operations will not be atomic -- however this can be dealt with with the lock prefix. The real issue is that its impossible to make two individual 32 bit memory accesses truly atomic. First let us take a second to show a solution that doesn't work:

The problem is that between [[1]] and [[2]] your thread could be preempted by another thread trying to execute the same code. This means that any given cycle the 64 bit value in [counter] can easily be innaccurate, and even grossly so. This is what is known as a race condition. There are endless incorrect ways of trying to patch this sort of code up, which can take some analysis to discover the flaws. I will skip all those as you can/should learn about them in college level OS class. So lets start by presenting two trivial solutions. The first is using AMD64 code:

So this describes the semantics that we are looking for -- we want to increment and return with the incremented value with respect to the local context. So not only do we require that we do this for 64 bits, but we require that we also get a read back of the counter in a manner that reflects that value before any other thread increments the counter. Of course, given that AMD64 is such a new instruction set, its going to be a little while before we can deploy such a solution widely. But starting with the Pentium Pro instruction set, even 32 bit x86s have had access to the 64 bit test and set instruction: CMPXCHG8B:

So the idea is to speculatively hope that the counter will remain untouched by other threads from [[1]] through to [[3]]. The CMPXCHG8B instruction allows us to verify this before we commit to the increment operation. If there is any inconsistency (including a mismatch between [[1]] and [[2]] due to an intervening increment operation) then we just put ourselves at the back of the line of pending increments and retry the operation. We can already see with this solution we've entered a realm where the theory is starting to frown on our assumptions. There is no guarantee that any given thread will eventually complete as it might be stuck in this retry loop for an arbitrary amount of time. In fact, when Matt timed this in a benchmark (presumably meant to really hammer on it with many threads and multiple CPUs) he found the performance of this code to be very unsatisfactory for precisely this reason -- so it has practical consequences too. The "63 bit counter" comes from the idea of using one of the bits as a kind of flag to indicate that a wrap around has happened. We can assume that most of the time, the increment will only affect as few as 30 bits and that we only need to specifically handle the specific wrap around cases. With some help from Terje Mathisen and Matt Taylor, I came up with the following code (based on the XADD instruction which has been available since the 80486 processor):

The general idea is that the bottom 31 bits are just the bottom 31 bits of the lock, the 31st bit indicates a wrap has occurred, and the 30th bit can

also indicate to us whether any noticed discrepancy in the high 32 bits was due to a wrap or not. With the assumption that no more than 30 bits of (about a billion) increments will happen during a contention we're no longer in anything resembling sound theoretical territory. For example, there is an obvious race condition between [[1]] and [[2]] if, in fact, there were 4.3 billion increments from other threads in between the two operations due to a preemption. Nevertheless, having a billion threads take action between a few instructions is not a realistic scenario, so in practice this should work fairly well. Atomic primitives like this are always tricky, so it pays to try to see why it works. (Said in another way, you should not trust this code, or anyone else's, unless you understand how it works.) Ok, so by using XADD on the lower 32 bits, there isn't really a possibility that the lower 31 bits will go wrong. And when the 31 bit wrap occurrs, the exact thread where the wrap occurrs will be the sole thread that tries to correct the high word. When the high word is incorrect, its exactly one less than the correct value and the "lock bit" (bit 31 of the low word) will be set. So it can be seen that the incrementing action keeps the bits in a well defined and deterministic state. The specific case of the wrapping locker's read is taken care of precisely, so let us assume that we are one of the other threads. When reading, if the high half is consistent with the low half (i.e., by sandwiching the low half read between two high half reads) and there is no lock bit set, then we are set. If there is an inconsistency its because some other thread has wrapped. We can then examine our own bit 30 to see whether we are supposed to be post or pre wrap which tells us whether we should take the later or earlier high half read respectively. Finally if the lock bit is set, we know that we are post-increment, and we just wait until either the lock has been cleared or the high half changes then just reread the high half. The commented out CLI and STI instructions show where these instruction can and should go to try to reduce the exposed region of uncertainty in the lock value (leading other threads to spin). On a single processor system they are sufficient to prevent any contention, and in multiprocessor systems it would give a kind of guarantee about the amount of time the counter would not be in a trivial state. But of course you require ring 0 access to execute these instructions, and they are not required for functional correctness.

And in C? Not a player I'm afraid. Even with OS support, the language cannot compete with this kind of low level access. Stay tuned for more ...

optimized constant multiplies optimized constant divides Tiny Little Assembly Gems The Aggregate Magic Algorithms John Eckerdal's Assembly Gems Tennie Remmel's Programming Tips and Tricks

Answers 2 Reviews and Exercises
No ratings yet
Answers 2 Reviews and Exercises
26 pages
Solutions: CS152 Computer Architecture and Engineering
No ratings yet
Solutions: CS152 Computer Architecture and Engineering
17 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
Programming Assessment 1
No ratings yet
Programming Assessment 1
4 pages
Arithmetic Instructions CEK
No ratings yet
Arithmetic Instructions CEK
23 pages
Assignment # 5 - 5
No ratings yet
Assignment # 5 - 5
6 pages
algo lab
No ratings yet
algo lab
21 pages
Chapter 2 Exercise Problems
No ratings yet
Chapter 2 Exercise Problems
10 pages
Assembly Paper Key
No ratings yet
Assembly Paper Key
7 pages
CH02 Solution
No ratings yet
CH02 Solution
12 pages
Bit Wise Challenges
No ratings yet
Bit Wise Challenges
55 pages
Standard Cheat Sheet
No ratings yet
Standard Cheat Sheet
6 pages
Modern Computer Architecture and Programming in Assembly Language - TCM - 183 - 1309076
No ratings yet
Modern Computer Architecture and Programming in Assembly Language - TCM - 183 - 1309076
131 pages
Integers Floating Point: N N S E
No ratings yet
Integers Floating Point: N N S E
4 pages
Amp Experiments
No ratings yet
Amp Experiments
32 pages
Section A: COA Comprehensive Makeup Examination 2014. Answer Key
No ratings yet
Section A: COA Comprehensive Makeup Examination 2014. Answer Key
5 pages
Lab 4
No ratings yet
Lab 4
10 pages
Exp No.:5 Date: Arithmetic Operations
No ratings yet
Exp No.:5 Date: Arithmetic Operations
12 pages
Department of Electrical Engineering
No ratings yet
Department of Electrical Engineering
8 pages
ECE3073 PA Validation and Testing Answers PDF
No ratings yet
ECE3073 PA Validation and Testing Answers PDF
6 pages
Eee 3132 Test 1_2024_solutions
No ratings yet
Eee 3132 Test 1_2024_solutions
8 pages
Document 1
No ratings yet
Document 1
21 pages
Assignment 1 Model Answers
No ratings yet
Assignment 1 Model Answers
5 pages
LAB03 Report
No ratings yet
LAB03 Report
8 pages
Embedded System Ipcc Laboratory
No ratings yet
Embedded System Ipcc Laboratory
11 pages
0233 BSCS 21 COAL Assignment 2
No ratings yet
0233 BSCS 21 COAL Assignment 2
16 pages
ESC 101: Fundamentals of Computing Name Roll No Dept. Section
No ratings yet
ESC 101: Fundamentals of Computing Name Roll No Dept. Section
8 pages
CS33 Midterm 2013 Soln All
No ratings yet
CS33 Midterm 2013 Soln All
10 pages
Microprcessor 2
100% (1)
Microprcessor 2
16 pages
Assembly Questions Complete Document (1)
No ratings yet
Assembly Questions Complete Document (1)
11 pages
QP 1
No ratings yet
QP 1
3 pages
Documentation For This Question. Assumptions Can Be Made Wherever Necessary
No ratings yet
Documentation For This Question. Assumptions Can Be Made Wherever Necessary
9 pages
Exam1 f12
No ratings yet
Exam1 f12
15 pages
8086
No ratings yet
8086
15 pages
Algoritmi Pentru Operatile Pe Biti
No ratings yet
Algoritmi Pentru Operatile Pe Biti
41 pages
Computer Organization & Assembly Language: MUL & DIV Instructions
No ratings yet
Computer Organization & Assembly Language: MUL & DIV Instructions
19 pages
MPMC Assignment_Madhumitha
No ratings yet
MPMC Assignment_Madhumitha
14 pages
2007 Fall Exam1 Solpp
No ratings yet
2007 Fall Exam1 Solpp
6 pages
Computer Organization & Assembly Language: CS/COE0447
No ratings yet
Computer Organization & Assembly Language: CS/COE0447
30 pages
assignment1_solution
No ratings yet
assignment1_solution
2 pages
ECE 252 - Quiz - 1 - Solutions
No ratings yet
ECE 252 - Quiz - 1 - Solutions
5 pages
8086 ALP'S
No ratings yet
8086 ALP'S
41 pages
Embedded Systems: Prolific Systems and Technologies PVT LTD
No ratings yet
Embedded Systems: Prolific Systems and Technologies PVT LTD
17 pages
Assembly Language: Arithmetic and Logic Instructions
No ratings yet
Assembly Language: Arithmetic and Logic Instructions
26 pages
Instruction Format
No ratings yet
Instruction Format
29 pages
PFC Workshop04 2015
No ratings yet
PFC Workshop04 2015
11 pages
C-Full Programs 001
No ratings yet
C-Full Programs 001
25 pages
Do Hoang Tu - Operating System From 0 To 1 (2022) - Removed - Removed - Removed
No ratings yet
Do Hoang Tu - Operating System From 0 To 1 (2022) - Removed - Removed - Removed
21 pages
PriyanshJain_21116076_HW2
No ratings yet
PriyanshJain_21116076_HW2
6 pages
MIAE 215 Entire Course Package
No ratings yet
MIAE 215 Entire Course Package
88 pages
CT1sol
No ratings yet
CT1sol
4 pages
23F-0043-BAI-4A-COAL-Assignment-1.docx
No ratings yet
23F-0043-BAI-4A-COAL-Assignment-1.docx
14 pages
EE323 Assignment 2 Solution
No ratings yet
EE323 Assignment 2 Solution
5 pages
ESC 101: Fundamentals of Computing Name Roll No Dept. Section
No ratings yet
ESC 101: Fundamentals of Computing Name Roll No Dept. Section
8 pages
Problem Set-2
No ratings yet
Problem Set-2
16 pages
Week-7 solutions
No ratings yet
Week-7 solutions
13 pages
Fuzzy
No ratings yet
Fuzzy
259 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
7 Logic Families
No ratings yet
7 Logic Families
5 pages
CCDI 22 23 T1 G12Adv Checkpoint 1 Sample MCQs
No ratings yet
CCDI 22 23 T1 G12Adv Checkpoint 1 Sample MCQs
3 pages
Lecture 1 Slides
100% (1)
Lecture 1 Slides
11 pages
Bro ibaDAQ-x en
No ratings yet
Bro ibaDAQ-x en
12 pages
Assignment Solution Week10
No ratings yet
Assignment Solution Week10
4 pages
Microprocessors and Microcontrollers Notes - Programs For 16 Bit Arithmetic Operations For 8086 (Using Various Addressing Modes) - Studentboxoffice
100% (1)
Microprocessors and Microcontrollers Notes - Programs For 16 Bit Arithmetic Operations For 8086 (Using Various Addressing Modes) - Studentboxoffice
16 pages
Dmidecode
No ratings yet
Dmidecode
33 pages
Microprocessor and Interfacing Devices/Peripherals: 8086 Instructions Set
No ratings yet
Microprocessor and Interfacing Devices/Peripherals: 8086 Instructions Set
24 pages
Product PDF
No ratings yet
Product PDF
4 pages
Nov - Dec - 2023-3 EP
No ratings yet
Nov - Dec - 2023-3 EP
2 pages
WINSEM2023-24 CBS1004 ETH VL2023240503603 2024-03-18 Reference-Material-II
No ratings yet
WINSEM2023-24 CBS1004 ETH VL2023240503603 2024-03-18 Reference-Material-II
12 pages
Authentic Task 2 - Rubric
No ratings yet
Authentic Task 2 - Rubric
2 pages
Assignment Cover Sheet Qualification Module Number and Title
100% (1)
Assignment Cover Sheet Qualification Module Number and Title
16 pages
Parts of The Computer Presentation
No ratings yet
Parts of The Computer Presentation
13 pages
A Level Computer Science 9608 - Syllabus
No ratings yet
A Level Computer Science 9608 - Syllabus
48 pages
Computer Organization
No ratings yet
Computer Organization
5 pages
Oracle Database Performance Tuning Advanced Features and Best Practices For Dbas
No ratings yet
Oracle Database Performance Tuning Advanced Features and Best Practices For Dbas
238 pages
Coa
No ratings yet
Coa
11 pages
RX3i Guide PDF
No ratings yet
RX3i Guide PDF
53 pages
MIS BBA 6th Sem UNIT 1-2-3
No ratings yet
MIS BBA 6th Sem UNIT 1-2-3
167 pages
Os Book
No ratings yet
Os Book
180 pages
Operating System: Digital Notes by
No ratings yet
Operating System: Digital Notes by
86 pages
Unit_5_Advance Microcontroller
No ratings yet
Unit_5_Advance Microcontroller
23 pages
Unit-3 Memory-Management
No ratings yet
Unit-3 Memory-Management
81 pages
Ict Notes
100% (3)
Ict Notes
63 pages
Overview of Microprocessors: Lecturer: Sri Parameswaran Notes By: Annie Guo
No ratings yet
Overview of Microprocessors: Lecturer: Sri Parameswaran Notes By: Annie Guo
53 pages
8th Gen Core Family Datasheet Vol 1
No ratings yet
8th Gen Core Family Datasheet Vol 1
134 pages
Karachi University Semester Course Outline
No ratings yet
Karachi University Semester Course Outline
2 pages
Grade 11 Chapter 3 Computer Science
No ratings yet
Grade 11 Chapter 3 Computer Science
39 pages
BATCH_06_COA_M1_T3 _IEEE
No ratings yet
BATCH_06_COA_M1_T3 _IEEE
5 pages

Assembly Language Lab

Uploaded by

Assembly Language Lab

Uploaded by

Assembly Language Lab

eax,1 L3: eax,eax L2 eax,ebx ebx,ebx L5 ebx,eax L5 L4 eax,ebx L3 ebx,eax L3

jne add jne inc ret

// min(x,y-x) // (y-x)+(x) - min(x,y-x)

But on a 32 bit machine:

// (y-x) - (x) // min(y-x,x) // (y-x)+(x)-min(x,y-x) == max(x,y-x)

Once again, we take the Pepsi challenge and we get:

ebx,ecx ; 5 dep: eax,ebx ; 6 dep: edx,eax ; 7 dep: eax,eax ; 7 L1 ; 7 eax,edx

here's the assembly output of the compiler versus my hand coding:

je L1: mov test jne add dec jne +1brt L2:

L2 eax,[esi] eax,eax L2 esi,8 ecx L1

je sub L1: add sub sbb or jz +1brt mov L2:

L2 esi,8 esi,8 ecx,1 eax,eax eax,[esi] L1 eax,[esi]

(b1<<1)|(b0>>31); (b1<<2)|(b0>>30); (b1>>31); (b1>>30); H1^H2^b0 &0x7FFFFFFF; L1^L2;

; 8 ; 9 ;10 ;11 ;12 ;13 ;13 ;14

adc lea shr xor xor xor shr xchg xor

cl,0 edx,[edx*2] ebx,30 edx,esi edx,ebx al,cl ebx,1 eax,edx eax,ebx

eax 2 [ecx+eax-8] edx

0x01010101 ; cycle 2 0x80808080 edx ; cycle 3

; cf = 1 if al!=0 or 0 if al==0 ; bl = -1 if al!=0 or 0 if al==0

; dest &bg.mask (P6: PRS) ; merge with src

; read sprite source bits

; 80808080 bits have non-zero status of

mov cmp sbb and mov inc dec jnz

al,[esi+edi] al,1 ebx,ebx ebx,edx [ebx+edi],al edi ecx loop

; well predicted branch

Paolo also submitted a trivial MMX loop:

I thought of two ways:

psrld packssdw punpckhwd pmullw

mm0, mm0, mm0, mm0,

24 mm0 mm0 Const_0101010101010101

These solutions use the follow observation:

; first bit carry ; first bit add (mod 2)

(...)&0xFE overflows ((...)&0xFE) + (((A&0x01)+(B&0x01))&1) blockers 0->80, 1->7F 0->00, 1->7F

61-7a 66-7f 66-7f 80-99

00-60,7b-7f 05-65,80-84 00-65 1a-7f

ecx 0x80808080 edx 2 ebx

The C code is fairly comparable:

sub and add and

eax, eax, eax, edx,

edx ecx edx ecx

; pop two DWORD parms and ret

I responded with the following:

You might also like