Here is an implementation in ARM of the Eratosthenes sieve, especially optimized for the Risc PC with a StrongARM processor (202 MHz). With this configuration, the bandwidth is very low compared to the processor speed. After testing the version 7 of the Eratosthenes sieve, we could notice that a program that writes a constant into memory (with the same kind of memory access, i.e. using the cache by reading non initialised data) is only about 10% faster than version 7. Thus we seek to:

- decrease the number of writes to the memory (cache or RAM) by avoiding redundant writes (as in version 7),
- use the cache as much as possible by splitting the array of booleans into blocks (that hold in the cache) and by processing these blocks the one after the other (as in version 7),
- avoid loading non initialised data into the cache (contrary to
version 7): the cache will continuously contain a fixed block where
the intermediate operations will be performed; at the end of each iteration,
this block will be copied to the array into RAM by using
`STM`s with 4 registers (faster than`STR`s).

Contrary to version 7, we will not seek to use the *write
buffer* as much as possible... maybe in a future version?

You will find an article (in French) with more details in issue 0C of ARMada News (review published by the ARMada, a French association of users of Acorn machines).

The ARM routine is below. You can assemble it
with !ObjAsm, possibly after replacing the file name
`h:RegNames`. To use it, you can take the small C program below
the ARM routine. Otherwise you can call it as follows to
search for the primes from 0 to `N`²-1: register `R0`
must contain the integer `N` (higher or equal to 5), register
`R1` must contain the address (multiple of 32) of an array of
32.⌈`N`²/32⌉ bytes, register `R2` must contain the
address (multiple of 32) of another array (for temporary storages) and
register `R3` must contain the block size (multiple of 480),
which must be less or equal to the size of array `(R2)` minus
8.π(`N`), where π(`N`) is the number of primes less or
equal to `N`. At the return, the registers are not modified and
the first `N`² bytes of array `(R1)` tell whether the
corresponding number is a prime (1) or not (0).

Note: the output format (array of bytes) was chosen and fixed in advance.
One could have had a faster

routine if another format, using less
memory (array of bits and/or non-stored even numbers), had been chosen.
However, this changes the problem, which consists in finding an optimization
for a fixed output format. Anyway the same techniques would be used.

; Eratosthenes Sieve, version 8 (1997-02-11) ; Search for the prime numbers up to N^2-1. ; In: ; R0: integer N >= 5. ; R1: array of 32*CEIL(N^2/32) bytes. Address: multiple of 32. ; R2: array (temporary data). Address: multiple of 32. ; R3: block size (multiple of 480), <= size(R2) - 8*pi(N). ; Out: ; The registers are not modified. ; Results in R1[0..N^2-1]: 0 (not a prime) or 1 (prime). GET h:RegNames ;Define R0, ..., R14, SP, LR. AREA area_erat, READONLY, CODE, PIC EXPORT erat erat STMFD SP!, {R0,R4-R12,LR} ADR R12, offsets MOV R11, #0 MOV R10, R2 MLA R5, R0, R0, R1 ADD R4, R2, R3 MOV R0, #0 STR R0, [R4] ;Empty list. ; In the following: ; R0: 0. ; R1: array of booleans (will contain the results). ; R2: block where the storages are performed (should be in the cache). ; R3: size of block (R2), multiple of 480. ; R4: list of couples (prime, address). ; R5: end of array (R1). ; R6: prime number p, in general. ; R9: pointer into block (R2). ; R10: address of the array read to find the next prime number ; (R2 at the first iteration, then R1). ; R11: offset of the image of array (R1) in block (R2). ; R12: array of offsets. ; Load block (R2) into the cache. MOV R9, R2 load_block LDR R14, [R9], #32 ;32: cache line size. CMP R9, R4 BCC load_block ; Block initialisation: suppression of the multiples of 2, 3 and 5 only. ; One stores: 0100 0001 0001 0100 0101 0001 0000 0101 0000 0100 0101 0001 ; 0100 0100 0001... (period) next_block MOV R9, R2 MOV R8, #&01000000 ;0001. MOV R6, #&00000100 ;0100. ORR R7, R8, R6 ;0101. init_block STMIA R9!, {R6,R8} STR R8, [R9], #4 STMIA R9!, {R6-R8} STMIA R9!, {R0,R7} STMIA R9!, {R0,R6-R8} STR R6, [R9], #4 STMIA R9!, {R6,R8} CMP R9, R4 BCC init_block MOV R8, R4 ;R8: beginning of the list B old_prime ;of the (p,addr)'s. ; Suppress the multiples of p in the block. old_loop STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+1). CMP R9, R4 BCS old_store1 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+7). CMP R9, R4 BCS old_store2 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+11). CMP R9, R4 BCS old_store3 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+13). CMP R9, R4 BCS old_store4 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+17). CMP R9, R4 BCS old_store5 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+19). CMP R9, R4 BCS old_store6 STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+23). CMP R9, R4 BCS old_store7 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+29). CMP R9, R4 BCC old_loop old_store0 STMDB R8, {R6,R9} old_prime LDMIA R8!, {R6,R9} ;Next (p,addr) or R6 = 0. ADD R14, PC, R6, LSR #25 BICS R6, R6, #&F8000000 ;C = 1. SUB R9, R9, R3 CMPNE R9, R4 ;Note: R9 (odd) != R4 (even). ADDCC R7, R6, R6, LSL #1 ;R7 = 3p. SUBCC PC, R14, #4*28 STRNE R9, [R8, #-4] BNE old_prime ; New prime numbers... SUB R8, R8, #8 TEQ R8, R4 LDRNE R6, [R8, #-8] ;R6: last prime in the list BNE last_prime ;if this list isn't empty. MOV R6, #7 ;First prime: 7. B new_prime new_loop STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+1). CMP R9, R4 BCS new_store1 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+7). CMP R9, R4 BCS new_store2 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+11). CMP R9, R4 BCS new_store3 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+13). CMP R9, R4 BCS new_store4 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+17). CMP R9, R4 BCS new_store5 STRB R0, [R9], R6, LSL #2 ;Suppress p(30k+19). CMP R9, R4 BCS new_store6 STRB R0, [R9], R7, LSL #1 ;Suppress p(30k+23). CMP R9, R4 BCS new_store7 STRB R0, [R9], R6, LSL #1 ;Suppress p(30k+29). CMP R9, R4 BCC new_loop new_store0 STMIA R8!, {R6,R9} last_prime BIC R6, R6, #&F8000000 next_odd ADD R6, R6, #2 ;Next odd number. LDRB R14, [R10, R6] TEQ R14, #0 ;Loop while it isn't BEQ next_odd ;a prime number p. new_prime MLA R9, R6, R6, R2 LDR R7, [R12, #data-offsets] SUB R9, R9, R11 MUL R14, R7, R6 CMP R9, R4 LDRCCB R14, [R12, R14, LSR #28] ADDCC R7, R6, R6, LSL #1 ;R7 = 3p. SUBCC PC, PC, R14, LSL #2 STR R0, [R8] ;End of the list of the (p,addr)'s. ; Copy the block into the array. MOV R14, R2 ADD R11, R1, R11 copy_block LDMIA R14!, {R6-R9} STMIA R11!, {R6-R9} CMP R14, R4 BCC copy_block CMP R11, R5 SUBCC R11, R11, R1 MOVCC R10, R1 BCC next_block ; End. LDMDB R12, {R8,R9} ;Correction for the integers STMIA R1, {R8,R9} ;from 0 to 7. LDMFD SP!, {R0,R4-R12,PC} old_store1 ORR R6, R6, #&18000000 B old_store0 old_store2 ORR R6, R6, #&30000000 B old_store0 old_store3 ORR R6, R6, #&48000000 B old_store0 old_store4 ORR R6, R6, #&60000000 B old_store0 old_store5 ORR R6, R6, #&78000000 B old_store0 old_store6 ORR R6, R6, #&90000000 B old_store0 old_store7 ORR R6, R6, #&A8000000 B old_store0 new_store1 ORR R6, R6, #&18000000 B new_store0 new_store2 ORR R6, R6, #&30000000 B new_store0 new_store3 ORR R6, R6, #&48000000 B new_store0 new_store4 ORR R6, R6, #&60000000 B new_store0 new_store5 ORR R6, R6, #&78000000 B new_store0 new_store6 ORR R6, R6, #&90000000 B new_store0 new_store7 ORR R6, R6, #&A8000000 B new_store0 data DCD &11111111 DCB 0,0,1,1,0,1,0,1 offsets DCB 0,39,27,0,24,0,0,36,21,0,0,33,0,30,18 END

The following program takes 3 arguments:

- the number
`N`(to search for the primes from 0 to`N`²-1), - the block size (multiple of 480),
- a 0 to display the prime numbers, or a positive integer to have the timing (it is the number of iterations, the higher it is, the more it is accurate).

/* Eratosthenes Sieve - test and timing * * Usage: erat <N> <block_size> <p> * --> Search for the prime numbers up to N^2-1 * p = 0: display * p > 0: benchmark (number of iterations) */ #include <stdio.h> #include <stdlib.h> #include <time.h> void erat(int, char *, char *, int); int ceil32(int x) { return (x + 31) & ~31; } int main(int argc, char **argv) { int i, n, s, p; char *t, *u; clock_t start, stop; if (argc != 4) exit(1); if ((n = atoi(argv[1])) < 5) exit(2); if ((s = atoi(argv[2])) <= 0 || s % 480) exit(3); if ((p = atoi(argv[3])) < 0) exit(4); if ((t = malloc(n*(n+8)+s+64)) == NULL) exit(5); t = (char *) ceil32((int) t); u = t + ceil32(n*n); if (p) { start = clock(); for (i = 0; i < p; i++) erat(n, t, u, s); stop = clock(); printf("Time = %.3f ms\n", (double) (stop-start) * (1000.0/CLK_TCK) / p); } else { erat(n, t, u, s); for (i = 0; i < n*n; i++) if (t[i]) printf("%d\n", i); } return 0; }

For `N` = 1000 (search for primes up to 1,000,000),
with a block size of 9600 bytes, it takes
**30.8 ms**. Note
that it is very fast, since writing one million bytes to memory with
`STM`s takes 24.0 ms,
and writing with `STR`s takes
39.9 ms.

Archive containing the source and the binary to run from the command line.

webmaster@vinc17.org