(Document written in August 1995.)

Armpi 3 is a program that calculates π
in decimal for the ARM processors (in particular for the
Acorn Risc PC). The following formula will be used:
π/4 = arctan(1/2) + arctan(1/3) with
arctan x = x - x^{3}/3
+ x^{5}/5 - x^{7}/7 + ...
because it is simple and converges quite fast, although other faster
formulas are often preferred. I already used this formula on the
Apple II (6502) and the Atari ST (68000); you can see the
comparisons.

The results will be given at the end.

The program given here is a standard routine whose return address
must be in register `LR` (Link Register). At the call, register
`R0` contains the wanted number of digits up to round errors, and
register `R1` contains the address of the memory block reserved
for the calculations and the storage of the result. This memory block must
hold at least 2`N` + `B` + 1544 bytes
(see below), where `B` is the least multiple of 4 greater or equal
to 5`N`/12. `R0` and `R1` must be multiples of
4.

At the return, registers `R0`, `R1` and
`R13` are not modified, and the result is stored at the
address given by `R1`, least significant digit first,
with one decimal digit in each byte.

Note: the ARM must be in *little-endian*
configuration.

The calculations will be performed in binary. Then the final result
will be converted into decimal. For `N` decimal digits, one needs
slightly fewer than 5`N`/12 bytes (1 byte = 8 binary digits).
Thus the calculations will be performed with `B`-byte numbers,
where `B` is the least multiple of 4 greater or equal to
5`N`/12 (it's better to choose a too high number of digits
than a too small one).

Divisions by a same number will be grouped, i.e.
instead of calculating and adding (1/2^{n})/`n`
and (1/3^{n})/`n`, one will calculate
(1/2^{n}+1/3^{n})/`n`.
The calculation of 1/2^{n} consists of a shift; in
fact, a 1 will be added to 1/3^{n}, which will take
a minor constant time. The calculation of 1/3^{n}
consists of a division by 9 at each iteration; a special, very fast
algorithm will be chosen for this division: the idea is to multiply by
the inverse of 9 in **Z**/2^{32}**Z**. The division by
`n` will be performed by calculating tables and reading in these
tables. The additions and subtractions will be performed in a conventional
way. Finally there is the conversion into decimal, which relatively takes a
long time; the general algorithm is conventional, but the calculations will
be performed on 4 decimal digits in parallel (there are 4 digits in a 32-bit
word), with a redundant representation (a digit may be greater than 9).

The following sections contain the explanations of each part of the program. The assembly source is given in a separate file. Here are the different parts:

- the division by 9,
- the division by
`n`, - the calculation of the table of the partial quotients
for the division by
`n`, - the calculation of the table of the products
for the division by
`n`, - the conversion into decimal,
- the main part.

The routines given before the main part will be used as macros, but
here the syntax is different: for more readability, the $

character
for the parameters will be omitted. Perl scripts
enable to extract the assembly source from the
given file in accordance with the standard syntax or
the source for extASM.
Another Perl script creates
a BASIC program (for the Risc PC) from the
assembly source returned by the first script.

We want to divide a large number by 9; the result must be stored in
place of the number. We'll perform a series of divisions by 9 of 36-bit
numbers `N` = (`NA`,`NB`) (where
the most significant 4 bits `NA` form a number less than 9),
with a 32-bit quotient and a 4-bit remainder.

First we get rid of the most significant 4 bits by subtracting
`NA` × &FFFFFFFC, then by adding 36 if
the result is negative. So we obtain a partial quotient equal
to `NA`.`a` (possibly + 4) with
a = &FFFFFFFC / 9 = &1C71C71C,
the remainder being the new value of `NB`. Now we have to
divide a 32-bit number by 9.

The inverse of 9 in **Z**/2^{32}**Z** is 2`a`+1.
Let `R` be the 32-bit result of `N`.(2`a`+1).
The following table gives the value of `N` modulo 9 as a
function of the interval containing `R`:

[0,a]......... N % 9 = 0 [a+1,2a]...... N % 9 = 5 [2a+1,3a+1]... N % 9 = 1 [3a+2,4a+1]... N % 9 = 6 [4a+2,5a+2]... N % 9 = 2 [5a+3,6a+2]... N % 9 = 7 [6a+3,7a+3]... N % 9 = 3 [7a+4,8a+3]... N % 9 = 8 [8a+4,9a+3]... N % 9 = 4

`NB` is multiplied by 2`a`+1, which is written in base 2:
00111000111000111000111000111001
and which can also be written, using signed digits:
0100m00100m00100m00100m00100m001
where `m` = -1. So the multiplication can be performed
with 1 `SUB` and 3 `ADD`s. Then the
remainder is determined to have a result in [0,`a`].

Here we want to divide a large number by a 32-bit number. The quotient will be stored in another block of memory (contrary to the division by 9). The algorithm given here is valid only with a divisor whose MSB (most significant bit) is zero; thus the divisor is in fact a 31-bit number (this also corresponds to the positive divisors in signed arithmetic), but this is widely sufficient for the calculations that will be performed.

The long division is performed in a conventional way in base
2^{32}.
At each iteration, the remainder of the division performed at the previous
iteration (32-bit word less than the divisor) is known; it is zero at the
first iteration. One has to divide the 64-bit number, called *partial
dividend*, composed of the remainder and the next digit of the dividend
(32-bit number). One obtains a digit of the quotient (32-bit number) and the
remainder, which will be used at the next iteration. The partial dividend
and the divisor will be *normalized*, i.e. left-shifted
by a same number of bits so that the MSB of the divisor
is 1; since the dividend and the divisor are multiplied by a same number,
the quotient is unchanged and the remainder is automatically normalized
during the calculation.

Now let's explain the division of the 64-bit partial dividend
`X` = (`NA`,`NB`) (2 words) by the
32-bit divisor `DV` (whose MSB is 1). The
division will be performed in 4 identical steps, with tables calculated
before the global division (these tables only depend on the divisor).
By hypothesis, one has `NA` < `DV`.

Considering the most significant 9 bits of the normalized partial
dividend, a table with (at least) 512 entries gives
8 bits of the quotient, which form a number `NQ` called *partial
quotient*. In fact, there may generally be 2 possible partial quotients:
`NQ` and `NQ`+1; to have a unique quotient, more bits
(up to 38) should be considered, which are too many for a table. Thus one
assumes that the partial quotient is `NQ` (as if the bits following
the first 9 bits were all zeros), and the quotient will be fixed later if
it is `NQ`+1. Then (`NQ`<<25).`DV` must
be subtracted from `X`, by reading `NQ`.`DV`
in a 256-entry table (each entry being a possible
partial quotient), and the result must be shifted by 8 bits to the left.
`NQ`.`DV` is a 39-bit number (`NQ` has 8 bits
and `DV` has 31 bits). But if `NQ` is the real quotient,
we know that the most significant 8 bits of
`X`-`NQ`.`DV` are zeros, and in this case,
the most significant 8 bits of `NQ`.`DV` need not be
represented; only the least significant bit of these 8 bits is useful
to determine the quotient. Therefore only 32 bits of the product
`NQ`.`DV` will be stored, which exactly form a word P
(thus this algorithm is not valid for a 32-bit divisor). So we calculate
`X'` = ((`X`<<7)-`P`)<<1.
If the bit lost in the last shift is 1 or if
`X'` ≥ `DV`, then the real quotient is
`NQ`+1, and not `NQ`; the possible correction of the
quotient and `X'` is done now. These calculations are performed
3 more times. After 4 steps, the 32 bits of the quotient are obtained
(and stored) and the remainder is normalized.

This table is calculated in the following way, where `DV`
is the normalized divisor:

- (
`NA`,`NB`) = 0 and`NQ`= 0. - Repeat
- Store
`NQ`into the table. - Add 2.
`DV`to (`NA`,`NB`). - Store
`NQ`into the table if a carry is generated by the addition of the least significant words. - Increase
`NQ`.

- Store
- While
`NQ`<256.

`NA` contains the most significant 9 bits of the products
`NQ`.`DV`. 256 to 512 values are stored.

This table is calculated in the following way, where `DV`
is the normalized divisor:

`T1`= 0.- Repeat 256 times:
- Store
`T1`into the table. - Add
`DV`/2 to`T1`.

- Store

Note: since the divisor is a 31-bit number, the
LSB of `DV`
is 0. In the additions, the most significant bits (up to 7 bits) are
lost (see Long Division by `n`).

We want to convert a binary number, defined up to a power of 2, into
base 10: π (in fact, π/4 was calculated, but 4 is a power of 2).
3 variables (long numbers) will be used during the conversion: the
binary number, a decimal number `T` equal to a power of 2 (the
exponent is decreased at each iteration, and becomes negative), and the
temporary result `R` in base 10 (more and more accurate). The
two decimal numbers have the same number of digits (including the non
significant zeros) and the value of the least significant digit must be
(slightly) greater than the value of the least significant digit of the
binary number.

Since the most significant bit of the binary number π has the value 2,
the first considered power of 2 is 2. The initial value of `R` is
obviously 0. The bits of the binary number are successively read. At each
iteration, `T` is added to `R` if the bit is 1; then
`T` is divided by 2. One iterates until `T` is 0 (due
to lack of precision).

The decimal numbers have one digit per byte. Like the binary numbers, the
decimal numbers are stored with their least significant digit first: this is
due to the fact that the ARM is *little-endian* by
default.

In order to optimise, the numbers `T` and `R` are
interlaced in a 32-bit way; the words of `T` are stored before
the corresponding words of `R`. This optimisation is specific to
the ARM, which can read (or write) several consecutive
words with only one instruction. Also fewer registers are needed.

We could perform the division by 2 and the addition as we could do them manually (i.e. one digit after another, in making sure that each digit is between 0 and 9). But the calculation can be much strongly accelerated, first by performing elementary 32-bit operations instead of 8-bit operations, i.e. by considering 4 digits at once, then by using a redundant representation, i.e. digits greater than 9 (but never greater than 255), which avoids calculating the carries at each addition.

The division by 2 is performed word by word starting with the most
significant word. At each iteration, a right-rotation with carry is
performed; a word `M` and a new carry are obtained. Bits 7, 15, 23
and 31 contain the carries that will affect the 4 digits. Thus one writes
`M` = `N`+`C`, where `C`
contains the carries and `N` contains the other bits.
`C`>>5 + `C`>>7 = 5(`C`>>7)
is added to `N` in order to obtain the correct result. Note: these
calculations in parallel are possible due to the fact that the base (10) is
divisible by the divisor (2).

The addition is performed word by word starting with the least
significant words, without carry, so that it is very fast. But the value
of the digits of the result must regularly be decreased to avoid overflows.
This cleaning

is performed every 8 iterations, i.e. each
time a byte of the binary number has been entirely read, after the possible
addition. The digits of the result are read one at a time, starting with the
least significant digit; each time the digit + the carry is greater or equal
to 128, 80 is subtracted and a carry equal to 8 is generated. So the digits
are always between 0 and 199, and overflows never occur.

At the end of the conversion, the digits are between 0 and 190. Then
the result is normalized

: a number whose digits are between 0 and
9 is obtained. The normalized number is copied at the same time where it
must be (its words are not interlaced any longer).

In the first part (before the conversion), 3 memory blocks are used:
one containing the numbers of the form 4/3^{2n+1}, one
containing the quotient in the long division by 2`n`+1, and one
containing the temporary result.

First `B` is calculated in the following way:
B = ((N + N/4)/3 + 3) BIC 3.
The addresses of the different blocks are calculated, the divisor
and the shift count for the normalization are initialized. We store
4 (1/2 + 1/3) = 1101010101...
in the block that will contain the binary result (the most significant
bit has the value 2), and
4/3 = 01010101...
in the block that contains the negative powers of 3. Then the
calculations start.

At each iteration, a division by 9 is performed, the table of the
partial quotients and the table of the products for the division by
2`n`+1 are calculated; a bit, corresponding to
4/2^{2n+1}, is set, and we go to the conversion if this
bit is out of the block, i.e. if the wanted precision has been
reached. Then the division by 2`n`+1 is performed, and the
quotient is added or subtracted to/from the result according to the
parity of `n`. The bit that has been set is cleared. 2 is
added to the divisor, and the shift count is decreased if the divisor
has exceeded a new power of 2. Then we start a new iteration.

For the conversion, new memory blocks are considered. The decimal numbers are initialized and the conversion is performed as described above.

200,000 decimals have been calculated in 7 hours 40 minutes 7 seconds: the binary calculation has lasted 3 hours 47 minutes 15 seconds and the conversion into decimal has lasted 3 hours 52 minutes 51 seconds.

The 3 calculation times are just about proportional to the square of the number of calculated decimals. With 20,000 decimals, the times are: 4 minutes 36 seconds in total, 2 minutes 22 seconds for the binary calculation and 2 minutes 14 seconds for the conversion. The times have respectively been divided by 100, 96, 104.

The binary calculation and the conversion into decimal take just about
the same time. Therefore it may be interesting to optimise the one or the
other, but if both are not optimised, the acceleration will be limited.
Concerning the binary calculation, another formula can be chosen (another
sum of atan's). Concerning the conversion, it's harder: the cleaning

may be performed slightly less often (every 8 additions instead of every
8 read bits) or a division by 4 may be performed instead of 2 divisions
by 2 when a bit is 0 and there is no addition to perform; see
Armpi 4.
It may be better to perform all calculations in a base of the form
10^{n}, where `n` is well chosen.

The comparisons with other computers and other π calculation programs I've written are given in a separate file; all the programs up to Armpi 4 use the same formula, but the algorithms are quite different.

webmaster@vinc17.org