Efficient Squaring Algorithm for Embedded RISC Processors

Feng-Fu Su, Ren-Junn Hwang and Loang-Shing Huang Department of Computer Science and Information Engineering TamKang University Tamsui, Taipei County, Taiwan 251, R.O.C. E-mail:[email protected]

2

Abstract-Squaring X is a special case of multiplication that plays an important role to several public-key cryptosystems such as the RSA and ECC cryptosystems. This paper proposes an efficient squaring algorithm for embedded RISC processors. In order to improve the performance, we utilize the feature (multiply/accumulate unit) of the embedded RISC processors and minimize the number of external memory accesses. Our squaring algorithm is 59-72% faster than Yang et al.s for the range of bit-length from 1024 to 8192 by Texas Instruments TMS320C55x DSP. Keywords: Squaring algorithm, Exponentiation algorithm, Embedded system.

1. Introduction Squaring, i.e. multiplying a number by itself, is the main operation in the exponentiation. There are many cryptographic methods, including RSA cryptosystem [5], elliptic curve cryptography [3], and so on, that are based on the exponentiation computation. The standard procedure for exponentiation operation requires many multiplications and squarings. The squaring takes most of the computation cost of an exponentiation operation. The times of squaring in the computation of Xe are dependent on the bit length of e. The exponent of RSA computation should be a large integer by security consideration. Therefore, squaring large integer is a key factor in the performance of many public key cryptosystems. In squaring a large integer, i.e. X2 = (xn-1, xn-2, , x1, x0)b2, many cross-product terms of the form xi  xj and xj  xi are equivalent. They need to be computed only once and then left shifted in order to be doubled. An n-digit squaring operation is performed using only (n2 + n)/2 single-precision multiplications. Consequently, the squaring

operation is more efficient than the multiplication operation. In 2004, Yang et al. proposed a new efficient squaring algorithm that fixes the error-indexing bug of the Guajardo-Paar squaring algorithm [2, 8]. However, their algorithm needs many external memory accesses in implementing it. This algorithm is not suitable for applying to embedded RISC processors. In the embedded RISC processor, such as digital signal processor (DSP), the time consuming of multiply instruction is the same as load/store instruction [1]. If the squaring algorithm requires fewer multiply instructions or fewer load/store instructions, it enhances the computation performance. We improve the squaring algorithm by minimizing the number of external memory accesses to enhance the computation performance of the embedded RISC processors in the related computation. Our implementation result shows that the performance of our improved squaring algorithm is nearly 2.5 times faster in comparison with the Yang et al.s squaring algorithm on the embedded RISC processors. The rest of this paper is organized as follows: in section 2, we first review the Yang et al.s squaring algorithm. In section 3, we present an efficient squaring algorithm. The details of our implementation results are described in section 4. Finally we conclude this paper in section 5.

2. Yang et al.s squaring algorithm In 2004, Yang et al. proposed an efficient squaring algorithm, Algorithm 1, to avoid both the improper carry handing bug of the standard squaring algorithm and the error-indexing bug of the Guajardo-Paar squaring algorithm [8]. Algorithm 1: Yang et al.s squaring algorithm Input: X = (xn-1, xn-2, , x1, x0)b Output: Z = (z2n-1, z2n-2, , z1, z0)b

1. (z2n-1, z2n-2, , z1, z0)b  (0, 0, , 0, 0)b 2. for i = 0 to n-1 2.1 c  0 2.2 for j = i+1 to n-1 2.2.1 (c, s)  zi+j + xjxi + c 2.2.2 zi+j  s 2.3 zi+n  c 3. Z  2Z 4. c  0 5. for i = 0 to n-1 5.1 (c, s)  z2i + xixi + c, z2i  s 5.2 (c, s)  z2i+1 + c, z2i+1  s 6. Return Z = (z2n-1, z2n-2, , z1, z0)b The algorithm computes Z = X2. The capital letters, such as X, Z, represent multiple precision integers. For example, X is a multiple precision integer which can be written as an array (xn-1, xn-2, , x1, x0)b consisting of n digits ( b is the digital base, 0  xi < b). The lowercase letters, such as x, z, c, s, denote single precision integers. Yang et al. claimed that their algorithm is accurate and efficient [8]. However, we find that the steps 2.2, 2.3, 3, and 5 of algorithm 1 require many memory accesses. Memory access is time-consuming because it must switch on highly capacitive address and data buses, row and column decode logic, and data lines with a high fan-out [4]. Yang et al.s squaring algorithm is not adopted to implement on the embedded RISC processors.

2.3 r  q, q  p, p  0 3. for i = n to 2n-3 3.1 for j = (i-n+1) to (i-1)/2 3.1.1 (p, q, r)  (p, q, r) + 2xjxi-j 3.2 zi  r 3.3 r  q, q  p, p  0 4. z2n-1  q 5. for i = 0 to n-1 5.1 (q, r)  z2i + xixi, z2i  r 5.2 (q, r)  z2i+1 + q, z2i+1  r 6. Return Z = (z2n-1, z2n-2, , z1, z0)b Our improved algorithm is used to implement long integer squaring on processors with a multiply/accumulate (MAC) unit [6]. Most embedded RISC processors (DSPs) feature a multiply/accumulate (MAC) unit with a word “wide” accumulator so that a certain number of products can be accumulated without loss of precision. The triple (p, q, r) of Algorithm 2 represents registers because of 2xjxi-j being the result of a triple-precision integer. The operations of Steps 2.3 and 3.3 is just a digit right-shift of (p, q, r). The most costly computation of Algorithm 2 is the execution of Steps 2.1 and 3.1. After finishing Step 2.1 or 3.1, we can get one result (i.e. r) and output it (Step 2.2 or 3.2). Because of only one memory access (Step 2.2 and 3.2 totally need 2n times) and then get one result, our proposed algorithm is faster than Algorithm 1 (Step 2.2.2 of n( n  1) Algorithm 1 totally needs times). 2

3. Our efficient squaring algorithm In this section, we propose a new efficient squaring algorithm for the embedded RISC processor. In the embedded RISC processor, the time consuming of multiply instructions are the same as load/store instructions. Load and store instructions are more expensive than other instructions that involve just register accessing. Our improved squaring algorithm, Algorithm 2, minimizes the number of external memory accesses to enhance the efficiency of performing related operation in the embedded RISC processor. Algorithm 2: Our improved squaring algorithm Input: X = (xn-1, xn-2, , x1, x0)b Output: Z = (z2n-1, z2n-2, , z1, z0)b 1. (z2n-1, z2n-2, , z1, z0)b  (0, 0, , 0, 0)b, (p, q, r)  (0, 0, 0) 2. for i = 1 to n-1 2.1 for j = 0 to (i – 1)/2 2.1.1 (p, q, r)  (p, q, r) + 2xjxi-j 2.2 zi  r

4. Implementation results To measure the performance of our improved squaring algorithm together with Yang et al.s, we implemented these two algorithms on the Texas Instruments TMS320C55x DSP [7]. The DSP includes two MACs, four independent 40-bit accumulators, a 40-bit ALU, a 16-bit ALU, a 40-bit shifter, and so on. The program codes are implemented by assembler language. In the given experimental analysis, the multiplier X is from 1024 to 8192 bits long. The bit length of RSA computation should be larger than 1024 by security considerations. Numbers of CPU clock cycles for realization of these two squaring algorithm are given in Table 1. The forth column of Table 1 shows our improved squaring algorithm is 59-72% faster than Yang et al.s for the range of bit-length from 1024 to 8192. In other words, our improved algorithm needs only 41% of the computational cost needed by the Yang et al.s for a 1024-bit squaring. It is

noteworthy that our algorithm can significantly improve the squaring performance for the embedded RISC processors. That is to say, the speed of our improved squaring algorithm is almost 2.5 times faster than that of the Yang et al.s.

5. Conclusion This paper proposes an efficient squaring algorithm that is suitable for the embedded RISC processors. Our squaring algorithm is based on minimizing the number of external memory accesses. Our computational performance analysis shows that our squaring algorithm is 59-72% faster than Yang et al.s for the range of bit-length from 1024 to 8192 on the Texas Instruments TMS320C55x family of digital signal processors. In a word, the speed of our squaring algorithm is almost 2.5 times faster than that of the Yang et al.s. It is noteworthy that our algorithm can significantly improve the squaring performance for the embedded RISC processors.

ACKNOWLEDGEMENTS This work was partially supported by the iCAST project sponsored by the National Science Council, Taiwan, under the grants no.97-2221-E-032-019.

References



[1] Johann Großschädl, Roberto M. Avanzi, Erkay Savas, and Stefan Tillich, ‘Energy-efficient software implementation of long integer modular arithmetic’, CHES 2005, LNCS 3659, 2005, pp.75-90 [2] J. Guajardo and C. Paar, ‘Modified squaring algorithm’, Available from URL: http://citeseer.ist.psu.edu/672729.htm [3] Koyama K, Maurer U, Okamoto, and Vanstone SA, ‘New public-key schemes based on elliptic curves over the ring Zn’, Proc. CRYPTO’91, Santa Barbara, 1991, pp.252-266 [4] K. Roy and M.C. Johnson, ‘Software design for low power’, Lower power design in deep submicron electronics, vol. 337 of NATO Advanced science institutes series, chapter 6.3, 1997, pp.433-460 [5] R. Rivest, A. Shamir, and L. Adleman, ‘A method for obtaining digital signature and public-key cryptosystems’, Commun. of ACM, 1978, vol.21, no.2, pp.120-126 [6] S.R. Dussé and B. S. Kaliski, ‘A cryptographic

library for Motorola DSP 56000’, Proc. EUROCRYPT ‘90, 1991, pp.203-213 [7] Texas Instruments, Inc., ‘TMS320C5510’, Available from URL: http://www.compactpci-systems.com/products/sear ch/fm/id/?6812 [8] Wu-Chuan Yang, Peng-Yueh Hseih, and Chi-Sung Laih, ‘Efficient squaring of large integers’, IEICE Trans. Fundamentals, 2004, vol.E87-A, no.5, pp.1189-1192

Table 1. Numbers of CPU clock cycles for realizing two squaring algorithms Length Yang et al.s Ours Speedup (bits) (clock cycles) (clock cycles) (%) 1024 5392 2221 59 2048 18981 6509 66 4096 70725 21197 70 8192 272516 75149 72

Efficient Squaring Algorithm for Embedded RISC ...

Abstract-Squaring X2 is a special case of multiplication that plays an important role to several public-key cryptosystems such as the RSA and ECC cryptosystems ...

64KB Sizes 2 Downloads 181 Views

Recommend Documents

Efficient FDTD algorithm for plane-wave simulation for ...
propose an algorithm that uses a finite-difference time-domain ..... velocity is on the free surface; in grid type 2, the vertical component is on the free surface. ..... 50 Hz. The model consists of a 100-m-thick attenuative layer of QP. = 50 and QS

An Efficient Algorithm for Location-Aware Query ... - J-Stage
Jan 1, 2018 - location-aware service, such as Web mapping. In this paper, we ... string descriptions of data objects are indexed in a trie, where objects as well ...

An Efficient Algorithm for Clustering Categorical Data
the Cluster in CS in main memory, we write the Cluster identifier of each tuple back to the file ..... algorithm is used to partition the items such that the sum of weights of ... STIRR, an iterative algorithm based on non-linear dynamical systems, .

VChunkJoin: An Efficient Algorithm for Edit Similarity ...
The current state-of-the-art Ed-Join algorithm im- proves the All-Pairs-Ed algorithm mainly in the follow- .... redundant by another rule v if v is a suffix of u (including the case where v = u). We define a minimal CBD is a .... The basic version of

An Efficient Algorithm for Learning Event-Recording ...
learning algorithm for event-recording automata [2] based on the L∗ algorithm. ..... initialized to {λ} and then the membership queries of λ, a, b, and c are ...

Register Pointer Architecture for Efficient Embedded ...
Embedded system designers must optimize three efficiency metrics: performance, energy consumption, and static code size. The processor register file helps ...

An Efficient MRF Embedded Level Set Method For Image ieee.pdf ...
Whoops! There was a problem loading more pages. An Efficient MRF Embedded Level Set Method For Image ieee.pdf. An Efficient MRF Embedded Level Set ...

BeeAdHoc: An Energy Efficient Routing Algorithm for ...
Jun 29, 2005 - Mobile Ad Hoc Networks Inspired by Bee Behavior. Horst F. Wedde ..... colleagues are doing a nice job in transporting the data pack- ets. This concept is ..... Computer Networks A. Systems Approach. Morgan Kaufmann ...

An Efficient Algorithm for Location-Aware Query ... - J-Stage
Jan 1, 2018 - †The author is with Graduate School of Informatics, Nagoya. University .... nursing. (1, 19). 0.7 o5 stone. (7, 27). 0.1 o6 studio. (27, 12). 0.1 o7 starbucks. (22, 18). 1.0 o8 starboost. (5, 5). 0.3 o9 station. (19, 9). 0.8 o10 schoo

Efficient Pattern Matching Algorithm for Memory ...
matching approaches can no longer meet the high throughput of .... high speed. Sourdis et al. ... based on Bloom filter that provides Internet worm and virus.

An Efficient Pseudocodeword Search Algorithm for ...
next step. The iterations converge rapidly to a pseudocodeword neighboring the zero codeword ..... ever our working conjecture is that the right-hand side (RHS).

An Efficient Algorithm for Monitoring Practical TPTL ...
on-line monitoring algorithms to check whether the execution trace of a CPS satisfies/falsifies an MTL formula. In off- ... [10] or sliding windows [8] have been proposed for MTL monitoring of CPS. In this paper, we consider TPTL speci- ...... Window

An Efficient Algorithm for Sparse Representations with l Data Fidelity ...
Paul Rodrıguez is with Digital Signal Processing Group at the Pontificia ... When p < 2, the definition of the weighting matrix W(k) must be modified to avoid the ...

An I/O-Efficient Algorithm for Computing Vertex ...
Jun 8, 2018 - graph into subgraphs possessing certain nice properties. ..... is based on the belief that a 2D grid graph has the property of being sparse under.

An Efficient Algorithm for Learning Event-Recording ...
symbols ai ∈ Σ for i ∈ {1, 2,...,n} that are paired with clock valuations γi such ... li = δ(li−1,ai,gi) is defined for all i ∈ {1, 2,...,n} and ln ∈ Lf . The language.

Efficient Pattern Matching Algorithm for Memory ... - IEEE Xplore
intrusion detection system must have a memory-efficient pat- tern-matching algorithm and hardware design. In this paper, we propose a memory-efficient ...

An exact algorithm for energy-efficient acceleration of ...
tion over the best single processor schedule, and up to 50% improvement over the .... Figure 3: An illustration of the program task de- pendency graph for ... learning techniques to predict the running time of a task has been shown in [5].

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
The Johns Hopkins University [email protected]. Thong T. .... time O(Md + (n + m)d2) where M denotes the number of non-zero ...... Computer Science, pp. 143–152 ...

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
republish, to post on servers or to redistribute to lists, requires prior specific permission ..... For a fair comparison, we fix the transform matrix to be. Hardarmard and set .... The next theorem is dedicated for showing the bound of d upon which

An Efficient Parallel Dynamics Algorithm for Simulation ...
portant factors when authoring optimized software. ... systems which run the efficient O(n) solution with ... cated accounting system to avoid formulation singu-.

A Space-Efficient Indexing Algorithm for Boolean Query ...
lapping and redundant. In this paper, we propose a novel approach that reduces the size of inverted lists while retaining time-efficiency. Our solution is based ... corresponding inverted lists; each lists contains an sorted array of document ... doc

An Efficient Algorithm for Similarity Joins With Edit ...
ture typographical errors for text documents, and to capture similarities for Homologous proteins or genes. ..... We propose a more effi- cient Algorithm 3 that performs a binary search within the same range of [τ + 1,q ..... IMPLEMENTATION DETAILS.