Trivia-ck hardware implementation Muhammad Hassan1 , Anupam Chattopadhyay2 1

MPSoC Architectures Research Group, RWTH Aachen University, Aachen 52074, Germany [email protected] 2 School of Computer Engineering, NTU, Singapore [email protected]

Abstract. Security plays a vital role in modern day communication systems. Only encryption of data is not sufficient to ensure data integrity, hence for integrity assurances authentication is also required to be incorporated with the encryption algorithm. Authenticated encryption ensures data integrity and security. In this paper we discuss the hardware implementation of Trivia-ck [1], an authenticated encryption algorithm based on a variant of stream cipher Trivium. The design was able to achieve a high throughput of 91.2 Gbps giving an area of 24.4 KGE, while synthesis was done using 65nm libs operating under typical conditions. The design gives a throughput/area of 3.73 Gbps/KGE. This design is first of its kind and it may serve as a base implementation in future designs.

Keywords: Authenticated Encryption, ASIC, Hardware accelerator, highlevel synthesis

1

Introduction and Motivation

The internet of things (IoT) has made the security of all the connected devices significantly important in today’s setup. Every embedded device is online, and this online presence gives a chance to third party intruder to alter the communication between two devices, hence critical information transfer requires a secure channel. Symmetric-key cryptography provides privacy by securing the channel, but it does not provide data integrity and source authenticity. Message authentication codes (MACs) on the other hand are used to provide integrity and authenticity assurances. Using a combination of symmetric key encryption and MAC, this problem can be resolved [3]. Authenticated Encryption (AE) is an efficient approach to provide privacy and authenticity of the data simultaneously. AE is an extension of symmetric-key encryption which offers authenticity. There are 3 ways to provide AE: 1- Encrypt then MAC, 2- Encrypt and MAC, 3- MAC then Encrypt. Trivia-ck is an AE, currently a candidate in CAESAR competition [2]. Among all the entries, 20% are based on AES, where Trivia-ck stands out. It is based on Trivium [4], and provides high throughput, and cycles

per byte, and low area, making it a very good candidate. The rest of the paper is organized as follows. In section 2, Trivia-ck is briefly discussed to have the knowledge of the algorithm. The hardware implementation along with different design points is discussed in section 3. Finally, section 4 evaluates the performance of our hardware architecture, while comparing it with different CAESAR candidates.

2

Cycles Per Byte (cpb) Analysis

Trivia-ck design targets high speed implementation and requires 47 clock cycles to authenticate and encrypt one message block of 64-bits. The cycle count is shown in Figure 1 where no pipeline stage has been used. 18 cycles are required for initialization phase where state register is updated in every cycle along with Z, the associated data AD is loaded and processed in 1 cycle, and during checksum phase instead of loading the block, checksum computed in earlier stage is used as input. Computation through checksum requires 4 cycles and finally 1 cycle to update tag and state register during processing of AD. For message (msg) encryption, again, the same number of cycles are required but AD is replaced by msg. The rest of the process is same with one minor difference, now the checksum is calculated only 3 times instead of 4 before the tag update.

Fig. 1. Cycle count of Trivia-ck. Associated data cycle count (top), message encryption (bottom)

Mathematically, cycle count can be represented in the following manner cycle count = (init count ∗ 2) +

adlen msglen + +4+1+3+1 8 8

(1)

where init count is 18 in Trivia-ck, adlen and msglen are in bytes instead of bits. The corresponding cpb can be calculated using the following formula cpb =

cycle count msglen

(2)

2.1

Pipelining

On the other hand, when the design uses single stage pipeline, the cycle count increases to 49 to authenticate and encrypt one message block of 64-bits. Two additional clock cycles are required to flush the pipeline registers. The rest of the data processing flow remains the same. The pipelined cycle count is shown in Figure 2.

Fig. 2. Cycle count of Trivia-ck after single stage pipeline. Associated data cycle count (top), message encryption (bottom)

Similarly, cycle count for pipelined design can be represented in the following manner cycle count = (init count ∗ 2) +

adlen msglen + + 9 + (pipe stages ∗ 2) (3) 8 8

where pipe stages are 1 in our case. As the number of pipe stages increase, the corresponding cycle count will increase accordingly. The cpb can be calculated using Eq. 2

3

Hardware Architectures

The top level diagram of Trivia-ck accelrator is shown in Figure 3. We present two architectures of Trivia-ck in this paper, a base implementation without any pipelining, and a single stage pipelined implementation. Both the implementations follow the same top level model. The architectures are very modular and provide high level of scalability. Due to similarity in operations for processing AD, and msg, same hardware modules are used to process both kinds of data. A single bit switch is used to distinguish between the type of input data. The algorithm consists of multiple operations, hence, the Trivia-ck hardware consists of the following modules: 1. State Registers: The registers are always the base component of any algorithm. State registers are used to store the intermediate states after each iteration. The state registers are used for 384-bit State Update, 256-bit Z register, 64-bit block, 160-bit tag, and 256-bit checksum.

Fig. 3. Trivia-ck top level model

2. State Update: State Update module is used to update the current state of the algorithm. This module is used in each iteration during initialization, encryption, and finalization. It takes 128-bit key, 64-bit Npub, 64-bit param, and 384-bits state register value as input. Performs the operations on the input values, and updates the state register. 3. Field Multiplication: Field multiplication module takes two 32-bit inputs, calculates the pseudo dot product on the input, and produces a 32-bit output. 4. VHorner32 : This module is used for horner’s multiplication for 32-bit Vandermonde Matrix Multiplication. It takes two inputs, a 32-bit value from field multiplication, and 160-bit tag value. It processes the input to generate a new tag of 160-bits. During the processing of AD, it processes all the 160bits of tag to give the output, whereas, for msg processing, only 128 bits are used. 5. VHorner64 : This module is used for horner’s multiplication for 64-bit Vandermonde Matrix Multiplication. It takes input block of 64-bits, and current checksum value of 256-bits as input. It generates a 256-bit checksum value as output. This modules executes its operations on 256-bits of checksum when working on AD, otherwise it uses only 192-bits. 3.1

Base Implementation

First implementation of Trivia-ck sets the base of our architecture. It is the most basic implementation which is neither optimized for area nor throughput. The design exploits the parallelism inherent to the algorithm and processes 64 bits in each cycle, shown in Figure 4. The critical path is shown in red. Prior to initialization,the state register is loaded with Key,param,Npub, and ones in the remaining bits on reset. Once the state registers are initialized, the initialization process starts where state register is updated in each cycle with state update operation. After initialization process, 8 bytes of AD are fetched in to the block register, which feeds the field multiplication module by XORing the 64-bits of block with 64-bits from state register. In parallel, the 64-bits of block are XORed with 64-bits of Z to get cipher text. field multiplication module is followed by

Fig. 4. Trivia-ck basic hardware implementation

VHorner32 module which generates the new tag. The field multiplication module has 64 2×1 32-bit Mux in series. The checksum is updated in VHorner64 module which is also executed in parallel. When the AD is finished, the checksum is calculated hence we require a 2×1 64-bit Mux which fetches 64-bits of checksum at a time to update the tag. Every operation is executing in parallel, hence one round can be executed in single cycle. At the end, after processing checksum, the tag and state register are updated in single cycle. We require a 3×1 mux at the input of state register and a 2×1 mux at the input of tag register. The state register takes input from 3 sources, initialization values on reset, state update after each cycle, and state register XORed with tag. Similarly, tag takes values from two sources, VHorner32, and Z. The control of the complete design is implemented using finite state machine (FSM), not shown in the schematic. The FSM consists of 6 states, where it starts with an idle state followed by initialization. After initialization, FSM goes in to processing state and stays there until the all the data has been processed. Then it jumps to checksum processing, followed by tag update, and pipeline flush. Only 3-bit register is required to store the present state of FSM. The combinational logic can be reduced for low area design requirements where resources will be shared, consequently the number of cycles will increase decreasing the throughput. Hence, it is a trade-off between area and speed. 3.2

Pipelined Implementation

The base implementation in section 3.1 can be used as a initial point for optimization, either for area or for throughput. Analysing the basic implementation, we identify the critical path which can be broken down in to a shorter path to increase the operating frequency. All the operations are single operations except

for Tag generation. Tag generation requires two operations in series, which are using multiple mux in series. This long chain of muxes reduces the clock speed of the whole design, hence other modules which can operate on higher frequencies are also limited by this. Therefore, we break the critical path and insert a pipeline register after Field Multiplication module. To balance the design, we also have to insert a pipeline register for Z register. Using this pipelined architecture, we are able to achieve higher throughput for Trivia-ck, as shown in Figure 5.

Fig. 5. Trivia-ck pipelined hardware implementation

3.3

Enc/Dec Implementation

Due to the similar structure of encryption and decryption algorithms, a combined hardware can also be designed with a small increase in area, while getting the same throughput. The encryption or decryption mode is selected using a mode select signal. When the mode select is set to 0, the hardware operates in encryption mode, whereas when mode select is set to 1, the hardware operates in decryption mode.

4

Performance Results And Comparison

The architectures of Trivia-ck are described in verilog HDL and synthesis is done with Synopsys Design Compiler J-2014.09 using Faraday standard cell libraries in topographical mode. We used UMC 65 nm logic SP/RVT Low-K process technology node for synthesis. The area for base implementation was 23.6 KGE

at a frequency of 1150 MHz, with 7.2 KGE required for sequential logic and 16.4 KGE required for combinational logic. The corresponding throughput comes out to be 73.9 Gbps, and throughput/area is 3.02 Mbps/GE. The area utilization is shown in Table 1 where each module and its respective area is shown. The registers for Tag, block, and checksum are instantiated in the top-module, hence their distribution is not listed in the table. Table 1. Area utilization without pipeline stage, Trivia-ck Base implementation Pipelined implementation Area (GE) % Area (GE) % Field Multiplication 6275 26 6890 28 Update State 7214 30 7208 29.5 FSM 1260 5.3 1296 5.3 VHorner32 675 2.8 387 1.5 VHorner64 573 2.4 576 2.4 Module

Similarly, the synthesis for pipelined implementation was carried under similar operating conditions, tools, and libraries. The design was successfully synthesized at 1425 MHz giving an area of 24.4 KGE, with 7.7 KGE in sequential and 16.7 KGE in combinational logic. The design successfully achieved a throughput of 91.2 Gbps with throughput/area of 3.7 Mbps/GE. The module-wise breakdown of area is shown in Table 1. The registers for Tag, block, checksum, and pipeline registers are instantiated in the top-module, hence their distribution is not listed in the table. For performance measures, different message lengths were considered to calculate cycles per byte. When the message length is 8 bytes, the overhead of initialization is significantly large in both the cases, base implementation and pipelined implementation, giving a high cycles per byte count but as we increase the message length, the overhead becomes less significant giving a cycle per byte of 0.25 for a message length of 8192 bytes, shown in Table 2 and graphically in Figure 6. Hence, as we increase the message length the cycles per byte of both designs converge. Table 2. ASIC implementation numbers for Trivia-ck (in clocks per byte) Message length (Bytes) 8 16 32 64 128 256 512 1024 2048 4096 8192 Trivia-base 5.87 3.06 1.65 0.95 0.60 0.42 0.33 0.29 0.27 0.26 0.25 Trivia-pipelined 6.12 3.18 1.71 0.98 0.61 0.43 0.34 0.29 0.27 0.26 0.25 Algorithm

Fig. 6. Hardware performance of pipelined design of Trivia-ck for different message lengths

4.1

Comparison

Not all CAESAR candidates have hardware implementations so far, and there is no other hardware implementation of Trivia-ck as well, so, the comparison done with the known results listed in Table 3 shows that Trivia-ck has a better throughput, cycles per byte and throughput/area. Table 3. Performance results, Trivia-ck Area Throughput Cycles per Byte Efficiency (KGE) (Gpbs) (cpb) (Gbps/ KGE) Trivia-ck base 23.6 73.9 0.25 3.13 Trivia-ck pipelined 24.4 91.2 0.25 3.73 ICEPOLE,v1 [5] 42 SCREAM, iSCREAM [6] 17.29 5.19 0.30 AES-GCM [7] 20.5 2.62 0.13 AO1 20.55 1.35 6.67 0.07 AO2 60.88 37.44 0.33 0.61 AEGIS [8] T O1 88.91 53.55 0.20 0.60 T O2 172.72 121.07 0.07 0.70 Algorithm

5

Conclusion

In this paper we presented an efficient hardware implementation of Trivia-ck authenticated algorithm which can achieve very high throughput, and throughput/area performance metric. We also performed analysis on the architecture

for combined encryption-decryption implementation at the cost of small area increase. We compared our architecture’s performance with other CAESAR candidates and showed that our implementation provides better performance among them.

References 1. Avik Chakraborti and Mridul Nandi, TriviA-ck-v1 Available at http:// competitions.cr.yp.to/round1/triviackv1.pdf. 2. CAESAR Competition, Available at http://competitions.cr.yp.to/caesar. html. 3. Bellare, Mihir, and Chanathip Namprempre. ”Authenticated encryption: Relations among notions and analysis of the generic composition paradigm.” Advances in CryptologyASIACRYPT 2000. Springer Berlin Heidelberg, 2000. 531-545. 4. De Canniere, Christophe, and Bart Preneel. ”Trivium.” New Stream Cipher Designs. Springer Berlin Heidelberg, 2008. 244-266. 5. P. Morawiecki, K. Gaj, E. Homsirikamol, K. Matusiewicz, J. Pieprzyk, M. Rogawski, M. Srebrny and M. Wojcik, ICE- POLE v1. Available at http://competitions.cr.yp.to/round1/icepolev1.pdf, last accessed Feb 4, 2015. 6. V. Grosso, G. Leurent, F. Standaert, K. Varici, F. Durvaux, L. Gaspar and S. Kerckhof. SCREAM and iSCREAM Side- Channel Resistant Authenticated Encryption with Masking. Available at http://competitions.cr.yp.to/round1/screamv1.pdf, last accessed Feb 4, 2015. 7. M. Mozaffari-Kermani and A. Reyhani-Masoleh, (2012). Efficient and highperformance parallel hardware architectures for the AES-GCM. IEEE Transactions on Computers, 61(8), pp. 11651178. 8. Debjyoti Bhattacharjee and Anupam Chattopadhyay, (2014). Efficient Hardware Accelerator for AEGIS-128 Authenticated Encryption. Inscrypt 2014.

Trivia-ck hardware implementation -

2 School of Computer Engineering, ... Every embedded device is online, and this online presence gives a chance to third party intruder to alter the com-.

306KB Sizes 1 Downloads 218 Views

Recommend Documents

A Hardware Implementation of POET -
and LT , dedicated registers have been con- sidered to store the corresponding values. The design supports only the message blocks with a size of a factor of 128 bits (16 bytes). The user can set the secret key SK, with which the other constants are

Trivia-ck hardware implementation -
Abstract. Security plays a vital role in modern day communication systems. Only encryption of data is not sufficient to ensure data in- tegrity, hence for integrity ...

Trivia-ck hardware implementation -
Only encryption of data is not sufficient to ensure data in- tegrity, hence for ... the associated data AD is loaded and processed in 1 cycle, and during checksum.

Intrinsic Evolvable Hardware Implementation of a ...
centralized process (such decentralized systems are also of interest to engineers). 1.1 Background of Development Principles. The development of an embryo is ...

A Hardware Intensive Approach for Efficient Implementation of ... - IJRIT
conventional Multiply and Accumulate (MAC) operations. This however tends to moderate ... However, the use of look-up tables has restricted their usage in FIR.

Novel Hardware Implementation of the Cipher ...
MACs are used in public key digital signature tech- niques that provide data .... portable clients (for data collection), that need to be cheap, small, and have minor ...

A Hardware Implementation of CURUPIRA Block ...
Abstract. An architecture and VLSI implementation of a new block cipher called Curupira is presented in this paper. This cipher is suitable for wireless sensors and. RFID applications. Our 0.13 μm implementation requires resources of 9450 gate equiv

A Hardware Intensive Approach for Efficient Implementation of ...
IJRIT International Journal of Research in Information Technology, Volume 3, Issue 5, May 2015, Pg.242-250. Rajeshwari N. Sanakal ... M.Tech student, Vemana Institute of Technology, VTU Belgaum. Banaglore ... It can, however, be shown that by introdu

Intrinsic Evolvable Hardware Implementation of a ...
In initial work we developed the software model described in this paper, and ..... Adaptive mutation rate has been shown to be efficient for hardware evolution ...

Novel Hardware Implementation of the Cipher Message ...
been deployed by VISA, MasterCard, and many other leading companies .... the computation of the MAC may begin “online” before the entire message is ...

Hardware and Representation - GitHub
E.g. CPU can access rows in one module, hard disk / another CPU access row in ... (b) Data Bus: bidirectional, sends a word from CPU to main memory or.

JUNIOR INSTRUCTOR-MECHANIC COMPUTER HARDWARE ...
(D) Fo'tran. lia@ a trTe of network it which anv conputar can aci a! both a sefld! md cliont : ... A deqe rn rnddrs @mptrkr getrsar€d dds i' a form Nisble lo' rrtn'oissis rrmu3b ... COMPUTER HARDWARE-INDUSTRIAL TRAINING DEPARTMENT.pdf.

ICESCA'08_Nabil_chouba_Multilayer Neuronal network hardware ...
Abstract— Perceptron multilayer neuronal network is widely ... Index Terms— neuronal network, Perceptron multilayer, ..... Computers, C-26(7):681-687, 1977.

Computer Science 141 Computing Hardware
Examples: – 1951 UNIVAC — 5K adds/sec in a 4400 ft3 package (22' x 20' x 10'). – 2010 Nvidia GTX580 GPU — 1.58T FPadd/sec. And demand is keeping ...

HARDWARE 17_digital wip.pdf
The excellence of the Italian artisan tradition applied to the haute couture concept shapes. a unique style that has its strength in the focus on details and the superb quality of the. products. GIOPAGANI Couture collection evokes the essential deco

IOT644 Hardware Manual.pdf
IOT644 Hardware Manual.pdf. IOT644 Hardware Manual.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying IOT644 Hardware Manual.pdf. Page 1 of ...

Computer Science 141 Computing Hardware
Computer Science 141. David Brooks. XSA-3S1000 Board https://xess.com/prods/prod035.php. Page 7. Computer Science 141. David Brooks. CS141 FAQ.

KUKA youBot Hardware Interfaces - GitHub
youBot hardware. omni-directional mobile platform.. 5-degree-of-freedom manipulator.. 2-finger gripper. all joints with relative encoders. real-time ...

Dynamic Implementation
Apr 29, 2010 - Intuitively, a strategy consists of two parts: a plan to make a decision, ...... complication introduced by having infinitely many types is continuity.

Hardware Baru5.pdf
Sementara WD Black SSHD dijual melalui OEM dan system. integrator. Mobo Gaming. MSI Ungkap Motherboard Gaming Mini-ITX Z871 Terbarunya Melalui ...

Chapter1-Hardware Concepts.pdf
input data and produces result as per given. instructions called Program. In general, Computer is a Data Processing Device. which convert data into ...

Software and hardware list.docx.docx - GitHub
Download links to the software. Hardware specifications. OS required. 1. 32-bit / 64-bit guest OS. Free. None. Windows/Mac. OS/Debian/RedHa t/CentOS/SUSE/U buntu. 2. R. 3.X.X/RStudio. Desktop V0.9X. Free. R http://www.r-project.org/. RStudio https://