Test Instruction Set (TIS) for High Level Self-Testing of CPU Cores Saeed Shamshiri, Hadi Esmaeilzadeh, and Zainalabedin Navabi Electrical and Computer Engineering Department, University of Tehran, Tehran, Iran {shamshiri, hadi}@cad.ece.ut.ac.ir, [email protected]

Abstract TIS (Test Instruction Set) is an instruction level technique for CPU core self-testing. This method is based on enhancing a CPU instruction set with test instructions. TIS replaces the NOP instruction that is available in most processors with test instructions so that online testing can be done with no performance penalty. This method can be applied to both offline and online (concurrent) testing of all types of processors (single-cycle, multi-cycle and pipelined). TIS is appropriate for pipelined architectures in which one or many NOP instructions (or stalls) are inserted between instructions that are data or control dependent. We have implemented this test method on a pipelined CPU core and several test programs for this pipelined CPU are used to illustrate the method. Also fault coverage results are presented to demonstrate the effectiveness of the TIS test technique.

1. Introduction Embedded processor cores are widely used in many SoCs because they offer several advantages including design reuse and portability over ASICs. Core based design allows processors to be used in a variety of applications in a cost effective manner. On the other hand, design based on processor cores presents new challenges for testing since access to these embedded processors becomes further removed from the pins of the chip [1]. Self-testing for high-speed circuits has clear advantages over testing through external testers. The tester's OTA (Overall Timing Accuracy) does not increase as fast as the on-chip clock speed and this implies more yield loss [2]. One approach for realizing self-testing is running a test program on the processor which tests it by its own instructions. This pure software self-testing method has

some disadvantages including low fault coverage, large program size which cannot fit in an on-chip memory, and long test time [3]. For self-testing of a microprocessor for either stuck-at or delay faults by test program generation, some approaches have been proposed [4][5][6][7][8][9][10]. Another proposed method is an instruction level DFT that adds instructions for improving the controllability and observability of processor cores for software based self-testing [3]. In our proposed method (which we refer to it as TIS), test instructions are added and employed to test a processor core. This instruction level testing method can be used for both online and offline testing. In the offline testing phase the only instructions that run in the CPU are test instructions. Therefore all combinational and sequential parts of the processor can be tested with high fault coverage. In the online testing phase test instructions are inserted in the machine code by the assembler or compiler instead of NOP instructions. This way, combinational parts of the processor will be tested while the processor performs its normal operation without any performance penalty. Our proposed method follows a unique approach for both online and offline testing of processor cores. For testing the processor core our method utilizes all the time that is otherwise wasted due to processor stalls after data, control and structural hazards or cache misses. The TIS method is appropriate for online testing of pipelined architectures. In a pipelined architecture, one or many NOP instructions are inserted as stalls between instructions which are data or control dependent. The focus of this paper is the implementation of TIS using a BIST. TIS employs a BIST architecture to facilitate test vector generation and response analysis of different parts of a processor. As a different implementation, it is possible to completely implement TIS in software in a way that each test instruction fetches a data that is regarded as a test vector and

Proceedings of the 13th Asian Test Symposium (ATS 2004) 0-7695-2235-1/04 $20.00 © 2004 IEEE Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.

applies it to the processor, while a signature generator collects the test result. However this approach is not an efficient technique because of its large memory usage and memory access. Section 2 illustrates the concept of the proposed instruction level testing. Section 3 discusses replacing NOP instructions by test instructions. Section 4 explains the implementation of TIS, its framework and its challenges. Finally experimental results are presented in Section 5 and the paper is concluded in Section 6.

2. The Proposed Instruction Level Testing Technique TIS is an instruction level test technique for CPU core self-testing, which utilizes NOP instructions for online testing. As a common practice, NOP instructions are inserted to provide stalls for resolving data and control dependencies in pipelined CPUs. These instructions degrade the performance of the processor. Our proposed technique employs these instructions to test the processor’s various units while the CPU performs its normal tasks. In this method, NOP instructions are replaced with test instructions. The functionality of these instructions is the same as the NOP instruction and this replacement has no effect on the running program’s operation and performance. The TIS method can also be used for atspeed off-line testing. In this situation all the running instructions are test instructions. TIS supports both deterministic and random test approaches. When TIS is used as a deterministic test method, test instructions load test vectors from a specific internal test memory and apply them to different combinational parts of the processor under test. On the other hand, in random test mode, test vectors are generated using LFSR (Linear Feedback Shift Register). For both of these approaches, the results of testing are collected in MISRs (Multiple Input Signature Register).

3. Test Instruction Instead of Stall In ordinary multi-cycle architectures the NOP instruction is used for inserting a delay in program execution. In pipelined architectures the NOP instruction is inserted for hazard elimination in addition to delay generation. There are three types of hazards: structural, data, and control. The structural hazard may occur when there are not enough hardware resources for execution of consecutive instructions. In processors with simple

architectures, this hazard is usually eliminated in the design phase; but in architectures that use more than one functional unit for instruction level parallelism (ILP) this kind of hazard can occur [11]. A data hazard occurs while CPU executes data dependent instructions and there is not enough latency between these instructions. A common solution for preventing data hazard is using a special hardware, called forwarding unit. This hardware is for detecting dependencies and forwarding the required result from the running instruction to the dependent instructions that follow it. In some cases, it is impossible to forward the results because they are not ready. In this situation using NOP is inevitable. In the most common pipelined architectures this situation happens after memory loads that are followed by instructions that depend on the load results. The last type of hazard is the control hazard that occurs when a branch prediction is mistaken or, in general, when the system has no mechanism for branch prediction. In most CPU architectures, conditional jumps need more NOP instructions than unconditional jumps, because validation of the branch prediction needs more latency. Another situation that usually forces the CPU to push stalls between instructions is in cache misses. Some CPU architectures attempted to solve the problem of cache miss by the out of order execution approach. This approach decreases the cache miss penalty and partially solves the problem but does not completely eliminate it. When the next instructions need the result of the instruction that encounters a cache miss, the processor must suspend them. When the processor freezes an instruction execution because of structural, data or control hazards or cache misses, one or many CPU cycles are wasted by stalls or NOPs. Our TIS method utilizes these wasted cycles by running test instructions. All logical units can be tested using one or more test instructions in their inactive periods that are due to hazards or cache misses.

4. Implementation 4.1. PAYEH as an Implementation Framework We have implemented our method on Pipelined SAYEH (PAYEH) processor. SAYEH [12] is a multicycle RISC CPU with a 16-bit data bus and a 16-bit address bus. PAYEH is the Pipelined version of SAYEH that has a similar instruction set. Table 4 shows the instruction set of PAYEH. PAYEH processor has 5 pipe stages illustrated in Fig. 3.

Proceedings of the 13th Asian Test Symposium (ATS 2004) 0-7695-2235-1/04 $20.00 © 2004 IEEE Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.

PAYEH architecture does not generate any structural hazards. All of the instructions that need stall are BRC (branch if zero), BRC (branch if carry), JPA (jump addressed), JPR (jump relative) and LDA (load addressed). BRZ, BRC and JPA need two stalls while JPR and LDA need one stall. After the exploration of the PAYEH processor as a framework for TIS implementation, in the next section, trade offs and challenges of the TIS implementation will be discussed.

4.2. TIS Trade Offs and Challenges Our method is based on introducing a test instruction which can test combinational parts of a processor. When the CPU loads a test instruction instead of NOP, the test instruction activates all combinational parts of the design in its path by injecting a test vector for each part and collecting the results. The test instruction does not write in the register file or in the internal pipe registers, because registers hold the state of the processor and this state must be kept unchanged in the run time of test instructions. Note that test instructions are functionally equivalent to the NOP instruction. We select a BIST (Built In Self Test) strategy for test vector generation and collection of the results. This means that a LFSR and a MISR are inserted before and after each combinational component for injecting random test vectors and holding the output results of the component respectively. The MISR captures the results of the components and compresses to a short signature; this signature is then compared with the expected signature to validate the component. To reduce the cost of hardware overhead, all components can work with the same LFSR. For example in our case study in PAYEH, a 32-bit LFSR is sufficient for feeding all combinational components including the adder, ALU, control unit and the branch unit. Furthermore, all of these components can be tested using the same random input data. However, for a better fault coverage it is best to use a dedicated LFSR for each component with a polynomial and starting seed that are tuned to best cover part of the predetermined test vectors of that component. Choosing the number of MISRs for hardware overhead reduction must also be decided for a CUT. Each component in the same pipe stage should have a dedicated MISR since they all work simultaneously during the same clock period. For example, in PAYEH two MISRs are required in the ID stage; one for the adder and one for the control unit.

There is a trade off between the number of MISRs in one pipe stage and the number of test instructions. For example by introducing two test instructions, the adder and the control unit can put their results into the same MISR. When the first test instruction executes and passes through the ID stage, the adder enters the test mode and the MISR collects its results. When the second test instruction arrives at the ID stage, the control unit enters into the test mode and the MISR collects the results of the control unit. Using two test instructions can solve the resource conflict by distributing it in the time domain. In other words there is a time (efficiency) and space (area or cost) tradeoff. Increasing the number of test instructions decreases the required hardware resources and hence decreases the total cost, but increases the test time and hence the test performance decreases. Therefore, for reducing the hardware cost of MISR to one MISR per each pipe stage, the number of required test instructions must be the same as the maximum number of combinational units in a pipe stage. When there are more test instructions than one, the processor can choose one of them randomly or in a specific order for issuing test instructions instead of stalls. Different pipe stages need separate MISRs since all pipe stages may run the test instruction at the same time. Our proposed method needs to collect the result of the unit-under-test in one pipe stage. This means that the MISR must perform its task in one clock cycle. Therefore a parallel implementation of a MISR is employed.

4.3. TIS Implementation

Fig. 1. TIS implementation with separate LFSRs and parallel MISRs for each component. When test instruction passes through the pipe stage, the BIST controller puts all the combinational units in that stage in test mode.

Proceedings of the 13th Asian Test Symposium (ATS 2004) 0-7695-2235-1/04 $20.00 © 2004 IEEE Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.

Combinational parts of PAYEH are its control unit and a 16-bit adder in the ID (Instruction decode) stage, and a 16-bit ALU and a branch unit for branch evaluation in the EXE (execution) stage. In this work, we used separate LFSRs and parallel MISRs for each component (see Fig. 1). By employing four MISRs, we only need one test instruction. This test instruction is called TST that has the same functionality as the NOP instruction but tests all combinational components of PAYEH.

5. Experimental Results To demonstrate the results of TIS, several experiments have been done. The first objective is illustrating the role of test instruction in online testing of the processor and the second objective is fault coverage measurement of the method. For achieving the first objective, we used several programs. These programs are as follows. Power: This program calculates ab for natural numbers a and b (See Fig. 2.a). Two stalls after BRZ and one stall after JPR instruction are filled with TST instructions. In the loop body, 3 instructions out of 7 are TST instructions so the test rate is 43%. Factorial: This program calculates a! (See Fig. 2.b). Two stalls after BRZ and one stall after JPR instruction are filled with TST instructions. In the loop body, 3 instructions out of 7 are TST instructions so the test rate is 43%. Fibonacci: This program calculates the nth statement of the Fibonacci series (See Fig. 2.c). Two stalls after each BRZ and one stall after each JPR instruction are filled with TST instructions. In the loop body, 5 instructions out of 12 are TST instruction so the test rate is 42%. Vector addition: This program adds two vectors from the data memory and stores the results into the data memory (See Fig. 2.d). Two stalls after BRZ and one stall after JPR and one stall after dependent LDA instruction are filled with TST instructions. In the loop body, 4 instructions out of 14 are TST instruction so the test rate is 29%. Table 1 summarizes these results. Since jump and branch instructions occur frequently, by utilizing their stalls for online testing of the processor core, a high online test can be achieved without any performance penalty.

Fig. 2. Running benchmark programs on PAYEH processor after replacing NOPs with TSTs. R1 a.Calculating R2 = R0 b.Calculating R1 = R0! th c.Calculating the R0 statement of the Fibonacci series. d. Adding two vectors from the data memory together and saving the results.

Table 1. Result summary of the benchmark programs ran on PAYEH processor. Program

Power

Factorial

Fibonacci

Vector Addition

Test Rate

43%

43%

42%

29%

Table 2. Fault coverage of each combinational component after testing with 8192 randomly generated test vectors. Component

Control Unit

Branch Unit

Adder

ALU

Fault Coverage

97.3%

100%

96.3%

91%

Proceedings of the 13th Asian Test Symposium (ATS 2004) 0-7695-2235-1/04 $20.00 © 2004 IEEE Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.

Table 3. The area overhead of the TIS method in PAYEH processor PAYEH (Number of Gates) 55876

PAYEH with TIS (Number of Gates)

Area Overhead

63216

13.1%

In the fault coverage measurement process, the fault coverage of each combinational component is measured separately. The method used for fault coverage measurement is based on synthesizing the design into a faulty library. In the faulty library, each gate reports its detected stuck-at faults during the test procedure [13]. Table 2 shows fault coverage achieved for each component after testing it with 8192 test vectors generated by the LFSRs. To achieve high fault coverage for the complete processor, sequential parts of the processor must be tested. Testing of the sequential parts is feasible only in the offline mode. This is when the processor is in the test mode and there is nothing to be kept in the registers as the state of the system. To test the sequential parts of the processor using TIS, the BIST approach should be combined by full scan. Internal memory modules (data cache, instruction cache and register file) also can be tested using one of the memory testing methods. The implementation of the TIS method on the PAYEH processor with four LFSRs and four parallel MISRs has been synthesized with 0.5 micron ASIC technology. Table 3 shows the post-synthesis hardware overhead of this implementation.

6. Conclusion and Future Work An instruction level test methodology for embedded CPU core self-testing was presented. The proposed method, TIS, which adds test instructions to enable the processor to test its different parts, was discussed. The implementation challenges of TIS were explained and a real implementation of the method, on the PAYEH processor was presented. Some sample programs were used to demonstrate the method’s appropriateness for at-speed online testing of pipelined processors. Furthermore the fault coverage of each component was measured. These measurements show that this method can achieve a desirable level of fault coverage for atspeed online and offline self-testing. Applying this method on some other processors with complicated architectures like VLIW and superscalar processors are our future steps.

7. References [1] Murray and J Hayes. Testing ICs, getting to the core of the problem. IEEE Design and Test of Computers. Vol. 29, No. 11. [2] The National Technology Roadmap for Semiconductors, Semiconductor Industry Association,1997. [3] Wei-Cheng Lai, Kwang-Ting (Tim) Cheng. InstructionLevel DFT for Testing Processor and IP cores in System-ona-Chip. Design Automation Conference, June 2001. [4] W.-C. Lai, A. Krstic, and K.-T. Cheng. Test Program Synthesis for Path Delay Faults in Microprocessor. Proceedings of International Test Conference, pages 10801089, 2000. [5] L. Chen and S. Dey. DEFUSE: A Deterministic Functional Self-Test Methodology for Processors, VLSI Test Symp.pp. 255-262, May 2000. [6] D. Brahme and J.A. Abraham. Functional Testing of Microprocessors. IEEE Transactions on Computers, vol. C33, pages. 475-485, 1984. [7] F. Distante and V. Piuri. Optimum Behavioral Test Procedure for VLSI Devices: A Simulated Annealing Approach. Proceedings of the IEEE International Conference on Computer Design, pages 31-35, 1986. [8] J. Shen and J.A. Abraham. Native Mode Functional Test Generation for Processors with Applications to Self Test and Design Validation. Proceedings of International Test Conference, pages 990-999, 1998. [9] K. Batcher and C.A. Papachristou. Instruction Randomization Self Test For Processor Cores. VLSI Test Symposium, pages 34-40, 1999. [10] J. Lee and J.H. Patel. Architectural Level Test Generation for Microprocessors. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 13(10):1288-1300, October 1994. [11] D. A. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative Approach, 3rd Edition, Morgan Kaufmann, San Francisco, 2003. [12] Zainalabedin Navabi. Digital Design and Implementation with Field Programmable Devices, Kluwer Academic Publisher, May 2004. [13] M. Zolfy, S. Mirkhani, Z. Navabi. SPC-FC: A New Method for Fault Simulation Implemented in VHDL, Proc. of NATW’01, pp.17-21, June 2001.

Proceedings of the 13th Asian Test Symposium (ATS 2004) 0-7695-2235-1/04 $20.00 © 2004 IEEE Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.

Fig. 3. The data path and controller of PAYEH with its five pipe stages.

Table 4. PAYEH instruction set opcode (15:10) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010-00 1011-00 1011-01 1011-10 1011-11 1100 1101-00 1101-01 1101-10 1110 1111-00 1111-01 1111-10 1111-11

Instruction Mnemonic and Definition mvi mvr lda sta and orr not add sub mul spc jpa jpr brz brc mwp srf inc dec mov ror rol nop hlt

Move Immediate Move Register Load Addressed Store Addressed AND Registers OR Registers NOT Registers Add Registers Subtract Registers Multiply Registers Save PC Jump Addressed Jump Relative Branch if Zero Branch if Carry Move to WP Set Reset Flags Increment Decrement Move Rotate right Rotate Left No operation Halt

RTL Notation: Comments or Condition Rd <= sign_extend(I) Rd <= Rs Rd <= (Rs) (Rd) <= Rs Rd <= Rd & Rs Rd <= Rd | Rs Rd <= ^Rs Rd <= Rd + Rs + C Rd <= Rd - Rs – C Rd <= Rd * Rs Rs <= PC + I + 1 PC <= Rs + I PC <= PC + I + 1 PC <= PC + I + 1 :if Z=1 PC <= PC + I + 1 :if C=1 WP <= D&S Z <= S(1), C <= S(0) Rs <= Rs + I + C Rs <= Rs - I - C WPd:Rd <= WPs:Rs Rs <= Rs >> I[3:0] Rs <= Rs << I[3:0] No operation Halt, fetching stops

Proceedings of the 13th Asian Test Symposium (ATS 2004) 0-7695-2235-1/04 $20.00 © 2004 IEEE Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on May 12, 2009 at 20:16 from IEEE Xplore. Restrictions apply.

Test Instruction Set (TIS) for High Level Self-Testing of ...

applies it to the processor, while a signature generator collects the test result. .... Adding two vectors from the data memory together and .... Digital Design and.

150KB Sizes 4 Downloads 168 Views

Recommend Documents

Instruction-Level Test Methodology for CPU Core Self ...
ACM Transactions on Design Automation of Electronic Systems, Vol. 10, No. 4, October .... The register file of SAYEH is composed of 16 windows. Each window ...

Instruction Level Test Methodology for CPU Core ...
Electrical and Computer Engineering Department,. University of Tehran, ... with test instructions so that online testing can be done with no performance penalty ... disadvantages including low fault coverage, large program size which cannot fit ...

ForwardCom: An open-standard instruction set for high ... - Agner Fog
Aug 1, 2016 - 1.4 Comparison with other open instruction sets . ...... 2 words. As A, with an extra 32-bit immediate constant. Bits. 2. 3. 6. 5. 1 ...... length plus 8 bytes, and discard any superfluous bytes afterwards ...... 6936&rep=rep1&type=pdf.

SET UP INSTRUCTION for 5061F1 - Henri Studio
May 3, 2017 - 5) Push the water tube onto the pump exhaust adapter. 6) Slip the flow restrictor clamp onto the middle of the tube and tighten just enough to ...

SET UP INSTRUCTION for 5061F1 - Henri Studio
May 3, 2017 - 2) Set this one-piece fountain in the prepared location. 3) Place the pump inside the fountain's basin and route its power cord out the back through the provided notch. 4) Install the small pump exhaust adapter into the top exhaust open

High Level Modeling of a ΣΔ Modulator for the Test of ...
test signals from digital test patterns (obtained via Σ∆ modulation) and converting the responses of the analogue modules into digital signatures that are ...

High Level Modeling of a ΣΔ Modulator for the Test of a ...
VHDL-AMS description from a transistor schematic simulation as detailed on figure 1. After a transistor level simulation, the different process parameters are extracted and injected in the VHDL-AMS description. Authorized licensed use limited to: ST

High Level Transforms for SIMD and low-level ...
The second part presents an advanced memory layout trans- ..... We call this phenomenon a cache overflow. ... 4, center). Such a spatial locality optimization can also be used in a multithreaded ... In the following, we call this optimization mod.

Instruction level redundant number computations for ...
exploiting a redundant number computation. (RNC) at an instruction level (e.g., architectural le- vel) in order to accelerate data computation which is strictly serialized due to data dependencies. Even though the RNCs, which are well-known as carry-

High-level Distribution for the Rapid Production of ...
Investigating the potential advantages of the high-level Erlang tech- nology shows that ..... feature of many wireless communication systems. Managing the call ...

Towards a High Level Approach for the Programming of ... - HUCAA
... except in the data parallel operations. ▫ Implementation based on C++ and MPI. ▫ http://polaris.cs.uiuc.edu/hta/. HUCAA 2016. 6 .... double result = hta_A.reduce(plus());. Matrix A Matrix B .... Programmability versus. MPI+OpenCL.

Method for presenting high level interpretations of eye tracking data ...
Aug 22, 2002 - Advanced interface design and virtual environments, Oxford Univer sity Press, Oxford, 1995. In this article, Jacob describes techniques for ...

Towards a High Level Approach for the Programming of ... - HUCAA
Page 1 .... Build HPL Arrays so that their host-side memory is the one of the HTA tile ... Build an HTA with a column on N tiles of size 100x100. (each tile is placed ...

High-level Distribution for the Rapid Production of Robust Telecoms ...
guages like Erlang [1], or Glasgow distributed Haskell (GdH) [25] automati- .... standard packet data in GSM systems [9], and the Intelligent Network Service.

High-level Distribution for the Rapid Production of Robust Telecoms ...
time performance is achieved, e.g. three times faster than the C++ imple- ..... standard packet data in GSM systems [9], and the Intelligent Network Service.

Epub Engineering Instruction for High-Ability Learners ...
Learners in K-8 Classrooms Debbie Dailey for. Ipad. Read PDF Engineering Instruction for High-Ability Learners in K-8 ... Debbie Dailey, Engineering Instruction for High-Ability Learners in K-8 Classrooms For ios by Debbie Dailey}.

Limits of Instruction Level Parallelism Critique -
instruction in parallel in a super scalar processor. The paper addresses vari- ous ways of exploiting instruction level parallelism and performs a large set of.

8086-instruction-set-overview.pdf
The assembly-language equivalent of an if state- ment in a high-level language is a CoMPare opera- tion followed by a conditional jump. Exercise: What would ...

FORM FIVE JOIN INSTRUCTION BOGWE HIGH SCHOOL.pdf ...
FORM FIVE JOIN INSTRUCTION BOGWE HIGH SCHOOL.pdf. FORM FIVE JOIN INSTRUCTION BOGWE HIGH SCHOOL.pdf. Open. Extract. Open with. Sign In.

TIMESTAMP LIQUID LEVEL (LTS) LOW LEVEL ALARM HIGH LEVEL ...
TIMESTAMP. LIQUID LEVEL (LTS). LOW LEVEL ALARM. HIGH LEVEL ALARM. 8/10/2017 9:27:11. 115. 0. 0. 8/10/2017 10:10:05. 115. 0. 0. 9/15/2017 13:52:06.

FORM FIVE JOIN INSTRUCTION BOGWE HIGH SCHOOL.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. FORM FIVE ...