# EE 8217 *Reconfigurable Computing Systems Engineering* Paqe 

 Sample of Final ExaminationStudent name:
Date: June 26, 2008

## General requirements for the exam:

1. This is CLOSED BOOK examination;
2. No questions allowed within the examination period;
3. If something is not clear in question please, put your assumptions;
4. No extra papers cell-phones or programmable calculators are allowed;
5. For calculations or assumptions you can use reserved space in the exam paper or opposite side of each page;
6. It is allowed for use: Pens and pencils, erasers, simple calculators and rulers.
7. All explanations in boxes and tables has to be printed

## Problem 1: Selection the best way for embedded system implementation

The embedded computing system has to work in 11 operational modes: 1 base mode (initial) mode and 10 additional modes initiated by conditional cases.
The simulation results of the compiled HDL-code for each mode has shown the following requirement for FPGA resources:
a) Base-mode required 38,120 logic cells;
b) Each other of the rest of 10 additional modes required extra 12,380 logic cells.

Note1: Extra logic associated with additional mode has to be added to the base-mode logic to perform any the additional modes of operation. In other words, the only logic (hardware) overlap between additional modes is base-mode logic.
Note 2: System can perform only one of the above 11 modes of operation
There are three possible solutions for system-on-chip (SoC) implementation of the above embedded computing system:

1. Implementation of the SoC in one large statically configured FPGA device, which accommodates all logic for all 11 modes of operation;
2. Implementation of the SoC in ASIC;
3. Implementation of the SoC in one FPGA device to be reconfigured to requested mode when necessary (simple multi-mode RSC implementation)

Suggested family of FPGA devices is Xilinx Virtex-4 LX. Available logic resources and approximate cost of each FPGA device from the family is given in Table 1 (see below).

Table 1: Xilinx Virtex-4 FPGA devices

| FPGA Device | Number of logic cells | Configuration bits | Approximate cost |
| :--- | :--- | :--- | :--- |
| XC4VLX25 | 24,192 | $7,819,904$ | $\$ 350$ |
| XC4VLX40 | 41,472 | $12,259,712$ | $\$ 800$ |
| XC4VLX60 | 59,904 | $17,717,632$ | $\$ 1,400$ |
| XC4VLX80 | 80,640 | $23,291,008$ | $\$ 2,400$ |
| XC4VLX100 | 110,592 | $30,711,680$ | $\$ 4,000$ |
| XC4VLX160 | 152,064 | $40,347,008$ | $\$ 8,000$ |
| XC4VLX200 | 200,448 | $51,367,808$ | $\$ 10,000$ |

## Sample of Final Examination

Estimated performance and approximate development cost of the SoC in different forms of implementation is given in Table 2 (see below).

Table 2: Performance and development cost estimation

| SoC implementation | Performance | SoC development cost |
| :--- | :--- | :---: |
| Large FPGA | 40 data-frame/ sec | $\$ 5,000$ |
| ASIC | 60 data-frame/ sec | $\$ 1,000,000$ |
| Multi-mode RCS | 30 data-frame/ sec | $\$ 10,000$ |

To figure out which way of embedded computing system implementation is the best for different amounts of production perform the following analysis:

## Question 1.1

Determine: a) the FPGA device for the SoC in large FPGA with static configuration and b) the volume of EEPROM to store the configuration file for this FPGA.

## Q 1.1.1

Amount of total logic resources needed to accommodate all 11 modes in FPGA - Rtotal
Rtotal $=$ $\qquad$ logic cells

Show calculations

The FPGA selected (from Table 1) - $\qquad$

## Q 1.1.2

[2 marks]
The volume of EEPROM for the configuration file - Veeprom $=$ $\qquad$ M Byte

Note: Configuration EEPROM devices are available in $2^{n} \mathrm{MB}$ (e.g. $2 \mathrm{MB}, 4 \mathrm{MB}, 8 \mathrm{MB}$, etc.). $1 \mathrm{MB}=1024$ Bytes

Show calculations

# EE 8217 *Reconfigurable Computing Systems Engineering* Paqe Sample of Final Examination 

## Question 1.2

Determine: a) the FPGA device for the SoC implementation in form of simple Multimode RCS with one FPGA device to be reconfigured to requested mode when necessary and b) the volume of EEPROM to store all configuration bits streams for all 11 modes.

Note: Each mode has its own configuration bit-stream
Q 1.2.1
Amount of total logic resources needed to accommodate any of 11 modes - Rtotal( $R C S$ ) $]$
$\boldsymbol{R}$ total $(R C S)=$ $\qquad$ logic cells

Show calculations

The FPGA selected (from Table 1) - $\qquad$
Q 1.2.2
[2 marks]
The volume of EEPROM for the configuration file $-\operatorname{Veeprom}(R C S)=$ $\qquad$ M Byte

Note: Configuration EEPROM devices are available in $2^{n} \mathrm{MB}$ (e.g. 2MB, 4MB, 8 MB , etc.). $1 \mathrm{MB}=1024$ Bytes

Show calculations

## Question 1.3

Compare the cost-effectiveness of possible implementation of the above embedded computing system for different volumes of production.

Cost-Performance Ratio $(\mathrm{CPR})=$ System performance $/$ System cost
System cost $=$ Cost of FPGA or ASIC + Cost of the board and supporting components + SoC Development cost / number of systems to be produced

## EE 8217 *Reconfigurable Computing Systems Engineering* Paqe

The following data is provided regarding the cost of other components:
a) Cost of the board and supporting components is approximately the same for each variant of embedded system implementation and equal to $\$ 300$ / system
b) Cost of ASIC produced in quantity of 100 units is equal to $\$ 2050$ / device,
c) Cost of ASIC produced in quantity of 1,000 units is equal to $\$ 250 /$ device,
d) Cost of ASIC produced in quantity of 10,000 units is equal to $\$ 70 /$ device

Calculate CPR for: i) Large FPGA-based system - CPR(fpga), ii) RCS based system $C P R(r c s)$, iii) ASIC-based system - CPR(asic) for the following cases:

Case 1: Volume of production - 100 embedded computing systems
Case 2: Volume of production - 1,000 embedded computing systems
Case 3: Volume of production - 10,000 embedded computing systems
Fill the Table 3 and select the best way of system implementation for each Case (according to highest value of respective CPR)

| Case \# | CPR(fpga) | CPR(rcs) | CPR(asic) | Best solution |
| :--- | :--- | :--- | :--- | :--- |
| Case 1: |  |  |  |  |
| 100 systems |  |  |  |  |
| Case 2: |  |  |  |  |
| 1000 systems |  |  |  |  |
| Case 3: |  |  |  |  |
| 10,000 systems |  |  |  |  |

Show all 9 CPR calculations below (if space is not enough, use the opposite side of the page)


# EE 8217 *Reconfigurable Computing Systems Engineering* Paqe 

## Problem 2: Performance and Speedup estimation and VHC creation

The following formula is given for data computation: $Y=a^{2}+a^{*} b+b^{2}$
The data segment consists of two vectors, where each vector contains 1000 elements:
$\boldsymbol{A}=\left[\boldsymbol{a}_{1}, \boldsymbol{a}_{2}, \boldsymbol{a}_{3}, \ldots \boldsymbol{a}_{n}\right]$ and $\boldsymbol{B}=\left[\boldsymbol{b}_{1}, \boldsymbol{b}_{2}, \boldsymbol{b}_{3}, \ldots \boldsymbol{b}_{n}\right]$, where $n=1000$. Each pair of elements of the above vectors: $\boldsymbol{a}_{\boldsymbol{i}}$ and $\boldsymbol{b}_{\boldsymbol{i}}$ for $i=1,2,3, \ldots n$ has to be processed by the given formula.

The above formula was programmed in program loop using the following instructions:

| Address | Operation | Operand 1 | Operand 2 | Result location |
| :--- | :--- | :--- | :--- | :--- |
| $0 \times 10000$ | Clear | Operand in $\boldsymbol{E}$ |  | Store result in Reg. $\boldsymbol{E}$ |
| $0 \times 10002$ | Multiply | Operand $\boldsymbol{a}_{\boldsymbol{i}}$ | Operand $\boldsymbol{a}_{\boldsymbol{i}}$ | Store result in Reg. $\boldsymbol{A}$ |
| $0 \times 10004$ | Multiply | Operand $\boldsymbol{b}_{\boldsymbol{i}}$ | Operand $\boldsymbol{b}_{\boldsymbol{i}}$ | Store result in Reg. $\boldsymbol{B}$ |
| $0 \times 10006$ | Multiply | Operand $\boldsymbol{a}_{\boldsymbol{i}}$ | Operand $\boldsymbol{b}_{\boldsymbol{i}}$ | Store result in Reg. $\boldsymbol{C}$ |
| $0 \times 10008$ | Add | Operand in $\boldsymbol{A}$ | Operand in $\boldsymbol{B}$ | Store result in Reg. $\boldsymbol{D}$ |
| $0 \times 1000 \mathrm{~A}$ | Add | Operand in $\boldsymbol{C}$ | Operand in $\boldsymbol{D}$ | Store result in Reg. $\boldsymbol{Y}$ |
| $0 \times 1000 \mathrm{~B}$ | Move | Operand in $\boldsymbol{Y}$ |  | Store result in memory |
| $0 \times 1000 \mathrm{D}$ | Increment | Operand in $\boldsymbol{E}$ |  | Store result in Reg. $\boldsymbol{E}$ |
| $0 \times 1000 \mathrm{~F}$ | GOTO | Address $\mathbf{1 0 0 1}$ | IF $\boldsymbol{E}<\mathbf{1 0 0 0}$ | ELSE GOTO Next |

Each instruction consists of the following stages:

1. IF - Instruction fetch (when instruction word is retrieved from the memory)

This stage requires - 1 clock cycle
2. ID - Instruction decode (when operation and data location are decoded)

This stage requires - 1 clock cycle.
3. $\mathbf{D F}$ - Data fetch (when data is delivered to the ALU)

This stage requires - 1 clock cycle.
4. EXE - Data execution stage requires: For simple arithmetic and logic operations (e.g. Add, Clear, Increment, Move, GOTO, etc.) - 1 clock cycle For multiplication and division - 2 clock cycles
5. $\mathbf{S R}$ - Storing the result -1 clock cycle.

## Question 2.1

Calculate the processing time of the above program loop if it is executed on CISC nonpipelined processor (e.g. micro-processor) with clock frequency $=100 \mathrm{MHz}$.

Q2.1.1:
[2 marks]
Calculate the total number of clock cycles spent for computation of one Y - element calculation (including return to address 0x10002 but without "Clear" operation)
Number of clock cycles = c.c.
Show calculations

# EE 8217 *Reconfigurable Computing Systems Engineering* Paqe 

## Sample of Final Examination

Q2.1.2:
[2 marks]
Calculate the total number of clock cycles spent for completion of data vectors computation (including start up "Clear" operation)

Number of clock cycles $=$
Show calculations

Q2.1.3:
[1 mark]
Calculate the total processing time for complete computation of data vectors (including start up "Clear" operation) for the operating frequency 100 MHz .

Total processing time $=$ $\qquad$ mS [milliseconds]

Show calculations

Question 2.2
Calculate the processing time of the above program loop if it is executed on RISC processor with Harward architecture: 5-stage instruction processing pipeline and separate instruction and data memory units and associated buses (clock frequency $=100 \mathrm{MHz}$ ).

## Q 2.2.1:

## [5 marks]

Calculate how many clock cycles will take execution of this segment of program (loop cycle time) on the simple pipeline without forwarding or bypassing when result of branch instruction (GOTO IF) is available after SR stage.
Show execution process on the timing diagram below of one loop cycle

| Instruction | Clock cycle number |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | 1 |  | 2 | 3 | 4 | 5 | d | 7 | 8 |  | 1 | 0 | 11 | 12 | 13 |  | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| Mult: $a_{i}, a_{i}$, Reg.A | IF |  | ID | DF | EX | E 2 | SR |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Mult: $b_{i}, b_{i}$, Reg.B |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Mult: $a i, b_{i}$, Reg.C |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Add:[A],[B], Reg.D |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Add:[C],[D], Reg.Y |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Move:[Y], Addr [E] |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Increment: [E] |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| GOTO: If[E] <1000 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

Total number of clock cycles for one Y element calculation $=$ $\qquad$ c.c.

## Q2.2.2:

[2 marks]
Calculate the total number of clock cycles needed for complete computation of data vectors (including start up "Clear" operation)

Number of clock cycles $=$ c.c.

Show calculations

## Q2.2.3:

[1 mark]
Calculate the total processing time for complete computation of data vectors (including start up "Clear" operation) for the operating frequency 100 MHz .

Total processing time $=$ $\qquad$ uS [micro-seconds]

Show calculations here

## Question 2.3

For the above formula: $Y=a^{2}+a^{*} b+b^{2}$, where $\boldsymbol{a}_{\boldsymbol{i}}$ and $\boldsymbol{b}_{\boldsymbol{i}}$ are 16 -bit elements of the data-vectors $\boldsymbol{A}=\left[\boldsymbol{a}_{1}, \boldsymbol{a}_{2}, \boldsymbol{a}_{3}, \ldots \boldsymbol{a}_{n}\right]$ and $\boldsymbol{B}=\left[\boldsymbol{b}_{1}, \boldsymbol{b}_{2}, \boldsymbol{b}_{3}, \ldots \boldsymbol{b}_{n}\right], n=1000$ and each pair of elements of the above vectors $\boldsymbol{a}_{\boldsymbol{i}}$ and $\boldsymbol{b}_{\boldsymbol{i}}$ for $\boldsymbol{i = 1 , 2 , 3 , \ldots n \text { has to be processed by the given }}$ formula, design the VHC -Virtual Hardware Component using Adders and Multipliers as functional units.
Resource Specification: Adder performs addition operation in 1 c.c. Multiplier performs multiplication in 2 c.c. System clock frequency - Fclk $=100 \mathrm{MHz}$
Q 2.3.1:
[3 marks]
In the box below draw the Symbol for the VHC to be designed. The Symbol should reflect all input / output buses, control and synchronization signals, clock \& reset. Note: For all buses please indicate number of lines and direction (e.g. input $\rightarrow$ )

For all signals please indicate name and direction (e.g. "Busy" $\leftarrow$ output)

# EE 8217 *Reconfigurable Computing Systems Engineering* Paqe 

## Q 2.3.2:

[2 marks]
In the box below draw the sequencing graph for the above task segment

## Q 2.3.3:

In the box below draw the scheduled sequencing graph for the above task segment.
Please provide binding with the following available resources:
a) 3 Multipliers: M1, M2 and M3 and b) 1 Added: A1.


## Q 2.3.3:

[4 marks]
Calculate latency and cycle time in case of fully pipelined data-path when:
a) Each multiplier performs operation in 2 c.c. and
b) Adder performs operation in 1 c.c.

Please provide the timing diagram in the box below


Latency = $\qquad$ c.c. Cycle time = $\qquad$ c.c.

## Q 2.3.4:

[1 marks]
Calculate total processing time for this segment of task to be executed on one above VHC with clock frequency -100 MHz .

Total processing time $=$ $\qquad$ uS (micro-seconds)

Show calculations here

## Q 2.3.5:

Calculate speedup of task segment processing on the above VHC vs. non-pipelined CISC CPU (refer to Q2.1.3 on page 6) and vs. pipeline RISC CPU (refer to Q2.2.3 on page 7). Note: Speedup is the ratio between processing time on one platform vs. another

VHC vs. CISC Speedup = $\qquad$
Show calculations here

VHC vs. RISC Speedup = $\qquad$
Show calculations here

## EE 8217 *Reconfigurable Computing Systems Engineering* Paqe

## Q 2.3.6:

[6 marks]
Draw the block diagram of the pipelined VHC data-path in the box below. Please use the following symbols for the operational blocks:


Block diagram of data-processing pipeline of VHC " $Y=a^{2}+a * b+b^{2}$ " according to sequencing graph presented in Q 2.3.3


