Browse papers
A

Section A: Long Answer Questions

Attempt all / any as specified.

4 questions
1long14 marks

(a) Differentiate between RISC and CISC instruction set architectures with respect to instruction format, addressing modes, and hardware complexity. (7)

(b) A processor supports the following addressing modes: immediate, direct, indirect, register, register-indirect, and indexed. For an instruction LOAD R1, X where the effective address is to be computed, explain with diagrams how the operand is fetched in register-indirect and indexed addressing modes, and state one practical use of each. (7)

(a) RISC vs CISC

AspectRISCCISC
Instruction formatFixed length (e.g. 32-bit), uniform encoding, few formatsVariable length, many formats, complex encoding
Addressing modesFew, simple (register, immediate, register-indirect); memory accessed only via LOAD/STOREMany, complex (direct, indirect, indexed, base-displacement, etc.); operands can be in memory
Hardware complexitySimple, hardwired control unit; large register file; relies on compiler optimization and pipeliningComplex, microprogrammed control unit; instructions may take many cycles
Instructions/clock cyclesMostly single-cycle, register-to-registerMulti-cycle, memory-to-memory operations
ExampleARM, MIPS, RISC-Vx86, VAX

Summary: RISC trades a larger number of simple instructions and compiler effort for fast, pipelinable, hardwired execution, whereas CISC packs more work per instruction at the cost of slow, microprogrammed, irregular hardware.

(b) Operand fetch in Register-Indirect and Indexed modes

Let R1 be the destination and X the operand field.

Register-Indirect Addressing

The address field names a register that holds the effective address (EA) of the operand; the register is not the operand itself.

EA=(Register)EA = (\text{Register})

Fetch sequence (described as a diagram):

Instruction: LOAD R1, (Rp)
   Rp ---------> [ contents = EA ] ---------> Memory[EA] ---------> R1
  1. Read pointer register Rp from the register file -> gives EA.
  2. Place EA on the memory address bus.
  3. Memory returns Memory[EA], which is loaded into R1.

Practical use: Pointer-based access and traversal of linked lists / arrays where the address is computed at run time (e.g., dereferencing a C pointer).

Indexed Addressing

The effective address is formed by adding the contents of an index register to a constant base/offset X carried in the instruction.

EA=X+(Index Register)EA = X + (\text{Index Register})

Fetch sequence (described as a diagram):

Instruction: LOAD R1, X(Ri)
   X (offset) ----+
                  (+) ----> EA ----> Memory[EA] ----> R1
   Ri (index) ----+
  1. Read index register Ri.
  2. Add the displacement X (in the ALU/address adder) to form EA.
  3. Fetch Memory[EA] and load it into R1.

Practical use: Sequential access to array/table elements by incrementing the index register inside a loop, where X is the array base and Ri is the element index.

instruction-set-architectureaddressing-modes
2long14 marks

(a) Distinguish between hardwired control and microprogrammed control units, listing two advantages and two disadvantages of each. (6)

(b) Design a microprogrammed control unit and explain the role of the control memory, control address register (CAR), control data register, and sequencing logic. Show how the next microinstruction address is selected during conditional branching. (8)

(a) Hardwired vs Microprogrammed Control

Hardwired control generates control signals directly from a fixed combinational/sequential logic circuit (gates, decoders, flip-flops, counters) driven by the instruction opcode and timing signals.

  • Advantages: (1) Very fast, since signals come from dedicated logic. (2) Efficient for small instruction sets (RISC).
  • Disadvantages: (1) Difficult to design, test, and modify; any change in the instruction set means redesigning the circuit. (2) Becomes very complex for large instruction sets.

Microprogrammed control stores a sequence of microinstructions (each a set of control bits) in a control memory; control signals are produced by reading these microinstructions.

  • Advantages: (1) Flexible and easy to modify/extend — just rewrite the microcode. (2) Systematic, simpler to design and debug for complex (CISC) instruction sets.
  • Disadvantages: (1) Slower, because each step requires a control-memory read. (2) Requires extra control memory and is generally more expensive.

(b) Microprogrammed Control Unit Design

Block diagram (in words): The opcode of the instruction maps (via a mapping logic) to a starting address in control memory. The Control Address Register (CAR) holds the address of the next microinstruction to be read. Its output addresses the control memory; the word read out is latched in the Control Data Register (CDR). The CDR is split into (i) a control field whose bits become the micro-operation control signals sent to the datapath, and (ii) an address/branch field used by the sequencing logic to decide the next CAR value.

Roles:

  • Control memory (ROM/RAM): stores all microinstructions (the microprogram) for every machine instruction.
  • Control Address Register (CAR): holds the address of the current/next microinstruction; analogous to a program counter for microcode.
  • Control Data Register (CDR / microinstruction buffer): holds the microinstruction just fetched so its control bits drive the datapath and its branch bits drive the sequencer.
  • Sequencing logic: computes the next CAR value — increment (CAR+1), branch to an address, map from opcode, or return.

Next-address selection during conditional branching: The microinstruction carries a branch-address field and a condition-select field. The selected status flag (e.g., Zero, Carry, Sign) is tested:

CAR{Branch Addressif selected condition is TRUECAR+1if condition is FALSECAR \leftarrow \begin{cases} \text{Branch Address} & \text{if selected condition is TRUE} \\ CAR + 1 & \text{if condition is FALSE} \end{cases}

A multiplexer chooses between the incremented address and the branch address; its select line is driven by the AND of the condition-select bits and the corresponding status flags. Thus the sequencing logic implements conditional microprogram branching.

control-unit-designcpu-organization
3long12 marks

(a) Explain the principle of locality of reference and how it justifies the use of a memory hierarchy. (4)

(b) Consider a system with a cache access time of 5 ns and a main memory access time of 70 ns. If the hit ratio is 0.92, calculate the average memory access time. Then determine the new hit ratio required to reduce the average memory access time below 8 ns. (8)

(a) Locality of Reference and Memory Hierarchy

Locality of reference is the observed tendency of programs to access a small, predictable portion of their address space over short time intervals. Two forms:

  • Temporal locality: a memory location accessed now is likely to be accessed again soon (e.g., loop variables, repeated instructions).
  • Spatial locality: locations near a recently accessed address are likely to be accessed soon (e.g., sequential instruction fetch, array traversal).

Justification of hierarchy: Because at any time only a small working set is active, that set can be kept in small, fast, expensive memory (registers, cache) while the bulk stays in large, slow, cheap memory (main memory, disk). This gives an average speed close to the fast level and an average cost/capacity close to the slow level — the best of both worlds.

(b) Average Memory Access Time (AMAT)

AMAT=Htcache+(1H)(tcache+tmem)AMAT = H \cdot t_{cache} + (1-H)\,(t_{cache} + t_{mem})

Using the (common) model where a miss costs cache time plus memory time:

Given tcache=5t_{cache}=5 ns, tmem=70t_{mem}=70 ns, H=0.92H=0.92:

AMAT=5+(10.92)(70)=5+0.08×70=5+5.6=10.6 nsAMAT = 5 + (1-0.92)(70) = 5 + 0.08\times 70 = 5 + 5.6 = 10.6\ \text{ns}

Required hit ratio for AMAT < 8 ns:

8>5+(1H)(70)8 > 5 + (1-H)(70) 3>(1H)701H<370=0.042863 > (1-H)\,70 \Rightarrow 1-H < \frac{3}{70} = 0.04286 H>0.9571H > 0.9571

Result: A hit ratio greater than approximately 0.957 (95.7%) is required to bring AMAT below 8 ns.

Note: If instead the miss penalty is taken as just the memory access (AMAT = H·t_cache + (1−H)·t_mem), then AMAT = 0.92·5 + 0.08·70 = 4.6 + 5.6 = 10.2 ns, and the required hit ratio becomes H > (70−8)/(70−5) = 62/65 ≈ 0.9538. Either consistent model is acceptable.

memory-hierarchycache
4long12 marks

(a) Explain the different types of pipeline hazards (structural, data, and control) with one example of each. (6)

(b) A non-pipelined processor takes 12 ns to execute an instruction. The same datapath is divided into a 5-stage pipeline with stage delays of 3 ns, 2 ns, 4 ns, 2 ns, and 3 ns, plus a 0.5 ns latch delay per stage. Compute the clock period, the speedup for executing 1000 instructions, and the maximum theoretical speedup. (6)

(a) Pipeline Hazards

1. Structural hazard — two instructions need the same hardware resource in the same cycle. Example: With a single unified memory port, an instruction fetch (IF) and a memory access (MEM) of two different instructions clash in the same cycle. (Solved by separate instruction/data caches.)

2. Data hazard — an instruction depends on the result of a previous instruction still in the pipeline. Example (RAW):

ADD R1, R2, R3   ; produces R1
SUB R4, R1, R5   ; needs R1 before ADD has written it back

(Solved by forwarding/bypassing or stalling.)

3. Control hazard — caused by branch/jump instructions, since the next instruction address is unknown until the branch resolves. Example:

BEQ R1, R2, LABEL  ; pipeline does not know whether to fetch the next
                   ; sequential instruction or the one at LABEL

(Solved by branch prediction, delay slots, or flushing.)

(b) Pipeline Performance Calculation

Clock period: limited by the slowest stage plus latch delay.

Slowest stage = 4 ns; latch delay = 0.5 ns.

Tclk=4+0.5=4.5 nsT_{clk} = 4 + 0.5 = 4.5\ \text{ns}

Non-pipelined time per instruction: Tnp=12T_{np} = 12 ns.

Pipelined execution time for n=1000n=1000 instructions (5 stages, k=5k=5):

Tpipe=(k+n1)Tclk=(5+10001)×4.5=1004×4.5=4518 nsT_{pipe} = (k + n - 1)\,T_{clk} = (5 + 1000 - 1)\times 4.5 = 1004 \times 4.5 = 4518\ \text{ns}

Non-pipelined time for 1000 instructions:

Tseq=1000×12=12000 nsT_{seq} = 1000 \times 12 = 12000\ \text{ns}

Speedup for 1000 instructions:

S=1200045182.656S = \frac{12000}{4518} \approx 2.656

Maximum theoretical speedup (as nn \to \infty):

Smax=TnpTclk=124.52.667S_{max} = \frac{T_{np}}{T_{clk}} = \frac{12}{4.5} \approx 2.667

Results: Clock period = 4.5 ns, speedup for 1000 instructions ≈ 2.66, maximum theoretical speedup ≈ 2.67.

pipeliningperformance-metrics
B

Section B: Short Answer Questions

Attempt all / any as specified.

8 questions
5short7 marks

Describe the instruction cycle of a CPU and explain the sequence of micro-operations performed during the fetch phase using register transfer language (RTL).

Instruction Cycle

The instruction cycle is the basic operational sequence the CPU repeats for each instruction. Its main phases are:

  1. Fetch — read the instruction from memory at the address in the Program Counter (PC).
  2. Decode — interpret the opcode and identify operands/addressing mode.
  3. Operand fetch / Effective-address calculation — compute the operand address (if indirect) and read operands.
  4. Execute — perform the operation in the ALU.
  5. Store / Write-back — write the result to a register or memory; service interrupts if pending.

Fetch Phase in RTL

Let PC = program counter, AR = address register, DR = data register, IR = instruction register, M = memory. The fetch micro-operations across three clock pulses (T0,T1,T2T_0, T_1, T_2):

T0:  AR <- PC                  ; transfer PC to address register
T1:  DR <- M[AR],  PC <- PC + 1 ; read instruction word; increment PC
T2:  IR <- DR                  ; load instruction into instruction register

A combined form sometimes written as:

T0: AR <- PC
T1: IR <- M[AR], PC <- PC + 1

After T2T_2 the opcode in IR is decoded and the CPU proceeds to the execute phase. The PC has already been incremented to point at the next instruction.

cpu-organizationinstruction-cycle
6short7 marks

A direct-mapped cache has 64 blocks with a block size of 16 bytes. The main memory is 4 KB. Determine the number of bits in the TAG, LINE (block), and WORD fields of the memory address. Show how the address 0x2A7 is split across these fields.

Direct-Mapped Cache Address Breakdown

Given: 64 blocks (lines), block size = 16 bytes, main memory = 4 KB.

Total address bits: 4 KB=4096=2124\text{ KB} = 4096 = 2^{12} bytes \Rightarrow 12-bit physical address.

WORD (byte offset) field: block size =16=24= 16 = 2^4 bytes \Rightarrow 4 bits.

LINE (block/index) field: number of cache lines =64=26= 64 = 2^6 \Rightarrow 6 bits.

TAG field: remaining bits =1264== 12 - 6 - 4 = 2 bits.

FieldTAGLINEWORD
Bits264

Splitting address 0x2A7

0x2A7=0010101001112(12 bits)0x2A7 = 0010\,1010\,0111_2 \quad (12\text{ bits})

Grouping from MSB as TAG(2) | LINE(6) | WORD(4):

  TAG    LINE      WORD
  00 | 101010 |  0111
  • TAG = 00 -> 00
  • LINE = 101010 -> 4242 (i.e. 0x2A0x2A)
  • WORD = 0111 -> 77

So address 0x2A7 maps to cache line 42, byte offset 7, with tag 0.

cachecache-mapping
7short7 marks

Compare programmed I/O, interrupt-driven I/O, and Direct Memory Access (DMA) as techniques for data transfer between the CPU and I/O devices, highlighting their relative advantages in terms of CPU involvement and throughput.

Comparison of I/O Data-Transfer Techniques

FeatureProgrammed I/OInterrupt-Driven I/ODMA
CPU involvementHighest — CPU continuously polls the device status (busy-wait) and transfers each wordModerate — CPU is free until the device raises an interrupt, then handles the transferLowest — DMA controller transfers blocks directly; CPU only sets up and is notified at the end
CPU efficiencyVery poor; CPU time wasted pollingBetter; CPU does other work between interruptsBest; CPU does useful work during the whole transfer
Data pathThrough CPU registersThrough CPU registersDirectly between device and memory (CPU bus released)
ThroughputLowMediumHigh (block transfers)
OverheadNone per device but heavy CPU costInterrupt overhead per word/transferSetup overhead, but minimal per word
Best suited forSimple, slow devicesDevices with moderate, unpredictable data ratesHigh-speed bulk transfers (disk, network)

Summary: Moving from programmed I/O to interrupt-driven I/O to DMA progressively reduces CPU involvement and increases throughput. Programmed I/O wastes CPU cycles polling; interrupt-driven I/O frees the CPU between events but still routes each datum through the CPU; DMA offloads the entire block transfer to a dedicated controller, achieving the highest throughput with the least CPU burden — ideal for high-speed devices like disks.

io-organizationinterrupts
8short7 marks

State Flynn's classification of computer architectures and briefly describe each category (SISD, SIMD, MISD, MIMD) with a representative example of each.

Flynn's Classification

Flynn (1966) classifies computer architectures by the multiplicity of instruction streams and data streams:

ClassInstruction streamsData streamsDescriptionExample
SISDSingleSingleOne instruction operates on one data item at a time; the classic sequential (von Neumann) uniprocessorTraditional single-core CPU (e.g., classic Intel 8086)
SIMDSingleMultipleOne instruction is applied simultaneously to many data elements; data-level parallelismVector/array processors, GPUs, SSE/AVX units
MISDMultipleSingleSeveral instructions operate on the same data stream; rare in practiceFault-tolerant pipelined/systolic systems (e.g., redundant flight-control computers)
MIMDMultipleMultipleMultiple processors execute different instructions on different data independentlyMulticore CPUs, multiprocessors, clusters

Notes: SISD is the conventional serial machine; SIMD exploits data parallelism efficiently for regular computations; MISD is largely theoretical with only niche uses; MIMD is the most general and is the basis of modern parallel and distributed systems.

parallel-processingflynn-taxonomy
9short6 marks

A program spends 40% of its execution time on a section that is enhanced to run 5 times faster. Using Amdahl's law, calculate the overall speedup of the program. What is the maximum achievable speedup if the enhancement were made infinitely fast?

Amdahl's Law

Amdahl's law gives the overall speedup when a fraction ff of execution time is enhanced by a factor ss:

Soverall=1(1f)+fsS_{overall} = \frac{1}{(1-f) + \dfrac{f}{s}}

Given: enhanced fraction f=0.40f = 0.40, speedup of that part s=5s = 5.

S=1(10.40)+0.405=10.60+0.08=10.681.47S = \frac{1}{(1-0.40) + \dfrac{0.40}{5}} = \frac{1}{0.60 + 0.08} = \frac{1}{0.68} \approx 1.47

Overall speedup ≈ 1.47×.

Maximum achievable speedup (enhancement infinitely fast, ss \to \infty):

Smax=11f=10.601.67S_{max} = \frac{1}{1-f} = \frac{1}{0.60} \approx 1.67

Result: The realistic speedup is about 1.47×, and even with an infinitely fast enhancement the speedup is bounded by 1.67×, because 60% of the execution time (the non-enhanced part) remains unchanged.

performance-metricsamdahls-law
10short6 marks

Explain the daisy-chaining method of bus arbitration with a suitable diagram. State one advantage and one limitation of this scheme compared to a centralized parallel arbitration scheme.

Daisy-Chaining Bus Arbitration

In daisy-chained arbitration, all bus masters share a single bus-grant (BG) line that is wired serially through the devices, ordered by physical priority.

Diagram (in words):

            +----------+      +----------+      +----------+
  BG  ----->| Device 1 |--BG->| Device 2 |--BG->| Device 3 |
            +----------+      +----------+      +----------+
     ^            |                |                |
     |            +-------+--------+----------------+
  Bus Arbiter <-------- Bus Request (BR, common/wired-OR) ----
  (Bus Busy line shared by all)

Operation:

  1. Any device needing the bus asserts the common Bus Request (BR) line.
  2. The arbiter responds by asserting Bus Grant (BG).
  3. BG propagates serially from device 1 onward. The first requesting device in the chain that receives BG takes the bus and blocks BG from passing further; non-requesting devices simply pass BG along.
  4. The device asserts Bus Busy, uses the bus, and releases it when done.

Thus priority is fixed by position in the chain (devices nearer the arbiter have higher priority).

Advantage (vs centralized parallel arbitration): Very simple — needs only a few control lines (BR, BG, Busy) regardless of the number of devices, so it is cheap and easy to expand.

Limitation: (i) Fixed, position-based priority can starve low-priority (far) devices, and (ii) the serial BG propagation is slow and a failure in one device breaks the chain for those downstream. Centralized parallel arbitration avoids this by giving each device its own request/grant lines and a programmable priority encoder, at the cost of more hardware.

io-organizationbus-arbitration
11short6 marks

Explain the concept of virtual memory and the role of paging. Describe how a Translation Lookaside Buffer (TLB) improves the performance of address translation.

Virtual Memory and Paging

Virtual memory is a memory-management technique that gives each process the illusion of a large, contiguous address space (the virtual address space) that can exceed the physical main memory. Only the actively used portions reside in RAM; the rest is kept on secondary storage (disk) and brought in on demand. It provides large address spaces, isolation/protection between processes, and relocation.

Role of paging: The virtual and physical address spaces are divided into equal-sized blocks — pages (virtual) and frames (physical). A page table maps each virtual page number (VPN) to a physical frame number (PFN). A virtual address is split into (VPN, offset); translation replaces the VPN with its PFN while the offset is unchanged:

Physical Address=PageTable[VPN]offset\text{Physical Address} = \text{PageTable}[VPN] \,\|\, \text{offset}

If the page is not present in RAM, a page fault occurs and the OS loads it from disk. Paging eliminates external fragmentation and allows non-contiguous allocation.

Role of the TLB

Without help, each memory reference would need an extra memory access just to read the page table — doubling effective access time (or more, with multi-level tables).

The Translation Lookaside Buffer (TLB) is a small, fast, fully/set-associative cache that stores the most recently used VPN -> PFN translations.

  • On a TLB hit, the PFN is obtained in ~1 cycle, so translation adds almost no overhead.
  • On a TLB miss, the page table is walked (in memory or by hardware), and the entry is loaded into the TLB.

Because of locality of reference, TLB hit rates are typically very high (often >95–99%), so most translations are fast. This keeps the effective address-translation time close to a single fast cache access, dramatically improving overall performance.

memory-hierarchyvirtual-memory
12short6 marks

Differentiate between a multicore processor and a multiprocessor system. Briefly explain cache coherence and why it becomes a problem in shared-memory multiprocessors.

Multicore Processor vs Multiprocessor System

AspectMulticore ProcessorMultiprocessor System
DefinitionTwo or more independent processing cores integrated on a single chipTwo or more separate physical CPUs (chips) in one system
IntegrationOn-die; cores share on-chip resources (e.g., L2/L3 cache, memory controller)Separate packages connected via a system bus/interconnect
CommunicationFast, low-latency on-chip interconnectSlower, through external bus/interconnect
Cost/PowerLower cost and power for given parallelismHigher cost, board complexity, and power
ExampleQuad-core CPU in a laptopA server board with multiple CPU sockets (SMP)

A multicore chip can itself be a building block of a multiprocessor system (multiple multicore chips).

Cache Coherence

Cache coherence requires that all caches in a shared-memory system present a consistent view of any shared memory location — i.e., a read must return the most recently written value to that location, regardless of which processor wrote it.

Why it becomes a problem in shared-memory multiprocessors: Each processor/core keeps its own private cache. When several caches hold copies of the same memory block and one processor writes to its copy, the other cached copies become stale (out of date). A subsequent read by another processor from its own cache would return the old value, violating correctness.

This is resolved by cache-coherence protocols (e.g., snooping protocols like MESI, or directory-based protocols) that invalidate or update other copies whenever a shared block is modified, ensuring all cores eventually see a single, consistent value.

parallel-processingmulticore

Frequently asked questions

Where can I find the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper 2079?
The full BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Computer Architecture (PU, CMP 262) 2079 paper come with solutions?
Yes. Every question on this Computer Architecture (PU, CMP 262) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 paper?
The BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Computer Architecture (PU, CMP 262) past paper free?
Yes — reading and attempting this Computer Architecture (PU, CMP 262) past paper on Kekkei is completely free.