Browse papers
A

Section A: Long Answer Questions

Attempt all / any as specified.

4 questions
1long14 marks

(a) Differentiate between RISC and CISC instruction set architectures with respect to instruction format, addressing modes, and hardware complexity. (7)

(b) A computer has a 16-bit instruction word. The processor supports 64 distinct operations and uses a register file of 16 general-purpose registers. For a two-address register-to-register instruction, design a suitable instruction format showing the bit allocation for the opcode and operand fields. Then, using the same machine, explain with examples how the following addressing modes are interpreted: immediate, register indirect, indexed, and PC-relative. (7)

(a) RISC vs CISC

FeatureRISCCISC
Instruction formatFixed length (e.g. 32-bit), few formats, easy to decodeVariable length (1–15 bytes), many formats, complex decoding
Addressing modesFew simple modes (register, immediate, base+offset). Only load/store access memoryMany complex modes (indexed, indirect, auto-increment, memory-to-memory)
Hardware complexitySimple hardwired control unit, large register file, pipelining is easyMicroprogrammed control, smaller register set, control logic is complex
InstructionsMany simple, single-cycle instructionsFewer but powerful, multi-cycle instructions
ExamplesARM, MIPS, RISC-Vx86, VAX, Motorola 68k

RISC pushes complexity to the compiler/software; CISC pushes it to the hardware/microcode.

(b) Instruction format design

Given: 16-bit word, 64 operations, 16 registers.

  • Opcode: log264=6\log_2 64 = 6 bits
  • Each register field: log216=4\log_2 16 = 4 bits; two operands need 2×4=82\times4 = 8 bits
  • Total used =6+4+4=14= 6 + 4 + 4 = 14 bits, leaving 2 bits (mode/reserved).
| Opcode (6) | Rdest (4) | Rsrc (4) | Mode (2) |
   15..10       9..6       5..2       1..0

Example: ADD R3, R5 → opcode=ADD, Rdest=0011, Rsrc=0101.

Addressing modes (effective address EA):

  • Immediate: operand value is part of the instruction itself, MOV R1, #7 → operand = 7. No memory access for the operand.
  • Register indirect: the register holds the address of the operand. LOAD R1, (R2) → EA = contents of R2; operand = M[R2].
  • Indexed: EA = base address (in instruction) + contents of an index register. LOAD R1, A(R2) → EA = A + R2. Used for arrays.
  • PC-relative: EA = PC + signed offset in the instruction. BEQ +offset → target = PC + offset. Gives position-independent branches.

(On a 16-bit word, an immediate/offset must fit the spare field, or a second word is used.)

instruction-set-architectureaddressing-modes
2long14 marks

(a) Draw and explain the organization of a basic CPU showing the ALU, register set, control unit, and the system bus. Describe the role of the program counter (PC), instruction register (IR), and memory address register (MAR) during the instruction fetch cycle. (7)

(b) Compare hardwired control units with microprogrammed control units. For a microprogrammed control unit, explain the difference between horizontal and vertical microinstruction formats, and discuss the trade-off between control memory size and execution speed. (7)

(a) Basic CPU organization

A basic CPU consists of four cooperating blocks connected by the system bus (address, data, control lines):

  • ALU – performs arithmetic and logic operations; takes operands from registers and writes the result back, setting status flags.
  • Register set – fast internal storage: general-purpose registers plus special registers (PC, IR, MAR, MBR/MDR, accumulator).
  • Control Unit (CU) – generates the timing and control signals that sequence all micro-operations (fetch, decode, execute).
  • System bus – carries addresses, data, and control signals between CPU, memory, and I/O.
        +-------------------- System Bus --------------------+
        |          |             |              |            |
     +--------+ +-------+    +--------+      +--------+   Memory / I/O
     |  MAR   | |  IR   |    |  ALU   |      | Reg.   |
     +--------+ +-------+    +--------+      | File   |
     |  PC    | |  MDR  |        | Control Unit |       |
     +--------+ +-------+        +------------+ +--------+

Role during the fetch cycle:

  • PC (Program Counter): holds the address of the next instruction to be fetched. After the address is used, PC is incremented.
  • MAR (Memory Address Register): receives the address from PC and drives it onto the address bus to select the memory location.
  • IR (Instruction Register): receives the fetched instruction word from memory (via MDR) and holds it so the CU can decode the opcode.

Fetch: MARPCMAR \leftarrow PC; MDRM[MAR]MDR \leftarrow M[MAR], PCPC+1PC \leftarrow PC+1; IRMDRIR \leftarrow MDR.

(b) Hardwired vs Microprogrammed control

AspectHardwiredMicroprogrammed
ImplementationFixed logic gates, flip-flops, decodersControl signals stored as microinstructions in control memory
SpeedFasterSlower (control-memory access per step)
FlexibilityHard to modifyEasy to modify/extend (rewrite microcode)
Suited toRISCCISC, complex instruction sets

Horizontal vs Vertical microinstructions:

  • Horizontal: one bit (or small field) per control signal; many signals can be activated in parallel. → Wide microinstructions, large control memory, but high speed and parallelism.
  • Vertical: control signals are encoded into compact fields and decoded at run time; fewer simultaneous signals. → Narrow microinstructions, small control memory, but slower (extra decode step, more steps).

Trade-off: Horizontal gives faster execution and more parallelism at the cost of a larger control store; vertical saves control-memory size at the cost of decoding overhead and execution speed. Real machines often use a hybrid (nano/two-level) encoding to balance both.

cpu-organizationcontrol-unit-design
3long12 marks

(a) Explain the concept of instruction pipelining. Describe the different types of pipeline hazards (structural, data, and control) and outline one technique to resolve each. (6)

(b) A non-pipelined processor takes 5 ns to execute each instruction. The same processor is redesigned into a 5-stage pipeline where each stage takes 1 ns and the pipeline register delay between stages is 0.2 ns. Calculate the speedup achieved by the pipeline for executing 1000 instructions, assuming no stalls. Also state the theoretical maximum speedup. (6)

(a) Instruction pipelining and hazards

Pipelining overlaps the execution of multiple instructions by dividing instruction processing into stages (e.g. IF–ID–EX–MEM–WB). While one instruction is decoded, the next is fetched, so a new instruction can ideally complete every clock cycle, increasing throughput (not the latency of a single instruction).

Pipeline hazards and one resolution each:

  1. Structural hazard – two instructions need the same hardware resource in the same cycle (e.g. one memory port). Resolution: add resources / use separate instruction and data caches (Harvard memory) or duplicate the unit.
  2. Data hazard – an instruction needs a result not yet written by a prior instruction (RAW dependency). Resolution: operand forwarding (bypassing); if unavoidable, insert stalls or reorder instructions.
  3. Control hazard – a branch changes the instruction flow, making prefetched instructions invalid. Resolution: branch prediction (or delayed branch / flushing the pipeline).

(b) Speedup calculation

Given: non-pipelined time Tseq=5T_{seq}=5 ns/instr; 5 stages, stage time 1 ns, register delay 0.2 ns.

Pipeline clock period =1+0.2=1.2= 1 + 0.2 = 1.2 ns.

Time for n=1000n=1000 instructions in a k=5k=5 stage pipeline:

Tpipe=(k+n1)×Tclk=(5+10001)×1.2=1004×1.2=1204.8 nsT_{pipe} = (k + n - 1)\times T_{clk} = (5 + 1000 - 1)\times 1.2 = 1004 \times 1.2 = 1204.8\text{ ns}

Non-pipelined time:

Tnon=n×Tseq=1000×5=5000 nsT_{non} = n \times T_{seq} = 1000 \times 5 = 5000\text{ ns}

Speedup:

S=50001204.84.15S = \frac{5000}{1204.8} \approx 4.15

Theoretical maximum speedup (large nn, ideal, no register overhead) =k=5= k = 5. With the 0.2 ns register overhead the asymptotic limit is 5/1.24.175/1.2 \approx 4.17, which matches our result.

pipeliningperformance-metrics
4long12 marks

(a) Explain the memory hierarchy of a computer system and justify why it improves average memory access time, referring to the principle of locality of reference. (5)

(b) A system has a cache with a hit ratio of 0.95, a cache access time of 5 ns, and a main memory access time of 80 ns. Compute the average memory access time for both a simultaneous (look-aside) access scheme and a hierarchical (look-through) access scheme. (4)

(c) Compare direct-mapped, fully associative, and set-associative cache mapping techniques in terms of hardware cost, hit ratio, and search complexity. (3)

(a) Memory hierarchy and locality

The memory hierarchy arranges storage in levels from fast/small/expensive to slow/large/cheap:

Registers → L1/L2 Cache → Main Memory (RAM) → Secondary Storage (SSD/HDD)
  fastest                                              slowest

It improves average memory access time by exploiting the principle of locality of reference:

  • Temporal locality: recently accessed items are likely to be accessed again soon → keep them in cache.
  • Spatial locality: items near a referenced address are likely to be used soon → fetch a whole block/line.

Because most accesses hit in the fast upper levels, the effective access time approaches that of the fast memory while the capacity approaches that of the cheap memory — giving near-cache speed at near-disk capacity and low cost.

(b) Average Memory Access Time (AMAT)

Given: hit ratio h=0.95h = 0.95, cache time Tc=5T_c = 5 ns, memory time Tm=80T_m = 80 ns.

Simultaneous / look-aside (cache and memory started together; on a miss only the extra memory time counts, cache and memory accessed in parallel):

T=hTc+(1h)Tm=0.95(5)+0.05(80)=4.75+4.0=8.75 nsT = h\,T_c + (1-h)\,T_m = 0.95(5) + 0.05(80) = 4.75 + 4.0 = 8.75\text{ ns}

Hierarchical / look-through (memory is accessed only after a cache miss, so a miss pays Tc+TmT_c + T_m):

T=hTc+(1h)(Tc+Tm)=Tc+(1h)Tm=5+0.05(80)=5+4=9.0 nsT = h\,T_c + (1-h)(T_c + T_m) = T_c + (1-h)T_m = 5 + 0.05(80) = 5 + 4 = 9.0\text{ ns}

(c) Cache mapping comparison

TechniqueHardware costHit ratioSearch complexity
Direct-mappedLowest (1 comparator)Lowest (conflict misses)Simplest – index a single line
Fully associativeHighest (one comparator per line, CAM)Highest (no conflict misses)Most complex – search all lines
Set-associativeModerate (n comparators)Intermediate (good balance)Moderate – search within a set

Set-associative is the practical compromise used in real CPUs.

memory-hierarchycache
B

Section B: Short Answer Questions

Attempt all / any as specified.

8 questions
5short7 marks

Compare programmed I/O, interrupt-driven I/O, and direct memory access (DMA) as techniques for data transfer between the CPU and I/O devices. Under what circumstances is DMA most beneficial?

I/O data-transfer techniques

TechniqueHow it worksCPU involvementSpeed
Programmed I/OCPU repeatedly polls the device status flag and transfers each word itselfCPU fully busy (busy-wait), wastes cyclesSlow
Interrupt-driven I/ODevice interrupts the CPU when ready; CPU services it and transfers a word, then resumes other workCPU free between transfers, but still moves every wordModerate
DMA (Direct Memory Access)A DMA controller transfers blocks directly between device and memory, stealing bus cycles; CPU is interrupted only once at completionCPU free during the whole block transferFast

When DMA is most beneficial: for high-speed, bulk/block data transfers (disk, network, audio/video) where moving data word-by-word through the CPU would create excessive interrupt/polling overhead. DMA frees the CPU to do useful computation while large blocks move, maximizing throughput.

i-o-organizationinterrupts
6short7 marks

State Flynn's classification of computer architectures. Briefly describe each of the four categories (SISD, SIMD, MISD, MIMD) and give one practical example of a system belonging to each category.

Flynn's Classification

Flynn classifies architectures by the number of concurrent instruction streams and data streams.

ClassInstruction × DataDescriptionExample
SISDSingle instr, Single dataOne processor executes one instruction on one data item at a time (classic von Neumann)Traditional single-core CPU (Intel 8085, basic uniprocessor)
SIMDSingle instr, Multiple dataOne instruction operates on many data elements simultaneously (data parallelism)GPUs, vector processors, x86 SSE/AVX, array processors
MISDMultiple instr, Single dataSeveral instructions operate on the same data stream; rareFault-tolerant/redundant systems (e.g. space shuttle flight control)
MIMDMultiple instr, Multiple dataMultiple processors execute different instructions on different data independentlyMulticore processors, clusters, multiprocessor servers

MIMD is the most common in general-purpose parallel computers; SIMD dominates graphics and numerical workloads.

parallel-processingflynn-taxonomy
7short7 marks

State Amdahl's law. A program spends 70% of its execution time in a section that can be parallelized across 8 processors, while the remaining 30% is strictly sequential. Calculate the overall speedup and explain what the result implies about the limits of parallel processing.

Amdahl's Law

Statement: The overall speedup from improving a fraction ff of a program by a factor NN is limited by the part that is not improved:

S=1(1f)+fNS = \frac{1}{(1-f) + \dfrac{f}{N}}

where ff = parallelizable fraction, NN = number of processors.

Calculation

Given f=0.70f = 0.70 (parallel), 1f=0.301-f = 0.30 (sequential), N=8N = 8:

S=10.30+0.708=10.30+0.0875=10.38752.58S = \frac{1}{0.30 + \dfrac{0.70}{8}} = \frac{1}{0.30 + 0.0875} = \frac{1}{0.3875} \approx 2.58

Overall speedup ≈ 2.58×.

Implication

Even though 70% of the work runs 8× faster, the 30% sequential portion caps the gain at only ~2.58×, far below 8×. As NN \to \infty, the maximum possible speedup is 11f=10.303.33\frac{1}{1-f} = \frac{1}{0.30} \approx 3.33. This shows that the sequential (non-parallelizable) fraction is the fundamental bottleneck: beyond a point, adding processors yields diminishing returns, so reducing the serial fraction matters more than adding cores.

performance-metricsamdahls-law
8short6 marks

Explain the difference between the write-through and write-back cache write policies. Discuss the role of the dirty bit and compare the two policies in terms of memory traffic and data consistency.

Write-through vs Write-back

Write-through: every write updates both the cache and main memory immediately. Memory is always consistent with the cache.

Write-back (copy-back): a write updates only the cache line; main memory is updated later, when that line is evicted/replaced.

Role of the dirty bit

Each cache line in a write-back cache has a dirty bit. It is set to 1 when the line is modified in cache but not yet in memory. On replacement:

  • dirty bit = 1 → the line must be written back to memory first,
  • dirty bit = 0 → the line can simply be discarded (memory already current). This avoids needless memory writes for unmodified lines.

Comparison

AspectWrite-throughWrite-back
Memory trafficHigh – every write goes to memoryLow – only dirty lines written, multiple writes coalesced
Data consistencyAlways consistent; simple for multiprocessors/DMAMemory can be stale; needs coherence handling
ComplexitySimple (no dirty bit)Needs dirty bit and write-back logic
SpeedSlower writesFaster writes

Write-through favors consistency and simplicity; write-back favors performance by reducing memory traffic.

cachecache-write-policy
9short6 marks

Represent the decimal number -85.375 in IEEE 754 single-precision (32-bit) floating-point format. Show the sign, exponent (with bias), and mantissa fields clearly.

IEEE 754 single precision of −85.375

Step 1 — Sign: number is negative → sign bit = 1.

Step 2 — Convert magnitude to binary:

  • 8510=1010101285_{10} = 1010101_2
  • 0.37510=0.01120.375_{10} = 0.011_2 (since 0.25+0.1250.25 + 0.125)
  • So 85.37510=1010101.011285.375_{10} = 1010101.011_2

Step 3 — Normalize (1.M×2E1.M \times 2^E):

1010101.0112=1.0101010112×261010101.011_2 = 1.010101011_2 \times 2^{6}

Unbiased exponent E=6E = 6.

Step 4 — Biased exponent: bias = 127.

Ebiased=6+127=133=100001012E_{biased} = 6 + 127 = 133 = 10000101_2

Step 5 — Mantissa (23 bits, drop leading 1):

010101011000000000000002010101011\,00000000000000_2

Final 32-bit representation

FieldBits
Sign (1)1
Exponent (8)10000101
Mantissa (23)01010101100000000000000
1  10000101  01010101100000000000000\boxed{1\;10000101\;01010101100000000000000}

In hexadecimal: 0xC2AAC000.

instruction-set-architecturenumber-representation
10short6 marks

Write the sequence of register-transfer micro-operations required to fetch an instruction from memory and increment the program counter. Explain the purpose of each micro-operation in the fetch cycle.

Instruction fetch micro-operations

The fetch cycle moves the next instruction from memory into the IR and advances the PC. Using register-transfer notation:

T0:  MAR ← PC
T1:  MDR ← M[MAR],   PC ← PC + 1
T2:  IR  ← MDR

Purpose of each micro-operation:

  • MAR ← PC – the address of the next instruction (held in the PC) is copied into the Memory Address Register, which drives the address bus to select the memory location.
  • MDR ← M[MAR] – memory is read; the instruction word at that address is loaded into the Memory Data Register (MDR/MBR). PC ← PC + 1 is done in the same step (in parallel) so the PC now points to the following instruction, ready for the next fetch.
  • IR ← MDR – the fetched instruction is transferred from the MDR into the Instruction Register, where the control unit will decode its opcode to begin the execute cycle.

After T2 the instruction is in the IR and the PC is updated, completing the fetch.

control-unit-designmicro-operations
11short5 marks

Explain bus arbitration in a system with multiple bus masters. Differentiate between centralized daisy-chaining and independent request/grant schemes.

Bus arbitration

When several bus masters (CPU, DMA controllers, I/O processors) can drive the shared bus, bus arbitration decides which master gains control at a given time, preventing conflicts. The unit that decides is the bus arbiter, using request and grant control lines.

Centralized daisy-chaining

  • A single bus grant (BG) line is chained serially through all masters; one shared bus request (BR) line goes to the arbiter.
  • When BR is asserted, the arbiter sends BG to the first device; if it didn't request, it passes BG to the next, and so on.
  • Pros: very few control lines, easy to add devices. Cons: fixed priority by physical position (nearest device dominates → starvation of far devices); failure of one device can break the chain; slower propagation.

Independent request / grant scheme

  • Each master has its own request line (BRi_i) and its own grant line (BGi_i) to a central arbiter.
  • The arbiter sees all requests at once and grants the bus to the highest-priority requester directly.
  • Pros: fast, flexible/programmable priority, no chain dependency, fair. Cons: many control lines and more arbiter hardware (cost grows with number of masters).

Summary: daisy-chaining = simple, few lines, fixed positional priority; independent request/grant = more lines/hardware but faster and flexible priority.

i-o-organizationbus-arbitration
12short5 marks

Write short notes on any TWO of the following:

(a) Cache coherence problem in multiprocessor systems

(b) Crossbar switch interconnection network

(c) Superscalar processor architecture

(Answer any TWO; all three are given.)

(a) Cache coherence problem

In a multiprocessor system each CPU has its own cache. When several caches hold copies of the same memory block and one processor writes to it, the other caches hold stale copies — this inconsistency is the cache coherence problem. It is solved by coherence protocols that enforce a single consistent view: snooping protocols (e.g. MESI — Modified, Exclusive, Shared, Invalid — where writes invalidate or update other copies via a shared bus) and directory-based protocols for larger systems.

(b) Crossbar switch interconnection network

A crossbar connects NN processors to MM memory modules through a grid of N×MN\times M switch points (crosspoints). Any processor can be connected to any free memory module simultaneously and non-blockingly, giving the highest bandwidth and lowest latency. Its drawback is cost/complexity: hardware grows as O(N×M)O(N\times M), so it scales poorly to large systems.

(c) Superscalar processor architecture

A superscalar processor issues and executes more than one instruction per clock cycle by having multiple parallel execution units (e.g. several ALUs, load/store, FP units). It uses hardware to detect independent instructions, supports out-of-order execution and dynamic scheduling, and exploits instruction-level parallelism (ILP), achieving an ideal CPI < 1 (IPC > 1). Examples: Intel Pentium and later x86 cores, modern ARM cores.

parallel-processinginterconnection-networks

Frequently asked questions

Where can I find the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper 2078?
The full BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Computer Architecture (PU, CMP 262) 2078 paper come with solutions?
Yes. Every question on this Computer Architecture (PU, CMP 262) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 paper?
The BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
Is practising this Computer Architecture (PU, CMP 262) past paper free?
Yes — reading and attempting this Computer Architecture (PU, CMP 262) past paper on Kekkei is completely free.