BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) Question Paper 2078 Nepal

Q: Where can I find the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper 2078?

The full BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Computer Architecture (PU, CMP 262) 2078 paper come with solutions?

Yes. Every question on this Computer Architecture (PU, CMP 262) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 paper?

The BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Computer Architecture (PU, CMP 262) past paper free?

Yes — reading and attempting this Computer Architecture (PU, CMP 262) past paper on Kekkei is completely free.

Question

1Long answer14 marks

(a) Differentiate between RISC and CISC instruction set architectures with respect to instruction format, addressing modes, and hardware complexity. (7)

(b) A computer has a 16-bit instruction word. The processor supports 64 distinct operations and uses a register file of 16 general-purpose registers. For a two-address register-to-register instruction, design a suitable instruction format showing the bit allocation for the opcode and operand fields. Then, using the same machine, explain with examples how the following addressing modes are interpreted: immediate, register indirect, indexed, and PC-relative. (7)

instruction-set-architectureaddressing-modes

Answer 1

(a) RISC vs CISC

Feature	RISC	CISC
Instruction format	Fixed length (e.g. 32-bit), few formats, easy to decode	Variable length (1–15 bytes), many formats, complex decoding
Addressing modes	Few simple modes (register, immediate, base+offset). Only load/store access memory	Many complex modes (indexed, indirect, auto-increment, memory-to-memory)
Hardware complexity	Simple hardwired control unit, large register file, pipelining is easy	Microprogrammed control, smaller register set, control logic is complex
Instructions	Many simple, single-cycle instructions	Fewer but powerful, multi-cycle instructions
Examples	ARM, MIPS, RISC-V	x86, VAX, Motorola 68k

RISC pushes complexity to the compiler/software; CISC pushes it to the hardware/microcode.

(b) Instruction format design

Given: 16-bit word, 64 operations, 16 registers.

Opcode: $\log_2 64 = 6$ bits
Each register field: $\log_2 16 = 4$ bits; two operands need $2\times4 = 8$ bits
Total used $= 6 + 4 + 4 = 14$ bits, leaving 2 bits (mode/reserved).

| Opcode (6) | Rdest (4) | Rsrc (4) | Mode (2) |
   15..10       9..6       5..2       1..0

Example: ADD R3, R5 → opcode=ADD, Rdest=0011, Rsrc=0101.

Addressing modes (effective address EA):

Immediate: operand value is part of the instruction itself, MOV R1, #7 → operand = 7. No memory access for the operand.
Register indirect: the register holds the address of the operand. LOAD R1, (R2) → EA = contents of R2; operand = M[R2].
Indexed: EA = base address (in instruction) + contents of an index register. LOAD R1, A(R2) → EA = A + R2. Used for arrays.
PC-relative: EA = PC + signed offset in the instruction. BEQ +offset → target = PC + offset. Gives position-independent branches.

(On a 16-bit word, an immediate/offset must fit the spare field, or a second word is used.)

Answer 2

(a) Basic CPU organization

A basic CPU consists of four cooperating blocks connected by the system bus (address, data, control lines):

ALU – performs arithmetic and logic operations; takes operands from registers and writes the result back, setting status flags.
Register set – fast internal storage: general-purpose registers plus special registers (PC, IR, MAR, MBR/MDR, accumulator).
Control Unit (CU) – generates the timing and control signals that sequence all micro-operations (fetch, decode, execute).
System bus – carries addresses, data, and control signals between CPU, memory, and I/O.

        +-------------------- System Bus --------------------+
        |          |             |              |            |
     +--------+ +-------+    +--------+      +--------+   Memory / I/O
     |  MAR   | |  IR   |    |  ALU   |      | Reg.   |
     +--------+ +-------+    +--------+      | File   |
     |  PC    | |  MDR  |        | Control Unit |       |
     +--------+ +-------+        +------------+ +--------+

Role during the fetch cycle:

PC (Program Counter): holds the address of the next instruction to be fetched. After the address is used, PC is incremented.
MAR (Memory Address Register): receives the address from PC and drives it onto the address bus to select the memory location.
IR (Instruction Register): receives the fetched instruction word from memory (via MDR) and holds it so the CU can decode the opcode.

Fetch: $MAR \leftarrow PC$ ; $MDR \leftarrow M[MAR]$ , $PC \leftarrow PC+1$ ; $IR \leftarrow MDR$ .

(b) Hardwired vs Microprogrammed control

Aspect	Hardwired	Microprogrammed
Implementation	Fixed logic gates, flip-flops, decoders	Control signals stored as microinstructions in control memory
Speed	Faster	Slower (control-memory access per step)
Flexibility	Hard to modify	Easy to modify/extend (rewrite microcode)
Suited to	RISC	CISC, complex instruction sets

Horizontal vs Vertical microinstructions:

Horizontal: one bit (or small field) per control signal; many signals can be activated in parallel. → Wide microinstructions, large control memory, but high speed and parallelism.
Vertical: control signals are encoded into compact fields and decoded at run time; fewer simultaneous signals. → Narrow microinstructions, small control memory, but slower (extra decode step, more steps).

Trade-off: Horizontal gives faster execution and more parallelism at the cost of a larger control store; vertical saves control-memory size at the cost of decoding overhead and execution speed. Real machines often use a hybrid (nano/two-level) encoding to balance both.

Answer 3

(a) Instruction pipelining and hazards

Pipelining overlaps the execution of multiple instructions by dividing instruction processing into stages (e.g. IF–ID–EX–MEM–WB). While one instruction is decoded, the next is fetched, so a new instruction can ideally complete every clock cycle, increasing throughput (not the latency of a single instruction).

Pipeline hazards and one resolution each:

Structural hazard – two instructions need the same hardware resource in the same cycle (e.g. one memory port). Resolution: add resources / use separate instruction and data caches (Harvard memory) or duplicate the unit.
Data hazard – an instruction needs a result not yet written by a prior instruction (RAW dependency). Resolution: operand forwarding (bypassing); if unavoidable, insert stalls or reorder instructions.
Control hazard – a branch changes the instruction flow, making prefetched instructions invalid. Resolution: branch prediction (or delayed branch / flushing the pipeline).

(b) Speedup calculation

Given: non-pipelined time $T_{seq}=5$ ns/instr; 5 stages, stage time 1 ns, register delay 0.2 ns.

Pipeline clock period $= 1 + 0.2 = 1.2$ ns.

Time for $n=1000$ instructions in a $k=5$ stage pipeline:

T_{pipe} = (k + n - 1)\times T_{clk} = (5 + 1000 - 1)\times 1.2 = 1004 \times 1.2 = 1204.8\text{ ns}

Non-pipelined time:

T_{non} = n \times T_{seq} = 1000 \times 5 = 5000\text{ ns}

Speedup:

S = \frac{5000}{1204.8} \approx 4.15

Theoretical maximum speedup (large $n$ , ideal, no register overhead) $= k = 5$ . With the 0.2 ns register overhead the asymptotic limit is $5/1.2 \approx 4.17$ , which matches our result.

Answer 4

(a) Memory hierarchy and locality

The memory hierarchy arranges storage in levels from fast/small/expensive to slow/large/cheap:

Registers → L1/L2 Cache → Main Memory (RAM) → Secondary Storage (SSD/HDD)
  fastest                                              slowest

It improves average memory access time by exploiting the principle of locality of reference:

Temporal locality: recently accessed items are likely to be accessed again soon → keep them in cache.
Spatial locality: items near a referenced address are likely to be used soon → fetch a whole block/line.

Because most accesses hit in the fast upper levels, the effective access time approaches that of the fast memory while the capacity approaches that of the cheap memory — giving near-cache speed at near-disk capacity and low cost.

(b) Average Memory Access Time (AMAT)

Given: hit ratio $h = 0.95$ , cache time $T_c = 5$ ns, memory time $T_m = 80$ ns.

Simultaneous / look-aside (cache and memory started together; on a miss only the extra memory time counts, cache and memory accessed in parallel):

T = h\,T_c + (1-h)\,T_m = 0.95(5) + 0.05(80) = 4.75 + 4.0 = 8.75\text{ ns}

Hierarchical / look-through (memory is accessed only after a cache miss, so a miss pays $T_c + T_m$ ):

T = h\,T_c + (1-h)(T_c + T_m) = T_c + (1-h)T_m = 5 + 0.05(80) = 5 + 4 = 9.0\text{ ns}

(c) Cache mapping comparison

Technique	Hardware cost	Hit ratio	Search complexity
Direct-mapped	Lowest (1 comparator)	Lowest (conflict misses)	Simplest – index a single line
Fully associative	Highest (one comparator per line, CAM)	Highest (no conflict misses)	Most complex – search all lines
Set-associative	Moderate (n comparators)	Intermediate (good balance)	Moderate – search within a set

Set-associative is the practical compromise used in real CPUs.

Answer 5

I/O data-transfer techniques

Technique	How it works	CPU involvement	Speed
Programmed I/O	CPU repeatedly polls the device status flag and transfers each word itself	CPU fully busy (busy-wait), wastes cycles	Slow
Interrupt-driven I/O	Device interrupts the CPU when ready; CPU services it and transfers a word, then resumes other work	CPU free between transfers, but still moves every word	Moderate
DMA (Direct Memory Access)	A DMA controller transfers blocks directly between device and memory, stealing bus cycles; CPU is interrupted only once at completion	CPU free during the whole block transfer	Fast

When DMA is most beneficial: for high-speed, bulk/block data transfers (disk, network, audio/video) where moving data word-by-word through the CPU would create excessive interrupt/polling overhead. DMA frees the CPU to do useful computation while large blocks move, maximizing throughput.

Answer 6

Flynn's Classification

Flynn classifies architectures by the number of concurrent instruction streams and data streams.

Class	Instruction × Data	Description	Example
SISD	Single instr, Single data	One processor executes one instruction on one data item at a time (classic von Neumann)	Traditional single-core CPU (Intel 8085, basic uniprocessor)
SIMD	Single instr, Multiple data	One instruction operates on many data elements simultaneously (data parallelism)	GPUs, vector processors, x86 SSE/AVX, array processors
MISD	Multiple instr, Single data	Several instructions operate on the same data stream; rare	Fault-tolerant/redundant systems (e.g. space shuttle flight control)
MIMD	Multiple instr, Multiple data	Multiple processors execute different instructions on different data independently	Multicore processors, clusters, multiprocessor servers

MIMD is the most common in general-purpose parallel computers; SIMD dominates graphics and numerical workloads.

Answer 7

Amdahl's Law

Statement: The overall speedup from improving a fraction $f$ of a program by a factor $N$ is limited by the part that is not improved:

S = \frac{1}{(1-f) + \dfrac{f}{N}}

where $f$ = parallelizable fraction, $N$ = number of processors.

Calculation

Given $f = 0.70$ (parallel), $1-f = 0.30$ (sequential), $N = 8$ :

S = \frac{1}{0.30 + \dfrac{0.70}{8}} = \frac{1}{0.30 + 0.0875} = \frac{1}{0.3875} \approx 2.58

Overall speedup ≈ 2.58×.

Implication

Even though 70% of the work runs 8× faster, the 30% sequential portion caps the gain at only ~2.58×, far below 8×. As $N \to \infty$ , the maximum possible speedup is $\frac{1}{1-f} = \frac{1}{0.30} \approx 3.33$ . This shows that the sequential (non-parallelizable) fraction is the fundamental bottleneck: beyond a point, adding processors yields diminishing returns, so reducing the serial fraction matters more than adding cores.

Answer 8

Write-through vs Write-back

Write-through: every write updates both the cache and main memory immediately. Memory is always consistent with the cache.

Write-back (copy-back): a write updates only the cache line; main memory is updated later, when that line is evicted/replaced.

Role of the dirty bit

Each cache line in a write-back cache has a dirty bit. It is set to 1 when the line is modified in cache but not yet in memory. On replacement:

dirty bit = 1 → the line must be written back to memory first,
dirty bit = 0 → the line can simply be discarded (memory already current). This avoids needless memory writes for unmodified lines.

Comparison

Aspect	Write-through	Write-back
Memory traffic	High – every write goes to memory	Low – only dirty lines written, multiple writes coalesced
Data consistency	Always consistent; simple for multiprocessors/DMA	Memory can be stale; needs coherence handling
Complexity	Simple (no dirty bit)	Needs dirty bit and write-back logic
Speed	Slower writes	Faster writes

Write-through favors consistency and simplicity; write-back favors performance by reducing memory traffic.

Answer 9

IEEE 754 single precision of −85.375

Step 1 — Sign: number is negative → sign bit = 1.

Step 2 — Convert magnitude to binary:

$85_{10} = 1010101_2$
$0.375_{10} = 0.011_2$ (since $0.25 + 0.125$ )
So $85.375_{10} = 1010101.011_2$

Step 3 — Normalize ( $1.M \times 2^E$ ):

1010101.011_2 = 1.010101011_2 \times 2^{6}

Unbiased exponent $E = 6$ .

Step 4 — Biased exponent: bias = 127.

E_{biased} = 6 + 127 = 133 = 10000101_2

Step 5 — Mantissa (23 bits, drop leading 1):

010101011\,00000000000000_2

Final 32-bit representation

Field	Bits
Sign (1)	`1`
Exponent (8)	`10000101`
Mantissa (23)	`01010101100000000000000`

\boxed{1\;10000101\;01010101100000000000000}

In hexadecimal: 0xC2AAC000.

Answer 10

Instruction fetch micro-operations

The fetch cycle moves the next instruction from memory into the IR and advances the PC. Using register-transfer notation:

T0:  MAR ← PC
T1:  MDR ← M[MAR],   PC ← PC + 1
T2:  IR  ← MDR

Purpose of each micro-operation:

MAR ← PC – the address of the next instruction (held in the PC) is copied into the Memory Address Register, which drives the address bus to select the memory location.
MDR ← M[MAR] – memory is read; the instruction word at that address is loaded into the Memory Data Register (MDR/MBR). PC ← PC + 1 is done in the same step (in parallel) so the PC now points to the following instruction, ready for the next fetch.
IR ← MDR – the fetched instruction is transferred from the MDR into the Instruction Register, where the control unit will decode its opcode to begin the execute cycle.

After T2 the instruction is in the IR and the PC is updated, completing the fetch.

Answer 11

Bus arbitration

When several bus masters (CPU, DMA controllers, I/O processors) can drive the shared bus, bus arbitration decides which master gains control at a given time, preventing conflicts. The unit that decides is the bus arbiter, using request and grant control lines.

Centralized daisy-chaining

A single bus grant (BG) line is chained serially through all masters; one shared bus request (BR) line goes to the arbiter.
When BR is asserted, the arbiter sends BG to the first device; if it didn't request, it passes BG to the next, and so on.
Pros: very few control lines, easy to add devices. Cons: fixed priority by physical position (nearest device dominates → starvation of far devices); failure of one device can break the chain; slower propagation.

Independent request / grant scheme

Each master has its own request line (BR $_i$ ) and its own grant line (BG $_i$ ) to a central arbiter.
The arbiter sees all requests at once and grants the bus to the highest-priority requester directly.
Pros: fast, flexible/programmable priority, no chain dependency, fair. Cons: many control lines and more arbiter hardware (cost grows with number of masters).

Summary: daisy-chaining = simple, few lines, fixed positional priority; independent request/grant = more lines/hardware but faster and flexible priority.

Answer 12

(Answer any TWO; all three are given.)

(a) Cache coherence problem

In a multiprocessor system each CPU has its own cache. When several caches hold copies of the same memory block and one processor writes to it, the other caches hold stale copies — this inconsistency is the cache coherence problem. It is solved by coherence protocols that enforce a single consistent view: snooping protocols (e.g. MESI — Modified, Exclusive, Shared, Invalid — where writes invalidate or update other copies via a shared bus) and directory-based protocols for larger systems.

(b) Crossbar switch interconnection network

A crossbar connects $N$ processors to $M$ memory modules through a grid of $N\times M$ switch points (crosspoints). Any processor can be connected to any free memory module simultaneously and non-blockingly, giving the highest bandwidth and lowest latency. Its drawback is cost/complexity: hardware grows as $O(N\times M)$ , so it scales poorly to large systems.

(c) Superscalar processor architecture

A superscalar processor issues and executes more than one instruction per clock cycle by having multiple parallel execution units (e.g. several ALUs, load/store, FP units). It uses hardware to detect independent instructions, supports out-of-order execution and dynamic scheduling, and exploits instruction-level parallelism (ILP), achieving an ideal CPI < 1 (IPC > 1). Examples: Intel Pentium and later x86 cores, modern ARM cores.

Level	BE Computer Engineering (Pokhara University)
Subject	Computer Architecture (PU, CMP 262)
Year	2078 BS
Exam session	Regular (annual)
Full marks	100
Time allowed	180 minutes
Questions	12, all with step-by-step solutions