BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) Question Paper 2079 Nepal

Q: Where can I find the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper 2079?

The full BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Computer Architecture (PU, CMP 262) 2079 paper come with solutions?

Yes. Every question on this Computer Architecture (PU, CMP 262) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 paper?

The BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Computer Architecture (PU, CMP 262) past paper free?

Yes — reading and attempting this Computer Architecture (PU, CMP 262) past paper on Kekkei is completely free.

Question

1Long answer14 marks

(a) Differentiate between RISC and CISC instruction set architectures with respect to instruction format, addressing modes, and hardware complexity. (7)

(b) A processor supports the following addressing modes: immediate, direct, indirect, register, register-indirect, and indexed. For an instruction LOAD R1, X where the effective address is to be computed, explain with diagrams how the operand is fetched in register-indirect and indexed addressing modes, and state one practical use of each. (7)

instruction-set-architectureaddressing-modes

Answer 1

(a) RISC vs CISC

Aspect	RISC	CISC
Instruction format	Fixed length (e.g. 32-bit), uniform encoding, few formats	Variable length, many formats, complex encoding
Addressing modes	Few, simple (register, immediate, register-indirect); memory accessed only via LOAD/STORE	Many, complex (direct, indirect, indexed, base-displacement, etc.); operands can be in memory
Hardware complexity	Simple, hardwired control unit; large register file; relies on compiler optimization and pipelining	Complex, microprogrammed control unit; instructions may take many cycles
Instructions/clock cycles	Mostly single-cycle, register-to-register	Multi-cycle, memory-to-memory operations
Example	ARM, MIPS, RISC-V	x86, VAX

Summary: RISC trades a larger number of simple instructions and compiler effort for fast, pipelinable, hardwired execution, whereas CISC packs more work per instruction at the cost of slow, microprogrammed, irregular hardware.

(b) Operand fetch in Register-Indirect and Indexed modes

Let R1 be the destination and X the operand field.

Register-Indirect Addressing

The address field names a register that holds the effective address (EA) of the operand; the register is not the operand itself.

EA = (\text{Register})

Fetch sequence (described as a diagram):

Instruction: LOAD R1, (Rp)
   Rp ---------> [ contents = EA ] ---------> Memory[EA] ---------> R1

Read pointer register Rp from the register file -> gives EA.
Place EA on the memory address bus.
Memory returns Memory[EA], which is loaded into R1.

Practical use: Pointer-based access and traversal of linked lists / arrays where the address is computed at run time (e.g., dereferencing a C pointer).

Indexed Addressing

The effective address is formed by adding the contents of an index register to a constant base/offset X carried in the instruction.

EA = X + (\text{Index Register})

Fetch sequence (described as a diagram):

Instruction: LOAD R1, X(Ri)
   X (offset) ----+
                  (+) ----> EA ----> Memory[EA] ----> R1
   Ri (index) ----+

Read index register Ri.
Add the displacement X (in the ALU/address adder) to form EA.
Fetch Memory[EA] and load it into R1.

Practical use: Sequential access to array/table elements by incrementing the index register inside a loop, where X is the array base and Ri is the element index.

Answer 2

(a) Hardwired vs Microprogrammed Control

Hardwired control generates control signals directly from a fixed combinational/sequential logic circuit (gates, decoders, flip-flops, counters) driven by the instruction opcode and timing signals.

Advantages: (1) Very fast, since signals come from dedicated logic. (2) Efficient for small instruction sets (RISC).
Disadvantages: (1) Difficult to design, test, and modify; any change in the instruction set means redesigning the circuit. (2) Becomes very complex for large instruction sets.

Microprogrammed control stores a sequence of microinstructions (each a set of control bits) in a control memory; control signals are produced by reading these microinstructions.

Advantages: (1) Flexible and easy to modify/extend — just rewrite the microcode. (2) Systematic, simpler to design and debug for complex (CISC) instruction sets.
Disadvantages: (1) Slower, because each step requires a control-memory read. (2) Requires extra control memory and is generally more expensive.

(b) Microprogrammed Control Unit Design

Block diagram (in words): The opcode of the instruction maps (via a mapping logic) to a starting address in control memory. The Control Address Register (CAR) holds the address of the next microinstruction to be read. Its output addresses the control memory; the word read out is latched in the Control Data Register (CDR). The CDR is split into (i) a control field whose bits become the micro-operation control signals sent to the datapath, and (ii) an address/branch field used by the sequencing logic to decide the next CAR value.

Roles:

Control memory (ROM/RAM): stores all microinstructions (the microprogram) for every machine instruction.
Control Address Register (CAR): holds the address of the current/next microinstruction; analogous to a program counter for microcode.
Control Data Register (CDR / microinstruction buffer): holds the microinstruction just fetched so its control bits drive the datapath and its branch bits drive the sequencer.
Sequencing logic: computes the next CAR value — increment (CAR+1), branch to an address, map from opcode, or return.

Next-address selection during conditional branching: The microinstruction carries a branch-address field and a condition-select field. The selected status flag (e.g., Zero, Carry, Sign) is tested:

CAR \leftarrow \begin{cases} \text{Branch Address} & \text{if selected condition is TRUE} \\ CAR + 1 & \text{if condition is FALSE} \end{cases}

A multiplexer chooses between the incremented address and the branch address; its select line is driven by the AND of the condition-select bits and the corresponding status flags. Thus the sequencing logic implements conditional microprogram branching.

Answer 3

(a) Locality of Reference and Memory Hierarchy

Locality of reference is the observed tendency of programs to access a small, predictable portion of their address space over short time intervals. Two forms:

Temporal locality: a memory location accessed now is likely to be accessed again soon (e.g., loop variables, repeated instructions).
Spatial locality: locations near a recently accessed address are likely to be accessed soon (e.g., sequential instruction fetch, array traversal).

Justification of hierarchy: Because at any time only a small working set is active, that set can be kept in small, fast, expensive memory (registers, cache) while the bulk stays in large, slow, cheap memory (main memory, disk). This gives an average speed close to the fast level and an average cost/capacity close to the slow level — the best of both worlds.

(b) Average Memory Access Time (AMAT)

AMAT = H \cdot t_{cache} + (1-H)\,(t_{cache} + t_{mem})

Using the (common) model where a miss costs cache time plus memory time:

Given $t_{cache}=5$ ns, $t_{mem}=70$ ns, $H=0.92$ :

AMAT = 5 + (1-0.92)(70) = 5 + 0.08\times 70 = 5 + 5.6 = 10.6\ \text{ns}

Required hit ratio for AMAT < 8 ns:

8 > 5 + (1-H)(70)

3 > (1-H)\,70 \Rightarrow 1-H < \frac{3}{70} = 0.04286

H > 0.9571

Result: A hit ratio greater than approximately 0.957 (95.7%) is required to bring AMAT below 8 ns.

Note: If instead the miss penalty is taken as just the memory access (AMAT = H·t_cache + (1−H)·t_mem), then AMAT = 0.92·5 + 0.08·70 = 4.6 + 5.6 = 10.2 ns, and the required hit ratio becomes H > (70−8)/(70−5) = 62/65 ≈ 0.9538. Either consistent model is acceptable.

Answer 4

(a) Pipeline Hazards

1. Structural hazard — two instructions need the same hardware resource in the same cycle. Example: With a single unified memory port, an instruction fetch (IF) and a memory access (MEM) of two different instructions clash in the same cycle. (Solved by separate instruction/data caches.)

2. Data hazard — an instruction depends on the result of a previous instruction still in the pipeline. Example (RAW):

ADD R1, R2, R3   ; produces R1
SUB R4, R1, R5   ; needs R1 before ADD has written it back

(Solved by forwarding/bypassing or stalling.)

3. Control hazard — caused by branch/jump instructions, since the next instruction address is unknown until the branch resolves. Example:

BEQ R1, R2, LABEL  ; pipeline does not know whether to fetch the next
                   ; sequential instruction or the one at LABEL

(Solved by branch prediction, delay slots, or flushing.)

(b) Pipeline Performance Calculation

Clock period: limited by the slowest stage plus latch delay.

Slowest stage = 4 ns; latch delay = 0.5 ns.

T_{clk} = 4 + 0.5 = 4.5\ \text{ns}

Non-pipelined time per instruction: $T_{np} = 12$ ns.

Pipelined execution time for $n=1000$ instructions (5 stages, $k=5$ ):

T_{pipe} = (k + n - 1)\,T_{clk} = (5 + 1000 - 1)\times 4.5 = 1004 \times 4.5 = 4518\ \text{ns}

Non-pipelined time for 1000 instructions:

T_{seq} = 1000 \times 12 = 12000\ \text{ns}

Speedup for 1000 instructions:

S = \frac{12000}{4518} \approx 2.656

Maximum theoretical speedup (as $n \to \infty$ ):

S_{max} = \frac{T_{np}}{T_{clk}} = \frac{12}{4.5} \approx 2.667

Results: Clock period = 4.5 ns, speedup for 1000 instructions ≈ 2.66, maximum theoretical speedup ≈ 2.67.

Answer 5

Instruction Cycle

The instruction cycle is the basic operational sequence the CPU repeats for each instruction. Its main phases are:

Fetch — read the instruction from memory at the address in the Program Counter (PC).
Decode — interpret the opcode and identify operands/addressing mode.
Operand fetch / Effective-address calculation — compute the operand address (if indirect) and read operands.
Execute — perform the operation in the ALU.
Store / Write-back — write the result to a register or memory; service interrupts if pending.

Fetch Phase in RTL

Let PC = program counter, AR = address register, DR = data register, IR = instruction register, M = memory. The fetch micro-operations across three clock pulses ( $T_0, T_1, T_2$ ):

T0:  AR <- PC                  ; transfer PC to address register
T1:  DR <- M[AR],  PC <- PC + 1 ; read instruction word; increment PC
T2:  IR <- DR                  ; load instruction into instruction register

A combined form sometimes written as:

T0: AR <- PC
T1: IR <- M[AR], PC <- PC + 1

After $T_2$ the opcode in IR is decoded and the CPU proceeds to the execute phase. The PC has already been incremented to point at the next instruction.

Answer 6

Direct-Mapped Cache Address Breakdown

Given: 64 blocks (lines), block size = 16 bytes, main memory = 4 KB.

Total address bits: $4\text{ KB} = 4096 = 2^{12}$ bytes $\Rightarrow$ 12-bit physical address.

WORD (byte offset) field: block size $= 16 = 2^4$ bytes $\Rightarrow$ 4 bits.

LINE (block/index) field: number of cache lines $= 64 = 2^6 \Rightarrow$ 6 bits.

TAG field: remaining bits $= 12 - 6 - 4 =$ 2 bits.

Field	TAG	LINE	WORD
Bits	2	6	4

Splitting address `0x2A7`

0x2A7 = 0010\,1010\,0111_2 \quad (12\text{ bits})

Grouping from MSB as TAG(2) | LINE(6) | WORD(4):

  TAG    LINE      WORD
  00 | 101010 |  0111

TAG = 00 -> $0$
LINE = 101010 -> $42$ (i.e. $0x2A$ )
WORD = 0111 -> $7$

So address 0x2A7 maps to cache line 42, byte offset 7, with tag 0.

Answer 7

Comparison of I/O Data-Transfer Techniques

Feature	Programmed I/O	Interrupt-Driven I/O	DMA
CPU involvement	Highest — CPU continuously polls the device status (busy-wait) and transfers each word	Moderate — CPU is free until the device raises an interrupt, then handles the transfer	Lowest — DMA controller transfers blocks directly; CPU only sets up and is notified at the end
CPU efficiency	Very poor; CPU time wasted polling	Better; CPU does other work between interrupts	Best; CPU does useful work during the whole transfer
Data path	Through CPU registers	Through CPU registers	Directly between device and memory (CPU bus released)
Throughput	Low	Medium	High (block transfers)
Overhead	None per device but heavy CPU cost	Interrupt overhead per word/transfer	Setup overhead, but minimal per word
Best suited for	Simple, slow devices	Devices with moderate, unpredictable data rates	High-speed bulk transfers (disk, network)

Summary: Moving from programmed I/O to interrupt-driven I/O to DMA progressively reduces CPU involvement and increases throughput. Programmed I/O wastes CPU cycles polling; interrupt-driven I/O frees the CPU between events but still routes each datum through the CPU; DMA offloads the entire block transfer to a dedicated controller, achieving the highest throughput with the least CPU burden — ideal for high-speed devices like disks.

Answer 8

Flynn's Classification

Flynn (1966) classifies computer architectures by the multiplicity of instruction streams and data streams:

Class	Instruction streams	Data streams	Description	Example
SISD	Single	Single	One instruction operates on one data item at a time; the classic sequential (von Neumann) uniprocessor	Traditional single-core CPU (e.g., classic Intel 8086)
SIMD	Single	Multiple	One instruction is applied simultaneously to many data elements; data-level parallelism	Vector/array processors, GPUs, SSE/AVX units
MISD	Multiple	Single	Several instructions operate on the same data stream; rare in practice	Fault-tolerant pipelined/systolic systems (e.g., redundant flight-control computers)
MIMD	Multiple	Multiple	Multiple processors execute different instructions on different data independently	Multicore CPUs, multiprocessors, clusters

Notes: SISD is the conventional serial machine; SIMD exploits data parallelism efficiently for regular computations; MISD is largely theoretical with only niche uses; MIMD is the most general and is the basis of modern parallel and distributed systems.

Answer 9

Amdahl's Law

Amdahl's law gives the overall speedup when a fraction $f$ of execution time is enhanced by a factor $s$ :

S_{overall} = \frac{1}{(1-f) + \dfrac{f}{s}}

Given: enhanced fraction $f = 0.40$ , speedup of that part $s = 5$ .

S = \frac{1}{(1-0.40) + \dfrac{0.40}{5}} = \frac{1}{0.60 + 0.08} = \frac{1}{0.68} \approx 1.47

Overall speedup ≈ 1.47×.

Maximum achievable speedup (enhancement infinitely fast, $s \to \infty$ ):

S_{max} = \frac{1}{1-f} = \frac{1}{0.60} \approx 1.67

Result: The realistic speedup is about 1.47×, and even with an infinitely fast enhancement the speedup is bounded by 1.67×, because 60% of the execution time (the non-enhanced part) remains unchanged.

Answer 10

Daisy-Chaining Bus Arbitration

In daisy-chained arbitration, all bus masters share a single bus-grant (BG) line that is wired serially through the devices, ordered by physical priority.

Diagram (in words):

            +----------+      +----------+      +----------+
  BG  ----->| Device 1 |--BG->| Device 2 |--BG->| Device 3 |
            +----------+      +----------+      +----------+
     ^            |                |                |
     |            +-------+--------+----------------+
  Bus Arbiter <-------- Bus Request (BR, common/wired-OR) ----
  (Bus Busy line shared by all)

Operation:

Any device needing the bus asserts the common Bus Request (BR) line.
The arbiter responds by asserting Bus Grant (BG).
BG propagates serially from device 1 onward. The first requesting device in the chain that receives BG takes the bus and blocks BG from passing further; non-requesting devices simply pass BG along.
The device asserts Bus Busy, uses the bus, and releases it when done.

Thus priority is fixed by position in the chain (devices nearer the arbiter have higher priority).

Advantage (vs centralized parallel arbitration): Very simple — needs only a few control lines (BR, BG, Busy) regardless of the number of devices, so it is cheap and easy to expand.

Limitation: (i) Fixed, position-based priority can starve low-priority (far) devices, and (ii) the serial BG propagation is slow and a failure in one device breaks the chain for those downstream. Centralized parallel arbitration avoids this by giving each device its own request/grant lines and a programmable priority encoder, at the cost of more hardware.

Answer 11

Virtual Memory and Paging

Virtual memory is a memory-management technique that gives each process the illusion of a large, contiguous address space (the virtual address space) that can exceed the physical main memory. Only the actively used portions reside in RAM; the rest is kept on secondary storage (disk) and brought in on demand. It provides large address spaces, isolation/protection between processes, and relocation.

Role of paging: The virtual and physical address spaces are divided into equal-sized blocks — pages (virtual) and frames (physical). A page table maps each virtual page number (VPN) to a physical frame number (PFN). A virtual address is split into (VPN, offset); translation replaces the VPN with its PFN while the offset is unchanged:

\text{Physical Address} = \text{PageTable}[VPN] \,\|\, \text{offset}

If the page is not present in RAM, a page fault occurs and the OS loads it from disk. Paging eliminates external fragmentation and allows non-contiguous allocation.

Role of the TLB

Without help, each memory reference would need an extra memory access just to read the page table — doubling effective access time (or more, with multi-level tables).

The Translation Lookaside Buffer (TLB) is a small, fast, fully/set-associative cache that stores the most recently used VPN -> PFN translations.

On a TLB hit, the PFN is obtained in ~1 cycle, so translation adds almost no overhead.
On a TLB miss, the page table is walked (in memory or by hardware), and the entry is loaded into the TLB.

Because of locality of reference, TLB hit rates are typically very high (often >95–99%), so most translations are fast. This keeps the effective address-translation time close to a single fast cache access, dramatically improving overall performance.

Answer 12

Multicore Processor vs Multiprocessor System

Aspect	Multicore Processor	Multiprocessor System
Definition	Two or more independent processing cores integrated on a single chip	Two or more separate physical CPUs (chips) in one system
Integration	On-die; cores share on-chip resources (e.g., L2/L3 cache, memory controller)	Separate packages connected via a system bus/interconnect
Communication	Fast, low-latency on-chip interconnect	Slower, through external bus/interconnect
Cost/Power	Lower cost and power for given parallelism	Higher cost, board complexity, and power
Example	Quad-core CPU in a laptop	A server board with multiple CPU sockets (SMP)

A multicore chip can itself be a building block of a multiprocessor system (multiple multicore chips).

Cache Coherence

Cache coherence requires that all caches in a shared-memory system present a consistent view of any shared memory location — i.e., a read must return the most recently written value to that location, regardless of which processor wrote it.

Why it becomes a problem in shared-memory multiprocessors: Each processor/core keeps its own private cache. When several caches hold copies of the same memory block and one processor writes to its copy, the other cached copies become stale (out of date). A subsequent read by another processor from its own cache would return the old value, violating correctness.

This is resolved by cache-coherence protocols (e.g., snooping protocols like MESI, or directory-based protocols) that invalidate or update other copies whenever a shared block is modified, ensuring all cores eventually see a single, consistent value.

Level	BE Computer Engineering (Pokhara University)
Subject	Computer Architecture (PU, CMP 262)
Year	2079 BS
Exam session	Regular (annual)
Full marks	100
Time allowed	180 minutes
Questions	12, all with step-by-step solutions

BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) Question Paper 2079 Nepal

Section A: Long Answer Questions

(a) RISC vs CISC

(b) Operand fetch in Register-Indirect and Indexed modes

Register-Indirect Addressing

Indexed Addressing

(a) Hardwired vs Microprogrammed Control

(b) Microprogrammed Control Unit Design

(a) Locality of Reference and Memory Hierarchy

(b) Average Memory Access Time (AMAT)

(a) Pipeline Hazards

(b) Pipeline Performance Calculation

Section B: Short Answer Questions

Instruction Cycle

Fetch Phase in RTL

Direct-Mapped Cache Address Breakdown

Splitting address `0x2A7`

Comparison of I/O Data-Transfer Techniques

Flynn's Classification

Amdahl's Law

Daisy-Chaining Bus Arbitration

Virtual Memory and Paging

Role of the TLB

Multicore Processor vs Multiprocessor System

Cache Coherence

Frequently asked questions

Section A: Long Answer Questions

(a) RISC vs CISC

(b) Operand fetch in Register-Indirect and Indexed modes

Register-Indirect Addressing

Indexed Addressing

(a) Hardwired vs Microprogrammed Control

(b) Microprogrammed Control Unit Design

(a) Locality of Reference and Memory Hierarchy

(b) Average Memory Access Time (AMAT)

(a) Pipeline Hazards

(b) Pipeline Performance Calculation

Section B: Short Answer Questions

Instruction Cycle

Fetch Phase in RTL

Direct-Mapped Cache Address Breakdown

Splitting address 0x2A7

Comparison of I/O Data-Transfer Techniques

Flynn's Classification

Amdahl's Law

Daisy-Chaining Bus Arbitration

Virtual Memory and Paging

Role of the TLB

Multicore Processor vs Multiprocessor System

Cache Coherence

Frequently asked questions

Splitting address `0x2A7`