BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) Question Paper 2079 Nepal
This is the official BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper for 2079, as set in the regular annual examination. It carries 100 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Computer Architecture (PU, CMP 262) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) exam or solving previous years' question papers, this 2079 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt all / any as specified.
(a) Differentiate between RISC and CISC instruction set architectures with respect to instruction format, addressing modes, and hardware complexity. (7)
(b) A processor supports the following addressing modes: immediate, direct, indirect, register, register-indirect, and indexed. For an instruction LOAD R1, X where the effective address is to be computed, explain with diagrams how the operand is fetched in register-indirect and indexed addressing modes, and state one practical use of each. (7)
(a) RISC vs CISC
| Aspect | RISC | CISC |
|---|---|---|
| Instruction format | Fixed length (e.g. 32-bit), uniform encoding, few formats | Variable length, many formats, complex encoding |
| Addressing modes | Few, simple (register, immediate, register-indirect); memory accessed only via LOAD/STORE | Many, complex (direct, indirect, indexed, base-displacement, etc.); operands can be in memory |
| Hardware complexity | Simple, hardwired control unit; large register file; relies on compiler optimization and pipelining | Complex, microprogrammed control unit; instructions may take many cycles |
| Instructions/clock cycles | Mostly single-cycle, register-to-register | Multi-cycle, memory-to-memory operations |
| Example | ARM, MIPS, RISC-V | x86, VAX |
Summary: RISC trades a larger number of simple instructions and compiler effort for fast, pipelinable, hardwired execution, whereas CISC packs more work per instruction at the cost of slow, microprogrammed, irregular hardware.
(b) Operand fetch in Register-Indirect and Indexed modes
Let R1 be the destination and X the operand field.
Register-Indirect Addressing
The address field names a register that holds the effective address (EA) of the operand; the register is not the operand itself.
Fetch sequence (described as a diagram):
Instruction: LOAD R1, (Rp)
Rp ---------> [ contents = EA ] ---------> Memory[EA] ---------> R1
- Read pointer register
Rpfrom the register file -> gives EA. - Place EA on the memory address bus.
- Memory returns
Memory[EA], which is loaded intoR1.
Practical use: Pointer-based access and traversal of linked lists / arrays where the address is computed at run time (e.g., dereferencing a C pointer).
Indexed Addressing
The effective address is formed by adding the contents of an index register to a constant base/offset X carried in the instruction.
Fetch sequence (described as a diagram):
Instruction: LOAD R1, X(Ri)
X (offset) ----+
(+) ----> EA ----> Memory[EA] ----> R1
Ri (index) ----+
- Read index register
Ri. - Add the displacement
X(in the ALU/address adder) to form EA. - Fetch
Memory[EA]and load it intoR1.
Practical use: Sequential access to array/table elements by incrementing the index register inside a loop, where X is the array base and Ri is the element index.
(a) Distinguish between hardwired control and microprogrammed control units, listing two advantages and two disadvantages of each. (6)
(b) Design a microprogrammed control unit and explain the role of the control memory, control address register (CAR), control data register, and sequencing logic. Show how the next microinstruction address is selected during conditional branching. (8)
(a) Hardwired vs Microprogrammed Control
Hardwired control generates control signals directly from a fixed combinational/sequential logic circuit (gates, decoders, flip-flops, counters) driven by the instruction opcode and timing signals.
- Advantages: (1) Very fast, since signals come from dedicated logic. (2) Efficient for small instruction sets (RISC).
- Disadvantages: (1) Difficult to design, test, and modify; any change in the instruction set means redesigning the circuit. (2) Becomes very complex for large instruction sets.
Microprogrammed control stores a sequence of microinstructions (each a set of control bits) in a control memory; control signals are produced by reading these microinstructions.
- Advantages: (1) Flexible and easy to modify/extend — just rewrite the microcode. (2) Systematic, simpler to design and debug for complex (CISC) instruction sets.
- Disadvantages: (1) Slower, because each step requires a control-memory read. (2) Requires extra control memory and is generally more expensive.
(b) Microprogrammed Control Unit Design
Block diagram (in words): The opcode of the instruction maps (via a mapping logic) to a starting address in control memory. The Control Address Register (CAR) holds the address of the next microinstruction to be read. Its output addresses the control memory; the word read out is latched in the Control Data Register (CDR). The CDR is split into (i) a control field whose bits become the micro-operation control signals sent to the datapath, and (ii) an address/branch field used by the sequencing logic to decide the next CAR value.
Roles:
- Control memory (ROM/RAM): stores all microinstructions (the microprogram) for every machine instruction.
- Control Address Register (CAR): holds the address of the current/next microinstruction; analogous to a program counter for microcode.
- Control Data Register (CDR / microinstruction buffer): holds the microinstruction just fetched so its control bits drive the datapath and its branch bits drive the sequencer.
- Sequencing logic: computes the next CAR value — increment (CAR+1), branch to an address, map from opcode, or return.
Next-address selection during conditional branching: The microinstruction carries a branch-address field and a condition-select field. The selected status flag (e.g., Zero, Carry, Sign) is tested:
A multiplexer chooses between the incremented address and the branch address; its select line is driven by the AND of the condition-select bits and the corresponding status flags. Thus the sequencing logic implements conditional microprogram branching.
(a) Explain the principle of locality of reference and how it justifies the use of a memory hierarchy. (4)
(b) Consider a system with a cache access time of 5 ns and a main memory access time of 70 ns. If the hit ratio is 0.92, calculate the average memory access time. Then determine the new hit ratio required to reduce the average memory access time below 8 ns. (8)
(a) Locality of Reference and Memory Hierarchy
Locality of reference is the observed tendency of programs to access a small, predictable portion of their address space over short time intervals. Two forms:
- Temporal locality: a memory location accessed now is likely to be accessed again soon (e.g., loop variables, repeated instructions).
- Spatial locality: locations near a recently accessed address are likely to be accessed soon (e.g., sequential instruction fetch, array traversal).
Justification of hierarchy: Because at any time only a small working set is active, that set can be kept in small, fast, expensive memory (registers, cache) while the bulk stays in large, slow, cheap memory (main memory, disk). This gives an average speed close to the fast level and an average cost/capacity close to the slow level — the best of both worlds.
(b) Average Memory Access Time (AMAT)
Using the (common) model where a miss costs cache time plus memory time:
Given ns, ns, :
Required hit ratio for AMAT < 8 ns:
Result: A hit ratio greater than approximately 0.957 (95.7%) is required to bring AMAT below 8 ns.
Note: If instead the miss penalty is taken as just the memory access (AMAT = H·t_cache + (1−H)·t_mem), then AMAT = 0.92·5 + 0.08·70 = 4.6 + 5.6 = 10.2 ns, and the required hit ratio becomes H > (70−8)/(70−5) = 62/65 ≈ 0.9538. Either consistent model is acceptable.
(a) Explain the different types of pipeline hazards (structural, data, and control) with one example of each. (6)
(b) A non-pipelined processor takes 12 ns to execute an instruction. The same datapath is divided into a 5-stage pipeline with stage delays of 3 ns, 2 ns, 4 ns, 2 ns, and 3 ns, plus a 0.5 ns latch delay per stage. Compute the clock period, the speedup for executing 1000 instructions, and the maximum theoretical speedup. (6)
(a) Pipeline Hazards
1. Structural hazard — two instructions need the same hardware resource in the same cycle. Example: With a single unified memory port, an instruction fetch (IF) and a memory access (MEM) of two different instructions clash in the same cycle. (Solved by separate instruction/data caches.)
2. Data hazard — an instruction depends on the result of a previous instruction still in the pipeline. Example (RAW):
ADD R1, R2, R3 ; produces R1
SUB R4, R1, R5 ; needs R1 before ADD has written it back
(Solved by forwarding/bypassing or stalling.)
3. Control hazard — caused by branch/jump instructions, since the next instruction address is unknown until the branch resolves. Example:
BEQ R1, R2, LABEL ; pipeline does not know whether to fetch the next
; sequential instruction or the one at LABEL
(Solved by branch prediction, delay slots, or flushing.)
(b) Pipeline Performance Calculation
Clock period: limited by the slowest stage plus latch delay.
Slowest stage = 4 ns; latch delay = 0.5 ns.
Non-pipelined time per instruction: ns.
Pipelined execution time for instructions (5 stages, ):
Non-pipelined time for 1000 instructions:
Speedup for 1000 instructions:
Maximum theoretical speedup (as ):
Results: Clock period = 4.5 ns, speedup for 1000 instructions ≈ 2.66, maximum theoretical speedup ≈ 2.67.
Section B: Short Answer Questions
Attempt all / any as specified.
Describe the instruction cycle of a CPU and explain the sequence of micro-operations performed during the fetch phase using register transfer language (RTL).
Instruction Cycle
The instruction cycle is the basic operational sequence the CPU repeats for each instruction. Its main phases are:
- Fetch — read the instruction from memory at the address in the Program Counter (PC).
- Decode — interpret the opcode and identify operands/addressing mode.
- Operand fetch / Effective-address calculation — compute the operand address (if indirect) and read operands.
- Execute — perform the operation in the ALU.
- Store / Write-back — write the result to a register or memory; service interrupts if pending.
Fetch Phase in RTL
Let PC = program counter, AR = address register, DR = data register, IR = instruction register, M = memory. The fetch micro-operations across three clock pulses ():
T0: AR <- PC ; transfer PC to address register
T1: DR <- M[AR], PC <- PC + 1 ; read instruction word; increment PC
T2: IR <- DR ; load instruction into instruction register
A combined form sometimes written as:
T0: AR <- PC
T1: IR <- M[AR], PC <- PC + 1
After the opcode in IR is decoded and the CPU proceeds to the execute phase. The PC has already been incremented to point at the next instruction.
A direct-mapped cache has 64 blocks with a block size of 16 bytes. The main memory is 4 KB. Determine the number of bits in the TAG, LINE (block), and WORD fields of the memory address. Show how the address 0x2A7 is split across these fields.
Direct-Mapped Cache Address Breakdown
Given: 64 blocks (lines), block size = 16 bytes, main memory = 4 KB.
Total address bits: bytes 12-bit physical address.
WORD (byte offset) field: block size bytes 4 bits.
LINE (block/index) field: number of cache lines 6 bits.
TAG field: remaining bits 2 bits.
| Field | TAG | LINE | WORD |
|---|---|---|---|
| Bits | 2 | 6 | 4 |
Splitting address 0x2A7
Grouping from MSB as TAG(2) | LINE(6) | WORD(4):
TAG LINE WORD
00 | 101010 | 0111
- TAG = 00 ->
- LINE = 101010 -> (i.e. )
- WORD = 0111 ->
So address 0x2A7 maps to cache line 42, byte offset 7, with tag 0.
Compare programmed I/O, interrupt-driven I/O, and Direct Memory Access (DMA) as techniques for data transfer between the CPU and I/O devices, highlighting their relative advantages in terms of CPU involvement and throughput.
Comparison of I/O Data-Transfer Techniques
| Feature | Programmed I/O | Interrupt-Driven I/O | DMA |
|---|---|---|---|
| CPU involvement | Highest — CPU continuously polls the device status (busy-wait) and transfers each word | Moderate — CPU is free until the device raises an interrupt, then handles the transfer | Lowest — DMA controller transfers blocks directly; CPU only sets up and is notified at the end |
| CPU efficiency | Very poor; CPU time wasted polling | Better; CPU does other work between interrupts | Best; CPU does useful work during the whole transfer |
| Data path | Through CPU registers | Through CPU registers | Directly between device and memory (CPU bus released) |
| Throughput | Low | Medium | High (block transfers) |
| Overhead | None per device but heavy CPU cost | Interrupt overhead per word/transfer | Setup overhead, but minimal per word |
| Best suited for | Simple, slow devices | Devices with moderate, unpredictable data rates | High-speed bulk transfers (disk, network) |
Summary: Moving from programmed I/O to interrupt-driven I/O to DMA progressively reduces CPU involvement and increases throughput. Programmed I/O wastes CPU cycles polling; interrupt-driven I/O frees the CPU between events but still routes each datum through the CPU; DMA offloads the entire block transfer to a dedicated controller, achieving the highest throughput with the least CPU burden — ideal for high-speed devices like disks.
State Flynn's classification of computer architectures and briefly describe each category (SISD, SIMD, MISD, MIMD) with a representative example of each.
Flynn's Classification
Flynn (1966) classifies computer architectures by the multiplicity of instruction streams and data streams:
| Class | Instruction streams | Data streams | Description | Example |
|---|---|---|---|---|
| SISD | Single | Single | One instruction operates on one data item at a time; the classic sequential (von Neumann) uniprocessor | Traditional single-core CPU (e.g., classic Intel 8086) |
| SIMD | Single | Multiple | One instruction is applied simultaneously to many data elements; data-level parallelism | Vector/array processors, GPUs, SSE/AVX units |
| MISD | Multiple | Single | Several instructions operate on the same data stream; rare in practice | Fault-tolerant pipelined/systolic systems (e.g., redundant flight-control computers) |
| MIMD | Multiple | Multiple | Multiple processors execute different instructions on different data independently | Multicore CPUs, multiprocessors, clusters |
Notes: SISD is the conventional serial machine; SIMD exploits data parallelism efficiently for regular computations; MISD is largely theoretical with only niche uses; MIMD is the most general and is the basis of modern parallel and distributed systems.
A program spends 40% of its execution time on a section that is enhanced to run 5 times faster. Using Amdahl's law, calculate the overall speedup of the program. What is the maximum achievable speedup if the enhancement were made infinitely fast?
Amdahl's Law
Amdahl's law gives the overall speedup when a fraction of execution time is enhanced by a factor :
Given: enhanced fraction , speedup of that part .
Overall speedup ≈ 1.47×.
Maximum achievable speedup (enhancement infinitely fast, ):
Result: The realistic speedup is about 1.47×, and even with an infinitely fast enhancement the speedup is bounded by 1.67×, because 60% of the execution time (the non-enhanced part) remains unchanged.
Explain the daisy-chaining method of bus arbitration with a suitable diagram. State one advantage and one limitation of this scheme compared to a centralized parallel arbitration scheme.
Daisy-Chaining Bus Arbitration
In daisy-chained arbitration, all bus masters share a single bus-grant (BG) line that is wired serially through the devices, ordered by physical priority.
Diagram (in words):
+----------+ +----------+ +----------+
BG ----->| Device 1 |--BG->| Device 2 |--BG->| Device 3 |
+----------+ +----------+ +----------+
^ | | |
| +-------+--------+----------------+
Bus Arbiter <-------- Bus Request (BR, common/wired-OR) ----
(Bus Busy line shared by all)
Operation:
- Any device needing the bus asserts the common Bus Request (BR) line.
- The arbiter responds by asserting Bus Grant (BG).
- BG propagates serially from device 1 onward. The first requesting device in the chain that receives BG takes the bus and blocks BG from passing further; non-requesting devices simply pass BG along.
- The device asserts Bus Busy, uses the bus, and releases it when done.
Thus priority is fixed by position in the chain (devices nearer the arbiter have higher priority).
Advantage (vs centralized parallel arbitration): Very simple — needs only a few control lines (BR, BG, Busy) regardless of the number of devices, so it is cheap and easy to expand.
Limitation: (i) Fixed, position-based priority can starve low-priority (far) devices, and (ii) the serial BG propagation is slow and a failure in one device breaks the chain for those downstream. Centralized parallel arbitration avoids this by giving each device its own request/grant lines and a programmable priority encoder, at the cost of more hardware.
Explain the concept of virtual memory and the role of paging. Describe how a Translation Lookaside Buffer (TLB) improves the performance of address translation.
Virtual Memory and Paging
Virtual memory is a memory-management technique that gives each process the illusion of a large, contiguous address space (the virtual address space) that can exceed the physical main memory. Only the actively used portions reside in RAM; the rest is kept on secondary storage (disk) and brought in on demand. It provides large address spaces, isolation/protection between processes, and relocation.
Role of paging: The virtual and physical address spaces are divided into equal-sized blocks — pages (virtual) and frames (physical). A page table maps each virtual page number (VPN) to a physical frame number (PFN). A virtual address is split into (VPN, offset); translation replaces the VPN with its PFN while the offset is unchanged:
If the page is not present in RAM, a page fault occurs and the OS loads it from disk. Paging eliminates external fragmentation and allows non-contiguous allocation.
Role of the TLB
Without help, each memory reference would need an extra memory access just to read the page table — doubling effective access time (or more, with multi-level tables).
The Translation Lookaside Buffer (TLB) is a small, fast, fully/set-associative cache that stores the most recently used VPN -> PFN translations.
- On a TLB hit, the PFN is obtained in ~1 cycle, so translation adds almost no overhead.
- On a TLB miss, the page table is walked (in memory or by hardware), and the entry is loaded into the TLB.
Because of locality of reference, TLB hit rates are typically very high (often >95–99%), so most translations are fast. This keeps the effective address-translation time close to a single fast cache access, dramatically improving overall performance.
Differentiate between a multicore processor and a multiprocessor system. Briefly explain cache coherence and why it becomes a problem in shared-memory multiprocessors.
Multicore Processor vs Multiprocessor System
| Aspect | Multicore Processor | Multiprocessor System |
|---|---|---|
| Definition | Two or more independent processing cores integrated on a single chip | Two or more separate physical CPUs (chips) in one system |
| Integration | On-die; cores share on-chip resources (e.g., L2/L3 cache, memory controller) | Separate packages connected via a system bus/interconnect |
| Communication | Fast, low-latency on-chip interconnect | Slower, through external bus/interconnect |
| Cost/Power | Lower cost and power for given parallelism | Higher cost, board complexity, and power |
| Example | Quad-core CPU in a laptop | A server board with multiple CPU sockets (SMP) |
A multicore chip can itself be a building block of a multiprocessor system (multiple multicore chips).
Cache Coherence
Cache coherence requires that all caches in a shared-memory system present a consistent view of any shared memory location — i.e., a read must return the most recently written value to that location, regardless of which processor wrote it.
Why it becomes a problem in shared-memory multiprocessors: Each processor/core keeps its own private cache. When several caches hold copies of the same memory block and one processor writes to its copy, the other cached copies become stale (out of date). A subsequent read by another processor from its own cache would return the old value, violating correctness.
This is resolved by cache-coherence protocols (e.g., snooping protocols like MESI, or directory-based protocols) that invalidate or update other copies whenever a shared block is modified, ensuring all cores eventually see a single, consistent value.
Frequently asked questions
- Where can I find the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper 2079?
- The full BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Computer Architecture (PU, CMP 262) 2079 paper come with solutions?
- Yes. Every question on this Computer Architecture (PU, CMP 262) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 paper?
- The BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2079 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Computer Architecture (PU, CMP 262) past paper free?
- Yes — reading and attempting this Computer Architecture (PU, CMP 262) past paper on Kekkei is completely free.