BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) Question Paper 2078 Nepal
This is the official BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper for 2078, as set in the regular annual examination. It carries 100 full marks and a time allowance of 180 minutes, across 12 questions. On Kekkei you can attempt this Computer Architecture (PU, CMP 262) past paper online with a timer, get instant AI feedback and step-by-step solutions, and track the topics where you lose marks — completely free. Whether you are revising for your BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) exam or solving previous years' question papers, this 2078 paper is a great way to practise under real exam conditions.
Section A: Long Answer Questions
Attempt all / any as specified.
(a) Differentiate between RISC and CISC instruction set architectures with respect to instruction format, addressing modes, and hardware complexity. (7)
(b) A computer has a 16-bit instruction word. The processor supports 64 distinct operations and uses a register file of 16 general-purpose registers. For a two-address register-to-register instruction, design a suitable instruction format showing the bit allocation for the opcode and operand fields. Then, using the same machine, explain with examples how the following addressing modes are interpreted: immediate, register indirect, indexed, and PC-relative. (7)
(a) RISC vs CISC
| Feature | RISC | CISC |
|---|---|---|
| Instruction format | Fixed length (e.g. 32-bit), few formats, easy to decode | Variable length (1–15 bytes), many formats, complex decoding |
| Addressing modes | Few simple modes (register, immediate, base+offset). Only load/store access memory | Many complex modes (indexed, indirect, auto-increment, memory-to-memory) |
| Hardware complexity | Simple hardwired control unit, large register file, pipelining is easy | Microprogrammed control, smaller register set, control logic is complex |
| Instructions | Many simple, single-cycle instructions | Fewer but powerful, multi-cycle instructions |
| Examples | ARM, MIPS, RISC-V | x86, VAX, Motorola 68k |
RISC pushes complexity to the compiler/software; CISC pushes it to the hardware/microcode.
(b) Instruction format design
Given: 16-bit word, 64 operations, 16 registers.
- Opcode: bits
- Each register field: bits; two operands need bits
- Total used bits, leaving 2 bits (mode/reserved).
| Opcode (6) | Rdest (4) | Rsrc (4) | Mode (2) |
15..10 9..6 5..2 1..0
Example: ADD R3, R5 → opcode=ADD, Rdest=0011, Rsrc=0101.
Addressing modes (effective address EA):
- Immediate: operand value is part of the instruction itself,
MOV R1, #7→ operand = 7. No memory access for the operand. - Register indirect: the register holds the address of the operand.
LOAD R1, (R2)→ EA = contents of R2; operand = M[R2]. - Indexed: EA = base address (in instruction) + contents of an index register.
LOAD R1, A(R2)→ EA = A + R2. Used for arrays. - PC-relative: EA = PC + signed offset in the instruction.
BEQ +offset→ target = PC + offset. Gives position-independent branches.
(On a 16-bit word, an immediate/offset must fit the spare field, or a second word is used.)
(a) Draw and explain the organization of a basic CPU showing the ALU, register set, control unit, and the system bus. Describe the role of the program counter (PC), instruction register (IR), and memory address register (MAR) during the instruction fetch cycle. (7)
(b) Compare hardwired control units with microprogrammed control units. For a microprogrammed control unit, explain the difference between horizontal and vertical microinstruction formats, and discuss the trade-off between control memory size and execution speed. (7)
(a) Basic CPU organization
A basic CPU consists of four cooperating blocks connected by the system bus (address, data, control lines):
- ALU – performs arithmetic and logic operations; takes operands from registers and writes the result back, setting status flags.
- Register set – fast internal storage: general-purpose registers plus special registers (PC, IR, MAR, MBR/MDR, accumulator).
- Control Unit (CU) – generates the timing and control signals that sequence all micro-operations (fetch, decode, execute).
- System bus – carries addresses, data, and control signals between CPU, memory, and I/O.
+-------------------- System Bus --------------------+
| | | | |
+--------+ +-------+ +--------+ +--------+ Memory / I/O
| MAR | | IR | | ALU | | Reg. |
+--------+ +-------+ +--------+ | File |
| PC | | MDR | | Control Unit | |
+--------+ +-------+ +------------+ +--------+
Role during the fetch cycle:
- PC (Program Counter): holds the address of the next instruction to be fetched. After the address is used, PC is incremented.
- MAR (Memory Address Register): receives the address from PC and drives it onto the address bus to select the memory location.
- IR (Instruction Register): receives the fetched instruction word from memory (via MDR) and holds it so the CU can decode the opcode.
Fetch: ; , ; .
(b) Hardwired vs Microprogrammed control
| Aspect | Hardwired | Microprogrammed |
|---|---|---|
| Implementation | Fixed logic gates, flip-flops, decoders | Control signals stored as microinstructions in control memory |
| Speed | Faster | Slower (control-memory access per step) |
| Flexibility | Hard to modify | Easy to modify/extend (rewrite microcode) |
| Suited to | RISC | CISC, complex instruction sets |
Horizontal vs Vertical microinstructions:
- Horizontal: one bit (or small field) per control signal; many signals can be activated in parallel. → Wide microinstructions, large control memory, but high speed and parallelism.
- Vertical: control signals are encoded into compact fields and decoded at run time; fewer simultaneous signals. → Narrow microinstructions, small control memory, but slower (extra decode step, more steps).
Trade-off: Horizontal gives faster execution and more parallelism at the cost of a larger control store; vertical saves control-memory size at the cost of decoding overhead and execution speed. Real machines often use a hybrid (nano/two-level) encoding to balance both.
(a) Explain the concept of instruction pipelining. Describe the different types of pipeline hazards (structural, data, and control) and outline one technique to resolve each. (6)
(b) A non-pipelined processor takes 5 ns to execute each instruction. The same processor is redesigned into a 5-stage pipeline where each stage takes 1 ns and the pipeline register delay between stages is 0.2 ns. Calculate the speedup achieved by the pipeline for executing 1000 instructions, assuming no stalls. Also state the theoretical maximum speedup. (6)
(a) Instruction pipelining and hazards
Pipelining overlaps the execution of multiple instructions by dividing instruction processing into stages (e.g. IF–ID–EX–MEM–WB). While one instruction is decoded, the next is fetched, so a new instruction can ideally complete every clock cycle, increasing throughput (not the latency of a single instruction).
Pipeline hazards and one resolution each:
- Structural hazard – two instructions need the same hardware resource in the same cycle (e.g. one memory port). Resolution: add resources / use separate instruction and data caches (Harvard memory) or duplicate the unit.
- Data hazard – an instruction needs a result not yet written by a prior instruction (RAW dependency). Resolution: operand forwarding (bypassing); if unavoidable, insert stalls or reorder instructions.
- Control hazard – a branch changes the instruction flow, making prefetched instructions invalid. Resolution: branch prediction (or delayed branch / flushing the pipeline).
(b) Speedup calculation
Given: non-pipelined time ns/instr; 5 stages, stage time 1 ns, register delay 0.2 ns.
Pipeline clock period ns.
Time for instructions in a stage pipeline:
Non-pipelined time:
Speedup:
Theoretical maximum speedup (large , ideal, no register overhead) . With the 0.2 ns register overhead the asymptotic limit is , which matches our result.
(a) Explain the memory hierarchy of a computer system and justify why it improves average memory access time, referring to the principle of locality of reference. (5)
(b) A system has a cache with a hit ratio of 0.95, a cache access time of 5 ns, and a main memory access time of 80 ns. Compute the average memory access time for both a simultaneous (look-aside) access scheme and a hierarchical (look-through) access scheme. (4)
(c) Compare direct-mapped, fully associative, and set-associative cache mapping techniques in terms of hardware cost, hit ratio, and search complexity. (3)
(a) Memory hierarchy and locality
The memory hierarchy arranges storage in levels from fast/small/expensive to slow/large/cheap:
Registers → L1/L2 Cache → Main Memory (RAM) → Secondary Storage (SSD/HDD)
fastest slowest
It improves average memory access time by exploiting the principle of locality of reference:
- Temporal locality: recently accessed items are likely to be accessed again soon → keep them in cache.
- Spatial locality: items near a referenced address are likely to be used soon → fetch a whole block/line.
Because most accesses hit in the fast upper levels, the effective access time approaches that of the fast memory while the capacity approaches that of the cheap memory — giving near-cache speed at near-disk capacity and low cost.
(b) Average Memory Access Time (AMAT)
Given: hit ratio , cache time ns, memory time ns.
Simultaneous / look-aside (cache and memory started together; on a miss only the extra memory time counts, cache and memory accessed in parallel):
Hierarchical / look-through (memory is accessed only after a cache miss, so a miss pays ):
(c) Cache mapping comparison
| Technique | Hardware cost | Hit ratio | Search complexity |
|---|---|---|---|
| Direct-mapped | Lowest (1 comparator) | Lowest (conflict misses) | Simplest – index a single line |
| Fully associative | Highest (one comparator per line, CAM) | Highest (no conflict misses) | Most complex – search all lines |
| Set-associative | Moderate (n comparators) | Intermediate (good balance) | Moderate – search within a set |
Set-associative is the practical compromise used in real CPUs.
Section B: Short Answer Questions
Attempt all / any as specified.
Compare programmed I/O, interrupt-driven I/O, and direct memory access (DMA) as techniques for data transfer between the CPU and I/O devices. Under what circumstances is DMA most beneficial?
I/O data-transfer techniques
| Technique | How it works | CPU involvement | Speed |
|---|---|---|---|
| Programmed I/O | CPU repeatedly polls the device status flag and transfers each word itself | CPU fully busy (busy-wait), wastes cycles | Slow |
| Interrupt-driven I/O | Device interrupts the CPU when ready; CPU services it and transfers a word, then resumes other work | CPU free between transfers, but still moves every word | Moderate |
| DMA (Direct Memory Access) | A DMA controller transfers blocks directly between device and memory, stealing bus cycles; CPU is interrupted only once at completion | CPU free during the whole block transfer | Fast |
When DMA is most beneficial: for high-speed, bulk/block data transfers (disk, network, audio/video) where moving data word-by-word through the CPU would create excessive interrupt/polling overhead. DMA frees the CPU to do useful computation while large blocks move, maximizing throughput.
State Flynn's classification of computer architectures. Briefly describe each of the four categories (SISD, SIMD, MISD, MIMD) and give one practical example of a system belonging to each category.
Flynn's Classification
Flynn classifies architectures by the number of concurrent instruction streams and data streams.
| Class | Instruction × Data | Description | Example |
|---|---|---|---|
| SISD | Single instr, Single data | One processor executes one instruction on one data item at a time (classic von Neumann) | Traditional single-core CPU (Intel 8085, basic uniprocessor) |
| SIMD | Single instr, Multiple data | One instruction operates on many data elements simultaneously (data parallelism) | GPUs, vector processors, x86 SSE/AVX, array processors |
| MISD | Multiple instr, Single data | Several instructions operate on the same data stream; rare | Fault-tolerant/redundant systems (e.g. space shuttle flight control) |
| MIMD | Multiple instr, Multiple data | Multiple processors execute different instructions on different data independently | Multicore processors, clusters, multiprocessor servers |
MIMD is the most common in general-purpose parallel computers; SIMD dominates graphics and numerical workloads.
State Amdahl's law. A program spends 70% of its execution time in a section that can be parallelized across 8 processors, while the remaining 30% is strictly sequential. Calculate the overall speedup and explain what the result implies about the limits of parallel processing.
Amdahl's Law
Statement: The overall speedup from improving a fraction of a program by a factor is limited by the part that is not improved:
where = parallelizable fraction, = number of processors.
Calculation
Given (parallel), (sequential), :
Overall speedup ≈ 2.58×.
Implication
Even though 70% of the work runs 8× faster, the 30% sequential portion caps the gain at only ~2.58×, far below 8×. As , the maximum possible speedup is . This shows that the sequential (non-parallelizable) fraction is the fundamental bottleneck: beyond a point, adding processors yields diminishing returns, so reducing the serial fraction matters more than adding cores.
Explain the difference between the write-through and write-back cache write policies. Discuss the role of the dirty bit and compare the two policies in terms of memory traffic and data consistency.
Write-through vs Write-back
Write-through: every write updates both the cache and main memory immediately. Memory is always consistent with the cache.
Write-back (copy-back): a write updates only the cache line; main memory is updated later, when that line is evicted/replaced.
Role of the dirty bit
Each cache line in a write-back cache has a dirty bit. It is set to 1 when the line is modified in cache but not yet in memory. On replacement:
- dirty bit = 1 → the line must be written back to memory first,
- dirty bit = 0 → the line can simply be discarded (memory already current). This avoids needless memory writes for unmodified lines.
Comparison
| Aspect | Write-through | Write-back |
|---|---|---|
| Memory traffic | High – every write goes to memory | Low – only dirty lines written, multiple writes coalesced |
| Data consistency | Always consistent; simple for multiprocessors/DMA | Memory can be stale; needs coherence handling |
| Complexity | Simple (no dirty bit) | Needs dirty bit and write-back logic |
| Speed | Slower writes | Faster writes |
Write-through favors consistency and simplicity; write-back favors performance by reducing memory traffic.
Represent the decimal number -85.375 in IEEE 754 single-precision (32-bit) floating-point format. Show the sign, exponent (with bias), and mantissa fields clearly.
IEEE 754 single precision of −85.375
Step 1 — Sign: number is negative → sign bit = 1.
Step 2 — Convert magnitude to binary:
- (since )
- So
Step 3 — Normalize ():
Unbiased exponent .
Step 4 — Biased exponent: bias = 127.
Step 5 — Mantissa (23 bits, drop leading 1):
Final 32-bit representation
| Field | Bits |
|---|---|
| Sign (1) | 1 |
| Exponent (8) | 10000101 |
| Mantissa (23) | 01010101100000000000000 |
In hexadecimal: 0xC2AAC000.
Write the sequence of register-transfer micro-operations required to fetch an instruction from memory and increment the program counter. Explain the purpose of each micro-operation in the fetch cycle.
Instruction fetch micro-operations
The fetch cycle moves the next instruction from memory into the IR and advances the PC. Using register-transfer notation:
T0: MAR ← PC
T1: MDR ← M[MAR], PC ← PC + 1
T2: IR ← MDR
Purpose of each micro-operation:
MAR ← PC– the address of the next instruction (held in the PC) is copied into the Memory Address Register, which drives the address bus to select the memory location.MDR ← M[MAR]– memory is read; the instruction word at that address is loaded into the Memory Data Register (MDR/MBR).PC ← PC + 1is done in the same step (in parallel) so the PC now points to the following instruction, ready for the next fetch.IR ← MDR– the fetched instruction is transferred from the MDR into the Instruction Register, where the control unit will decode its opcode to begin the execute cycle.
After T2 the instruction is in the IR and the PC is updated, completing the fetch.
Explain bus arbitration in a system with multiple bus masters. Differentiate between centralized daisy-chaining and independent request/grant schemes.
Bus arbitration
When several bus masters (CPU, DMA controllers, I/O processors) can drive the shared bus, bus arbitration decides which master gains control at a given time, preventing conflicts. The unit that decides is the bus arbiter, using request and grant control lines.
Centralized daisy-chaining
- A single bus grant (BG) line is chained serially through all masters; one shared bus request (BR) line goes to the arbiter.
- When BR is asserted, the arbiter sends BG to the first device; if it didn't request, it passes BG to the next, and so on.
- Pros: very few control lines, easy to add devices. Cons: fixed priority by physical position (nearest device dominates → starvation of far devices); failure of one device can break the chain; slower propagation.
Independent request / grant scheme
- Each master has its own request line (BR) and its own grant line (BG) to a central arbiter.
- The arbiter sees all requests at once and grants the bus to the highest-priority requester directly.
- Pros: fast, flexible/programmable priority, no chain dependency, fair. Cons: many control lines and more arbiter hardware (cost grows with number of masters).
Summary: daisy-chaining = simple, few lines, fixed positional priority; independent request/grant = more lines/hardware but faster and flexible priority.
Write short notes on any TWO of the following:
(a) Cache coherence problem in multiprocessor systems
(b) Crossbar switch interconnection network
(c) Superscalar processor architecture
(Answer any TWO; all three are given.)
(a) Cache coherence problem
In a multiprocessor system each CPU has its own cache. When several caches hold copies of the same memory block and one processor writes to it, the other caches hold stale copies — this inconsistency is the cache coherence problem. It is solved by coherence protocols that enforce a single consistent view: snooping protocols (e.g. MESI — Modified, Exclusive, Shared, Invalid — where writes invalidate or update other copies via a shared bus) and directory-based protocols for larger systems.
(b) Crossbar switch interconnection network
A crossbar connects processors to memory modules through a grid of switch points (crosspoints). Any processor can be connected to any free memory module simultaneously and non-blockingly, giving the highest bandwidth and lowest latency. Its drawback is cost/complexity: hardware grows as , so it scales poorly to large systems.
(c) Superscalar processor architecture
A superscalar processor issues and executes more than one instruction per clock cycle by having multiple parallel execution units (e.g. several ALUs, load/store, FP units). It uses hardware to detect independent instructions, supports out-of-order execution and dynamic scheduling, and exploits instruction-level parallelism (ILP), achieving an ideal CPI < 1 (IPC > 1). Examples: Intel Pentium and later x86 cores, modern ARM cores.
Frequently asked questions
- Where can I find the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) question paper 2078?
- The full BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
- Does the Computer Architecture (PU, CMP 262) 2078 paper come with solutions?
- Yes. Every question on this Computer Architecture (PU, CMP 262) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
- How many marks is the BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 paper?
- The BE Computer Engineering (Pokhara University) Computer Architecture (PU, CMP 262) 2078 paper carries 100 full marks and is meant to be completed in 180 minutes, across 12 questions.
- Is practising this Computer Architecture (PU, CMP 262) past paper free?
- Yes — reading and attempting this Computer Architecture (PU, CMP 262) past paper on Kekkei is completely free.