BE Computer Engineering (IOE, TU) Computer Organization and Architecture (IOE, CT 603 / ENCT 303) Question Paper 2078 Nepal

Q: Where can I find the BE Computer Engineering (IOE, TU) Computer Organization and Architecture (IOE, CT 603 / ENCT 303) question paper 2078?

The full BE Computer Engineering (IOE, TU) Computer Organization and Architecture (IOE, CT 603 / ENCT 303) 2078 (Regular (annual)) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.

Q: Does the Computer Organization and Architecture (IOE, CT 603 / ENCT 303) 2078 paper come with solutions?

Yes. Every question on this Computer Organization and Architecture (IOE, CT 603 / ENCT 303) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.

Q: How many marks is the BE Computer Engineering (IOE, TU) Computer Organization and Architecture (IOE, CT 603 / ENCT 303) 2078 paper?

The BE Computer Engineering (IOE, TU) Computer Organization and Architecture (IOE, CT 603 / ENCT 303) 2078 paper carries 80 full marks and is meant to be completed in 180 minutes, across 12 questions.

Q: Is practising this Computer Organization and Architecture (IOE, CT 603 / ENCT 303) past paper free?

Yes — reading and attempting this Computer Organization and Architecture (IOE, CT 603 / ENCT 303) past paper on Kekkei is completely free.

Question

1Long answer12 marks

(a) Differentiate between hardwired control unit and microprogrammed control unit on the basis of speed, flexibility, implementation cost and ease of modification. [6]

(b) Design a hardwired control unit for a simple processor. Draw the block diagram showing the instruction register, sequence counter, instruction decoder and control logic gates, and explain how the control signals are generated for a fetch cycle. [6]

control-unitmicroprogrammed-control

Answer 1

(a) Hardwired vs Microprogrammed Control Unit

Basis	Hardwired Control Unit	Microprogrammed Control Unit
Speed	Faster — control signals are produced directly by combinational logic, so no extra memory fetch is needed.	Slower — each control word (microinstruction) must be fetched from the control memory before signals are issued.
Flexibility	Inflexible — implemented as fixed gates, flip-flops and decoders.	Highly flexible — the instruction set/behaviour can be changed simply by rewriting the microprogram.
Implementation cost	Cheaper for a small instruction set, but design/wiring cost rises sharply as complexity grows.	More economical for complex/large instruction sets because the control logic is replaced by a regular control ROM.
Ease of modification	Difficult — any change requires redesigning the hardware circuit.	Easy — modify or add microroutines in control memory; no hardware rewiring.

In short, hardwired control suits simple, high-speed RISC machines, whereas microprogrammed control suits complex CISC machines where ease of design and modification matter more than raw speed.

(b) Design of a Hardwired Control Unit (Fetch Cycle)

Block diagram (described):

        +------------------+
        | Instruction Reg  |  (IR) -- holds current opcode
        +--------+---------+
                 | opcode bits
                 v
        +------------------+
        | Instruction      |
        | Decoder (n->2^n) |  --> D0, D1, ... Dk decoded outputs
        +------------------+
                 |
  +-----------+   |   +------------------+
  | Sequence  |---+-->|  Control Logic   |--> control signals
  | Counter   |       |  Gates (AND/OR)  |    (Read, Write, LD-IR,
  | (SC) T0..Tn       +------------------+     INR-PC, bus select...)
  +-----------+              ^
        ^                    |
        | timing             | status flags / clock
      Clock

Components

Instruction Register (IR): stores the opcode of the fetched instruction.
Instruction Decoder: decodes the opcode into individual operation lines $D_0,D_1,\dots$ .
Sequence Counter (SC): a counter decoded into timing signals $T_0,T_1,T_2,\dots$ that mark successive clock steps; it is incremented each clock and cleared at the end of a cycle.
Control Logic Gates: combinational AND/OR gates that combine the decoded operation lines, the timing signals and status flags to produce the actual control signals.

How control signals are generated for the fetch cycle

The fetch cycle is implemented over timing steps $T_0, T_1, T_2$ :

T_0: \; MAR \leftarrow PC

T_1: \; MBR \leftarrow M[MAR], \quad PC \leftarrow PC + 1

T_2: \; IR \leftarrow MBR

Each control signal is the logical product of the relevant timing line. For example, the signal that loads the IR is asserted by the gate term $T_2$ , the memory-read signal by $T_1$ , and the PC-increment signal by $T_1$ . Because these depend only on the timing counter (and not yet on the opcode), the fetch logic is fixed combinational hardware — exactly the hardwired approach. After $T_2$ , control passes to the decode/execute steps where the decoded opcode lines $D_i$ are ANDed with later timing signals to generate execution control signals.

Answer 2

(a) Locality of Reference and the Memory Hierarchy

Locality of reference is the observed tendency of programs to access a relatively small portion of their address space at any given time. It has two forms:

Temporal locality: a memory location that has been referenced recently is likely to be referenced again soon (e.g. loop variables, instructions inside a loop).
Spatial locality: if a location is referenced, nearby locations are likely to be referenced soon (e.g. sequential instruction fetch, array traversal).

How the hierarchy exploits it: Memory is arranged in levels — registers → cache (SRAM) → main memory (DRAM) → secondary storage — with speed and cost decreasing down the hierarchy. By keeping recently used (temporal) data and whole blocks of neighbouring (spatial) data in the small, fast upper levels, most accesses are served quickly from cache, so the CPU rarely waits for slow main memory. This makes the effective access time close to that of the fast level while the capacity is close to that of the slow level, bridging the CPU–memory speed gap.

(b) Address Format

Given: Main memory = 256 KB $= 2^{18}$ bytes → physical address = 18 bits. Cache = 2 KB $= 2^{11}$ bytes. Block size = 16 bytes $= 2^{4}$ → word/offset = 4 bits.

Number of cache blocks (lines) $= \dfrac{2\text{KB}}{16\text{B}} = \dfrac{2^{11}}{2^{4}} = 2^{7} = 128$ lines.

(i) Direct Mapping

Line bits $= \log_2 128 = 7$ bits
Word bits $= 4$ bits
Tag bits $= 18 - 7 - 4 = 7$ bits

Tag	Line	Word
7	7	4

(ii) 4-way Set-Associative Mapping

Number of sets $= \dfrac{\text{lines}}{\text{ways}} = \dfrac{128}{4} = 32 = 2^{5}$ .

Set bits $= \log_2 32 = 5$ bits
Word bits $= 4$ bits
Tag bits $= 18 - 5 - 4 = 9$ bits

Tag	Set	Word
9	5	4

Answer 3

(a) Instruction Pipelining and the Four-Stage Pipeline

Instruction pipelining is a technique that overlaps the execution of successive instructions by dividing the instruction cycle into independent stages, each handled by a dedicated hardware segment. While one instruction is in one stage, the next instruction occupies the previous stage, so several instructions are processed simultaneously, increasing throughput.

Four segments:

FI (Fetch Instruction): fetch the instruction from memory into the instruction register.
DA (Decode & Calculate Address): decode the opcode and compute the effective operand address.
FO (Fetch Operand): read the operand from memory/registers.
EX (Execute): perform the operation in the ALU and store the result.

Space–time diagram (rows = pipeline stages, columns = clock cycles); $I_1,I_2,I_3,I_4$ are instructions:

Stage \ Clock	1	2	3	4	5	6	7
FI	I1	I2	I3	I4
DA		I1	I2	I3	I4
FO			I1	I2	I3	I4
EX				I1	I2	I3	I4

Without a pipeline, 4 instructions $\times$ 4 stages $= 16$ cycles; with the pipeline they finish in $4 + (4-1) = 7$ cycles. In general $n$ instructions through a $k$ -stage pipeline take $k + (n-1)$ cycles, giving a speedup approaching $k$ for large $n$ .

(b) Pipeline Hazards

A pipeline hazard is a situation that prevents the next instruction from executing in its designated clock cycle, stalling the pipeline.

Data hazard: arises when an instruction depends on the result of a previous instruction that has not yet completed (e.g. ADD R1,R2,R3 followed by SUB R4,R1,R5). The operand of the second instruction is not ready, causing a read-after-write dependency.
Control (branch) hazard: arises from branch/jump instructions. Until the branch outcome is known, the pipeline does not know which instructions to fetch next, so wrongly-fetched instructions may have to be discarded.

Two techniques to handle control hazards:

Branch prediction: the hardware predicts whether the branch will be taken (static, e.g. always-not-taken, or dynamic using a branch-history/prediction table) and prefetches accordingly. If the prediction is correct, no penalty occurs; if wrong, the speculatively fetched instructions are flushed.
Delayed branch: the instruction(s) immediately following the branch (the branch delay slot) are always executed regardless of the branch outcome. The compiler fills the slot with a useful, independent instruction, so the pipeline does no wasted work. (Other valid techniques: branch-target buffer/prefetching both paths, or pipeline stall/flush.)

Answer 4

(a) Booth's Multiplication — Hardware and Operation

Hardware block diagram (described): A multiplicand register M (n bits), a register pair A (accumulator, initially 0) and Q (holds the multiplier), an extra 1-bit register Q₋₁ appended to the right of Q, an n-bit ALU/adder–subtractor feeding A, a control/sequence counter holding the bit count, and a combined right-shift path over the A-Q-Q₋₁ register set.

        +----+        +-------------------+
        | M  |------->|  Adder/Subtractor |---> A
        +----+        +---------+---------+
                                ^
                                |
   [ A | Q | Q-1 ]  <-- arithmetic right shift
        |   |   |
        +---+---+--> control examines Q0,Q-1

Operation: Booth's algorithm multiplies two signed (2's-complement) numbers by examining the pair of bits $(Q_0, Q_{-1})$ at each step:

$00$ or $11$ → no arithmetic, just shift.
$01$ → $A \leftarrow A + M$ (end of a string of 1s), then shift.
$10$ → $A \leftarrow A - M$ (start of a string of 1s), then shift.

After each step, an arithmetic right shift is applied to the combined register $A\,Q\,Q_{-1}$ (sign bit preserved). The process repeats $n$ times (n = number of multiplier bits). The final $2n$ -bit product appears in $A:Q$ . It handles signed numbers directly and skips runs of identical bits, making it efficient.

(b) Multiply (-5) × (+3) using Booth's Algorithm

Use 4-bit 2's-complement ( $n=4$ ).

Multiplicand $M = +3 = 0011$ , $-M = 1101$ .
Multiplier $Q = -5 = 1011$ .
$A = 0000$ , $Q_{-1}=0$ , count $=4$ .

Step	Operation	A	Q	Q₋₁
Init		0000	1011	0
1	$Q_0Q_{-1}=10 \Rightarrow A=A-M$	1101	1011	0
	ASR	1110	1101	1
2	$Q_0Q_{-1}=11 \Rightarrow$ shift	1110	1101	1
	ASR	1111	0110	1
3	$Q_0Q_{-1}=01 \Rightarrow A=A+M$	0010	0110	1
	ASR	0001	0011	0
4	$Q_0Q_{-1}=10 \Rightarrow A=A-M$	1110	0011	0
	ASR	1111	0001	1

Result: $A:Q = 1111\,0001$ (8-bit 2's-complement).

This equals $-(0000\,1111) = -15$ .

Check: $(-5)\times(+3) = -15$ . ✓

Answer 5

Addressing Modes

(a) Immediate addressing: the operand itself is contained in the instruction (no memory reference for the operand).

Example: MOV R1, #5 → loads the constant 5 into R1.
Use: initialising registers with constants and defining counters/loop limits.

(b) Register indirect addressing: the instruction names a register that holds the address of the operand; the operand is in memory at that address.

Example: MOV R1, (R2) → R2 holds an address, the data at that address is loaded into R1.
Use: accessing data through pointers and traversing data structures (linked lists, arrays).

(c) Indexed addressing: the effective address $=$ base address in the instruction $+$ contents of an index register: $EA = A + (Index)$ .

Example: LDA A(X) → loads $M[A + X]$ .
Use: sequential access to array/table elements by incrementing the index register.

(d) Relative addressing: the effective address $=$ contents of the Program Counter $+$ the (signed) address field: $EA = (PC) + A$ .

Example: BRANCH +6 → jump to a location 6 words ahead of the current PC.
Use: writing position-independent (relocatable) code, especially branch/jump targets.

Answer 6

RISC vs CISC

Feature	RISC	CISC
Instruction format	Fixed-length, uniform, simple instructions.	Variable-length, complex instructions of differing sizes.
Addressing modes	Few, simple addressing modes (mostly load/store).	Many, complex addressing modes.
Execution	Mostly single-cycle execution; register-to-register operations; memory accessed only by LOAD/STORE; heavily pipelined; control is usually hardwired.	Multi-cycle instructions; instructions can operate directly on memory operands; control is usually microprogrammed.
Use of registers	Large set of general-purpose registers; operands kept in registers to reduce memory traffic.	Fewer general-purpose registers; relies more on memory.
Code size / compiler	Larger code, more instructions, more work for the compiler.	More compact code, more work done in hardware.

Example processors: RISC — ARM / MIPS; CISC — Intel x86 (8086/Pentium).

Answer 7

Interrupt-Driven I/O

Definition: In interrupt-driven I/O, the CPU does not continuously poll the device. Instead, when an I/O device becomes ready, it raises an interrupt request to the CPU. The CPU finishes its current instruction, suspends the running program, and executes an Interrupt Service Routine (ISR) to transfer the data, then resumes the interrupted program.

Flowchart of CPU handling an interrupt (described):

   Execute current instruction
              |
              v
   Check interrupt line at end of instruction
              |
        Interrupt? --No--> continue normal execution
              | Yes
              v
   Save PC and processor status (push onto stack)
              |
              v
   Acknowledge interrupt / identify device (vector)
              |
              v
   Load PC with ISR start address; execute ISR (transfer data)
              |
              v
   Restore PC and status (pop from stack)
              |
              v
   Resume interrupted program

Comparison with programmed I/O (CPU utilisation): In programmed I/O the CPU repeatedly tests the device status flag in a busy-wait loop, so it is fully occupied waiting and cannot do useful work while the slow device prepares data — wasting CPU time. In interrupt-driven I/O the CPU is interrupted only when the device is actually ready, so between requests it is free to run other programs. This gives far better CPU utilisation and overlap of computation with I/O, at the small cost of interrupt-handling overhead.

Answer 8

Direct Memory Access (DMA)

Working: DMA allows an I/O device to transfer a block of data directly to/from main memory without continuous CPU involvement. A DMA controller (DMAC) takes over the system bus from the CPU to perform the transfer. The CPU only initialises the controller by loading: the memory address register (starting address), the word/byte count register (block size), and the control register (read/write direction, start). The DMAC then:

Requests the bus by asserting HOLD (bus request) to the CPU.
The CPU completes its current bus cycle and replies with HLDA (bus grant), releasing the bus.
The DMAC transfers data directly between the device and memory, incrementing the address and decrementing the count after each word.
When the count reaches zero, the DMAC releases the bus and raises an interrupt to inform the CPU that the transfer is complete.

Block diagram (described):

  CPU <--HOLD/HLDA--> DMA Controller <--DMA req/ack--> I/O Device
   |                       |                              
   +------ Address / Data / Control Bus ------+----- Main Memory

The DMA controller contains an address register, a count register, a control/status register and data lines, and connects to the bus alongside the CPU and memory.

Transfer modes:

Cycle stealing: the DMAC transfers one word at a time, "stealing" a single bus cycle from the CPU for each word and then returning bus control to the CPU. The CPU is slowed slightly but can still execute between transfers. Suitable for slower devices.
Burst (block) transfer: the DMAC holds the bus and transfers the entire block of words in one continuous burst before releasing the bus. This is fast and efficient for high-speed devices (e.g. disk), but the CPU is completely halted from using the bus for the duration of the burst.

Answer 9

Flynn's Classification

Flynn classifies computer architectures by the multiplicity of instruction streams and data streams:

1. SISD (Single Instruction, Single Data): one processing unit executes a single instruction stream on a single data stream. This is the classical sequential (von Neumann) uniprocessor; instruction-level pipelining may be used. Example: a traditional single-core PC / classic CPU.

2. SIMD (Single Instruction, Multiple Data): one control unit issues the same instruction to many processing elements, each operating on its own data. Ideal for data-parallel work. Example: vector/array processors, GPUs; applications like image processing and matrix operations.

3. MISD (Multiple Instruction, Single Data): many instruction streams act on the same data stream. It is mostly theoretical and rarely built. Example/application: fault-tolerant systems where several units process the same data and results are compared (e.g. systolic arrays are sometimes cited).

4. MIMD (Multiple Instruction, Multiple Data): multiple processors each execute their own instruction stream on their own data stream independently. Most modern parallel systems fall here. Example: multicore processors, multiprocessors and distributed clusters; general-purpose parallel/distributed computing.

Answer 10

(a) Major CPU Registers

Major registers of a basic computer CPU: PC (Program Counter), MAR/AR (Memory Address Register), MBR/DR (Memory Buffer/Data Register), IR (Instruction Register), AC (Accumulator), TR (Temporary Register), and the general-purpose registers.

Program Counter (PC): holds the memory address of the next instruction to be fetched; it is incremented after each fetch.
Memory Address Register (MAR): holds the address of the memory location to be read from or written to; it is connected to the address bus.
Instruction Register (IR): holds the current instruction (opcode) being decoded and executed.

(b) Fetch-Phase Microoperations

The register-transfer microoperations for the fetch phase, over timing steps $T_0, T_1, T_2$ :

T_0: \quad MAR \leftarrow PC

T_1: \quad MBR \leftarrow M[MAR], \quad PC \leftarrow PC + 1

T_2: \quad IR \leftarrow MBR

That is, the PC supplies the address, the addressed memory word is read into the buffer while the PC is incremented to point to the next instruction, and finally the instruction is copied into the IR for decoding.

Answer 11

(a) Average Memory Access Time

Given hit ratio $h = 0.92$ , cache time $t_c = 10$ ns, memory time $t_m = 100$ ns.

Using the miss-penalty model where a miss costs cache time plus memory time:

T_{avg} = h \cdot t_c + (1-h)\,(t_c + t_m)

T_{avg} = 0.92(10) + 0.08(10 + 100) = 9.2 + 0.08(110) = 9.2 + 8.8 = \mathbf{18\ ns}

(If the simpler model $T_{avg} = h\,t_c + (1-h)\,t_m$ is assumed: $0.92(10) + 0.08(100) = 9.2 + 8 = 17.2$ ns.)

(b) Write Policies

Write-through: every write updates both the cache and main memory at the same time. Advantage: main memory always holds the current (consistent) data, making it simple and reliable, which is important in multiprocessor/DMA environments.
Write-back (copy-back): a write updates only the cache; the modified ("dirty") block is written to main memory only when it is evicted. Advantage: far less memory traffic, since multiple writes to the same block cause only one memory update — giving higher performance.

Answer 12

Short Notes (any two)

(a) Vector Processing and Array Processors

A vector processor has instructions that operate on entire vectors (one-dimensional arrays) of data with a single instruction, using deeply pipelined arithmetic units so successive element operations overlap. It exploits data parallelism and reduces instruction-fetch/loop overhead (one vector instruction replaces an inner loop). An array processor is an SIMD machine consisting of many identical processing elements (PEs), each with its own local memory, all executing the same instruction broadcast by a single control unit but on different data. Vector processors pipeline operations in time; array processors replicate hardware in space. Both are used for scientific/matrix computations, signal/image processing and simulations.

(b) Hypercube Interconnection Network

A hypercube of dimension $n$ connects $N = 2^{n}$ nodes, where each node has a unique $n$ -bit address. Two nodes are directly connected iff their binary addresses differ in exactly one bit. Hence each node has exactly $n$ neighbours (degree $= n = \log_2 N$ ), and the maximum distance (diameter) between any two nodes is $n = \log_2 N$ , since routing simply flips the differing address bits one at a time. It offers good scalability, low diameter and multiple redundant paths (fault tolerance), and is widely used in massively parallel computers (e.g. nCUBE).

(c) Amdahl's Law

Amdahl's law gives the maximum speedup achievable by parallelising a program with $p$ processors when a fraction $f$ of the work is inherently sequential:

\text{Speedup} = \frac{1}{f + \dfrac{1-f}{p}}

As $p \to \infty$ , the speedup is bounded by $\dfrac{1}{f}$ . Significance: the serial fraction limits the benefit of adding processors — even a small non-parallelisable portion caps the attainable speedup (e.g. if $f = 0.1$ , maximum speedup is only $10\times$ no matter how many processors are used). It guides designers to minimise the sequential part rather than just adding hardware.

Level	BE Computer Engineering (IOE, TU)
Subject	Computer Organization and Architecture (IOE, CT 603 / ENCT 303)
Year	2078 BS
Exam session	Regular (annual)
Full marks	80
Time allowed	180 minutes
Questions	12, all with step-by-step solutions