Browse papers
A

Section A: Long Answer Questions

Attempt all / any as specified.

4 questions
1long14 marks

(a) Explain the role of a lexical analyzer in a compiler. Describe the terms token, lexeme, and pattern with suitable examples. [6]

(b) Consider the regular expression (a|b)*abb. Construct an NFA using Thompson's construction, then convert it into an equivalent DFA using the subset construction algorithm. Show the transition table and the minimized DFA. [8]

(a) Role of the Lexical Analyzer [6]

The lexical analyzer (scanner) is the first phase of a compiler. It reads the source program as a stream of characters and groups them into meaningful sequences called lexemes, producing a stream of tokens that is passed to the parser.

Main functions:

  • Read input characters and produce tokens.
  • Strip out whitespace and comments.
  • Correlate error messages with line numbers.
  • Expand macros / interact with the symbol table (entering identifiers).

Key terms:

TermMeaningExample (int count = 10;)
TokenA category/pair <name, attribute> representing a class of lexemes<keyword>, <id>, <assign>, <num>
LexemeThe actual sequence of characters in the source matching a patternint, count, =, 10
PatternA rule (usually a regular expression) describing the set of lexemes for a tokenidentifier = `letter(letter

(b) (a|b)*abb — NFA → DFA → Minimized DFA [8]

Thompson NFA (described): The construction yields a start state with an ε\varepsilon-loop subautomaton for (a|b)* followed by a linear path consuming a, b, b. Conceptually a 11-state NFA with ε\varepsilon-transitions.

Subset construction. Using start = ε\varepsilon-closure of the initial state, the relevant DFA states are:

  • AA = start set (loop)
  • BB = states reachable after a (still in loop, plus first a of abb)
  • CC = after seeing prefix ab
  • DD = after seeing abb (accepting)

DFA transition table:

Stateon aon b
→ ABA
BBC
CBD
*DBA

Start state = AA; accepting state = DD.

Minimization. Initial partition: non-accepting {A,B,C}\{A,B,C\} and accepting {D}\{D\}. Checking transitions, AA, BB, CC all reach distinct behaviour with respect to reaching DD (only CC goes to DD on b), so they remain distinguishable. The DFA above is already minimal with 4 states {A,B,C,D}\{A,B,C,D\} — this is the classic minimal DFA recognizing strings over {a,b}\{a,b\} ending in abb.

lexical-analysisfinite-automata
2long16 marks

Consider the following augmented grammar:

E -> E + T | T
T -> T * F | F
F -> ( E ) | id

(a) Construct the canonical collection of LR(0) items (the set of item sets) for this grammar. [8]

(b) Build the SLR(1) parsing table (ACTION and GOTO) and show, using a stack, how the input string id * id + id is parsed. [8]

Augmented grammar adds E' -> E. Productions: E->E+T | T, T->T*F | F, F->(E) | id.

(a) Canonical Collection of LR(0) Items [8]

  • I0: E'->.E, E->.E+T, E->.T, T->.T*F, T->.F, F->.(E), F->.id
  • I1 = goto(I0,E): E'->E., E->E.+T
  • I2 = goto(I0,T): E->T., T->T.*F
  • I3 = goto(I0,F): T->F.
  • I4 = goto(I0,(): F->(.E), E->.E+T, E->.T, T->.T*F, T->.F, F->.(E), F->.id
  • I5 = goto(I0,id): F->id.
  • I6 = goto(I1,+): E->E+.T, T->.T*F, T->.F, F->.(E), F->.id
  • I7 = goto(I2,*): T->T*.F, F->.(E), F->.id
  • I8 = goto(I4,E): F->(E.), E->E.+T
  • I9 = goto(I6,T): E->E+T., T->T.*F
  • I10 = goto(I7,F): T->T*F.
  • I11 = goto(I8,)): F->(E).

(b) SLR(1) Parsing Table & Parse of id * id + id [8]

FOLLOW sets: FOLLOW(E)={+,),$}, FOLLOW(T)={+,*,),$}, FOLLOW(F)={+,*,),$}.

ACTION / GOTO (s=shift, r=reduce by numbered production: 1 E->E+T, 2 E->T, 3 T->T*F, 4 T->F, 5 F->(E), 6 F->id):

Stateid+*()$ETF
0s5s4123
1s6acc
2r2s7r2r2
3r4r4r4r4
4s5s4823
5r6r6r6r6
6s5s493
7s5s410
8s6s11
9r1s7r1r1
10r3r3r3r3
11r5r5r5r5

Parse of id * id + id $:

StackInputAction
0id*id+id$s5
0 id 5*id+id$r6 (F->id), goto3
0 F 3*id+id$r4 (T->F), goto2
0 T 2*id+id$s7
0 T 2 * 7id+id$s5
0 T 2 * 7 id 5+id$r6, goto10
0 T 2 * 7 F 10+id$r3 (T->T*F), goto2
0 T 2+id$r2 (E->T), goto1
0 E 1+id$s6
0 E 1 + 6id$s5
0 E 1 + 6 id 5$r6, goto3
0 E 1 + 6 F 3$r4, goto9
0 E 1 + 6 T 9$r1 (E->E+T), goto1
0 E 1$accept

The string is accepted.

lr-parsersslr-parsing
3long12 marks

(a) Differentiate between machine-dependent and machine-independent code optimization. [4]

(b) For the following three-address code, construct the basic blocks, draw the control flow graph, and apply the optimizations common sub-expression elimination, dead code elimination, and constant folding. Show the optimized code at each step. [8]

1.  t1 = 4 * i
2.  t2 = a[t1]
3.  t3 = 4 * i
4.  t4 = b[t3]
5.  t5 = t2 + t4
6.  c   = t5
7.  t6 = 4 * i
8.  d   = t6

(a) Machine-dependent vs Machine-independent Optimization [4]

AspectMachine-independentMachine-dependent
LevelPerformed on intermediate code (3AC)Performed on/near target code
Knowledge of targetDoes not assume specific CPUExploits registers, addressing modes, instruction set
ExamplesCommon sub-expression elimination, constant folding, dead code elimination, loop-invariant motionRegister allocation, instruction selection, peephole optimization, instruction scheduling, use of special addressing modes
PortabilityPortable across machinesTied to a particular architecture

(b) Basic Blocks, CFG and Optimization [8]

Basic block: The code is straight-line with no jumps, so it forms a single basic block B1 (statements 1–8). The CFG is one node B1 with entry → B1 → exit.

Step 1 – Common Sub-expression Elimination. 4 * i is computed in lines 1, 3, 7. Compute once into t1; reuse it:

t1 = 4 * i
t2 = a[t1]
t4 = b[t1]      ; was b[t3], t3 = t1
t5 = t2 + t4
c  = t5
d  = t1         ; was t6 = 4*i = t1

Step 2 – Copy/Dead Code Elimination. t3, t6 are dead (no longer used); c = t5 and d = t1 are kept since c,d are live results. t5 used only once — could be folded but is a result:

t1 = 4 * i
t2 = a[t1]
t4 = b[t1]
c  = t2 + t4
d  = t1

Step 3 – Constant Folding. No constant-only expressions remain to fold here (4 * i has variable i). If i were a known constant, 4*i would be folded; otherwise no change.

Final optimized code:

t1 = 4 * i
t2 = a[t1]
t4 = b[t1]
c  = t2 + t4
d  = t1

Three redundant 4 * i computations are reduced to one; dead temporaries t3, t6 removed.

code-optimizationdata-flow-analysis
4long10 marks

(a) Distinguish between synthesized and inherited attributes. Explain S-attributed and L-attributed definitions with examples. [5]

(b) Write a syntax-directed definition that generates three-address code for the assignment statement a = b * -c + b * -c. Show the annotated parse tree and the generated three-address instructions. [5]

(a) Synthesized vs Inherited Attributes [5]

  • Synthesized attribute: value computed from the attributes of the node's children (and itself). Flows bottom-up. Example: in E -> E1 + T { E.val = E1.val + T.val }, E.val is synthesized.
  • Inherited attribute: value computed from attributes of the node's parent and/or siblings. Flows top-down / left-to-right. Example: in a declaration D -> T L, the type flows down via L.in = T.type.

S-attributed definition: an SDD that uses only synthesized attributes. It can always be evaluated bottom-up during LR parsing.

L-attributed definition: an SDD where each inherited attribute of a symbol on the RHS depends only on (i) inherited attributes of the head, and (ii) attributes of symbols to its left. Every S-attributed definition is also L-attributed. L-attributed SDDs can be evaluated in a single left-to-right (e.g., LL / predictive) pass.

(b) SDD generating 3AC for a = b * -c + b * -c [5]

Using attributes E.addr (place holding the value) and E.code, with newtemp() and gen():

t1 = minus c     ; unary minus on c
t2 = b * t1
t3 = minus c
t4 = b * t3
t5 = t2 + t4
a  = t5

Annotated parse tree (described): The tree for a = E has E -> E1 + E2, each Ei -> b * (-c). Bottom-up, each -c synthesizes t1/t3; each b * (-c) synthesizes t2/t4; the + node synthesizes t5; finally the assignment node emits a = t5. Each node's addr attribute holds the temporary name shown above, and code accumulates the generated instructions.

syntax-directed-translationintermediate-code
B

Section B: Short Answer Questions

Attempt all / any as specified.

9 questions
5short6 marks

Explain the different phases of a compiler with a neat block diagram. Show the intermediate output produced by each phase for the source statement position = initial + rate * 60.

Phases of a Compiler [6]

Block diagram (described): Source program → Lexical AnalyzerSyntax AnalyzerSemantic AnalyzerIntermediate Code GeneratorCode OptimizerCode Generator → Target program. The Symbol Table and Error Handler interact with all phases.

Phases:

  1. Lexical Analysis – converts characters to tokens.
  2. Syntax Analysis – builds a parse/syntax tree, checks grammar.
  3. Semantic Analysis – type checking, scope checking; annotates the tree.
  4. Intermediate Code Generation – produces machine-independent code (e.g., 3AC).
  5. Code Optimization – improves the intermediate code.
  6. Code Generation – emits target machine code.

Output for position = initial + rate * 60

1. Lexical Analyzer (tokens): <id,1> <=> <id,2> <+> <id,3> <*> <num,60> (symbol table: 1=position, 2=initial, 3=rate)

2. Syntax tree:

      =
     / \
 position  +
          /   \
      initial   *
               / \
            rate   60

3. Semantic Analyzer: type check; 60 (int) is converted to float → inttofloat(60).

4. Intermediate code (3AC):

t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3

5. Optimized code:

t1 = id3 * 60.0
id1 = id2 + t1

6. Target code (assembly):

LDF  R2, id3
MULF R2, R2, #60.0
LDF  R1, id2
ADDF R1, R1, R2
STF  id1, R1
compiler-phases
6short6 marks

For the grammar below, compute the FIRST and FOLLOW sets of every non-terminal and construct the LL(1) predictive parsing table. State whether the grammar is LL(1).

S -> a B D h
B -> c C
C -> b C | epsilon
D -> E F
E -> g | epsilon
F -> f | epsilon

FIRST, FOLLOW and LL(1) Table [6]

Grammar: S->aBDh, B->cC, C->bC | ε, D->EF, E->g | ε, F->f | ε.

FIRST sets:

  • FIRST(C) = { b, ε }
  • FIRST(B) = { c }
  • FIRST(E) = { g, ε }
  • FIRST(F) = { f, ε }
  • FIRST(D) = FIRST(E) ∪ (FIRST(F) since E→ε) ∪ {ε if both nullable} = { g, f, ε }
  • FIRST(S) = { a }

FOLLOW sets:

  • FOLLOW(S) = { $ }
  • FOLLOW(B): in S->aBDh, follows B = FIRST(Dh) = FIRST(D) (minus ε) ∪ {h} = { g, f, h }
  • FOLLOW(C) = FOLLOW(B) = { g, f, h }
  • FOLLOW(D): in S->aBDh, FOLLOW(D) = { h }
  • FOLLOW(E): in D->EF, FOLLOW(E) = FIRST(F) (minus ε) ∪ FOLLOW(D) = { f, h }
  • FOLLOW(F) = FOLLOW(D) = { h }

LL(1) Parsing Table (rows = non-terminals, columns = terminals):

abcfgh$
SS→aBDh
BB→cC
CC→bCC→εC→εC→ε
DD→EFD→EFD→EF
EE→εE→gE→ε
FF→fF→ε

Conclusion: No cell contains more than one production, so the grammar is LL(1).

ll-parsersfirst-follow
7short5 marks

What is left recursion and why is it a problem for top-down parsers? Eliminate left recursion and perform left factoring on the following grammar:

A -> A c | A a d | b d | epsilon
S -> i E t S | i E t S e S | a

Left Recursion [5]

Left recursion occurs when a non-terminal A derives a string beginning with itself, e.g. A -> A α. Problem for top-down parsers: a recursive-descent / predictive parser would call A to expand A again without consuming any input, causing infinite recursion / non-termination.

Eliminating left recursion in A

A -> A c | A a d | b d | ε

Here α-alternatives (left-recursive) = {c, ad}, β-alternatives (non-left-recursive) = {bd, ε}. Using the rule A -> βA' and A' -> αA' | ε:

A  -> b d A' | A'
A' -> c A' | a d A' | ε

(A -> A' arises because β includes ε.)

Left factoring S

S -> i E t S | i E t S e S | a

The two alternatives share the common prefix i E t S. Factor it:

S  -> i E t S S' | a
S' -> e S | ε

This removes the common prefix so a predictive parser can decide between alternatives by one token of lookahead (the classic dangling-else resolution: S' -> ε for no else).

left-recursiongrammar-transformation
8short5 marks

Describe the data structures used for implementing a symbol table. Compare the use of a linear list versus a hash table for symbol table organization, commenting on insertion and lookup time complexity. How is scope information handled in a block-structured language?

Symbol Table Data Structures [5]

A symbol table stores information (name, type, scope, offset, etc.) about identifiers. Common implementations:

  • Linear/unordered list: array or linked list of entries.
  • Ordered list / sorted array: allows binary search.
  • Binary search tree (BST): keeps entries ordered.
  • Hash table: key (name) hashed to a bucket; most widely used.

Linear List vs Hash Table

OperationLinear list (unordered)Hash table
InsertionO(1)O(1) (append) but must check duplicates → O(n)O(n)O(1)O(1) average
LookupO(n)O(n)O(1)O(1) average, O(n)O(n) worst case (collisions)
SpaceLowHigher (table + chains)
SimplicityVery simpleMore complex (hash fn, collision handling)

Linear lists are simple and space-efficient but slow for large programs; hash tables give near-constant lookup/insertion and are the standard choice.

Scope in Block-Structured Languages

A block-structured language uses nested scopes. The symbol table is organized as a stack of scopes (chained/nested tables): entering a block pushes a new table; on lookup, the most recent (innermost) scope is searched first, then enclosing scopes (the most-closely-nested rule); leaving a block pops/closes its table so its local names become inaccessible. This naturally implements name hiding by inner declarations.

symbol-table
9short6 marks

Explain the three representations of three-address code: quadruples, triples, and indirect triples. Represent the expression a = b * -c + b * -c in all three forms and comment on the relative advantages of each.

Representations of Three-Address Code [6]

For a = b * -c + b * -c, the 3AC is:

(0) t1 = minus c
(1) t2 = b * t1
(2) t3 = minus c
(3) t4 = b * t3
(4) t5 = t2 + t4
(5) a  = t5

Quadruples — (op, arg1, arg2, result)

#oparg1arg2result
0minusct1
1*bt1t2
2minusct3
3*bt3t4
4+t2t4t5
5=t5a

Triples — (op, arg1, arg2); results referenced by position (i)

#oparg1arg2
0minusc
1*b(0)
2minusc
3*b(2)
4+(1)(3)
5=a(4)

Indirect Triples — a list of pointers into a triple table

Statement listTriple table (as above)
(0) → 1414: minus c
(1) → 1515: * b (14)
(2) → 1616: minus c
(3) → 1717: * b (16)
(4) → 1818: + (15)(17)
(5) → 1919: = a (18)

Comparison

  • Quadruples: explicit result field → temporaries can be moved/renamed easily; best for optimization and code reorganization, but uses more space.
  • Triples: save space (no result field, refer by index), but reordering is hard because positions are referenced directly.
  • Indirect triples: keep triples' space saving while allowing reordering (just permute the pointer list) — good compromise for optimizers.
intermediate-codethree-address-code
10short6 marks

(a) What are the major issues in the design of a code generator? [3]

(b) Generate target code for the three-address statements t = a - b, u = a - c, v = t + u, d = v + u using a simple code generator and the register-descriptor / address-descriptor based getReg() function. Assume two registers are available. [3]

(a) Major Issues in Code Generator Design [3]

  1. Input to the code generator – form of the intermediate representation (3AC, trees) and assumption that semantic/type checking is done.
  2. Target program / instruction set – choice of target (absolute, relocatable, assembly) and its addressing modes.
  3. Instruction selection – choosing efficient machine instructions for IR operations.
  4. Register allocation and assignment – deciding which values reside in registers (NP-hard in general).
  5. Evaluation order – ordering computations to minimize register usage/spills.

(b) Target Code Generation [3]

For t = a-b, u = a-c, v = t+u, d = v+u with 2 registers R0, R1, using getReg() with register/address descriptors:

MOV  a, R0        ; R0 = a
SUB  b, R0        ; R0 = a - b  (t)
MOV  a, R1        ; R1 = a
SUB  c, R1        ; R1 = a - c  (u)
ADD  R1, R0       ; R0 = t + u  (v)
ADD  R1, R0       ; R0 = v + u  (d)
MOV  R0, d        ; store d

Register/address descriptors: after the first two lines R0 holds t, R1 holds u. v = t+u reuses R0 (t is dead after this) and R0 now holds v; d = v+u adds R1(u) into R0 → d in R0; finally d is stored to memory. Only the live final value d is spilled, minimizing memory traffic.

code-generation
11short5 marks

Show that the grammar E -> E + E | E * E | ( E ) | id is ambiguous by giving two distinct parse trees (or leftmost derivations) for the string id + id * id. How can operator precedence and associativity rules be used to remove this ambiguity?

Ambiguity of E -> E + E | E * E | ( E ) | id [5]

The grammar is ambiguous because the string id + id * id has two distinct parse trees / leftmost derivations:

Derivation 1 (treats + as topmost — correct precedence): E ⇒ E + E ⇒ id + E ⇒ id + E * E ⇒ id + id * E ⇒ id + id * id Tree: root +, right child is id * id. Means id + (id * id).

Derivation 2 (treats * as topmost — wrong precedence): E ⇒ E * E ⇒ E + E * E ⇒ id + E * E ⇒ id + id * E ⇒ id + id * id Tree: root *, left child is id + id. Means (id + id) * id.

Since two different leftmost derivations / parse trees exist for the same string, the grammar is ambiguous.

Removing Ambiguity

Use precedence (give * higher precedence than +) and associativity (make both left-associative) by introducing levels:

E -> E + T | T      ; + lowest precedence, left associative
T -> T * F | F      ; * higher precedence, left associative
F -> ( E ) | id     ; highest (parentheses / atoms)

This unambiguous grammar forces * to bind tighter than +, so id + id * id parses uniquely as id + (id * id). (Alternatively, a parser generator like YACC can be given explicit %left '+' / %left '*' precedence declarations.)

parsingambiguous-grammar
12short5 marks

Explain the following loop optimization techniques with an example for each: (a) loop-invariant code motion, (b) strength reduction, (c) induction variable elimination.

Loop Optimizations [5]

(a) Loop-Invariant Code Motion

Move computations whose operands do not change inside the loop to outside (before) the loop.

// before
for (i=0; i<n; i++)  x = y + z;  a[i] = x * i;
// after  (y+z is invariant)
x = y + z;
for (i=0; i<n; i++)  a[i] = x * i;

(b) Strength Reduction

Replace an expensive operation with an equivalent cheaper one (e.g., multiplication by repeated addition).

// before: t = 4 * i  computed each iteration
for (i=0; i<n; i++)  t = 4 * i; ...
// after: replace multiply with add
t = 0;                 // = 4*i for i=0
for (i=0; i<n; i++) { ...; t = t + 4; }

(c) Induction Variable Elimination

When several variables move in lockstep (induction variables), keep one and eliminate the others, rewriting tests in terms of the retained variable.

// i and t = 4*i are both induction variables
// before
for (i=0; i<n; i++)  a[4*i] = 0;
// after eliminate i, drive loop by t
for (t=0; t<4*n; t=t+4)  a[t] = 0;

Here the original counter i is eliminated and the loop runs directly on the derived induction variable t.

code-optimizationloop-optimization
13short5 marks

Compare LL(1), SLR(1), LALR(1), and canonical LR(1) parsers in terms of grammar class power, parsing table size, and ability to detect errors. Why is LALR(1) the most commonly used method in practical parser generators such as YACC?

Comparison of Parsers [5]

FeatureLL(1)SLR(1)LALR(1)Canonical LR(1)
DirectionTop-down, leftmostBottom-up, rightmost (reverse)Bottom-upBottom-up
Grammar class powerSmallest (LL(1) grammars; no left recursion, no ambiguity)Larger than LL(1)Between SLR and canonical LRLargest (most grammars)
Table sizeSmallSmall (same #states as LALR)Small (same states as SLR)Very large (many states)
Reduce decisionsPredictiveUses FOLLOW sets (can give conflicts)Uses precise lookahead per state (merged LR(1) states)Full LR(1) lookahead
Error detectionGood, earlyMay make a few extra reductions before errorMay make extra reductions but detects on first illegal shiftEarliest/most precise, never an erroneous shift

Hierarchy of power: LL(1) ⊂ SLR(1) ⊂ LALR(1) ⊂ LR(1). Every SLR(1) grammar is LALR(1), and every LALR(1) is LR(1).

Why LALR(1) is most used (YACC/Bison)

  • It accepts almost all practical programming-language grammars (much more powerful than SLR/LL(1)).
  • Its table has the same small number of states as SLR, far smaller than canonical LR(1), so it is memory-efficient.
  • It is obtained by merging LR(1) item sets with identical cores, giving precise lookahead at low cost.
  • The only drawback (occasional reduce-reduce conflicts from merging, and possibly one extra reduction before reporting an error) is rare in practice.

Thus LALR(1) offers the best trade-off between grammar power and table size, which is why YACC/Bison use it.

lr-parsersparser-comparison

Frequently asked questions

Where can I find the BE Computer Engineering (Pokhara University) Compiler Design (PU, CMP 422) question paper 2078?
The full BE Computer Engineering (Pokhara University) Compiler Design (PU, CMP 422) 2078 (regular) question paper is available free on Kekkei. You can read every question online and attempt the paper under timed exam conditions.
Does the Compiler Design (PU, CMP 422) 2078 paper come with solutions?
Yes. Every question on this Compiler Design (PU, CMP 422) past paper includes a step-by-step solution, plus instant AI feedback when you attempt it on Kekkei.
How many marks is the BE Computer Engineering (Pokhara University) Compiler Design (PU, CMP 422) 2078 paper?
The BE Computer Engineering (Pokhara University) Compiler Design (PU, CMP 422) 2078 paper carries 100 full marks and is meant to be completed in 180 minutes, across 13 questions.
Is practising this Compiler Design (PU, CMP 422) past paper free?
Yes — reading and attempting this Compiler Design (PU, CMP 422) past paper on Kekkei is completely free.