Part II. Simplification of Context Free Grammars
3 Simplification of CFGs
A context-free grammar (CFG) can often be transformed into a simpler but equivalent grammar that generates the same language. The usual simplification steps are:
- Eliminating useless symbols (non-generating and non-reachable symbols).
- Eliminating ε-productions (productions that produce the empty string).
- Eliminating unit productions (productions of the form A → B where both A and B are non-terminals).
3.1 Eliminating Useless Symbols
A symbol (terminal or non-terminal) of a grammar is called useful if it occurs in the derivation of some string of terminals from the start symbol. A non-terminal is useful precisely when it is both generating and reachable.
Generating. A non-terminal Y is called generating if it can derive a string consisting only of terminals. Formally, Y is generating if there exists a terminal string \\(w \in T^*\\) such that \\(Y \Rightarrow^* w\\).
Reachable. A non-terminal Y is called reachable if there is a derivation from the start symbol S that produces a sentential form containing Y. Formally, S \\(\\Rightarrow^*\\) αYβ for some strings \\(α,β\\).
If a non-terminal is not generating or not reachable, it is useless and all productions containing it may be removed. The usual procedure is two-phase:
- Find all generating non-terminals by repeatedly marking any non-terminal that has a production whose RHS consists only of terminals and already marked non-terminals.
- Find all reachable symbols by starting at the start symbol and marking every non-terminal that appears in the RHS of a production whose LHS has been marked reachable; continue until no new symbols are found.
- Remove all productions that involve a non-terminal that is not both generating and reachable; also remove any terminals that appear only with removed productions.
Worked examples
Example 1
Grammar:
S → AB | a
A → b
Start symbol: S
Analysis and simplification:
- Generating: A derives the terminal string b, and S derives the terminal string a. Symbol B has no production that leads to terminals, so B is non-generating.
- Remove any production containing the non-generating symbol B: remove S → AB.
- After removal the grammar is S → a, A → b. Now A is not reachable from S (no derivation from S uses A), so remove A → b.
- Final reduced grammar: S → a. No useless symbols remain.
Example 2
Grammar:
S → aB | bX
A → BAd | bSX | a
B → aSB | bBX
X → SBD | aBx | ad
Start symbol: S
Analysis and simplification:
- Step 1 - generating symbols: scan productions to find non-terminals that can derive terminal strings.
- A has production A → a, so A is generating.
- X has production X → ad, so X is generating.
- Since S → bX and X is generating, S is generating as well.
- B appears only in productions with non-generating patterns (its productions include B → aSB and B → bBX), and no combination gives a terminal string; thus B is non-generating.
- Remove productions containing non-generating B. Grammar becomes:
- Step 2 - reachability: start at S. X is reachable via S → bX. A is not reachable from S (no RHS from reachable non-terminals contains A), so remove A productions.
- Final reduced grammar:
Example 3
Grammar:
A → xyz | Xyzz
X → Xz | xY z
Y → yY y | Xz
Z → Zy | z
Start symbol: A
Analysis and simplification:
- A has production A → xyz and Z → z, so A and Z are generating.
- X and Y cannot derive a string of terminals because their productions always keep non-terminals that do not terminate; so X and Y are non-generating.
- Remove productions of X and Y, giving:
- Now check reachability from A; Z does not appear in any RHS reachable from A, so Z is not reachable. Remove Z productions.
- Final reduced grammar: A → xyz.
Example 4
Grammar:
S → aC | SB
A → bSCa
B → aSB | bBC
C → aBC | ad
Start symbol: S
Analysis and simplification:
- C → ad shows C is generating.
- S → aC with generating C makes S generating.
- A → bSCa contains S and C (both useful), so A is generating.
- B appears on the RHS of its own productions and there is no termination without itself; so B is non-generating. Remove productions that contain B.
- After removal the grammar becomes:
- Check reachability from S: A is not reachable. Remove A.
- Final reduced grammar:
3.2 Removal of Unit Productions
Unit production is a production of the form A → B where A and B are non-terminals. The aim is to eliminate chains of unit productions while preserving the language (except possibly the empty string if special care is required).
Standard procedure to remove unit productions:
- Compute the unit-closure relation: for each non-terminal A, find all non-terminals B such that A ⇒* B using only unit productions.
- For every pair (A, B) in this relation and for every non-unit production B → α (where α is not a single non-terminal), add the production A → α to the grammar.
- After adding those productions, remove all original unit productions.
Worked examples
Example 1
Grammar:
S → AB
A → a
B → C | b
C → D
D → E
E → a
Start symbol: S
Analysis and simplification:
- Unit chain found: B → C → D → E.
- Since E → a, propagate terminal production back along the chain: add D → a, then C → a, then B → a.
- Now remove unit productions B → C, C → D, D → E.
- After removing unit productions, remove unreachable symbols (here C, D, E are not reachable from S), leaving:
Example 2
Grammar (expression grammar fragment):
S → a | b | Sa | Sb | S0 | S1
F → S | (E)
T → F | T * F
E → T | E + T
Start symbol: E
Analysis and simplification:
- Unit productions: F → S, T → F, E → T.
- Remove F → S by adding all non-unit productions of S to F, giving F → a | b | Sa | Sb | S0 | S1 | (E).
- Remove T → F by adding non-unit productions of F to T.
- Remove E → T by adding non-unit productions of T to E.
- The resulting grammar contains no unit productions; the original structure of operator-precedence productions such as T * F and E + T is preserved.
Example 3
Grammar:
S → A | bb
A → B | b
B → S | a
Start symbol: S
Analysis and simplification:
- Unit productions: S → A, A → B, B → S. These form a unit cycle between S, A and B.
- Compute reachable terminals via unit chains:
- S → A → B → a gives S → a.
- S → A → B → S → bb gives S → bb (already present).
- Similarly propagate to obtain:
- A → a | bb | b
- B → a | bb | b
- Remove unit productions. Final grammar:
- S → a | b | bb
- A → a | b | bb
- B → a | b | bb
3.3 Removal of ε-Productions
A production of the form A → ε (where ε denotes the empty string) is an ε-production. To eliminate ε-productions while preserving the language (except possibly the empty string), follow this standard method:
- Find the set of nullable non-terminals: a non-terminal A is nullable if A ⇒* ε.
- For every production B → α, where α is a sequence of terminals and non-terminals, generate new productions by taking all possibilities of deleting some (but not all) nullable non-terminals from α. Add these new productions (excluding the case that yields an empty RHS unless the start symbol must generate ε).
- Remove all explicit ε-productions (except possibly S → ε if the language originally contained ε and one wishes to retain it; special care is needed for the start symbol).
The algorithm is usually implemented as: compute nullable set, then for each production iterate over each occurrence of nullable non-terminals and add new productions with those occurrences omitted (for all combinations). Finally, delete the original ε-productions.
Worked examples
Example 1
Grammar:
S → aA
A → b | ε
Start symbol: S
Analysis and simplification:
- Nullable set: A is nullable because A → ε.
- For production S → aA, generate an additional production by deleting nullable A: add S → a.
- Remove the ε-production A → ε. Final grammar:
Example 2
Grammar:
S → ABAC
A → aA | ε
B → bB | ε
C → c
Start symbol: S
Analysis and simplification:
- Nullable set: A and B are nullable.
- For S → ABAC consider all combinations of deleting occurrences of nullable A and B (but preserve at least one symbol on RHS unless original grammar allowed S → ε and it must be retained). The generated productions include:
- S → ABAC (original)
- S → ABC (delete the first A)
- S → BAC (delete the second A)
- S → BC (delete both A occurrences)
- S → AAC, S → AC, S → C, etc., according to combinations that arise when handling multiple nullable occurrences; include all that make sense and are not duplicates.
- Also replace A → aA by adding A → a because A is nullable.
- Similarly for B → bB add B → b.
- Remove explicit ε-productions A → ε and B → ε.
- One consistent resulting grammar is:
- S → ABAC | ABC | BAC | BC | AAC | AC | C
- A → aA | a
- B → bB | b
- C → c
which contains no ε-productions.
Example 3
Grammar:
S → aSa | bSb | ε
Start symbol: S
Analysis and simplification:
- S is nullable because S → ε.
- For each recursive production, remove nullable occurrences of S within the RHS to produce shorter alternatives:
- From S → aSa we obtain S → aa by deleting the inner S.
- From S → bSb we obtain S → bb.
- Remove the ε-production if we wish to produce a grammar without ε-productions; include the shorter alternatives instead. Final grammar:
- Note: if the original language included ε and it must be preserved, keep a distinguished start symbol S' → S | ε trick to preserve ε while removing other ε-productions.
Example 4
Grammar:
S → a | Xb | aYa
X → Y | ε
Y → b | X
Start symbol: S
Analysis and simplification:
- Nullable set: X is nullable because X → ε. Initially Y is not nullable (it produces b or X), but after substitution it may get new forms.
- Replace occurrences of X by ε where appropriate:
- From S → Xb we obtain S → b (deleting X), and keep S → Xb as well if X remains a non-terminal option before final removal.
- Propagate transformations through the grammar and compute nullable again if needed.
- Eventually eliminate explicit ε productions and tidy up repeated or redundant productions. One suitable result after full elimination is:
- S → a | Xb | aYa | b | aa
- X → Y
- Y → b | X
- After further removal of unit productions and unreachable symbols the grammar can be simplified even more (follow the unit-removal and useless symbol algorithms to finish).
Practical notes and common pitfalls
- When eliminating ε-productions special care is needed if the start symbol can derive ε; typical solution is to introduce a fresh start symbol S' with S' → S | ε and then eliminate other ε-productions.
- Order of simplification matters for convenience but not correctness: a common workflow is (1) remove useless symbols, (2) remove ε-productions, (3) remove unit productions, and finally again remove any newly created useless symbols.
- Always check for reachability after each major transformation because removal of productions can make previously reachable non-terminals unreachable.
- When adding productions during ε or unit elimination avoid creating duplicate productions; duplicates do not change the language but clutter the grammar.
Exercises
Simplify the following context-free grammars (remove useless symbols, ε-productions and unit productions as appropriate; show intermediate steps):
S → AB
A → a
B → b
B → C
E → c | ε
Start symbol: S
S → aAa
A → Sb | bCC | DaA
C → abb | DD
E → aC
D → aDA
Start symbol: S
S → bS | bA | ε
A → ε
Start symbol: S
S → bS | AB
A → ε
B → ε
D → a
Start symbol: S
S → A
A → B
B → C
C → a
Start symbol: S
S → AB
A → a
B → C | b
C → D
D → E
E → a
Start symbol: S
S → ABCD
A → a
B → C | b
C → D
D → c
Start symbol: S
S → aS | AC | AB
A → ε
C → ε
B → bB | bb
D → c
Start symbol: S
S → ABC | A0A
A → 0A | B0C | 000 | B
B → 1B1 | 0 | D
C → CA | AC
D → ε
Start symbol: S
S → AAA | B
A → 0A | B
B → ε
Start symbol: S
End of chapter: Simplification of Context Free Grammars.