Introduction
Symbol table is a central data structure created and maintained by a compiler to record and retrieve semantic information about names (identifiers) that appear in the source program. It stores information such as scope and binding of identifiers, types, storage locations, and other attributes for variables, functions, classes, labels and compiler-generated temporaries. The symbol table is built during the front-end phases (primarily lexical, syntax and semantic analysis) and is consulted by later phases (intermediate-code generation, optimisation and target-code generation).
Role of the symbol table in compiler phases
- Lexical analysis: Records new identifiers and literal tokens as they are recognised; creates initial entries when names first appear.
- Syntax analysis: Adds structural and declaration information to entries (for example, that an identifier is a function or an array, and its declaration context).
- Semantic analysis: Uses stored information to perform type checking, consistency checks, and to resolve declarations and uses; updates entries with inferred attributes.
- Intermediate-code generation: Consults the table for storage size, type and addressing information; records temporaries and their attributes.
- Code optimisation: Uses attributes such as constness, lifetime, and aliasing information to enable optimisations (constant folding, dead-code elimination, register allocation, etc.).
- Target-code generation: Uses address, offset and storage-class information in the symbol table to generate correct load/store instructions and call sequences.
What is stored in a symbol table
Each symbol table entry contains attributes that the compiler requires for correct translation and optimisation. Typical items and attributes recorded are:
- Names of identifiers: variable names, constant names, procedure and function names, type names.
- Literal constants and string literals.
- Compiler-generated temporaries and labels.
- Kind of entity: variable, function, type, constant, label, parameter, record/structure field.
- Data type and type descriptor (primitive type, array dimensions, record layout, pointer target, function signature).
- Declaring scope or procedure; lexical level or scope nesting level.
- Storage information: offset within activation record, absolute address, storage class (static, automatic, external), size and alignment.
- Parameter passing method (call-by-value, call-by-reference), number and types of parameters.
- Pointer to structure/layout table when the identifier is a record/structure.
- Source information: line(s) of declaration and first use, occurrence count for warnings.
- Attributes and flags: const, volatile, static, extern, inline, visibility (public/private), linkage name (mangled name when needed).
Basic operations on a symbol table
The symbol table must efficiently support a small set of operations that are repeatedly invoked during compilation:
- Insert(name, attributes): Create a new entry for an identifier with its attributes. The compiler must detect re-declaration errors if the name already exists in the relevant scope.
- Lookup(name): Find and return the symbol entry for a name, following scope rules (current scope first, then enclosing scopes when applicable).
- Delete(name): Remove or mark an entry when leaving a scope (or when explicitly removing temporary symbols).
- EnterScope(): Prepare the symbol table mechanism to begin a new (nested) scope-often implemented by pushing a new table or scope marker.
- ExitScope(): Remove all entries belonging to the current scope-often implemented by popping the top scope table and restoring the previous one.
- Update(entry, attribute): Modify attributes of an existing entry (for example, to add type information discovered later).
Common implementation techniques
Several data structures are used in practice to implement symbol tables. Choice depends on expected size, required operation times and memory constraints.
1. List (array-based)
- An array stores entries sequentially. A pointer (for example, available) indicates the end of occupied slots; new entries are appended.
- Insertion is fast: O(1) on average when appending.
- Lookup requires scanning the array from the start (or from the most recent entries) and is O(n) in the number of entries.
- Space overhead is low; this method is simple and sometimes used for small scopes or for educational implementations.
- Compiler must check for duplicate declarations in the current scope during insertion.
2. Linked list
- Each record has a link field and entries are chained; a pointer First points to the first record.
- Insertion at the front is O(1); lookup is O(n), requiring traversal of the links.
- Linked lists can easily support scope entry/exit by recording the list head for each scope or by linking new scope entries at the front.
3. Hash table
- Two tables are used: a hash table (array of buckets or indices) and the symbol-entry storage. The hash table maps a name to an index or bucket where the corresponding symbol entries are stored.
- Average-time complexity for insertion and lookup is O(1).
- Collision resolution methods include chaining (each bucket stores a list of entries) and open addressing (linear probing, quadratic probing, double hashing).
- Hashing is the most commonly used approach in real compilers because of its speed; it requires a good hash function and handling of collisions and resizing (rehashing) when load factor grows.
- Disadvantages: more complex implementation, space overhead for hash buckets, and degradation if hash function is poor.
4. Binary search tree (BST)
- Entries are organised in a BST keyed by name; each node has left and right pointers.
- Average insertion and lookup times are O(log₂ n). Worst-case degenerates to O(n) unless a balanced tree is used.
- Balanced BSTs (AVL trees, red-black trees) guarantee O(log n) worst-case time and are an alternative when ordered traversal of names is required.
Scope management and organisation of symbol tables
Languages with nested or block scopes require mechanisms to ensure correct visibility and lifetime of names. Common organisational techniques:
- Single table with scope markers: A single table stores all entries; scope entry pushes a marker, and scope exit pops entries until marker is reached.
- Separate table per scope (stack of scopes): A new table is created on EnterScope() and pushed on a stack; lookup searches the top table first then proceeds down the stack.
- Scope trees / symbol-table trees: Each scope has a node containing a table and a pointer to its parent scope; lookup follows parent links for enclosing scopes.
- Displays and activation-record links: For fast access to non-local variables in nested procedures, techniques such as displays or static/dynamic chain links are used together with the symbol table information.
- Handling overloading and namespaces: Entries record distinguishing attributes (parameter types, namespace) and lookup uses these to resolve the correct entity.
- Dynamic vs static scoping: Symbol-table organisation differs when a language uses dynamic scoping (lookup follows runtime call chain) versus static (lexical) scoping (lookup follows lexical nesting).
Attributes and representation of a symbol-table entry
A typical symbol-table entry contains structured fields. Representative fields are:
- Identifier name
- Kind (variable, function, parameter, type, constant, label)
- Type descriptor (data type, array dimensions, record layout, function signature)
- Scope information (scope id, lexical level, pointer to the scope record)
- Storage information (offset within activation record or static address, size, alignment)
- Parameter list (for functions: number, types and passing modes of parameters)
- Flags (const, static, extern, volatile, inline, visibility)
- Auxiliary pointers (pointer to type/structure table, pointer to symbol table for fields of a record)
- Source locations (line number of declaration, lines of first use)
Symbol table usage in optimisation and code generation
- The table provides lifetime and usage information required by register allocation and for determining whether a variable can be placed in a register or must be stored in memory.
- Alias and const information in the table enable optimisations such as constant propagation and dead-code elimination.
- Function signatures and calling-convention attributes are used to generate correct call/return sequences and to layout activation records.
- During link-time and separate compilation, symbol-table entries contribute to symbol resolution and generation of relocation records or linkage information.
Example: simple C-like fragment and its symbol entries
Consider this C-like snippet:
int x; int f(int y) { int z; z = x + y; return z; }
Representative symbol-table entries (one per line; fields abbreviated):
- x - kind: variable; type: int; scope: global; storage: static/absolute; offset: <address or 0 relative to data segment>.
- f - kind: function; return type: int; scope: global; parameters: (y: int); entry point: <code address>.
- y - kind: parameter; type: int; declaring procedure: f; scope: f's activation record; passing mode: by value; offset: <offset in AR>.
- z - kind: local variable; type: int; declaring procedure: f; scope: f's activation record; offset: <offset in AR>.
Design considerations and trade-offs
- Time versus space: Hash tables give very fast average-time operations but require additional space and careful collision handling; lists are space-efficient but slow for large scopes.
- Complexity guarantees: If worst-case bounds are required, balanced BSTs or specialised hashing with guaranteed bounds are chosen.
- Scope frequency: Many languages have many short-lived scopes; implementations that quickly enter/exit scopes (stacked tables or scope markers) are preferred.
- Separate compilation and linking: Symbol tables must interoperate with linker symbol tables and emit appropriate debug/relocation information.
- Concurrency and incremental compilation: Modern build systems may require thread-safe or persistent symbol-table representations for fast incremental rebuilds.
Practical notes and common optimisations
- Use chaining in hash tables to simplify insertion and deletion when scopes are frequently created and destroyed.
- Store common attributes by reference (pointers to type descriptors or structure templates) to reduce duplication.
- Maintain a small, fast cache for recent lookups to exploit locality of reference when parsing and analysing declarations and uses.
- Record occurrence counts or use/use-def chains for enabling later optimisation passes and for producing useful diagnostics and warnings.
Summary
The symbol table is a fundamental component of a compiler front end that supports name binding, type checking, storage management and later optimisation and code generation. Implementations balance lookup/insertion speed and memory usage using data structures such as arrays (lists), linked lists, hash tables and binary search trees. Proper scope management and well-chosen attributes enable the compiler to perform correct translation and to apply effective optimisations.