Lexical Analysis Notes | EduRev

: Lexical Analysis Notes | EduRev

 Page 1


9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer 
Science.
• I want to understand how a compiler works. 
See behind the scenes and learn something 
about software engineering.
• I really liked 341, and this seems like a way 
to understand more of what goes on behind 
programming languages.
• I am excited about the project and actually 
implementing some of the theoretical 
knowledge.
• Haven't heard much, but I'm thinking it will 
have to do a lot with the stuff we learned in 
322 such as grammars and regular 
expressions.
• Sadly, I haven't heard much, but I enjoyed 
the first lecture. 
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory. 
• I am worried about getting too buried in 
theory that isn't shown in an applicable way. 
Quite a bit of work and complicated. 
• The projects are a lot of work, especially 
when you are taking 3 other courses (2 of 
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring 
what the preexisting project code does than 
applying the theory we learn. While that is a 
useful skill, I hope to get more out of the 
projects. 
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function 
recursively written and optimize it into a for loop without the user knowing it. I 
see now optimizations aren't part of the class, but I would like to learn maybe 
about just what optimizations are used in compilers today. 
• No specific ones yet. I suppose I'm a little curious about how ML (and similar 
statically typed languages) do type inference. 
• I am especially interested in JIT compiling. 
• Effectively parsing text is something I've been curious about, so i'm looking 
forward to that. Also getting to the computer to recognize the meaning of text 
looks interesting. 
• What is being done in the field of compilers that try to make code more secure? 
• I'm interested in seeing how theory connects to practice. 
• I'm not sure if this relates to this course, but I'm interested in learning how 
scripting languages are compiled since we don't go through a build process with 
them as we do with Java or C, if it's a different process. 
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be 
interested in it?
– A reduced project (that is, not all of the extensions to 
MiniJava but still enough to learn about key issues in 
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to 
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in 
addition to degree of interest, will help inform my decision
CSE401 Au08 4
Page 2


9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer 
Science.
• I want to understand how a compiler works. 
See behind the scenes and learn something 
about software engineering.
• I really liked 341, and this seems like a way 
to understand more of what goes on behind 
programming languages.
• I am excited about the project and actually 
implementing some of the theoretical 
knowledge.
• Haven't heard much, but I'm thinking it will 
have to do a lot with the stuff we learned in 
322 such as grammars and regular 
expressions.
• Sadly, I haven't heard much, but I enjoyed 
the first lecture. 
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory. 
• I am worried about getting too buried in 
theory that isn't shown in an applicable way. 
Quite a bit of work and complicated. 
• The projects are a lot of work, especially 
when you are taking 3 other courses (2 of 
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring 
what the preexisting project code does than 
applying the theory we learn. While that is a 
useful skill, I hope to get more out of the 
projects. 
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function 
recursively written and optimize it into a for loop without the user knowing it. I 
see now optimizations aren't part of the class, but I would like to learn maybe 
about just what optimizations are used in compilers today. 
• No specific ones yet. I suppose I'm a little curious about how ML (and similar 
statically typed languages) do type inference. 
• I am especially interested in JIT compiling. 
• Effectively parsing text is something I've been curious about, so i'm looking 
forward to that. Also getting to the computer to recognize the meaning of text 
looks interesting. 
• What is being done in the field of compilers that try to make code more secure? 
• I'm interested in seeing how theory connects to practice. 
• I'm not sure if this relates to this course, but I'm interested in learning how 
scripting languages are compiled since we don't go through a build process with 
them as we do with Java or C, if it's a different process. 
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be 
interested in it?
– A reduced project (that is, not all of the extensions to 
MiniJava but still enough to learn about key issues in 
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to 
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in 
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source 
program into a token stream
– In general, it should be an efficient phase of 
compilation
• A token is a group of characters forming an atomic 
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored 
– they contribute to the human communication 
aspects of the program, but do not change the 
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part 
of syntactic analysis
– Regular expressions are a proper subset of 
context free grammars
• But this is rarely if ever done: separating concerns 
tends to be a better design 
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies, 
needn?t know language syntax
– Recognizing regular expressions is much faster 
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages 
are free form
– Layout doesn?t 
matter (to the 
computer – see 
obfuscated code 
example on right)
– White space 
separates tokens
• But some 
languages are 
more constrained 
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' 
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; 
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! 
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement 
number
– Sixth character is a continuation character – any 
character other than „0? indicates that this 
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have 
no meaning with respect to the program per se
CSE401 Au08 8
Page 3


9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer 
Science.
• I want to understand how a compiler works. 
See behind the scenes and learn something 
about software engineering.
• I really liked 341, and this seems like a way 
to understand more of what goes on behind 
programming languages.
• I am excited about the project and actually 
implementing some of the theoretical 
knowledge.
• Haven't heard much, but I'm thinking it will 
have to do a lot with the stuff we learned in 
322 such as grammars and regular 
expressions.
• Sadly, I haven't heard much, but I enjoyed 
the first lecture. 
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory. 
• I am worried about getting too buried in 
theory that isn't shown in an applicable way. 
Quite a bit of work and complicated. 
• The projects are a lot of work, especially 
when you are taking 3 other courses (2 of 
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring 
what the preexisting project code does than 
applying the theory we learn. While that is a 
useful skill, I hope to get more out of the 
projects. 
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function 
recursively written and optimize it into a for loop without the user knowing it. I 
see now optimizations aren't part of the class, but I would like to learn maybe 
about just what optimizations are used in compilers today. 
• No specific ones yet. I suppose I'm a little curious about how ML (and similar 
statically typed languages) do type inference. 
• I am especially interested in JIT compiling. 
• Effectively parsing text is something I've been curious about, so i'm looking 
forward to that. Also getting to the computer to recognize the meaning of text 
looks interesting. 
• What is being done in the field of compilers that try to make code more secure? 
• I'm interested in seeing how theory connects to practice. 
• I'm not sure if this relates to this course, but I'm interested in learning how 
scripting languages are compiled since we don't go through a build process with 
them as we do with Java or C, if it's a different process. 
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be 
interested in it?
– A reduced project (that is, not all of the extensions to 
MiniJava but still enough to learn about key issues in 
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to 
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in 
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source 
program into a token stream
– In general, it should be an efficient phase of 
compilation
• A token is a group of characters forming an atomic 
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored 
– they contribute to the human communication 
aspects of the program, but do not change the 
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part 
of syntactic analysis
– Regular expressions are a proper subset of 
context free grammars
• But this is rarely if ever done: separating concerns 
tends to be a better design 
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies, 
needn?t know language syntax
– Recognizing regular expressions is much faster 
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages 
are free form
– Layout doesn?t 
matter (to the 
computer – see 
obfuscated code 
example on right)
– White space 
separates tokens
• But some 
languages are 
more constrained 
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' 
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; 
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! 
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement 
number
– Sixth character is a continuation character – any 
character other than „0? indicates that this 
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have 
no meaning with respect to the program per se
CSE401 Au08 8
9/26/2008
3
Ex: Haskell
• “[I]ndentation … is important. Haskell uses a system 
called „layout? to structure its code (… Python uses a 
similar system). The layout system allows you to 
write code without the explicit semicolons and braces 
that other languages like C and Java require.”
-- Hal Daumé III
• Tabs and spaces can cause confusion
main = let dolly = breedSheep
in do args <- getArgs
print $ traceFamily dolly (map getFunctionByName args)
CSE401 Au08 9
Definitions
• Pattern: a definition of a related set of lexical entities
– Ex: all sequences of numeric characters, all 
sequences of alphanumeric characters starting 
with an alphabetic character
– Regular expressions are used in practice to define 
patterns
• Lexeme: group of characters that matches a pattern
– Ex: „1234?, „43204222?, „snork?, „f0rk?
• Token: class of lexemes matching a pattern, 
distinguished by an attribute
– Ex: „snork? and „f0rk? are both identifier lexemes 
with the actual names kept as an attribute
CSE401 Au08 10
Languages: quick reminder
• Alphabet: finite set of characters and symbols
• String: a finite (possibly empty) sequence of 
characters from an alphabet
• Language: a (possibly empty or infinite) set of strings 
• Grammar: a finite specification for a set of strings
• Language automaton: an abstract machine that 
accepts all strings in a given language and only those
• A language can be specified by many different 
grammars and automata
• A grammar or automaton specifies a single language
CSE401 Au08 11
Language (Chomsky) hierarchy:
quick reminder
• Regular (Type-3) languages are 
specified by regular 
expressions/grammars and 
finite automata (FSAs)
• Context-free (Type-2) 
languages are specified by 
context-free grammars and 
pushdown automata (PDAs)
• Context-sensitive (Type-1) 
languages … aren?t too 
important
• Recursively-enumerable (Type-
0) languages are specified by 
general grammars and Turing 
machines
Turing
CSL
CFL
Regular
One distinction among the levels is what 
is allowed on the right hand and on the 
left hand sides of grammar rules
CSE401 Au08 12
Page 4


9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer 
Science.
• I want to understand how a compiler works. 
See behind the scenes and learn something 
about software engineering.
• I really liked 341, and this seems like a way 
to understand more of what goes on behind 
programming languages.
• I am excited about the project and actually 
implementing some of the theoretical 
knowledge.
• Haven't heard much, but I'm thinking it will 
have to do a lot with the stuff we learned in 
322 such as grammars and regular 
expressions.
• Sadly, I haven't heard much, but I enjoyed 
the first lecture. 
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory. 
• I am worried about getting too buried in 
theory that isn't shown in an applicable way. 
Quite a bit of work and complicated. 
• The projects are a lot of work, especially 
when you are taking 3 other courses (2 of 
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring 
what the preexisting project code does than 
applying the theory we learn. While that is a 
useful skill, I hope to get more out of the 
projects. 
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function 
recursively written and optimize it into a for loop without the user knowing it. I 
see now optimizations aren't part of the class, but I would like to learn maybe 
about just what optimizations are used in compilers today. 
• No specific ones yet. I suppose I'm a little curious about how ML (and similar 
statically typed languages) do type inference. 
• I am especially interested in JIT compiling. 
• Effectively parsing text is something I've been curious about, so i'm looking 
forward to that. Also getting to the computer to recognize the meaning of text 
looks interesting. 
• What is being done in the field of compilers that try to make code more secure? 
• I'm interested in seeing how theory connects to practice. 
• I'm not sure if this relates to this course, but I'm interested in learning how 
scripting languages are compiled since we don't go through a build process with 
them as we do with Java or C, if it's a different process. 
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be 
interested in it?
– A reduced project (that is, not all of the extensions to 
MiniJava but still enough to learn about key issues in 
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to 
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in 
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source 
program into a token stream
– In general, it should be an efficient phase of 
compilation
• A token is a group of characters forming an atomic 
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored 
– they contribute to the human communication 
aspects of the program, but do not change the 
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part 
of syntactic analysis
– Regular expressions are a proper subset of 
context free grammars
• But this is rarely if ever done: separating concerns 
tends to be a better design 
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies, 
needn?t know language syntax
– Recognizing regular expressions is much faster 
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages 
are free form
– Layout doesn?t 
matter (to the 
computer – see 
obfuscated code 
example on right)
– White space 
separates tokens
• But some 
languages are 
more constrained 
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' 
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; 
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! 
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement 
number
– Sixth character is a continuation character – any 
character other than „0? indicates that this 
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have 
no meaning with respect to the program per se
CSE401 Au08 8
9/26/2008
3
Ex: Haskell
• “[I]ndentation … is important. Haskell uses a system 
called „layout? to structure its code (… Python uses a 
similar system). The layout system allows you to 
write code without the explicit semicolons and braces 
that other languages like C and Java require.”
-- Hal Daumé III
• Tabs and spaces can cause confusion
main = let dolly = breedSheep
in do args <- getArgs
print $ traceFamily dolly (map getFunctionByName args)
CSE401 Au08 9
Definitions
• Pattern: a definition of a related set of lexical entities
– Ex: all sequences of numeric characters, all 
sequences of alphanumeric characters starting 
with an alphabetic character
– Regular expressions are used in practice to define 
patterns
• Lexeme: group of characters that matches a pattern
– Ex: „1234?, „43204222?, „snork?, „f0rk?
• Token: class of lexemes matching a pattern, 
distinguished by an attribute
– Ex: „snork? and „f0rk? are both identifier lexemes 
with the actual names kept as an attribute
CSE401 Au08 10
Languages: quick reminder
• Alphabet: finite set of characters and symbols
• String: a finite (possibly empty) sequence of 
characters from an alphabet
• Language: a (possibly empty or infinite) set of strings 
• Grammar: a finite specification for a set of strings
• Language automaton: an abstract machine that 
accepts all strings in a given language and only those
• A language can be specified by many different 
grammars and automata
• A grammar or automaton specifies a single language
CSE401 Au08 11
Language (Chomsky) hierarchy:
quick reminder
• Regular (Type-3) languages are 
specified by regular 
expressions/grammars and 
finite automata (FSAs)
• Context-free (Type-2) 
languages are specified by 
context-free grammars and 
pushdown automata (PDAs)
• Context-sensitive (Type-1) 
languages … aren?t too 
important
• Recursively-enumerable (Type-
0) languages are specified by 
general grammars and Turing 
machines
Turing
CSL
CFL
Regular
One distinction among the levels is what 
is allowed on the right hand and on the 
left hand sides of grammar rules
CSE401 Au08 12
9/26/2008
4
Regular Expressions:
defined inductively
• Base cases
– Empty string ( )
– Symbol from the 
alphabet
• Inductive cases
– Concatenation: E
1
E
2
– Alternation E
1
| E
2
– Kleene closure: E*
• Parentheses for 
grouping
• Precedence: * is 
highest, then 
concatenate, | is lowest
• White space not 
significant
CSE401 Au08 13
Notational Conveniences:
no additional expressive power
• E+ 1 or more occurrences of E
• Ek exactly k occurrences of E
• [E] 0 or 1 occurrences of E
• {E} E*
• not(x) any character in 
alphabet except x
• not(E) any strings from 
alphabet except those in E
• E1-E2 any string matching E1 
that?s not in E2
• May name regular expressions 
• Ex:
– letter ::= a | b | ... | z
– digit  ::= 0 | 1 | ... | 9
– alphanum ::= letter | num
• Recursive definitions prohibited
• Produce simple regular 
expression from named regular 
expressions by “macro 
expansion”
CSE401 Au08 14
Examples
• Identifiers
– ident ::= letter ( digit | letter)*
• Integer constants
– integer ::= digit+
– sign ::= + | -
– signed_int ::= [sign] integer 
• Real numbers
– real ::= signed_int
[fraction] [exponent]
– fraction ::= . digit+
– exponent ::= (E | e) signed_int
CSE401 Au08 15
More Examples
• String and character constants
– string ::= " char* "
– character ::= ' char '
– char ::= not(" | ' | \) | escape
– escape  ::= 
\(" | ' | \ | n | r | t  | v | b | a )
• White space
– whitespace ::=
<space> | <tab> | <newline> | 
comment
– comment ::= /*  not(*/)  */
CSE401 Au08 16
Page 5


9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer 
Science.
• I want to understand how a compiler works. 
See behind the scenes and learn something 
about software engineering.
• I really liked 341, and this seems like a way 
to understand more of what goes on behind 
programming languages.
• I am excited about the project and actually 
implementing some of the theoretical 
knowledge.
• Haven't heard much, but I'm thinking it will 
have to do a lot with the stuff we learned in 
322 such as grammars and regular 
expressions.
• Sadly, I haven't heard much, but I enjoyed 
the first lecture. 
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory. 
• I am worried about getting too buried in 
theory that isn't shown in an applicable way. 
Quite a bit of work and complicated. 
• The projects are a lot of work, especially 
when you are taking 3 other courses (2 of 
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring 
what the preexisting project code does than 
applying the theory we learn. While that is a 
useful skill, I hope to get more out of the 
projects. 
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function 
recursively written and optimize it into a for loop without the user knowing it. I 
see now optimizations aren't part of the class, but I would like to learn maybe 
about just what optimizations are used in compilers today. 
• No specific ones yet. I suppose I'm a little curious about how ML (and similar 
statically typed languages) do type inference. 
• I am especially interested in JIT compiling. 
• Effectively parsing text is something I've been curious about, so i'm looking 
forward to that. Also getting to the computer to recognize the meaning of text 
looks interesting. 
• What is being done in the field of compilers that try to make code more secure? 
• I'm interested in seeing how theory connects to practice. 
• I'm not sure if this relates to this course, but I'm interested in learning how 
scripting languages are compiled since we don't go through a build process with 
them as we do with Java or C, if it's a different process. 
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be 
interested in it?
– A reduced project (that is, not all of the extensions to 
MiniJava but still enough to learn about key issues in 
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to 
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in 
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source 
program into a token stream
– In general, it should be an efficient phase of 
compilation
• A token is a group of characters forming an atomic 
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored 
– they contribute to the human communication 
aspects of the program, but do not change the 
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part 
of syntactic analysis
– Regular expressions are a proper subset of 
context free grammars
• But this is rarely if ever done: separating concerns 
tends to be a better design 
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies, 
needn?t know language syntax
– Recognizing regular expressions is much faster 
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages 
are free form
– Layout doesn?t 
matter (to the 
computer – see 
obfuscated code 
example on right)
– White space 
separates tokens
• But some 
languages are 
more constrained 
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' 
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; 
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! 
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement 
number
– Sixth character is a continuation character – any 
character other than „0? indicates that this 
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have 
no meaning with respect to the program per se
CSE401 Au08 8
9/26/2008
3
Ex: Haskell
• “[I]ndentation … is important. Haskell uses a system 
called „layout? to structure its code (… Python uses a 
similar system). The layout system allows you to 
write code without the explicit semicolons and braces 
that other languages like C and Java require.”
-- Hal Daumé III
• Tabs and spaces can cause confusion
main = let dolly = breedSheep
in do args <- getArgs
print $ traceFamily dolly (map getFunctionByName args)
CSE401 Au08 9
Definitions
• Pattern: a definition of a related set of lexical entities
– Ex: all sequences of numeric characters, all 
sequences of alphanumeric characters starting 
with an alphabetic character
– Regular expressions are used in practice to define 
patterns
• Lexeme: group of characters that matches a pattern
– Ex: „1234?, „43204222?, „snork?, „f0rk?
• Token: class of lexemes matching a pattern, 
distinguished by an attribute
– Ex: „snork? and „f0rk? are both identifier lexemes 
with the actual names kept as an attribute
CSE401 Au08 10
Languages: quick reminder
• Alphabet: finite set of characters and symbols
• String: a finite (possibly empty) sequence of 
characters from an alphabet
• Language: a (possibly empty or infinite) set of strings 
• Grammar: a finite specification for a set of strings
• Language automaton: an abstract machine that 
accepts all strings in a given language and only those
• A language can be specified by many different 
grammars and automata
• A grammar or automaton specifies a single language
CSE401 Au08 11
Language (Chomsky) hierarchy:
quick reminder
• Regular (Type-3) languages are 
specified by regular 
expressions/grammars and 
finite automata (FSAs)
• Context-free (Type-2) 
languages are specified by 
context-free grammars and 
pushdown automata (PDAs)
• Context-sensitive (Type-1) 
languages … aren?t too 
important
• Recursively-enumerable (Type-
0) languages are specified by 
general grammars and Turing 
machines
Turing
CSL
CFL
Regular
One distinction among the levels is what 
is allowed on the right hand and on the 
left hand sides of grammar rules
CSE401 Au08 12
9/26/2008
4
Regular Expressions:
defined inductively
• Base cases
– Empty string ( )
– Symbol from the 
alphabet
• Inductive cases
– Concatenation: E
1
E
2
– Alternation E
1
| E
2
– Kleene closure: E*
• Parentheses for 
grouping
• Precedence: * is 
highest, then 
concatenate, | is lowest
• White space not 
significant
CSE401 Au08 13
Notational Conveniences:
no additional expressive power
• E+ 1 or more occurrences of E
• Ek exactly k occurrences of E
• [E] 0 or 1 occurrences of E
• {E} E*
• not(x) any character in 
alphabet except x
• not(E) any strings from 
alphabet except those in E
• E1-E2 any string matching E1 
that?s not in E2
• May name regular expressions 
• Ex:
– letter ::= a | b | ... | z
– digit  ::= 0 | 1 | ... | 9
– alphanum ::= letter | num
• Recursive definitions prohibited
• Produce simple regular 
expression from named regular 
expressions by “macro 
expansion”
CSE401 Au08 14
Examples
• Identifiers
– ident ::= letter ( digit | letter)*
• Integer constants
– integer ::= digit+
– sign ::= + | -
– signed_int ::= [sign] integer 
• Real numbers
– real ::= signed_int
[fraction] [exponent]
– fraction ::= . digit+
– exponent ::= (E | e) signed_int
CSE401 Au08 15
More Examples
• String and character constants
– string ::= " char* "
– character ::= ' char '
– char ::= not(" | ' | \) | escape
– escape  ::= 
\(" | ' | \ | n | r | t  | v | b | a )
• White space
– whitespace ::=
<space> | <tab> | <newline> | 
comment
– comment ::= /*  not(*/)  */
CSE401 Au08 16
9/26/2008
5
Meta-Rules
• Consider a program defined as:
– program ::= (token | whitespace)*
– token ::= ident | integer | real | …
• Then consider how to tokenize „hi2bob?
– <ident: „hi2bob?> ?
– <ident: „hi?, integer: „2?, ident: „bob?>
– Or six separate tokens?
• All are legal according to the grammar, but the choice 
does matter – the ambiguity isn?t desirable here
• Usually apply an extra rule such as “longest 
sequence wins”
CSE401 Au08 17
Initial MiniJava lexical specification
Program ::= (Token | Whitespace)* 
Token ::= ID | Integer | ReservedWord | Operator |
Delimiter 
ID ::= Letter (Letter | Digit)* 
Letter ::= a | ... | z | A | ... | Z 
Digit ::= 0 | ... | 9 
Integer ::= Digit+ 
ReservedWord::= class | public | static | extends |
void | int | boolean | if | else | 
while|return|true|false| this | new | String 
| main | System.out.println
Operator ::= + | - | * | / | < | <= | >= | > | == | 
!= | && | ! 
Delimiter ::= ; | . | , | = | ( | ) | { | } | [ | ]
CSE401 Au08 18
Building Scanners with REs
• Convert regular expressions into finite state automata 
(FSA)
• Convert FSA into a scanner implementation
– By hand into a collection of procedures
– Mechanically using a table-driven scanner
CSE401 Au08 19
Finite State Automata
• On whiteboard; see book
CSE401 Au08 20
Read More
Offer running on EduRev: Apply code STAYHOME200 to get INR 200 off on our premium plan EduRev Infinity!