Page 1
9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer
Science.
• I want to understand how a compiler works.
See behind the scenes and learn something
about software engineering.
• I really liked 341, and this seems like a way
to understand more of what goes on behind
programming languages.
• I am excited about the project and actually
implementing some of the theoretical
knowledge.
• Haven't heard much, but I'm thinking it will
have to do a lot with the stuff we learned in
322 such as grammars and regular
expressions.
• Sadly, I haven't heard much, but I enjoyed
the first lecture.
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory.
• I am worried about getting too buried in
theory that isn't shown in an applicable way.
Quite a bit of work and complicated.
• The projects are a lot of work, especially
when you are taking 3 other courses (2 of
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring
what the preexisting project code does than
applying the theory we learn. While that is a
useful skill, I hope to get more out of the
projects.
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function
recursively written and optimize it into a for loop without the user knowing it. I
see now optimizations aren't part of the class, but I would like to learn maybe
about just what optimizations are used in compilers today.
• No specific ones yet. I suppose I'm a little curious about how ML (and similar
statically typed languages) do type inference.
• I am especially interested in JIT compiling.
• Effectively parsing text is something I've been curious about, so i'm looking
forward to that. Also getting to the computer to recognize the meaning of text
looks interesting.
• What is being done in the field of compilers that try to make code more secure?
• I'm interested in seeing how theory connects to practice.
• I'm not sure if this relates to this course, but I'm interested in learning how
scripting languages are compiled since we don't go through a build process with
them as we do with Java or C, if it's a different process.
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be
interested in it?
– A reduced project (that is, not all of the extensions to
MiniJava but still enough to learn about key issues in
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in
addition to degree of interest, will help inform my decision
CSE401 Au08 4
Page 2
9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer
Science.
• I want to understand how a compiler works.
See behind the scenes and learn something
about software engineering.
• I really liked 341, and this seems like a way
to understand more of what goes on behind
programming languages.
• I am excited about the project and actually
implementing some of the theoretical
knowledge.
• Haven't heard much, but I'm thinking it will
have to do a lot with the stuff we learned in
322 such as grammars and regular
expressions.
• Sadly, I haven't heard much, but I enjoyed
the first lecture.
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory.
• I am worried about getting too buried in
theory that isn't shown in an applicable way.
Quite a bit of work and complicated.
• The projects are a lot of work, especially
when you are taking 3 other courses (2 of
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring
what the preexisting project code does than
applying the theory we learn. While that is a
useful skill, I hope to get more out of the
projects.
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function
recursively written and optimize it into a for loop without the user knowing it. I
see now optimizations aren't part of the class, but I would like to learn maybe
about just what optimizations are used in compilers today.
• No specific ones yet. I suppose I'm a little curious about how ML (and similar
statically typed languages) do type inference.
• I am especially interested in JIT compiling.
• Effectively parsing text is something I've been curious about, so i'm looking
forward to that. Also getting to the computer to recognize the meaning of text
looks interesting.
• What is being done in the field of compilers that try to make code more secure?
• I'm interested in seeing how theory connects to practice.
• I'm not sure if this relates to this course, but I'm interested in learning how
scripting languages are compiled since we don't go through a build process with
them as we do with Java or C, if it's a different process.
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be
interested in it?
– A reduced project (that is, not all of the extensions to
MiniJava but still enough to learn about key issues in
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source
program into a token stream
– In general, it should be an efficient phase of
compilation
• A token is a group of characters forming an atomic
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored
– they contribute to the human communication
aspects of the program, but do not change the
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part
of syntactic analysis
– Regular expressions are a proper subset of
context free grammars
• But this is rarely if ever done: separating concerns
tends to be a better design
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies,
needn?t know language syntax
– Recognizing regular expressions is much faster
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages
are free form
– Layout doesn?t
matter (to the
computer – see
obfuscated code
example on right)
– White space
separates tokens
• But some
languages are
more constrained
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw'
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i;
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w!
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement
number
– Sixth character is a continuation character – any
character other than „0? indicates that this
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have
no meaning with respect to the program per se
CSE401 Au08 8
Page 3
9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer
Science.
• I want to understand how a compiler works.
See behind the scenes and learn something
about software engineering.
• I really liked 341, and this seems like a way
to understand more of what goes on behind
programming languages.
• I am excited about the project and actually
implementing some of the theoretical
knowledge.
• Haven't heard much, but I'm thinking it will
have to do a lot with the stuff we learned in
322 such as grammars and regular
expressions.
• Sadly, I haven't heard much, but I enjoyed
the first lecture.
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory.
• I am worried about getting too buried in
theory that isn't shown in an applicable way.
Quite a bit of work and complicated.
• The projects are a lot of work, especially
when you are taking 3 other courses (2 of
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring
what the preexisting project code does than
applying the theory we learn. While that is a
useful skill, I hope to get more out of the
projects.
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function
recursively written and optimize it into a for loop without the user knowing it. I
see now optimizations aren't part of the class, but I would like to learn maybe
about just what optimizations are used in compilers today.
• No specific ones yet. I suppose I'm a little curious about how ML (and similar
statically typed languages) do type inference.
• I am especially interested in JIT compiling.
• Effectively parsing text is something I've been curious about, so i'm looking
forward to that. Also getting to the computer to recognize the meaning of text
looks interesting.
• What is being done in the field of compilers that try to make code more secure?
• I'm interested in seeing how theory connects to practice.
• I'm not sure if this relates to this course, but I'm interested in learning how
scripting languages are compiled since we don't go through a build process with
them as we do with Java or C, if it's a different process.
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be
interested in it?
– A reduced project (that is, not all of the extensions to
MiniJava but still enough to learn about key issues in
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source
program into a token stream
– In general, it should be an efficient phase of
compilation
• A token is a group of characters forming an atomic
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored
– they contribute to the human communication
aspects of the program, but do not change the
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part
of syntactic analysis
– Regular expressions are a proper subset of
context free grammars
• But this is rarely if ever done: separating concerns
tends to be a better design
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies,
needn?t know language syntax
– Recognizing regular expressions is much faster
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages
are free form
– Layout doesn?t
matter (to the
computer – see
obfuscated code
example on right)
– White space
separates tokens
• But some
languages are
more constrained
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw'
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i;
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w!
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement
number
– Sixth character is a continuation character – any
character other than „0? indicates that this
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have
no meaning with respect to the program per se
CSE401 Au08 8
9/26/2008
3
Ex: Haskell
• “[I]ndentation … is important. Haskell uses a system
called „layout? to structure its code (… Python uses a
similar system). The layout system allows you to
write code without the explicit semicolons and braces
that other languages like C and Java require.”
-- Hal Daumé III
• Tabs and spaces can cause confusion
main = let dolly = breedSheep
in do args <- getArgs
print $ traceFamily dolly (map getFunctionByName args)
CSE401 Au08 9
Definitions
• Pattern: a definition of a related set of lexical entities
– Ex: all sequences of numeric characters, all
sequences of alphanumeric characters starting
with an alphabetic character
– Regular expressions are used in practice to define
patterns
• Lexeme: group of characters that matches a pattern
– Ex: „1234?, „43204222?, „snork?, „f0rk?
• Token: class of lexemes matching a pattern,
distinguished by an attribute
– Ex: „snork? and „f0rk? are both identifier lexemes
with the actual names kept as an attribute
CSE401 Au08 10
Languages: quick reminder
• Alphabet: finite set of characters and symbols
• String: a finite (possibly empty) sequence of
characters from an alphabet
• Language: a (possibly empty or infinite) set of strings
• Grammar: a finite specification for a set of strings
• Language automaton: an abstract machine that
accepts all strings in a given language and only those
• A language can be specified by many different
grammars and automata
• A grammar or automaton specifies a single language
CSE401 Au08 11
Language (Chomsky) hierarchy:
quick reminder
• Regular (Type-3) languages are
specified by regular
expressions/grammars and
finite automata (FSAs)
• Context-free (Type-2)
languages are specified by
context-free grammars and
pushdown automata (PDAs)
• Context-sensitive (Type-1)
languages … aren?t too
important
• Recursively-enumerable (Type-
0) languages are specified by
general grammars and Turing
machines
Turing
CSL
CFL
Regular
One distinction among the levels is what
is allowed on the right hand and on the
left hand sides of grammar rules
CSE401 Au08 12
Page 4
9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer
Science.
• I want to understand how a compiler works.
See behind the scenes and learn something
about software engineering.
• I really liked 341, and this seems like a way
to understand more of what goes on behind
programming languages.
• I am excited about the project and actually
implementing some of the theoretical
knowledge.
• Haven't heard much, but I'm thinking it will
have to do a lot with the stuff we learned in
322 such as grammars and regular
expressions.
• Sadly, I haven't heard much, but I enjoyed
the first lecture.
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory.
• I am worried about getting too buried in
theory that isn't shown in an applicable way.
Quite a bit of work and complicated.
• The projects are a lot of work, especially
when you are taking 3 other courses (2 of
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring
what the preexisting project code does than
applying the theory we learn. While that is a
useful skill, I hope to get more out of the
projects.
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function
recursively written and optimize it into a for loop without the user knowing it. I
see now optimizations aren't part of the class, but I would like to learn maybe
about just what optimizations are used in compilers today.
• No specific ones yet. I suppose I'm a little curious about how ML (and similar
statically typed languages) do type inference.
• I am especially interested in JIT compiling.
• Effectively parsing text is something I've been curious about, so i'm looking
forward to that. Also getting to the computer to recognize the meaning of text
looks interesting.
• What is being done in the field of compilers that try to make code more secure?
• I'm interested in seeing how theory connects to practice.
• I'm not sure if this relates to this course, but I'm interested in learning how
scripting languages are compiled since we don't go through a build process with
them as we do with Java or C, if it's a different process.
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be
interested in it?
– A reduced project (that is, not all of the extensions to
MiniJava but still enough to learn about key issues in
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source
program into a token stream
– In general, it should be an efficient phase of
compilation
• A token is a group of characters forming an atomic
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored
– they contribute to the human communication
aspects of the program, but do not change the
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part
of syntactic analysis
– Regular expressions are a proper subset of
context free grammars
• But this is rarely if ever done: separating concerns
tends to be a better design
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies,
needn?t know language syntax
– Recognizing regular expressions is much faster
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages
are free form
– Layout doesn?t
matter (to the
computer – see
obfuscated code
example on right)
– White space
separates tokens
• But some
languages are
more constrained
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw'
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i;
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w!
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement
number
– Sixth character is a continuation character – any
character other than „0? indicates that this
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have
no meaning with respect to the program per se
CSE401 Au08 8
9/26/2008
3
Ex: Haskell
• “[I]ndentation … is important. Haskell uses a system
called „layout? to structure its code (… Python uses a
similar system). The layout system allows you to
write code without the explicit semicolons and braces
that other languages like C and Java require.”
-- Hal Daumé III
• Tabs and spaces can cause confusion
main = let dolly = breedSheep
in do args <- getArgs
print $ traceFamily dolly (map getFunctionByName args)
CSE401 Au08 9
Definitions
• Pattern: a definition of a related set of lexical entities
– Ex: all sequences of numeric characters, all
sequences of alphanumeric characters starting
with an alphabetic character
– Regular expressions are used in practice to define
patterns
• Lexeme: group of characters that matches a pattern
– Ex: „1234?, „43204222?, „snork?, „f0rk?
• Token: class of lexemes matching a pattern,
distinguished by an attribute
– Ex: „snork? and „f0rk? are both identifier lexemes
with the actual names kept as an attribute
CSE401 Au08 10
Languages: quick reminder
• Alphabet: finite set of characters and symbols
• String: a finite (possibly empty) sequence of
characters from an alphabet
• Language: a (possibly empty or infinite) set of strings
• Grammar: a finite specification for a set of strings
• Language automaton: an abstract machine that
accepts all strings in a given language and only those
• A language can be specified by many different
grammars and automata
• A grammar or automaton specifies a single language
CSE401 Au08 11
Language (Chomsky) hierarchy:
quick reminder
• Regular (Type-3) languages are
specified by regular
expressions/grammars and
finite automata (FSAs)
• Context-free (Type-2)
languages are specified by
context-free grammars and
pushdown automata (PDAs)
• Context-sensitive (Type-1)
languages … aren?t too
important
• Recursively-enumerable (Type-
0) languages are specified by
general grammars and Turing
machines
Turing
CSL
CFL
Regular
One distinction among the levels is what
is allowed on the right hand and on the
left hand sides of grammar rules
CSE401 Au08 12
9/26/2008
4
Regular Expressions:
defined inductively
• Base cases
– Empty string ( )
– Symbol from the
alphabet
• Inductive cases
– Concatenation: E
1
E
2
– Alternation E
1
| E
2
– Kleene closure: E*
• Parentheses for
grouping
• Precedence: * is
highest, then
concatenate, | is lowest
• White space not
significant
CSE401 Au08 13
Notational Conveniences:
no additional expressive power
• E+ 1 or more occurrences of E
• Ek exactly k occurrences of E
• [E] 0 or 1 occurrences of E
• {E} E*
• not(x) any character in
alphabet except x
• not(E) any strings from
alphabet except those in E
• E1-E2 any string matching E1
that?s not in E2
• May name regular expressions
• Ex:
– letter ::= a | b | ... | z
– digit ::= 0 | 1 | ... | 9
– alphanum ::= letter | num
• Recursive definitions prohibited
• Produce simple regular
expression from named regular
expressions by “macro
expansion”
CSE401 Au08 14
Examples
• Identifiers
– ident ::= letter ( digit | letter)*
• Integer constants
– integer ::= digit+
– sign ::= + | -
– signed_int ::= [sign] integer
• Real numbers
– real ::= signed_int
[fraction] [exponent]
– fraction ::= . digit+
– exponent ::= (E | e) signed_int
CSE401 Au08 15
More Examples
• String and character constants
– string ::= " char* "
– character ::= ' char '
– char ::= not(" | ' | \) | escape
– escape ::=
\(" | ' | \ | n | r | t | v | b | a )
• White space
– whitespace ::=
<space> | <tab> | <newline> |
comment
– comment ::= /* not(*/) */
CSE401 Au08 16
Page 5
9/26/2008
1
Lexical
David Notkin
Autumn Quarter 2008
Analysis
Survey (partial results)
Excited [selected responses]
• My dad says it's the epitome of Computer
Science.
• I want to understand how a compiler works.
See behind the scenes and learn something
about software engineering.
• I really liked 341, and this seems like a way
to understand more of what goes on behind
programming languages.
• I am excited about the project and actually
implementing some of the theoretical
knowledge.
• Haven't heard much, but I'm thinking it will
have to do a lot with the stuff we learned in
322 such as grammars and regular
expressions.
• Sadly, I haven't heard much, but I enjoyed
the first lecture.
Concerned [selected responses]
• Very difficult projects
• High work load and difficult project
• Some think it's a bunch of pointless theory.
• I am worried about getting too buried in
theory that isn't shown in an applicable way.
Quite a bit of work and complicated.
• The projects are a lot of work, especially
when you are taking 3 other courses (2 of
which are CSE
• I've heard the projects can be time-
consuming
• I've heard that most of the work is figuring
what the preexisting project code does than
applying the theory we learn. While that is a
useful skill, I hope to get more out of the
projects.
CSE401 Au08 2
Survey (partial results): interests…
• I was interested in how compilers are able to take the Fibonacci function
recursively written and optimize it into a for loop without the user knowing it. I
see now optimizations aren't part of the class, but I would like to learn maybe
about just what optimizations are used in compilers today.
• No specific ones yet. I suppose I'm a little curious about how ML (and similar
statically typed languages) do type inference.
• I am especially interested in JIT compiling.
• Effectively parsing text is something I've been curious about, so i'm looking
forward to that. Also getting to the computer to recognize the meaning of text
looks interesting.
• What is being done in the field of compilers that try to make code more secure?
• I'm interested in seeing how theory connects to practice.
• I'm not sure if this relates to this course, but I'm interested in learning how
scripting languages are compiled since we don't go through a build process with
them as we do with Java or C, if it's a different process.
CSE401 Au08 3
Question
•IF I were to offer the following option, might you be
interested in it?
– A reduced project (that is, not all of the extensions to
MiniJava but still enough to learn about key issues in
compiling) and
– more substantial “homework”
• If you are interested in this option, send email either directly to
me or, if you prefer, to the mailing list before lecture on Monday
• I will decide if this is feasible by Wednesday?s lecture
– It?s not a vote, it?s my decision: articulate arguments, in
addition to degree of interest, will help inform my decision
CSE401 Au08 4
9/26/2008
2
Scanning a.k.a. lexing: purpose
• Turn the character stream that represents the source
program into a token stream
– In general, it should be an efficient phase of
compilation
• A token is a group of characters forming an atomic
unit of syntax, such as a identifier, number, etc.
• White space comprises
– characters between tokens that are ignored
– they contribute to the human communication
aspects of the program, but do not change the
semantics of its execution
CSE401 Au08 5
Separating lexing from parsing
• Lexing can be represented and implemented as part
of syntactic analysis
– Regular expressions are a proper subset of
context free grammars
• But this is rarely if ever done: separating concerns
tends to be a better design
• Simplifies scanner and parser
– Scanner handles I/O and machine dependencies,
needn?t know language syntax
– Recognizing regular expressions is much faster
than parsing context free grammars
– Parser can focus on syntactic structure
CSE401 Au08 6
Lexical design
• Most languages
are free form
– Layout doesn?t
matter (to the
computer – see
obfuscated code
example on right)
– White space
separates tokens
• But some
languages are
more constrained
lexically
#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-
13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-
94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-
72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w
#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l
\
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#
}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw'
iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i;
:{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-
{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w!
nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-
65,_,a+1):main((*a=='/')+t,_,a+1)
:0<t?main(2,2,"%s"):*a=='/'||main(0,main(-
61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m
.vpbks,fxntdCeghiry"),a+1);}
CSE401 Au08 7
Ex: Fortran
• Data cards
• Comment cards: first character is „C?
• Statement cards
– First five characters are an optional statement
number
– Sixth character is a continuation character – any
character other than „0? indicates that this
continues the statement from the previous card
– Characters 7 through 72 are source code
– Characters 73 through 80 are optional and have
no meaning with respect to the program per se
CSE401 Au08 8
9/26/2008
3
Ex: Haskell
• “[I]ndentation … is important. Haskell uses a system
called „layout? to structure its code (… Python uses a
similar system). The layout system allows you to
write code without the explicit semicolons and braces
that other languages like C and Java require.”
-- Hal Daumé III
• Tabs and spaces can cause confusion
main = let dolly = breedSheep
in do args <- getArgs
print $ traceFamily dolly (map getFunctionByName args)
CSE401 Au08 9
Definitions
• Pattern: a definition of a related set of lexical entities
– Ex: all sequences of numeric characters, all
sequences of alphanumeric characters starting
with an alphabetic character
– Regular expressions are used in practice to define
patterns
• Lexeme: group of characters that matches a pattern
– Ex: „1234?, „43204222?, „snork?, „f0rk?
• Token: class of lexemes matching a pattern,
distinguished by an attribute
– Ex: „snork? and „f0rk? are both identifier lexemes
with the actual names kept as an attribute
CSE401 Au08 10
Languages: quick reminder
• Alphabet: finite set of characters and symbols
• String: a finite (possibly empty) sequence of
characters from an alphabet
• Language: a (possibly empty or infinite) set of strings
• Grammar: a finite specification for a set of strings
• Language automaton: an abstract machine that
accepts all strings in a given language and only those
• A language can be specified by many different
grammars and automata
• A grammar or automaton specifies a single language
CSE401 Au08 11
Language (Chomsky) hierarchy:
quick reminder
• Regular (Type-3) languages are
specified by regular
expressions/grammars and
finite automata (FSAs)
• Context-free (Type-2)
languages are specified by
context-free grammars and
pushdown automata (PDAs)
• Context-sensitive (Type-1)
languages … aren?t too
important
• Recursively-enumerable (Type-
0) languages are specified by
general grammars and Turing
machines
Turing
CSL
CFL
Regular
One distinction among the levels is what
is allowed on the right hand and on the
left hand sides of grammar rules
CSE401 Au08 12
9/26/2008
4
Regular Expressions:
defined inductively
• Base cases
– Empty string ( )
– Symbol from the
alphabet
• Inductive cases
– Concatenation: E
1
E
2
– Alternation E
1
| E
2
– Kleene closure: E*
• Parentheses for
grouping
• Precedence: * is
highest, then
concatenate, | is lowest
• White space not
significant
CSE401 Au08 13
Notational Conveniences:
no additional expressive power
• E+ 1 or more occurrences of E
• Ek exactly k occurrences of E
• [E] 0 or 1 occurrences of E
• {E} E*
• not(x) any character in
alphabet except x
• not(E) any strings from
alphabet except those in E
• E1-E2 any string matching E1
that?s not in E2
• May name regular expressions
• Ex:
– letter ::= a | b | ... | z
– digit ::= 0 | 1 | ... | 9
– alphanum ::= letter | num
• Recursive definitions prohibited
• Produce simple regular
expression from named regular
expressions by “macro
expansion”
CSE401 Au08 14
Examples
• Identifiers
– ident ::= letter ( digit | letter)*
• Integer constants
– integer ::= digit+
– sign ::= + | -
– signed_int ::= [sign] integer
• Real numbers
– real ::= signed_int
[fraction] [exponent]
– fraction ::= . digit+
– exponent ::= (E | e) signed_int
CSE401 Au08 15
More Examples
• String and character constants
– string ::= " char* "
– character ::= ' char '
– char ::= not(" | ' | \) | escape
– escape ::=
\(" | ' | \ | n | r | t | v | b | a )
• White space
– whitespace ::=
<space> | <tab> | <newline> |
comment
– comment ::= /* not(*/) */
CSE401 Au08 16
9/26/2008
5
Meta-Rules
• Consider a program defined as:
– program ::= (token | whitespace)*
– token ::= ident | integer | real | …
• Then consider how to tokenize „hi2bob?
– <ident: „hi2bob?> ?
– <ident: „hi?, integer: „2?, ident: „bob?>
– Or six separate tokens?
• All are legal according to the grammar, but the choice
does matter – the ambiguity isn?t desirable here
• Usually apply an extra rule such as “longest
sequence wins”
CSE401 Au08 17
Initial MiniJava lexical specification
Program ::= (Token | Whitespace)*
Token ::= ID | Integer | ReservedWord | Operator |
Delimiter
ID ::= Letter (Letter | Digit)*
Letter ::= a | ... | z | A | ... | Z
Digit ::= 0 | ... | 9
Integer ::= Digit+
ReservedWord::= class | public | static | extends |
void | int | boolean | if | else |
while|return|true|false| this | new | String
| main | System.out.println
Operator ::= + | - | * | / | < | <= | >= | > | == |
!= | && | !
Delimiter ::= ; | . | , | = | ( | ) | { | } | [ | ]
CSE401 Au08 18
Building Scanners with REs
• Convert regular expressions into finite state automata
(FSA)
• Convert FSA into a scanner implementation
– By hand into a collection of procedures
– Mechanically using a table-driven scanner
CSE401 Au08 19
Finite State Automata
• On whiteboard; see book
CSE401 Au08 20
Read More