The document Regular Expressions Computer Science Engineering (CSE) Notes | EduRev is a part of the Computer Science Engineering (CSE) Course Compiler Design.

All you need of Computer Science Engineering (CSE) at this link: Computer Science Engineering (CSE)

**4. REGULAR EXPRESSIONS:****4.1: SPECIFICATION OF TOKENS**

There are 3 specifications of tokens:

1) Strings

2) Language

3) Regular expression**Strings and Languages **

An **alphabet** or character class is a finite set of symbols.

A** string **over an alphabet is a finite sequence of symbols drawn from that alphabet. A l**anguage** is any countable set of strings over some fixed alphabet.

In language theory, the terms "sentence" and "word" are often used as synonyms for "string." The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.**Operations on strings** The following string-related terms are commonly used:

1. A **prefix** of string s is any string obtained by removing zero or more symbols from the end of strings. For example, ban is a prefix of banana.

2. A** suffix** of string s is any string obtained by removing zero or more symbols from the beginning of s. For example, nana is a suffix of banana.

3. A **substring** of s is obtained by deleting any prefix and any suffix from s. For example, nan is a substring of banana.

4. The **proper prefixes, suffixes, and substrings of a string** s are those prefixes, suffixes, and substrings, respectively of s that are not ε or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not necessarily consecutive

For example, baan is a subsequence of banana.

**Operations on languages:**

The following are the operations that can be applied to languages:

1. Union

2. Concatenation

3. Kleene closure

4.Positive closure The following example shows the operations on strings:

Let L={0,1} and S={a,b,c}

Union 1. : L U S={0,1,a,b,c

Concatenation 2. : L.S={0a,1a,0b,1b,0c,1c}

Kleene closure 3. : L* ={ ε,0,1,00….}

Positive closure 4. : L+ ={0,1,00….}**4.2Regular Expressions:**

Each regular expression r denotes a language L(r).

Here are the rules that define the regular expressions over some alphabet Σ and the languages that those expressions denote:

1. ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the empty string.

2. If‘a’is a symbol in Σ, then ‘a’is a regular expression, and L(a) = {a}, that is, the language with one string, of length one, with ‘a’in its one position.

3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then,

o (r)|(s) is a regular expression denoting the language

L(r) U L(s). o (r)(s) is a regular expression denoting the language

L(r)L(s). o (r)* is a regular expression denoting

(L(r))*. o (r) is a regular expression denoting L(r).

4. The unary operator * has highest precedence and is left associative.

5. Concatenation has second highest precedence and is left associative. has lowest precedence and is left associative

**4.2.1 REGULAR DEFINITIONS:**

For notational convenience, we may wish to give names to regular expressions and to define regular expressions using these names as if they were symbols. Identifiers are the set or string of letters and digits beginning with a letter. The following regular definition provides a precise specification for this class of string.**Example-1**,

Ab*|cd? Is equivalent to (a(b*)) | (c(d?)) Pascal identifier

Letter - A | B | ……| Z | a | b |……| z| Digits - 0 | 1 | 2 | …. | 9

Id - letter (letter / digit)*** Shorthand’s**

Certain constructs occur so frequently in regular expressions that it is convenient to introduce notational shorthands for them.

**1. One or more instances (+): **

o The unary postfix operator + means “ one or more instances of” .

oIf r is a regular expression that denotes the language L(r), then ( r )+ is a regular expression that denotes the language (L (r ))+

o Thus the regular expression a+ denotes the set of all strings of one or more a’s.

o The operator + has the same precedence and associativity as the operator * .**2. Zero or one instance ( ?): -**

The unary postfix operator ? means “zero or one instance of”. -

The notation r? is a shorthand for r | ε. -

If ‘r’ is a regular expression, then ( r )? s a regular expression that denotes the language L( r ) U { ε }.**3. Character Classes: -** The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b | c. -

Character class such as [a – z] denotes the regular expression

a | b | c | d | ….|z. -

We can describe identifiers as being strings generated by the regular expression,

[A–Za–z][A–Za–z0–9]*

**Non-regular Se**t

A language which cannot be described by any regular expression is a non-regular set. Example: The set of all strings of balanced parentheses and repeating strings cannot be described by a regular expression. This set can be specified by a context-free grammar.**4.2.2 RECOGNITION OF TOKENS: **

Consider the following grammar fragment: stmt → if expr then stmt |if expr then stmt else stmt |

ε expr → term relop term |term term → id |num

where the terminals if , then, else, relop, id and num generate sets of strings given by the following regular definitions:

If → if

then → then

else → else

relop → |>|>=

id → letter(letter|digit)* num → digit+

(.digit+ )?(E(+|-)?digit+ )?

For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved; that is, they cannot be used as identifiers

Offer running on EduRev: __Apply code STAYHOME200__ to get INR 200 off on our premium plan EduRev Infinity!

16 videos|44 docs|29 tests