2 Regular Languages
2 Regular Languages
But the Lord came down to see the city and the tower the people were building. The Lord
said, “If as one people speaking the same language they have begun to do this, then nothing
they plan to do will be impossible for them. Come, let us go down and confuse their language
so they will not understand each other.”
— Genesis 11:6–7 (New International Version)
Some people, when confronted with a problem, think "I know, I’ll use regular expressions."
Now they have two problems.
— Jamie Zawinski, alt.religion.emacs (August 12, 1997)
2 Regular Languages
2.1 Languages
A formal language (or just a language) is a set of strings over some finite alphabet Σ, or
equivalently, an arbitrary subset of Σ∗ . For example, each of the following sets is a language:
As a notational convention, I will always use italic upper-case letters (usually L, but also A, B, C,
and so on) to represent languages.
Formal languages are not “languages” in the same sense that English, Klingon, and Python
are “languages”. Strings in a formal language do not necessarily carry any “meaning”, nor
are they necessarily assembled into larger units (“sentences” or “paragraphs” or “packages”)
according to some “grammar”.
1The empty set symbol ∅ was introduced in 1939 by André Weil, as a member of the pseudonymous mathematical
collective Nicholai Bourbaki. The symbol derives from the Norwegian letter Ø, which pronounced like a German ö or
a sound of disgust, and not from the Greek letter φ. Calling the empty set “fie” or “fee” makes the baby Jesus cry.
1
Models of Computation Lecture 2: Regular Languages [Sp’18]
It is very important to distinguish between three “empty” objects. Many beginning students
have trouble keeping these straight.
• ∅ is the empty language, which is a set containing zero strings. ∅ is not a string.
• {"} is a language containing exactly one string, which has length zero. {"} is not empty,
and it is not a string.
• " is the empty string, which is a sequence of length zero. " is not a language.
A • B := {x y | x ∈ A and y ∈ B}.
The Kleene closure or Kleene star2 of a language L, denoted L∗ , is the set of all strings obtained
by concatenating a sequence of zero or more strings from L. For example, {0, 11}∗ = {", 0, 00, 11,
000, 011, 110, 0000, 0011, 0110, 1100, 1111, 00000, 00011, 011110011011, . . .}. More
formally, L ∗ is defined recursively as the set of all strings w such that either
• w = ", or
• w = x y, for some strings x ∈ L and y ∈ L ∗ .
∅∗ = {"}∗ = {"}.
For any other language L, the Kleene closure L ∗ is infinite and contains arbitrarily long (but
finite!) strings. Equivalently, L ∗ can also be defined as the smallest superset of L that contains the
empty string " and is closed under concatenation (hence “closure”). The set of all strings Σ∗ is,
just as the notation suggests, the Kleene closure of the alphabet Σ (where each symbol is viewed
as a string of length 1).
A useful variant of the Kleene closure operator is the Kleene plus, defined as L + := L • L ∗ .
Thus, L + is the set of all strings obtained by concatenating a sequence of one or more strings
from L.
The following identities, which we state here without (easy) proofs, are useful for designing,
simplifying, and understanding languages.
2named after logician Stephen Cole Kleene, who actually pronounced his last name “clay-knee”, not “clean” or
“cleanie” or “claynuh” or “dimaggio”.
2
Models of Computation Lecture 2: Regular Languages [Sp’18]
Lemma 2.1. The following identities hold for all languages A, B, and C:
(a) A ∪ B = B ∪ A.
(b) (A ∪ B) ∪ C = A ∪ (B ∪ C).
(c) ∅ • A = A • ∅ = ∅.
(d) {"} • A = A • {"} = A.
(e) (A • B) • C = A • (B • C).
(f) A • (B ∪ C) = (A • B) ∪ (A • C).
(g) (A ∪ B) • C = (A • C) ∪ (B • C).
Lemma 2.3 (Arden’s Rule). For any languages A, B, and L such that L = A • L ∪ B, we have
A∗ • B ⊆ L. Moreover, if A does not contain the empty string, then L = A • L ∪ B if and only if
L = A∗ • B.
• L is empty;
• L contains exactly one string (which could be the empty string ");
• L is the union of two regular languages;
• L is the concatenation of two regular languages; or
• L is the Kleene closure of a regular language.
Regular languages are normally described using a compact notation called regular expres-
sions, which omit braces around one-string sets, use + to represent union instead of ∪, and
juxtapose subexpressions to represent concatenation instead of using an explicit operator •. By
convention, in the absence of parentheses, the ∗ operator has highest precedence, followed by
the (implicit) concatenation operator, followed by +.
For example, the regular expression 10∗ is shorthand for the language {1} • {0}∗ (containing
all strings consisting of a 1 followed by zero or more 0s), and not the language {10}∗ (containing
all strings of even length that start with 1 and alternate between 1 and 0). As a larger example,
the regular expression
0∗ 0 + 0∗ 1(10∗ 1 + 01∗ 0)∗ 10∗
represents the language
∗
({0}∗ • {0}) ∪ {0}∗ • {1} • ({1} • {0}∗ • {1}) ∪ ({0} • {1}∗ • {0}) • {1} • {0}∗ .
3
Models of Computation Lecture 2: Regular Languages [Sp’18]
Most of the time we do not distinguish between regular expressions and the languages they
represent, for the same reason that we do not normally distinguish between the arithmetic
expression “2+2” and the integer 4, or the symbol π and the area of the unit circle. However, we
sometimes need to refer to regular expressions themselves as strings. In those circumstances, we
write L(R) to denote the language represented by the regular expression R. String w matches
regular expression R if and only if w ∈ L(R).
Here are several more examples of regular expressions and the languages they represent.
• (" + 1)(01)∗ (" + 0) — the set of all strings of alternating 0s and 1s, or equivalently, the
set of all binary strings that do not contain the substrings 00 or 11.
• (0 + 1)∗ 0000(0 + 1)∗ — the set of all binary strings that contain the substring 0000.
• ((" + 0 + 00 + 000)1)∗ (" + 0 + 00 + 000) — the set of all binary strings that do not contain
the substring 0000.
• ((0 + 1)(0 + 1))∗ — the set of all binary strings whose length is even.
• 1∗ (01∗ 01∗ )∗ — the set of all binary strings with an even number of 0s.
• 0 + 1(0 + 1)∗ 00 — the set of all non-negative binary numerals divisible by 4 and with no
redundant leading 0s.
• 0∗ 0 + 0∗ 1(10∗ 1 + 01∗ 0)∗ 10∗ — the set of all non-negative binary numerals divisible by 3,
possibly with redundant leading 0s.
The last example should not be obvious. It is straightforward, but really tedious, to prove
by induction that every string in 0∗ 0 + 0∗ 1(10∗ 1 + 01∗ 0)∗ 10∗ is the binary representation of a
non-negative multiple of 3. It is similarly straightforward, but even more tedious, to prove that
the binary representation of every non-negative multiple of 3 matches this regular expression. In
a later note, we will see a systematic method for deriving regular expressions for some languages
that avoids (or more accurately, automates) this tedium.
Two regular expressions R and R0 are equivalent if they describe the same language. For
example, the regular expressions (0 + 1)∗ and (1 + 0)∗ are equivalent, because the union
operator is commutative. More subtly, the regular expressions (0 + 1)∗ and (0∗ 1∗ )∗ and
(00 + 01 + 10 + 11)∗ (0 + 1 + ") are all equivalent; intuitively, these three expressions represent
different ways of thinking about the language {0, 1}∗ . In fact, almost every regular language can
be represented by infinitely many distinct but equivalent regular expressions, even if we ignore
ultimately trivial equivalences like L = (L∅)∗ L" + ∅.
4
Models of Computation Lecture 2: Regular Languages [Sp’18]
These cases mirror the definition of regular language exactly. A leaf labeled ∅ represents the
empty language; a leaf labeled with a string represents the language containing only that string;
a node labeled + represents the union of the languages represented by its two children; a node
labeled • represents the concatenation of the languages represented by its two children; and a
node labeled ∗ represents the Kleene closure of the languages represented by its child.
+
• •
* 0 * •
0 0 1 •
* •
+ 1 *
• • 0
1 • 0 •
* 1 * 0
0 1
A regular expression tree for 0∗ 0 + 0∗ 1(10∗ 1 + 01∗ 0)∗ 10∗
5
Models of Computation Lecture 2: Regular Languages [Sp’18]
The size of a regular expression is the number of nodes in its regular expression tree. The size
of a regular expression could be either larger or smaller than its length as a raw string. On the
one hand, concatenation nodes in the tree are not represented by symbols in the string; on the
other hand, parentheses in the string are not represented by nodes in the tree. For example, the
regular expression 0∗ 0 + 0∗ 1(10∗ 1 + 01∗ 0)∗ 10∗ has size 29, but the corresponding raw string
0*0+0*1(10*1+01*0)*10* has length 22.
A subexpression of a regular expression R is another regular expression S whose regular
expression tree is a subtree of some regular expression tree for R. A proper subexpression of R
is any subexpression except R itself. Every subexpression of R is also a substring of R, but not
every substring is a subexpression. For example, the substring 10∗ 1 is a proper subexpression
of 0∗ 0 + 0∗ 1(10∗ 1 + 01∗ 0)∗ 10∗ . However, the substrings 0∗ 0 + 0∗ 1 and 0∗ 1 + 01∗ are not
subexpressions of 0∗ 0 + 0∗ 1(10∗ 1 + 01∗ 0)∗ 10∗ , even though they are well-formed regular
expressions.
• Suppose R = ∅.
6
Models of Computation Lecture 2: Regular Languages [Sp’18]
Students uncomfortable with structural induction can instead induct on the size of the regular
expression (defined as the number of nodes in the corresponding regular expression tree). This
variant changes only the statement of inductive hypothesis, not the structure of the proof itself;
the rest of the boilerplate is utterly identical.
Here is an example of the structural induction boilerplate in action. Again, this proof is longer
than a typical induction proof about strings or integers, but each individual case is still just a
short exercise in definition-chasing.
Lemma 2.4. Every regular expression that does not use the symbol ∅ represents a non-empty
language.
Proof: Let R be an arbitrary regular expression that does not use the symbol ∅. Assume that
every proper subexpression of R that does not use the symbol ∅ represents a non-empty language.
There are five cases to consider, mirroring the definition of R.
7
Models of Computation Lecture 2: Regular Languages [Sp’18]
Similarly, most algorithms that accept regular expressions as input actually require regular
expression trees, rather than regular expressions as raw strings. Fortunately, it is possible to parse
any regular expression of length n into an equivalent regular expression tree in O(n) time. (The
details of the parsing algorithm are beyond the scope of this chapter.) Thus, when we see an
algorithmic problem that starts “Given a regular expression. . . ”, we can assume without loss of
generality that we are actually given a regular expression tree.
Lemma 2.5. Every non-empty regular language is represented by a regular expression that does
not use the symbol ∅.
Proof: Let R be an arbitrary regular expression; we need to prove that either L(R) = ∅ or
L(R) = L(R0 ) for some ∅-free regular expression R0 . For every proper subexpression S of R,
assume that either L(S) = ∅ or L(S) = L(S 0 ) for some ∅-free regular expression S 0 . There are
five cases to consider, mirroring the definition of R.
• If R = ∅, then L(R) = ∅.
• Suppose R = S + T for some regular expressions S and T . There are four subcases to
consider:
• Suppose R = S • T for some regular expressions S and T . There are two subcases to
consider.
• Suppose R = S ∗ for some regular expression S. There are two subcases to consider.
8
Models of Computation Lecture 2: Regular Languages [Sp’18]
• Every regular expression over the one-symbol alphabet {} is itself a string over the
seven-symbol alphabet {, +, (, ), *, 3, Ø}. By interpreting these symbols as the digits 1
through 7, we can interpret any string over this larger alphabet as the base-8 representation
of some unique integer. Thus, the set of all regular expressions over {} is at most as large
as the set of integers, and is therefore countably infinite. It follows that the set of all regular
languages over {} is also countably infinite.
• On the other hand, for any real number 0 ≤ α < 1, we can define a corresponding language
Lα = n α2n mod 1 ≥ 1/2 .
In other words, Lα contains the string n if and only if the (n + 1)th bit in the binary
representation of α is equal to 1. For any distinct real numbers α 6= β, the binary
representations of α and β must differ in some bit, so Lα 6= Lβ . We conclude that the set
of all languages over {} is at least as large as the set of real numbers between 0 and 1,
and is therefore uncountably infinite.
We will see several explicit examples of non-regular languages in later lectures. In particular, the
set of all regular expressions over the alphabet {0, 1} is itself a non-regular language over the
alphabet {0, 1, +, (, ), *, 3, Ø}!
Exercises
1. (a) Prove that ∅ • L = L • ∅ = ∅, for every language L.
(b) Prove that {"} • L = L • {"} = L, for every language L.
(c) Prove that (A • B) • C = A • (B • C), for all languages A, B, and C.
(d) Prove that |A• B| ≤ |A| · |B|, for all languages A and B. (The second · is multiplication!)
i. Describe two languages A and B such that |A • B| < |A| · |B|.
ii. Describe two languages A and B such that |A • B| = |A| · |B|.
(e) Prove that L ∗ is finite if and only if L = ∅ or L = {"}.
(f) Prove that if A • B = B • C, then A∗ • B = B • C ∗ = A∗ • B • C ∗ , for all languages A, B,
and C.
(g) Prove that (A ∪ B)∗ = (A∗ • B ∗ )∗ , for all languages A and B.
9
Models of Computation Lecture 2: Regular Languages [Sp’18]
4. For each of the following languages in {0, 1}∗ , describe an equivalent regular expression.
There are infinitely many correct answers for each language. (This problem will become
significantly simpler after we’ve seen finite-state machines.)
10
Models of Computation Lecture 2: Regular Languages [Sp’18]
[Hint: Yes, all three proofs use induction, but induction on what? And yes, all three
proofs.]
For example, 1((0∗ 10)∗ 1)∗ 0 is plus-free and (therefore) top-plus; 01∗ 0 + 10∗ 1 + " is
top-plus but not plus-free, and 0(0 + 1)∗ (1 + ") is neither top-plus nor plus-free.
Recall that two regular expressions R and S are equivalent if they describe exactly the
same language: L(R) = L(S).
(a) Prove that for any top-plus regular expressions R and S, there is a top-plus regular
expression that is equivalent to RS.
(b) Prove that for any top-plus regular expression R, there is a plus-free regular expres-
sion S such that R∗ and S ∗ are equivalent.
11
Models of Computation Lecture 2: Regular Languages [Sp’18]
(c) Prove that for any regular expression, there is an equivalent top-plus regular expres-
sion.
You may assume the following facts without proof, for all regular expressions A, B, and C:
8. (a) Describe and analyze an efficient algorithm to determine, given a regular expression R,
whether L(R) = ∅.
(b) Describe and analyze an efficient algorithm to determine, given a regular expression R,
whether L(R) = {"}. [Hint: Use part (a).]
(c) Describe and analyze an efficient algorithm to determine, given a regular expression R,
whether L(R) is finite. [Hint: Use parts (a) and (b).]
In each problem, assume you are given R as a regular expression tree, not just a raw string.
12