Lecture Notes Formal Languages Nouwen
Lecture Notes Formal Languages Nouwen
Rick Nouwen
Version number: 2.022, March 22, 2021
Contents
1 Formal languages 1
1.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Kleene star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Formal languages and decision problems . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Computability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 How many languages are there? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Formal versus natural language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Regular languages 16
3.1 Regular languages and finite state automata . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Closure properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Non-regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 The pumping lemma for regular languages . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Formal grammars 24
4.1 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Parse trees and ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Grammar equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Regular grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Context-free languages 31
5.1 Push-down automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2 Context-free grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Chomsky Normal Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Pumping lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.5 Closure properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6 Mirroring versus copying, and natural language . . . . . . . . . . . . . . . . . . . . . 40
Changes made to these lecture notes since the start of the block:
2
1 Formal languages
1.1 Strings
A set is the most basic mathematical method of describing collections. The only thing that matters to a set is the elements
that it contains. There is no notion of order in a set, nor is there a possibility of repeated membership. Something is either
in a set or not, nothing else is relevant. To illustrate, {1, 1} is a very odd way of writing down the set {1}. There is no
difference between {1, 2} and {2, 1}. The set {1, {1}} only contains the number 1 once, because it has two elements: the
number 1 and the set containing that number.
Order and multiplicity does matter in the pairs we form when we form a Cartesian product. For instance, {1, 2} × {1, 2}
contains both (1, 2) and (2, 1) as an element. It also contains the pair (1, 1). So, while two sets are equal if and only if
they have the same elements, two ordered pairs are equal if and only if they have the same elements on the same positions.
A string is like an ordered pair, except that there are no restrictions on how many elements it contains. We usually write
strings without any extra notation. So, we write 1 for the string just containing a single 1 and we write 111 for the string
that has three positions, each of which contains 1. As with ordered pairs ab 6= ba and aa 6= a.
Nota bene:
As always, we want mathematics to be grounded in set theory and so we would like ordered pairs and strings to
correspond to sets. But how can we do this? How can we represent order in something that is fundamentally
unordered? The Polish mathematician Kazimierz Kuratowski proposed a way to do this. He identified the ordered
pair (x, y) as the set {{x}, {x, y}}. That is, the first element in an ordered pair is the element that occurs in all
the sets, while the second element is the element that occurs in just one. (Exercise: check that this still makes
sense when you have an ordered pair like (1, 1).)
Like ordered pairs, we can also ground strings in sets. Set-theoretically, a string is a function from natural numbers
(the positions in the string) to the elements that make up the strings. As we know, a function is a set of ordered
pairs and is thus itself also grounded in set theory. (See above). Example: the string 3512JK corresponds to
{(1, 3), (2, 5), (3, 1), (4, 2), (5, J), (6, K)}.
Strings can be extended indefinitely. Say we have some string containing just the number 3, repeated many times. We
can form a different string by just appending another 3 to this string. This new string can be extended in the same way,
etc. This means that there are infinitely many strings, even if we build strings just from a single element.
Strings can also have no elements. The empty string is written as . (Alternative notation for the empty string include:
Λ, λ and e).
Strings can be concatenated. If α and β are strings, then α _ β is the unique string such that the elements in β follow
the elements in α, preserving the order of the elements in both strings. Concatenation is not commutative: for instance,
12 _ 21 6= 21 _ 12. But concatenation is associative: 12 _ (21 _ 12) = (12 _ 21) _ 12 = 122112. The empty string
acts as a so-called identity element for concatenation, which means that for any string ϕ, it holds that _ ϕ = ϕ _ = ϕ.
(Compare: 0 is the identity element for addition. For example, a + 0 = 0 + a = a.)
As will become evident below, it is often handy to have a special notation for repetitions of symbols in a string. For
instance, we will sometimes write 13 for the string 111, and so 122 33 44 is short for 1223334444. The set {1n |n > 0} is
the set of all strings that contain 1 or more 1s and nothing else.
Strings have huge importance for artificial intelligence. This is because many kinds of knowledge can be stored and
represented as a string. For instance, any text, whether it is the content of a book, a web page or a governmental law, etc.
is a string of letters, digits, spaces and punctuation. Similarly, any computer program can be represented as a string of
letters, digits, spaces and punctuation. (Alternatively, a computer program can be seen as a string of binary digits.) Also,
any image can be seen as a string of pixel values and a sound recording is a string of values that represent subsequent
properties of an audio signal. In general, when computers perform tasks, they perform tasks on strings.
1
Definition 1
Kleene closure: For any set A, the following holds:
• Base case: ∈ A∗
• Recursive step: If s ∈ A∗ and t ∈ A, then s _ t ∈ A∗
• Any element of A∗ is either or the result of a finite number of applications of the recursive step
Definition 2
Kleene closure: For any set A, A∗ = {Ai | i ≥ 0 and i ∈ N}
S
Where:
• A0 = {}
• Ai = {σ _ a|σ ∈ Ai−1 and a ∈ A} for any i ∈ N such that i > 0
For example, {1}∗ is the set {, 1, 11, 111, 1111, . . .}. The set {1, 2}∗ corresponds to:
A special case is ∅∗ . Let’s see which set this is by applying the above definition. First of all, is in ∅∗ , since the definition
has it that is in any set that results from Kleene closure. Note then that ∅∗ 6= ∅, since we have found a string that is
included in the Kleene closure of the empty set. According to the definition, further strings in ∅∗ are now to be the result
from concatenating a given string in that set with some element in the original set. Since ∅ has no elements, we end up
with no further strings, and so: ∅∗ = {}.
What is special about ∅∗ is that it is the only Kleene closure that is finite. In particular:
2
Theorem 1
For any non-empty finite or countably infinite set X, X ∗ is countably infinite.
Proof
According to the second definition I gave for Kleene closure, X is the infinite union of a family {Xi |i ≥
∗
0}.
First case: X is finite. Take any set Xi that is the set of strings of length i that can be built from the
elements of X. For a set of cardinality c, the number of strings of length n that you can build from this
set equals cn . So, |Xi | = |X|i . This means that each Xi is finite. We can thus enumerate all the strings,
by just first enumerating all the strings of length 0, then the finite number of strings of length 1, then the
finite number of strings of length 2, etc. This yields a countably infinite number of strings.
Second case: X is countably infinite. We know that X ∗ is the union of a countable infinity of sets Xi .
Each of these sets contains a countable number of strings. To see this, first look at X0 , which is obviously
countable, since |X0 | = 1. Next, X1 is countably infinite, since it contains all and only the strings of length
1 made up of the countably infinite elements of X. All further sets Xi are the result of concatenating one
of the elements in X to one of the elements in Xi−1 . We can show that if Xi−1 is countably infinite,
then so is Xi by creating a table where the (countably infinite) columns represent elements of X (so,
X = {x1 , x2 , x3 , . . .}) and the (countably infinite) rows represent strings in Xi−1 (which we take to be
{s1 , s2 , s3 . . .}). We can enumerate the elements in this table using the enumeration strategy depicted in
the table below. (This is similar to how we normally show that there is a countably infinite number of
rational numbers.) This shows that Xi is countably infinite whenever Xi−1 is. Since X1 (and, in fact X0 )
is countable, so are all sets Xi .
x1 x2 x3 x4 x5 . . .
s1 1 2 4 7 11
s1 3 5 8 12
s2 6 9 13
s3 10 14
s4 15
s5
..
.
Now we need to prove that a countably infinite union of countable infinite sets is countably infinite.
For ease of reference, let’s name the elements of the individual subsets Xi that make up X ∗ as follows:
Xi = {si1 , si2 , si3 , . . .}. As I’ve shown, all these sets Xi are countable and there are countably many of
them. X ∗ is the union of all these sets, so the task is to show that we can enumerate all the elements of
all these sets Xi . We can refer to the individual elements in this enumeration as sij and enumerate as
follows.
j = 1 j = 2 j = 3 j = 4 j = 5 ...
i=1 1 2 4 7 11
i=2 3 5 8 12
i=3 6 9 13
i=4 10 14
i=5 15
i=6
..
.
This means that we start with s11 , then s12 , then s21 , s13 , etc. This way, we will reach each element in
each set Xi . Note that, we left out X0 , but since X0 contains a finite number of elements (namely just )
we can just add its contents to the enumeration.
A = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
∪
{A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z}
3
The set A∗ is the set of all possible finite combinations of elements in this set. A formal language is a particular subset
of that set. For instance, Dutch postal codes consist of 4 numerical digits followed by 2 capital letters. So, the set of all
Dutch postal codes is a formal language over alphabet A. Similarly, a company may make many kinds of models of a
certain product and label them with alphanumeric combinations. These labels, too, would form a formal language over
alphabet A.
These examples are rather trivial illustrations of what a formal language is. To understand the value of formal languages,
it is important to understand the relation to computer science and in particular to one kind of task that computers can
perform. So-called decision problems are problems that, given an input, ask for a binary decision to be made. Here are
some examples of decision problems:
Decision problems divide inputs up into two classes: one kind of input will result in “yes”, the rest will result in “no”.
Because of this, decision problems correspond to formal languages. A language L over Σ is a subset of Σ∗ and, as such,
this language represents a choice between those strings in Σ∗ that are in L (the inputs resulting in “yes”) and those strings
in Σ∗ that are not in L (the inputs resulting in “no”). Consider, for example, the final example of a decision problem, where
the task is to decide whether the input is to be permitted as a password for the user. This task amounts to saying yes
to the admissible combinations and no to the non-admissible ones. The set of admissible combinations can be seen as a
formal language and, so, the decision problem reduces to computing this language. That is, performing this task is the
same as performing the task of deciding whether or not a string is in the language or not.
Formally, decision problems are related to characteristic functions:
Definition 3
Say we are interested in some set of elements U . Let A be a subset of U . The characteristic function of A,
fA : U → {0, 1} is defined as follows:
1 whenever x ∈ A
fA (x) =
0 otherwise
Conversely, the language Lf corresponding to some characteristic function f is defined as:
Lf = {x | f (x) = 1 and x ∈ U }
We can think of a decision problem corresponding to both a language and a chararcteristic function. Computing the
function f can be equated with deciding on membership in Lf . While there are many tasks that do not correspond to
solving a decision problem, very often such tasks can be reduced to a decision problem. This is why decision problems
are central to the scientific study of computation. To illustrate, consider the task of simple arithmetic addition. This task
is normally not represented as a decision problem, but rather as a problem that requires finding the right answer: for
instance, what is the number that equals 1+4? However, being able to solve such problems entails that you can solve a
related decision problem, namely the problem of how to distinguish correct answers from incorrect one. So, the task of
addition as a decision problem amounts to the language that contains strings like “1+4=5”, but not strings like “1+4=6”.
1.4 Computability
Say, we want to start make computers perform all sorts of tasks for which we humans need intelligence. The naive way to
go about this would just be to take one task at a time and work on that task until we have satisfactory performance by a
computer. Without a theory of what it means for a computer to compute, however, we have no way of knowing whether
the things we are attempting are possible, or in what way the individual tasks are related, or what the complexity is of
the task we are looking at, etc.
Historically, formal languages are at the basis of theories of computability. This is the notion used to reason about which
tasks computers can perform. It turns out that some functions are not computable. This doesn’t mean that we simply
haven’t found of way of getting a computer to perform the task corresponding to that function, or that we didn’t manage
to build a computer powerful enough to do so. Rather, it means that we can prove that it is theoretically (and therefore
also practically) impossible to compute these functions.
4
We know this because we do have a theory of computation. Alan Turing (1912-1954) developed an abstract model of
computation, the Turing machine, and used it to reason about decision problems. The most famous example of something
Turing proved to be impossible is the halting problem. The task in the halting problem is the following. We would like
an algorithm that takes as input some computer program code and an input to feed to that computer program and the
algorithm should decide for us whether the program will halt or run indefinitely. Take for instance the following two
mini pseudo code programs:
It is easy to see that test(2) will halt. (It returns 1). Also, it is easy to see that oei(2) will not halt. Since 2>0, it will
keep on adding 1 to n indefinitely. (n++ is short for assigning to n a value that is 1 higher than the current value.) The
question is now whether we can think of a function halt that looks roughly as follows:
and which would output YES for halt(test,2), YES for halt(oei,0) and NO for halt(oei,2).
Let us assume that halt is a computable function and then show that this runs into a contradiction. If halt is computable,
then we should also be able to compute the following function:
This is a program that takes a function f as input and then does the following. It uses halt to test whether f applied to
itself halts. (That is, halt(f,f) gives YES if f(f) halts and NO if f(f) does not halt.) If f(f) does halt, then barber runs
oei(2). As a consequence barber(f) will fail to halt whenever f(f) does halt. If f(f) does not halt, then barber(f)
will return NO and, as such, halt.
Now we imagine running the following: barber(barber). What would be the result of running this? Well, barber
plays the role of f here. So to see what happens, we need to know what the outcome of halt(barber, barber)
is. This outcome will be YES if barber(barber) halts and NO if it does not. But now we get a contradiction. Let’s
say barber(barber) halts. Then halt(barber,barber) returns YES and barber will run oei(2) indefinitely and, so,
barber(barber) will not halt. Let’s then instead say that barber(barber) does not halt. In that case halt(barber,barber)
does not return YES and as a result, barber will return NO. So, if barber(barber) does not halt, we are forced to con-
clude that it does halt, namely by outputing NO. Whatever we do, we run into a contradiction. Note that there is nothing
odd about the barber function. It contains a normal if-then-else condition which tests the output of a function, runs
another function in one case and returns something in the other case. The only reason we could have to doubt whether
we could define barber is the fact that it makes use of halt. The contradictions we run into, therefore will have to do
with our assumption that halt exists and, as such, we can conclude that this is a non-computable function.
5
Theorem 2
If X is a countably infinite set, then ℘(X) is an uncountably infinite set.
Proof
Let’s say that ℘(X) is countable. In that case we should be able to enumerate all its sets. The theorem
will be proven by showing that this assumption is untenable.
Consider the following table. The columns of the table are the enumerated elements of X =
{x1 , x2 , x3 , . . .}. The rows are intended to be the enumerated elements of ℘(X). That is, each row repre-
sents a set as a vector, where a 1 indicates that the element corresponding to the column is a member of
that set and 0 indicates that it is not.
x1 x2 x3 . . .
0 0 0 0
1 0 0 0
0 1 0 0
1 1 0 0
0 0 1 0
1 0 1 0
.. .. .. ..
. . . .
So, the first row in this table is the set that contains none of the elements in X. That is, it’s ∅. The second
row is the set containing just x1 . The sixth represents {x1 , x3 }. Etc. Because we are systematically going
through all the elements of X (the columns of the table), the rows should enumerate all the subsets of X,
if there are countably many.
But now take the diagonal of the table, the values indicated in red. This yields a vector D = (0, 0, 0, 0, . . .).
Say we take this vector and change all its values to the opposite value, so we write a 0 where it said 1 and
a 1 where it said 0. We then get V = (1, 1, 1, 1, . . .). Since this is just a vector of 0s and 1s, it represents
a subset of X. If ℘(X) is countable, V should correspond to some row in the table. But notice that it
couldn’t possibly be a row. If it is a row, then the diagonal (the red line of numbers) will cross that row at
some column c. But the value of V at c has the opposite value from the diagonal at c and, so, V cannot
be in the table. It follows that ℘(X) is not countable.
So, even if the alphabet is extremely simple, such as Σ = {1}, L(Σ) contains an uncountably infinite set of languages.
In what follows we will try and understand certain interesting subclasses of this uncountable infinity, some of which are
countable. Before we do so, we turn from formal languages to natural ones. In particular, we will have a brief look at the
role that infinity plays in natural languages.
6
is not completely visible in our everyday use of language, but we may access it through introspection. For instance, any
native speaker of English will be able to decide that (1) is not a sentence of that language. (Linguists mark ungrammatical
sentences with a “*”. Note that this is fully unrelated to the Kleene star.)
This is just word soup, a seemingly random sequence of English words. It is very easy to come to the judgment that
this is not a sentence. You know how to make that judgment, because you have knowledge of language. Nobody taught
you that this sentence is ungrammatical, it is just something that you have the ability to decide on, via your language
competence. What’s more, that same competence allows you to decide that the following sentence is grammatical.
The sentence in (2), a famous example due to Chomsky, clearly makes no sense whatsoever. But that does not seem to
matter for your ability to judge it as grammatical: Every speaker of English will judge (2) differently from (1). While
neither makes sense, (2) is a sentence, but (1) is not. This illustrates the robustness of our knowledge of language. At the
heart of our everyday linguistic functioning are abilities that we only become aware of through introspection.
Crucially, there is an in principle infinite number of combinations of words for which we are (in principle) able to decide
whether they are grammatical or not. For instance, we know that (2) remains grammatical when we add a prepositional
phrase:
(5) The cat stood on a table in a room in a castle on a hill in a country where people wear hats with feathers from
birds from a forrest near a lake …
Any speaker of English can and will decide that (5) is grammatical. There are obvious practical reasons why natural lan-
guage sentences are never infinite, but our introspection tells us that they seem to allow for infinitely repeating patterns.
As such, our linguistic competence is infinite. Crucially, we achieve this infinite potential through finite means, given
that our brains are finite.
This link with infinity highlights the commonality between natural and formal languages. Our knowledge of a natural
language involves the decision problem that asks to distinguish grammatical from ungrammatical sentences and, so, there
is a formal language that corresponds to that decision problem. What is more, we are interested in understanding what
finite computational means allow us humans to capture this infinite formal language.
In the examples above, I associate knowledge of language with the ability of deciding on the grammaticality of a sentence.
But there are many different kinds of linguistic knowledge. Parallel to the examples above, there is a similar case to be
made that you have knowledge of how individual words are built. If you are a speaker of English, you will know that
you can add the prefix “re” to some verbs, to get another verb. For instance, “discover” becomes “rediscover”, “invent”
becomes “reinvent” etc. Adding “-y” or “-ion” to these turns the verb into a noun: “discovery”, “rediscovery”, “invention”,
“reinvention”. If you are a native speaker of English, you were never instructed that this is how things work. This ability
has simply emerged as part of your knowledge of language.
In summary, our linguistic abilities are a good example of a set of human abilities that involve deciding on membership
in an infinite formal language. So, from the context of artificial intelligence, it would make sense to understand what
kind of formal languages are part of our human linguistic competence. What is their complexity? How do they relate
to formal languages we know more generally from our theory of computation? These are the kind of questions we will
approach in what follows.
7
2 Finite state automata
How do we study decision problems and computability? Turing proposed to use abstract machines. In particular, he
proposed a mathematical model of computation that we now call the Turing machine. A Turing Machine is a formal
model that mimics an imaginary machine that consists of a tape that is segmented into an infinite number of cells, a head
that can move along the tape and read the contents of a cell as well as write content to a cell, and a mechanism that
controls the actions of the machine based only on what the head reads and the state the machine is in. We won’t really
discuss Turing Machines in these lecture notes. However, we will discuss simpler abstract models of computation. We
start with the finite state automaton (FSA). As we will see, the difference in complexity between a Turing Machine and
an FSA has consequences for the kind languages that can be computed. In fact, we will see that by studying models of
computation of differing complexity, we get insights about different classes of decision problems.
The task of an FSA is to receive an input string and to decide on whether to accept it or not. In other words, FSAs solve
a particular decision problem for the input they receive. Acceptance means that the string is a member of the language
corresponding to the decision problem the FSA is meant to solve. Not accepting a string means that the string does not
belong to the language. An FSA is an automaton that can be in one of a finite number of possible states. At the moment
the FSA receives input, it is in a dedicated start state. Every FSA has one or more acceptance states. A string is accepted
whenever the FSA is in one such state when the input has been read completely.
A finite state automaton reads the input bit by bit. For each state the automaton can be in, there is a specification of
what to do when a certain symbol (or sometimes string of symbols) is read. The possible actions are extremely limited,
however. The only thing an automaton can do is either remain in the same state or transition to a different state before
proceeding to read the next symbol.
FSAs are abstract models, but there are examples of actual machines that resemble finite state automata. It is illustrative
to start by looking at such a machine. Take, for instance, a digital lock on the door of a safe. The lock has two states:
locked or unlocked. When it receives input, it is always in the locked state. It has a key pad that can record an input,
namely a string of key presses. There is a key for each numerical digit, as well as for the symbols “!” and “#”. The user can
type in a code and send the code by pressing “#”. If the user makes a mistake, he or she can type “!” to start completely
from scratch. Let’s say the code is “0000”. We can now say that the unlocked state is an acceptance state and the locked
state is not. The task of the digital lock is to accept strings like 0000#, 9!0000#, 0000!9!0000#, etc. and to not accept strings
like 01234#, 0000, 0000!#, 0000000#, etc. That is, the safe unlocks only when a string of key presses occurs that correspond
to the sending of the “0000” code.
Working locks like this actually exist. However, irrespective of how such real locks work, we can represent the task they
perform as a formal language over alphabet Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, #, !}. The set of strings that unlock the safe is a
subset of Σ∗ . We can define a finite state automaton that computes this language in the way just described. Below, we will
give a precise formal definition of FSAs, which will make it possible to define a particular automaton as a fully specified
mathematical object. But before we turn to these formal definitions, we will look at an intuitive way of representing
automata. It is often handy to represent models of computation graphically. For FSAs, we do this in the following way:
• states are circles, with the name of the state written in the circle
• acceptance states are indicated by a double line
• arrows indicate transitions between states triggered by the reading of a symbol, where we label the arrow with
the responsible symbol
• the start state is indicated with an incoming arrow that is not connected to any other state
8
As you can see, the automaton has 7 states. One of these is the start state (s0 ) and one of them is an acceptance state (a).
The only way this FSA will get into the acceptance state is by reading the symbol 0 four consecutive times, immediately
followed by a “#”. Such an input has the automaton transition from s0 to s1 , s2 , s3 , s4 and finally a. Anything else will
either get the automaton into the state x or, if a ! is typed, back to the start state. From x, the only way to get further is
to go back to s0 by reading a “!”.
Definition 4
A finite state automaton is a 5-tuple hΣ, S, s, A, Ri, such that:
• Σ is a finite set
• S is a finite set
• s∈S
• A⊆S
• R ⊆ (S × Σ∗ ) × S
Let’s unpack this. Σ is the alphabet of the language that the automaton is to compute. So, each input is a string in Σ∗ .
S is the set of states that make up the automaton and s is its unique element that acts as the start state. A is that subset
of S of accepting states. Finally, R is where all the work happens. R is a relation between pairs of states and strings
(elements of S × Σ∗ ) and states. It is often handy to represent R as a table, the transition table of the automaton, where
the rows represent states, columns represent symbols and cells represent to which state the machine should transition
when a certain symbol is read in a certain state.
To illustrate, here is a formal specification of the lock automaton that is graphically depicted above.
9
h{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, !, #}, {a, x, s0 , s1 , s2 , s3 , s4 }, s0 , {a}, Ri
h{0, 1}, {s0 , s1 }, s0 , {s0 }, {((s0 , 0), s0 ), ((s0 , 1), s1 ), ((s1 , 0), s1 ), ((s1 , 1), s0 )}i
This automaton reads strings in {0, 1}∗ . It accepts only those that contain an even number of 1s (irrespective of the
number of 0s it contains). (It also accepts string that contain no 1s, so I am assuming that 0 is even.) It does this by
accepting any string it reads until it encounters a 1, which triggers a transition to the non-acceptance state s1 . From
there, the only way to get back into an acceptance state is by once more reading a 1. So, every odd 1 that is read moves
the machine into non-acceptance and only the next 1 moves it back to acceptance.
2.2 Non-determinism
In an FSA hΣ, S, s, A, Ri it is the transition table R that regulates what happens when a particular symbol is encountered
in a particular state. In the examples we have seen so far R was such that it unambiguously determines what happens in
each possible situation. Such an automaton is called deterministic: given a string, there is a unique sequence of transitions
that the automaton will go through. A nondeterministic finite state automaton (NFSA) is a finite state automaton that does
not provide a unique sequence of transitions for each output. Here is a simple example of an NFSA.
0
1
1
a b
0
1
This is an automaton over alphabet {0, 1}. It is nondeterministic because whenever a “1” is read in state a, there are
two candidate transitions: either the automaton remains in state a, or it transitions into b. To distinguish between
deterministic and nondeterministic automata, it is a good idea to have a closer look at the transition table component of
FSAs.
As I wrote above, the transition table R is a relation, a subset of (S × Σ∗ ) × S. Recall first of all that any subset Z of
a Cartesian product X × Y is a relation. Next, recall that such a relation Z is a function if and only if whenever both
(x, y) ∈ Z and (x, y 0 ) ∈ Z, then y = y 0 . In other words, a function Z ⊂ X × Y is a relation whenever each element in X
is paired with at most 1 element in Y . The definition of a finite state automaton states that the transition specification R
is a relation. An FSA is deterministic, whenever this transition relation is a function that maps pairs of states and symbols
to states.
10
Definition 5
A deterministic finite state automaton is a 5-tuple hΣ, S, s, A, Ri where
• Σ is a finite set
• S is a finite set
• s∈S
• A⊆S
• R : (S × Σ∗ ) → S
Nondetermistic automata can also be viewed as having a functional transition relation. However, in a non-deterministic
machine, each state-symbol pair does not yield a unique state to transition to, but rather a set of states. This means we
can define NFSAs as follows:
Definition 6
A nondeterministic finite state automaton is a 5-tuple hΣ, S, s, A, Ri where
• Σ is a finite set
• S is a finite set
• s∈S
• A⊆S
• R : (S × Σ∗ ) → ℘(S)
h{0, 1}, {a, b}, a, {b}, {((a, 0), {a}), ((a, 1), {a, b}), ((b, 1), {b}), ((b, 0), {a})}i
If we present a deterministic FSA with an input, it is easy to check whether the input is accepted or not, since the
transition function fully specifies what to do with the input. (The next section formalises this.) Things are different for
non-deterministic FSAs. There will be inputs that present us with a choice at certain points in reading the string. As
such, we would need a strategy for navigating through such choices in order to be able to decide whether a string is
recognised by an NFSAs. So, figuring out whether an NFSA accepts a string or not is much harder than figuring out
whether a deterministic automaton accepts a string. Given this, you may wonder why we bother discussing NFSAs
in the first place. Well, first of all, an NFSA tends to be much smaller in size than a deterministic FSA that does the
same thing. Second, there exist quite a few efficient algorithms that help us determine whether a string is accepted by a
non-deterministic automaton or not. So, practically speaking, it is sometimes simply preferable to work with NFSAs.
In terms of expressive power, it turns out that the choice between deterministic and non-deterministic finite state au-
tomata is immaterial. For any language recognized by some deterministic FSA, there exists a non-deterministic FSA
recognizing that same language, and vice versa.
Theorem 3
(i) Any non-deterministic FSA N is such that there exists a deterministic FSA D such that L(N ) = L(D). Con-
versely, (ii) for any deterministic FSA D, there exists a non-deterministic FSA N , such that L(N ) = L(D).
I will not provide the proof here, but hope to give you some intuition. Note first of all that (ii) is somewhat trivial. In
fact, given definition 6, any deterministic automaton can be rewritten as a non-deterministic one simply by changing the
transition function R : (S ×Σ∗ ) → S into a function R0 : (S ×Σ∗ ) → ℘(S). We can do this as follows: R0 is the function
such that for any pair (x, a) ∈ S × Σ∗ : R0 ((x, a)) = {R(x, a)}. So, the nondeterministic version of the deterministic
automaton is the automaton that maps each combination of a state and a string to the singleton set containing the state
the deterministic version maps that combination to.
Part (ii) of the theorem is trickier and was first proved by Rabin and Scott in 1959. One way to see that (ii) is the case
is the existence of reliable strategies to transform any nondeterministic automaton into a (usually considerably larger)
deterministic one. This is outside the scope of the current course.
11
2.3 Acceptance
So far, we only have had an informal understanding of what happens when an automaton receives an input. We know
that input strings are read symbol by symbol and that the transition function determines for each read symbol what to
do, based on the state the FSA is in. The input is accepted if and only if the automaton is in an acceptance state after
reading the final symbol. Here is a formal definition of acceptance.
Definition 7
A finite state automaton M = hΣ, S, s, A, Ri accepts an input string w1 . . . wn , if and only if there exists a
computation s0 , s1 , . . . , sn such that:
• for all 0 ≤ i ≤ n: si ∈ S.
• s0 = s
• sn ∈ A
• for all 1 ≤ i ≤ n: either si ∈ R((si−1 , wi )) or R((si−1 , wi )) = si
This says that a string gets accepted by an automaton whenever we can go through the string, symbol by symbol (or if
appropriate substring by substring), and for each symbol we can find a transition in such a way that, if we start in the
start state the machine will transition to an acceptance state after reading the final symbol. The last line in this definition
is a bit complex. This is because I am assuming that M can be of two types. Either it is deterministic, in which case
R((si−1 , wi )) will always point to at most one value, or it is nondeterministic in which case it will map to a set of states.
Here’s an example. Consider the automaton
X = h{1}, {a, b, c, d}, a, {c}, RX = {((a, 1), b), ((b, 1), c), ((c, 1), d), ((d, 1), d)}i
This automaton accepts only a single string, namely “11”. It rejects anything shorter or longer in {1}∗ . The string “11” is
accepted because there is a computation abc such that RX ((a, 1)) = b and RX ((b, 1)) = c. There is no other string that
has a computation with the required properties. For instance, the computation that goes with “111” is abcd, but d 6∈ A
and, so, the string is not accepted. Similarly, the string “1” is computed by “ab”. Here, too, the final state, b, is not an
acceptance state, so the string is not accepted.
We are now ready to finally connect finite state automata to formal languages. Given the definition of acceptance that I
defined above, this is very straightforward. The language computed by an automaton is the set of strings that it accepts:
Definition 8
Let M be a finite state automaton. Its corresponding formal language is written as L(M) and is defined as:
L(M) = {x | M B x}
2.4 Determinism
Recall that a relatoin f ⊆ A×B is a function whenever for each a ∈ A it is the case that (a, x) ∈ f & (a, y) ∈ f ⇒ x = y.
There exist relations that are functional in this sense, but which do not provide an output for each input in the domain.
These functions are called partial functions: a function f : A → B is partial whenever there exists x ∈ A such that there
exists no y ∈ B such that (x, y) ∈ f . That is, for some x, f (x) will be undefined. If f is a partial function, we often write
f (x) ↓ for f (x) being defined and f (x) ↑ for f is undefined.
The definition of a deterministic FSA states that the transition table needs to be a function R : (S × Σ∗ ) → S. It does
not state, however, whether R need to be a total function or whether it is partial. That is, we could define an automaton
that fails to specificy transitions for some configurations the automaton can encounter. Take, for example, the following
variation on X :
X 0 = h{1}, {a, b, c, d}, a, {c}, RX 0 = {((a, 1), b), ((b, 1), c), ((c, 1), d)}i
12
X 0 differs from X only in that its transition function is not defined for (d, 1). It turns out that this does not really matter
so much for our definition of acceptance. Take for instance string “1111”. We have both X 6 B1111 and X 0 6 B1111. The
reason for not accepting the string, however, is different in both cases. For a string to be accepted by an FSA, two things
have to be the case: (A) there has to be a computation for the string; (B) this computation needs to have certain properties
(start in a starting state, end in an acceptance state). In X , “1111” is not accepted because of (B): d is not an acceptance
state. In X 0 , “1111” is not accepted because of (A), the lack of a computation.
We can thus distinguish two kinds of determinism:
Definition 9
A finite state automaton hΣ, S, s, A, R, i is p-deterministic whenever:
• Σ is a finite set
• S is a finite set
• s∈S
• A⊆S
• R : (S × Σ∗ ) → S and R is a partial function
There is a simple procedure that can turn a p-deterministic automaton into a deterministic FSA with a total transition
function.
Let A = Σ, S, s, A, Ri be p-deterministic. Let σ be a state such that σ 6∈ S. The t-deterministic version of A is defined
below. (Recall, that ↓ indicates that the function applied to its arguments is defined and ↑ indicates that it is not defined.)
hΣ, S ∪ {σ}, s, A, R0 i
where for any ∗
(x, y) ∈ S × (Σ ∪ {σ}):
R((x, y)) if R((x, y)) ↓
R0 (x, y) =
σ if R((x, y)) ↑
Take for example the following p-deterministic automaton. (It’s p-deterministic since there is no defined transition for
reading 0 in state b).
0
1
1
a b
Assuming that the alphabet is {0, 1}, this is the automaton given by
h{0, 1}, {a, b}, a, {b}, {((a, 0), a), ((a, 1), b), ((b, 1), 1)}i
To make this automaton t-deterministic, we add a state x and make sure that cases that are undefined trigger transition
to x. So, the altered automaton is
h{0, 1}, {a, b, x}, a, {b}, {((a, 0), a), ((a, 1), b), ((b, 1), b), ((b, 0), x), ((x, 0), x), ((x, 1), x)}i
13
0
1
1
a b
0
0 x
1:0
0:1
In graphical depictions of FSTs like this one, we use the colon as a special symbol that distinguishes the two tapes. We
can for instance think of symbols to the left of “:” as part of the input and symbols to the right as part of the output. As
such, this FST reads a string of arbitrary combinations of 1s and 0s and replaces it with string where every occurrence of
a 1 is replaced by a 0 and vice versa. So, this FST recognises input-output pairs like 010001:101110 and 000:111 and does
not accept pairs like 00:01 or 111:0000.
Formally, a finite state transducer is more complex than a finite state automaton. This is because the FST distinguishes
two (possibly distinct) alphabets, one for each side of the colon. Furthermore, there are two transition functions, also for
both strings under consideration.
Definition 10
A finite state transducer is a 7-tuple hΣ1 , Σ2 , S, s, A, R1 , R2 i where
• Σ1 is a finite set
• Σ2 is a finite set
• S is a finite set
• s∈S
• A⊆S
• R1 : (S × Σ∗1 ) → S
• R2 : (S × Σ∗1 ) → Σ∗2
h{0, 1}, {0, 1}, {s}, s, {s}, {((s, 1), s), ((s, 0), s)}, {((s, 1), 0), ((s, 0), 1)}i
The input part of an FST accepts strings just like an FSA would. A string w1 . . . wn is accepted if there exists a computation
s0 s1 . . . sn such that: (i) s0 is the start state, (ii) sn is an acceptance state and (iii) for each 1 ≤ si ≤ n: R((si−1 , wi )) = si .
The following definition extends acceptance to include the output string. (Here, we once more use “:” as a special symbol
to distinguish the input from the output).
14
Definition 11
A finite state transducer T = hΣ1 , Σ2 , S, s, A, R1 , R2 i accepts an input-output pair w1 . . . wn : o1 . . . on if and
only if there exists a computation s0 , s1 , . . . , sn such that:
• for all 0 ≤ i ≤ n : si ∈ S
• s0 = s
• sn ∈ A
• for all 1 ≤ i ≤ n: R1 ((si−1 , wi )) = si
• for all 1 ≤ i ≤ n: R2 ((si−1 , wi )) = oi
Practically, FSTs are used to translate or manipulate input strings. That is, they are employed when we need produce an
output string that is based on the input string. Note, however, that the definition given here defines acceptance of both
the input and the output, rather than just defining acceptance of the input and calculating the corresponding output.
So, even though transducers can be seen as mapping an output to an input, they can be defined in terms of a decision
problem. The problem of knowing which output goes with what input can be reduced to the problem of recognizing a
set of input output pairs. Put differently, while FSAs correspond to formal languages, i.e. sets of strings, FSTs correspond
to relations, sets of pairs of strings.
15
3 Regular languages
As we saw above, there are uncountably infinitely many languages, whenever the alphabet is non-empty. We can di-
vide the set of languages up in interesting subclasses, however. We start with so-called regular languages: the class of
languages recognized by finite state automata. To define such languages, we need the following operation:
Definition 12
Set concatenation: If L1 and L2 are two sets of strings, then L1 · L2 = {l1 _ l2 |l1 ∈ L1 and l2 ∈ L2 }.
For example {1, 2} · {1} = {11, 21}. Given this operation, we can provide a recursive definition of regular languages.
Definition 13
The set of regular language over Σ, notation RΣ , is recursively defined as follows:
• Base: ∅ ∈ RΣ , {} ∈ RΣ and for each σ ∈ Σ: {σ} ∈ RΣ .
• Recursive step: If L1 ∈ RΣ and L2 ∈ RΣ , then
– L1 · L2 ∈ RΣ ,
– L1 ∪ L2 ∈ RΣ
– L∗1 ∈ RΣ and L∗2 ∈ RΣ
• Nothing else is in RΣ .
Take for example Σ = {0, 1}. The definition above now states that sets like {0} and {1} are regular languages and so
are their concatenations {01}, {10}, {00}, {11}. Also, unions of these sets are regular, such as for instance {1, 01, 11}
or {0, 00}. Concatenations of such sets are also regular again: {10, 100, 010, 0100, 110, 1100}. Given the complexity of
languages that one can build using just concatenation and union, you may wonder what the Kleene closure recursive
step adds to the definition in regular languages. It matters in a small but important way. If we had used an alternative
definition, one that did not include Kleene closure as a recursive step, then the definition would in fact describe an
importantly different class of languages. Without Kleene, in any regular language strings of length exceeding 1 would
result from a finite number of applications of the concatenation step. But that means that for any regular language, there
is an upper bound to the length of the strings it contains. As such, all regular languages would be finite. The addition of
the Kleene recursive step allows for infinite regular languages.
Note that Σ∗ = {} ∪ Σ ∪ Σ · Σ ∪ Σ · Σ · Σ ∪ . . .. This in turn may make you wonder why the definition includes
concatenation as a recursive step. This is to include finite languages as regular. Without this step, we would only have ∅
and the singleton languages as finite regular languages.
I will not prove this theorem here, but I will sketch the proof for one part of the theorem, which will hopefully make
things more intuitive.
16
Theorem 5
For each regular language L, there is a finite state automaton M such that L(M) = L.
Proof
Recall that there are three basic cases of regular languages and three possible recursive steps. What we
need to show is that all these cases can be handled by FSAs.
Base case 1, the regular language ∅: Just take an FSA without an accepting state: hΣ, {s}, s, ∅, ∅i.
Base case 2, the regular language {}: This is a minimal variation on base case 1: hΣ, {s}, s, {s}, ∅i.
Base case 3, regular languages {σ} for σ ∈ Σ: hΣ, {s, t}, s, {t}, {((s, σ), t)}
Recursive steps: Say we have L(M1 ) = L1 and L(M2 ) = L2 . According to the definition of regular
languages, whenever L1 and L2 is regular, then so is L1 · L2 , L1 ∪ L2 , L∗1 and L∗2 . So, for the current proof
we need to show that there is an FSA M1·2 such that L(M1·2 ) = L1 · L2 , there is an FSA M1∪2 such
that L(M1∪2 ) = L1 ∪ L2 and there is an FSA M1∗ such that L(M1∗ ) = L∗1 . The following procedures
will deliver these FSA. Let M1 = hΣ, S1 , s1 , A1 , R1 i and M2 = hΣ, S2 , s2 , A2 , R2 i.
Recursive step 1, concatenation: M1·2 = hΣ, S1 ∪ S2 , s1 , A2 , R1 ∪ R2 ∪ {((a, ), s2 )|a ∈ A1 }i
Recursive step 2, union: M1∪2 = hΣ, S1 ∪ S2 ∪ {x}, x, A1 ∪ A2 , R1 ∪ R2 ∪ {((x, ), s1 ), ((x, ), s2 )}i.
Recursive step 3, Kleene closure: M1∗ = hΣ, S1 ∪ {x}, x, A1 ∪ {x}, R1 ∪ {((a, ), s1 )|a ∈ A1 } ∪
{((x, ), s1 )}i
I illustrate recursive step 3 with an example. First consider X, the language over {0, 1} that contains strings of arbitrary
combinations of 1s and 0s as long as the number of 1s is zero or a multiple of 3, and the language Y over {0, 1} which
is similar except that all strings are such that the number of 1s is zero or a multiple of 2. Here are two automato that
recognise these languages.
0
s0 1 s2 1
t0 t1
0 1 1 0 0 1
s1
0
The union of the two languages, Z = X ∪ Y , is the language that contains arbitrary combinations of 0s and 1s as long as
the number of 1s is either zero, a multiple of 3 or a multiple of 2. So, Z contains strings like 00001010001 and 000100001
but not 0001010001011. To build an automaton for Z we can use the recipe for recursive step 2 from the proof idea above:
x
0
s0 1 s2 1
t0 t1
0 0 0 1
1 1
s1
17
3.2 Closure properties
We say that a set is closed under a certain operation whenever performing the operation on members of the set results
in another member of the set. For instance, the natural numbers are closed under squaring. This is because whenever
√
x ∈ N, then x2 is also in N. The natural numbers are not closed under the square root operation. For instance, 2 6∈ N.
Given the definition of regular languages, we already know that they are closed under union, concatenation and Kleene
star. It can be important to know of such closure properties. This is because it may help us understand relations between
decision problems. For instance, if we know that a certain problem can be seen as the concatenation of two problems
solvable by finite state automata, then we know we can solve this problem using an FSA, too.
Beyond the three cases that follow directly from the definition of regular languages, this class has further closure prop-
erties. Here, I only discuss one of them, namely the case of complementation. Say that LΣ is some language over Σ. The
complement of LΣ is set of strings in Σ∗ that are not in L. So: LΣ = Σ∗ \ L. There is an easy procedure that allows you
to construct an FSA for the complement of a language on the basis of an automaton for the original language.
Definition 14
Let M = hΣ, S, s, A, Ri be t-deterministic. The FSA M is defined as:
M = hΣ, S, s, S \ A, Ri
Take, for example, the following FSA, which recognizes {0n 1m | n is even, and m is odd}. (Note that 0 is an even number,
so the FSA also accepts string like 1 or 111).
0 1
s1 s0 1 s3 s4
0 0 1
s2
1
1 0
1
0
z
0
As per definition 14, we can take the complement of this FSA to recognize the language that contains strings that are not
a sequence of an even number of 0s followed by an odd number of 1s. This complement language contains strings like 01
and 0011, but also strings like 10100. The complement automaton is simply the following:
0 1
s1 s0 1 s3 s4
0 0 1
s2
1
1 0
1
0
z
18
Note that this FSA just accepts anything as long as it doesn’t finish reading the input in s3 , because that would result in
a string that belongs to the original language.
It is important that complementation takes place with a t-deterministic automaton. Consider the following p-deterministic
FSA.
0 1
s1 s0 1 s3 s4
0 0 1
s2
This FSA also corresponds to {0n 1m | n is even,
For instance, it fails to accept 000001 because there is no computation for that string that results in an acceptance state.
But if we were take the complement of this FSA, we’d get the following:
0 1
s1 s0 1 s3 s4
0 0 1
s2
Crucially, this is not the complement of {0n 1m | n is even, and m is odd}. For instance, it fails to accept strings like 01
or 1000.
Theorem 6
Whenever M is t-deterministic, then L(M) = L(M).
Proof
Say that M = hΣ, S, s0 , A, Ri. The assumption is that M is t-deterministic. That means that for any
string w1 . . . wn in Σ∗ there is a computation s0 . . . sn , such that for any wi , ((si−1 , wi ), si ) ∈ R. There
are two kinds of computations of this kind: if sn ∈ A, then this is the computation of an accepted string
and if sn 6∈ A, then it is the computation of a string that is not accepted. But this immediately means that
we can swap accepted and non-accepted strings by just swapping acceptance states with states that are
not acceptance. It follows from the definition of acceptance and L( ) that L(M) = L(M).
Now everything is in place to prove that regular languages are closed under complementation.
Theorem 7
If LΣ is a regular language over Σ, then so is LΣ .
Proof
Assume that L is regular. Since L is regular, there exists a t-deterministic FSA M such that L(M) = L.
Given theorem 6, we know that L(M) = L. And so there exists an FSA corresponding to L, namely M.
Given theorem 4, this means that L must be regular.
3.3 Non-regularity
We’ve defined regular languages and we have seen that these are exactly the languages for which we can build a finite
state automaton. The original abstract model of computing put forward by Alan Turing was a model that is quite a bit
more complex than the FSAs we’ve discussed so far. One crucial difference (but by not means the only difference) is that
a Turing Machine can not only read the input string, it can also write symbols and revisit what it has written down at a
later stage in the computation. That is, a Turing Machine has a memory mechanism, while a finite state automaton does
not. This difference matters. Turing Machines can compute a proper superset of the formal languages that FSAs can.
Here is a classical example of a language that is not regular, a language for which there is no finite state automaton.
19
{0n 1n | n ≥ 1}
This language is the infinite sets of sequences of 0s and 1s, where all the 1s follow all the 0s and there are exactly as many
0s and 1s. In the next section, I discuss a proof of why this is not a regular language, but before we turn to this proof, it
is important to understand the intuition behind why there can’t be an FSA that corresponds to this language. To do this
let’s just try and find a corresponding FSA and see the trouble such an attempt meets on the way. So, we are looking for
finite state automaton A such that L(A) = {0n 1n | n ≥ 1}.
Take an arbitrary string in the language, say 00001111. Say A accepts this string, and say that part of the acceptance is a
computation s0 s1 s1 s1 s1 corresponding to 0000. What would the rest of the computation look like? Well, after reading
the four 0s, A will read the first 1 and it would then go to a new state s2 . For the language {0n 1n | n ≥ 1}, it is crucial
however that the automaton will now somehow “remember” that it has read four 0s. But there is no way it can remember
this. As soon as the machine transitions from s1 to s2 all information about what it has encountered so far is gone. For
instance, the following FSA is not the A we are looking for. It corresponds to {0n 1m | n ≥ 1 and m ≥ 1}, which is a
proper superset of {0n 1n | n ≥ 1}. That is, it does not just include strings like 0011 and 00001111, but also strings like
001111 and 000011.
0
s0 s1 s2
0 1 1
The only way an FSA can have some sort of memory is to have a different state for each number of 0s that has been read.
For example:
t0 t1 t2 t3
0 0 0
1 1 1
f1 x2 x3
1 1
f2 y3
f3
This is an automaton that accepts 01, 0011, and 000111, but not strings like 011 or 001. Every time it is in a state where
n 0s have been read, there will be n consecutive states that can be transitioned to by reading a 1. Only the last of these
is an acceptance state. This trick works for {01, 0011, 000111} and we could naturally extend the automaton to handle
similar strings. Unfortunately, we cannot use it for {0n 1n | n ≥ 1}. This is because if we want to extent this automaton
to accept the countably infinite number of string in that language we will need an infinite number of states. For starters,
we will need a countably infinite number of states like t0 , t1 , t2 , t3 , etc. Of course if we included such an infinite number
of states, we would no longer have a finite state automaton.
The fact that not all languages are regular shows that the finite state automaton is a limited model of computation. It
also shows that there is a distinguished class of decision problems, namely those corresponding to regular languages,
for which this limited model suffices. That is, it can be important to know whether a certain problem is regular or not,
because this will tell us the complexity of the computational mechanism we need to use.
As we will see, there are multiple classes of formal languages, which form a hierarchy of increasing complexity. For now,
however, we stick with the distinction between those languages that are and those that are not regular.
20
Proving that a language is regular is relatively easy. All you need to do is provide a finite state automaton for the language.
For instance, say I want to prove that {0n 1m | n ≥ 1 and m ≥ 1} is regular. I could present you with the (p-deterministic)
automaton
R = h{0, 1}, {s0 , s1 , s2 }, s0 , {s2 }, {((s0 , 0), s1 ), ((s1 , 0), s1 ), ((s1 , 1), s2 ), ((s2 , 1), s2 )}
s0 s1 s2
0 1
If I can now show that L(R) = {0n 1m | n ≥ 1 and m ≥ 1}, then this is proof that this language is regular. But this is
easy to prove. All the computations that lead to accepted strings are of the form s0 followed by at least one s1 and at
least one s2 . This is only possible if the string has at least one 0, followed by at least one 1.
We now turn to the formal proof of non-regularity.
Theorem 8
Any finite language is regular.
Proof
Let L be some finite set of strings over alphabet Σ. To prove that L is regular, we need to provide a finite
state automaton for it. Let L = {x1 , . . . xn }. So, there are n strings in L. Let’s say that each string xi ∈ L
can be written as xi,1 . . . xi,k . For the whole language to be accepted, for every string xi , we’ll need a
computation s0 . . . sk , with s0 the start state, sk an accepting state and a transition ((st−1 , xi,t−1 ), st ) for
every 1 ≤ t ≤ k. This means we can provide the FSA by just taking the set of all these transitions and the
set of all corresponding states: M = hΣ, S, s0 , A, Ri such that:
• S = {s0 } ∪ {s1,w | 1 ≤ w ≤ |x1 |} ∪ . . . ∪ {sn,w | 1 ≤ w ≤ |xn |}
• A = {sv,w | xv ∈ L and w = |xv |}
• R = {((sv,w−1 , xv,w−1 ), sv,w ) | 1 ≤ v ≤ n and 1 ≤ w ≤ |xv |}
Since L(M) = L it follows that L is regular. Since we took an arbitrary finite language, it follows that
every finite language is regular.
For example, take the {01, 0011, 000111} language. Here, the n in the proof is 3 and we have x1 = 01, x2 = 0011 and
x3 = 000111. (Any other order would do equally well.) These strings are 2, 4 and 6 symbols long. This means we need
2+4+6+1=13 states, one for each symbol plus a start state. If we follow the procedure in the proof, we end up with the
following FSA.
21
s0 s1,1 s1,2
0 1
0
s2,1 s2,2 s2,3 s2,4
0 0 1 1
Note that there are of course simpler finite state automata for this language. (I gave one earlier). But that is besides the
point. By following this procedure we are guaranteed to provide an FSA for each finite language, thus proving that finite
languages are regular. Providing a simpler FSA would only reiterate that point.
The above proof exploits the fact that finite state automata can include any finite number of states. Given that an FSA has
a finite number of states, this means that FSAs for infinite languages can accept strings that are longer than the number
of states in the FSA. This in turn means that the computation of some accepted string in an infinite language must involve
the same state more than once. (This is sometimes referred to as the pigeon hole principle. If you have m pigeons in n < m
pigeon holes, then some pigeon holes will have to contain multiple pigeons.) In other words, FSAs for infinite languages
must contain a loop.
It turns out that loops in finite state automata have a very particular feature. Say that we have some FSA and there is a
computation of string x which starts in the start state and ends in some state si . Let’s say that when the FSA subsequently
reads string y, it enters a loop. That is, the first symbol of y triggers a transition to sj and the sequence of transitions
brought about by the rest of the symbols in y ends once more in sj . Finally, let’s assume that from there the FSA can
read string z, transitioning from sj to sk and subsequently to some states ending in some accepting state sf . So, the FSA
accepts the string xyz and the computation that goes with this string is s0 . . . si sj . . . sj sk . . . sf . Because this acceptance
has a loop along its path of computation, it follows that there will be similar accepting computations when the loop is
traversed less than one or more than one times. In other words, this FSA will also accept xz, xyyz, xyyyz, etc. In fact
the set {xy n z | n ≥ 0} will be a subset of the language corresponding to this automaton. This is the guiding intuition
behind the pumping lemma.
Proof
We know that L is regular, so there exists an automaton M such that L(M) = L. Let us assume that
M has p states. Now take σ ∈ L such that σ = x1 . . . xn with n ≥ p. Since σ is accepted, we have a
computation c1 c2 . . . cp cp+1 . . . cn cn+1 such that ((ci , xi ), ci+1 ) is a transition in M, c1 is the start state
and cn+1 is an acceptance state. Since n ≥ p, there are more states in this computation than there are
states in M. This means that there has to be some v, w such that v 6= w and cv = cw . Let’s call this state
s. This means that the computation will look like c1 c2 . . . cv−1 s . . . scw+1 . . . cn cn+1 . Let’s call the string
computed by c1 c2 . . . cv−1 s, x, the string computed by scw+1 . . . cn cn+1 , z and the string computed by
s . . . s, y. By assumption, xyz ∈ L. Because s . . . s is a loop, it follows that xy n z ∈ L for any n ≥ 0.
Given that v 6= w in the original computation of σ, it follows that |y| > 0. Because there are only p unique
states, v < p and w < p + 1. It follows from this that |xy| ≤ p.
The pumping lemma for regular language is an incredibly useful lemma, for it provides us with a method of proof for
non-regularity. This may seem unintuitive, but notice that the pumping lemma give us a property that all regular
languages have. This is not to say that all non-regular languages lack this property. In other words, if you find a language
with the pumping property described by the lemma, this does not mean that this language is regular. However, if you
22
find a language that does not have the property, then you know it can’t be regular, since all regular languages have the
property.
Let’s say I have the task of finding cows. My problem is that I am very bad at identifying animals. I encounter
three: a cat, a cow and a frog. The tail lemma won’t help me with the cat or a cow. All I know is that every cow
has a tail. Both these animals could therefore be a cow, but they could also both be other animals that have a tail
(as the cat indeed is). The lemma does help me with the frog. The frog does not have a tail and therefore cannot
be a cow.
Here is an example of the pumping lemma in action. Consider the standard example of a non-regular language L =
{0n 1n | n ≥ 1}. Assume that L is regular. Then the pumping lemma should hold for L. So, there should be some p such
that any string that is at least p can be “pumped” in the way described by the lemma. Take an arbitrary string that is long
enough: 0p 1p (length 2 × p). We should now be able to divide this string into x, y and z such that |xy| ≤ p, |y| > 0. Since
|xy| ≤ p, it follows that both x and y only contain 0s. In particular y = 0k for 1 ≤ k ≤ p. Given the pumping lemma,
it now follows that xy l z ∈ L. This runs into a contradiction. Take for instance, xy 2 z. We know that xyz contains p 0s
and p 1s. We also know that y contains k 0s and k > 0. It follows that xy 2 z contains p + k 0s and p 1s. Since p + k > p,
it follows that xy 2 z 6∈ L. This contradicts the assumption that L was a language for which the pumping lemma holds. It
follows that L is not regular.
23
4 Formal grammars
The models of computation we looked at so far were automata, abstract machines that involve state transitions on the
basis of an input that is being read. In this section, we move to another model of computation, namely a formal grammar.
Later we will see how grammars are related to automata. In particular, we will discuss different kinds of formal grammar
and compare them to various kinds of automata beyond the class of finite state automata that we discussed so far.
It is common to use capital letters for non-terminals and lower-case letters for terminals. Here is an example of a formal
grammar:
This grammar has one non-terminal symbol S and two productions. We often write production as rules, using →. For
instance, the productions of the above grammar look like:
S → 1
S → 1S
When linguists are not concerned with matters of formal language theory, they often represent a formal grammar by just
giving the production rules. For instance, you may encounter a “grammar” like the following:
NP → N (PP)
DP → D NP
PP → P DP
Y=
N → box | table | diamond | room | house | sister | thief
P → in | on | of
D → the | a
Note first of all that from these production rules alone, you will be unable to identify what grammar is meant. This
is because there is no indication of what the start node is. The language corresponding to the intended grammar will
be quite different when “PP” is the start node compared to when “DP” is the start symbol. In that sense, an informal
grammatical representation like this corresponds to a set of grammars and a set of languages, one for each choice of the
start symbol (DP, NP, PP). For instance, this grammar derives DPs like the following:
(6) a. a diamond
b. the table of the thief
c. a diamond in the box on a table in a room in the house of the sister of the thief of the diamond in the box
(7) a. in a diamond
b. of the table of the thief
c. of a diamond in the box on a table in a room in the house of the sister of the thief of the diamond in the box
24
The production rules above illustrate some common notational conventions that are handy to know. First of all, one rule
indicates the optionality of certain symbols by putting them in brackets. That is,
NP → N (PP)
is short for two rules, namely:
NP → N
NP → N PP
A different form of disjunction for production rules is the pipe symbol “|”. This indicates alternative right-hand sides for
the same left-hand side. So:
D → the | a
is short for:
D → the
D→a
4.2 Derivation
Derivation is the formal notion that concerns how a certain string is recognized/produced by a grammar. This notion is
comparable to the notion of computation that I introduced for finite state automata. It is defined as follows:
Definition 16
Let G = hΣ, N, S, P i and let α, β, γ, δ ∈ (N ∪ Σ)∗ . If (α, β) ∈ P then we say that δαγ directly derives δβγ,
which we write as: δαγ ⇒G δβγ. We say that α derives β, which we write α ⇒∗ β, whenever:
• α ⇒G β (α directly derives β), or
• there exists a non-empty sequence x1 . . . xn such that α ⇒G x1 ⇒G xn ⇒G . . . ⇒G xn ⇒G β
I will drop the subscript indicating the grammar if it is clear for which grammar derivation is being discussed.
For example, the grammar X = h{1}, {S}, S, {(S, 1), (S, 1S)}i we gave above yields S ⇒∗ 1, since S ⇒ 1, since
(S, 1) is in the productions of this grammar. (That is, here we can take δ = γ = .) We also have S ⇒∗ 111, since
S ⇒ 1S ⇒ 11S ⇒ 111. There are three direct derivation in this indirect derivation. First, S ⇒ 1S because S → 1S is a
production rule. (I.e. δ = γ = ). Second, 1S ⇒ 11S because (S, 1S) is a production rule. (Here, α = S, β = 1S, δ = 1,
γ = .) Finally, 11S ⇒ 111 because (S, 1) is a production rule. (So we take α = S, β = 1, δ = 11 and γ is empty.)
Another example: if we assume that the startsymbol of grammar Y is DP, then it has a derivation for the string “the box
on the table”, which goes as follows.
DP ⇒ D NP ⇒ the NP ⇒ the N PP ⇒ the box PP ⇒ the box P DP ⇒ the box on DP ⇒ the box on D NP
⇒ the box on the NP ⇒ the box on the N ⇒ the box on the table
Note that in this derivation, I consistently replace the left-most non-terminal symbol in accordance to some production
rule. Such a derivation is called the left-most derivation. The right-most derivation consistently replaces the right-most
non-terminal and looks like this:
There are also derivations that are neither left-most nor right-most. For example:
It is important to note that for all these derivations, there is a sense in which they came about as by magic. We just
happened to consistently pick the right production rule to get to the string we were after. For instance, say that we
attempt a right-most derivation starting, as before, with the step DP ⇒ D NP. Given that our focus is on the right-most
non-terminal, we need to find a production rule for NP. Let’s say we pick NP⇒ N. Then our derivation becomes DP ⇒ D
25
NP ⇒ D N. Now, we have no way to continue the derivation to derive the string “the box on the table”, since both D and
N are only rewritable as terminal symbols.
What matters for now is simply whether or not there exists a derivation for a certain string. If one exists, then the string
is part of the language corresponding to the grammar. Actually finding such a derivation (that is, proving that the string
is in the language) is a different and practical matter to which we turn later.
Definition 17
The language corresponding to a grammar G = hΣ, N, S, P i, written L(G) is defined as follows:
L(G) = {σ | S ⇒∗G σ}
1 S
1 S
These structures are called parse trees. They are not just the graphic representation of the derivation of a string. They are
formal object in their own right:
Definition 18
A parse tree is a pair (P, C), where
• P is a symbol, the parent node, and
• C is an ordered sequence of symbols and parse trees, the children
Nodes that are not trees are called leaves.
The following procedure produces parse trees from derivations. Whenever we have a direct derivation N ⇒ x1 . . . xn , we
build the parse tree (N, (x1 , . . . , xn )). The next step in the derivation could now rewrite xi . For instance x1 . . . xi . . . xn ⇒
x1 . . . yk . . . yk . . . xn . We adjust the parse tree accordingly by replacing xi by the tree (xi , (y1 , . . . , yk )), yielding the tree
(N, (x1 , . . . , (xi , (y1 , . . . , yk )), . . . , xn )). Once we have done this for every part of the derivation we have the parse tree
corresponding to that derivation.
For instance, the tree just above definition 18 corresponds to the derivation S ⇒ 1S → 11S → 111 for grammar
X = h{1}, {S}, S, {(S, 1), (S, 1S)}i. I illustrate this using the following table, where the left-hand side shows the
unfolding of the tree, step by step in the derivation and the right-hand side the production rule that was used in the
corresponding derivation step.
Note that a string will have multiple derivations in a grammar. For starters, for each left-most derivation there will be a
right-most derivation. However, this does not have any impact on the parse tree. I illustrate this with Y, repeated here,
and the DP “the box on the table”:
26
NP → N (PP)
DP → D NP
PP → P DP
Y=
N → box | table | diamond | room | house | sister | thief
P → in | on | of
D → the | a
Definition 19
The trace of a derivation is the sequence of production rules (elements in used by the derivation. (This is the
right column in the tables above).
Two derivations are similar if their derivation traces are of equal length and contain the same productions. (That
is, they are simply a reordering of one-another).
The left- and right-most derivation of “the box on the table” above are derivation-similar. They involve the same steps, yet
in a different order. Try and convince yourself that each left-most derivation will have a derivation-similar right-most
derivation. Note as well that whenever two derivations are similar, they will have the same parse tree.
Given all this, we can turn things around. A parse tree is a representation of a class of derivations that are similar.
Technically, a parse tree corresponds to an equivalence class of derivations. This is because derivation-similarity is an
equivalence relation (it is transitive, reflexive and symmetric). As such, given a grammar and given some string we can
construct the set of derivations of the string in that grammar that are similar to one-another. This is an equivalence class.
For that class of derivations, there is exactly one parse tree.
Importantly, two derivations of the same string (and given the same grammar) are not always similar. Sometimes, the
same string may have multiple parse trees. In that case, the grammar is called ambiguous.
Here is an example of a grammar that is ambiguous:
h{0, 1}, {S, A}, S, {(S, 0), (S, 0S), (S, A), (S, 0A), (A, 1)}i
27
This grammar derives strings like 0, 00, 001, 00001, etc. Ambiguity arises when the string contains a 1. For instance, 001
can be derived in two ways: S ⇒ 0S ⇒ 00S ⇒ 00A ⇒ 001 or S ⇒ 0S ⇒ 00A ⇒ 001. The corresponding parse trees
are:
S
0 S S
0 S 0 S
A 0 A
1 1
These trees are constructed from the derivations as illustrated in the tables below. (As before, the tables show each step
in the derivation. On the left of the table the tree is unfolding step by step; on the right I indicate the production that was
used at the corresponding step in the derivation.)
From the formal perspective we developed here, ambiguity is the notion that a grammar has two distinct parse trees for
the same string. Given the link between parse trees and derivation similarity (a parse tree corresponds to the equivalence
class of derivation-similar derivations), we can also see ambiguity as the phenomenon within a grammar where the same
string has two distinct left-most derivations. Or, similarly, a grammar is ambiguous whenever there’s a string that has
two distinct right-most derivations.
28
DP
D NP
the N PP DP
box P DP DP PP
on D NP D N P DP
This shows that even though the two grammars are equivalent in the sense that they correspond to the same language,
they differ in the structures they assign to these strings.
Definition 20
Let G1 and G2 be two formal grammars. G1 and G2 are weakly equivalent whenever L(G1 ) = L(G2 ). G1 and G2
are strongly equivalent when they are weakly equivalent and they generate the same parse trees, given some
renaming (if needed) of non-terminal symbols.
The above two grammars are weakly but not strongly equivalent.
As we will see below, formal grammars are much more expressive than finite state automata. However, there is a certain
class of formal grammars that is equivalent to FSAs in the sense that the languages that these grammars can compute are
exactly the regular languages. This is the class of regular grammars.
Definition 21
Let G = hΣ, N, S, P i be a formal grammar. G is a left linear grammar whenever for every (α, β) ∈ P it is the
case that α ∈ N and either β ∈ Σ∗ or β = xX with x ∈ Σ∗ and X ∈ N . G is a right linear grammar whenever
for every (α, β) ∈ P it is the case that α ∈ N and either β ∈ Σ∗ or β = Xx with x ∈ Σ∗ and X ∈ N . A
grammar is regular when it is either left or right linear.
This says that left linear grammars are grammars where all productions rules take one of two forms: either a non-terminal
mapping to a terminal symbol, or a non-terminal mapping to a terminal symbol followed by a non-terminal one. Right
linear grammars are similar, but with the order of terminal and non-terminal swapped in the second type of rule. The
simple grammar we saw above, h{1}, {S}, S, {(S, 1), (S, 1S)}i, is left-linear. There is a right linear grammar that derives
exactly the same language, namely:
These grammars are weakly equivalent to each other. In fact, for each left-linear grammar there is a weakly equivalent
right-linear grammar, and vice versa. Given this, the class of languages that can be generated from left-linear grammars
29
is the same as the class of languages that can be generated from right-linear grammar. These are the regular languages.
Importantly, these are exactly the languages that finite state automata can recognize.
Theorem 9
If M is a finite state automaton, then there exists a regular grammar G such that L(G) = L(M). If G is a regular
grammar, then there exists a finite state automaton M such that L(M) = L(G). As a result, the set of languages
computable with regular grammars is the set of regular languages.
I will sketch the proof by going through a procedure to turn a regular grammar into a finite state automaton and vice versa.
Say we have (left-linear) regular grammar G = hΣ, N, S, P i. The corresponding FSA is M = hΣ, N ∪ {F }, S, {F }, Ri,
with R:
For example, applying this procedure to h{1}, {S}, S, {(S, 1), (S, 1S)}i, we get:
More graphically:
S 1
To “translate” an FSA into a corresponding regular grammar, we can do the following. Say, M is an FSA of the form
hΣ, S, s, A, Ri. The corresponding left-linear regular grammar is G = hΣ, S, s, P i with P :
If we apply this to the automaton we gave for the language {1n |n > 0}, we get
S → 1S
h{1}, {S, F }, S, {(F, ), (S, 1F ), (S, 1S)} or graphically: S → 1F
F →
Although this grammar looks different from the original grammar we gave for this language, it is easy to verify that it
corresponds to exactly the same language.
The real proof that regular grammars correspond to regular languages involves showing that these two procedures (trans-
lating regular grammars into FSAs and vice versa) work across the board.
30
5 Context-free languages
In chapter 3 we saw an example of a non-regular language: {0n 1n | n ∈ N}. Being non-regular, this is a language for
which no finite state automaton and no regular grammar exists. So what mechanisms do we have to compute languages
like these?
, → $
s0 s1
1, 0 →
, $ →
s3 s2
1, 0 →
This automaton works as follows. Each transition is labeled with three things, notated: a, b → c. Here, a is what is read
from the input, b is what is popped from the stack and c is what is pushed to the stack. So, at the start this automaton
reads nothing and pops nothing. It simply pushes “$” to the stack and transitions to s1 . The “$” symbol is simply to have
a marker in the stack of where the reading of input started. Once in s1 , if it reads a 0, then it pops nothing, and pushes a
0 to the stack. This way, for each 0 that is read, there is a 0 on the stack. Once a 1 is read, it pops the top 0 from the stack
and pushes nothing, transitioning to s2 . There, it keeps on popping 0s for each 1 that is read. The automaton transitions
to the acceptance state s3 if there is nothing left to read and “$” can be popped from the stack. That is, the automaton
only accepts if after reading all the 1s, there are no more 0s on the stack. As you can verify yourself, this is only possible
if the number of 0s and 1s is exactly the same.
The computation of a string can be captured in a table that keeps track of the current state and the current stack for each
symbol read from the input. For instance, for the input 000111:
31
read state stack
s0
s1 $
0 s1 $0
0 s1 $00
0 s1 $000
1 s2 $00
1 s2 $0
1 s2 $
s3
Formally, a pushdown automaton looks a lot like a finite state automaton, except with the addition of the stack and the
added complexity of the transitions:
Definition 22
A push-down automaton is a 6-tuple hΣ, Γ, S, s, A, Ri, such that:
• Σ is a finite set (the language alphabet)
• Γ is a finite set (the stack alphabet)
• S is a finite set of states
• s ∈ S, the start state
• A ⊆ S, the acceptance states
• R ⊆ (S × Σ∗ × Γ∗ ) × (S × Γ∗ ), the transition relation
Let’s unpack the transition relation. It is a relation between triples consisting of a state, a string of symbols of the alphabet
of the language and a string of symbols of the stack alphabet on the one hand and pairs consisting of a state and a string
of symbols from the stack language on the other. So, given a state, a symbol (or string) that is being read from the input
string and a symbol (or string) to be popped from the top of the stack, there is a transition to a new state and a string of
stack symbols is pushed to the stack.
The PDA I gave for {0n 1n | n ∈ N} is given as follows:
(s0 , , ) → (s1 , $)
(s1 , 0, ) → (s1 , 0)
(s1 , 1, 0) → (s2 , )
(s2 , 1, 0) → (s2 , )
(s2 , , $) → (s3 , )
{((s0 , (, )), (s1 , $)), ((s1 , (0, )), (s1 , 0)), ((s1 , (1, 0)), (s2 , )), ((s2 , (1, 0)), (s2 , )), ((s2 , (, $)), (s3 , ))}
.
32
Definition 23
Let P = hΣ, Γ, S, s, A, Ri be a push-down automaton. Let x ∈ Σ∗ and σ ∈ Σ∗ . (So, xσ is also a member
of Σ∗ .) Let t, t0 ∈ S and u, v, w ∈ Γ∗ . A computation step for P is defined as the relation `P , which is defined as:
(xσ, t, vu) `P (σ, t0 , wu) if and only if ((t, (x, v)), (t0 , w)) ∈ R.
The relation `∗P , the “computes” relation for push down automaton P is the reflexive and transitive closure of
`P . That is, it is the smallest relation such that `P ⊆`∗P and such that it is reflexive and transitive.
The triples in these computation steps represent situations the push-down automaton can be in. For instance (s1 , 0111, 100)
is a situation of an automaton in state s1 , which still needs to read 0111 (so 0 is the next symbol it reads), where the stack
has 100 in it (so 1 is the symbol that could potentially be popped).
Here is an example of an application of these definitions for the PDA A that I gave for {0n 1n | n ∈ N}.
(s0 , 000111, ) `A (s1 , 000111, $) `A (s1 , 00111, $0) `A (s1 , 0111, $00)
`A (s1 , 111, $000) `A (s2 , 11, $00) `A (s2 , 1, $0) `A (s2 , , $) `A (s3 , , )
As a consequence:
Given the definition of the ‘computes’ relation, we can now define acceptance for push-down automata:
Definition 24
Let P = hΣ, Γ, S, s, A, ri be a push-down automaton. P accepts σ ∈ Σ∗ if and only if for some f ∈ A:
That is, a string is accepted if we can find a computation that starts in the start state with an empty stack and
ends in an acceptance state with an empty stack (and the whole string read).
As we did with acceptance for finite state automata, the language corresponding to the automaton is simply the set of
accepted strings.
Definition 25
If P = hΣ, Γ, S, s, A, ri is a push-down automaton, then:
Definition 26
Let G = hΣ, N, S, P i be a formal grammar. G is a context-free grammar (CFG) if and only if P ⊆ N ×(Σ∪N )∗ .
That is, a context-free grammar is a formal grammar where all production rules have a single non-terminal symbol
on the left-hand side and a string of made up of terminal and non-terminal symbols on the right-hand side.
Note, first of all, that every regular grammar is context-free. Productions like S → S1 are left-linear, but they also
fall within what is allowed to qualify as context-free. CFGs allow for much more than linear productions, though. For
instance, S → 1S1 is a typical production rule that qualifies as context-free but not linear/regular.
33
Here is a context-free grammar for {0n 1n | n ∈ N}:
S→
h{0, 1}, {S}, S, i
S → 0S1
CFGs and PDAs correspond to exactly the same class of languages, the context-free languages. I will not give the proof
here, but to get the intuition, here’s a procedure to construct a PDA that is equivalent to some CFG.
Let C = hΣ, N, S, P i be a CFG. The corresponding PDA A will be the sextuple:
(So, the stack alphabet is the set of terminal and non-terminal symbols of the CFG and there are just two states, one
a start state and the other an acceptance state.) The transitions R are constructed as follows: First of all, R contains
the transition (s0 , , ) → (s1 , S). This is a transition to the acceptance state by reading nothing and pushing the start
non-terminal symbol to the stack. Every rule in P is converted into a transition in R. If X → σ is a production, then the
corresponding transition in R is (s1 , , X) → (s1 , σ). This transition reads nothing, but pops the non-terminal X from
the stack and pushes the right-hand side of the production to it. Finally, for each terminal symbol x, we add a transition
(s1 , x, x) → (s1 , ).
To see how this works, let us take the above CFG for {0n 1n | n ∈ N}. Given this CFG, the corresponding PDA will look
like:
(s0 , , ) → (s1 , S)
(s1 , , S) → (s1 , )
(s1 , , S) → (s1 , 0S1)
(s1 , 0, 0) → (s1 , )
(s1 , 1, 1) → (s1 , )
, S → 0S1
, → S
s0 s1
0, 0 → 1, 1 →
If we take this PDA and try to find a computation that proves acceptance of 000111, we find:
(s0 , 000111, ) ` (s1 , 000111, S) ` (s1 , 000111, 0S1) ` (s1 , 00111, S1)
` (s1 , 00111, 0S11) ` (s1 , 0111, S11) ` (s1 , 0111, 0S111) ` (s1 , 111, S111)
` (s1 , 111, 111) ` (s1 , 11, 11) ` (s1 , 1, 1) ` (s1 , , )
This PDA is quite different from the one I gave earlier, but – as you may verify – it accepts exactly the same language.
• At most one production contains the empty string, namely a production which maps the start symbol to .
34
• All other productions take one of two forms:
A → BC a non-terminal mapped to two non-terminals
A→a a non-terminal mapped to an element of the alphabet
• Finally, the start symbol is not allowed to occur on the right-hand side.
These restriction would only make sense if they are harmless with respect to the class of languages that can be expressed.
Indeed, CNF context-free grammars correspond to exactly the same class of languages as non-CNF CFGs (and PDAs)
correspond to. So, we can obtain simpler rules and parse trees, without compromising on expressivity.
Part of how we know that CNF CFGs correspond to non-CNF CFGs is because there is an algorithm that can tranform each
non-CNF CFGs into a (weakly) equivalent grammar in Chomsky normal form. It is important to know this algorithm, so
you can always produce a CNF on the basis of some context-free grammar.
Say, G = hΣ, N, S, Ri is a context-free grammar. If we want to turn it into Chomsky normal form grammar, this grammar
will take the following form: G0 = hΣ, Nc , Sc , Rc i. Here, Sc is an entirely new non-terminal symbol. That is, Sc 6∈ N .
Obviously, we do require that Sc ∈ Nc . We now consider R and change the productions in several ways until we have a
CNF.
The first kind of change is adding a rule Sc → S. This simply connects the old start symbol to the new one, in order to
make sure the start symbol doesn’t occur on the right-hand side of any rule. Note that this is not a production that is
accordance to the CNF rules, but ignore this for now, we will deal with that problem in a later step.
The second change we need to consider is to get rid of any rules containing the empty string. So, if R contains a rule
X → , we remove this rule and make sure that this omission doesn’t have consequences elsewhere. For instance, if
some other rule looks like Y → AXb, then we need to double this rule to account for the option that X is empty. So, the
new production set will not just have Y → AXb, but also Y → Ab. This way the new grammar does all that the old one
did, without making use of . Note that these rules are not yet CNF, but that will be fixed later.
The third manipulation that will bring us closer to CNF is to remove rules that have a single non-terminal on the right-
hand side. A rule like X → Y can be removed by looking at rules with Y on the left-hand side and replacing Y in X → Y
with that right-hand side. For instance, if Y → y, then we can replace X → Y with X → y.
The next change is to deal with rules that have more than two symbols on the right-hand side. Say, we have X →
α1 α2 α3 . . . αn where αi could be both terminal and non-terminal symbols. We can replace this with binary branching
rules as follows.
X → α1 X1
X1 → α2 X2
X2 → α3 X3
.. ..
. .
Xn → αn
Note that we need to make sure that X1 , . . . Xn 6∈ N . These have to be fresh non-terminal symbols that have not been
used elsewhere in the grammar. For all but the final rule it is the case that these are only CNF productions if the αi is a
non-terminal. If it is not, then the next step deal with this. The final rule is only CNF if αn is a terminal. If it is not, our
earlier strategy for dealing with rules with just a single non-terminal on the right hand side will take care of it.
The final kind of non-CNF production we need to be able to deal with are rules where there are two elements on the
right-hand side, but one of them is a terminal. This is easy. A rule like X → zY can be replaced by the combination
X → ZY and Z → z. (Similarly, for when the terminal is right of the non-terminal.)
This is all we need to transform any non-CNF CFG into a CFG that does adhere to the strict constraints of Chomsky
normal form. Here’s an example, the CFG for {0n 1n | n ≥ 0} that I gave earlier:
S → S0 → S S0 → S S0 → S S0 → S0 →
S → 0S1 S → S → S → S0 → YX S0 → YX
S → 0S1 S → 0X S → YX S → S → YX
X → S1 Y → 0 S → YX Y → 0
X → SZ Y → 0 X → SZ
Z → 1 X → SZ X → 1
Z → 1 Z → 1
the original grammar adding new start symbol reducing the right hand removing terminals from removing productions removing productions
sides > 2 the right-hand sides with one non-terminal on (other than the start
right-hand side symbol) that involve the
empty string
35
Here is an example parse tree corresponding to the derivation of the string 000111. The table next to the parse tree shows
the left-most derivation corresponding to this parse tree.
S0
0 1
A B
S → AB
A → AB
B → BA A B B A
A → 0
B → 1
A B B A B A A B
0 1 1 0 1 0 0 1
Now note the following. Since the grammar is in CNF, each parse tree will have at most binary branching at each node.
So, the width of a tree at each level n of branching is at most 2n . (The root node is level 0, so the tree has width 20 = 1
there. The next level down is the first branching, level 1 and has width 21 = 2. Etc.) Given the relation between the width
and height of the tree we can deduce from the size of the string what the height of the tree will minimally be. Since the
leaves of the tree will be arrived at via unary branching (as per Chomsky normal form), a tree of n branching levels will
have at most 2n symbols in the string it derived. The tree above branches on three levels and does so maximally and, so,
its number of leaves is 23 . Reversely, if we didn’t have the tree for this string, we could still conclude that the height of
the tree (measured in branchings) will be at least 3.
The grammar above only has 3 non-terminals. Since the height of the tree is 3, there will be a path from the root node to
a leaf that contains 4 non-terminals, this entails that there must be paths from the root node (S) to a leaf that contains the
same non-terminal more than once, as it indeed does. Given that this repetition was possible once, it must be the case
that it is possible more times, and it must be the case that the repetition could have been omitted. That is, such repetitions
show that stuff can be pumped, just like we discovered for regular languages. Let’s illustrate this for the tree above.
Consider the B node that is coloured red in the tree above. That node illustrates a derivation B ⇒∗ 10. But this means
we should be able to replace the blue node B with a copy of the red node B. So, the following should also be a parse tree
for this grammar.
36
S
A B
A B B A
A B B A B A A B
0 1 1 0 B A 0 0 1
1 0
The idea behind the pumping lemma for context-free languages generalises this idea of repeatability. Consider the follow-
ing abstract parse tree for a string uvwxy, from an unknown CNF context-free grammar. (The triangular shapes represent
unknown sub-trees that derive the (sub)strings u, v, w, x and y.)
T y
T x
Given this tree, we know that uvwxy belongs to the language. But we also know that T derives vwx. As a result, we
know that we can replace the lowest node T with the parse tree for T⇒∗ vwz, resulting in a parse tree for uv2 wx2 y.
T y
T x
T x
w
Clearly, we can repeat this as many times as we want. Also, we could replace the sub-tree rooting in the top-most T with
the small tree for T⇒ w, which would result in the string uwy. That is, given the fact that we saw a parse tree for uvwxy
with two occurrences of T, we know that for all i ≥ 0, the string uvi wxi y is in the language.
All this is the intuition behind the pumping lemma for context-free languages. Here’s the formal statement of the lemma:
37
Theorem 10
Every infinite context-free language L is such that there exists a number m, such that for every string σ ∈ L
where |σ| ≥ m, σ can divided up as uvwxy and the following hold:
1. |vwx| ≤ m
2. |vx| ≥ 1
3. For every i ≥ 0: uv i wxi y ∈ L
As we saw above, as soon as we find a path in a parse tree that contains a repetition of non-terminal, we have evidence
of other strings in the language. As soon as we have a subtree rooted in non-terminal T that contains a node T deeper
in that subtree, we can replace that node in the subtree with the subtree itself. This holds of paths of arbitrary length, as
long as they exceed the pumping length. However, to make the lemma easy to use, we can focus on the repetition that
is as close to the bottom of the tree as possible. That is, if there is a path longer than m, then for some non-terminal T ,
there will be a minimal subtree rooted in T with exactly one other occurrence of T deeper in that subtree. This subtree
cannot be longer than m. If it was longer, then it must contain some other repetition in it - and it wouldn’t be minimal.
This is why the lemma can state that |vwx| ≤ m.
The second condition in the lemma |vx| ≥ 1 holds because the closest the repeated non-terminals are in the subtree that
derives vwx is one node apart. As the following trees illustrate, that would result in either v or x being empty.
T T
T X X T
w x v w
As soon as the repeated nodes are further apart neither v nor x can be empty:
v T X
w x
Because we know there is a repetition, the two trees where the T nodes are one node apart are the shortest the string
vwx can be. In other words, at most one of v or x can be empty, but not both.
I won’t prove the lemma here, but hope the intuition behind the lemma sketched above suffice to understand how it
works. As with the pumping lemma for regular languages, the typical application of the pumping lemma is only useful to
prove that a language does not belong to the class. That is, we apply the above lemma if we want to prove that a language
isn’t context-free. If we encounter a language L and prove using the pumping lemma for regular languages that it is not
regular, the next step could be to prove either that it is context-free (by providing a PDA or CFG for it) or that it is not
context-free by applying the above lemma. As we saw above, all regular languages are context-free (any linear grammar
is context-free), so once we know that a language is not context-free, then we also know that it is not regular.
Consider the language L = {an bn cn | n ≥ 0}. Notice first (informally) that this language is not regular. We won’t be
able to construct an FSA for this language, for exactly the same reasons as we won’t be able to construct an FSA for the
language {an bn | n ≥ 0}. But while this latter language is context-free, L is neither regular nor context-free. We can
show this by applying the pumping lemma. Let’s assume that L is context-free and that the pumping length is n. Then
consider σ = an bn cn . Since |σ| > n the properties mentioned in the lemma should now hold for some decomposition
σ = uvwxy. The first property is that |vwx| ≤ n. Given this, vwx either contains no as or it contains no cs. (If vwx
contains both, it should also contain n bs, which makes it longer than n.) It follows from this that when we pump uv i wxi y
we end up with strings that either contain more cs or bs than as or more as or bs than cs. These resulting strings are not
part of the language, which contradicts our assumption. This proves that L is not context-free.
38
5.5 Closure properties
As before with regular languages, it is interesting to consider closure properties of context-free languages. This may in
some cases help us deduce the complexity of a language from the complexity of other languages.
Context-free languages are closed under union, concatenation and Kleene star. It is relatively easy to see why, using
context-free grammars. Say we have two CFGs, G1 = hΣ1 , N1 , S1 , P1 i and G2 = hΣ2 , N2 , S2 , P2 i. Assume that N1 ∩
N2 = ∅. (Note, that we can always rename the non-terminals of a grammar and get a strongly equivalent grammar in
return. In other words, this assumption is not essential to the proof.) It is now very easy to build context-free grammars
that correspond to L(G1 ) ∪ L(G2 ), L(G1 ) · L(G2 ) and L(G1 )∗ . Since we know that L(G1 ) and L(G2 ) are context-free
(we have CFGs for them), it will follow that context-free languages are closed under union, concatenation and Kleene
star.
To build CFGs for the language obtained by union or concatenation, all we need to do is construct a new grammar
hΣ1 ∪ Σ2 , N1 ∪ N2 ∪ {S}, S, P i, where S 6∈ N1 ∪ N2 and P is as follows:
P = P1 ∪ P2 ∪ {S → S1 , S → S2 } union
P = P1 ∪ P2 ∪ {S → S1 S2 } concatenation
To obtain (L(G1 ))∗ we can take the grammar hΣ1 , N1 ∪ {S}, S, P1 ∪ {S → SS1 , S → }i.
Here’s an example. Consider the following two CFGs with start symbol S1 and S2 , respectively.
S1 → 1 S1 0 S2 → 1 S2 0
S1 → 10 S2 → 110
These correspond to {1n 0n | n ≥ 1} and {1n+1 0n | n ≥ 1}, respectively. To obtain the union of these languages, we can
simply construct the following grammar.
S → S1
S → S2
S1 → 1 S1 0
S1 → 10
S2 → 1 S2 0
S2 → 110
S → S1 S2
S1 → 1 S1 0
S1 → 10
S2 → 1 S2 0
S2 → 110
Finally, if we want to construct a grammar for {1n 0n | n ≥ 1}∗ , which is the language {, 1n 0n , 1n 0n 1m 0m , 1n 0n 1m 0m 1k 0k . . . |n ≥
1, m ≥ 1, k ≥ 1 . . .}, the following grammar will do:
S → SS1
S →
S1 → 1 S1 0
S1 → 10
For regular languages, we showed that the complement of any such language is regular as well. This is not the case for
context-free languages.
39
Theorem 11
Context-free languages are not closed under complementation.
Proof
Take L1 = {1 2 3 | i ≥ 0 and j ≥ 0} and L2 = {1 2 3 | i ≥ 0 and j ≥ 0}. Both are context-free.
i i j i j j
(Check this by providing a CFG for the languages.) Let us assume that context-free languages are closed
under complementation. Then L1 and L2 are context-free. Given the fact that context-free language
are closed under union, L1 ∩ L2 is also context free. Call this language L. L1 is the language that has
sequences of 1s followed by sequences of 2s, then sequences of 3s in such a way that the numer of 1s and
2s is not equal. L2 is similar except that the number of 2s and 3s are not equal. So, L is the language
with strings like this with either (or both) the number of 1s and 2s or the number of 2s and 3s not being
equal. Now take the complement of L: the language of strings with 1s followed by 2s and 3s, where the
number of 1s and 2s are equal and the number of 2s and 3s are equal. So, L = {1i 2i 3i | i ≥ 0}. Since L
is context-free and the assumption is that complements of context-free languages are context-free, then
we have to conclude that L is context-free. But it clearly is not. In fact, L was our prime example of a
language that is not context-free. This shows that our assumption that context-free languages are closed
under complementation must be false.
Theorem 12
Context-free languages are not closed under intersection.
Proof
Once more, take L1 = {1 2 3 | i ≥ 0 and j ≥ 0} and L2 = {1 2 3 | i ≥ 0 and j ≥ 0}. Note that
i i j i j j
S → 0S0
S → 1S1
S → 11
S → 00
S →
The copy language, on the other hand, is not context-free. To prove this, we walk through an application of the pumping
lemma. If C were context-free, then there must be a pumping length n. Let’s then consider the string 0n 1n 0n 1n ∈ C.
We decompose this string into uvwxy such that |vwx| < n. It follows that vwx is one of two types: (i) it contains only
1s or only 0s; (ii) it is made up of a series of 1s followed by a sequences of 0s or made up of a series of 0s followed by
a series of 1s. In case (i) pumping would increase the number of 1s or 0s in one part of the string without adjusting the
1s and 0s elsewhere. The result is not an element of C. Note that in case (ii) v and x will contain fewer than n 1s or 0s.
Also, either u will contain n 0s or y will contain n 1s (or both). So, the situation looks like one of these (the red vertical
line is there to help you spot where the coping takes place):
40
In the right-most situation, we cannot pump. The number of 0s and 1s left of the copy line is n, but if we pump there will
be more 0s and/or 1s right of the copy line. The same holds for the situation in the middle. Here, there are n 0s and 1s
on the right-hand side of the copy line and so if we pump we’ll also loose the copying property. So, the only interesting
case is the left-most one. It is instructive to go through all the possible ways vwx could be carved up. Let’s first consider
the case that v is empty. Then there are two possibilities. First, x contains only 0s. If this is the case, then pumping will
result in a string with more 0s right of the copy line than on the left. Second, x contains all 0s but also some 1s. If we then
remove x, the resulting string should still be in the language, but it cannot be, because the number of 0s left of the copy
line will now exceed those right of the line (and vice versa for the number of 1s). We can go through similar options for
the case where x is empty, with the same result. Finally, we consider the case where v and x are both not empty. Now v
will contain at least one 1 and x will contain at least one 0. So, if we pump, we will increase the 1s on the left hand side
of the copy line and we will increase the 0s on the right hand side of the copy line. The resulting string will not be in the
language. This exhausts our options and proves that c is not context-free.
A crucial difference between mirroring and copying is that the former but not the latter can be achieved with center-
embedding. Center-embedding is recursion that is nested, in the sense that the recursive step takes place within a string,
rather than at the edge of that string. This is why {0n 1n | n ≥ 0} is context-free: once we have a string in that language,
we can build a new string but splitting it in half and nesting 01 between the two halves. The way these strings are built
causes a very typical dependency, illustrated here:
0 0 0 1 1 1
The language {0n 1n 2n | n ≥ 0} is not context-free (as we saw earlier). This is because the strings simply cannot be built
by having nested dependencies. There will always be crossed dependencies. For instance,
0 0 0 1 1 1 2 2 2
Similarly, the copy language also cannot be constructed using center embedding. Here it is very clear what the depen-
dencies are, since the symbols in the copied string have to be repeated in the second half of the string, symbol by symbol.
For instance, the following string is in C:
0 1 0 0 1 0
There’s been considerable discussion in the linguistic literature on whether natural language is ever not context-free.
That is, can we find natural languages where we see phenomena that involve recursion that creates crossed, not just
nested dependencies. Notice, first, that English has clear examples of center embedding. Consider the following sentence
consisting of a determiner (the), a noun (fish) and a verb (smelled). Schematically, you could thus see this sentence as a
sequence DNV.
41
(10) The fish the cat my father loves ate smelled DNDNDNVVV
For a while, Dutch was seen as a candidate language that shows crossing dependencies (Huybregts, Utrecht working
papers in Linguistics, 1979). Consider the following embedded clause, for instance:
This sentence contains the pattern Name Name Name Verb Verb Verb. Crucially, however, the first name is the subject of
the first verb, the second one of the second verb etc. So, this is clearly a case of crossing dependencies.
Note that the same sentence in German is nested and not crossed:
The problem with Huybregts’ argument was that from a purely syntactic point of view, Dutch still looks context-free
(Gazdar & Pullum, Linguistics & Philosophy, 1982). It simply contains the pattern Namen Verbn , which can be generated
using center-embedding. A counter-argument may be that even if Dutch syntax is context-free, it seems that Dutch
semantics cannot be that. There exists, however, a language very close to Dutch which displays crossing dependencies
clearly in syntax. In Schwiizerdütsch (Swiss German), sentences close to the above Dutch one exist, with the same
word order, but with the addition of case marking, which allowed Shieber (Linguistics & Philosophy, 1985) to strengthen
Huybregts’ argument to a purely syntactic one.
42
6 Beyond context-free grammars
6.1 The Chomsky hierarchy
The formal grammars and automata that I discussed so far are all examples of models of computation. We saw that the
simplest of these models, regular grammars and finite state automata, describe a class of languages called the regular
languages. Some languages are not regular and for some of these languages we could employ a context-free grammar or,
equivalently, a push-down automaton. As we saw in the previous chapter, some languages are not even context-free and
so will need to be computed by something more expressive than the models discussed up to now.
The ensuing picture is one of a hierarchy of language classes, each with their own models of computation. This hierarchy
is known as the Chomsky hierarchy and it is given below.
You should read this table as providing inclusion relations. The regular languages are a subset of context-free languages,
which are a subset of the so-called context-sensitive languages, which are a subset of the recursively enumerable lan-
guages. That last set is the set of all computable decision problems, so exactly the languages for which there exists a
Turing machine. The definition of formal grammar that we gave in chapter 4, in its fully unrestricted form is equivalent
to a Turing machine.
Another way to read the table is as specifying what model is suitable for which class. Unrestricted grammars can compute
recursively enumerable languages, but since every regular language is recursively enumerable, an unrestricted grammar
is (for instance) also suitable for regular languages. Similarly, every regular grammar is a context-free grammar, and
every context-free grammar counts as an unrestricted grammar.
In these lecture notes, we won’t discuss Turing machines in detail. The thing you should know is that Turing machines
constitute the model for computability. What the table above shows us is that there is yet another level between context-
free and the level constituting all computable languages (the recursively enumerable ones). In the remainder of this
chapter we zoom in on this class of context-sensitive languages.
Definition 27
Let G = hΣ, N, S, P i be a formal grammar. G is a context-sensitive grammar (CSG) if and only if P ⊆
{(α, β) | α, β ∈ (Σ ∪ N )∗ and |α| ≤ |β|}.
Context-sensitive grammars can handle unlimited dependencies. Consider, for instance, the following CSG with start
symbol S.
S → 123 | 1X23
1Y → 11 | 11X
X2 → 2X
X3 → Y 233
2Y → Y 2
43
This grammar corresponds to the language {1n 2n 3n | n ≥ 1}. Here is a derivation using this grammar for the string
111222333:
There is a class of automata that correspond to context-sensitive grammars, called linear bounded automata (LBAs). Ba-
sically, an LBA is a Turing machine that can only use a part of the tape that has a length which is a linear function of the
input length. Together, LBAs and CSGs desribe the class of context-sensitive languages. As summarised in the Chomsky
hierarchy, this class is a proper subset of the recursively enumerable languages and a proper superset of the context-free
languages.
It is commonly assumed that natural languages are at least context-free and at most context-sensitive. Generalising to the
worst case, this may imply that we should use context-sensitive grammars to describe natural languages. This, however,
is overkill, since most of natural language seems context-free and the expressivity of CGSs is much higher than needed
for natural languages. An example of this is the Bach (or MIX) language:
Bach = {σ ∈ {0, 1, 2}∗ | the number of 0s = the number of 1s = the number of 2s}
This is the language that consists of strings built from the symbols 0, 1 and 2 such that each string in the language has an
equal number of each of these three symbols. So, it includes 001222110 and 212120010, but not 021102220. This language
is context-sensitive. The kind of dependencies displayed in this language, however, is something entirely alien to natural
languages. A kind of free worder corresponding to a language like Bach does not exist in the natural world. In part for
this reason, CGSs are not used in modern computational linguistics. Rather, natural language is placed in a level that
is not included in the traditional Chomsky hierarchy, namely a level inbetween context-free and context-sensitive: the
class of mildly context-sensitive languages.
This means we could present an altered Chomsky hierarchy as follows:
44
Definition 28
A labeled tree is a triple hV, E, ri paired with a labeling l such that:
• V is a finite set (the so-called vertices or nodes of the tree)
• E ⊆ V × V (the so-called edges)
• r ∈ V (the root node)
• {v ∈ V | (v, r) ∈ E} = ∅ (no incoming edges for the root node)
• For all w ∈ V \ {r} : |{v ∈ V | (v, w) ∈ E}| = 1 (exactly one incoming edges for all other nodes)
• For all w ∈ V : (r, w) ∈ E ∗ , where E ∗ is the reflexive transitive closure of E
• l:V →N with N some set of labels
So, for example h{1, 2, 3, 4}, {(4, 2), (4, 1), (2, 3)}, 4i is the following tree:
2 1
Here is an example labeling L for this tree: {(1,man),(2,Det),(3,the),(4,NP)}, which would make this the following tree:
NP
Det man
the
Note that you can’t define a syntactic parse tree by taking the union of the sets of terminals and non-terminals as the set
of vertices. The problem would be that many syntactic trees contain the same non-terminal multiple times. For instance,
h{NP,A,N,large,blue,ball},{(NP,NP),(NP,A),(A,large),(NP,A),(A,blue),(NP,N),(N,ball)}i
corresponds not to the tree on the left, but rather to the non-tree graph on the right:
NP
A NP
NP
large A N
A N
h{1, 2, 3, 4, 5, 6, 7}, {(1, 2), (1, 3), (3, 4), (3, 5), (2, 6), (4, 7), (5, 8)}, 1i
l = {(1,NP),(2,A),(3,NP),(4,A),(5,N),(6,large),(7,blue),(8,ball)}
For ease of reference, I define two handy way of talking about the root and leaves of a tree:
Definition 29
√
Let T be a tree hV, E, ri with labeling l. We write T for the tree root, r. Also, we write Te for the yield of the
tree: the string made up from the leaves of the tree, read from left to right.
45
Given this conception of trees, we can define two operations on trees that resemble things that typically happen in a
context-free grammar. The first of these is the operation of substitution:
Definition 30
Let T1 = hV, E, ri be a tree with labeling l1 . Let T2 = hV 0 , E 0 , r0 i be a tree with labeling l2 . Let V ∩ V 0 = ∅,
or if V ∩ V 0 6= ∅, then rename the vertices in T2 so that V and V 0 are disjoint. Let v ∈ V be a leaf of T1 and let
l1 (v) = l2 (r0 ). The substitution of T2 in T1 at node v, notated T1 [v, T2 ], is defined as follows.
Consider the trees X1 and X2 . (I am showing the labels, not the vertices).
X1
=
NP VP
D N V PP
the sat P NP X2
=
on D N N
As the definition says, substitution can only take place at a node that is a leaf and with a tree that is rooted in the same
label as that leaf node. This means that we can substitute X2 in X1 at the left most node that is labeled N .
X3
=
S S
NP VP NP VP
D N V PP D N V PP
N =
on D N on D N
The other operation that will be relevant below is adjunction. Adjunction differs from substitution in that it constitutes a
recursive step.
46
Definition 31
Let T1 = hV, E, ri be a tree with labeling l1 and v some node in V . Let T2 = hV 0 , E 0 , r0 i be a tree with labeling
l2 , such that there exists a leaf node f , the foot node, such that l1 (v) = l2 (f ) = l2 (r0 ). Let V ∩ V 0 = ∅, or if
V ∩ V 0 6= ∅, then rename the vertices in T2 so that V and V 0 are disjoint. Adjoining T2 into T1 at v, notated
T1 [v, T2 ], is defined as follows.
T1 [v, T2 ] = hV ∪ V 0 \ {v}, E 00 , ri
Where E 00 = E \ {(n, m) | n = v or m = v} ∪ E 0 ∪ {(n, r0 ) | (n, v) ∈ E} ∪ {(f, m) | (v, m) ∈ E}.
The labeling of T1 [v, T2 ] is once again l1 ∪ l2 .
Consider X3 above and tree X4 below. X4 has a foot node marked with a ∗ .
X4
=
VP
ADV VP∗
often
S VP S
NP VP ADV VP∗ NP VP
=
D N V PP often D N ADV VP
on D N sat P NP
the mat on D N
the mat
A tree adjoining grammar is a grammar that takes trees as its primitives and uses adjunction and substitution to build a
language of trees and a corresponding string language.
47
Definition 32
A tree adjoining grammar G is a 6-tuple hΣ, N, S, I, A, Ci:
• Σ is a set of terminal symbols,
• N is a set of non-terminal symbols,
• S ∈ N is the start symbol,
• I is a set of initial trees,
• A is a set of auxiliary trees
• C a set of constraints (to be discussed later)
A tree is an initial tree if the labels of its leaves are all in Σ ∪ {} ∪ N and the labels of its internal nodes are all
in N .
A tree is an auxiliary tree if its top node is labeled X ∈ N , the labels of its leaves are all in Σ ∪ {} ∪ N , the labels
of its internal nodes are all in N . Furthermore, one leaf is labeled X and is called the foot node and is marked
with ∗ .
Derivation consists of taking the elementary (i.e. initial and auxiliary) trees and applying substitution and adjunction to
them.
Definition 33
Let G = hΣ, N, S, I, A, Ci be a tree adjoining grammar. The set of derived trees of G, D(G) is the smallest set
such that:
• I ⊆ D(G)
• A ⊆ D(G)
• T [n, T 0 ] ∈ D(G) for any T, T 0 ∈ D(G) such that T [n, T 0 ] is defined
√
The tree language of G: T (G) = {T | T ∈ D(G) and Te ∈ Σ∗ and l( T ) = S}.
S S S
a S b S
S∗ a S∗ b
This TAG approximates the copy language. For instance, we can derive aabaab as follows:
48
S S S S S S
a S a S a S b S a S a S
⇒ ⇒ ⇒
S∗ a S∗ a a S S b a S a S
S a b S b S
S∗ a S b S b
S
S a S a
S a S a
The colour coding is meant to indicate which parts of the tree are due to which operation. As you can see, two adjunctions
and a substitution result in the string aabaab. Note, however, that the grammar does not correctly yield the copy language.
This is because nothing stops us from adjoining to different nodes labeled S than the ones I used above:
S S S
a S b S a S
⇒
a S S b b S
S a S b
S∗ a a S
S a
S a
Following substitution of the leaf node S with the empty string, this will yield the string abaaab, which is not in the copy
language. In order to make TAGs more expressive, constraints on adjunction are needed. This is the component C of a
TAG that I have so far not discussed.
Definition 34
In a tag G = hΣ, N, S, I, A, Ci the adjunction constraints C is a pair (fOA , fSA ) such that:
• fOA : {v | v is a vertex in some tree γ ∈ I ∪ A} → {0, 1} (obligatory adjunction)
This is a characteristic function describing the set of nodes that have obligatory adjunction.
• fSA : {v | v is a vertex in some tree γ ∈ I ∪ A} → ℘(A)} (selective adjunction)
This maps nodes to the set of auxiliary trees that can be adjoined there.
49
If a node has v is such that fOA (v) = 1, then we have to adjoin at this point in the tree. Such nodes are marked with OA .
If fOA (v) = 0 and fSA (v) = ∅, then we cannot adjoin at this node. Such nodes are marked with NA . It is assumed that
at least all leaves of trees are NA.
With these tools, the copy language can be given in TAG.
S SN A SN A
a S b S
S∗N A a S∗N A b
This TAG blocks derivations like those of abaaab above, while allowing derivation like those of aabaab above.
50