0% found this document useful (0 votes)
2 views

AI Training

The article discusses the importance of recognizing Mathematics as a profession and the need for a unified approach to enhance Mathematics education and its applications in society. It emphasizes the diverse membership of the Mathematics Profession, which includes educators, researchers, and industry professionals, and advocates for initiatives to engage students and promote mathematical literacy. The author calls for collaboration among mathematical associations to strengthen the profession and ensure its relevance in modern society.
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

AI Training

The article discusses the importance of recognizing Mathematics as a profession and the need for a unified approach to enhance Mathematics education and its applications in society. It emphasizes the diverse membership of the Mathematics Profession, which includes educators, researchers, and industry professionals, and advocates for initiatives to engage students and promote mathematical literacy. The author calls for collaboration among mathematical associations to strengthen the profession and ensure its relevance in modern society.
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 25

See discussions, stats, and author profiles for this publication at:

https://www.researchgate.net/publication/265315899

The profession of mathematics

Article · January 2004

CITATIONS READS

0 224

1 author:

Cheryl E. Praeger
The University of Western Australia

630 PUBLICATIONS 10,100 CITATIONS

SEE PROFILE

All content following this page was uploaded by Cheryl E. Praeger on 14 February
2015.

The user has requested enhancement of the downloaded file.

217

Math matters

The Profession of Mathematics

Cheryl E. Praeger

You who glance at this column may care [is] an automatic expectation by students
about Mathematics as passionately as I do. that mathematical thinking will play a
key
You may feel that the importance of Math- role in their understanding, and problem-
ematics education is self-evident, for both solving in every part of their lives”1.
individuals and society. On the other hand Why did my perceptive and well-
you may also have experienced conversa- educated friend have such a different
under-
tions such as one I had with a friend this standing from me of the role of
Mathemat-
week. On telling her that I intended to write ics? Is there a Mathematics
Profession? If
about the Profession of Mathematics, she so, what is it like? If not, and if we
want
responded that she didn’t think that Math- there to be one, what must we do to
achieve
ematics was a profession. To be a profes- this?
sion, she said, there had to be a range of Let us assume for the moment that there
careers available, and there were none listed is a Mathematics Profession and that
we are
for Mathematics in careers material that her members of it. What are our
perceptions of
high school-aged son had brought home to the Profession? Who are the members and
show his parents. As a student, she contin- where do they work? What do they need
ued, she knew that she had strong problem- professionally, and in particular, what
do
solving skills, and good logical and analyt- they need from a professional
association?
ical thinking. However, based on her (neg-
ative) experience in studying calculus, she Purpose of the Profession
had decided to build on her strengths and The two hallmarks of Mathematics are its
not proceed with Mathematics. power and its beauty. “The high technol-

The skills my friend chose in describing ogy that is so celebrated today is essen-
her strengths are close to those I would tially mathematical technology”2. More-
nominate as generic skills to be obtained over mathematical literacy is critical
for an
from a good Mathematics education. How- individual to function effectively in mod-
ever my friend had a very different view ern society. Politicians, industrial
leaders,
from me on the purpose and outcome of and educators all say they recognise this.
a Mathematics education. My view, al- To summarise, let’s say that the role of the
ready on record, is that “the most impor- Mathematics Profession is:
tant outcome from a mathematics education

1In The Essential Elements of Mathematics, a paper I wrote in March 2004 in


response to an invitation
from the Victorian Curriculum and Assessment Authority with respect to its work on
developing Framework
of Essential Learning.

2E.E. David, President of Exxon Research and Engineering, see


http://www.maths.uwa.edu.au/
students/prospective/first_year_general.php

218 Cheryl Praeger

• To strengthen Mathematics education Similarly, in the second of these columns,


in schools and tertiary institutions, in Tony Dooley5 argued cogently that, in the
order to fit young people to function ef- area of mathematics research, the Mathe-
fectively in society; matics Profession must take “greater con-

• To enhance the impact of Mathematics trol of the mysterious process between the-
research for the health of our own and ory and applications” and develop “better
other disciplines, and ultimately for the structures for sharing ideas and projects
public good; and across the whole spectrum from the purest

• To promote effective applications of to the most highly applied research”. Tak-


Mathematical methods and analysis in ing control of the connection between the-
commerce and industry, for the eco- ory and application is critical, and must be
nomic benefit of our community and na- taken seriously. The reason why the pro-
tion. cess may look “mysterious” is that many of

Inevitably most of us, as members of the us have not done it – it is challenging


and
profession, will focus on some aspects more sometimes very difficult as a real
applica-
than others. Indeed it is the major challenge tion is rarely as clear-cut as
theory. How-
for the Mathematics Profession to harness ever, the mathematical mind is a good one
the energy and commitment of all its mem- for solving these problems because of the
bers to work together towards fulfilling this ability to think clearly, recognise
what is a
role. proof (or more commonly, what is not) and

Many mathematicians are striving to- to simplify complex systems.


wards this. One celebrated successful ini-
tiative is the long-running Mathematics in Membership of the Profession
Industry Study Group that seeks to provide The Mathematics Profession deserves a
annually a forum where mathematicians depth of membership that embraces under-
and other professionals meet to bring math- graduate “trainees”, mathematics
teachers
ematical thinking and expertise to bear on at all levels, mathematics researchers,
and
a range of problems arising in industry. commercial and industrial mathematicians.

Moreover, the mission of the recently es- Universities certify as graduate


mathemati-
tablished Australian Mathematical Sciences cians those who have a major in a mathe-
Institute3 is aligned precisely with this role, matical science. Usually this means
gradu-
as discussed by Garth Gaudry in the third ates with three year degrees. At the
least,
of these columns. In his role as Director all of these graduates are part of our
profes-
of AMSI, Garth found “the level of appre- sion.
ciation of our discipline and its extraordi- However the profession is broader than
nary impact [to be] extremely high”. He this. For example, the PhD program in any
called on us in the profession to “broaden of the mathematical sciences offers a
rig-
our horizons” and “demonstrate our will- orous training in research, and this is
one
ingness to cooperate, not only among our- of the possible routes into a mathemati-
selves but with people from the many other cal career. Some graduates in
disciplines
endeavours in which the mathematical sci- other than the mathematical sciences be-
ences play a significant role”4. come members of the Mathematics Profes-

sion through such a program.

3See http://www.amsi.org.au/about.html. The AMSI mission is to become a nationally


and internation-
ally recognised centre for the mathematical sciences, providing service to its
member institutions, improving
the international competitiveness of Australian industry and organisations and
enhancing the national level
of school mathematics, by the provision and support of mathematical and statistical
expertise.

4AustMS Gazette 31 (2004), 145–146.


5AustMS Gazette 31 (2004), 76–78.

Math matters 219

Alternative routes into the Mathemat- Several other professional mathematical


ics Profession for those without a complete associations include teachers of
mathemat-
mathematics or statistics major are of equal ics in primary and secondary schools,
statis-
validity: for example, the “in-house” and ticians from all sectors, and
mathematical
“on-the-job” training and experience that scientists from special sub-disciplines
such
produce effective commercial or industrial as Operations Research. There is no sin-
mathematicians and statisticians; or the gle organisation to which all professional
professional development and further study mathematicians can logically belong.
More-
that enable those without a mathematics over, none caters very well for undergradu-
major to become competent mathematics ate mathematics students as members.
teachers in schools. Mathematical ability By contrast Engineers Australia6 of-
and commitment should determine member- fers free student membership to all under-
ship of the Mathematics Profession, not for- graduate engineering students,
entailing a
mal qualifications – certainly not the hold- monthly student newsletter, and access
to
ing of a PhD degree. And let us not forget careers services, discussion forums and
pro-
our undergraduate student members. fessional advice. From this early stage un-

When it comes to defining the profession, dergraduate engineering students are wel-
it is important to use the broadest possible comed into the Engineering Profession.
In
umbrella. It is especially relevant to em- addition Engineers Australia has active
pro-
brace those diverse users of Mathematics grams run by its branches, and offers
struc-
and Statistics (in fields such as Computer tured professional development programs
Science and Bioinformatics) who may not for individual members and teams.
describe their work as Mathematics. Much Efforts to provide opportunities for un-
of what they do is Mathematics by any rea- dergraduate mathematics and statistics
stu-
sonable definition. The Mathematics Pro- dents led to the inaugural AMSI Sum-
fession should be taking credit for it and mer School in Melbourne in February
2003,
welcoming those who practise it into our whose success was praised by the Federal
fold. Minister, Dr Brendan Nelson7. In addition,

the Statistical Society of Australia runs a


Professional associations Young Statisticians Section8 for statistics
There are many mathematical associations students and new graduates. Both the Sta-
in Australia, of which we may like to think tistical Society of Australia and the
Aus-
of the Australian Mathematical Society as tralian Association of Mathematics Teach-
one of the major ones. The membership ers have strong state branches that run their
of each covers only a “slice” of the profes- own programs independently of the
central
sion’s membership. Most members of the organisation.
Australian Mathematical Society (including
ANZIAM) are mathematicians or statisti- The Accredited Mathematical Sci-
cians in universities. A minority are re- entist
search or commercial mathematicians and Several mathematical associations have
statisticians from government or private en- tried to raise public awareness of
Mathe-
terprise, and some are mathematics teachers matics and Statistics, and the quality
of
in schools. members of the Mathematics Profession, by
introducing accreditation of their members

6http://www.ieaust.org.au
7Garth Gaudry, Math Matters, AustMS Gazette 31 (2004), 145–146.
8http://www.statsoc.org.au/Sections/YoungStatisticians.htm

220 Cheryl Praeger

or of university courses. While accredita- • Young people see the relevance of a


tion may benefit an individual by providing mathematical training for developing
recognition of their qualifications and expe- strong problem-solving skills and
crit-
rience, the most valuable purpose of accred- ical thinking, and the possibility of
a
itation is to assure those outside the pro- variety of satisfying mathematical ca-
fession that the accredited person can help reers;
them mathematically. • Companies expect and obtain maxi-

When the Australian Mathematical So- mum, and cost-effective, benefits from
ciety introduced its accreditation scheme in incorporating mathematicians on their
1994 during my term as President, it was a staff, or as consultants, to enable them
controversial decision. The scheme was con- to achieve their competitive edge; and
servative, measuring worthiness for accredi- • Government comprehends the value of
tation against performance levels of univer- investing in mathematics education for
sity academic staff. Three levels of accredi- its citizens.
tation were offered, and the Society website How are we as a profession faring in
currently lists 108 persons who have been terms of recognition? Some data indicate
accredited as Fellows (the highest level). that Mathematics is facing a crisis,
with de-
However, no lists of accredited members, or creased resources, splintering of the
disci-
accredited graduate members are given. We pline, and dissipation of mathematics
con-
have, it seems, failed to attract young math- tent in courses at all levels. In the
first
ematicians to accredited graduate member- of these columns Peter Hall10 analysed
the
ship, which is available to those who are negative impact on Mathematics and Sta-
graduates with a major in a mathematical tistics in Australian universities of
govern-
science. ment policies on research and higher edu-

The Statistical Society of Australia cation, principally the “penalising of highly


(SSAI) introduced its accreditation scheme performing Australian mathematical
scien-
in 1996, and more recently decided to of- tists” and lack of provision of “adequate
ca-
fer graduate accreditation status to those reer paths for younger Australian mathe-
with a three year degree with a major in matical scientists”. Documenting the de-
statistics. Unlike the Society, the SSAI pro- cline in resources is important, but
does not
vides a list9 of all Accredited Statisticians necessarily shed light on the causes.
and Graduate Statisticians, together with If we, as members of the Mathematics
their contact details and professional ar- Profession, assume that the only
relevant
eas of interest, thus helping to achieve the issues are government decisions on
support
major purpose of accreditation. I question for Mathematics, then the cause of the
de-
whether the accreditation scheme of the So- cline is the government. But this might
di-
ciety is sufficiently outwardly focused. vert the Mathematics Profession from fac-

ing the possibility that much of the problem


How is the Profession perceived? may lay within itself.
As well as internal strength, the Mathemat- The most recent university mathematics
ics Profession needs external recognition to enrolment data I have seen indicate
that, in
ensure that: the Mathematical Sciences during the pe-

riod 1995–9911, the ratio of the number of


graduates with majors in a mathematical

9http://www.statsoc.org.au/Accreditations/AccreditedMembers.htm
10AustMS Gazette 31 (2004), 6–11. Peter called for increased government funding for
university math-

ematics teaching, structured research/teaching fellowships, and fundamental changes


to government policy
on the measurement of research performance.

11Information collected from Heads of Mathematics Departments in 2001.

Math matters 221

science from Australian universities to the alone choose) between the various
mathe-
number who graduate with honours is more matical career paths, not to mention
alter-
than 5:1. In past years the mathematical as- native career paths such as the
computing
sociations have largely ignored the former, or physical sciences, or engineering.
and they account for 80% of the Profes- An appropriate initiative to achieve this
sion. Unless the government sees a vigorous would involve:
Mathematics Profession that acknowledges • A joint initiative by several mathemat-
and engages all its members, we cannot ex- ical bodies14 to welcome and engage
pect it to regard Mathematics as important undergraduate mathematics students in
politically. the Mathematics Profession.

A way forward Fortunately for this country, the Aus-


tralian Mathematics Trust engages many

All undergraduate students studying a thousands of Australian students in mathe-


mathematics subject should be of interest matics challenges and enrichment
activities
to the Mathematics Profession, not only be- while they are in primary and secondary
cause they are its clients, but also because school. Equally fortunately, the
Australian
they are students with a good mathemat- Mathematical Sciences Institute has a fo-
ics education and we want to ensure that cus on school and undergraduate training
in
they understand the “unreasonable effec- the mathematical sciences, and the profes-
tiveness”12 of mathematics in solving real- sional development of mathematics
teach-
world problems. Two simple initiatives the ers. The Trust’s activities could become
the
Society might take are: first part of a seamless program of mathe-
• To extend the offer of free student mem- matics enrichment, promotion and
informa-

bership of the Australian Mathematical tion offered to young people by the Mathe-
Society for the duration of a student’s matics Profession.
undergraduate career13; and Australia is a talented country. It is

• To establish a Young Mathematicians also a small country, too small to waste its
section of the Society. mathematical resources. Success in building

However, from a student’s perspective the Mathematics Profession demands the


(and indeed the perspective of most mem- goodwill and commitment of all Australian
bers of the community), the “angles of sep- Mathematicians (in the broadest sense
of
aration” between academic mathematicians the word).
and statisticians, industrial and commer-
cial mathematicians, and school mathemat- Acknowledgements
ics teachers, are very small indeed. Students Helpful comments from Robyn Ellis,
should be welcomed into the Mathematics Stephen Fienberg, John Henstridge, Will
Profession early in their undergraduate ca- Morony and Peter Taylor are gratefully
ac-
reers – before most of them distinguish (let knowledged.

School of Mathematics and Statistics, University of Western Australia, 35 Stirling


Highway, Nedlands WA
6009
E-mail : [email protected]

12E. Wigner, The Unreasonable Effectiveness of Mathematics in the Natural Sciences,


Communications
in Pure and Applied Mathematics 1 (1960).

13Currently one year’s free membership is offered.


14Involving at least the Australian Mathematical Society, the Statistical Society
of Australia, the Aus-

tralian Association of Mathematics Teachers, the Australian Mathematics Trust, and


the Australian Math-
ematical Sciences Institute.

View publication stats

The Thirty-Second AAAI Conference


on Artificial Intelligence (AAAI-18)

Video Generation from Text

Yitong Li,†∗ Martin Renqiang Min,‡ Dinghan Shen,† David Carlson,† Lawrence Carin†
† Duke University, Durham, NC, United States, 27708

‡NEC Laboratories America, Princeton, NJ, United States, 08540


{yitong.li, dinghan.shen, david.carlson, lcarin}@duke.edu, [email protected]

Abstract also required in video generation. However, simply pre-


dicting future frames is not enough to generate a com-

Generating videos from text has proven to be a significant chal- plete video clip.
Recent work on video generation has
lenge for existing generative models. We tackle this problem
by training a conditional generative model to extract both static decomposed video
into a static background, a mask and
and dynamic information from text. This is manifested in a hy- moving objects
(Vondrick, Pirsiavash, and Torralba 2016;
brid framework, employing a Variational Autoencoder (VAE) Tulyakov et al. 2017).
Both of the cited works use a Gener-
and a Generative Adversarial Network (GAN). The static fea- ative Adversarial
Network (GAN) 2014, which has shown
tures, called “gist,” are used to sketch text-conditioned back- encouraging results
on sample fidelity and diversity.
ground color and object layout structure. Dynamic features However, in contrast
with these previous works on video
are considered by transforming input text into an image filter. generation, here we
conditionally synthesize the motion and
To obtain a large amount of data for training the deep-learning background features
based on side information, specifically
model, we develop a method to automatically create a matched
text-video corpus from publicly available online videos. Exper- text captions. In
the following, we call this procedure text-
imental results show that the proposed framework generates to-video generation.
Text-to-video generation requires both a
plausible and diverse short-duration smooth videos, while ac- good conditional
scheme and a good video generator. There
curately reflecting the input text information. It significantly are a number of
existing models for text-to-image generation
outperforms baseline models that directly adapt text-to-image (Reed et al. 2016;
Mansimov et al. 2016); unfortunately, sim-
generation procedures to produce videos. Performance is eval- ply replacing the
image generator by a video generator pro-
uated both visually and by adapting the inception score used vides poor performance
(e.g. severe mode collapse), which
to evaluate image generation in GANs. we detail in our experiments. These
challenges reveal that

even with a well-designed neural network model, directly


generating video from text is difficult.

1 Introduction
In order to solve this problem, we breakdown the gener-

Generating images from text is a well-studied topic, but gen- ation task into two
components. First, a conditional VAE
erating video clips based on text has yet to be explored as model is used to
generate the “gist” of the video from the in-
extensively. Previous work on the generative relationship put text, where the gist
is an image that gives the background
between text and a short video clip has focused on produc- color and object layout
of the desired video. The content and
ing text captioning from video (Venugopalan et al. 2015; motion of the video is
then generated by conditioning on both
Donahue et al. 2015; Pan et al. 2016; Pu et al. 2017). How- the gist and text
input. This generation procedure is designed
ever, the inverse problem of producing videos from text has to mimic how humans
create art. Specifically, artists often
more degrees of freedom, and is a challenging problem for draw a broad draft and
then fill in the detailed information. In
existing methods. A key consideration in video generation other words, the gist-
generation step extracts static “univer-
is that both the broad picture and object motion must be de- sal” features from the
text, while the video generator extracts
termined by the text input. Directly adapting text-to-image the dynamic “detailed”
information from the text.
generation methods empirically results in videos in which One approach to combining
the text and gist information
the motion is not influenced by the text. is to simply concatenate the feature
vectors from the encoded

In this work, we consider motion and background syn- text and the gist, as was
previously used in image genera-
thesis from text, which is related to video prediction. In tion (Yan et al. 2016).
This method unfortunately struggles to
video prediction, the goal is to learn a nonlinear transfor- balance the relative
strength of each feature set, due to their
mation function between given frames to predict subse- vastly different
dimensionality. Instead, our work computes a
quent frames (Vondrick and Torralba 2017) – this step is set of image filter
kernels based on the input text and applies

∗Most of this work was done when the first and third authors the generated filter
on the gist picture to get an encoded text-
were summer interns at NEC Laboratories America. gist feature vector. This combined
vector better models the
Copyright ©c 2018, Association for the Advancement of Artificial interaction
between the text and the gist than simple concate-
Intelligence (www.aaai.org). All rights reserved. nation. It is similar to the
method used in (De Brabandere

7065

Text input Generated gist Generated video

Play golf on grass

Play golf on snow

Play golf on water

Figure 1: Samples of video generation from text. Universal background information


(the gist) is produced based on the text. The
text-to-filter step generates the action (e.g., “play golf”). The red circle shows
the center of motion in the generated video.

et al. 2016) for video prediction and image-style transforma- layer. The input
image is convolved with its image-dependent
tion, and (Shen et al. 2017) for question answering. As we kernels to give
predicted future frames. A similar approach
demonstrate in the experiments, the text filter better captures has previously been
used to generate future frames (De Bra-
the motion information and adds detailed content to the gist. bandere et al. 2016).
For our work, however, we do not have
Our contributions are summarized as follows: (i) By view- a matching frame for most
possible text inputs. Thus, this is
ing the gist as an intermediate step, we propose an effective not feasible to feed
in a first frame.
text-to-video generation framework. (ii) We demonstrate that GAN frameworks have
been proposed for video generation
using input text to generate a filter better models dynamic without the need for a
priming image. A first attempt in
features. (iii) We propose a method to construct a training this direction was made
by separating scene and dynamic
dataset based on YouTube (www.youtube.com) videos where content (Vondrick,
Pirsiavash, and Torralba 2016). Using the
the video titles and descriptions are used as the accompany- GAN framework, a video
could be generated purely from
ing text. This allows abundant on-line video data to be used randomly sampled
noise. Recently, Tulyakov et al. (2017)
to construct robust and powerful video representations. incorporated an RNN model
for video generation into a GAN-

based framework. This model can construct a video simply


2 Related Work by pushing random noise into a RNN model.

2.1 Video Prediction and Generation 2.2 Conditional Generative Networks


Video generation is intimately related to video prediction. Two of the most popular
deep generative models are the Vari-
Video prediction focuses on making object motion realistic ational Autoencoder
(VAE) (Kingma and Welling 2013) and
in a stable background. Recurrent Neural Networks (RNNs) the Generative Adversarial
Network (GAN) (Goodfellow et
and the widely used sequence-to-sequence model (Sutskever, al. 2014). A VAE is
learned by maximizing the variational
Vinyals, and Le 2014) have shown significant promise in lower bound of the
observation while encouraging the approx-
these applications (Villegas et al. 2017; De Brabandere et imate (variational)
posterior distribution of the hidden latent
al. 2016; van Amersfoort et al. 2017; Kalchbrenner et al. variables to be close to
the prior distribution. The GAN frame-
2017). A common thread among these works is that a convo- work relies on a minimax
game between a “generator” and a
lutional neural network (CNN) encodes/decodes each frame “discriminator.” The
generator synthesizes data whereas the
and connects to a sequence-to-sequence model to predict discriminator seeks to
distinguish between real and generated
the pixels of future frames. In addition, Liu et al. (2017) data. In multi-modal
situations, GAN empirically shows ad-
proposed deep voxel-flow networks for video-frame inter- vantages over the VAE
framework (Goodfellow et al. 2014).
polation. Human-pose features have also been used to re- In order to build
relationships between text and videos,
duce the complexity of the generation (Villegas et al. 2017; it is necessary to
build conditionally generative models,
Chao et al. 2017). which have received significant recent attention. In partic-

There is also significant work on video generation condi- ular, (Mirza and Osindero
2014) proposed a conditional
tioned on a given image. Specifically, Vukotić et al.; Chao et GAN model for text-
to-image generation. The conditional
al.; Walker et al.; Chen et al.; Xue et al. (2017; 2017; 2016; information was
given to both the generator and the discrim-
2017; 2016) propose methods to generate videos based on inator by concatenating a
feature vector to the input and
static images. In these works, it is important to distinguish the generated image.
Conditional generative models have
potential moving objects from the given image. In contrast to been extended in
several directions. Mansimov et al. (2016)
video prediction, these methods are useful for generating a va- generated images
from captions with an RNN model using
riety of potential futures, based upon the current image. Xue “attention” on the
text. Liu and Tuzel; Zhu et al. (2016;
et al. (2016) inspired our work by using a cross-convolutional 2017) proposed
conditional GAN models for either style or

7066

RNN Encoder
t: kitesurfing t: kitesurfing

t: kitesurfing LSTM LSTM LSTM at beach at beach t: kitesurfing


at beach Text2Filter at beach

kitesurfing at beach RNN Encoder RNN Encoder RNN Encoder

v Real?
1 Video Generator Fake?

Conv

CNN Encoder CNN Decoder Gist Encoder Real Sample Video Discriminator

Figure 2: Framework of the proposed text-to-video generation method. The gist


generator is within the green box. The encoded
text is concatenated with the encoded frame to form the joint hidden representation
zd, which is further transformed into zg . The
video generator is within the yellow box. The text description is transformed into
a filter kernel (Text2Filter) and applied to
the gist. The generation uses the feature zg . Following this point, the flow chart
forms a standard GAN framework with a final
discriminator to judge whether a video and text pair is real or synthetic. After
training, the CNN image encoder is ignored.

domain transfer learning. However, these methods focused width dimensions,


respectively, for each video frame. Note
on transfer from image to image. Converting these methods that all videos are cut
to the same number of frames; this
for application to text and image/video pairs is non-trivial. limitation can be
avoided by using an RNN generator, but

The most similar work to ours is from Reed et al. (2016), this is left for future
work. The text description t is given as
which is the first successful attempt to generate natural im- a sequence of words
(natural language). The index i is only
ages from text using a GAN model. In this work, pairs of data included when
necessary for clarity.
are constructed from the text features and a real or synthetic The text input was
processed with a standard text encoder,
image. The discriminator tries to detect synthetic images or which can be jointly
trained with the model. Empirically, the
the mismatch between the text and the image. A direct adap- chosen encoder is a
minor contributer to model performance.
tation unfortunately struggles to produce reasonable videos, Thus for simplicity,
we directly adopt the skip-thought vector
as detailed in our experiments. Text-to-video generation re- encoding model (Kiros
et al. 2015).
quires a stronger conditional generator than what is necessary
for text-to-image generation, due to the increased dimension- 3.1 Gist Generator
ality. Video is a 4D tensor, where each frame is a 2D image
with color information and spatiotemporal dependency. The In a short video clip,
the background is usually static with
increased dimensionality challenges the generator to extract only small motion
changes. The gist generator uses a CVAE
both static and motion information from input text. to produce the static
background from the text (see example

gists in Figure 1). Training the CVAE requires pairs of text


3 Model Description and images; in practice, we have found that simply using the

first frame of the video, v1, works well.


We first introduce the components of our model, and then The CVAE is trained by
maximizing the variational lower
expand on each module in subsequent sections. The overall bound
structure of the proposed model is given in Figure 2. There
are three model components: the conditional gist generator

L [ ]
(green box), the video generator (yellow box), and the video CV AE(θg,φg;v, t) =
Eqφg (zg |v,t) log p

( θg (v|zg, t)
discriminator. The intermediate step of gist generation is de- −KL qφg (zg|v, t)||

)
p(zg) . (1)

veloped using a conditional VAE (CVAE). Its structure is


detailed in Section 3.1. The video generation is based on Following the original
VAE construction (Kingma and
the scene dynamic decomposition with a GAN framework Welling 2013), the prior p(zg)
is set as an isotropic mul-
(Vondrick, Pirsiavash, and Torralba 2016). The generation tivariate Gaussian
distribution; θg and φg are parameters
structure is detailed in Section 3.2. Because the proposed related to the decoder
and encoder network, respectively. The
video generator is dependent on both the text and the gist, subscript g denotes
gist. The encoder network qφg (zg|v, t)
it is hard to incorporate all the information by a simple con- has two sub-encoder
networks η(·) and ψ(·). η(·) is ap-
catenation, as proposed by Reed et al. (2016). Instead, this plied to the video
frame v and ψ(·) is applied to the text
generation is dependent on a “Text2Filter” step described in input t. A linear-
combination layer is used on top of the en-
Section 3.3. Finally, the video discriminator is used to train coder to combine the
encoded (video frame and)t)ext. Thus

zg ∼ N (
the model in an end-to-end fashion. μφ [η(v);ψ(t)], diag σφ [η(v);ψ(t)] . The de-
g g

The data are a collection of N videos and associated text coding network takes zg
as an input. The output of this CVAE
descriptions, {Vi, ti} for i = 1, . . . , N . Each video Vi ∈ network is called
“gist”, which is then one of the inputs to

T×C×H×W
R with frames Vi = {v1i, · · · ,vTi}, where C the video generator.
reflects the number of color bands (typically C = 1 or C = At test time, the
encoding network on the video frame is
3), and H and W are the number of pixels in the height and ignored, and only the
encoding network ψ(·) on the text is

7067

applied. This step ensures the model sketches for the text- through time to match
the video dimensionality, where the
conditioned video. In our experiments, we demonstrate that values in s(·) are from
an independent neural network with
directly creating a plausible video with diversity from text is 2D convolutional
layers. Therefore, the text-gist vector gt
critically dependent on this intermediate generation step. and the random noise
combine to create further details on the

gist (the scene) and dynamic parts of the video.


3.2 Video Generator The discriminator function D(·) in (2) is parameterized as
The video is generated by three entangled neural networks, a deep neural network
with 3D convolutional layers; it has
in a GAN framework, adopting the ideas of Vondrick, Pirsi- a total of five
convolution and batch normalization layers.
avash, and Torralba (2016). The GAN framework is trained The encoded text is
concatenated with the video feature on
by having a generator and a discriminator compete in a min- the top fully connected
layer to form the conditional GAN
imax game (Goodfellow et al. 2014). The generator syn- framework.
thesizes fake samples to confuse the discriminator, while
the discriminator aims to accurately distinguish synthetic 3.3 Text2Filter
and real samples. This work utilizes the recently developed Simply concatenating
the gist and text encoding empirically
Wasserstein GAN formulation (Arjovsky, Chintala, and Bot- resulted in an overly
reliant usage of either gist or text infor-
tou 2017), given by mation. Tuning the length and relative strength of the features

min maxEV ∼p(V ) [D(V ; θD)] is challenging in a complex framework. Instead, a more
ro-
θG∈ΘG θD
− bust and effective way to utilize the text information is to
Ezv∼p(zv) [D(G(zv; θG); θD)] . (2) construct the motion-generating filter weights
based on the

The function D discriminates between real and synthetic text information, which is
denoted by Text2Filter. This is
video-text pairs, and the parameters θD are limited to main- shown as the orange
cube in Figure 2.
tain a maximum Lipschitz constant of the function. The gen- The Text2Filter
operation consists of only convolutional
erator G generates synthetic samples from random noise that layers, following
existing literature (Long, Shelhamer, and
attempt to confuse the discriminator. Darrell 2015). We extend the 2D fully
convolutional archi-

As mentioned, conditional GANs have been previously tecture to a 3D fully


convolutional architecture for gen-
used to construct images from text (Reed et al. 2016). Be- erating filters from
text. The filter is generated from the
cause this work needs to condition on both the gist and text, encoded text vector
by a 3D convolutional layer of size
it is unfortunately complicated to construct gist-text-video Fc × Ft × kx × ky ×
kz, where Ft is the length of the
triplets in a similar manner. Instead, first a motion filter is encoded text vector
ψ(t). Fc is number of output channels
computed based on the text t and applied to the gist, further and kx× ky × kz is
filter kernel size. The 3D convolution
described in Section 3.3. This step forces the model to use is applied to the text
vector. In our experiments, Fc = 64.
the text information to generate plausible motion; simply con- kx = 3 in accordance
with the RGB channels. ky and kz
catenating the feature sets allows the text information to be are set by the user,
since they will become the kernel size
given minimal importance on motion generation. These fea- of the gist after the 3D
convolution. After this operation, the
ture maps are further used as input into a CNN encoder (the encoded text vector
ψ(t) of length Ft becomes a filter of size
green cube in Figure 2), as proposed by Isola et al. (2016). Fc × 3× ky × kz, which
is applied on the RGB gist image
The output of the encoder is denoted by the text-gist vector g. A deep network
could also be adopted here if desired.
gt, which jointly considers the gist and text information. Mathematically, the text
filter is represented as

To this point, there is no diversity induced for the motion fg(t) = 3Dconv(ψ(t)).
(4)
in the text-gist vector, although some variation is introduced
in the sampling of the gist based on the text information. Note that “3Dconv”
represents the 3D full convolution oper-
The diversity of the motion and the detailed information is ation and ψ(·) is the
text encoder. The filter fg(t) is directly
primarily introduced by concatenating isometric Gaussian applied on the gist to
give the text-gist vector
noise nv with the text-gist vector, to form zv = [gt;nv]. The
subscript v is short for video. The random-noise vector nv gt = Encoder (2Dconv (g,
fg(t))) . (5)
gives motion diversity to the video and synthesizes detailed
information. 3.4 Objective Function, Training, and Testing

We use the scene dynamic decomposition (Vondrick, Pir- The overall objective
function is manifested by the combi-
siavash, and Torralba 2016). Given the vector zv , the output nation of LCV AE and
LGAN . Including an additional recon-
video from the generator is given by

struction loss LRECONS = ||G−V̂ ||1 empirically improves


G(zv) = α(zv)#m(zv) + (1− α(zv))# s(zv). (3) performance, where V̂ is the output of
the video generator
The output of α(zv) is a 4D tensor with all elements con- and G is T repeats of g
in time dimension. The final objective
strained in [0, 1] and # is element-wise multiplication. function is given by

α(·)
and m(·) are both neural networks using 3D fully convolu- L = γ1LCV AE + γ2LGAN +
γ3LRECONS , (6)
tional layers (Long, Shelhamer, and Darrell 2015). α(·) is
a mask matrix to separate the static scene from the motion. where γ1, γ2 and γ3 are
scalar weights for each loss term.
The output of s(zv) is a static background picture repeated In the experiments, γ1
= γ2 = 1 and γ3 = 10, making the

7068

values of the three terms comparable empirically. The genera- t: kitesurfing at


beach

tor and discriminator are both updated once in each iteration.


Adam (Kingma and Ba 2014) is used as an optimizer. RNN Encoder

Real?
When generating new videos, the video encoder before Video Generator Fake?

zg in Figure 2 is discarded, and the additive noise is drawn Real Sample Video
Discriminator

zg ∼ N (0, I). The text description and random noise are


then used to generate a synthetic video.

(a) Baseline with only text encoder.


4 Dataset Creation

t: kitesurfing at beach

Because there is no standard publicly available text-to-video


generation dataset, we propose a way to download videos RNN Encoder

Video Discriminator
Video Generator Real?

with matching text description. This method is similar in Fake?


Concatenate

concept to the method in Ye et al. (2015) that was used to t: kitesurfing at beach
RNN Encoder

Text Discriminator
create a large-scale video-classification dataset.

Retrieving massive numbers of videos from YouTube is (b) Baseline with pairing
information.
easy; however, automatic curation of this dataset is not as
straightforward. The data-collection process we have consid- Figure 3: Two
baselines adapted from previous work. Fig-
ered proceeds as follows. For each keyword, we first collected ure 3(a) uses the
conditional framework proposed by Von-
a set of videos together with their title, description, dura- drick, Pirsiavash,
and Torralba (2016). The model was orig-
tion and tags from YouTube. The dataset was then cleaned inally used for video
prediction conditioned on a starting
by outlier-removal techniques. Specifically, the methods of frame. The starting
frame in the model is replaced with text
(Berg, Berg, and Shih 2010) were used to get the 10 most description. Figure 3(b)
uses a discriminator performing on
frequent tags for the set of video. The quality of the selected the concatenation
of encoded video and text vectors. This is
tags is further guaranteed by matching them to the words inspired by Reed et al.
(2016).
in existing categories in ImageNet (Deng et al. 2009) and
ActionBank (Sadanand and Corso 2012). These two datasets
help ensure that the selected tags have visually detectable frames, causing
existing models to fail. Therefore, each video
objects and actions. Only videos with at least three of the se- is cut and only
qualified clips are used for the training (Von-
lected tags were included. Other requirements include (i) the drick, Pirsiavash,
and Torralba 2016). The clips were quali-
duration of the video should be within the range of 10 to 400 fied as follows. Each
video uses a sampling rate of 25 frames
seconds, (ii) the title and description should be in English, per second. SIFT key
points are extracted for each frame,
and (iii) the title should have more than four meaningful and the RANSAC algorithm
determines whether continuous
words after removing numbers and stop words. frames have enough key-point overlap
(Lowe 1999). This

Clean videos from the Kinetics Human Action Video step ensures smooth motions in
the background and objects
Dataset (Kinetics) (Kay et al. 2017) are additionally used with in the used videos.
Each video clip is limited to 32 frames,
the steps described above to further expand the dataset. The with 64× 64
resolution. Pixel values are normalized to the
Kinetic dataset contains up to one thousand videos in each range of [−1, 1],
matching the use of the tanh function in
category, but the combined visual and text quality and consis- the network output
layer.
tency is mixed. For instance, some videos have non-English
titles and others have bad video quality. In our experiments, 5.2 Models for
Comparison
we choose ten keywords as our selected categories: ‘biking To demonstrate the
effectiveness of our gist generation and
in snow’, ‘playing hockey’, ‘jogging’, ‘playing soccer ball’, conditional text
filter, we compare the proposed method to
‘playing football’, ‘kite surfing’, ‘playing golf’, ‘swimming’, several baseline
models. The scene dynamic decomposition
‘sailing’ and ‘water skiing’. Note that the selected keywords framework (Vondrick,
Pirsiavash, and Torralba 2016) is used
are related to some categories in the Kinetic dataset. Most of in all the following
baselines, which could be replaced with
the videos in the Kinetic dataset and the downloaded videos alternative frameworks.
These baseline models are as follows:
unfortunately have meaningless titles, such as a date indicat-
ing when the video was shot. After screening these videos, • Direct text to video
generation (DT2V): Concatenated
we end up with about 400 videos for each category. Using encoded text ψ(t) and
randomly sampled noise are fed into
the YouTube8M (Abu-El-Haija et al. 2016) dataset for this a video generator without
the intermediate gist generation
process is also feasible, but the Kinetic dataset has cleaner step. This also
includes a reconstruction loss L
videos than YouTube8M. RECONS

in (6). This is the method shown in Figure 3(a).

5 Experiments • Text-to-video generation with pair information


(PT2V): DT2V is extended using the framework of (Reed

5.1 Video Preprocessing et al. 2016). The discriminator judges whether the video
Current video-generation techniques only deal with smooth and text pair are real,
synthetic, or a mismatched pair. This
dynamic changes. A sudden change of shot or fast-changing is the method in Figure
3(b). We use a linear concatenation
background introduces complex non-linearities between for the video and text
feature in the discriminator.

7069

Method Generated videos


Swimming in swimming pool Playing golf

DT2V

PT2V

GT2V

T2V

Figure 4: Comparison of generated videos with different methods. The generated


movie clips are given as supplemental files
(http://www.cs.toronto.edu/pub/cuty/Text2VideoSupp).

• Text-to-video generation with gist (GT2V): The pro-


posed model, including only the conditional VAE for gist
generation but not the conditional text filter (Text2Filter).

• Video generation from text with gist and Text2Filter


(T2V) This is the complete proposed model in Section 3
with both gist generation and Text2Filter components.
Figure 4 presents samples generated by these four models,

given text inputs “swimming in the swimming pool” and


“playing golf”. The DT2V method fails to generate plausible
videos, implying that the model in Figure 3(a) does not have (a) Kitesurfing on the
sea. (b) Kitesurfing on grass.

the ability to simultaneously represent both the static and


motion features of the input. Using the “pair trick” (Reed et
al. 2016; Isola et al. 2016) does not drastically alter these
results. We hypothesize that because the video is a 4D tensor
while the text is a 1D vector, balancing strength of each
domain in the discriminator is rendered difficult. By using
gist generation, GT2V gives a correct background and object
layout but is deficient in motion generation. By concatenating
the encoded gist vector, the encoded text vector, and the
noise vector, the video generator of (3) is hard to control. (c) Swimming in
swimming (d) Swimming in snow.
Specifically, this method may completely ignore the encoded pool.
text feature when generating motion. This is further explained
in Section 5.5.

In comparison, the T2V model provides both background Figure 5: Input text with
same motion and different back-
and motion features. The intermediate gist-generation step ground information. The
input text is given as the figure
fixes the background style and structure, and the following caption.
Text2Filter step forces the synthesized motion to use the
text information. These results demonstrate the necessity of
both the gist generator and the Text2Filter components in
our model. In the following subsections, we intentionally
generate videos that do not usually happen in real world. This
is to address concerns of simply replicating videos in the
training set.

5.3 Static Features Figure 6: Left is from text input “kitesurfing on the sea”.
This section shows qualitative results of the gist generation, Right is from text
input “kitesurfing on grass”
demonstrating that the gist reflects the static and background
information from input text.

Figures 5(a) and 5(b) show sample gists of kite surfing at field, the gist shows a
green color. In contrast, when kite
two different places. When generating videos with a grass surfing on the sea, the
background changes to a light blue. A

7070

In-set DT2V PT2V GT2V T2V


Accuracy 0.781 0.101 0.134 0.192 0.426

Table 1: Accuracy on different test sets. ‘In-set’ means the


test set of real videos. DT2V, PT2V, GT2V, and T2V (the full
proposed model) are described in Section 5.2.

(a) Left is “swimming at swimming pool”. Right is “playing golf


at swimming pool”.

(b) Left is “sailing on the sea”. Right is “running on the sea”.

Figure 7: Same textual motion for different locations. These


texts inputs show generalization, as the text in the right col-
umn does not exist in the training data.

Figure 8: Classification confusion matrix on T2V generated


black blurred shape appears in the gist in both cases, which samples.
is filled in with detail in the video generation. In Figure 5(c),
the lanes of a swimming pool are clearly visible. In contrast,
the gist for swimming in snow gives a white background. In the training process,
the whole video dataset is split with
Note that for two different motions at the same location, the ratios 7 : 1 : 2 to
create training, validation and test sets. The
gists are similar (results not shown due to space). trained classifier was used on
the 20% left-out test data as

One of the limitations of our model is the capacity of well as the generated
samples from the proposed and baseline
motion generation. In Figure 6, although the background models. The classification
accuracy is given in Table 1.
color is correct, the kite-surfing motion on the grass is not We observe clear mode
collapse when using DT2V and
consistent with reality. Additional samples can be found in PT2V, explaining their
poor performance. Further, it appears
Figure 1. that directly generating video from a GAN framework fails

because the video generator is not powerful enough to ac-


5.4 Motion Features count for both the static and motion features from text. Using
We further investigate motion-generation performance, which the gist generation in
GT2V provides an improvement over
is shown by giving similar background and sampling the the other baseline models.
This demonstrates the usefulness
generated motion. The samples are given in Figure 7. of the gist, which alleviates
the burden of the video genera-

This figure shows that a different motion can be suc- tor. Notably, the full
proposed model (including Text2Filter)
cessfully generated with similar backgrounds. However, performs best on this metric
by a significant margin, showing
the greatest limitation of the current CNN video gener- the necessity of both the
gist generation and Text2Filter.
ator is its difficulty in keeping the object shape while Figure 8 shows the
confusion matrix when the classifier is
generating a reasonable motion. Moving to specific fea- applied to the generated
videos of our full model. Generated
tures such as human pose or skeleton generation could videos of swimming and
playing golf are easier to classify
provide improvements to this issue (Chao et al. 2017; than other categories. In
contrast, both ‘sailing’ and ‘kite
Walker et al. 2017). surfing’ are on the sea. Thus it is difficult to distinguish

between them. This demonstrates that the gist generation


5.5 Quantitative Results step distinguishes different background style
successfully.
Following the idea of inception score (Salimans et al. 2016),
we first train a classifier on six categories: ‘kite surfing’, 6 Conclusion
‘playing golf’, ‘biking in snow’, ‘sailing’, ‘swimming’ and This paper proposes a
framework for generating video from
‘water skiing.’ Additional categories were excluded due to text using a hybrid VAE-
GAN framework. The intermediate
the low in-set accuracy of the classifier on those categories. gist-generation step
greatly helps enforce the static back-

A relatively simple video classifier is used, which is a ground of video from input
text. The proposed Text2Filter
five-layer neural network with 3D full convolutions (Long, helps capture dynamic
motion information from text. In the
Shelhamer, and Darrell 2015) and ReLU nonlinearities. The future, we plan to build
a more powerful video generator by
output of the network is converted to classification scores generating human pose
or skeleton features, which will fur-
through a fully connected layer followed by a soft-max layer. ther improve the
visual quality of generated human activity

7071

videos. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; and Lee,
H. 2016. Generative adversarial text-to-image synthesis. In ICML.

References Sadanand, S., and Corso, J. J. 2012. Action bank: A high-level


Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; representation of
activity in video. In IEEE CVPR.
Varadarajan, B.; and Vijayanarasimhan, S. 2016. Youtube-8m: A Salimans, T.;
Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford,
large-scale video classification benchmark. arXiv:1609.08675. A.; and Chen, X.
2016. Improved techniques for training gans. In
Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. NIPS.
ICML. Shen, D.; Min, M. R.; Li, Y.; and Carin, L. 2017. Adaptive convolu-
Berg, T. L.; Berg, A. C.; and Shih, J. 2010. Automatic attribute tional filter
generation for natural language understanding. arXiv
discovery and characterization from noisy web data. In ECCV. preprint
arXiv:1709.08294.
Chao, Y.-W.; Yang, J.; Price, B.; Cohen, S.; and Deng, J. 2017. Sutskever, I.;
Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence
Forecasting human dynamics from static images. In IEEE CVPR. learning with neural
networks. In NIPS.
Chen, B.; Wang, W.; Wang, J.; Chen, X.; and Li, W. 2017. Video Tulyakov, S.; Liu,
M.-Y.; Yang, X.; and Kautz, J. 2017. Moco-
imagination from a single image with transformation generation. gan: Decomposing
motion and content for video generation.
arXiv:1706.04124. arXiv:1707.04993.
De Brabandere, B.; Jia, X.; Tuytelaars, T.; and Van Gool, L. 2016. van Amersfoort,
J.; Kannan, A.; Ranzato, M.; Szlam, A.; Tran, D.;
Dynamic filter networks. In NIPS. and Chintala, S. 2017. Transformation-based
models of video
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. sequences.
arXiv:1701.08435.
2009. Imagenet: A large-scale hierarchical image database. In IEEE Venugopalan, S.;
Rohrbach, M.; Donahue, J.; Mooney, R.; Darrell,
CVPR. T.; and Saenko, K. 2015. Sequence to sequence-video to text. In
Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; IEEE ICCV.
Venugopalan, S.; Saenko, K.; and Darrell, T. 2015. Long-term recur- Villegas, R.;
Yang, J.; Hong, S.; Lin, X.; and Lee, H. 2017. Decom-
rent convolutional networks for visual recognition and description. posing motion
and content for natural video sequence prediction.
In IEEE CVPR. ICLR.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, Vondrick, C.,
and Torralba, A. 2017. Generating the future with
D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial
transformers. In CVPR.
adversarial nets. In NIPS. Vondrick, C.; Pirsiavash, H.; and Torralba, A. 2016.
Generating
Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2016. videos with scene dynamics.
In NIPS.
Image-to-image translation with conditional adversarial networks. Vukotić, V.;
Pintea, S.-L.; Raymond, C.; Gravier, G.; and
arXiv:1611.07004. Van Gemert, J. 2017. One-step time-dependent future video frame
Kalchbrenner, N.; Oord, A. v. d.; Simonyan, K.; Danihelka, I.; prediction with a
convolutional encoder-decoder neural network.
Vinyals, O.; Graves, A.; and Kavukcuoglu, K. 2017. Video pixel arXiv:1702.04125.
networks. ICML. Walker, J.; Doersch, C.; Gupta, A.; and Hebert, M. 2016. An
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijaya- uncertain
future: Forecasting from static images using variational
narasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. 2017.
autoencoders. In ECCV.
The kinetics human action video dataset. arXiv:1705.06950. Walker, J.; Marino, K.;
Gupta, A.; and Hebert, M. 2017. The pose
Kingma, D., and Ba, J. 2014. Adam: A method for stochastic knows: Video forecasting
by generating pose futures.
optimization. arXiv:1412.6980. Xue, T.; Wu, J.; Bouman, K.; and Freeman, B. 2016.
Visual dy-
Kingma, D. P., and Welling, M. 2013. Auto-encoding variational namics:
Probabilistic future frame synthesis via cross convolutional
bayes. arXiv:1312.6114. networks. In NIPS.
Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun, R.; Yan, X.; Yang,
J.; Sohn, K.; and Lee, H. 2016. Attribute2image:
Torralba, A.; and Fidler, S. 2015. Skip-thought vectors. In NIPS. Conditional image
generation from visual attributes. In ECCV.
Liu, M.-Y., and Tuzel, O. 2016. Coupled generative adversarial Ye, G.; Li, Y.; Xu,
H.; Liu, D.; and Chang, S.-F. 2015. Eventnet: A
networks. In NIPS. large scale structured concept library for complex event
detection in
Liu, Z.; Yeh, R.; Tang, X.; Liu, Y.; and Agarwala, A. 2017. Video video. In ACM
Int. Conf. on Multimedia.
frame synthesis using deep voxel flow. ICCV. Zhu, J.-Y.; Park, T.; Isola, P.; and
Efros, A. A. 2017. Unpaired image-
Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional to-image
translation using cycle-consistent adversarial networks.
networks for semantic segmentation. In CVPR. ICCV.
Lowe, D. G. 1999. Object recognition from local scale-invariant
features. In IEEE ICCV, volume 2.
Mansimov, E.; Parisotto, E.; Ba, J.; and Salakhutdinov, R. 2016.
Generating images from captions with attention. In ICLR.
Mirza, M., and Osindero, S. 2014. Conditional generative adversar-
ial nets. arXiv:1411.1784.
Pan, Y.; Mei, T.; Yao, T.; Li, H.; and Rui, Y. 2016. Jointly modeling
embedding and translation to bridge video and language. In IEEE
CVPR.
Pu, Y.; Min, M. R.; Gan, Z.; and Carin, L. 2017. Adaptive feature
abstraction for translating video to language. ICLR workshop.

7072

CONCEPTS OF
PROGRAMMING LANGUAGES

TENTH EDITION

This page intentionally left blank


CONCEPTS OF
PROGRAMMING LANGUAGES

TENTH EDITION

ROBERT W. SEBESTA
University of Colorado at Colorado Springs

Boston Columbus Indianapolis New York San Francisco Upper Saddle River

Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto
Delhi Mexico City Sao Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo

Vice President and Editorial Director, ECS: Senior Production Project Manager:
Marilyn Lloyd
Marcia Horton Manufacturing Manager: Nick Sklitsis

Editor in Chief: Michael Hirsch Operations Specialist: Lisa McDowell


Executive Editor: Matt Goldstein Cover Designer: Anthony Gemmellaro
Editorial Assistant: Chelsea Kharakozova Text Designer: Gillian Hall
Vice President Marketing: Patrice Jones Cover Image: Mountain near Pisac, Peru;
Marketing Manager: Yez Alayan Photo by author
Marketing Coordinator: Kathryn Ferranti Media Editor: Dan Sandin
Marketing Assistant: Emma Snider Full-Service Vendor: Laserwords
Vice President and Director of Production: Project Management: Gillian Hall

Vince O’Brien Printer/Binder: Courier Westford


Managing Editor: Jeff Holcomb Cover Printer: Lehigh-Phoenix Color

This book was composed in InDesign. Basal font is Janson Text. Display font is ITC
Franklin Gothic.

Copyright © 2012, 2010, 2008, 2006, 2004 by Pearson Education, Inc., publishing as
Addison-Wesley.
All rights reserved. Manufactured in the United States of America. This publication
is protected by Copy-
right, and permission should be obtained from the publisher prior to any prohibited
reproduction, storage
in a retrieval system, or transmission in any form or by any means, electronic,
mechanical, photocopying,
recording, or likewise. To obtain permission(s) to use material from this work,
please submit a written
request to Pearson Education, Inc., Permissions Department, One Lake Street, Upper
Saddle River, New
Jersey 07458, or you may fax your request to 201-236-3290.

Many of the designations by manufacturers and sellers to distinguish their products


are claimed as trade-
marks. Where those designations appear in this book, and the publisher was aware of
a trademark claim,
the designations have been printed in initial caps or all caps.

Library of Congress Cataloging-in-Publication Data


Sebesta, Robert W.
Concepts of programming languages / Robert W. Sebesta.—10th ed.
p. cm.
Includes bibliographical references and index.
ISBN 978-0-13-139531-2 (alk. paper)
1. Programming languages (Electronic computers) I. Title.
QA76.7.S43 2009
005.13—dc22 2008055702

10 9 8 7 6 5 4 3 2 1

ISBN 10: 0-13-139531-9


ISBN 13: 978-0-13-139531-2

New to the Tenth Edition


Chapter 5: a new section on the let construct in functional pro-
gramming languages was added

Chapter 6: the section on COBOL's record operations was removed;


new sections on lists, tuples, and unions in F# were added

Chapter 8: discussions of Fortran's Do statement and Ada's case


statement were removed; descriptions of the control statements in
functional programming languages were moved to this chapter from
Chapter 15

Chapter 9: a new section on closures, a new section on calling sub-


programs indirectly, and a new section on generic functions in F# were
added; the description of Ada's generic subprograms was removed
Chapter 11: a new section on Objective-C was added, the chapter
was substantially revised

Chapter 12: a new section on Objective-C was added, five new fig-
ures were added
Chapter 13: a section on concurrency in functional programming
languages was added; the discussion of Ada's asynchronous message
passing was removed
Chapter 14: a section on C# event handling was added

Chapter 15: a new section on F# and a new section on support for


functional programming in primarily imperative languages were added;
discussions of several different constructs in functional programming
languages were moved from Chapter 15 to earlier chapters

Preface

Changes for the Tenth Edition

The goals, overall structure, and approach of this tenth edition of Concepts
of Programming Languages remain the same as those of the nine ear-
lier editions. The principal goals are to introduce the main constructs

of contemporary programming languages and to provide the reader with the


tools necessary for the critical evaluation of existing and future programming
languages. A secondary goal is to prepare the reader for the study of com-
piler design, by providing an in-depth discussion of programming language
structures, presenting a formal method of describing syntax and introducing
approaches to lexical and syntatic analysis.

The tenth edition evolved from the ninth through several different kinds
of changes. To maintain the currency of the material, some of the discussion
of older programming languages has been removed. For example, the descrip-
tion of COBOL’s record operations was removed from Chapter 6 and that of
Fortran’s Do statement was removed from Chapter 8. Likewise, the description
of Ada’s generic subprograms was removed from Chapter 9 and the discussion
of Ada’s asynchronous message passing was removed from Chapter 13.

On the other hand, a section on closures, a section on calling subprograms


indirectly, and a section on generic functions in F# were added to Chapter 9;
sections on Objective-C were added to Chapters 11 and 12; a section on con-
currency in functional programming languages was added to Chapter 13; a
section on C# event handling was added to Chapter 14; a section on F# and
a section on support for functional programming in primarily imperative lan-
guages were added to Chapter 15.

In some cases, material has been moved. For example, several different
discussions of constructs in functional programming languages were moved
from Chapter 15 to earlier chapters. Among these were the descriptions of the
control statements in functional programming languages to Chapter 8 and the
lists and list operations of Scheme and ML to Chapter 6. These moves indicate
a significant shift in the philosophy of the book—in a sense, the mainstreaming
of some of the constructs of functional programming languages. In previous
editions, all discussions of functional programming language constructs were
segregated in Chapter 15.

Chapters 11, 12, and 15 were substantially revised, with five figures being
added to Chapter 12.

Finally, numerous minor changes were made to a large number of sections


of the book, primarily to improve clarity.

vi

Preface vii

The Vision
This book describes the fundamental concepts of programming languages by
discussing the design issues of the various language constructs, examining the
design choices for these constructs in some of the most common languages,
and critically comparing design alternatives.

Any serious study of programming languages requires an examination of


some related topics, among which are formal methods of describing the syntax
and semantics of programming languages, which are covered in Chapter 3.
Also, implementation techniques for various language constructs must be con-
sidered: Lexical and syntax analysis are discussed in Chapter 4, and implemen-
tation of subprogram linkage is covered in Chapter 10. Implementation of
some other language constructs is discussed in various other parts of the book.

The following paragraphs outline the contents of the tenth edition.


Chapter Outlines
Chapter 1 begins with a rationale for studying programming languages. It then
discusses the criteria used for evaluating programming languages and language
constructs. The primary in

You might also like