Retention Modeling at Scholastic Travel Company
On a sunny Monday afternoon in early spring 2013, David Powell entered his new office and
took a deep breath. He pondered his first few days as the new data analyst for Scholastic Travel Company
(STC), an educational tourism firm. Powell had filled his first week of employment meeting the firm’s
departmental leadership and attending a company-wide new-employee-orientation program, and he was
eager to get started on his first project.
Just a few hours earlier, at the weekly marketing strategy meeting, Powell’s new supervisor,
Stephen Blackford, stressed the urgency of a new data initiative centered on customer retention. As
Blackford outlined, in less than two weeks, contract renewal opportunities would begin for customers
who had gone on an STC trip in 2012. During the meeting, he presented a dataset with all of the known
information about the previous year’s client base (see Exhibits 1 and 2). From his past experience,
Blackford was confident that models could be constructed to predict whether or not a customer would
book again in 2013. With such a model, he hoped to design a more nuanced marketing strategy that
would target certain subsets of the client population to save cost and improve yield. With multiple
plausible methodologies in mind, Powell knew he needed to get to work immediately so he could give
Blackford an accurate prediction model before the end of the week.
Company Background
STC was not a particularly young company. Founded in the 1960s, it grew from a single-person
operation to a multi-million-dollar business, was bought and sold twice (first to the management when the
founder retired, then to a private equity firm), and survived almost having to declare bankruptcy post-
9/11. Yet as of 2013, it was one of the premium providers of cultural and educational trips: history and
science trips to middle- and high-school students, exchange trips for university students, cultural
immersion, artistic destination trips, and other tours worldwide.
Customers chose STC because of its superb ability to coordinate the numerous details associated
with taking a large group of primarily young people on a far-away journey. These included the procedures
related to obtaining proper documents and permits (e.g., visas); logistical details (bus, plane, train, and
other tickets); meet- and-greets at the transfer points and destinations; hotel, meal, and entertainment
bookings; taking care of safety concerns (chaperones and accompanying security guards to ensure
physical safety and eliminate the possibility of sexual, emotional, and substance abuse); insurance; and
accident “resolution” (e.g., searching for missing travelers1 or replacing a lost passport). All were critical
elements of a successful trip, which the school teachers, university administrators, parents, and students
themselves were glad to outsource to trusted professionals, and STC was a prime example.
The majority of the trips STC managed were of the “teacher organized, parent paid” type. This
meant that the teacher (or university administrator) determined the itinerary, desired duration, and activity
schedule, but the parents (or students) paid for the trip. For that purpose, it was not uncommon for the
teachers/administrators to hold meetings with parents/students prior to the trip. STC typically kept track
of those meetings, as they often revealed important information about the upcoming trips. STC
representatives would often attend such meetings, either in person or virtually.
STC also collected and carefully tracked multiple types of data about the travel group and the
organizing teacher/administrator, and it sought feedback after the trip. This was all recorded in the STC
cloud-based database and was easily accessible to Powell.
Prediction Task and Available Data
Powell’s ultimate task was to predict which customers would book with STC in the 2013–14
school year (fall 2013 to spring 2014). He decided to build a model that took the data available as of
spring 2013 to make this prediction. To build such a model, however, Powell would need to replicate the
2013–14 prediction task on the available data. This meant that for training his model, he would use the
data from the 2012–13 school year—which showed whether a certain group had been retained or not—
and try to predict based on the client- profile information as of the end of the 2011–12 school year. He
was lucky that STC took snapshots of customer-profile data once per year, so this historical data was
available (see Exhibit 1).
Powell knew from the marketing meeting that the company used post-trip surveys to track
performance and get feedback from the teachers. These responses, coupled with trip data such as trip
revenue, trip length, and school size were what he would use to construct his retention model. A
comprehensive list of data fields was included with the spreadsheet that Blackford sent to Powell (see
Exhibit 2). With a sample size of nearly 2,400 groups, Powell was hopeful he could have a model that
would make reasonably accurate predictions for Blackford before the end of the week, so that the new
marketing strategy could be deployed before the sales season started later in the spring.
As David Powell progressed with building various models, he shared the results with one of the
analysts at Scholastic Travel Company (STC), Emily Glenn. In particular, his models thus far were
approximately 75% to 80% accurate—meaning they were correctly predicting which customers would
return on their own and which would not in 75% to 80% of the cases. In the remaining cases, the models
were making mistakes: either incorrectly predicting that customers would be retained when they had not
been, or incorrectly predicting that customers would not be retained when they actually had been.
Glenn and Powell grabbed a coffee and had a quick chat. She was generally impressed with the
initial results. “Quite a bit better than just guessing!” Glenn said. She also pointed out, however, two
potentially valuable kinds of data that Powell had not been using thus far.
First was the net promoter score (NPS) data. NPS was a common metric for measuring customer
satisfaction. The customers were asked, “How likely are you to recommend STC to a colleague?” The
possible responses ranged from “extremely likely,” which was awarded the score of 10, to “not at all
likely,” which was awarded the score of 1. To calculate the number of net promoters, one took the
number of those who responded with 9 or 10 (referred to as “promoters”) and subtracted the number of
those who responded with 5 or less (referred to as “detractors”). The “score” was then the proportion of
net promoters among the number of respondents. STC started collecting customer answers to the NPS
question in 2008, but initially it did little to enforce customer responses. This changed over time, and so
there had been more data in recent years. Second was the data on group formation. As Glenn explained,
under the “teacher organized, parent paid” model that most groups followed, STC monitored the
registration of the parents/students on its website, and therefore it knew when each group reached a
certain size. A size of 20 to 35 was considered healthy, and the belief was that the schools that were able
to put together groups of large size were more likely to be retained.
“Do you think these would be helpful?” wondered Glenn.
“Of course!” Powell sounded excited. “I would need to think about how to deal with the missing
values, but beyond that, these data should only improve the accuracy. Do you have access? Can you send
them to me?”
“We were actually looking at these just last week,” Glenn replied, “so I have all the data in one
place. I can send them to you almost right away.”
“Perfect! Thank you so much—glad I spoke to you. Looking forward to working them into my models.”
“Done deal; it was nice chatting with you too, David,” said Glenn, as she turned to walk up to her floor. A
few minutes later, Powell received a file from Glenn.
Questions.
(1) Use chi-square test for independence to find at least three categorical variables that influence
customer retention. Paste the results below and explain your findings.
(2) Use regression to find a group of variables that predict travel expenses. Paste the results below
and explain your findings.
(3) Split the data into training set and test set. Use the training set to train a logistic model that has at
least 70% overall accuracy in the test set. Paste the results below and explain your findings.
Data Dictionary
Data Field Name Example Description
ID 1 Self-explanatory.
Program.Code HD This is a very granular code that describes where the trip went and what it did. HN, for
instance, is a history program that runs in New York.
From.Grade 8 This is the lowest grade in school of a participant on that program.
To.Grade 8 This is the highest grade in school of a participant on that program.
Group.State IN This is the two-letter designator for the state in which the originating school is located.
OTHER stands for rare geographies that appear in the data only once.
Is.Non.Annual. 1 1/0 indicating if the group from this school typically skips a year in between programs.
These
will rarely repeat the very next year.
Days 3 The number of days the group was on the program and with one of the instructors.
Travel.Type A Mode of travel from the originating school location to the starting location of the program
(A = Air, B = Bus, T = Train).
Departure.Date 19/02/2011 The date that the group left its originating school.
Return.Date 21/02/2011 The date the group returned to its originating school.
The date by which registrants are supposed to have at least an initial deposit in prior to
Deposit.Date 20/10/2010 departure. The time in the school year when certain events occur can be important; for
instance, there are no deposit dates in the summer since no one would be around to act on
them.
The most important of these are school accounts (SA). That means that, contrary to the usual
Special.Pay NA practice, the teacher collects all of the money and then remits it in bulk to STC. The normal
arrangement is STC handling all of the cash collection from parents/students.
Tuition 1174 This is the price it costs each full-paying participant (FPP) to go on the program. West-coast
air trips are more expensive per person than midwestern bus groups.
FRP.Active 72 FRP is the full refund program. This is the number of FPPs on the trip who bought trip-
cancellation insurance.
FRP.Cancelled 13 This is the number of FPPs on the trip who bought trip-cancellation insurance, but then
cancelled it.
FRP.Take.up.percent. 0.6857 This is the percentage of the FPPs who bought the FRP and ended up paying for it.
Early.RPL 02/03/2010 This is the date that the first communication went out to the group. Often this can be 12 to
18 months before the trip actually departed.
Latest.RPL 10/08/2010 This is the date that the last communication inviting people to join the group went out. Often
this can be 6 to 9 months before the trip actually departed.
Cancelled.Pax 15 This is the number of passengers who signed up with a $100 deposit but then cancelled
before the group departed.
Total.Discount.Pax 7 This is the total number of extra passengers who went along without paying full price (or
typically anything). These would be the chaperones and the teachers.
Initial.System.Date 02/03/2010 This is the date when the teacher first agreed to get this trip organized. It is typically the
earliest of the dates relative to group activities.
Poverty code for the area in which the originating school (and by extension, most of the
Poverty.Code A parents who will be paying for the trip) resides based on estimated percentage below the
poverty line. A is 0 to 5.9, B is 6 to 15.9, C is 16 to 30.9, D is 31 or more, E is unclassified,
Space if DISTCLASS = U (Supervisory Union).
Region Other This is a larger aggregation of state areas. Some large states, like California, are their own
region. Others are combined.
This is a type of school code used in the customer-relationship-management (CRM) system
CRM.Segment 1 to describe the school. The codes are numbered 1–11 but are in no particular order;
proprietary,
but it is a designation of a customer type that may be helpful.
School.Type PUBLIC Public or private.
1/0 indicating whether a parent meeting was held. These are typically strong indicators of
Parent.Meeting.Flag 1 parent engagement and of a teacher who understands that these can be important to
successfully organizing one of these out-of-school programs.
MDR.Low.Grade 7 This is the lowest grade in the originating school.
MDR.High.Grade 8 This is the highest grade in the originating school.
Total.School.Enrollment 955 This is the total enrollment of the school (to differentiate big schools from little ones).
Income.Level P Like poverty code, an indication of ability of parents to pay for these programs. A is lowest,
Q
is highest, Z is unclassified.
EZ.Pay.Take.Up.Rate 0.2286 This is a % of the FPPs that sign up for an automatic bank draft installment plan.
This is an indication (1/0) of whether or not the school is officially sponsoring the trip.
School.Sponsor 0 Mostly, though these programs draw from the same school, they are typically run
independently.
SPR.Product.Type East Coast A high level of aggregation of the very granular tour types.
SPR.New.Existing EXISTING EXISTING means that the group has traveled with STC before—most often the year before.
NEW, with few exceptions, means that the school has never traveled before with STC.
FPP 105 This is the actual number of FPPs who went on the trip.
Total.Pax 112 This is the actual number of total passengers (including chaperones and teachers) who went
on the trip.
SPR.Group.Revenue 125735.4 This is the total amount paid for all of the participants to go on the program from that group.
NumberOfMeetingswithParents 0 Number of meetings with parents prior to the trip.
FirstMeeting 18/11/2010 The date of the first meeting with parents (NA if none held).
LastMeeting 28/11/2010 The date of the last meeting with parents (NA if none held, may be same as the first meeting
if only one meeting was held).
DifferenceTraveltoFirstMeeting 93 The number of days from the first parent meeting to travel date.
DifferenceTraveltoLastMeeting 103 The number of days from the last parent meeting to travel date.
SchoolGradeTypeLow Elementary The lowest grade type in the school.
SchoolGradeTypeHigh Elementary The highest grade type in the school.
SchoolGradeType Elementary- Combination of the above denoting the type of school.
>Elementary
DepartureMonth January Month of departure.
GroupGradeTypeLow K The lowest grade type in the group that travels.
GroupGradeTypeHigh Elementary The highest grade type in the group that travels.
GroupGradeType K- Combination of the above denoting the type of the group that travels.
>Elementary
MajorProgramCode H Aggregation of the granular program code; the first letter of the program code.
SingleGradeTripFlag 1 Indicator for the trip taken by a group comprising students from the same grade.
FPP.to.School.enrollment 0.06364617 The ratio of FPP to school enrollment.
FPP.to.PAX 0.93650794 The ratio of FPP to total PAX on the trip.
Num.of.Non_FPP.PAX 4 The number of PAX who are not FPP.
SchoolSizeIndicator L A label for the size of the school (S, M, L, S-M, M-L), by quintiles of sizes.
Retained.in.2012. 1 THIS IS THE 1/0 SUCCESS METRIC WE ARE TRYING TO PREDICT—DID THE
GROUP ACTUALLY RETURN THE NEXT YEAR?
Sample of NPS and Group-Formation Data, and Corresponding Data Dictionary
NPS NPS NPS NPS >= 3 FPP >= 10 FPP >= 20 FPP >= 35 FPP
ID 2011 2010 2009 2008 Date Date Date Date
1 10 10 10 10 6/6/2010 6/18/2010 8/17/2010 8/30/2010
2 9 10 10 12/15/2009 1/20/2010 5/31/2010
3 10 10 6/9/2010 6/9/2010 10/26/2010
4 10 10 1/6/2011 1/6/2011
5 10 10 5/13/2010 5/24/2010 5/27/2010 6/1/2010
Data
Field Example Description
Name
NPS This is the answer given to the traditional net promoter score (NPS) question: “How
Score 10 likely
- 2011 are you to recommend STC to a colleague?” 10 being “extremely likely” and 1 being
“not at all likely.”
NPS
Score 10 This is the answer given to the traditional NPS question in 2010, if any.
- 2010
NPS
Score 10 This is the answer given to the traditional NPS question in 2009, if any.
- 2009
NPS
Score 8 This is the answer given to the traditional NPS question in 2008, if any.
- 2008
>= 3 This is the date on which at least three full-paying participants (FPPs) had been
FPP 22/03/2010 registered
Date for the group. It can be subtracted from the departure date to see how early the group
started to form, whether it was in the spring or the fall before the trip, etc.
>= 10 This is the date when the number of registrants in the group exceeded 10. This is often
FPP viewed as the minimum amount (“critical mass”) for a group to successfully travel.
02/06/2010
Groups this small will be combined with other smaller groups to make a more
Date
efficient program in the field.
>= 20 This is the date when the number of registrants in the group exceeded 20. This is often
FPP 18/05/2010 viewed as a healthy group size. Groups this small will still be combined with other
Date smaller
groups to make a more efficient program in the field.
>= 35 This is the date when the number of registrants in the group exceeded 35. This is often
FPP 26/05/2010 viewed as a large group. Some groups can get up to 200 FPPs, but the general gist is that
Date groups of 35+ retain at a much higher rate than smaller groups.