Data Preparation For Analytics Using SAS
Data Preparation For Analytics Using SAS
for Analytics
Using SAS ®
ISBN 978-1-59994-047-2
For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission
of the publisher, SAS Institute Inc.
For a Web download or e-book: Your use of this publication shall be governed by the terms established by the
vendor at the time you acquire this publication.
U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related
documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in
FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987).
SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513.
Other brand and product names are registered trademarks or trademarks of their respective companies.
To my family
Preface xiii
Glossary 391
Index 393
Preface
View analytic data preparation in light of its business environment and to consider the
underlying business questions in data preparation
Learn about data models and relevant analytic data structures such as the one-row-per-
subject data mart, the multiple-rows-per-subject data mart, and the longitudinal data mart
Use powerful SAS macros to change between the various data mart structures
Consider the specifics of predictive modeling for data mart creation
Learn how to create meaningful derived variables for all data mart types
Illustrate how to create a one-row-per-subject data mart from various data sources and
data structures
Learn the concepts and considerations for data preparation for time series analysis
Illustrate how scoring can be done with various SAS procedures and with SAS Enterprise
Miner
Learn about the power of SAS for analytic data preparation and the data mart
requirements of various SAS procedures
Benefit from many do’s and don’ts and from practical examples for data mart building
In this part, we deal with the definition of the business questions and with various considerations
and features of analytic business questions. We investigate different points of view from IT
people, business people, and statisticians on analytic data preparation and also take a look at
business-relevant properties of data sources.
Business people who are the consumers of analytic results rather than being involved in the data
preparation and analysis process benefit from these chapters by getting a deeper understanding of
xiv Preface
the reasons for certain requirements and difficulties. Also, IT people benefit from this part of the
book by exploring the reasons made by the analyst concerning data requirements.
Whereas business people can benefit from the concepts in this part of the book, the chapters
mainly address IT people and analysts by showing the data preparation concepts for analytics.
Analysts will probably find in these chapters the explicit formulation and rationale of data
structures, which they use intuitively in their daily work.
The aim of this part is to help the analyst and SAS programmer to speed up his or her work by
providing SAS code examples, macros, and tips and tricks. It will also help the reader to learn
about methods to create powerful derived variables that are essential for a meaningful analysis.
In this part we put together what we learned in the previous parts by preparing data for special
business questions. These case studies present the source data and their data model and show all
of the SAS code needed to create an analytic data mart with relevant derived variables.
Preface xv
Appendixes
The appendixes include two appendixes that are specific to SAS, and a third appendix that
includes a programming alternative to PROC TRANSPOSE for large data sets. These appendixes
illustrate the advantage of SAS for data preparation and explore the data structure requirements of
the most common SAS procedures.
Part 1 is a chapter of general interest. Parts 2, 3, and 5 have a strong relationship to each other by
dealing with data preparation from a conceptual point of view in Part 2, a SAS programming point
of view in Part 3, and a case study point of view in Part 5.
The relationships among chapters are shown in more detail in the next section and diagram.
General
In the next diagram, we use the same structure as shown in the first diagram. Additionally, the
chapters of each part are shown. The relationships among Parts 2, 3, and 5 are illustrated. The
following relationships among chapters exist.
xvi Preface
Predictive Modeling
Predictive modeling is covered in Chapters 12, 20, and 25. Chapter 12 shows general
considerations for predictive modeling. Chapter 20 gives code examples for some types of derived
variables for predictive modeling. The case study in Chapter 25 also deals with aspects of
predictive modeling, such as time windows and snapshot dates.
Analytics is primarily defined in the scope of this book as all analyses that involve advanced
analytical methods, e.g., regression analysis, clustering methods, survival analysis, decision trees,
neural networks, or time series forecasting. Analytics can also involve less sophisticated methods
such as comparison of means or the graphical analysis of distributions or courses over time. There
is no sharp distinction between analytic and non-analytic. Data prepared for regression analysis
can also be used to perform a descriptive profiling of various variables. The focus of the data
preparation methods in this book is primarily in the area of advanced analytical methods.
The motivation for this book was the author’s conviction that this topic is essential for successful
analytics and that despite a large number of analytic and data mining books, the topic of data
preparation is strongly underrepresented.
The focus of this book is not solely on macros and SAS code for derived variables and
aggregations. The author is convinced that data preparation has to be seen in the scope of its
underlying business questions. Because of the importance of the business environment for data
preparation, a separate book part has been dedicated to this topic. Also considerations about the
resulting data mart structure are essential before coding for data preparation can start.
Therefore, this book is not only a coding book and programmers guide, but it also explains the
rationale for certain steps from a business and data modeling point of view. This understanding is,
in the author’s opinion, critical for successful data preparation.
The scope of this book can be defined as all steps that are necessary to derive a rectangular data
table from various source systems by extracting, transposing, and aggregating data, and creating
relevant derived variables. This resulting table can be used for analysis in SAS Enterprise Miner
or SAS analytic procedures.
In SAS terms, this book covers all data preparation tasks by using data extraction, and data
transformation with SAS data steps and procedure steps. The resulting SAS data set can be used
in the DATA= option of a SAS procedure or in the Input Data Source node of SAS Enterprise
Miner.
Data preparation and analytical modeling are not necessarily consecutively distinct steps. The
processes in analytic data preparation are usually dynamic and iterative. During the analysis it
might turn out that different aggregations or a certain set of derived variables might make more
sense. The classification of interval variables, especially in predictive modeling, is a topic that
also strongly involves analytic methods.
The classification of interval variables from an analytic point of view and the process of variable
selection based on correlations and association measures are already covered in a lot of data
mining and analytics books. For the scope of this book, these two topics are considered part of the
analysis itself and not as data preparation topics.
The topic of data quality checks and data cleansing would fill a book itself. For this topic, see
other SAS Press titles on the topic.
Preface xix
The author tried to keep this book as practical as possible by including a lot of macros, SAS code
examples, and tips and tricks. Obviously a large number of additional programming tricks that are
not covered in this book exist in SAS and in SAS Press titles and SAS documentation. The
objective of this book, however, is data preparation for analytics; it would go beyond the scope of
this book to include voluminous programming tips and tricks.
This book is focused on data preparation for analyses that require a rectangular data table, with
observations in the rows and attributes in the columns. These data mart structures are needed by
most procedures in SAS/STAT and SAS/ETS as well as SAS Enterprise Miner. Data structures
such as triangular data, which are used in PROC MDS in SAS/STAT, are not featured in this
book.
Note that all SAS macros and code examples in this book can be downloaded from the companion
Web site at http://support.sas.com/publishing/bbu/companion_site/60502.html.
Besides the importance of analytical skills, however, the need for good and relevant data for the
business questions is also essential.
Another topic that this book addresses is adequate preparation of the data. This is an important
success factor for analysis. Even if the right data sources are available and can be accessed, the
appropriate preparation is an essential prerequisite for a good model.
The author’s experience in numerous projects leads him to conclude that a missing or improperly
created derived variable or aggregation cannot be substituted for by a clever parameterization of
the model. It is the goal of the following chapters to help the reader optimally prepare the data for
analytics.
xx Preface
Example
In predictive modeling at the customer level, a data mart with one-row-per customer is needed.
Transactional data such as account transactions have to be aggregated to a one-row-per-customer
structure. This aggregation step is very crucial in data preparation as we want to retrieve as much
information about the client’s transaction behavior as possible from the data, e.g., by creating
trend indicators or concentration measures.
This Book
This book addresses the data preparation process for analytics in the light of the underlying
business questions. In Part 1 we will start to consider the relation of the business questions to data
preparation and will move in Part 2 to data structures and data models that form the underlying
basis of an analysis data mart. In Part 3 we will fill the data mart with content, restructure data,
and create derived variables. Sampling, scoring, and automation are the topics of Part 4. Finally in
Part 5 we consider a number of data preparation case studies.
In the following chapters, all of these terms may be used. However they mean the same thing in
the context of the book. When we use data mart, analysis table, or data set, we mean a single
rectangular table to be one that holds the data for the analysis, where the observations are
represented by rows and the attributes are represented by columns.
Since 1999, he has worked as a consultant for SAS Austria, where he is involved in numerous
analytic projects in the CRM, Basel II, and demand forecasting areas across various industries. He
has product manager responsibility for the SAS analytic products and solutions in Austria and has
constructed a number of analytic data mart creations as well as creating concepts for analytic
projects and data preparation.
Gerhard lives in Vienna, Austria. He is married and is the father of three sons. Besides working
for SAS and with SAS software, he likes to spend time with his family and to be out in nature,
especially sailing on a lake.
Preface xxi
Acknowledgments
Many people have contributed to the completion and to the content of this book in different ways.
Gerhard, Christian, Wolfgang, and Michael were the reason to start and to keep working for SAS.
Peter supported my skills by allowing me to work in both disciplines, statistics and informatics.
Andreas, Bertram, Christine, Franz, Josef, Hendrik, Martin, Nico, Rami, and a lot of unnamed
people took the time to discuss ideas for my book and to talk about questions in data preparation
in general. Many SAS customers and SAS users have requested solutions from me for their data
preparations tasks, which resulted in some of the ideas shown in this book.
Julie Platt, Donna Faircloth, and the team from SAS Press supported me throughout the process.
The reviewers gave valuable feedback on the content of the book.
xxii Preface
P a r t 1
Data Preparation: Business Point of View
Introduction
In Part 1 of this book we will deal with the business point of view of data preparation. This part is free
from code and data structures. It contains only the nontechnical components of data preparation. Without
reducing the importance of data structures, table merges, and the creation of derived variables, it is
fundamental to understand that behind data preparation there always stands a business question.
This part of the book also helps to categorize business questions and to show their impact on data sources
and data models.
In Chapter 1 – Analytic Business Questions, we will look at various business questions and will see what
they all have in common—that data are needed for their analysis. We will also go through the analysis
process and will learn that data preparation is influenced by the business question and the corresponding
analysis method.
2 Data Preparation for Analytics Using SAS
We will see that a business question itself poses many other questions that need to be discussed in order
to prepare and perform the analysis.
In Chapter 2 – Characteristics of Analytic Business Questions, we will show the properties of a business
question. We will see, for example, that data preparation is affected by the analysis paradigm, whether it
is classical statistics or data mining. We will look at the data requirements, and we will examine whether
the analysis is a repeated analysis (e.g., model recalibration over time), or a scoring analysis.
In Chapter 3 – Characteristics of Data Sources, we will go into detail about the properties of data
sources from a business standpoint. We will focus on properties such as the periodic availability of data
and the advantages of a data warehouse as an input data source.
In Chapter 4 – Different Points of View on Analytic Data Preparation, we will define three roles that are
frequently encountered in the analysis process. We will look at the business, analytical, and technical
points of view. The different backgrounds and objectives of these roles can lead to conflicts in a project,
but they can also provide many benefits for a project if different points of view are exchanged and
knowledge is shared.
The intention of Part 1 is to illustrate the business rationale in the data preparation process for analytic
data preparation. Part 1 introduces data preparation in general (i.e., from a data modeling and coding
point of view). We should keep in mind that all data models, features, derived variables, and tips and
tricks we will encounter in this book are provided in order to solve a business problem and are to be seen
in the context of the points we discuss in the first four chapters.
C h a p t e r 1
Analytic Business Questions
1.1 Introduction 3
1.2 The Term Business Question 4
1.3 Examples of Analytic Business Questions 4
1.4 The Analysis Process 5
1.5 Challenging an Analytic Business Question 6
1.6 Business Point of View Needed 9
1.1 Introduction
Before we go into the details of analytic data preparation, we will consider analytics from a
business point of view. In this chapter we will examine three business questions that deal with
analytics in the background. This chapter deals with situations in which analytics is applied and
consequently data are prepared. We will list business questions from various industries and will
provide a definition of the business point of view.
Answering business questions is done through a process we call the analysis process. The steps
and features of this analysis process will be described. The fact that the analyst must not step back
from acquiring domain-specific knowledge will be discussed as well as the fact that even a simple
business question needs to be evaluated for more details.
4 Data Preparation for Analytics Using SAS
For our purposes, business question also describes the content and rationale of the question for
which an analysis is done.
Business does not primarily reflect the fact that a monetary benefit needs to be linked to a
business question. In our context, business knowledge can also be referred to as domain expertise.
With business questions we also want to separate the rationale for the analysis from
methodological questions such as the selection of the statistical method.
Which risk factors have a strong influence on the occurrence of lung cancer?
What is the average time until the occurrence of a disease for breast cancer patients?
Is the success rate for the verum treatment significantly higher than that of the placebo
treatment (given e.g., alpha = 0.05)?
How many overnight stays of tourists in Vienna hotels on a daily basis can we expect in
the next six months?
How will economic parameters like GDP or interest rates behave over time?
How do factors such as time, branch, promotion, and price influence the sale of a soft
drink?
What is the daily cash demand for automatic teller machines for banks in urban areas?
How many pieces of a new movie will be produced by sales region?
Which products will be offered to which customers in order to maximize customer
satisfaction and profitability?
Which customers have a high cancellation risk in the next month?
Which customer properties are correlated with premature contract cancellation?
Does a production process for hard disk drives fit the quality requirements over time?
What is the expected customer value for the next 60 months?
What is the payback probability of loan customers?
How can customers be segmented based on their purchase behavior?
Which products are being bought together frequently?
What are the most frequent paths that Web site visitors take on the way through a Web
site (clickstream analysis)?
Chapter 1: Analytic Business Questions 5
How can documents be clustered based on the frequency of certain words and word
combinations?
Which profiles can be derived from psychological tests and surveys?
How long is the remaining customer lifetime in order to segment the customers?
What is the efficiency and influence of certain production parameters on the resultant
product quality?
Which properties of the whole population can be inferred from sample data?
Is the variability of the fill amount for liter bottles in a bottling plant in control?
How long is the average durability of electronic devices?
Which subset of properties of deciduous trees will be used for clustering?
How does the course of laboratory parameters evolve over time?
Is there a difference in the response behavior of customers in the test or control group?
Is there a visual relationship between age and cholesterol (scatter plot)?
This list shows that business questions can come from various industries and knowledge domains.
In Chapter 2 – Characteristics of Analytic Business Questions, we will show characteristics that
allow for a classification of these business questions.
General
The preceding examples of business questions are different because they come from different
industries or research disciplines, they involve different statistical methods, and they have
different data requirements, just to mention a few.
The preceding steps are not necessarily performed in sequential order. They can also be iterated,
or the process can skip certain steps.
The identification of data sources can result in the fact that the data needed for the
analysis are not available or can only be achieved with substantial effort. This might lead
to a redefinition of the business questions—note that we are dealing mostly with
observational data but rather with data that are specifically gathered for an analysis.
Currently unavailable but essential data sources might lead to a separate data-gathering
step outside the analysis process.
In many cases the process iterates between data preparation and analysis itself.
The interpretations of the results can lead to a refinement of the business questions. From
here the analysis process starts again. This can possibly lead to another data preparation
step if the detailed analysis cannot be performed on existing data.
In this book we will only deal with the step 4 “Prepare the data.” We will not cover the selection
of the appropriate analysis method for a business question, the analysis itself, or its interpretation.
This process of questioning leads to a deeper understanding of the business question, which is
mandatory for a correct analysis and interpretation of the results. Even though analysis and
interpretation are not in the scope of this book, the process is very important for data preparation
as data history or data structures are influenced.
This shows that the analyst who is in charge of the data preparation has to become involved with
the business process. The analyst cannot ignore these questions as they are crucial for the data
basis and therefore to the success of the analysis.
If it is an existing product, does the campaign run under the same conditions as in the
past?
Can we assume that responders to the campaign in the past are representative of
buyers in this campaign?
In the case of a new product, which customer behavior in the past can we take to
derive information about expected behavior in this campaign?
Do you want to include customers that have canceled the product but have started to
use a more (or less) advanced product?
Do you also want to consider customers who did not themselves cancel but were
canceled by our company?
Can you calculate a survival curve for the time until relapse
per treatment group for melanoma patients?
In clinical research statisticians cooperate with health professionals in order to plan and evaluate
clinical trials. The following questions can be discussed.
We see that our three example questions seem to be clear at first. However, we then see there are
many points that must be queried in order to have relevant details for data preparation and project
planning.
Chapter 1: Analytic Business Questions 9
It is important for business people to understand that these kinds of questions are not posed in
order to waste their time, but to allow for better data preparation and analysis.
In Part 3 and Part 4 of this book we will emphasize that coding plays an important role in data
preparation for analytics.
Data have to be extracted from data sources, and quality has to be checked, cleaned,
joined, and transformed to different data structures.
Derived variables have to be calculated.
Coding itself is only part of the task. We want to emphasize that data preparation is more than a
coding task; there is also a conceptual part, where business considerations come into play. These
business considerations include the following:
In Chapter 4 – Different Points of View on Analytic Data Preparation, we will take a more
detailed look at different roles in the data preparation process and their objectives.
10 Data Preparation for Analytics Using SAS
C h a p t e r 2
Characteristics of Analytic Business Questions
2.1 Introduction 12
2.2 Analysis Complexity: Real Analytic or Reporting? 12
2.3 Analysis Paradigm: Statistics or Data Mining? 13
2.4 Data Preparation Paradigm: As Much Data As Possible or Business
Knowledge First? 14
2.5 Analysis Method: Supervised or Unsupervised? 15
2.6 Scoring Needed: Yes/No? 16
2.7 Periodicity of Analysis: One-Shot Analysis or Re-run Analysis? 17
2.8 Need for Historic Data: Yes/No? 18
2.9 Data Structure: One-Row-per-Subject or Multiple-Rows-per-Subject? 19
2.10 Complexity of the Analysis Team 19
2.11 Conclusion 19
12 Data Preparation for Analytics Using SAS
2.1 Introduction
The business questions that are considered in this book need data in order to answer them. The
need for data is therefore a common theme. There are also certain characteristics that allow
differentiating among these business questions.
In this chapter we will look at the following characteristics and will also see that some of them
have an impact on how the data will be prepared in certain cases. We will refer to these
characteristics from time to time in other chapters.
There are, however, many cases where data are prepared for analyses that are not complex
analytical methods. These methods are often referred to as reporting or descriptive analyses.
Examples of these methods include the calculations of simple measures such as means, sums,
differences, or frequencies as well as result tables with univariate and multivariate frequencies,
and descriptive statistics and graphical representations in the form of bar charts, histograms, and
line plots.
Chapter 2: Characteristics of Analytic Business Questions 13
line plot per patient; the course of laboratory parameters over time
crosstabulation of 4,000 customers for the categories test/control group and yes/no
response
scatter plots to visualize the correlation between age and cholesterol
We want to emphasize here that analytic data preparation is not exclusively for complex analytic
methods, but is also relevant for reporting analyses. In practice, analytics and reporting are not an
“either/or” proposition but practically an “and” proposition.
Descriptive or exploratory analysis is performed in order to get knowledge of the data for
further analysis.
Analytic results are often presented in combination with descriptive statistics such as a
2x2 cross table with the statistics for the chi-squared test for independence.
Therefore, the considerations and the data preparation examples that are mentioned in the
following chapters are not only the basis for elaborate analytic methods, but also for reporting and
descriptive methods.
In a controlled clinical trial, which risk factors have a strong influence on the occurrence
of lung cancer?
In marketing, which customers have a high cancellation risk in the next month? This
analysis is sometimes called churn analysis.
Both business questions can be solved using the same analytic method, such as logistic regression.
In the data table, one column will hold the variable that has to be predicted. The difference
between these two questions is the analysis paradigm.
The analysis of the risk factors for lung cancer is usually performed using classic
statistical methods and follows a predefined analysis scheme.
There are usually between 10 and 20 risk factors or patient attributes that have been
collected in the clinical trial.
The influence of these risk factors is evaluated by looking at the p-values of their
regression coefficients.
Some of the input variables are not necessarily variables that serve as predictors, but
can be covariates that absorb the effect of certain patient properties.
In this case we analyze in the statistical paradigm: a list of potential explanatory
variables is defined a priori, their influence is statistically measured in the form of p-
values, and the evaluation of the results is performed on predefined guidelines such
as the definition of an alpha error.
14 Data Preparation for Analytics Using SAS
The analysis of the premature cancellation of customers is usually performed by the data
mining approach.
Data for the analysis table is gathered from many different data sources. These
different data sources can have a one-to-one relationship with the customer but can
also have repeated observations per customer over time.
From these data, an analysis table with one row per customer is built, where various
forms of aggregations of repeated observations per customer are applied.
The emphasis is on building an analysis table with many derived customer attributes
that might influence whether a customer cancels the contract or not. Very often these
analysis tables have more than 100 attributes.
In the analysis, emphasis is not on p-values but more on lift values of the predicted
results. It is evaluated by how many hits—successful predictions of churn—are
achieved if 2% of the customer base is contacted that have the highest predicted
probability from the logistic regression.
We see that both examples incorporate the same statistical method, but the analysis paradigm is
different. This also leads to different scopes of the analysis tables. The analysis table for the lung
cancer example has a predefined number of variables. In the marketing example, the list of input
variables includes cleverly derived variables that consolidate a certain product usage behavior
over time from a few separate variables.
Note that the difference between statistics and data mining is not primarily based on the analytic
method that we use but on the analysis paradigm. We have either a clear analysis question with a
defined set of analysis data, or we have a more exploratory approach to discover relationships.
Concerning the number of variables and the philosophy of how analytic tables are built, two
paradigms exist:
In the as much data as possible paradigm, the classic data mining idea is followed. We
try to get as much data as possible, combine them, and try to find relationships in the
data. All data that seem to have some relationship to the business question are included.
A contrast is in the business knowledge first paradigm. The problem is approached from a
business point of view. It is evaluated which data can best be used to explain certain
subject behavior. For example, “What causes a subject to perform a certain action?” The
aim is to determine with business knowledge which attributes might have the most
influence on the business questions.
Chapter 2: Characteristics of Analytic Business Questions 15
In the case of the modeling of the contract cancellation of a telecommunications customer (also
called churn), the following predominant variables can be explained from a business point of
view:
The number of days until the obligation (contract binding) period ends.
Interpretation: The sooner the client is free and does not have to pay penalties for contract
canceling, the more he will be at risk to cancel the contract.
The number of loyalty points the customer has collected and not cashed.
Interpretation: The more points a customer has collected, the more he will lose when he
cancels his contract, as he cannot be rewarded any more. Additionally, the more loyalty
points a customer has collected indicates that he has a long relationship with the company
or he has a very high usage, which lets him collect a number of points.
The importance and explanatory power of these two variables usually supersedes sets of highly
complex derived variables describing usage patterns over time.
Competing Paradigms?
Note that these two paradigms do not necessarily exclude each other. On the one hand, it makes
sense to try to get as much data as possible for the analysis in order to answer a business question.
The creative and exploratory part of a data mining project should not be ended prematurely
because we might leave out potentially useful data and miss important findings.
The fact that we select input characteristics from a business point of view does not mean that there
is no place for clever data preparation and meaningful derived variables. On the other hand, not
every technically possible derived variable needs to be built into the analysis paradigm.
We also need to bear in mind the necessary resources, such as data allocation and extraction in the
sources systems, data loading times, disk space for data storage, analysis time, business
coordination, and selection time to separate useful information from non-useful information. Let
us not forget the general pressure for success when a large number of resources have been
invested in a project.
In supervised analyses a target variable is present that defines the levels of possible outcomes.
This target variable can indicate an event such as the purchase of a certain product, or it can
represent a value or the size of an insurance claim. In the analysis, the relationship between the
input variables and the target variable is analyzed and described in a certain set of rules.
Supervised analysis is often referred to as predictive modeling. Examples include the following:
daily prediction for the next six months of the overnight stays of tourists in Vienna hotels
prediction of the payback probability of loan customers
16 Data Preparation for Analytics Using SAS
In unsupervised analyses, no target variable is present. In this type of analysis, subjects that
appear similar to each other, that are associated with each other, or that show the same pattern are
grouped together. There is no target variable that can be used for model training. One instance of
unsupervised modeling is segmentation based on clustering. Examples include the following:
General
Analyses can be categorized by whether they create scoring rules or not. In any case, each
analysis creates analysis results. These results can include whether a hypothesis will be rejected or
not, the influence of certain input parameters in predictive modeling, or a cluster identifier and the
cluster properties in clustering. These results are the basis for further decision making and
increased business knowledge.
It is also possible that the result of an analysis is additionally represented by a set of explicit rules.
This set of rules is also called the scoring rules or scorecard. In the case of a decision tree, the
scoring rules are built of IF THEN/ELSE clauses, such as “If the customer has more than two
accounts AND his age is > 40 THEN his purchase probability is 0.28.” In the case of a regression,
the scoring rules are a mathematical formula that is based on the regression coefficients and that
provides an estimated or predicted value for the target variable.
The types and usage of scoring can be divided into two main groups:
Scoring of new analysis subjects. New means that those analysis subjects were not
available during training of the model in the analysis, such as new customers or new loan
applicants that are being scored with the scorecard. These subjects are often called
prospects.
Scoring of analysis subjects at a later time. This includes new and old analysis subjects
whose score is actualized periodically (e.g., weekly) in order to consider the most recent
data. The subjects are said to have their scores recalibrated with the freshest data
available.
Example
Assume we perform predictive modeling in order to predict the purchase event for a certain
product. This analysis is performed based on historic purchase data, and the purchase event is
explained by various customer attributes. The result of this analysis is a probability for the
purchase event for each customer in the analysis table.
Additionally, the calculation rule for the purchase probability can be output as a scoring rule. In
logistic regression, this scoring rule is based on the regression coefficients. The representation of
the scoring rules in code is called the score code. The score code can be used to apply the rules to
another data table, to other analysis subjects, or to the same or other analysis subjects at a later
time. Applying the score code results in an additional column containing the purchase probability.
Chapter 2: Characteristics of Analytic Business Questions 17
Usually a scorecard for buying behavior is applied to the customer base in certain time intervals
(e.g., monthly) in order to allow the ranking of customers based on their purchase probability.
Scorecards and score code are frequently used in predictive modeling but also in clustering, where
the assignment of subjects to clusters is performed using the score code.
Other examples of scorecards are churn scorecards that calculate a probability that the customer
relationship is terminated during a certain period of time, credit scoring scorecards that calculate
the probability that a loan or credit can be paid back, scorecards that assign customers to behavior
segments, propensity-to-buy scorecards, or scorecards that predict the future customer value.
Data
Considering scoring, we can talk of an analysis table and a scoring table.
The analysis table holds data that are used to perform the analysis and to create the
results. The result of the analysis is the basis for the score code.
The scoring table is the table on which the scoring rules are applied. In predictive
modeling, the scoring table usually has no target variable, but a predicted value for the
target variable is created on the basis of the score code. During scoring, the score is
created as an additional column in the table.
We will deal with scoring in Chapter 13 – Scoring and Automation, and also come back to the
requirements for periodic data later in this chapter.
In biometrical analyses such as medical statistics, patient data are collected in a trial or
study. Data are entered into a table and passed to the statistician who performs the
appropriate statistical tests for certain hypotheses.
In market research, survey data are entered into a table. Derived variables are calculated
and the appropriate analysis, such as cluster analysis, factor analysis, or conjoint analysis,
is performed on that data.
18 Data Preparation for Analytics Using SAS
One shot does not mean that the final analysis results are derived in only one analysis step. These
analyses might be complex in the sense that pre-analysis or exploratory analysis is needed where
properties of the data are evaluated. Analysis can also get more complex in the analysis process
because more detailed business questions on subgroups arise. However, one shot does mean that
the analysis and data preparation will most likely not be repeated in the future.
The analysis is re-run at a later time when data for more analysis subjects are available.
This case is frequently seen when analysis subjects enter the study or survey at different
times. Time after time, more data are available on the analysis subjects and the analysis
results are updated.
The analysis is re-run at a later time when more recent data for the analysis subjects are
available. This also includes the case where analyses are re-run at a later time to compare
the results over time, such as for new groups of subjects.
Because re-run analyses will be performed more than once or on additional data not only on the
original data, all other data sources they will be applied to need to have the same data structure as
the original data.
In the previous section we looked at scoring. With periodic scoring the analysis is not re-run, but
the results of an analysis are re-applied. However, to be able to apply the score code to other data,
they need to be prepared and structured as the original data. We are, therefore, in almost the same
situation as with re-run analyses. We have the same data requirements except we do not need to
prepare a target variable as in the case of predictive modeling.
Other analyses, such as time series forecasting, require historic data because the course over time
needs to be learned for historic periods in order to forecast it for the future. This difference is
relevant because the extraction of historic data from data sources can be both resource- and time-
intensive.
Chapter 2: Characteristics of Analytic Business Questions 19
In the analysis of a complex business question usually a team of people are working together on a
project. The involvement of several people also has an effect on the data preparation of a project.
The effect is not on the data content or data structure itself but on the necessary communication
and resulting project documentation.
If data preparation and analysis are done by different people, the transfer of details that are
gathered during data preparation to the analysis people is important. This transfer might include
the documentation of the meaning of missing values, the source systems, and their reliability and
data quality.
Note that the complexity of the analysis team is not necessarily directly linked to the business
question. It is also a resource decision in organizations as to how many people are involved in
analysis.
2.11 Conclusion
Some of the preceding characteristics are predefined concerning certain business questions. Other
characteristics are defined by the analysis approach. The selection of the data preparation
paradigm—for example, whether all possible derived variables are built or if only a few variables
that are derived from business knowledge are created—is decided by the analyst. The need for
historic data is inherent to the business problem and is in most cases mandatory.
This chapter also aimed to give more insight into various characteristics as they allow for
differentiating among business questions. The preceding characteristics also allow business
people to gain more understanding of characteristics and details the analyst has to consider in
formulating the best methodological approach to address the business objective under
consideration.
20 Data Preparation for Analytics Using SAS
C h a p t e r 3
Characteristics of Data Sources
3.1 Introduction 21
3.2 Operational or Dispositive Data Systems? 22
3.3 Data Requirement: Periodic Availability 24
3.4 Wording: Analysis Table or Analytic Data Mart? 25
3.5 Quality of Data Sources for Analytics 25
3.1 Introduction
After having discussed the characteristics of business questions in the previous chapter, we will
now go into more detail about the characteristics of data sources. In this chapter we will continue
with the business point of view of data and will discuss properties of data sources such as the
rationale of the source system and periodic data availability. We will also look at the
advantageous properties of a data warehouse.
Obviously, data sources have a lot of technical characteristics, which we will look at in Chapter 5
– The Origin of Data.
22 Data Preparation for Analytics Using SAS
Sales force automation systems that help salespeople administer their offers, orders, and
contacts.
Call center systems that maintain customer databases where address changes, customer
complaints, or inquires are handled.
Booking systems at hotels, airlines, or travel agencies that allow the entering of
reservations and bookings, and that have methods implemented so different agents are
not able to simultaneously book the same seat on the same airplane.
Legacy systems in banks that maintain customer and account data and process account
transactions.
Clinical databases, where data are entered and maintained in adherence to good clinical
practice regulations. These databases allow double data entry, data correctness or
plausibility verification, and maintain audit trails of data modifications.
Enterprise Resource Planning (ERP) systems perform tasks such as human resource
management, stock keeping, or accounting.
The following are important characteristics of these systems:
Because the underlying business processes are time critical, quick response times are a
key issue.
The focus is in most cases on a single transaction; a bank customer who wants to make a
few financial transactions or withdrawals, or an agent in the call center who needs the
data for a calling customer.
Having the actual version of the data available, rather than providing time histories of
data value changes. This does not mean that these systems have no focus on historic
information, just the immediate need to provide a transaction history for a customer’s
bank account. However, with customer data for example, the focus is more on the actual
version of the address for sending the bill rather than on providing the information, such
as where a customer lived 12 months ago. In many cases the decision to maintain a lot of
historic data is dropped in favor of improving the performance and response times of
these systems.
Chapter 3: Characteristics of Data Sources 23
Sales reporting systems. These systems provide sales measures such as the number of
sold pieces, price, or turnover. These measures are made available to the user in the form
of tables, summaries, or charts, allowing subgroups by regions, product categories, or
others.
Customer data warehouses. With the target to provide a 360° view of the customer, this
kind of system holds customer data from sources such as a call center database, billing
system, product usage database, contract database, or socio-demographics.
Data warehouse systems in general. These systems hold information from various data
sources of the enterprise or organization. A data warehouse is subject-oriented and
integrates data from those data sources by a data model. It is usually a read-only database
to the warehouse (information) consumer and is time-variant in the sense that its data
model is built to provide historic views (or snapshots) of the data.
The following are important characteristics of these systems:
Dispositive systems or data warehouse systems are mostly read-only. They have to
handle read operations over the entire table (full table scans) very quickly, whereas they
do not have to provide updates of single records.
Data warehouse systems handle time histories. They provide historic information so that
the information consumer can retrieve the status of the data at a certain historic point in
time. As previously mentioned, operational systems deal more with the actual version of
data.
For example, customers that canceled their contracts with the company two years ago
will no longer appear in the operational system. The customers and their related data,
however, will still be available in a data warehouse (with a record-actuality
indicator).
Data warehouses also track changes of subject attributes over time and allow the
creation of a snapshot of the data for a given date in the past.
Data warehouse environments provide aggregations of several measures in
predefined time intervals. Depending on the data warehouse design and the business
requirement, the time granularity is monthly, weekly, or daily. Also note that due to
the huge space requirements, the underlying detail records are very seldom
maintained in the data warehouse for a long time in online mode. Older versions of
the data might exist on tapes or other storage media.
Extending the time granularity. Billing systems create balances with each bill. The bill
cycles, however, will very often not coincide with calendar months. Therefore, sums
from the bill cannot be used to exactly describe the customer’s monthly usage. Data
warehouses store additional information in the billing data, and monthly usage by
calendar months, weeks, and days. The granularity itself depends on the data warehouse
design.
24 Data Preparation for Analytics Using SAS
The following table summarizes the properties of operational and dispositive systems:
There are, however, cases where not all data are periodically available. This happens if different
data sources are used in the analysis that have different reasons for their existence. Data from
some sources, such as surveys and special analyses, are gathered only once (e.g., to perform a
market research analysis). These are examples of typical one-shot analysis.
Data that were gathered for a one-shot analysis should be included with care in a re-run analysis.
If it cannot be assured that the data can be provided in the same or similar quality and actuality at
a later time for re-running the analysis or scoring, then the analytic data mart will contain outdated
or irrelevant information.
When using data that were acquired from an external data provider, care has to be taken
whether these data will be available in the future. Negotiation and contractual status with
the external provider have to be taken into account.
Surveys such as market research are performed on samples and not on the whole
customer base. The resulting variables are only available for a subset of the customer
base. Analysis results for the samples can hardly be reproduced or generalized to the
whole customer base.
Chapter 3: Characteristics of Data Sources 25
Analyses with inherent dependencies that use the scores of other scorecards as input
variables. If model B depends on model A in the sense that the scores derived from
model A are input variables for model B, then model B can be applied only as long as
model A is used and periodically applied. The awareness of those dependencies is
important because a retraining of model A will force an immediate retraining of model B.
Analyses that are based on classifications that periodically change. If the model uses
input variables that are based on lookup tables or definitions that change over time, such
as price plans or tariffs, two problems can arise:
Changes in the definition of the classifications force a retraining of the model(s).
Care has to be taken that the actual and valid version of the lookup table or
classification is used for analysis. Again we are faced with the case where the analyst
tries to gather as many variables as possible and oversees the data maintenance effort
and responsibility for these data.
In data mining it has become usual to call this table a data mart. This name is not necessarily
correct, because a data mart is not always only one table; it can also be a set of tables that hold
data for a certain business domain. For example, tables of a data mart might need to be joined
together in order to have the appropriate format. However, data mart has become a synonym for
the table that holds the data for data mining.
In non-data mining areas, the term data mart is almost unknown. Names such as analysis table,
data table, or data matrix are common here and mean the same thing as data mart in the preceding
example, which is a table with data for analysis.
We have therefore tried to use the global name analysis table as often as possible. In some
chapters, the term data mart is used. If not separately specified, this term also means a table with
data for analysis.
Our definition of the final product of the data preparation process is that we have a single table
that can be used for analysis.
The following figure displays data inputs from a logical point of view:
Figure 3.1: Data inputs in the analytic data mart from a logical point of view
We see that there are two main sources where data for analysis originally reside:
Data in operational systems, which were described in the first section of this chapter.
Other data such as external data, survey data, ad-hoc data, lookup tables, and
categorizations that also reside in databases, spreadsheets, PC databases, or text files.
Data from operational systems or other data can either flow directly into the analytic data mart or
can be retrieved from a data warehouse.
Operational systems, in contrast, hold historic information—for example, accounts that were
opened seven years ago, or transactions that were made last year. However, historic data and
historic versions of data are not the same. An operational system may provide information about
accounts that were opened seven years ago, but will not necessarily provide information about
how the conditions on this account, such as the interest rate, looked seven, six, or five years ago.
The analyst must now decide whether or not to use data sources that have not been loaded into the
data warehouse, or to access these data sources directly. Technically, SAS is of great help in this
case, because external data or other additional tables for analysis can be quickly imported, data
can be cleansed and quality checked, and complex data structures can be decoded.
Given this, the analyst assumes all duties and responsibilities, including data quality checking,
lookup table maintenance, and interface definitions. He might encounter trouble with every
change in the operational system, including renaming variables, and redefining classifications or
interfaces to external data.
This is not problematic with one-shot analysis because once the data are extracted from whatever
system, the analysis can be performed and there is no need to reload the data at a later time.
However, if the analysis is a re-run analysis and analysis results need to be refreshed at a later
time, problems in the data interface can be encountered. This is also true for scoring at a later
time, because the scoring table needs to be rebuilt as well.
This does not mean that the analyst can always decide against including data directly from some
systems. In many cases this is necessary to access the relevant data for an analysis. However,
analysts should not work independently of IT departments and build their own analytic data
warehouses if they do not have the resources to maintain them.
28 Data Preparation for Analytics Using SAS
C h a p t e r 4
Different Points of View on Analytic Data
Preparation
4.1 Introduction 29
4.2 Simon, Daniele and Elias: Three Different Roles in the Analysis Process 30
4.3 Simon—The Business Analyst 30
4.4 Daniele—The Quantitative Expert 31
4.5 Elias—The IT and Data Expert 32
4.6 Who Is Right? 33
4.7 The Optimal Triangle 35
4.1 Introduction
In Chapter 2 – Characteristics of Analytic Business Questions, we discussed the complexity of the
analysis team as a characteristic of a business question. If the business question and the related
analysis are simple and only one or a few people are involved in this analysis, there will be little
interaction between people.
If, however, more people from different disciplines with different backgrounds and interests are
involved in the analysis process, different points of view on this topic come up. In this chapter we
will not deal with project management and personnel management theory. Instead we will focus
on typical roles that are frequently encountered in analysis projects and will look at the interests,
objectives, and concerns of the people in those roles.
30 Data Preparation for Analytics Using SAS
This separation of roles is suitable for our purposes to illustrate different viewpoints of different
people who are involved in data preparation.
Note that in the following sections we will list some of these people’s attitudes and opinions. For
illustration we do this in an intentionally extreme manner, although we know that not many
people are so extreme. We also want to mention that we had no particular Simon, Daniele, or
Elias in mind when writing this chapter.
His interests
Simon wants results. For example, Simon needs the results for an analysis in order to
make a business decision about a target group for a campaign. He wants to see and
understand the results, possibly graphically illustrated.
In conversations with Daniele, the quantitative expert, he likes simple explanations. He
knows that Daniele can deliver very valuable results to him in order to improve his
business.
In his role as a medical doctor, he does not like dealing with problems such as
multiplicity and the fact that multiple choice responses to questions on the case record
form or multiple possible diagnoses can’t be entered into the same table column for a
patient.
What he dislikes
Not understanding why an analysis takes so much preparation time, whereas another
analysis, such as building a predictive model, can be done much more quickly
Being involved in too much data acquisition discussion
Getting things explained in a complex statistical language
Being told that things are too complicated, questions cannot be answered with these types
of data, and more time is needed
Her role
Daniele is the statistician in the organization. She is in charge of models that help the company
answer its business questions and be more productive. She is confronted with requirements from
the business departments and she needs data from the IT department. She tries to explain her
points of view with as few statistical terms as possible, but sometimes she needs to rely on her
statistical arguments.
in the business department, she assists the business directly with results
in the IT department, if analysis is seen as a service to other (business) departments
in a separate analysis department, if this is especially foreseen in the respective
organization
as an external consultant who assists the organization with the analysis
Other people expect Daniele to deliver correct and beneficial results. Sometimes, she finds herself
wondering whether she should leave the path of “correct” statistical analysis and be creative with
a workaround in order to get desired results.
She knows that analysis cannot be completely separated from data management, and she often
finds herself facing the difficulty of why she cannot name the necessary attributes.
32 Data Preparation for Analytics Using SAS
Her interests
Have an analytic database rich in attributes
Perform data management on her own
Respond to requests from the business side very quickly
See her results being used in practice
Need for analysis of not only active customers but also of inactive customers (or patients
that have already left the clinical trial)
Understand that historic data are not only data from the past but historic snapshots of the
data
Investigate longitudinal data herself—which trends, triggers, or aggregations best fits the
business problem
Create from perhaps hundreds of input variables a final prediction model with between
10 and 20 predictors
Make it clear that there is a difference between multivariate statistical suitable data and
univariate cleansed data for reporting purposes
His role
Elias is the IT and data expert. He manages a lot of databases, operational systems, and the
reporting data warehouse. He knows the data sources, the tables, and the data model very well. He
also knows about the effort and resources needed to create additional variables.
His interests
That data necessary for analysis can be regularly loaded from the data warehouse or other
source systems and not too much external ad-hoc data are used
That the data preparation process runs smoothly, automatically, and securely
That the data preparation job can run in batch and can be scheduled
That no errors in the data preparation process occur
Perform scoring on the scoring data mart
Balance the cost of data acquisition with the benefit of these data
Consider data to be clean, if they are clean for reporting or clean in a univariate way
What he dislikes
Transferring data to the analytic platform just to be scored there
Being confronted with a large SAS program with many (dubious) macro calls, where it is
hard to understand what a program does
Looking for error messages in the SAS log
Having an unsuccessful scoring run because of new categories or other reasons
Having an unsuccessful scoring run that he cannot interpret because of errors in the SAS
log
Daniele’s need for historic snapshots of data that are hard to produce from the source
systems
Loading times that are very long because many attributes, historic snapshots, and
aggregations need to be made
Providing data on the detail or record level instead of aggregated values
The fact that the cost for only one column (e.g., the purchase probability) is very high in
terms of programming, monitoring, and data loading
Daniele’s need to explore so many attributes instead of providing him a final list of
attributes that she will need in the modeling
Not understanding why Daniele cannot specify the final shape of the analysis table and
why she wants to perform data management or access source data during data preparation
Not understanding why an analysis should still be performed when a desired variable is
missing
General
As you might expect there is no universal truth concerning the preceding points. And there is no
judge to define who is allowed to like or to dislike something. We always have to bear in mind
that everyone in an organization plays a role and understands and copes with expectations that are
posed to that role, given their differential needs and responsibilities.
Therefore, the following sections do not judge but rather explain the background of certain
wishes, fears, and attitudes of people.
34 Data Preparation for Analytics Using SAS
It is however, inefficient and impractical to create a stock analysis database on the chance that
someone might require a statistic on these data. In this case, resources such as time and disk
storage would be needed.
Simon needs to understand that not all data can be made available so that at any time an analysis
can be started if it is suddenly decided to be a high priority from a business point of view.
topics. However, it doesn’t always have to be that dramatic. It is more a personal and
organizational culture concerning how different points of view are handled.
Moreover, the presence of different points of view also means that people have a different
background, knowledge, and understanding of the respective analysis and they can benefit from
each other. Analysis is not always a very straightforward process. We need to be curious about
various details and creative in order to uncover flexible solutions.
The people or roles we have described in the preceding sections are experts in their respective
domains. They have very specific knowledge, which others probably don’t have. Therefore, the
cooperation of these people is an important success factor for a project.
For example, Daniele might want to include a certain attribute in the data, but Elias might
tell her that this attribute is not well maintained and that the content, if not missing, is
more subjective textual input than real categorical data.
Because Elias knows the data very well, he might suggest different aggregation levels or
data sources that haven’t occurred to Daniele. For historical data or historical snapshots
of the data, he might also give good advice on the respective data sources and their
definitions at a historic point in time.
Daniele might be a good “sparring partner” for Simon in questioning his business
practices for more details. Very often points come up that Simon had not considered.
In return, Simon might have valuable input in the selection of data and data sources from
a business point of view.
In consulting projects it often happens that an external consultant in the persona of
Daniele comes to an organization and encounters Simon and Elias on the business and
technical sides. Again the exchange of suggestions and statistical requirements of
Daniele, the business process and requirements, and the technical possibilities are very
crucial parts of each project.
We also see here that many of these conflicts and discussions take place long before the first
analysis table is built or the first derived variable is calculated. We also want to mention that these
discussions should not be seen as counterproductive. They can provide momentum and positive
energy for the entire project’s successful completion.
P a r t 2
Data Structures and Data Modeling
Introduction
Part 2 deals with data structures and data models of the data. Data are stored in the source systems in a
data model, which can range from a simple flat table data model to a complex relational model. Based on
these data from the source system, an analysis data mart is built.
The structure of the analysis data mart depends on the definition of the analysis subject and the handling
of repeated observations.
38 Data Preparation for Analytics Using SAS
The possibilities for the technical source of our data will be discussed in Chapter 5 – The Origin of Data,
where we will consider data sources by their technical platform and business application.
In Chapter 6 – Data Models, we will provide an introduction to data models such as the relational data
model and the star schema and show the possibilities of the graphical representation of data models with
an entity relationship diagram.
In Chapter 7 – Analysis Subjects and Multiple Observations, the influence of the analysis subject and
multiple-observations-per-analysis subject will be discussed.
In Chapter 8 – The One-Row-per-Subject Data Mart, we will discuss the properties of the one-row-per-
subject paradigm and the resulting data table.
In Chapter 9 – The Multiple-Rows-per-Subject Data Mart, we will show examples and properties of
tables that can have more than one row per subject.
In Chapter 10 – Data Structures for Longitudinal Analysis, we will cover application of data structures in
time series analysis.
In Chapter 11 – Considerations for Data Marts, will discuss the properties of variables in data marts,
such as their measurement scale, type, and role in the analysis process.
Finally, in Chapter 12 – Considerations for Predictive Modeling, we will look at selected properties of
data marts in predictive modeling such as target windows and overfitting.
C h a p t e r 5
The Origin of Data
5.1 Introduction 39
5.2 Data Origin from a Technical Point of View 40
5.3 Application Layer and Data Layer 40
5.4 Simple Text Files or Spreadsheets 40
5.5 Relational Database Systems 41
5.6 Enterprise Resource Planning Systems 41
5.7 Hierarchical Databases 42
5.8 Large Text Files 42
5.9 Where Should Data Be Accessed From? 43
5.1 Introduction
In Chapter 3 – Characteristics of Data Sources, we started classifying data sources in respect to
their relationship to the analytic business question and identified properties that are important for
analytic data preparation.
In this chapter, we will deal with the origin of data that we will use in analytic data preparation.
We will do this from a technical viewpoint where we separate data sources by their technical
platform and storage format.
40 Data Preparation for Analytics Using SAS
In some systems accessing data for an entity called CUSTOMER ORDER can involve joining a
number of tables on the data level. Accessing this information on the application level can use the
functions that are built into the application to combine this information. These functions can use
the metadata and the business data dictionary of the respective application.
The technical formats of simple text files or spreadsheets include delimited text files, comma-
separated files, Microsoft Excel and Microsoft Access tables, or other formats that frequently
occur on personal computers.
Data from these data sources are frequently used in small or simple analyses, where a researcher
enters data on his PC and transfers or imports these data to SAS. External data such as geo-
demographics from data providers in marketing, lookup tables such as a list of product codes and
Chapter 5: The Origin of Data 41
product names, or randomization lists in clinical trials are also frequently imported from this type
of data source.
This type of data source is very widespread, and it can be created and opened on nearly any
personal computer.
Text files and spreadsheets allow users to enter data however they want. The rules for data
structures are not set by the software itself, but have to be considered during the data entry
process.
Data that are accessed from relational database systems are usually accessed table by table, which
afterward are merged together corresponding to the primary and foreign keys, or they are accessed
from a view in the database that already represents the merge of these tables. The relationships
between tables are also called the relational model or relational data model, which we will discuss
in the next section.
Structured Query Language (SQL) is used for data definition, data management, and data access
and retrieval in relational database systems. This language contains elements for the selection of
columns and subsetting of rows, aggregations, and joins of tables.
Strictly speaking, SQL refers to a specific collection of data, but it is often used synonymously
with the software that is used to manage that collection of data. That software is more correctly
called a relational database management system, or RDBMS.
Relational database systems also provide tables with metadata on tables and columns. Leading
relational database systems include Oracle, Microsoft SQL Server, and DB2. Microsoft Access
does not provide the same database administration functions as a relational database system, but it
can be mentioned here because data can be stored in a relational model and table views can be
defined.
The functionality of ERP systems is tightly linked to the business process. They are typically
operational systems as we discussed in Chapter 3. Their data model is also optimized for the
operative process and not for information retrieval. In the background, most of these systems store
their data within a relational database management system.
The complexity of the data model, however, makes it difficult to retrieve data directly from the
relational database. The access to information is mostly done on the application layer. Here the
data are retrieved by using certain programmatic functions of the ERP system. In this case the
ERP system provides functions that retrieve and join data from several tables and translates
technical table and variable names into names with business meaning.
Data retrieval from hierarchical databases is not as easy as data access from a relational database
system because the underlying data are not stored in rectangular tables that can be joined by a
primary key. Data access does not follow table relationships; rather it follows so-called access
hierarchies. Host programming and detailed knowledge of the respective hierarchical data
structure are needed to decode the data files and to retrieve data.
The output dump of a hierarchical database is characterized by the fact that rows can belong to
different data objects and can therefore have different data structures. E.g., customer data can be
followed by account data for that customer (see section 13.5 for an illustration).
Transferring and importing text files is very common and typically fast and practical. Usually an
additional file describing the variable’s formats, lengths, and positions in a table are delivered. For
data import, a data collector has to be defined that reads the data from the correct positions in the
text file.
In contrast to a relational database where additional or new columns can easily be imported
without additional definition work, every change of the structure of the text file needs a respective
change in the definition of the data collector.
Chapter 5: The Origin of Data 43
Data from an operational system can be accessed directly via the application itself, from
the underlying relational database, or via a dump into a text file.
External data can be loaded directly from a spreadsheet, or they can already be in a data
warehouse where they can be accessed from the relational table.
Selecting the technical method for providing data for an analysis is influenced by the following
criteria:
In the case of a one-shot analysis a manual output dump from database data can be
sufficient. For re-run analyses, however, it has to be certain that the data in the same
format can be provided on a regular basis.
Authorization policies sometimes do not allow the direct access of data from a system
itself, even if it would be technically possible. In these cases the system that holds data
usually exports its data to another source, text file, or relational database, where they can
be retrieved for analysis.
Performance considerations can also play a role in the selection of the data delivery
method.
Now that we have seen the technical origin of data, we will look at possible data models and data
structures of the input data in the next chapter.
44 Data Preparation for Analytics Using SAS
C h a p t e r 6
Data Models
6.1 Introduction 45
6.2 Relational Model and Entity Relationship Diagrams 46
6.3 Logical versus Physical Data Model 46
6.4 Star Schema 47
6.5 Normalization and De-normalization 49
6.1 Introduction
The source systems, which we discussed in Chapter 5 – The Origin of Data, and which we will
use to access data, store the data in a certain data model. Except in the case of a flat table, where
we have only a single table to import, in most situations we will incorporate a number of tables.
These tables have a certain relationship to each other, which is described by a data model.
We will consider the possible data models that are frequently encountered, namely the flat table,
the relational model, and the star schema. We will go into detail about entity relationship
diagrams and will look at logical and physical data models.
This chapter builds a bridge between the input data sources that come into our analytic data
preparation environment and the final analytic data table.
46 Data Preparation for Analytics Using SAS
The business fact “One customer can have one or more accounts/One account is owned by exactly
one customer” results in two entities, CUSTOMER and ACCOUNT, and the relationship between
them is “can have”/“is owned.” We also say that CUSTOMER and ACCOUNT have a one-to-
many relationship because one customer can have many accounts. The restrictions “one or more”
and “exactly one” represent cardinalities in the relationships.
Cardinalities are important for business applications, which in our preceding example have to
check that an account without a customer can’t be created. In data preparation for analytics
cardinalities form the basis of data quality checks. Different from business applications, which
check analytics for each transaction when an account is opened, in data quality checks for
analytics, we check these rules on the entire table and create a list of non-conformities.
The visual representation of a relational model is called an entity relationship diagram. The
following is an entity relationship diagram for this example.
We see that each entity is represented by a rectangle, and the relationship is represented by an
arrow. In our diagram in a one-to-many relationship, the direction of the arrow points from the
‘many’ to the ‘one’ entity. This relationship is sometimes also called a parent-child relationship.
We will use this term in the book. Note that there are different conventions for the representation
of relationships, their direction, and their cardinality.
In the physical data model entities are converted to tables, and attributes are converted to
columns. Relationships are represented by the appropriate columns or additional table.
One-to-many relationships are represented by a foreign key column. In this case the
‘many’ table holds the primary key of the ‘one’ table as a foreign key. In our preceding
example the ‘many’ table ACCOUNT holds the customer ID for the corresponding
customer.
A many-to-many relationship requires its own table that resolves the relationship to two
one-to-many relationships. An example of a many-to-many relationship in the preceding
context is that not only can one customer have one account, but also one account can be
owned by one or more customers. In this case we need a separate table that resolves the
many-to-many relationship between customers and accounts by holding the combinations
of CustID and AccountID with one row for each combination.
The following diagram shows the physical data model for our customer-account example.
We see that column CustID is stored in the ACCOUNT table as a foreign key. These columns are
important because they are needed to join the tables together in order to create an analysis table.
The fact table POINTOFSALE holds the sales data per DATE, CUSTOMER, PRODUCT, and
PROMOTION.
48 Data Preparation for Analytics Using SAS
It is apparent that the name star schema comes from its shape. This type of data storage is efficient
for querying and reporting tasks. For example, the purchases of a certain customer in a certain
period of time can easily be queried, as well as a report per branches, showing the monthly
purchase amount for each article group.
Note that more elaborate data models such as the snowflake schema exist. We will, however,
leave details about this to the data modeling literature for data warehouses.
Star schemas are very common in data warehouse environments, especially for reporting.
Multidimensional data structures, so-called cubes, are built for OLAP reporting, mostly on the
basis of the star schema. In data warehouses, star schemas are also important for the historization
of attributes of certain points in time.
For analytics, however, this structure needs to be converted into one single table. For example, if
we want to generate a table that shows the purchase behavior per customer, we will use the
dimension table CUSTOMER as its basis and include all relevant attributes that belong directly to
the customer. Then we will start aggregating or merging data from the POINTOFSALE table,
potentially by subgrouping per PROMOTION, PRODUCTGROUP, and TIME to the
CUSTOMER table (derived from DATE in POINTOFSALE).
In Chapter 7 – Analysis Subjects and Multiple Observations, we will give an overview of possible
data structures with repeated observations per subject. In Part 3 of this book, we will show which
derived variables can be created in various data structures. And in Chapter 26 – Case Study 2—
Deriving Customer Segmentation Measures from Transactional Data, we will show a
comprehensive example of the star schema data model.
Chapter 6: Data Models 49
The CUSTOMER table holds a unique customer identifier and all relevant information
that is directly related to the customer.
The ACCOUNT table holds a unique account identifier, all information about accounts,
and the customer key.
The unique account identifier in the account table and the unique customer identifier in
the customer table are called primary keys. The customer identifier in the account table
denotes which customer the account belongs to and is called the foreign key.
The process of combining information from several tables based on the relationships expressed by
primary and foreign keys is called joining or merging.
Normalization
In a normalized relational model no variables, aside from primary and foreign keys, are
duplicated among tables. Each piece of information is stored only once in a dedicated table. In
data modeling theory this is called the second normal form. Additional normal forms exist, such
as third normal form and the Boyce Codd normal form, but we will leave details about these to the
data modeling theory literature and will not discuss them in this book.
Normalization is important for transactional systems. The rationale is that certain information is
stored in a single table only, so that updates on data are done in only one table. These data are
stored without redundancy.
De-normalization
The opposite of normalization is de-normalization. De-normalization means that information is
redundantly stored in the tables. This means that the same column appears in more than one table.
In the case of a one-to-many relationship, this leads to the fact that values are repeated.
De-normalization is necessary for analytics. All data must be merged together into one
single table.
De-normalization can be useful for performance and simple handling of data. In
reporting, for example, it is more convenient for the business user if data are already
merged together. For performance reasons, in an operational system, a column might be
stored in de-normalized form in another table in order to reduce the number of table
merges.
Example
We return to our preceding example and see the content of the CUSTOMER and ACCOUNT
tables.
50 Data Preparation for Analytics Using SAS
Tables 6-1 and 6-2 represent the normalized version. Besides column CustID, which serves as a
foreign key in the ACCOUNT table, no column is repeated.
Merging these two tables together creates the de-normalized CUSTOMER_ACCOUNT table.
In the de-normalized version, the variables BIRTHDATE and GENDER appear multiple times
per customer. This version of the data can directly be used for analysis of customers and accounts
because all information is stored in one table.
In the remaining chapters in Part 2, we will examine how to efficiently create de-normalized data
tables for statistical analysis. The process of moving from a relational data structure to a single
table is also often called the de-normalizing or flat-making of a table.
C h a p t e r 7
Analysis Subjects and Multiple Observations
7.1 Introduction 51
7.2 Analysis Subject 52
7.3 Multiple Observations 53
7.4 Data Mart Structures 55
7.5 No Analysis Subject Available? 59
7.1 Introduction
In Chapter 5 – The Origin of Data, we explored possible data sources from a technical point of
view, and in Chapter 6 – Data Models, we discussed the data models and data structures that we
might encounter when accessing data for analytics.
In this chapter we will cover the basic structure of our analysis table.
In the following sections we will look at two central elements of analytic data structures:
Persons: Depending on the domain of the analysis, the analysis subjects have more
specific names such as patients in medical statistics, customers in marketing analytics, or
applicants in credit scoring.
Animals: Piglets, for example, are analyzed in feeding experiments; rats are analyzed in
pharmaceutical experiments.
Parts of the body system: In medical research analysis subjects can also be parts of the
body system such as arms (the left arm compared to the right arm), shoulders, or hips.
Note that from a statistical point of view, the validity of the assumptions of the respective
analysis methods has to be checked if dependent observations per person are used in the
analysis.
Things: Such as cash machines in cash demand prediction, cars in quality control in the
automotive industry, or products in product analysis.
Legal entities: Such as companies, contracts, accounts, and applications.
Regions or plots in agricultural studies, or reservoirs in the maturity prediction of fields
in the oil and gas industry.
Analysis subjects are the heart of each analysis because their attributes are measured, processed,
and analyzed. In deductive (inferential) statistics the features of the analysis subjects in the sample
are used to infer the properties of the analysis subjects of the population. Note that we use feature
and attribute interchangeably here.
In this table 21 runners have been examined, and each one is represented by one row in the
analysis table. Features such as age, weight, and runtime, have been measured for each runner,
and each feature is represented by a single column. Analyses, such as calculating the mean age of
Chapter 7: Analysis Subjects and Multiple Observations 53
our population or comparing the runtime between experimental group 1 and 2, can directly start
from this table.
data verifications and plausibility checks, if the original data in database queries or data
forms have to be consulted
the identification of the analysis subject if additional data per subject has to be added to
the table
if we work on samples and want to refer to the sampled analysis subject in the population
Also note that in some cases it is illegal, and in general it is against good analysis practice to add
people’s names, addresses, social security numbers, and phone numbers to analysis tables. The
statistician is interested in data on analysis subjects, not in the personal identification of analysis
subjects. If an anonymous subject number is not available, a surrogate key with an arbitrary
numbering system has to be created for both the original data and the analysis data. The
statistician in that case receives only the anonymous analysis data.
There are, however, many cases where the situation becomes more complex; namely, when we
have multiple observations per analysis subject.
Examples
In the preceding example we will have multiple observations when each runner does
more than one run, such as a second run after taking an isotonic drink.
A dermatological study in medical research where different creams are applied to
different areas of the skin.
Evaluation of clinical parameters before and after surgery.
An insurance customer with insurance contracts for auto, home, and life insurance.
A mobile phone customer with his monthly aggregated usage data for the last 24 months.
A daily time series of overnight stays for each hotel.
In general there are two reasons why multiple observations per analysis subject can exist:
Note that we are using the term repeated measurement where observations are recorded
repeatedly. We are not necessarily talking about measurements in the sense of numeric variables
per observation of the same analysis subject—only the presence or absence of an attribute (yes or
no) would be noted on each of X occasions.
The simplest form of repeated measurements is the two-observations-per-subject case. This case
happens most often when comparing observations before and after a certain event and we are
interested in the difference or change in certain criteria (pre-test and post-test). Examples of such
an event include the following:
Patients in a clinical trial make quarterly visits to the medical center where laboratory
values and vital signs values are collected. A series of measurement data such as the
systolic and diastolic blood pressure can be analyzed over time.
The number and duration of phone calls of telecommunications customers are available
on a weekly aggregated basis.
The monthly aggregated purchase history for retail customers.
The weekly total amount of purchases using a credit card.
The monthly list of bank branches visited by a customer.
The fact that we do not have only multiple observations per analysis subject, but ordered repeated
observations, allows us to analyze their course over time such as by looking at trends. In Chapters
18–20 we will explore in detail how this information can be described and retrieved per analysis
subject.
One insurance customer can have several types of insurance contracts (auto insurance,
home insurance, life insurance). He can also have several contracts of the same type, e.g.,
if he has more than one car.
A telecommunications customer can have several contracts; for each contract, one or
more lines can be subscribed. (In this case we have a one-to-many relationship between
the customer and contract and another one-to-many relationship between the contract and
the line.)
Chapter 7: Analysis Subjects and Multiple Observations 55
In one household, one or more persons can each have several credit cards.
Per patient both eyes are investigated in an ophthalmological study.
A patient can undergo several different examinations (laboratory, x-ray, vital signs)
during one visit.
Customers can have different account types such as a savings account and a checking
account. And for each account a transaction history is available.
Patients can have visits at different times. At each visit, data from different examinations
are collected.
If from a business point of view a data mart based on these relationships is needed, data
preparation gets more complex, but the principles that we will see in “Data Mart Structures”
remains the same.
The problem of redefining the analysis subject is that we then have dependent measures that
might violate the assumptions of certain analysis methods. Think of a dermatological study where
the effect of different creams applied to the same patient can depend on the skin type and are
therefore not independent of each other. The decision about a redefinition of the analysis subject
level requires domain-specific knowledge and a consideration of the statistically appropriate
analysis method.
Besides the statistical correctness, the determination of the correct analysis subject level also
depends on the business rationale of the analysis. The decision whether to model
telecommunication customers on the customer or on the contract level depends on whether
marketing campaigns or sales actions are planned and executed on the customer or contract level.
Note that so far we have only identified the fact that multiple observations can exist and the causal
origins. We have not investigated how they can be considered in the structure of the analysis
table. We will do this in the following section.
In the case of the presence of multiple observations per analysis subject, we have to represent
them in additional columns. Because we are creating a one-row-per-subject data mart, we cannot
create additional rows per analysis subject. See the following example.
Proportion Opendate of
Number of of Checking oldest
CustID Birthdate Gender Accounts Accounts account
1 16.05.1970 Male 2 50 % 05.12.1999
2 19.04.1964 Female 3 33 % 01.01.2002
Table 7.4 is the one-row-per-subject representation of Tables 7.2 and 7.3. We see that we have
only two rows because we have only two customers. The variables from the CUSTOMER table
have simply been copied to the table. When aggregating data from the ACCOUNT table,
however, we experience a loss of information. We will discuss that in Chapter 8. Information
from the underlying hierarchy of the ACCOUNT table has been aggregated to the customer level
by completing the following tasks:
We have used simple statistics on the variables of ACCOUNT in order to aggregate the data per
subject. More details about bringing all information into one row will be discussed in detail in
Chapter 8 – The One-Row-per-Subject Data Mart.
In the case of repeated observations over time the repetitions can be enumerated by a
measurement variable such as a time variable or, if we measure the repetitions only on an
ordinal scale, by a sequence number. See the following example with PATNR as the ID
variable for the analysis subject PATIENT. The values of CENTER and TREATMENT
are repeated per patient because of the repeated measurements of CHOLESTEROL and
TRIGLYCERIDE at each VISITDATE.
58 Data Preparation for Analytics Using SAS
In the next two chapters we will take a closer look at the properties of one- and multiple-rows-per
subject data marts. In Chapter 14 – Transposing One- and Multiple-Rows-per-Subject Data
Structures, we will see how we can switch between different data mart structures.
There are, however, analysis tables where we do not have an explicit analysis subject. Consider an
example where we have aggregated data on a monthly level—for example, the number of airline
travelers, which can be found in the SASHELP.AIR data set. This table is obviously an analysis
table, which can be used directly for time series analysis. We do not, however, find an analysis
subject in our preceding definition of it.
We, therefore, have to refine the definition of analysis subjects and multiple observations. It is
possible that in analysis tables we consider data on a level where information of analysis subjects
is aggregated. The types of aggregations are in most cases counts, sums, or means.
Example
We will look at an example from the leisure industry. We want to analyze the number of
overnight stays in Vienna hotels. Consider the following three analysis tables:
Table that contains the monthly number of overnight stays per HOTEL
Table that contains the monthly number of overnight stays per CATEGORY (5 stars,
4 stars …)
Table that contains the monthly number of overnight stays in VIENNA IN TOTAL
The first table is a typical multiple-rows-per-subject table with a line for each hotel and month. In
the second table we have lost our analysis subjects because we have aggregated, or summed, over
them. The hotel category could now serve as a new “analysis subject,” but it is more an analysis
level than an analysis subject. Finally, the overnight stays in Vienna in total are on the virtual
analysis level ‘ALL,’ and we have only one analysis subject, ‘VIENNA’.
60 Data Preparation for Analytics Using SAS
These categories can be an aggregation level. We have seen the example of hotel categorization as
5 stars, 4 stars, and so on. The category at the highest level can also be called the ALL group,
similar to a grand total.
Note that categorizations do not need to necessarily be hierarchical. They can also be two
alternative categorizations such as hotel classification, as we saw earlier, and regional district in
the preceding example. This requires that the analysis subject HOTEL have the properties
classification and region, which allow aggregation of their number of overnight stays.
Data structures where aggregated data over time are represented either for categories or the ALL
level are called longitudinal data structures or longitudinal data marts.
Strictly speaking the multiple-rows-per-subject data mart with repeated observations over time per
analysis subject can also be considered a longitudinal data mart. The analytical methods that are
applied to these data marts do not differ. The only difference is that in the case of the multiple-
rows-per-subject data mart we have dependent observations per analysis subject, and in the case
of longitudinal data structures we do not have an analysis subject in the classic sense.
C h a p t e r 8
The One-Row-per-Subject Data Mart
8.1 Introduction 61
8.2 The One-Row-per-Subject Paradigm 62
8.3 The Technical Point of View 64
8.4 The Business Point of View: Transposing or Aggregating Original Data 65
8.5 Hierarchies: Aggregating Up and Copying Down 67
8.6 Conclusion 68
8.1 Introduction
In this chapter we will concentrate on the one-row-per-subject data mart. This type of data mart is
frequently found in classic statistical analysis. The majority of data mining marts are of this
structure. In this chapter, we will work out general properties, prerequisites, and tasks of creating
a one-row-per-subject data mart. In Chapters 18–20 we will show how this type of data mart can
be filled from various data sources.
62 Data Preparation for Analytics Using SAS
regression analysis
analysis-of-variance (ANOVA)
neural networks
decision trees
survival analysis
cluster analysis
principal components analysis and factor analysis
Chapter 8: The One-Row-per-Subject Data Mart 63
If we have multiple observations per analysis subject it is more effort to put all the information
into one row. Here we have to cover two aspects:
the technical aspect of exactly how multiple-rows-per-subject data can be converted into
one-row-per-subject data
the business aspect—which aggregations, derived variables—make the most sense to
condense the information from multiple rows into columns of the single row
The process of taking data from tables that have one-to-many relationships and putting them into
a rectangular one-row-per-subject analysis table has many names: transposing, de-normalizing,
“making flat,” and pivoting, among others.
Table 8.1 illustrates the creation of a one-row-per-subject data mart in terms of an entity
relationship diagram.
Transposing: Here we transpose the multiple rows per subject into columns. This
technique can be considered the “pure” way because as we take all data from the rows
and represent them in columns.
Aggregating: Here we aggregate the information from the columns into an aggregated
value per analysis subject. We perform information reduction by trying to express the
content of the original data in descriptive measures that are derived from the original
data.
Transposing
We look at a very simple example with three multiple observations per subject.
Table 8.2: Base table with static information per patient (= analysis subject)
PatNr Gender
1 Male
2 Female
3 Male
Table 8.3: Table with multiple observations per patient (= analysis subject)
We have to bring data from the table with multiple observations per patient into a form so that the
data can be joined to the base table on a one-to-one basis. We therefore transpose Table 8.3 by
patient number and bring all repeated elements into columns.
Table 8.4: Multiple observations in a table structure with one row per patient
This information is then joined with the base table and we retrieve our analysis data mart in the
requested data structure.
Note that we have brought all information on a one-to-one basis from the multiple-rows-per-
subject data set to the final data set by transposing the data. This data structure is suitable for
analyses such as repeated measurements analysis of variance.
Aggregating
If we aggregate the information from the multiple-rows-per-subject data set, e.g., by using
descriptive statistics such as the median, the minimum and maximum, or the interquartile range,
we do not bring the original data on a one-to-one basis to the one-row-per-subject data set. Instead
we analyze a condensed version of the data. The data might then look like Table 8.6.
We see that aggregations do not produce as many columns as transpositions because we condense
the information. The forms of aggregations, however, have to be carefully selected because the
omission of an important and, from a business point of view, relevant aggregation means that
information is lost. There is no ultimate truth for the best selection aggregation measures over all
business questions. Domain-specific knowledge is key when selecting them. In predictive
analysis, for example, we try to create candidate predictors that reflect the properties of the
underlying subject and its behavior as accurately as possible.
We mentioned in the preceding section that there are analyses where all data for a subject are
needed in columns, and no aggregation of observations makes sense. For example, the repeated
measurement analysis of variance needs the repeated measurement values transposed into
columns. In cases where we want to calculate derived variables, describing the course of a time
series, e.g., the mean usage of months 1 to 3 before cancellation divided by the mean usage in
months 4 to 6, the original values are needed in columns.
66 Data Preparation for Analytics Using SAS
There are many situations where a one-to-one transfer of data is technically not possible or
practical and where aggregations make much more sense. Let’s look at the following questions:
If the cholesterol values are expressed in columns, is this the information we need for
further analysis?
What will we do if we have not only three observations for one measurement variable per
subject, but 100 repetitions for 50 variables? Transposing all these data would lead to
5,000 variables.
Do we really need the data in the analysis data mart on a one-to-one basis from the
original table, or do we more precisely need the information in concise aggregations?
In predictive analysis, don’t we need data that have good predictive power for the
modeling target and that suitably describe the relationship between target and input
variables, instead of the original data?
The answers to these questions will depend on the business objectives, the modeling technique(s),
and the data itself. However, there are many cases, especially in data mining analysis, where
clever aggregations of the data make much more sense than a one-to-one transposition. How to
best aggregate the data requires business and domain knowledge and a good understanding of the
data.
So we can see that putting all information into one row is not solely about data management. We
will show in this book how data from multiple observations can be cleverly prepared to fit our
analytic needs. Chapters 19–21 will extensively cover this topic. Here we will give an overview of
methods that allow the extraction and representation of information from multiple observations
per subject.
When aggregating measurement (quantitative) data for a one-row-per-subject data mart we can
use the following:
total counts
frequency counts per category and percentages per category
distinct counts
concentration measures on the basis of cumulative frequencies
Chapter 8: The One-Row-per-Subject Data Mart 67
In the case of multiple observations because of hierarchical relationships, we are also aggregating
information from a lower hierarchical level to a higher one. But we can also have information at a
higher hierarchical level that has to be available to lower hierarchical levels. It can also be the
case that information has to be shared between members of the same hierarchy.
Example
Consider the example where we have the following relationships:
We will now show how data from different hierarchical levels are made available to the analysis
mart. Note that for didactic purposes, we reduce the number of variables to a few.
Aggregating up
In our example, we will create the following variables on the CUSTOMER level that are
aggregations from the lower levels, CREDIT CARD and TRANSACTION.
Copying down
At the HOUSEHOLD level the following information is available, which is copied down to the
CUSTOMER level.
Graphical representation
Table 8.7 shows the graphical representation in the form of an entity relationship diagram and the
respective variable flows.
8.6 Conclusion
We see that information that is finally used at the analysis level CUSTOMER is retrieved from
several levels. We used simple statistical aggregations to illustrate this. In Chapters 18–20 we will
go into more detail about these processes and will also present their respective coding examples.
C h a p t e r 9
The Multiple-Rows-per-Subject Data Mart
9.1 Introduction 69
9.2 Using Multiple-Rows-per-Subject Data Marts 70
9.3 Types of Multiple-Rows-per-Subject Data Marts 71
9.4 Multiple Observations per Time Period 74
9.5 Relationship to Other Data Mart Structures 75
9.1 Introduction
As we saw in Table 7.6 of Chapter 7 – Analysis Subjects and Multiple Observations, the multiple-
rows-per-subject data mart can be created only if we have multiple observations per subject. We
also identified the two reasons for the multiple observations, namely, repeated measurements over
time and multiple observations because of hierarchical relationships.
A one-to-many relationship exists between the analysis subject entity and the entity of the
multiple observations. In the multiple-rows-per-subject data mart the number of observations is
determined by the number of observations in the entity with the multiple observations. This is
different from the one-row-per-analysis subject data mart, where the number of observations is
determined by the number of observations in the table of the analysis subject.
The columns in the multiple-rows-per-subject data mart are derived from the table with the
multiple observations. Different from the one-row-per-subject data mart, the columns from the
table with the multiple observations can be copied directly to the multiple-rows-per-subject data
mart.
70 Data Preparation for Analytics Using SAS
Aggregations, in the sense of putting multiple observations information into additional columns,
are not needed. Therefore, in the case of multiple observations, building multiple-rows-per-
subject data marts is a bit more straightforward than building a one-row-per-subject data mart.
The reason is that multiple observation data are usually stored in a multiple-rows-per-subject
structure by data processing systems.
The columns can also come from the table that holds non-repeated or static information for the
analysis subject. In that case, this information is repeated per subject as often as observations for
the respective subject exist in the analysis subject table. These data are then de-normalized.
Multiple-rows-per-subject data marts can also consist of data only from the ‘many’ table of a one-
to-many relationship. This means that static information on the analysis subject level does not
necessarily have to be copied to the multiple-rows-per-subject data mart.
What is the demand forecast for a certain product for the next 18 months?
Are there products that are frequently bought together (market basket analysis)?
What are the typical buying sequences of my customers?
What are the Web paths of Internet users (clickstream analysis)?
Is there a visual trend of cholesterol values over time?
The following analytic methods require a multiple-rows-per-subject data mart:
Does the table contain data only from multiple observations (besides the subject ID), or
are de-normalized data from the subject level in the table also present?
How are multiple rows per subject enumerated? The table can have the following
features:
an interval scaled variable such as a time variable
an ordinal numerator variable or sequence variable
no numeration of the multiple rows
Note that sometimes from the ordering of records in the table the sequence can be assumed.
However, this would not be the usual case, and a sequence number, if available, should be
provided because the record sequence can be easily changed during data import or table creation
or for other reasons.
Table 9.1: Market basket data for two customers without a buying sequence
variable
Table 9.2 contains the same data as in Table 9.1. However, in this case, additional information for
the analysis subject level has been added—the value segment. The value of this variable is
repeated as many times as observations exist per subject. This table allows a so-called BY-
category analysis, where the market basket analysis is performed per value segment.
72 Data Preparation for Analytics Using SAS
Table 9.2: Market basket data for two customers with additional data on
CUSTOMER level
This type of data representation is sometimes found in storage systems. We will see in
Chapter 14 – Transposing One- and Multiple-Rows-per-Subject Data Structures, that the
restructuring of a key-value table to a one-row-per-subject data mart is very simple to accomplish.
Table 9.4: Market basket data for two customers with a buying sequence variable
Table 9.5 gives a further example of an ordinal-scaled variable for the repeated observations. The
data are an extract from a Web log and can be used for Web path analysis. The subject in that case
is the session, identified by the SESSION IDENTIFIER; the variable SESSION SEQUENCE
enumerates the requested files in each session.
These types of multiple-rows-per-subject data are usually applicable in time series analysis. The
MONTH variable defines equidistant intervals.
74 Data Preparation for Analytics Using SAS
Table 9.7 shows the same data as in Table 9.6. However, the variables GENDER and TARIFF
have been added and repeated for each customer.
Table 9.7: Monthly profit data, enhanced with GENDER and TARIFF
Additionally, we see that we have multiple observations per MACHINE and OPERATOR for a
given date. So in our example we have a one-to-many relationship between
MACHINE/OPERATOR and DAY and a one-to-many-relationship between DAY and
MEASUREMENT. The variable MEASUREMENT is not included in this table.
MEASUREMENT in this case does not mean that the observations are taken in an order. They
can also be non-ordered repetitions.
Such occurrences are common in analysis tables for ANOVA or quality control chart analysis (see
PROC SHEWHART in SAS/QC). In these analyses, a measurement is often repeated for a
combination of categories.
These table structures are also multiple-rows-per-subject tables. The focus, however, is not on the
analysis at the subject level but on the comparison of the influences of various categories.
Chapter 9: The Multiple-Rows-per-Subject Data Mart 75
10.1 Introduction 77
10.2 Data Relationships in Longitudinal Cases 79
10.3 Transactional Data, Finest Granularity, and Most Appropriate Aggregation
Level 82
10.4 Data Mart Structures for Longitudinal Data Marts 83
10.1 Introduction
In this chapter we will cover longitudinal data marts. In longitudinal data marts we have
observations over time. This means that time or a sequential ordering is included in these data
marts by a time or sequence variable.
Longitudinal data marts do not have an analysis subject such as in multiple-rows-per-subject data
marts. They can represent one or more variables measured on several points in time, as we see in
the following examples.
78 Data Preparation for Analytics Using SAS
Or longitudinal data marts can have measurements over time for each category. In Table 10.2 the
category is COUNTRY.
Table 10.2: Longitudinal data mart with time series per country
Note that the borders of longitudinal and multiple-rows-per-subject data marts are very fluid. As
soon as we define COUNTRY as an analysis subject, we have a multiple-rows-per-subject data
mart. We will, however, keep this distinction for didactic purposes.
Analyses that require longitudinal data structures will be called longitudinal analysis in this book.
Time series analyses or quality control analyses are the main types of longitudinal analyses.
Examples of longitudinal analyses include the following:
predicting the number of tourist overnight stays in Vienna on a daily basis for a
forecasting horizon of three months
predicting the daily cash demand in ATMs, depending on factors such as region or day of
week
Chapter 10: Data Structures for Longitudinal Analysis 79
Note that the VALUE entity here can represent one or more values. Both tables in Table 10.1 are
represented by this entity relationship diagram.
Figure 10.2: Entity relationship diagram for values per month and country
If we add another categorization such as PRODUCT, our entity relationship diagram will have an
additional entity, PRODUCT.
Figure 10.3: Entity relationship diagram for categories COUNTRY and PRODUCT
If we refer to Chapter 6 – Data Models, we see that here we have a classic star schema. In fact,
data models for longitudinal data marts are a star schema where we have the VALUES in the fact
table and at least one dimension table with the TIME values. For each category, a dimension table
will be added; in this case, COUNTRY and PRODUCT are added.
The categories themselves can have additional hierarchical relationships, such as the following:
One country is in a sales region, and each sales region belongs to one continent.
One product is in a product group, and product groups are summarized in product main
groups.
Chapter 10: Data Structures for Longitudinal Analysis 81
In terms of the star schema, we have put all possible dimensions except the TIME dimension into
CATEGORY, resulting in a star schema with the fact table VALUE and two dimension tables,
TIME and CATEGORY. This representation is useful for us in this chapter because we will
investigate which data mart structures represent these relationships. Before we look at the
different data mart structures, we will summarize properties of our generic data relationship
model.
TIME
The TIME entity contains the ordering dimension of the data. In most cases it is a time variable,
but it can also be an ordinal sequence or some other type of ordering that is not time-related. The
time information can be stored in one variable, or it can be given in a set of variables such as
YEAR, MONTH, and DAY. In many cases the time information is available at different
aggregation levels such as YEAR, QUARTER, MONTH, WEEK, or DAY, allowing analysis at
different time granularities. This entity corresponds to the TIME dimension in a star schema.
CATEGORY
The CATEGORY entity is a logical entity that combines classification information from different
categorizing entities except for the TIME entity. In terms of the star schema, CATEGORY entity
corresponds to the different dimensions (except for the TIME dimension). The single categories
can have hierarchical orderings, such as branch, region, and country, allowing analysis at different
hierarchical aggregation levels.
Note that if no categorization is present, as we saw in the examples in Table 10.1, we can speak of
a virtual ALL category in order to fit our definition.
VALUE
The VALUE entity can contain one or more values that are given for combinations of the other
entities TIME and CATEGORY. The values can be either interval-scaled or nominal-scaled. The
analysis method will be chosen based on the level of measurement for the variables in question.
This entity corresponds to the fact table in a star schema.
82 Data Preparation for Analytics Using SAS
Looking at the logical CATEGORY dimension, we can have different dimensions with
hierarchies. For example, we can have a CUSTOMER and a PRODUCT dimension. The
PRODUCT dimension can have hierarchies such as PRODUCT, PRODUCT_SUBGROUP, and
PRODUCT_GROUP.
Consider an example where we have daily data at the CUSTOMER and PRODUCT level. In this
case the finest granularity is DATE, CUSTOMER, and PRODUCT. Table 10.3 shows an example
of these data.
We see that in our data we can have more than one row per DATE, CUSTOMER, and
PRODUCT. This can happen, for example, if a customer purchases the same product more than
once on a certain day. This representation of data is frequent if data are retrieved from an ordering
system. In Chapter 3 – Characteristics of Data Sources, we also saw that these systems are called
transactional systems. In the ordering system the orders will have a unique order ID.
If we now aggregate the data to their finest aggregation levels, DATE, CUSTOMER, and
PRODUCT, the result is Table 10.4.
Chapter 10: Data Structures for Longitudinal Analysis 83
In practice, Tables 10.3 and 10.4 are often referred to as transactional data. We want to mention
here that real transactional data from an operational system is usually in the finest time
granularity, but they can also contain duplicate rows per TIME and CATEGORY. The duplicates
per TIME and CATEGORY are usually identified by a unique order ID. Usually these data are
aggregated in a next step to a higher aggregation level for analysis.
This higher aggregation level can also be called the most appropriate aggregation level. The most
appropriate aggregation level depends on the business questions. If forecasts are produced on a
monthly basis, then data will be aggregated at the TIME dimension on a monthly basis. The finest
granularity can be seen as something that is technically possible with the available data. However,
the level, either TIME or CATEGORY, to which data are finally aggregated depends on the
business questions and the forecasting analysis method.
General
Let’s look again at a simple version of a longitudinal data mart.
84 Data Preparation for Analytics Using SAS
At the end of the previous section we discussed that we can have more than one VALUE, and we
can have values not only for the virtual ALL group but for subgroups or CATEGORIES. This
means that the table will grow in order to accommodate the additional VALUES or
CATEGORIES.
Not surprisingly, the table can grow by adding more rows or columns, or both. There are three
main data mart structures for longitudinal data:
The data set contains one variable for each time series.
The data set contains exactly one observation for each time period.
The data set contains an ID variable or variables that identify the time period of each
observation.
The data set is sorted by the ID variables associated with datetime values, so the
observations are in time sequence.
The data are equally spaced in time. That is, successive observations are a fixed time
interval apart, so the data set can be described by a single sampling interval such as
hourly, daily, monthly, quarterly, yearly, and so forth. This means that time series with
different sampling frequencies are not mixed in the same SAS data set.
We can see that the tables in Table 10.1 are in the standard form of a longitudinal data set. Table
10.2 is not in the standard form because we have more than one row per time interval. In order to
bring this table into standard form, we have to arrange it as in Table 10.6.
Chapter 10: Data Structures for Longitudinal Analysis 85
Table 10.6: Standard form of a longitudinal data set with values per country
Note that the standard form of a longitudinal data set has some similarity to the one-row-per-
subject paradigm. In the one-row-per-subject paradigm, we are not allowed to have multiple rows
per subject, but here we cannot have more than one row per period. As a consequence we have to
represent the information in additional columns.
If we have more than one category in our logical entity CATEGORY, each additional category
results in an additional cross-dimension variable. If, for example, we have data as in
Figure 10.3, we will have a cross-sectional dimension for COUNTRY and one for PRODUCT.
The corresponding table would look like Table 10.7.
Table 10.7: Cross-sectional dimension data mart for COUNTRY and PRODUCT
(truncated)
86 Data Preparation for Analytics Using SAS
In the diction of our data structures, we have in this case two variables from the fact table
observed for each time period. Representing the second fact variable in additional rows, instead of
additional columns, results in the interleaved data structure.
In the interleaved or cross-sectional data mart structures we multiply the number of observations
of the corresponding standard form by repeating the rows as often as we have categories or
repeated values. This type of data representation is convenient for analysis because we can use
BY statements for BY-group processing and WHERE statements for filtering variables. And this
type of data representation allows for the creation of graphs for BY groups on the data.
We will encounter the three forms of longitudinal data marts again in Chapter 15 – Transposing
Longitudinal Data, where we will demonstrate methods of converting among these structures by
using SAS code.
88 Data Preparation for Analytics Using SAS
C h a p t e r 11
Considerations for Data Marts
11.1 Introduction 89
11.2 Types and Roles of Variables in a Data Mart 89
11.3 Derived Variables 92
11.4 Variable Criteria 93
11.1 Introduction
In Chapters 6–10 we discussed possible data models and data structures for our data mart. Before
we move to the creation of derived variables and coding examples we will examine definitions
and properties of variables in a data mart. We will look at attribute scales, variable types, and
variable roles. And we will see different ways that derived variables can be created.
attribute scales
variable types
variable roles
90 Data Preparation for Analytics Using SAS
Binary-scaled attributes contain two discrete values, such as PURCHASE: Yes, No.
Nominal-scaled attributes contain a discrete set of values that do not have a logical
ordering such as PARTY: Democrat, Republican, other.
Ordinal-scaled attributes contain a discrete set of values that do have a logical ordering
such as GRADE: A, B, C, D, F.
Interval-scaled attributes contain values that vary across a continuous range such as AGE:
15, 34, 46, 50, 56, 80, ..., 102.
In some cases binary, nominal, and ordinal scaled attributes are also referred to as categorical or
qualitative attributes (or categorical variables) because their values are categories. Interval
measures are considered quantitative attributes.
Note that the preceding classification equals the classification of variable types in SAS Enterprise
Miner, which is important for all statistical considerations that we will deal with. The only
difference in statistical theory is that we do not distinguish between interval-scaled and ratio-
scaled attributes but treat them as one type. A ratio-scaled attribute is defined by the fact that
ratios of its values make sense; for example,
“Duration in years” is ratio-scaled because we can say that 10 years is twice as much as 5
years.
“Temperature in degrees Celsius” is (only) interval-scaled because ratios on this scale do
not make sense.
numeric variables
character variables
Numeric variables can contain interval-scaled values and categorical values. In the case of
interval-scaled measures the value itself is stored in the variable. The value of binary, nominal, or
ordinal attributes needs to be numerically coded for interval variables.
Character variables usually hold binary, nominal, and ordinal scaled values. It is also technically
possible to store interval-scaled values in character variables, but in most cases this does not allow
for performing calculations on them.
Chapter 11: Considerations for Data Marts 91
Note that only on Windows and UNIX machines numeric variables have at least a length
of 3 bytes in a SAS data set. If numeric codes of a few digits are to be stored, using a
character type variable can save disk space.
Storing binary values in numeric variables can make sense in calculations, as the mean of
a binary variable with values 0 and 1 equals the proportion of the observations with a
value of 1.
SAS Enterprise Miner, SAS/INSIGHT, and SAS/IML derive the metadata information—
whether a variable is an interval or categorical variable—from their variable type. Storing
categorical information in character type variables automates the correct definition of the
variable’s metadata type.
SAS formats can be used to assign meaningful labels to values and reduce disk space
requirements.
ID variables hold the ID values of subjects and the ID values of entities in underlying
hierarchies. Examples are Customer Number, Account ID, Patient Number, and so forth.
Subject identifiers such as primary and foreign keys are candidates for ID variables.
TARGET variables are predicted or estimated by input variables in predictive modeling.
In the y=Xb equation, target variables stand on the left side of the equation and represent
the y’s. Note that target variables can also be a set of variables. In PROC LOGISTIC, for
example, the y can also be given in the form of count data, where the number of events
and the number of trials are in two variables.
INPUT variables are all non-target variables that contain information that is used in
analyses to classify, estimate, or predict a target or output variable. In predictive
modeling, for example, INPUT variables are used to predict the target variables (those
that are represented by the X in the y=Xb equation). In cluster analyses, where there is no
target variable, INPUT variables are used to calculate the cluster assignments.
TIMEID variables can be found in longitudinal data marts and multiple-rows-per-subject
data marts only and contain the value of the TIME dimension.
CROSSID variables can be found in longitudinal data marts only and contain the
categories for cross-sectional analyses in longitudinal data marts. The CROSSID
variables are used to subgroup or subset an analysis.
SEQUENCE variables can be found in longitudinal and multiple-rows-per-subject data
marts. They are similar to the TIMEID variables, but can also contain ordinal or non-
time-related information.
Note that one variable can have more than one role—a variable can be a target variable in analysis
A and an input variable in analysis B.
92 Data Preparation for Analytics Using SAS
variables that are taken on a one-to-one basis from the source data and are copied to the
analysis table
derived variables that are created from other variables by mathematical or statistical
functions or by aggregations
Variables that are taken on a one-to-one basis from the source data are often subject identifiers or
values that do not need further transformation or calculation such as the “number of children” or
“gender.”
The variable BIRTHDATE, for example, is taken from the operative system on a one-to-one basis
into the data mart and the derived variable AGE is calculated from the actual date and
BIRTHDATE.
Analysis tables almost never exclusively contain variables that are taken on a one-to-one basis
from the source data. In many cases derived variables play an important role in statistical analysis.
We will, therefore, have a more detailed look at them and how they are created.
For example, we want to group our customers by their income into the groups LOW,
MEDIUM, and HIGH. In this case we will set up rules such as INCOME < 15000 =
LOW. Thus, the derived variable INCOME_GROUP for a subject depends not only on
its INCOME value but also on the class limits, which are derived from business rules or
the quantiles of the distribution of overall customers.
Chapter 11: Considerations for Data Marts 93
An aggregated value over all observations, such as the sum or the mean, is used in the
calculation of the derived variable. For example, the mean over all subjects of a variable
is subtracted from the subject’s value of a variable in order to have positive values in the
derived variable for values higher than the mean and negative values for values lower
than the mean.
These types of variables are usually very important in order to describe specific properties of a
subject. In data mining, most variables in an analysis data mart are created by comparing the
subject’s value to the values of other subjects and trying to create meaningful characteristics for
the analysis. In Chapters 16–19 we will discuss potential methods used to calculate those types of
variables.
Note that these types of derived variables are mostly found with numerical variables but are not
limited to those. Think of a simple indicator variable that defines whether a customer lives in a
district where most other customers live or where the most car insurance claims have been
recorded.
The variable itself belongs to a group “Variables that are taken on a one-to-one basis
from the source data.”
Creating an indicator variable that tells whether a subject has children or not, we have a
derived variable that depends only on the subject’s values themselves.
Creating a variable that tells us whether the subject has fewer or more children than the
average number of children for all customers, results in a derived variable that also
depends on the values of other subjects.
For example, in the case of the creation of a one-row-per-subject analysis table, where we have
data sources with multiple observations per analysis subject, we have to represent the repeated
information in additional columns. We discussed this in Chapters 7 and 8.
The following four criteria characterize properties of the resulting derived variables that are
created in an analysis table from the data source(s):
sufficiency
efficiency
relevance
interpretability
94 Data Preparation for Analytics Using SAS
Sufficiency means that all potentially relevant information from the available source systems is
also found in the analysis table.
Efficiency means that we try to keep the number of variables as small as possible. For example, in
the case of transposing repeated observations, we can easily get thousands of variables in the
analysis table.
Relevance means that data are aggregated in such a form that the derived variables are suitable for
the analysis and business question. For example, in predictive modeling, we want to have input
variables with high predictive power.
Interpretability means the variables that are used for analysis can be interpreted and are
meaningful from a business point of view. In predictive modeling, however, variables that have
high predictive power but are not easy to interpret are often used in the analysis. In this case, other
variables are used for model interpretation.
Example
For customers we have a measurement history for the account balance over 20 months. When we
have to condense this information into a one-row-per-customer data mart, we are faced with the
challenge of creating meaningful variables from the measurement history.
Simply creating 20 variables M1–M20 for the measurements would fulfill the sufficiency
requirement. However, creating these variables would not be efficient, nor would it usually
provide relevant variables for analysis.
In this case data preparation has to provide meaningful derived values of the measurements over
time, which does not necessarily mean that we consider each value itself. Chapter 18 – Multiple
Interval-Scaled Observations per Subject will go into detail about this topic and discuss the
applications of indicators, trend variables, and moving averages, among others.
C h a p t e r 12
Considerations for Predictive Modeling
12.1 Introduction 95
12.2 Target Windows and Observation Windows 96
12.3 Multiple Target Windows 97
12.4 Overfitting 98
12.1 Introduction
Predictive modeling is a discipline of data mining that plays a very important role. Because
voluminous literature can be found concerning the training and tuning of predictive models, we
will not consider these topics here. However, because the analysis data mart is a very important
success factor for the predictive power of models, we will consider some design details for
predictive modeling. In Chapter 20 – Coding for Predictive Modeling, we will deal with the
creation of derived variables with high predictive power.
The points that we discuss here mostly concern one-row-per-subject data marts in data mining
analyses but are not restricted to them. In this section we will consider the following:
This time interval is called the target window because the target event—the cancellation—has to
happen within it. Note that the length of the target window has to be within the context of the
business question. In our case it is one month, because we want to predict customers who cancel
in the next month. Assume our target window is January 2005.
We will then consider only customers who were active customers at the beginning of the target
window on January 1, 2005. If a customer cancels his contract at any time from January 1 until
January 31, 2005, he will be considered an event; otherwise, he will be considered a non-event.
Note that customers who cancelled before January 1, 2005, are not considered at all and
customers who cancelled after January 31, 2005, are considered as customers who did not cancel.
The end of the observation window can then for example, be December 31, 2004. This means that
a data snapshot after December 31, 2004, is used to predict an event one month in the future.
Obviously, we cannot take a data snapshot after January 31, 2005, to predict an event that has
already happened. We would then use data that are measured after the event has happened and
would likely get a good but useless prediction that variables such as customer status are good
predictors.
In many cases the end of the observation window in our preceding example is set to one month
earlier, at November 30, 2004. This trains the model to predict events that occur one month after
the end of the observation window instead of events that occur immediately in the month after the
end of the observation window. This additional time interval accounts for the time that might be
needed to make data available for scoring and to prepare a campaign for customer retention. In
many cases the data from the last month are available at, for example, the fifth of the following
month in the data warehouse. To execute a retention campaign might take another three to five
days. This means that in practice we might reach the customer only eight days after the start of the
target window. This additional time interval is also called the offset.
We see that the observation window can span several months—in order to have customer account
transactions for a longer time period so that we can detect patterns of time courses that correlate
with the cancellation event.
Chapter 12: Considerations for Predictive Modeling 97
In Chapter 3 – Characteristics of Data Sources, we mentioned that the need for snapshots of the
data at a historic point in time can be a challenge if a data warehouse is not in place. We see that
for a prediction for January 2005 we have to go back to September 2004 in our simple example.
If only a small number of target events exist, the target window will need to be extended by
setting it to one year instead of one month, for example. Sometimes the context in which a
business question is analyzed also suggests a one-year target window. In credit scoring, for
example, defaults of credit customers are cumulated over one calendar year for analysis.
Note that in this case or rare incidence of target events, we need a data history extending over
years if we want to use the events in one target window as training data and the events in another
target window as validating data.
We see that we used the months November 2004, December 2004, and January 2005 as target
windows in the respective data marts. The final data mart is then stacked together by aligning the
data marts with their target windows over each other. The first row of Table 12.1 shows that
relative month, as it is used in the analysis table. Rows 1–3 show the calendar months from which
those data are taken in each of the three analysis data marts that are stacked together.
98 Data Preparation for Analytics Using SAS
In our example the target window spans only one month. This case is frequently found in
marketing analyses where the behavior of customers in the next month will be predicted. There
are, however, many other cases where the target window can span several months or a year. In
credit scoring, for example, target windows with a span over a calendar year are frequently found.
Note that the selection of the sampling months also provides differential (and perhaps biasing)
contributions to the model structure.
In this case we violated the assumption of independent observations in the input data because we
used the same analysis subject more than once. The benefit, however, is a more stable prediction
because we have more events and also events from different periods, which can offset seasonal
effects of the event behavior.
12.4 Overfitting
For categorical data, overfitting often occurs if a high number of categories with a
possible low number of observations per category exist. Considering all these categories
individually leads to a predictive model on the training data, but will probably not
generalize on other data if, for example, some of the categories with low frequency show
other behavior.
In the case of interval data, overly complex mathematical functions that describe the
relationship between the input variable and the target variable can also lead to overfitting,
if the relationship cannot be reproduced on other data.
The most dramatic case of overfitting would be to use the individual subject ID to
measure its relationship to the target variable. This would perfectly model the available
data, but it would only by chance model other data. Of course, this example is more of
didactic importance than a real-life mistake, but it illustrates very well what happens if
highly specific but inappropriate information from the training data is used during
modeling.
Chapter 12: Considerations for Predictive Modeling 99
Whereas overfitting describes the fact that we model the noise in the data in our prediction model,
the term generalization is used for the fact that a model makes good predictions for cases that are
in the training data. A generalization error describes the fact that a prediction model loses its
predictive power on non-training data. More details about this can also be found in the SAS
Enterprise Miner Help.
Split sample validation is very popular and easy to implement. The available data are
split into training and validation data. The split is done either by simple random sampling
or by stratified sampling. The training data mart is used to train the model and to learn the
relationships. It is used for preliminary model fitting. The analyst attempts to find the best
model weights using this data set. The validation data mart is used to assess the adequacy
of the model that was built on the training data.
With some methods the validation data are already used in the model creation process
such as the pruning of leaves in a decision tree. In these cases, the generalization error
based on the validation data is biased. Therefore, a third data set, the test data, is created
by random or stratified sampling and assessment statistics are calculated for this data set.
Cross-validation is preferable for small data sets. Here the data are split in different ways,
models are trained for each split, and then the validation results are combined across the
splits.
Bootstrap aggregation (bagging) works similarly to cross-validation. However, instead of
subsets of the data, subsamples of the data are analyzed, where each subsample is a
random sample with a replacement from the original sample of data.
100 Data Preparation for Analytics Using SAS
P a r t 3
Data Mart Coding and Content
Introduction
After we introduce different data models and data structures for analytic data marts, we will start to fill
the tables with content in this part of the book. Part 3 – Data Mart Coding and Content, is fully
dedicated to the creation of the appropriate data mart structure and the generation of meaningful variables
for the analysis.
This part of the book is essential for powerful data preparation for analytics. We will show how we can
fill the tables with content in order to answer our business questions.
102 Data Preparation for Analytics Using SAS
Whereas in Parts 1 and 2 we only introduced the concepts and rationale for certain elements in analytic
data preparation, we will start to use SAS code in this part of the book. Part 3 is a good balance between
business explanations for the importance and rationale of certain derived variables and data mart
structures and the SAS coding to create these variables and structures.
In total, Part 3 introduces 11 SAS macros for data preparation and is filled with numerous SAS code
examples that explain step by step how derived variables can be created or data mart structures can be
changed. All SAS code examples also show the underlying data tables in order to illustrate the effect of
the respective SAS code. Note that all SAS macros and code examples in this book can be downloaded
from the companion Web site at http://support.sas.com/publishing/bbu/companion_site/60502.html.
Part 3 contains nine chapters:
In Chapter 13 – Accessing Data, we will deals with how data can be loaded into SAS and be
loaded into SAS data sets.
The next two chapters deal with the transposition of data to the appropriate data mart
structure:
In Chapter 14 – Transposing One-and Multiple-Rows-per-Subject Data Structures, we
will show how we can transpose data marts in order to change between one-row-per-
subject structures and multiple-rows-per-subject structures.
In Chapter 15 – Transposing Longitudinal Data, we will show how to change between
different data structures for longitudinal data marts.
Chapters 16 and 17 will show how derived variables for interval-scaled data and categorical
data can be created. The methods that are presented here apply to one-row-per-subject tables
and multiple-rows-per-subject tables.
In Chapter 16 – Transformations of Interval-Scaled Variables, we will consider topics
such as standardization, the handling of time intervals, and the binning of observations.
In Chapter 17 – Transformations of Categorical Variables, we will cover typical topics
for categorical variables such as dummy variables, the definition of an OTHER group,
and multidimensional categorical variables.
Part 3 of this book will serve as a pool of ideas for data preparation and the creation of derived variables.
You are invited to read this part of the book, to learn from it, and to get ideas and come back to these
chapters during data preparation work in order to look for coding examples.
104 Data Preparation for Analytics Using SAS
C h a p t e r 13
Accessing Data
13.1 Introduction
In Chapter 5 – Data Structures and Data Mining, we investigated different technical origins of
data such as simple text files, spreadsheets, relational database systems, Enterprise Resource
Planning systems, large text files, and hierarchical databases.
When investigating methods of data preparation we also need to look at the methods that allow
accessing data and importing data to SAS. After data are available in SAS we can start to change
the data mart structure, to create derived variables, and to perform aggregations.
In this chapter we will look at data access from a technical point of view. We will investigate with
code examples how to access data from relational databases using SAS/ACCESS modules, then
we will see how data can be accessed from Microsoft Office and from rectangular and
hierarchical text files.
106 Data Preparation for Analytics Using SAS
General
SAS offers access modules for the following relational databases: DB2, Informix, Microsoft SQL
Server, MySQL, Oracle, Sybase, and Teradata. Access to ODBC and OLEDB is also available
from relational databases.
The advantage of specifying a libref is that the data in the database can be used in the syntax, as
they were in the SAS tables. The library is displayed in the SAS Explorer window or the libraries
listing in SAS Enterprise Guide. Tables can be opened for browsing.
This DATA step imports the data from the Oracle database to SAS. An alternative is to use PROC
COPY.
PROC COPY IN = RelDB OUT = data;
SELECT CustomerData UsageData;
RUN;
The following statement, for example, causes the filtering of the observations for Segment 1 in the
source database and only the import of the filtered data.
DATA data.CustomerData_S1;
SET RelDB.CustomerData;
WHERE Segment = 1;
RUN;
The libref to the relational database can also be used to access data for analysis directly.
PROC MEANS DATA = RelDB.CustomerData;
VAR age income;
RUN;
In the preceding statement, instead of the whole table, only the variables AGE and INCOME are
imported from the table CUSTOMERDATA.
The pass-through statements can also perform an aggregation in the relational database itself by
using the appropriate group functions such as SUM or AVG and the respective GROUP BY
clause.
108 Data Preparation for Analytics Using SAS
The LIBNAME XLSLIB allows accessing of the available sheets from the Excel worksheet
LOOKUP.XLS, e.g., by specifying the following statements.
DATA lookup;
SET xlslib.’Sheet1$’n;
RUN;
Note that we have used ‘Sheet1$’n because we have to mask the specific character ‘$’.
The LIBNAME MDBLIB allows accessing of the tables and views of the Microsoft Access
database.
Note that we have specified GETNAMES=YES in order to read the column names from the first row.
Various formats
Data in text files can have numerous formats.
Values can be separated by tabs, commas, or blanks, or they can be aligned at a certain
column position. One row in the text file can represent exactly one row in the table.
One row can also represent more than one table row, or one table row can extend over
more than one line.
Chapter 13: Accessing Data 109
All rows in the text file can hold data from the same entity, or the text file can also
represent data from a hierarchical database where a row with customer attributes is
followed by rows with the corresponding account data.
The first row in the table can already hold data, or it can hold the variable names.
SAS can handle all of these types of text data for import. The SAS DATA step offers numerous
ways to decode simple and complex text files. We will not show all options here because that
would go beyond the scope of this book. We will deal with an example of rectangular-oriented
data in a comma-separated file and we will show an example with hierarchical data in the next
section.
The following statements allow accessing of the data from the text file by specifying the
appropriate informats, and they output the data to the table CDR_DATA by specifying the
respective formats.
DATA CDR_Data;
INFILE 'C:\DATA\CDR.csv' DELIMITER = ';'
MISSOVER DSD LRECL=32767 FIRSTOBS=2 ;
We specify the LRECL option for the record length in order to overwrite the default of
256 characters. Note that this is not necessary for this example, but it has been included
for didactic purposes because it is necessary in many cases when one row in the text file
is longer than 256 characters.
We specify INFORMAT date=MMDDYY and FORMAT date=DATE9 to convert the
representation of the date format from 010505 to 01MAY2005.
The result is shown in Table 13.1.
The data
The output dump of a hierarchical database is characterized by the fact that different rows can
belong to different data objects and therefore can have different data structures. For example,
customer data can be followed by account data for that customer.
An example of a hierarchical text file is shown next. We have rows for CUSTOMER data, and we
have rows for USAGE data.
The CUSTOMER data rows hold the following attributes: CUSTID, BIRTH DATE, GENDER,
and TARIFF. The USAGE data rows hold the following attributes: DATE, DURATION,
NUMBEROFCALLS, and AMOUNT. At the first character a flag is available that indicates
whether the data come from the CUSTOMER or USAGE data.
C;31001;160570;MALE;STANDARD
U;010505;8874.3;440;32.34
U;020505;1887.3;640;31.34
U;030505;0;0;0
U;040505;0;0;0
C;31002;300748;FEMALE;ADVANCED
U;010505;2345;221;15.99
U;020505;1235;221;27.99
U;030505;1000.3;520;64.21
C;31003;310850;FEMALE;STANDARD
U;010505;1100.3;530;68.21
U;020505;5123;50;77.21
U;030505;1512;60;87.21
Chapter 13: Accessing Data 111
The code
In order to input this data structure to a SAS data set the following code can be used.
DATA CustomerCDR;
DROP check;
RETAIN CustID BirthDate Gender Tariff;
INFORMAT
CustID 8.
BirthDate ddmmyy.
Gender $6.
Tariff $10.
Date ddmmyy.
Duration 8.1
NumberOfCalls 8.
Amount 8.2 ;
FORMAT CustID 8.
BirthDate date9.
Gender $6.
Tariff $10.
Date date9.
Duration 8.2
NumberOfCalls 8.
Amount 8.2 ;
INFILE 'C:\data\hierarch_DB.csv' DELIMITER = ';'
DSD LRECL=32767 FIRSTOBS=1 ;
LENGTH check $ 1;
INPUT @@ check $;
IF check = 'C' THEN INPUT CustID BirthDate Gender $ Tariff $;
ELSE if check = 'U' THEN INPUT Date Duration NumberOfCalls Amount;
IF check = 'U' THEN OUTPUT;
RUN;
We use a RETAIN statement to retain the values from the CUSTOMER hierarchy and
make them available at the USAGE hierarchy.
We input the first character from the data in order to decide which INPUT statement will
be used to input the data.
Finally, we output only those rows that correspond to usage data because they now also
hold the de-normalized values from the customer data.
The result
The resulting data set is shown in Table 13.2.
112 Data Preparation for Analytics Using SAS
We have seen that a SAS DATA step can be used in a very flexible way to decode a hierarchical
text file and to load the data into a SAS data set.
This outputs, in addition to the CUSTOMERCDR table, the CUSTOMER and CDR tables.
We see that we have output separate tables in a relational structure from the preceding
hierarchical text file.
General
The preceding methods allow access of data at the data layer. The data are accessed directly from
their source and no application logic is applied. In Chapter 5 – Data Structures and Data
Modeling, we also mentioned the access of data from the application layer. In this case the data
are not directly accessed from their underlying database tables or data dump into text files, but
application logic is used to derive the data with their semantic relationships.
A new customer record that is entered into a customer database, for example, also notifies the
sales force automation system, the campaign management system, and the data warehouse in real
time. This means that data are not batch-loaded between systems at certain time points, but data
are exchanged on a real-time basis.
The underlying data exchange techniques are, for example, message-queuing techniques that
represent the data exchange logic and data exchange techniques to disperse data to other systems.
SAS Integration Technologies provide the techniques to integrate SAS into enterprise application
integration.
We have now imported our data and can start bringing the data into the appropriate structure.
114 Data Preparation for Analytics Using SAS
C h a p t e r 14
Transposing One- and Multiple-Rows-per-Subject
Data Structures
14.1 Introduction
Changing the data structure of data sets is a significant and common task in preparing data sets. In
this book, we call this transposing of data sets. The structure of data sets might need to be
changed due to different data structure requirements of certain analyses, and data might need to be
transposed in order to allow a join of different data sets in an appropriate way.
Here are the most common cases of the need for transposing:
The preparation of a one-row-per-subject data set might need the transposition of data
sets with multiple observations per subject. Other names for this task are pivoting or
flattening of the data. See “Putting All Information into One Row” for more details about
this.
Longitudinal or multiple-rows-per-subject data sets are converted to a one-row-per-
subject structure and vice versa depending on the requirements of different types of
116 Data Preparation for Analytics Using SAS
analysis. We will discuss data structure requirements for SAS analytic procedures in
Appendix B.
Longitudinal data sets need to be switched from one type to another (standard form of a
time series, the cross-sectional data sets, or the interleaved time series) for various types
of longitudinal analysis. See Chapter 9 – The Multiple-Rows-per-Subject Data Mart for
details about this topic.
Bringing the categories of a table with transactional data into columns for further
analyses on a one-row-per-subject level.
Intermediate results or the output from SAS procedures need to be rearranged in order to
be joined to the analysis data set.
SAS is an excellent tool for transposing data. The functionality of the DATA step and PROC
TRANSPOSE provides data management power that exceeds the functionality of various
RDBMSs.
Terminology
When looking at the shape of the data sets, we are working with “long” data sets (many rows) and
“wide” data sets (many columns). In practice it has proven to be very intuitive to use the term
LONG for a multiple-rows-per-subject data set and WIDE for a one-row-per-subject data set in
this context. We can also use LONG to refer to univariate data sets and WIDE to refer to
multivariate data sets. See Table 14.1 for an illustration.
We will use the terminology LONG and WIDE in the following sections of this chapter.
Chapter 14: Transposing One- and Multiple-Rows-per-Subject Data Structures 117
Macros
In this chapter we will use the following macros:
Overview
In Chapter 8 – The One-Row-per-Subject Data Mart, we discussed two different ways to include
multiple-rows-per-subject data into a one-row-per-subject data set, namely transposing and
aggregating. We also saw that in some cases it makes sense to aggregate data. There are,
however, a lot of cases where the original data have to be transposed on a one-to-one basis per
subject into one row.
If we consider our two data sets in Table 14.1 the change from a LONG to a WIDE data set can
be done simply with the following code:
PROC TRANSPOSE DATA = long
OUT = wide_from_long(DROP = _name_)
PREFIX = weight;
BY id ;
VAR weight;
ID time;
RUN;
118 Data Preparation for Analytics Using SAS
Note that the data must be sorted for the ID variable. The legend to the macro parameters is as
follows:
ID
The name of the ID variable that identifies the subject.
COPY
A list of variables that occur repeatedly with each observation for a subject and will be
copied to the resulting data set. Note that the COPY variable(s) must not be used in the
COPY statement of PROC TRANSPOSE for our purposes, but are listed in the BY
statement after the ID variable. We assume here that COPY variables have the same
values within one ID.
VAR
The variable that holds the values to be transposed. Note that only one variable can be
listed here in order to obtain the desired transposition. See the following section for an
example of how to deal with a list of variables.
TIME
The variable that numerates the repeated measurements.
Note that the TIME variable does not need to be a consecutive number; it can also have non-
equidistant intervals. See the following data set from an experiment with dogs.
Chapter 14: Transposing One- and Multiple-Rows-per-Subject Data Structures 119
%MAKEWIDE(DATA=dogs_long,
OUT=dogs_wide,
ID=id,
COPY=drug depleted,
VAR=Histamine,
TIME=Measurement);
DOGS_LONG data set
If we have, in addition to HISTAMINE, a variable HEAMOGLOBIN in our DOGS data set from
the preceding example, the following statements will result in a data set with variables
HISTAMINE1–HISTAMINE3 and HEAMOGLOBIN1–HEAMOGLOBIN3.
%MAKEWIDE(DATA=dogs_long_2vars,OUT=out1,ID=id, COPY=drug depleted,
VAR=Histamine, TIME=Measurement);
%MAKEWIDE(DATA=dogs_long_2vars,OUT=out2,ID=id,
VAR=Heamoglobin, TIME=Measurement);
DATA dogs_wide_2vars;
MERGE out1 out2;
BY id;
RUN;
Note that in the second invocation of the %MAKEWIDE macro no COPY variables were given,
because they were transposed with the first invocation. Also note that every variable requires its
separate transposition here. If there are a number of variables to transpose another way can be of
interest.
DATA dogs_tmp;
SET dogs_tmp;
Varname = CATX(“_”,_NAME_,measurement);
RUN;
It is up to you to decide which version best suits your programs. The last version has the
advantage because it can be easily converted into a SAS macro, as the code itself does not change
with the number of variables that will be transposed. To create a macro, the bold elements in the
preceding code have to be replaced by macro variables in a macro.
Note the use of the CATX function, which allows an elegant concatenation of strings with a
separator. For details about the CAT function family, see SAS OnlineDoc.
Overview
In the following cases multiple observations per subject need to be in multiple rows:
when using PROC GPLOT in order to produce line plots on the data
when using PROC REG or other statistical procedures to calculate, for example, trend or
correlation coefficients
when calculating aggregated values per subject such as means or sums with PROC
MEANS or PROC SUMMARY
Chapter 14: Transposing One- and Multiple-Rows-per-Subject Data Structures 121
Note that with these statements we will receive a variable MEASURE that contains for each
observation the name of the respective variable name of the WIDE data set (HISTAMINE1,
HISTAMINE2, …).
The following statements convert this to a variable with only the measurement numbers in a
numeric format.
*** Create variable with measurement number;
DATA long_from_wide;
SET long_from_wide;
FORMAT Time 8.;
time = INPUT(TRANWRD(_measure,"weight",''),8.);
DROP _measure;
RUN;
Table 14.5: LONG_FROM_WIDE table with the time sequence in variable TIME
122 Data Preparation for Analytics Using SAS
The macro
This code can be encapsulated simply in the following macro %MAKELONG:
%MACRO MAKELONG(DATA=,OUT=,COPY=,ID=,ROOT=,MEASUREMENT=Measurement);
PROC TRANSPOSE DATA = &data(keep = &id © &root.:)
OUT = &out(rename = (col1 = &root))
NAME = _measure;
BY &id ©
RUN;
*** Create variable with measurement number;
DATA &out;
SET &out;
FORMAT &measurement 8.;
&Measurement = INPUT(TRANWRD(_measure,"&root",''),8.);
DROP _measure;
RUN;
%MEND;
Note that the data must be sorted for the ID variable. The legend to the macro parameters is as
follows:
ID
The name of the ID variable that identifies the subject.
COPY
A list of variables that occur repeatedly with each observation for a subject and will be
copied to the resulting data set. Note that the COPY variable(s) must not be used in the
COPY statement of PROC TRANSPOSE for our purposes, but should be listed in the BY
statement after the ID variable. We assume here that COPY variables have the same
values within one ID.
ROOT
The part of the variable name (without the measurement number) of the variable that will
be transposed. Note that only one variable can be listed here in order to obtain the desired
transposition. See below for an example of how to deal with a list of variables.
MEASUREMENT
The variable that numerates the repeated measurements.
Note that the TIME variable does not need to be a consecutive number, but can also have non-
equidistant intervals. See the following data set from an example with dogs.
Chapter 14: Transposing One- and Multiple-Rows-per-Subject Data Structures 123
%MAKELONG(DATA=dogs_wide,
OUT=
Dogs_long_from_wide,
ID=id,
COPY=drug
Depleted,
ROOT=Histamine,
MEASUREMENT=
Measurement);
DOGS_LONG_FROM_WIDE
data set
Note that in the second invocation of the %MAKELONG macro no COPY variables were given
because they were transposed with the first invocation. Also note that every variable requires its
separate transposition here.
In Appendix C.2 we introduce a macro that performs transpositions from a WIDE to a LONG data
set by using a SAS DATA step instead of PROC TRANSPOSE.
Simple example
In our simple example we will use the data set SAMPSIO.ASSOCS. This data sets holds market
basket data in transactional form. Each PRODUCT a CUSTOMER purchased is represented on a
separate line. The variable TIME gives the purchasing sequence for each customer.
In order to transpose this data set into a one-row-per-subject data set, PROC TRANSPOSE will
be used.
The big advantage of PROC TRANSPOSE as part of SAS compared with other SQL-like
languages is that the list of possible products itself is not part of the syntax. Therefore, PROC
TRANSPOSE can be used flexibly in cases where the list of products is extended.
Printing the data for customer 45 into the output windows gives the following:
Obs CUSTOMER TIME PRODUCT
316 45 0 corned_b
317 45 1 peppers
318 45 2 bourbon
319 45 3 cracker
320 45 4 chicken
321 45 5 ice_crea
322 45 6 ice_crea
We see that customer 45 has an “ice_crea” entry in times 5 and 6. PROC TRANSPOSE however,
does not allow duplicate values per subject. We will therefore introduce the macro
%TRANSP_CAT, which uses PROC FREQ before PROC TRANSPOSE in order to compress
duplicate entries per subject.
126 Data Preparation for Analytics Using SAS
%MEND;
Note that unlike our simple example we use a VAR statement in PROC TRANSPOSE, which
holds the frequency for each category per subject. This frequency is then inserted into the
respective column, as we can see from the transposed data set for customer 45.
ID
The name of the ID variable that identifies the subject.
VAR
The variable that holds the categories, e.g., in market basket analysis the products a
customer purchased.
Note that this can also be processed simply as in the following code. Here we use the property that
bourbon NE . evaluates to 1 for all bourbon values other than MISSING, and to 0 for MISSING
values.
bourbon = (bourbon NE .);
For a list of variables this can be done more efficiently by using an ARRAY statement in the SAS
DATA step:
ARRAY prod {*} apples artichok avocado baguette bordeaux bourbon
chicken coke corned_b cracker ham heineken hering
ice_crea olives peppers sardines soda steak turkey;
DO i = 1 TO dim(prod);
IF prod{i}= . THEN prod{i}=0;
END;
DROP i;
As the replacement of missing values of a list of variables occurs very frequently, the preceding
code is shown here as a macro version:
%MACRO REPLACE_MV(cols,mv=.,rplc=0);
ARRAY varlist {*} &cols;
DO _i = 1 TO dim(varlist);
IF varlist{_i} = &mv THEN varlist{_i}=&rplc;
END;
DROP _i;
%MEND;
COLS
The list of variables for which missing values will be replaced.
MV
The definition of the missing value, default = .
RPCL
The replace value, default = 0.
128 Data Preparation for Analytics Using SAS
In Appendix C.3 we will introduce a macro for the DATA step version of a transposition of
transactional categories. This version also automatically generates the list of variables for the
replacment of missing values.
General
In Chapter 9 we saw an example of a key-value table. In this section we will see how we can
create a key-value table from a one-row-per-subject table and how we can re-create a one-row-
per-subject table from a key-value table.
Then we will transpose this table by the ID variable and use the NAME variable in the BY
statement in order to copy it to the key-value table.
PROC TRANSPOSE DATA = class OUT = class_tp;
BY ID name;
VAR sex age height weight;
RUN;
Then we will compress the leading and trailing blanks and rename the variables KEY and
VALUE.
DATA Key_Value;
SET class_tp;
RENAME _name_ = Key;
Value = strip(col1);
DROP col1;
RUN;
DATA one_row;
SET one_row(RENAME = (Age=Age2 Weight=Weight2 Height=Height2));
FORMAT Age 8. Weight Height 8.1;
Age = INPUT(Age2,$8.);
Weight = INPUT(Weight2,$8.);
130 Data Preparation for Analytics Using SAS
Height = INPUT(Height2,$8.);
DROP Age2 Weight2 Height2;
RUN;
The DATA step is needed to create numeric variables from textual key-values. Note that we first
rename textual variables in the DATA step option in order to be able to give the numerically
formatted variables the original name.
C h a p t e r 15
Transposing Longitudinal Data
15.1 Introduction
In Chapter 10 – Data Structures for Longitudinal Analysis, we discussed the different data
structures that can be found in the context of longitudinal data.
In the following sections we will show how the data structure can be switched between the three
main data mart structures and also between combinations of them. In time series analysis there is
a need to prepare the data in different structures in order to be able to specially analyze VALUES
and CATEGORIES.
Considering our three main entities in longitudinal data structures (see Chapter 10), we have the
following:
Data in cross-sectional dimension data structures are usually sorted first by the cross-sectional
dimension, and then by date. This sort is omitted in the following examples; otherwise, more rows
of the table would be needed to show in the figure in order to visualize the different cross-
sectional dimensions. If desired, at the end of each cross-sectional dimension example the data
can be sorted with the following statement:
PROC SORT DATA = cross_section_result;
BY <cross section> DATE;
RUN;
General
In this section we will see how we can change between the typical longitudinal data structures:
standard time series form and interleaved time series (STANDARD and
INTERLEAVED)
cross-sectional dimensions and standard time series form (CROSS SECTIONAL and
STANDARD)
Chapter 15: Transposing Longitudinal Data 133
PROC TRANSPOSE allows a short and elegant syntax for the change between different
longitudinal data structures. The variable names and resulting new categories are not
included in the syntax. It can therefore flexibly be used if variables are added.
Data need to be sorted by DATE because PROC TRANSPOSE uses DATE in the BY
statement.
Transferring data from columns to rows only needs a BY statement and the NAME
option in PROC TRANSPOSE.
Transferring data from rows to columns needs a BY, ID, and VAR statement in PROC
TRANSPOSE.
134 Data Preparation for Analytics Using SAS
As you can see, besides the type of sorting, there is no technical difference between a cross-
sectional data structure and an interleaved data structure. Both data structures can be converted
into the standard time series form and be retrieved from the standard time series form with the
same statements.
General
When considering our VALUE variables and our product CATEGORIES we have the following
scenarios.
The CATEGORIES in cross sections and VALUES interleaved have some similarity
with the key-value table we discussed in Chapter 14.
In order to create a standard form if more than one cross section and/or interleaved group
exists, a preceding DATA step is needed to create a concatenated variable that is used for
the transposition.
The SAS function family CAT* allows concatenation of text strings with different
options by automatically truncating leading and trailing blanks.
138 Data Preparation for Analytics Using SAS
C h a p t e r 16
Transformations of Interval-Scaled Variables
16.1 Introduction
In this chapter we will deal with transformations of interval variables. With interval variables we
mean variables such as age, income, and number of children, which are measured on an interval
or ratio scale. The most obvious transformation of interval variables is to perform a calculation
such as addition or multiplication on them. However, we will also see in this chapter that the
transformations are not restricted to simple calculations.
This chapter is divided into six sections, which will deal with the following topics:
Simple derived variables, where we will look at simple calculations between interval variables
and the creation of interactions.
Relative derived variables, where we will consider the creation of ratios and proportions by
comparing a value to another value.
140 Data Preparation for Analytics Using SAS
Time and time intervals, where we will see which calculation methods are available for time
intervals and which representations for time variables are possible.
Binning observations into groups, where we will show that an interval value can be grouped with
IF_THEN/ELSE clauses or formats.
Replacement of missing values, where we will deal with replacing missing values with the mean
of other variables.
Macros
In this chapter we will introduce the following SAS macros:
Overview
The rationale for derived variables is that we create variables that hold information that is more
suitable for analysis or for interpretation of the results. Derived variables are created by
performing calculations on one or more variables and creating specific measures, sums, ratios,
aggregations, or others.
Derived variables can be a simple calculation on one or more variables. For example, we can
create the body mass index (BMI) with the following expression if WEIGHT is measured in
kilograms and HEIGHT is measured in centimeters:
BMI = Weight / Height **2;
In this section we will deal with interactions, quadratic terms, and derived sums from a list of
variables.
Variables that contain an interaction between two variables can be easily built by a simple
multiplication of the variables. The following code computes an interaction variable of AGE and
WEIGHT:
INT_AGE_WEIGHT = AGE * WEIGHT;
Chapter 16: Transformations of Interval-Scaled Variables 141
Note that by definition the interaction is missing if one of its components is missing, which is true
for our expression. In some procedures, e.g., PROC REG, interactions cannot be created in the
MODEL statement but need to be present in the input data.
A quadratic term and cubic term for AGE can be easily created by the following statement:
AGE_Q = AGE ** 2;
AGE_C = AGE ** 3;
Note that it is up to the programmer to name the AGE variable. AGE2 and AGE3 are also
common to express the quadratic and cubic terms. When creating derived variables, the
consecutive numbering of variables is very common. Although this “quick and dirty” approach is
not the best programming style, we have to consider that the name AGE2 as the quadratic term
can be easily confused with an age variable that is created with a different definition.
See also Chapter 20 – Coding for Predictive Modeling, for the visualization of quadratic
relationships between variables and a discussion of the specification of quadratic terms in
predictive modeling.
VARS
The list of variables for which interactions will be created.
QUADR
Determines whether quadratic terms will also be created (Default = 1) or if only the non-
diagonal elements of the variable-matrix will be created (QUADR = 0).
PREFIX
The prefix for the new variable names, Default = INT.
142 Data Preparation for Analytics Using SAS
Here are two examples of its invocation and the respective code that is created:
%INTERACT(age weight height,quadr=1);
and
%INTERACT(age weight height,quadr=0);
The SAS functions also have the advantage that they handle missing values correctly. If, for
example, USAGE2 is a missing value, the function will not return a missing value but sum the non-
missing values. The MEAN function will correctly divide by the number of non-missing values.
We build a variable by summing the YES answers for each subject. This variable will contain
information about customer opinion and loyalty about the company.
LOYALTY = sum((question1=1), (question2=1), (question3=1),
(question4=1), (question5=1));
We sum the result of the Boolean expressions “QuestionX = 1”. Again, we have an example of
how an informative variable can be coded simply. This variable ranges from 0 to 5 and is a point
score for loyalty. This can be generalized to all situations where we want to count events that
fulfill a certain condition. The expression itself can, for example, also be in the form “Value >
200” or “Tariff IN (‘START’ ‘CLASSIC’)”.
General
For some measurement variables it makes sense to set them relative to another measurement
variable. The business rationale for derived relative variables is that the absolute number is not
meaningful and we have to set the value relative to another value. For example, the relative cost
of a hospital per patient day is used for comparison rather than the absolute cost. Basically we can
distinguish between two types of derived relative variables, namely proportions and ratios.
We talk about proportions if the numerator and the denominator have the same measurement unit.
In contrast, ratios are calculated by a fraction, where the numerator and denominator have
different measurement units.
Proportions can be expressed in percent and are interpreted as “part of the whole.” For example,
65% of the employees of company A have used tariff XY in a proportion, whereas the average
number of phone calls per day is a ratio, as we divide, for example, the monthly number of phone
calls by the number of days per month.
Frequently used denominators for derived relative variables include the following:
number of employees
number of days since the start of the customer relationship
number of days since admission to the hospital or since the start of treatment
number of months in the observation period
number of contracts
total amount of usage (minutes, number of calls, transactions, euro, dollars)
Use SAS functions to create sums or means because they can deal with missing values.
CALLS_TOTAL = CALLS_FIXED + CALLS_INTERNATIONAL + CALLS_MOBILE;
will ignore the MISSING value and sum the non-missing values.
Use an ARRAY statement if more variables have to be divided by the same number:
Calls_total = SUM(calls_fixed, calls_international, calls_mobile);
ARRAY abs_vars {*} calls_fixed calls_international calls_mobile;
ARRAY rel_vars {*} calls_fixed_rel calls_international_rel
calls_mobile_rel;
DO i = 1 TO DIM(abs_vars);
rel_vars{i}._rel = abs_vars{i}/calls_total;
END;
DROP i;
For an explanation consider a simple example. We will use the data from SASHELP.CLASS and
only concentrate on the variable WEIGHT. The mean weight of the 19 pupils is 100.03 pounds.
We want to use this mean to create derived relative variables that show whether a subject has a
value above or below the mean. In the SAS program in the next section we will create these
derived variables:
WEIGHT_SHIFT
The mean is subtracted from the values, which results in a shift of the distribution to a
zero mean; the shape of the distribution is not changed.
WEIGHT_RATIO
The values are divided by the mean. Assuming a positive mean, all positive values that
are smaller than the mean are squeezed into the interval (0,1), whereas positive values
larger than the mean are transferred to the interval (1,∞).
WEIGHT_CENTRATIO
The centered ratio combines the calculations from WEIGHT_SHIFT and
WEIGHT_RATIO by shifting the distribution to a zero mean and dividing the values by
the mean.
WEIGHT_STD
Subtracting the mean and dividing the values by the standard deviation leads to the
standardized values.
WEIGHT_RNK
Ranking the values by their size gives the ordinal-scaled variable WEIGHT_RNK. Ties
have been set to the lower rank.
Chapter 16: Transformations of Interval-Scaled Variables 145
DATA class;
MERGE class class2(RENAME = (weight = Weight_Std));
BY id;
RUN;
makes sense. The drawback with this variable, however, is that for positive values, such
as counts, the distribution is not symmetrical.
From a programming point of view consider the following:
In the preceding code we artificially created an ID variable from the logical _N_
variable in order to have a subject identifier in place.
We needed this subject identifier to merge the results of PROC STANDARD with
the original data because PROC STANDARD does not create a new variable but
overwrites the existing variables.
We used PROC RANK to create the ordinal variable WEIGHT_RNK and used the
TIES=LOW option in order use the lower rank for tied values.
If the variable WEIGHT_RNK does not have the same value for ties, but one
observation will randomly get the lower rank and another the higher rank, a variable
WEIGHT_RND can be created by adding an artificial random number. The
probability that the resulting values of WEIGHT_RND have ties is very low. See the
following code:
WEIGHT_RND = WEIGHT + UNIFORM(1234) / 10;
General
Date and time variables as well as intervals between different points in time are an important
group of measurement variables for analysis tables. With time variables we mean not only
variables that represent time-of-day values, but all variables that represent data that are measured
on a time scale.
To show that the calculation of a time interval is not always simple, see the following example
with the calculation of age.
In a customer table we have stored the birthdate for a customer, which is obviously the start of the
interval when we want to calculate the age.
Chapter 16: Transformations of Interval-Scaled Variables 147
Here are the possible interval end dates for the calculation:
Calculation Comment
&SYSDATE Not recommended! Calculates the age at the
start of the SAS session (see the variable
&SYSDATE), which does not necessarily need
to be the actual date.
Date() Calculates the age at the current date. Note
that this is not necessarily the end date that
you need for your analysis. If the data were a
snapshot of the last day of the previous
month, then age should be calculated for this
point in time and not for the current date.
&snapdate Calculates the age for a certain snapshot date.
(e.g. %LET snapdate = This makes sense if the age will be calculated
"01MAY2005"d) for a certain point in time and if the value of
age will not depend on the date when the data
preparation was run. Also, a rerun of the data
preparation process at a later time produces
consistent results.
Other options for age values at certain events in the subject’s lifetime include the following:
This example also illustrates that in surveys, data collection, data entry systems, and customer
databases the date of birth will be collected instead of the age. If we have only the age, we need to
know at which point in time it was collected, in order to have the actual age value at a later time.
Having discussed the effect of the definition of the beginning and end of an interval, we will now
look at the definition of how intervals are measured.
148 Data Preparation for Analytics Using SAS
When using this method there will always be cases where an age that is not exact is calculated. In
the case of a person’s age, the age value will be incremented some days before or after the real
birthday.
Note that these calculations are only approximations and do not reflect the exact number of years
or months between the two dates. Here we are back at the definition of the business problem and
the type of analysis we want to perform. If, for example, in data mining, we need a value for a
time interval and do not care whether the real interval length is 12.3444 or 12.36 years, and if the
definition is consistent for all subjects, we can use this method.
This is the reason why in the creation of data mining marts the situation is very often encountered
that derived variables are calculated like this:
AGE = (&snapdate – birthdate) / 365.2422;
MonthsSinceFirstContact = (&snapdate – FirstContact) / (365.2422/12);
Different from years and months, weeks have a constant number of days. Therefore, we can
simply divide the number of days between two dates by 7 in order to get the number of weeks.
The SAS function DATDIF calculates the difference in days between two dates. The result equals
the subtraction of two dates:
NR_DAYS1=DATDIF(‘16MAY1970’d, ‘22OCT2006’d,’ACTUAL’);
NR_DAYS2=’22OCT2006’d – ‘16MAY1970’d;
Both statements return the same results. In the DATDIF function, however, options are available
to force a 365-day or a 360-day calculation. See SAS Help and Documentation for details.
The INTCK function counts the number of interval boundaries between two dates or between two
datetime values. The function has three parameters—the interval name, which can include YEAR,
QTR, MONTH, WEEK; the beginning of the interval; and the end of the interval. For example:
INTCK('MONTH','16MAY1970'd,'12MAY1975'd);
Chapter 16: Transformations of Interval-Scaled Variables 149
evaluates to 60, because 60-month borders lie between May 16, 1970, and May 12, 1975.
The INTNX function increments a date, time, or datetime value by a given number of intervals.
The function has three parameters—the interval name, which can include YEAR, QTR, MONTH,
or WEEK; the beginning of the interval; and the number of intervals to increment. For example:
INTNX('MONTH','16MAY1970'd,100);
evaluates to “01SEP1978,” because we have incremented the value of May16, 1970, for 100-
month borders.
The age in years can now be calculated with the following formula:
AGE = FLOOR((INTCK(‘MONTH’,’16MAY1970’d,¤t_date)-
(DAY(¤t_date) < DAY(’16MAY1970’d))) / 12);
Note that this formula is taken from a SAS Quick Tip, presented by William Kreuter at
http://support.sas.com/sassamples/.
The equivalent to the preceding formula is the SAS function YRDIF. It returns the difference in
years between two dates. Here the number of days in leap years is calculated with 366 and in
other years with 365. See the following example for a comparison of the two calculation methods:
DATA _NULL_;
years_yrdiff = YRDIF('16MAY1970'd,'22OCT2006'd,'ACTUAL');
years_divide = ('22OCT2006'd - '16MAY1970'd) / 365.2242;
output;
PUT years_yrdiff= years_divide=;
RUN;
years_yrdiff=36.435616438
years_divide=36.437892122
The difference between the two values arises from the fact that when dividing by 365.25 we are
correct on average. However, for a particular interval the correction for the leap years is not exact.
The YRDIF function handles this by exactly calculating the number of days.
General
In some cases it is necessary to bin interval-scaled values into groups. For example, in the case of
a variable with hundreds of different values, the calculation of frequencies for each value does not
make sense. In these cases the observations are binned into groups and the frequencies are
calculated for the groups. Note that the terms grouping and binning are used interchangeably in
this chapter.
150 Data Preparation for Analytics Using SAS
In this section we will deal with the possible methods used to group observations. Basically there
are three main ways to group observations:
To bin the observations of the SASHELP.AIR data set into 10 groups we can use PROC RANK
as follows:
PROC RANK DATA = sashelp.air OUT = air
GROUPS = 10;
VAR air;
RANKS air_grp;
RUN;
The specification of a separate output data set with the OUT= option is in some cases
desirable but not necessary.
One or more variables can be ranked within one PROC RANK invocation.
The ordering of the ranks (group numbers) can be reversed using the DESCENDING
option in the PROC RANK statement.
The ranks start with 0. If ranks starting from 1 are needed, they have to be incremented in
a DATA step.
With the variable AIR_GRP1 we have created a group variable that bins observations
greater than 100 up to 110 into group 11, observations greater than 110 up to 120 into
group 12, and so on.
AIR_GRP2, AIR_GRP3, and AIR_GRP4 have the same grouping rule but assign
different values to the groups. AIR_GRP2 assigns the maximum value in each group as a
group label.
It is easy to change the code to receive group midpoints just by subtracting 5, as we see in
AIR_GRP3.
AIR_GRP4 gives a consecutive group numbering, which we receive by subtracting the
(minimum group number –1).
Formatting the variable that holds the new group names is advisable. Otherwise, the
length of the variable is determined from the first assignment, which can cause truncation
of the group names.
The group names contain a numbering according to their size. This is advisable for a
sorted output.
The vertically aligned coding has the advantage that it is easier to read and to edit.
Using SAS formats
The preceding grouping can also be created by defining a SAS format and assigning the format to
the variable AIR during analysis:
PROC FORMAT;
VALUE air
. = '00: MISSING'
LOW -< 220 = '01: < 220'
220 -< 275 = '02: 220 - 274'
275 - HIGH = '03: > 275';
RUN;
The format can be assigned to the variable AIR by the following statement in every procedure call
or DATA step:
FORMAT air air.;
It is also possible to create a new variable in a DATA step by using the format in a PUT function:
DATA air;
SET sashelp.air;
Air_grp = PUT(air,air.);
RUN;
See the first five observations of the data set AIR in the following table:
General
In right-skewed distributions extreme values or outliers are likely to occur. The minimum for
many variables is naturally bounded at zero—for example, at all count variables or measurements
that logically can start only at zero. Examples of variables that usually have highly skewed
distributions and outliers in the upper values are laboratory values, number of events, minutes of
mobile phone usage, claim amounts in insurance, or loan amounts in banking.
The presence of skewed distribution or outlier influences has an effect on the analysis because
some types of analyses accept only normal (or close to normal) distribution. In order to achieve a
close to normal distribution of values you must
In this case it has to be decided whether non-parametric methods of analysis are applied that can
deal with extreme values or skewed distributions, or if parametric methods are applied with the
knowledge that certain assumptions of these methods are violated.
First we calculate the minimum for each variable in order to see whether we have to add a
constant for the logarithm in order to have positive values:
PROC MEANS DATA = skewed MIN;
RUN;
154 Data Preparation for Analytics Using SAS
Second we analyze the distribution with PROC UNIVARIATE and use ODS SELECT to display
only the tests for normality:
ODS SELECT TestsForNormality Plots;
PROC UNIVARIATE DATA = skewed NORMAL PLOT; RUN;
ODS SELECT ALL;
Minimum
------------
-1.0000000
------------
We apply a log and a root transformation to the data. We see from this the minimum for variable
A is -1; therefore, we add a constant of 2 before applying the log and the root transformation:
DATA skewed;
SET skewed;
log_a = log(a+2);
root4_a = (a+2) ** 0.25;
RUN;
(continued)
156 Data Preparation for Analytics Using SAS
We have used PROC MEANS with the MIN option to calculate the minimum of variable
A, even if PROC UNIVARIATE would do this. The advantage of PROC MEANS with
many variables that might need transformation is that a tabular output with the minima
for all variables is produced, which is very clear.
We use the options NORMAL and PLOT in the PROC UNIVARIATE statement in order
to receive tests for normality and the corresponding plots.
Chapter 16: Transformations of Interval-Scaled Variables 157
We use an ODS SELECT statement to output only those results of PROC UNIVARIATE
that are relevant to our task. This is not mandatory, but it is useful for clearness of the
output, especially with many variables.
From the Kolmogorov-Smirnov statistic we see, for example, in our case the log transformation
performs better than the root transformation.
Shifting the values of a variable that are larger or smaller than a certain value can be done easily
with the following statement:
IF a > 20 THEN a=20;
IF b < 0 THEN b=0;
An alternative is to use the MIN and MAX functions as shown in the following example:
a = MIN(a,20);
a = MAX(b,0);
The MIN function shifts values that are larger than 20 to the value 20. The MAX function shifts
values that are smaller than 0 to the value 0. This is an easy and short way to shift values without
using IF-THEN/ELSE clauses.
In order to simplify the coding we present the macros SHIFT and FILTER for the respective
tasks:
%MACRO SHIFT (OPERATION,VAR,VALUE, MISSING=PRESERVE);
%IF %UPCASE(&missing) = PRESERVE %THEN %DO;
IF &var NE . THEN &var = &operation(&var,&value);
%END;
%ELSE %IF %UPCASE(&missing) = REPLACE %THEN %DO;
&var = &operation(&var,&value);
%END;
%MEND;
VAR
The name of the variable whose values will be shifted.
VALUE
The value to which the variable values will be shifted.
158 Data Preparation for Analytics Using SAS
OPERATION
Possible values are MIN or MAX. MIN shifts values down and MAX shifts values up.
Note that the values of this variable are directly used as the MIN or MAX function call.
The default value is MIN.
MISSING
Defines whether missing values will be preserved or replaced. Possible values are
PRESERVE (= default value) or REPLACE.
Examples:
Note that the advantage of this macro is that it controls missing values. In the case of non-missing
values, the direct code is shorter than the macro invocation.
The macro FILTER creates an IF-THEN DELETE statement to filter certain observations:
%MACRO FILTER (VAR,OPERATION,VALUE, MISSING=PRESERVE);
%IF %UPCASE(&missing) = PRESERVE %THEN %DO;
IF &var NE . AND &var &operation &value THEN DELETE; %END;
%ELSE %IF %UPCASE(&missing) = DELETE %THEN %DO;
IF &var &operation &value THEN DELETE; %END;
%MEND;
VAR
The name of the variable that will be used in the filter condition.
OPERATION
Specifies the comparison operator. This can be any valid comparison operator in SAS
such as <, >, <=, >=, GT, LT, GE, and LE. Note that it has to be specified without
quotation marks.
VALUE
The value to which the variable values will be shifted.
MISSING
Defines whether missing values will be preserved or replaced. Possible values are
PRESERVE (= default value) or DELETE.
Examples:
Delete all observations that are larger than 20, but do not delete missing values:
%FILTER(a,<,20);
Delete all observations that are smaller than 0, but do not delete missing values:
%FILTER(a,<,0);
Chapter 16: Transformations of Interval-Scaled Variables 159
Delete all observations that are negative or zero, and also delete missing values:
%FILTER(a,<=,0,MISSING=DELETE);
Overview
Replacing missing values can range from a simple replacement with 0 or another predefined
constant to a complex calculation of replacement values. In this section we will show a macro that
replaces a missing value with a constant. We will also see how PROC STANDARD can be used
to replace missing values.
In this macro the replacement value can be chosen in order to replace missing values with other
values, for example, the mean, that has previously been calculated. The rationale for the macro
REPLACE_MV is to replace systematic missing values, e.g., missing values that can be replaced
by zeros.
In the following results, we see that the missing values have been replaced by their mean 38:
Obs AGE
1 12
2 60
3 38
4 24
5 38
6 50
7 48
8 34
9 38
16.8 Conclusion
In this chapter we looked at derived variables and transformations. We however skipped those
derived variables and transformations that are of particular interest in predictive modeling.
Chapter 20 – Coding for Predictive Modeling is devoted to this topic.
SAS Enterprise Miner has special data management nodes that allow the binning of observations
into groups, the handling of missing values, the transformation interval variables, and others. We
will show examples of these functionalities in Chapter 28 – Case Study 4—Data Preparation in
SAS Enterprise Miner.
We also want to mention that the tasks that we described in the last sections of this chapter go
beyond simple data management that can be done with a few statements. This is the point where
SAS Enterprise Miner with special data preparation routines for data mining comes into play.
C h a p t e r 17
Transformations of Categorical Variables
17.1 Introduction
In this chapter we will deal with transformations of categorical variables. With categorical
variables we mean variables such as binary, nominal, or ordinal variables. These variables are not
used in calculations—instead they define categories.
General considerations for categorical variables such as formats and conversions between
interval and categorical variables.
Derived variables, where we will see which derived variables we can create from categorical
information.
Dummy coding of categorical variables, where we will show how categorical information can be
used in analysis, which allows only interval variables.
Multidimensional categorical variables, where we will show how the information of two or more
categorical variables can be combined into one variable.
Lookup tables and external data, where we will show how to integrate lookup tables and external
data sources.
In Chapter 11 – Considerations for Data Marts, we discussed the properties, advantages, and
disadvantages of numeric and character categorical variables.
The following example shows how formats can be used to assign the category name to the
category code. SAS formats for a numeric- and a character-coded gender variable are created.
These formats are used in PROC PRINT when the data set GENDER_EXAMPLE is printed.
Chapter 17: Transformations of Categorical Variables 163
PROC FORMAT;
VALUE gender 1='MALE' 0='FEMALE';
VALUE $gender_c M='MALE' F='FEMALE';
RUN;
DATA gender_example;
INPUT Gender Gender2 $;
DATALINES;
1 M
1 M
0 F
0 F
1 M
;
RUN;
1 MALE MALE
2 MALE MALE
3 FEMALE FEMALE
4 FEMALE FEMALE
5 MALE MALE
These implicit conversions can also cause unexpected results. It is therefore advisable to explicitly
convert them by using the functions INPUT or PUT.
The conversion from character to numeric in our preceding gender example is done with the
following statement:
NEW_VAR = INPUT(gender2, 2.);
The conversion from numeric to character in our preceding gender example is done with the
following statement:
NEW_VAR = PUT(gender, 2.);
For a more detailed discussion of type conversions see the SAS Press publication The Little SAS
Book: A Primer.
164 Data Preparation for Analytics Using SAS
Overview
Different from numeric variables, not many different types of derived variables are created from
categorical data because we cannot perform calculations on categorical data. Therefore, derived
variables for categorical data are mostly derived either from the frequency distribution of values
or from the extraction and combination of hierarchical codes.
In a hierarchical code different elements of the code can define different hierarchies. For
example, a product code can contain the PRODUCTMAINGROUP code in the first
character.
In the case of multidimensional codes different characters can contain different sub-
classifications. For example a medical disease code contains in the first two digits the
disease code and in the third and fourth digits a classification of the location in the body.
By extracting certain digits from a code, derived variables can be created. The following example
shows a Product Code that contains the ProductMainGroup in the first digit, and in the second and
third digits the hierarchical underlying subproduct group. Extracting the code for the
ProductMainGroup can be done as in the following example:
DATA codes;
SET codes;
ProductMainGroup = SUBSTR(ProductCode,1,1);
RUN;
This type of derived variable is frequently used for classifications and aggregations when the
original categorical variable has too many different values.
Chapter 17: Transformations of Categorical Variables 165
1. Identify the category. This is done by calculating a simple or advanced descriptive statistics.
If we want to create an indicator variable for the most common ProductMainGroup of the
preceding example, we first use PROC FREQ to create a frequency table:
PROC FREQ DATA = codes ORDER = FREQ;
TABLE ProductMainGroup;
RUN;
From the output (not printed) we see that ProductMainGroup 2 is the most frequent. The indicator
variable can be created with the following statement:
IF ProductMainGroup = ‘2’ THEN ProductMainGroupMF = 1;
ELSE ProductMainGroupMF =0;
or simply
ProductMainGroupMF = (ProductMainGroup = ‘2’);
With these derived variables we can create indicators that describe each subject in relation to
other subjects. We can therefore determine how a certain subject differs from the population.
In the following coding example we have data from the call center records and from Web usage.
Both tables are already aggregated per customer and have only one row per customer. These two
tables are merged with the CUSTOMER_BASE table. In the resulting data set, variables that
indicate whether a subject has an entry in the corresponding table or not are created.
DATA customer;
MERGE customer_base (IN=in1)
Call_center_aggr (IN=in2)
Web_usage_aggr (IN=in3);
BY CustomerID;
HasCallCenterRecord = in2;
HasWebUsage = in3;
RUN;
With this method you add the variables HasCallCenterRecord and HasWebUsage to your data
mart. These variables can be used in the following ways:
However, not only the degrees of freedom are reduced. From a business point of view it is also
desirable to reduce the number of categories to an interpretable set.
PROC FREQ with the ORDER = FREQ option is an important tool for this task. Using the
ProductMainGroup data from the preceding example we create the following output:
PROC FREQ DATA = codes ORDER = FREQ;
TABLE ProductMainGroup;
RUN;
Chapter 17: Transformations of Categorical Variables 167
Product
Main Cumulative Cumulative
Group Frequency Percent Frequency Percent
-------------------------------------------------------------
2 3 42.86 3 42.86
3 2 28.57 5 71.43
1 1 14.29 6 85.71
4 1 14.29 7 100.00
We see that the ProductMainGroups 1 and 4 occur only once and we want to assign them to the
OTHERS group. This can be done with the following statements:
FORMAT ProductMainGroupNEW $6.;
IF ProductMainGroup IN (‘1’ ‘4’) THEN ProductMainGroupNEW = ‘OTHERS’;
ELSE ProductMainGroupNEW = ProductMainGroup;
The FORMAT statement can be important here because otherwise the length of the
newly created variable corresponds to the length of the first character value that is
assigned to it. This might result in truncated category values such as “OTHE”.
The IN operator “IN (‘1’ ‘4’)” is much more efficient than a long row of OR expressions
such as ProductMainGroup = 1 OR ProductMainGroup = 4.
In the case of many different categories the selection of the relevant groups can be
difficult. Here the “matrix selection” of characters out of the Output window can be
helpful.
The advantage of “OTHERS groups” is that a high number of potentially low frequent categories
are combined into one group. This helps to speed up data preparation and analysis and also makes
the interpretation of results easier.
In Chapter 23 – Scoring and Automation, we will deal with the problem of new and changing
categories that cause problems in the scoring process, which can partly be overcome with
OTHERS or UNKNOWN groups.
In the SAS Output window, the matrix selection can be activated by holding down the
ALT key, pressing the left mouse button, and moving the mouse pointer over the desired
characters (codes).
The selected values can be copied to the clipboard and copied to the program editor for
the IN clause.
Note here that during selection in the Output window you must not move after the last
line. Otherwise, you are not able to copy the selection to the clipboard.
168 Data Preparation for Analytics Using SAS
The format code from the preceding example might look like the following:
FORMAT ProductMainGroupNEW $6.;
IF ProductMainGroup IN (
‘1’
‘4’
) THEN ProductMainGroupNEW = ‘OTHERS’;
ELSE ProductMainGroupNEW = ProductMainGroup;
Because we have to add the apostrophe (‘) manually it might be more efficient to write
the list of codes into separate lines.
The data that are inserted from a matrix selection do not contain line feeds; therefore the
appropriate number of empty lines has to be inserted manually before the paste from the
clipboard.
Such groupings of categories are not restricted to OTHERS groups, but can also be used to create
any composite category.
The table CODES1 in LOOKUP.XLS will then be opened in Excel and a column GROUP_NEW
is added. This column is filled with the appropriate category names.
After the new column GROUP_NEW is added to the spreadsheet and the spreadsheet is closed, it
can be re-imported. During the re-import a data set is created that can be used directly to create a
format with the CNTLIN option on PROC FORMAT.
Chapter 17: Transformations of Categorical Variables 169
DATA codes1;
SET lookup.’codes1’n(RENAME =( ProductMainGroup =start
Group_New=label));
RETAIN fmtname ‘ProductMainGroupNEW’ type 'c';
RUN;
The format ProductMainGroupNEW can now be assigned to the variable ProductMainGroup and
will contain the new grouping.
DATA codes;
SET codes;
FORMAT ProductMainGroup $ProductMainGroupNEW.;
RUN;
General
While categorical information in variables such as REGION or GENDER is very straightforward
to understand for humans, statistical methods mostly can’t deal with it. The simple reason is that
statistical methods calculate measures such as estimates, weights, factors, or probabilities, and
values such as MALE or NORTH CAROLINA can’t be used in calculations.
To make use of this type of information in analytics, so-called dummy variables are built. A set of
dummy variables represents the content of a categorical variable by creating an indicator variable
for (almost) each category of the categorical variable.
This type of coding of dummy variables is also called GLM coding. Each category is represented
by one dummy variable.
170 Data Preparation for Analytics Using SAS
The same is true for binary variables, e.g., gender male/female, where only one binary dummy
variable such as MALE (0/1) is needed to represent the information sufficiently.
In regression analysis, for one categorical variable with k categories, only k–1 of dummy variables
is used. The omitted category is referred to as the reference category and estimates for the dummy
variable are interpreted as differences between the categories and the reference category. A GLM
coding, where one category is not represented by a dummy variable but is treated as the reference
category, is also referred to as reference coding.
In our EMPLOYMENT_STATUS example this means that we provide only three dummy
variables (EMPLOYED, UNEMPLOYED, and RETIRED) for the four categories. The
coefficients of these three variables are interpreted as the difference from the category
EDUCATION.
For our employment variables this would look like the following:
The advantage of this type of coding is that in regression you also get an estimate for
unemployment as the negative sum of the estimates of the dummy variables for the other
categories. This is very important for the business interpretation of the results, because in many
cases the definition and interpretation of a reference category as in GLM coding are not easy.
The estimate for a dummy variable is interpreted as the difference for that category and the
average over all categories.
Chapter 17: Transformations of Categorical Variables 171
Dummy codes can also be easily created with IF-THEN/ELSE clauses or SELECT-WHEN
clauses. The following example creates dummy variables for the deviation coding method:
DATA CUSTOMER;
SET CUSTOMER;
SELECT (Employment_Status);
WHEN (‘Employed’) Employed=1;
WHEN (‘Unemployed’) Unemployed=1;
WHEN (‘Education’) Education =1;
OTHERWISE DO;
Employed =-1;
Unemployed=-1;
Education =-1;
END;
END;
RUN;
Dummy variables for GLM or reference coding can be created in the same way:
DATA CUSTOMER;
SET CUSTOMER;
SELECT (Employment_Status);
WHEN (‘Employed’) Employed =1;
WHEN (‘Unemployed’) Unemployed=1;
WHEN (‘Education’) Education =1;
WHEN (‘Retired’) Retired =1;
RUN;
This type of coding creates short code, because there is no need for IF-THEN/ELSE clauses or
SELECT-WHEN clauses. Furthermore the definition of the variables can easily be put into a data
mart definitions table (see also Chapter 24 – Do’s and Don’ts When Building Data Marts). One
column contains the variable name and another column contains the definition.
172 Data Preparation for Analytics Using SAS
Variables Definition
Employed (Employment_Status = ‘Employed’)
Unemployed (Employment_ Status = ‘Unemployed’)
Retired (Employment_ Status = ‘Retired’)
Education (Employment_ Status = ‘Education’)
This would not be possible with IF-THEN/ELSE clauses. A further advantage is that the
statements can be used in PROC SQL as well. IF-THEN/ELSE clauses would need to be
transferred to CASE/WHEN clauses for SQL.
PROC SQL;
CREATE TABLE Customer_Dummy AS
SELECT *,
(Employment_Status = ‘Employed’) AS Employed,
(Employment_Status = ‘Unemployed’) AS Unemployed,
(Employment_Status = ‘Retired’) AS Retired,
(Employment_Status = ‘Education’) AS Education
FROM customer;
QUIT;
Other coding methods exist such as ordinal and polynomial dummy coding, as well as the ordinal
and polynomial methods. You can refer to the SAS/STAT User’s Guide in SAS Online Doc for the
LOGISTIC procedure and CLASS statement, or to the SAS Press publication SAS for Linear
Models.
17.6.1 Rationale
To consider the relationship between two or more categorical variables, cross tables are usually
created in statistical analysis. It is, however, also possible to concatenate the values of the
respective categorical variables and analyze the univariate distribution of these new variables.
Customers can use one or more of three products. Product use is stored in the indicator variables
PRODUCTA, PRODUCTB, and PRODUCTC. In order to analyze the usage patterns of the three
products a concatenated variable PRODUCTUSAGE is created:
ProductString = CAT(ProductA,ProductB,ProductC);
In this case, the result is a string variable with three digits and eight possible values (111, 110,
101, 100, 011, 010, 001, 000). The creation of a frequency table of the variable
PRODUCTSTRING gives insight about the most frequent product combinations and allows a
segmentation of customers. For example, PRODUCTSTRING = 100 can be named the
“PRODUCT_A_ONLY customers”.
A derived variable HAS_OPTIMAL_TARIFF can easily be created with the following statement:
HAS_OPTIMAL_TARIFF = (ACTUAL_TARIFF = OPTIMAL_TARIFF);
Note that we assume that the format of the two variables is the same (length, no leading and
trailing blanks) so that we compare them on a one-to-one basis.
If we want to create a variable TARIFF_MATRIX that concatenates the actual and the optimal
tariffs in one variable we can do this with the following statement:
TARIFF_MATRIX = CATX('_',PUT(actual_tariff,$2.),PUT(optimal_tariff,$2.));
Note that here we assume that the tariff names (or code) differ in the first two characters. In other
cases the PUT function needs to be amended accordingly.
In the following output the results can be seen for selected observations. Note that we have
created two meaningful derived variables, one indicating whether the customer has the optimal
tariff or not, and another variable describing the deviations between the optimal and actual tariff.
Both of these variables can also be important predictor variables for the contract cancellation
event of a customer.
actual_ optimal_ HAS_OPTIMAL_ TARIFF_
Obs tariff tariff TARIFF MATRIX
A potential additional important variable can be created by combining the information from
HAS_OPTIMAL_TARIFF and TARIFF_MATRIX:
IF HAS_OPTIMAL_TARIFF = 1 THEN TARIFF_MATRIX2 = 'OPTIMAL';
ELSE TARIFF_MATRIX2 = TARIFF_MATRIX;
This variable allows powerful classification of customers, by segmenting them into an OPTIMAL
group and groups of NON-OPTIMAL TARIFFS with the corresponding tariff matrix.
Further examples that are frequently used in this context are changes over time in the categories.
We have used the logical variables of the SET statement, indicating whether a customer has an
entry in a certain server to create a concatenated variable.
DATA customer;
MERGE customer_base
Product_Server_A (IN = Server_A
RENAME = (PURCHASE_SUM = SUM_A))
Product_Server_B (IN = Server_B
RENAME = (PURCHASE_SUM = SUM_B))
Product_Server_C (IN = Server_C
RENAME = (PURCHASE_SUM = SUM_C))
Product_Server_D (IN = Server_D
RENAME = (PURCHASE_SUM = SUM_D))
;
BY customer_id;
ProdA = Server_A; ProdB = Server_B;
ProdC = Server_C; ProdD = Server_D;
ProductUsage = CAT(ProdA,ProdB,ProdC,ProdD);
ProductUsage1000 = CAT((SUM_A > 1000),(SUM_B > 1000),
(SUM_C > 1000),(SUM_D > 1000));
RUN;
Again we receive a variable that concatenates 0 and 1. However, 1 is only then inserted if the
purchase amount for the respective product exceeds 1,000. We see from this example that we can
create meaningful derived variables with only a few lines of code.
Note that the values of a concatenated variable need not be 0 and 1, as we saw in our first example
in this section. Missing values for binary variables can be inserted as a period (.) or 9, resulting in
the possible values 0, 1, and 9. Also, nominal and ordinal classes as well as counts can be
concatenated in order to identify frequent combinations.
Concatenated variables also perform well in the analysis of multiple choice responses in surveys.
The multidimensional structure of a lot of yes/no answers is usually complicated to handle. With
concatenated variables the most frequent combinations can be identified and rare combinations
can be grouped into an OTHERS group.
The associations between several univariate segmentations can also be analyzed with
concatenated variables. For example, univariate segmentations for age, product usage, and
purchase sum can be analyzed for their most frequent combinations. Rare classes can be
combined; large classes can be split for another criterion. This type of segmentation is also called
business rule-based segmentation. In Chapter 26 – Case Study 2—Deriving Customer
Segmentation Measures from Transactional Data, we will work out a more detailed example on
this topic.
Finally, we should mention that the final definition of concatenated variables, in the sense of the
number of dimensions or definitions of the OTHERS group, is usually an iterative process. A
certain definition is proposed, univariate distributions are analyzed, and the definition is
reworked.
176 Data Preparation for Analytics Using SAS
Overview
For a category more detailed data can be available. This information can be a more descriptive
name, i.e., a description for a code. In this case we use the term lookup table.
It can also be a list of properties that are associated with the category. In this case we are talking
about external data.
postal codes
list of branches or sales outlets
product categories
product lists
tariff plans
lists of adverse events in clinical research
The dimension tables of a star schema can also be considered lookup tables. If a flat table (one-
row-per-subject table) has to be built from a star schema the dimension tables are “merged” with
the fact table.
age distribution
gender distribution
educational status
income situation
External data are in most cases merged directly with the base table. In some cases a set of formats,
one for each attribute in the external data, is created and applied to copies of the region variable.
These formats can be created with PROC FORMAT and a VALUE statement, as we saw in our
preceding gender example. For categorical variables with many different values it is more
convenient to create formats from SAS data sets. Additionally, the maintenance of those lookup
lists is easier and more efficient if the formats can be created from tables.
With a simple example we will show how to create a format from a data set table. We have the
following table BRANCHLIST.
Chapter 17: Transformations of Categorical Variables 177
From this table we want to create a SAS format for the variable BRANCHID. This can be done
with the following statement:
DATA BranchList;
SET BranchList(RENAME =(BranchID=start BranchName=label));
RETAIN fmtname 'BranchName' type 'n';
RUN;
Note that in the case of duplicate rows in the table BRANCHLIST they have to be deleted with
the following statement:
PROC SQL;
CREATE TABLE branchlist
AS SELECT DISTINCT BranchID FROM branchlist;
QUIT;
or alternatively
PROC SORT DATA = branchlist OUT = branchlist_nodup NODUP;
BY BranchID;
RUN;
Note that the format BRANCHNAME can be assigned to the BRANCHID variable in the base
table, which will then display the branch name instead. If we have more details for each branch,
such as the square meters of the branch, we can create another format BRANCH_M and assign
this format to a copy of the BRANCHID variable.
The alternative to creating a format for each attribute in the lookup or external data table is to
merge the whole lookup or external data table to the base table.
18.1 Introduction
If we have multiple numerical observations per subject, there are two major ways to make them
available in a one-row-per-analysis subject. We can either transpose each value per subject into a
separate column, or we can aggregate the multiple observations and condense the inherent
information into some descriptive values. The pure transposition of values is needed for some
analyses such as a repeated measurement analysis of variance.
For many data mining analyses, however, the one-to-one transposition of the values does not
make much sense. If we have many observations per subject, we will want to create a set of
variables that sufficiently describes the properties of the repeated observation per subject. In
contrast to pure transposition, we call this the aggregation of multiple observations.
When aggregating multiple observations we can distinguish between two main groups—static
aggregation and trend aggregation. Only in trend aggregation do we consider the timely or
sequential ordering of observations and their values and create indicators for trends over time.
With static aggregation we aggregate only the values, using descriptive statistics, ignoring their
ordering.
180 Data Preparation for Analytics Using SAS
In this chapter we will not deal with pure transposition, because we covered this in Chapter 14 –
Transposing One- and Multiple-Rows per Subject Data Structures, but we will cover the
aggregation of multiple observations.
Overview
In this chapter we will look at the following methods of aggregating information from multiple
observations.
Static aggregation, where we will discuss various ways to aggregate data with simple descriptive
statistics.
Correlation of values, where we will show how derived variables can show the correlation
between measurements or the correlation between measurements and a group mean.
Concentration of values, where we will show a special aggregation that provides the information
whether the sum per subject concentrates on a few measurements or is distributed over
measurements or subhierarchies.
Standardization of values, where we will show the ways to standardize values—for example, by
dividing through the subject’s mean value and other methods.
Derived variables, where we will deal with ways derived variables can be created that describe
the course over time of measurements.
%CONCENTRATE, which calculates the concentration of the total sum of values per
subject on the top 50 % of the subhierarchies.
We will also show a number of coding examples.
Overview
In this section we will deal with methods of static aggregation of multiple numeric values per
analysis subject. With static aggregations the following topics are of special interest:
various methods of basic descriptive aggregation measures that condense the information
into a set of variables
correlation of variables
a special measure that represents the concentration of values on a certain proportion of
observations
Chapter 18: Multiple Interval-Scaled Observations per Subject 181
If we need the data in a multiple-rows-per-subject structure (WIDE), we can convert them from a
one-row-per-subject structure (LONG) with the following statements:
PROC TRANSPOSE DATA = WIDE OUT = LONG;
BY custId;
RUN;
DATA LONG;
SET LONG;
FORMAT Month 8.;
RENAME col1 = Usage;
Month = compress(_name_,'M');
DROP _name_;
RUN;
Note that the LONG format can be used for creating graphs with PROC GPLOT and for analysis
with PROC MEANS.
182 Data Preparation for Analytics Using SAS
To run the analysis per subject, we use a CLASS statement. A BY statement would do
this too, but would require the data be sorted by CustID.
We use the NWAY option in order to suppress the grand total mean (and possible
subtotals if we had more than one class variable), so the output data set contains only
rows for the 10 analysis subjects.
Furthermore, we use the NOPRINT option in order to suppress the printed output from
the log, which, in addition to our simple example with 10 observations, can overflow the
Output window with thousands of descriptive measures.
In the OUTPUT statement we specify the statistics that will be calculated. The
AUTONAME option creates the new variable name in the form
VARIABLENAME_STATISTIC.
If we have only one statistic specified, we can omit the AUTONAME option and the
variables in the aggregated data set will be the same as in the input data set.
If we want to calculate different statistics for different input variables we can specify the
following in the OUTPUT statement:
SUM(Usage) = sum_usage
MEAN(Billing) = mean_billing
In the OUTPUT statement we immediately drop the _TYPE_ and the _FREQ_ variables.
We could also keep the _FREQ_ variable and omit the ‘N’ from the statistic list, as we
have done here for didactic purposes.
For a complete discussion of PROC MEANS, see SAS Help and Documentation or the SAS Press
publication Longitudinal Data and SAS.
The preceding list can easily be longer by adding various descriptive measures such as skewness,
kurtosis, median, interquartile ranges, ranges, and so on. If we were in a pure data mining scenario
this might have its benefits because we might want to see which aggregations best fit the data. It
makes sense however, to calculate only those measures that also have an adequate business
interpretation.
Using the mean is very obvious; it measures the average location of the distribution and
allows a differentiation of subjects by their magnitude of values.
Calculating the standard deviation allows us to distinguish subjects with an erratic
behavior of their time series from those with a smooth course over time.
The number of observations does not measure the values themselves, but gives
information about how many periods for which each subject has observations. Here we
have to investigate whether the source data contain zeros or missing values for periods
without values. In some cases the number of observations with value zeros or the number
of missing values can make sense.
184 Data Preparation for Analytics Using SAS
There is no rule about which measure to use for which analysis. It is more of a balancing act
between a small number of interpretable aggregates and a complete snapshot of the data.
If there are no special business reasons for different statistics measures per variable, it makes
sense to decide on one descriptive measure for better interpretability—for example, to use the
mean or the median as the location measure, or to use the standard deviation or the interquartile
range as the dispersion measure for all variables and not change between them per variable.
Correlation of individual values with the group mean or with a reference period
The business rationale here is that we can see whether a subject’s values are
homogeneous with the values of the population or whether the subject’s individual values
or course differs over time.
In practice this can be a good indicator of the correlation of perceived service (= usage) to the
customers and the amount they have to pay for it.
Chapter 18: Multiple Interval-Scaled Observations per Subject 185
We run PROC CORR with a NOPRINT option and save the Spearman correlation
coefficients to a table.
We could also use the Pearson correlation coefficient, which can be specified with an
OUTP= option.
We specify the BY statement to run the analysis per CustID.
We receive one number per customer, holding the correlation of BILLING and USAGE values, as
in Table 18.6.
Then we join the means of all customers per month back to the original data:
PROC SQL;
CREATE TABLE interval_corr
AS SELECT *
FROM longitud a, m_mean_tp b
WHERE a.month = b.month;
QUIT;
Note that we use PROC SQL for the join because we want to join the two tables without sorting
for the variable MONTH, and then calculate the correlation per customer:
PROC CORR DATA = interval_corr
OUTS = Corr_CustID(WHERE = (_type_ = 'CORR')) NOPRINT;
BY CustID;
VAR Usage;
WITH m_mean;
RUN;
We see that customers 3 and 10 have a strong positive correlation with the mean course, and
customers 2 and 6 have a negative correlation with the mean course. In our small example
however, we have to admit that customers with high values are likely to have a positive
correlation because their values contribute much to the overall mean. For customer 9 the
correlation is missing because all of that customer’s values are the same.
Chapter 18: Multiple Interval-Scaled Observations per Subject 187
In this case the concentration of values on a certain proportion of underlying entities is a useful
measure to describe the distribution.
Note that this measure is not restricted to repeated measurements because of hierarchical
relationships, but it can also be calculated in the case of repeated measurements over time. In this
case it measures how the values concentrate on a certain proportion of time periods.
We will look at a simple example where we have usage data per contract. Each customer can have
one or more contracts. The data are shown in Table 18.8.
We see that customer 1 has a high usage concentration on three on his six contracts, whereas
customer 2 has a rather equal usage distribution of his contracts.
We will now calculate a measure per customer that shows what percentage of the total usage is
done with 50% of the contracts. The business rationale of this measure is that we want to assume
different behavior of customers with an equal distribution versus customers with a usage
concentration on a few contracts. This variable has proven to be very valuable in event
prediction—for example, in marketing analyses for business customers, where several
hierarchical levels such as CUSTOMER and CONTRACT exist.
Note that we are not talking here about classic concentration coefficients such as entropy or Gini,
but we are defining concentration as the proportion of the sum of the top 50% subhierarchies to
the total sum over all subhierarchies per subject.
188 Data Preparation for Analytics Using SAS
DATA
The name of the data set with the original data.
VAR
The name of the variable that holds the value whose aggregation will be measured.
ID
The name of the ID variable of the higher relationship—for example, the customer ID in
the case of the “one customer has several contracts” situation.
The macro can be invoked with the following statement for the preceding data:
%concentrate(usage,usage1,CustID);
Chapter 18: Multiple Interval-Scaled Observations per Subject 189
The name of the ID variable is copied from the input data set, and the name of the concentration
variable is created as <variable name>_CONC. The name of the output data set is
CONCENTRATE_<variable name>.
We see from the output that customer 1 has a concentration value of 0.9375, and customer 2 has a
value of 0.53488, which is consistent with our input data.
Note that the concentration value can range from 0.5 to 1—0.5 for the equal distribution over
subhierarchies and 1 for the concentration of the total sum on the 50% subhierarchies.
In the case of unequal numbers of subentities the macro averages the cumulative concentration at
the median entity with the value of the next higher entity. Strictly speaking this is only an
approximation but a good compromise between an exact result and a short program.
If concentrations are calculated for more than one variable, we will call the %CONCENTRATE
macro for each variable and join the results afterwards.
%concentrate(usage,usage1,CustID);
%concentrate(usage,usage2,CustID);
DATA _concentrate_;
MERGE concentrate_Usage1
concentrate_Usage2;
BY CustID;
RUN;
General
When we have time series of values available per subject, we can start analyzing their course over
time and derive indicators about the patterns over time. When looking at and analyzing patterns
over time, it is sometimes relevant to consider relative values instead of absolute values.
Here, the creation of relative values will be called the standardization of values. However, we do
not mean a statistical standardization in the sense of subtraction of the mean and the division by
the standard deviation, but we will consider different ways to make the values relative.
190 Data Preparation for Analytics Using SAS
For example, we see that customer 1 has a usage drop in months M5 and M6.
Note that you can use this type of standardization not only to filter seasonal effect, but also to
consider the different number of days per month. For example a 10% decrease from January to
February or a 3% increase from April to May is probably explained by the number of days in
these months rather than a change in the subject’s behavior.
For the standardization per month we can calculate the means per month from a one-row-per-
subject data mart and output the results to the Output window.
PROC MEANS DATA = wide MAXDEC=2;
VAR M1 - M6;
OUTPUT OUT = M_Mean MEAN = /AUTONAME;
RUN;
From the Output window we copy and paste values to the program editor in order to use the
means directly in the DATA step. This is shown in the following example:
DATA interval_month;
SET wide;
FORMAT M1 – M6 8.2;
M1=M1/ 52.00;
M2=M2/ 55.80;
M3=M3/ 48.10;
M4=M4/ 55.67;
M5=M5/ 47.60;
M6=M6/ 51.50;
RUN;
The results are shown in Table 18.11. We see that the resulting data reflect the subject’s
individual level of values. However, we have corrected the values by the average over all subjects
in this month. This allows interpretation, whether a customer really increases or decreases his
values.
Table 18.11: Table with standardized values by the mean over all subjects for
the same time period
The individual mean per subject corresponds to a standardization of the row mean.
The mean over all subjects for the same time period corresponds to a standardization of
the column mean.
192 Data Preparation for Analytics Using SAS
The last option, standardizing by both, corresponds to a standardization of the row mean
and the column mean.
We see that the result of standardizing by the individual mean of the subject and the mean over all
subjects for the same time period equals the calculation of expected values from the margin
distributions in a cross table. Comparing to the expected value, we also see how each customer’s
values deviate from the value that we would expect based on the customer’s average usage and
the usage in the respective month.
A very elegant way to code the consideration of the individual mean of the subject and the mean
over all subjects for the same time period is to use PROC FREQ and an ODS statement. With this
method we can also calculate the two components separately.
The first step, which is optional, is to route the output into a file in order to avoid Output window
overflow. The reason is that we cannot use a NOPRINT option as in the other example, because
we need output to be created for the ODS statement.
PROC PRINTTO PRINT='C:\data\somefile.lst';
RUN;
We specify that we want to store the output object CrossTabFreqs in the FREQ table.
ODS OUTPUT CrossTabFreqs = freq;
We run PROC FREQ and specify EXPECTED to calculate the expected values form the margin
distributions.
PROC FREQ DATA = long;
TABLE CustID * Month / EXPECTED;
WEIGHT Usage;
RUN;
Finally, we close the ODS statement and route the output back to the Output window.
ODS OUTPUT CLOSE;
PROC PRINTTO;
RUN;
Chapter 18: Multiple Interval-Scaled Observations per Subject 193
The resulting output file also contains the margin distributions. We want to keep only the inner
cells of the table (TYPE = ‘11’).
DATA freq;
SET freq;
KEEP CustID Month Expected RowPercent ColPercent;
WHERE _type_ = '11';
RUN;
EXPECTED is the expected value after filtering the overall time and the individual
customer effect. This value needs to be compared to the actual value and corresponds to
the values that we call standardizing by the individual mean of the subject and the mean
over all subjects for the same time period.
ROWPERCENT is the usage proportion of the individual sum of its subject.
COLPERCENT is the usage proportion of the sum over all subjects for the same time
period. In the case of a high number of subjects, these values will get very small;
therefore, they might need to be multiplied by a constant.
Note that the ROWPERCENT and COLPERCENT columns are in absolute terms not the
same numbers as we calculated in the preceding examples, but they are proportional to
them.
Table 18.13: Table with expected value, row, and column percent
Standardizing by the individual mean of the subject allows us to see whether the subject’s
values change over time. In event prediction, these variables can be important in order to
see whether a subject changes his behavior. For example, if we analyze usage data over
data, a drop in the usage data can be a good predictor for an event.
Standardizing by the mean over all subjects for the same time period is a very important
measure in describing the subject’s behavior. Here we can see whether he behaves
differently from other subjects.
194 Data Preparation for Analytics Using SAS
Standardizing by both, by the individual mean of the subject and the mean over all
subjects for the same time period, creates values on a non-interpretable scale. It allows us
to filter all effects and to see whether a change took place over time.
The last method, using PROC FREQ and the ODS table CrossTabFreqs, allows the easy
calculation of the standardized values on a multiple-rows-per-subject table. The other code
example we saw in this section was based on a one-row-per-subject table.
Simple trend measures, where we create only differences and ratios of values over time.
Complex trends, where we calculate coefficients of the linear regression for the entire
time series or parts of it.
Time series analysis, where we calculate measures of time series analysis, including
trends but also autoregressive (AR) and moving average (MA) terms.
Note that in the last case we could also divide by the standard deviation of the values. In all cases
we use the MEAN function to calculate the mean of the values for the respective months. Note
that the SUM function would be appropriate if only we had the same number of observations for
each subject.
In Table 18.4 we see the output with the respective columns. From the sign of the difference for
the variables LAST2MONTHSDIFF and LAST2MONTHSSTD or the fact the values is above or
below 1 for the variable LAST2MONTHSRATIO we can see whether a decrease or increase in
usage has occurred.
In our example we are comparing the values of months 1–4 and the values of months 5–6. Other
choices such as 1–3 versus 4–6 or 1–5 versus 6 would also be possible. The decision about which
months be compared with each other depends on a business consideration, a visual analysis of
data courses, or both. Note that in the case of event prediction you might want to create a graph
showing the mean course of values for event A versus the mean course of values for event B in
order to detect the point in time when values decrease or increase for a certain group.
Note that these variables, unlike static aggregations, do not reflect the location of the values
themselves. The purpose of these derived variables is that they tell us about the course over time.
They are very suitable to compress the behavior over time into a single variable, which is
important for predictive analysis.
The usage of these types of variables has proven to be very predictive in the case of event
prediction—for example, in the case of prediction of contract cancellations based on usage drops.
Note, however, that the time windows for the time series and the target windows where the event
is observed have to be chosen accordingly in order to avoid a sole prediction of those customers
who have already canceled their contracts and no marketing actions are effective. See Chapter 12
– Considerations for Predictive Modeling for details.
We use a BY statement to perform regression analysis per CustID. The data must be
sorted for this variable in the input data set.
We use the NOPRINT option in order to suppress printed output.
We use the OUTEST option to create an output data set, which holds the regression
coefficients per CustID.
We specify a simple regression model USAGE = MONTH.
The data need to be in a multiple-rows-per-subject structure.
PROC REG is part of the SAS/STAT module.
The output is shown in the following table.
196 Data Preparation for Analytics Using SAS
For non-centered data the intercept gives the mean value of month 0. The slope coefficient in the
variable MONTH shows the average change per month.
The advantage of this approach is that we can get an estimate for the time trend over all
observations per subject. If we want to restrict the calculation of the regression slope for a certain
time interval we need to add a WHERE statement to the preceding code:
WHERE month in (3,4,5,6);
This considers only the last four months of the available data.
In the following example, we create linear regressions for the entire time period and a separate
regression for months 5 and 6. Thus we can measure the long-term and short-term trend in two
separate variables.
We run separate regressions for the respective time periods and store the coefficients in separate
data sets with the following code:
PROC REG DATA = long NOPRINT
OUTEST=Est_LongTerm(KEEP = CustID month
RENAME = (month=LongTerm));
MODEL usage = month;
BY CustID;
RUN;
Next we merge the two coefficient data sets together and receive a data set with two derived
variables that contain information about the trend:
DATA mart;
MERGE wide
est_longTerm
est_shortTerm;
BY CustID;
RUN;
The two variables LONGTERM and SHORTTERM show the slopes of the regression. We see
that the coefficients in SHORTTERM equal the difference of month 5 (M5) and month 6 (M6) as
we had only two periods in the regression.
We create a format for the groups. Note that the limits of –1 and 1 are chosen arbitrarily and can
be chosen based on the distribution of coefficients.
PROC FORMAT;
VALUE est LOW -< -1 = '-'
-1 - 1 = '='
1 <- HIGH = '+';
RUN;
In the next step we create the derived variable LONGSHORTIND. In this variable we concatenate
the values of LONGTERM and SHORTTERM and we apply simultaneously the format EST on
the data to retrieve values ‘+’, ‘=’ or ‘-’.
DATA mart;
SET mart;
LongShortInd = CAT(put(LongTerm,est.),put(ShortTerm,est.));
RUN;
We also receive an additional variable showing the group of the long-term trend in the first digit
and the group of the short-term trend in the second digit. The concatenated indicator is a powerful
classification of the trend over time as it considers both the long-term and short-term trend. For
predictive analysis of events this classification provides a powerful segmentation. Again, the
198 Data Preparation for Analytics Using SAS
choice of the time intervals depends on the business considerations, and a visual analysis of the
course over time.
Table 18.17: Long-term and short-term regression coefficients and the concatenated variable
per customer
There are many options for the parameterization of a time series model. Therefore, there are many
possible coefficients that can be created per subject. In this chapter we did not deal with all the
possible ways to create and parameterize the time series model, because that would have gone
beyond the scope of this book.
We will show a short example of how we can calculate coefficients per subject on the basis of a
time series. In our example we will use PROC TIMESERIES, which is part of SAS/ETS. We will
use the SASHELP.PRDSAL3 data, which we will first sort by STATE and DATE.
PROC SORT DATA = sashelp.prdsal3 OUT = Prdsal3;
BY state date;
RUN;
Next we will run PROC TIMESERIES. Note that we create output data sets for the SEASON, the
TREND, the decomposition (DECOMP), and the autocorrelations (CORR).
PROC TIMESERIES DATA=prdsal3
OUTSEASON=season
OUTTREND=trend
OUTDECOMP=decomp
OUTCORR = corr
MAXERROR=0;
BY state;
WHERE product = 'SOFA';
SEASON SUM / TRANSPOSE = YES;
TREND MEAN / TRANSPOSE = YES;
CORR ACOV / TRANSPOSE = YES;
DECOMP TCS / LAMBDA = 1600 TRANSPOSE = YES;
ID date INTERVAL=MONTH ACCUMULATE=TOTAL SETMISSING=MISSING;
VAR actual ;
RUN;
Chapter 18: Multiple Interval-Scaled Observations per Subject 199
Output data sets for SEASON, TREND, DECOMP, and CORR can be created in one step by
using the appropriate statements and output data sets.
The option TRANSPOSE = YES is needed to create an output data set in a one-row-per-subject
structure.
The ID statement specifies the time variable and also allows us to perform aggregations in the
time hierarchy—for example, by specifying INTERVAL = QTR to perform an analysis per
quarter.
We see that we get one row per subject, which in this case is the STATE. In our PROC
TIMESERIES syntax we specified ACOV as the statistic option, which creates the
autocorrelations for various lags. The resulting tables can now be joined per subject ID, which
results in a list of variables describing the course of the time series.
If we specify more than one statistic for an output table we get a table with one row per analysis
subject and statistic. We can use a WHERE clause to select a certain statistic from this table.
Alternatively, we can transpose the table twice and concatenate the variable names LAGx with the
STATISTICS value to have one row per analysis subject and all measures in columns. In Chapter
14 we saw an example of transposing a data set twice.
As mentioned earlier, the application of these methods results in a high number of derived
variables, which are not always easy to interpret. Therefore, care has to be taken to control the
number of derived variables, and time series modeling know-how might be needed to select the
appropriate statistic.
200 Data Preparation for Analytics Using SAS
C h a p t e r 19
Multiple Categorical Observations per Subject
19.1 Introduction
General
In this chapter we will deal with the case where we have a one-to-many relationship between a
subject table and either a table with repeated measurements or a table that holds data from an
underlying hierarchy. We will investigate methods of preparing the data for a one-row-per-subject
table.
With interval data for certain types of analysis, it can make sense to transpose the data by subject
as they are and represent them in columns. We saw this in Chapter 18 – Multiple Interval-Scaled
Observations per Subject. With categorical data, however, it usually does not make sense to
transpose them as they are.
Categorical data are mostly aggregated per subject, and absolute and relative frequencies per
categories are created per subject. A trend between the categories can be measured in the form of
a sequence analysis.
202 Data Preparation for Analytics Using SAS
Our general aim in the case of multiple categorical observations per subject is to create a number
of derived variables that describe the distribution of the categories per subject. We will see that
the number of variables in the one-row-per-subject table can quickly explode with the increasing
number of categories and the number of categorical variables.
We will, however, introduce a number of them, because in the data mining paradigm a high
number of candidate variables is desirable. We mention again that business considerations play an
important role in variable selection.
With categorical variables, in this chapter we mean all kinds of binary, nominal, ordinal, or
grouped interval data.
We will show how we can create variables that hold the absolute and relative frequencies
for categories per subject.
We will show that it makes sense to group categories with rare values to an OTHERS
group in order to reduce the number of derived variables.
In order to reduce the number of derived variables we will show how we can create a
concatenated variable that holds the values for each category.
Simply looking at count statistics such as total and distinct counts also allows the creation
of a number of derived variables. We will illustrate this with an example.
We will show the business interpretation of these derived variables and give an overview
of other methods.
Introduction
In this section we will explore how multiple values of one categorical variable can be aggregated
on a one-row-per-subject basis. We will deal only with a static consideration of a categorical
variable and will not consider a potential timely order of the observations.
We will deal with example data that are derived from the finance industry.
We have a CUSTOMER table with variables such as CUST_ID, AGE, GENDER, and
START OF CUSTOMER RELATIONSHIP.
We have a multiple-rows-per-subject ACCOUNTS table with one row per account. Here
we have the variables ACCOUNT_ID, ACCOUNT_TYPE, and BALANCE.
In the following examples we will deal with the variable ACCOUNT_TYPE, which can take the
values SAVINGS ACCOUNT, CHECKING_ACCOUNT, LOAN_ACCOUNT, CUM SAVINGS
ACCOUNT, and SPECIAL ACCOUNT. Example data are shown in Table 19.1.
Chapter 19: Multiple Categorical Observations per Subject 203
Frequency‚
Row Pct ‚CHECKING‚LOAN ‚MORTGAGE‚SAVINGS ‚SAVINGS2 ‚SPECIAL ‚ Total
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
1 ‚ 1 ‚ 1 ‚ 0 ‚ 2 ‚ 0 ‚ 0 ‚ 4
‚ 25.00 ‚ 25.00 ‚ 0.00 ‚ 50.00 ‚ 0.00 ‚ 0.00 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
2 ‚ 1 ‚ 0 ‚ 0 ‚ 0 ‚ 1 ‚ 0 ‚ 2
‚ 50.00 ‚ 0.00 ‚ 0.00 ‚ 0.00 ‚ 50.00 ‚ 0.00 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
3 ‚ 1 ‚ 1 ‚ 1 ‚ 1 ‚ 0 ‚ 0 ‚ 4
‚ 25.00 ‚ 25.00 ‚ 25.00 ‚ 25.00 ‚ 0.00 ‚ 0.00 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
4 ‚ 1 ‚ 0 ‚ 0 ‚ 0 ‚ 0 ‚ 0 ‚ 1
‚ 100.00 ‚ 0.00 ‚ 0.00 ‚ 0.00 ‚ 0.00 ‚ 0.00 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
5 ‚ 1 ‚ 1 ‚ 0 ‚ 3 ‚ 1 ‚ 1 ‚ 7
‚ 14.29 ‚ 14.29 ‚ 0.00 ‚ 42.86 ‚ 14.29 ‚ 14.29 ‚
ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ
Total 5 3 1 6 2 1 18
Note that this table represents the account distribution in a one-row-per-subject structure. We
have also intentionally specified the NOPERCENT NOCOL option in order to receive only the
COUNTS and the ROW PERCENTS. These two measures are frequently created measures from
multiple categorical data per subject.
In the following sections we will show how we can create derived variables on analysis subject
levels that hold the absolute and relative frequencies per account type. As a first optional step we
204 Data Preparation for Analytics Using SAS
will combine categories with low frequency to an OTHERS group, and then we will use PROC
FREQ to create the derived variables.
We can see that the categories MORTGAGE and SPECIAL have rare frequencies, which we will
combine with the following DATA step into a new variable ACCOUNT_TYPE2:
DATA Accounts;
SET Accounts;
FORMAT Account_Type2 $12.;
IF UPCASE(Account_Type) IN ('MORTGAGE','SPECIAL') THEN
Account_Type2 = 'OTHERS';
ELSE Account_Type2 = Account_Type;
RUN;
Note that we use the NOPRINT option here in order to avoid an overflow of the Output window.
We also drop the PERCENT variable from the output data set because it contains only
percentages over all observations, and we will calculate row percents manually in the next step.
PROC TRANSPOSE DATA = Account_Freqs
OUT = Account_Freqs_TP(DROP = _name_ _label_);
BY Cust_ID;
VAR Count;
ID Account_Type2;
RUN;
After transposing the data to a one-row-per-customer structure we will perform the following
steps:
We used ARRAY statements to replace the missing values and calculate the row percents per
customer. The result is shown in Table 19.2.
206 Data Preparation for Analytics Using SAS
Table 19.2: Table with absolute and relative frequencies of ACCOUNT_TYPE2 per
customer
We created 2x4 derived variables for the account data. The number of derived variables increases
with the number of categories. Therefore, it is advisable to group the data to a reasonable number
of categories before creating the derived variables for absolute and relative frequencies.
Overview
When we look at the data in Table 19.2 we see that we have a number of variables that describe
the distribution of ACCOUNT_TYPE2. In Chapter 17, we saw the possibility of concatenating a
categorical variable to create a meaningful derived variable. We will now show how that can be
used here.
19.3.1 Coding
With the following code we create two variables that hold the absolute and relative frequencies of
the four account types in a concatenated variable:
DATA Accounts_Subject;
SET Accounts_Subject;
Account_Freqs = CATX('_',Checking, Loan, Savings,Others);
Account_RowPct = CATX('_',PUT(Checking_rel,3.), PUT (Loan_rel,3.),
PUT (Savings_rel,3.), PUT (Others_rel,3.));
RUN;
Note that we use the PUT statement in the CATX function in order to truncate the percentage to
integer numbers.
Chapter 19: Multiple Categorical Observations per Subject 207
The following output shows the most frequent values of ACCOUNT_ROWPCT for a table with
41,916 customers:
Account_ Cumulative Cumulative
RowPct Frequency Percent Frequency Percent
---------------------------------------------------------------
0_100_0_0 12832 30.61 12832 30.61
100_0_0_0 9509 22.69 22341 53.30
50_0_0_50 4898 11.69 27239 64.98
33_0_0_67 1772 4.23 29011 69.21
0_0_100_0 1684 4.02 30695 73.23
67_0_0_33 1426 3.40 32121 76.63
0_0_50_50 861 2.05 32982 78.69
50_0_50_0 681 1.62 33663 80.31
25_0_0_75 652 1.56 34315 81.87
33_0_33_33 549 1.31 34864 83.18
75_0_0_25 423 1.01 35287 84.19
We have displayed only those values that have a relative frequency of at least 1%. The other
values can be combined into an OTHERS group.
This results in a variable with 12 categories, describing the most frequent combinations of row
percentages for the four categories CHECKING, LOAN, SAVINGS, and OTHERS in a
multivariate way.
In the case of more distributed values of the row percentages, it makes sense to define a format to
group the row percentages into buckets such as decile groups, and use this format in the PUT
function in the CATX function.
208 Data Preparation for Analytics Using SAS
Overview
Another group of derived variables is the absolute and distinct number of different categories per
subject. The following are variables that are very straightforward with multiple categorical
variables per subject:
19.4.1 Coding
In the following code we will use PROC SQL to copy the number of distinct categories overall
into a macro variable and calculate the variables listed earlier. Note that PROC SQL has the
advantage of being able to calculate DISTINCT counts, which is not possible in PROC FREQ or
PROC MEANS.
PROC SQL NOPRINT;
SELECT COUNT(DISTINCT account_type2)
INTO :NrDistinctAccounts
FROM accounts;
Note that we use the keyword CALCULATED to reference a derived variable that has just been
calculated in the same SELECT statement and is not yet present in the table.
Chapter 19: Multiple Categorical Observations per Subject 209
Table 19.3: Derived variables from the number of absolute and different categories
19.4.2 Interpretation
Note that we have created six derived variables just by looking at the product or service offering.
These variables are often valuable to infer certain customer behavior—for example, how a
customer chooses form a certain product, service, offering.
Using association or sequence analysis is obviously much more detailed than using the preceding
variables. The advantage of the preceding approach is the simple calculation of the derived
variables and the straightforward interpretation.
Background
In the preceding sections we used a DATA step to create the row percentages manually. The
OUT= option in PROC FREQ does not output row or column percentages to a SAS data set. It is,
however, possible to use the Output Delivery System (ODS) to redirect the output of PROC
FREQ to a table. This saves the additional need for a DATA step.
Note that it is not possible to use the NOPRINT option in this case—the ACCOUNT_FREQS
table will not be filled if no output is produced. It is possible, however, to prevent PROC FREQ
from filling and possibly overflowing the Output window by redirecting the printed output to a
file with PROC PRINTTO.
The ACCOUNT_FREQS table now contains absolute frequencies, relative frequencies, row
percentages, column percentages, and an expected number of frequencies. In Table 19.4 we see
the output of ODS OUTPUT CROSSTABFREQS =. We see the list of statistics that are
calculated.
The row percentages are stored in the ACCOUNT_ROWPCT_TP table. Note that we do not need
to calculate the values as in the preceding section, just transpose the already-calculated values.
PROC TRANSPOSE DATA = Account_Freqs_ODS(KEEP = cust_id Account_Type2
RowPercent _TYPE_)
OUT = Account_RowPct_TP(DROP = _name_ _label_)
PREFIX = RowPct_;
FORMAT RowPercent 8.;
BY Cust_id;
VAR RowPercent;
ID Account_Type2;
RUN;
Chapter 19: Multiple Categorical Observations per Subject 211
See Table 19.5 for an example of the output data. We see that this is the same output as we
generated in Table 19.2.
We did not calculate these values in the preceding section. That would have required an additional
calculation of the sums per account type and merge of the values to the frequency table. Again,
we benefit from the precalculated values of PROC FREQ.
A table with column percentages per subject can be calculated analogously to the example for the
row percentages.
PROC TRANSPOSE DATA = Account_Freqs_ODS(KEEP = cust_id Account_Type2
ColPercent)
OUT = Account_ColPct_TP(DROP = _name_ _label_)
PREFIX = ColPct_;
FORMAT ColPercent 8.2;
BY Cust_id;
VAR ColPercent;
ID Account_Type2;
RUN;
Before we transpose the deviations from the expected number of frequencies we have to calculate
them in a DATA step. The variable EXPECTED is created in ACCOUNT_FREQS by PROC
FREQ:
DATA Account_freqs_ODS;
SET Account_freqs_ODS;
ExpRatio = (Frequency-Expected)/Expected;
RUN;
212 Data Preparation for Analytics Using SAS
In the next step this table needs to be merged with the customer table. We skip this here because it
is analogous to the preceding merge.
19.5.4 Conclusion
The advantage of this method is that the row percentages do not need to be calculated manually in
a DATA step and additional statistics such as column percentages or expected frequencies can be
calculated. However, each statistic from the output table of PROC FREQ needs to be transposed
separately to a one-row-per-subject structure.
General
Note that we have calculated 24 derived variables from a single variable ACCOUNT_TYPES.
As mentioned in the first section of this chapter, the creation of derived variables always has to be
combined with business consideration and knowledge. Here we introduced the most common
derived variables for categories. Not all of them have to be used to make sense in every analysis.
We already discussed the business rationale of the total and distinct counts in the preceding
section. The rationale of the absolute frequencies should be obvious. We do not use relative
percentages over subjects and categories because their interpretation is not easy.
19.6.1 Percentages
Row percentages represent the relative distribution of categories per subject. With these variables
we are able to judge whether a subject concentrates on certain categories compared to his
portfolio of categories.
Calculating column percentages allows us to indicate whether a subject’s frequency for a category
is above or below the average of all subjects for this category.
Calculating the ratio between observed (= frequency) and expected frequencies allows us to
quantify whether a subject is dominant or non-dominant in a category compared to his category
portfolio and the values of other subjects.
Chapter 19: Multiple Categorical Observations per Subject 213
19.6.2 Concatenation
The concatenated variables with frequencies or row percentages over categories allow a
multivariate analysis of the combinations within one variable. In addition to row percentages a
concatenation of absolute values is always important for exploratory purposes.
Note that the values of concatenated row percentages or column percentages or expected ratios
are different for each subject. But their separation of the observations into groups is the same.
Therefore, we have created only one concatenation of row percentages.
19.6.3 Automation
With a few changes it is possible to include the code in a macro that creates the proposed derived
variables just by specifying the variable name. Note, however, that with many categorical
variables and possible categories, a blind invocation of a macro will explode the number of
variables. The manual way forces programmers to think of their variables. Furthermore, the
possible need to combine rare categories makes the automation more difficult.
19.7.1 Concentration
In Chapter 18 – Multiple Interval-Scaled Observations per Subject, we introduced the
concentration measure, which measures the concentration of values on a certain proportion of the
multiple observations per subject. It is also possible to apply this method to categorical variables.
In this case we analyze the proportion of the most common categories that make up a certain
proportion of the multiple observations per subject.
In Chapter 19 we introduced a number of derived variables that describe in detail the distribution
and characteristics of categorical variables per subject. The applications and importance of
concentration for categorical variables are not as important as in the measurement case. That is
why we did not cover the creation of these variables here.
Association analysis can analyze a list of categories per subject more sophisticated way—for
example, by identifying frequent category combinations. Sequence analysis additionally considers
the time dimension and analyses the sequence of the categories.
Calculating sequences and performing association analysis cannot be performed simply in the
SAS language without creating complex macros. Therefore, we will not show a coding example
for these methods here. In Chapter 28 – Case Study 4—Data Preparation in SAS Enterprise
Miner, we will show how SAS Enterprise Miner can help in data preparation. There we will also
deal with an example of association and sequence analysis.
214 Data Preparation for Analytics Using SAS
C h a p t e r 20
Coding for Predictive Modeling
20.1 Introduction
Data mining analysis not only starts with the creation of a predictive model, it also deals with
creating and preparing meaningful derived variables. In earlier chapters we saw a number of
methods to prepare derived variables and how to insert predictive power into them.
In this chapter we will consider those types of derived variables where we specifically transform
variables in order to give them more predictive power. In this chapter we will move toward the
world of data mining analyses and reach the limits of this book’s scope.
In fact, however, there is no sharp limit between data preparation and analysis. Data preparation
and analysis are usually a cyclic process that is repeated over and over. New variables or sets of
variables are created and used in analysis.
216 Data Preparation for Analytics Using SAS
The task to add predictive power to a variable requires types of transformation that usually go
beyond simple transformation. Data mining tools such as SAS Enterprise Miner offer special data
mining nodes and data mining functionality for such purposes. In Chapter 28 – Case Study 4—
Data Preparation in SAS Enterprise Miner, we will see some examples.
In this chapter we will cover the following points of predictive modeling and show code examples
and macros in order to prepare data for predictive modeling:
Proportions or means of the target variable, where we show specific ways of how a
derived variable can be created that holds the proportion or the mean of the target
variable for a certain category.
We will show how interval variables can be either mathematically transformed or binned
into groups.
We will show a short example of how data can be split into training and validation data.
General
The method proportions or means of the target variable applies in predictive modeling.
Depending on whether the measurement level of the target variable is binary or interval, we speak
of proportions or means. The basic idea of this method is that categories of input variables are
replaced by interval values. In the case of a binary target variable these values are the proportion
of events in this category; in the case of an interval target the values are the mean of the interval
variable in this category.
Short example
The following short example explains the principle of a binary target:
We have one categorical variable GENDER with the values MALE and FEMALE, and a binary
target variable RESPONSE with the values 0 and 1. Assume that for GENDER = MALE we have
the following responses: 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, and for FEMALE we have 0, 1, 1, 1, 0, 1, 0,
1, 1, 0.
In the case of an indicator variable the mean of the variable for a group equals the proportion of
event=1 in this group. So we have an event proportion of 36.4% for MALE and of 60% for
FEMALE. In the data we would now replace the category MALE with the interval value 36.4 and
the category FEMALE with 60.
Chapter 20: Coding for Predictive Modeling 217
Also note that the proportions must be calculated only on the training data! Otherwise, we would
create a model that generalizes badly, because we would put the information of the target variable
directly into the input variables and would dramatically overfit the data.
2. Create a new variable GENDER2 that holds the proportions of the target variable.
IF gender = ‘MALE’ THEN gender2 = 0.364;
ELSE gender2 = 0.6;
For data sets with many variables and many categories, a macro is desirable.
The big advantage of this macro is that there are no manual steps to write IF-THEN/ELSE clauses
for each variable. The usage of PUT functions with the respective format is very elegant. The
resulting format catalog can be used for scoring new observations.
The macro creates new variables in the data set. For each variable in the VARS= list a
corresponding variable with the prefix M is created.
Note that the code of the macro can certainly be enhanced by creating additional macro
parameters for individual tasks. It was the intention to create code that is still simple enough to
218 Data Preparation for Analytics Using SAS
explain the principle. At first the code might appear more complicated than it is because of the
many %SCAN() functions calls. To get a clearer picture, look at the SAS log after the macro has
run:
%MACRO CreateProps(data=,vars=,target=,library=sasuser,
out_ds=,mv_baseline=YES,type=c,other_tag="OTHER");
*** Load the number of items in &VARS into macro variable NVARS;
%LET c=1;
%DO %WHILE(%SCAN(&vars,&c) NE);
%LET c=%EVAL(&c+1);
%END;
%LET nvars=%EVAL(&c-1);
END;
RUN;
Chapter 20: Coding for Predictive Modeling 219
DATA fmt_tmp;
SET work.prop_%SCAN(&vars,&i)(RENAME=(%SCAN(&vars,&i) = start
&target=label)) END = last;;
*WHERE _type_ = 1;
RETAIN fmtname "%SCAN(&vars,&i)F" type "&type";
RUN;
*** Run PROC Format to create the format;
PROC format library = &library CNTLIN = fmt_tmp;
RUN;
%end;
*** Use the available Formats to create new variables;
options fmtsearch = (&library work sasuser);
DATA &out_ds;
SET &data;
FORMAT %DO i = 1 %TO &nvars; %SCAN(&vars,&i)_m %END; 16.3;
%DO i = 1 %TO &nvars;
%IF &type = c %THEN IF UPCASE(%SCAN(&vars,&i)) = 'OTHER' THEN
%SCAN(&vars,&i) = '_OTHER';;
%SCAN(&vars,&i)_m =
INPUT(PUT(%SCAN(&vars,&i),%SCAN(&vars,&i)f.),16.3);
%END;
RUN;
%mend;
DATA
The input data set that contains TARGET variables and a list of categorical INPUT
variables.
VARS
A list of categorical input variables.
TARGET
The target variable, which can be interval or binary.
LIBRARY
A library where the format catalog will be stored. The default is the SASUSER library.
OUT_DS
The name of the output data set that is created by the macro. This data set will contain
additional variables, the variables from the VARS list with the suffix _M.
TYPE
The type of variables in the VARS list—c for character formatted variables, n for
numeric formatted variables. Note that the whole macro can run only for character or
numeric formatted variables. If both types of variables will be converted to means or
proportions, the macro has to be run twice.
220 Data Preparation for Analytics Using SAS
MV_BASELINE
Possible values are YES and NO. YES (= default) indicates that for the group of missing
values the mean for this group will not be used, but the overall mean (= baseline) will be
used.
OTHER_TAG
The name of the category that will be used as the OTHER category. In this category all
observations are classified that have values in the respective categorical variable whose
value was not found in the training data of this variable.
Examples
We use the SAMPSIO.HMEQ data. The variable BAD is a binary target variable with values 0
and 1.
%CreateProps(DATA=sampsio.hmeq,OUT_DS=work.hmeq,VARS=job,
TARGET=bad,MV_BASELINE=NO);
This results in the creation of a SAS format $JOBF with the following values and labels. Note that
this output has been created with the following statement.
proc format library = sasuser fmtlib;
run;
Table 20.1: Values of SAS format $JOBF with a separate proportion value for
missing values
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $JOBF LENGTH: 12 NUMBER OF VALUES: 8 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 12 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 09JUN2005:16:31:19)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ ‚0.082437276 ‚
‚Mgr ‚Mgr ‚0.2333767927 ‚
‚Office ‚Office ‚0.1318565401 ‚
‚ProfExe ‚ProfExe ‚0.1661442006 ‚
‚Sales ‚Sales ‚0.3486238532 ‚
‚Self ‚Self ‚0.3005181347 ‚
‚_OTHER ‚_OTHER ‚0.2319932998 ‚
‚**OTHER** ‚**OTHER** ‚0.1994966443 ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
We see that the category of missing values has its own proportion that is calculated on the
basis of the training data.
The variable REASON originally contains a category OTHER, which has been replaced
with _OTHER in order to have the OTHER item free for all categories that are not in the
list. This is represented by **OTHER** in the format list.
The category **OTHER** holds the overall mean of the target variable BAD.
222 Data Preparation for Analytics Using SAS
Running the macro %MV_BASELINE=YES causes the MISSING group to be represented by the
overall mean.
%CreateProps(DATA=sampsio.hmeq,OUT_DS=work.hmeq,VARS=job,
TARGET=bad,MV_BASELINE=YES);
This results in the creation of a SAS format $JOBF with the following values and labels.
Table 20.3: Values of SAS format $JOBF with the overall proportion used for
missing values
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $JOBF LENGTH: 12 NUMBER OF VALUES: 8 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 12 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 09JUN2005:16:44:12)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ ‚0.1994966443 ‚
‚Mgr ‚Mgr ‚0.2333767927 ‚
‚Office ‚Office ‚0.1318565401 ‚
‚ProfExe ‚ProfExe ‚0.1661442006 ‚
‚Sales ‚Sales ‚0.3486238532 ‚
‚Self ‚Self ‚0.3005181347 ‚
‚_OTHER ‚_OTHER ‚0.2319932998 ‚
‚**OTHER** ‚**OTHER** ‚0.1994966443 ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
Chapter 20: Coding for Predictive Modeling 223
A look at the data shows how the values of the variable JOB are replaced by their proportion to
the target variable.
Table 20.4: Original and transformed values of the variable JOB with
MV_BASELINE = YES
We will do this in the macro %PROPSCORING. Here we use the proportions or means that have
been calculated on the basis of training data to score the validation data set or to score data for
other subjects or other time periods.
The macro uses the existing format catalog and creates new variables with values that are replaced
on the basis of these formats:
%MACRO PropScoring(data=, out_ds=,vars=,library=sasuser,type=c);
options fmtsearch = (&library work sasuser);
%LET c=1;
%DO %WHILE(%SCAN(&vars,&c) NE);
%LET c=%EVAL(&c+1);
%END;
%LET nvars=%EVAL(&c-1);
DATA &out_ds;
SET &data;
FORMAT %DO i = 1 %TO &nvars; %SCAN(&vars,&i)_m %END; 16.3;;
%DO i = 1 %TO &nvars;
%IF &type = c %THEN IF UPCASE(%SCAN(&vars,&i)) = 'OTHER' THEN
%SCAN(&vars,&i) = '_OTHER';;
%SCAN(&vars,&i)_m =
INPUT(PUT(%SCAN(&vars,&i),%SCAN(&vars,&i)f.),16.3);
%END;
RUN;
%MEND;
224 Data Preparation for Analytics Using SAS
The macro has the following input parameters, which are similar to the macro
%CREATEPROPS:
DATA
The input data set that contains a list of categorical INPUT variables.
VARS
A list of categorical input variables.
LIBRARY
A library where the format catalog will be stored. The default is the SASUSER library.
OUT_DS
The name of the output data set that is created by the macro. This data set will contain
additional variables, the variables from the VARS list with the suffix _M.
TYPE
The type of variables in the VARS list—c for character formatted variables, n for
numeric formatted variables. Note that the whole macro can run only for either character
or numeric formatted variables. If both types of variables will be converted to means or
proportions, the macro has to be run twice.
WoE(MALE) = –0.7367
WoE(FEMALE) = 0.1206
Compared to the values 0.36 and 0.60 we received with the proportions of the target variable, we
see that proportions lower than the baseline proportion result in a negative WoE, and proportions
greater than the baseline result in a positive WoE.
Chapter 20: Coding for Predictive Modeling 225
We have talked only about categorical variables so far. Interval input variables need to be binned
into groups before Weights of Evidence values can be calculated or before the %CREATEPROPS
or %PROPSCORING macros can be used.
We can bin the values of an interval variable by defining groups of equal size with PROC RANK.
PROC RANK DATA = sampsio.hmeq OUT = ranks GROUPS = 10;
VAR value;
RANKS value_grp;
RUN;
This binning can also be done by manually defining the bins in an IF-THEN/ELSE statement.
DATA ranks;
SET ranks;
IF Value_grp <= 50000 THEN Value_grp2 = 1;
ELSE IF Value_grp <= 100000 THEN Value_grp2 = 2;
ELSE IF Value_grp <= 200000 THEN Value_grp2 = 3;
ELSE Value_grp2=4;
RUN;
Note that these limits are often defined by looking at the distribution of the means or proportions
of the target variable over the values of the input variable.
SAS Enterprise Miner offers special functionality to transform an optimal bin input variable for
the target variable. Credit Scoring for SAS Enterprise Miner offers the Interactive Grouping node,
which allows the interactive grouping of input variables and the creation of the respective Weight
of Evidence variables.
General
The handling of interval variables that have a linear relationship to the target variable is very
straightforward. In regression analysis, for example, a coefficient is estimated that describes the
relationship of the input variable to the target variable.
If, however, the relationship between the input variable and the target variable is not linear, the
interval input variable is either binned into groups or transformed by a mathematical
transformation.
If we have an interval input variable, the relationship to the target variable can be analyzed
graphically by a scatter plot or by a so-called target bar chart. The target variable in this case can
be interval or binary.
226 Data Preparation for Analytics Using SAS
We see a U-shaped relationship between RESPONSE and AGE that requires a quadratic term of
AGE.
We now have two options to create the quadratic trend. We can square the values of AGE, or we
can first center the values of AGE by the mean and square the centered values.
AGE_Qdr = age**2;
AGE_Qdr_Std = (age-50)**2;
The second method has the advantage that we explicitly create a variable that has a linear
relationship with the target variable, which we can see from the following figure.
Chapter 20: Coding for Predictive Modeling 227
It can be shown that the two alternatives for quadratic variables give the same performance only
in regression—for example, by considering the F-value of the linear regression, if we also specify
the linear term in the regression equation.
MODEL response = AGE AGE_Qdr_Std;
MODEL response = AGE AGE_Qdr;
The F-statistic and therefore the predictability of the two preceding models are the same, because
the coefficient for the linear trend and the intercept will absorb the difference in the two quadratic
methods, because one is centered the other is not centered.
Note, however, that the centered quadratic term has the advantage that it can be specified alone
and that the estimation is more stable. It is therefore advisable in the case of quadratic
relationships to center them on 0.
Note that types of transformations are not limited to quadratic transformations but can also
include more complex mathematical formulas.
DATA
The name of the data set.
TARGET
The name of the target variable.
INTERVAL
The list of interval input variables.
CLASS
The list of categorical input variables.
The macro can create a target bar chart for nominal-scaled and interval-scaled variables. Note that
interval-scaled variables are automatically binned into groups of equal size. The invocation of the
macro can be as follows:
%TARGETCHART(DATA=age,TARGET=response,interval = age age_qdr_std);
The macro will then create two charts, which are shown in Figure 20.3 and Figure 20.4.
Chapter 20: Coding for Predictive Modeling 229
We see the U-shaped relationship between AGE and RESPONSE. The AGE values are binned
into groups with a width of 3. For each class a bar is drawn that represents the mean of the target
variable RESPONSE with a 95% confidence interval. For each group we see the absolute
frequency and the value of the mean of the target variable in this group.
Note that from this type of graphical representation we can see which values of the input variable
have which value of the target variable. The information can be derived whether a variable will be
transformed—by a quadratic transformation. We can also see which values of the input variable
will be grouped together. And the bar chart shows which values of a low absolute frequency
might be filtered or transformed to another value.
This type of graphical representation is very helpful in predictive modeling. It is frequently used
not only for interval input variables but also for nominal input variables.
230 Data Preparation for Analytics Using SAS
In Figure 20.4 we see the linear relationship between the centered quadratic value to AGE and
RESPONSE.
In our preceding example we grouped the AGE values into categories with the following
statement:
if age < 30 then Age_GRP = 1;
else if age < 45 then Age_GRP = 2;
else if age < 55 then Age_GRP = 3;
else if age < 75 then Age_GRP = 4;
else Age_GRP = 5;
Again, we see the U-shaped relationship between AGE and RESPONSE. However, the predictive
model will now consider the input values as nominal categories.
In the case of a U-shaped relationship the advantage of a transformation of the input variable with
a quadratic term is very obvious. However, if we have a more complex relationship between the
input and target variable that cannot easily be expressed with a mathematical transformation, the
creation of nominal categories is very important.
Again we want to mention that the analysis of the relationship of the variable to the target
variable, the definition of the mathematical transformation, or the definition of the group limits
should be done only on the training data in order to avoid overfitting.
In our short example we use a random number to decide for each observation whether it will go
into the training or validation data set.
DATA training_data
validation_data;
SET raw_data;
IF UNIFORM(3) < 0.67 THEN OUTPUT training_data;
ELSE OUTPUT validation_data;
RUN;
We randomly decide for each observation whether it should go into the training or validation data
set. The value of 0.67 has the effect that we have a distribution of cases in the ratio 2:1 on training
and validation data.
If we want to create a training, validation, and test data set, the code is analogous to the preceding
example.
DATA training_data
validation_data
test_data;
SET raw_data;
IF UNIFORM(3) <= 0.50 THEN OUTPUT training_data;
ELSE IF UNIFORM(4) <= 0.50 THEN OUTPUT validation_data;
ELSE OUTPUT test_data;
RUN;
Note that with this code we create a split of 50:25:25 (2:1:1) to the training, validation, and test
data. The statement UNIFORM(4) <= 0.50 applies to only half of the population and therefore
splits into 25% (50% * 50%) groups.
232 Data Preparation for Analytics Using SAS
20.5 Conclusion
In all the cases mentioned in this chapter, proportions or means of the target variable, Weight of
Evidence, optimal binning, and adjusting the distribution of the values of the input variable, it is
important to consider that these operations must be done only on the training data and not on the
complete data set. Otherwise, we would build a model that generalizes badly as we overfit the
data.
C h a p t e r 21
Data Preparation for Multiple-Rows-per-Subject
and Longitudinal Data Marts
21.1 Introduction
General
In this chapter we will deal with specifics of data preparation for multiple-rows-per-subject and
longitudinal data marts. In Chapters 9 and 10 we considered these types of data marts from a data
structure and data modeling point of view. This chapter provides information about how these
data marts can be filled with data and how derived variables can be created.
There is a lot of information that we can take from Chapters 16 and 17—for example, there is no
difference between whether the revenue per number of purchases is calculated per customer only
or whether it is calculated per customer and time period. Therefore, ratios, differences, means,
event counts, concatenated categorical variables, dummy variables, indicators, and lookup tables
can be treated in the same manner in multiple-rows-per-subject data sets.
234 Data Preparation for Analytics Using SAS
In this chapter we will deal with those parts of data preparation that are special to multiple-rows-
per-subject or longitudinal data sets. We will explore the following:
We will learn how to prepare data for association and sequence analysis.
We will learn how we can enhance time series data with information per time interval or
with data from higher hierarchical levels.
We will look at how we can aggregate data on various levels.
We will see which derived variables can be created using SAS functions.
We will examine procedures of SAS/ETS for data management, namely PROC EXPAND
and PROC TIMESERIES.
General
Data for association or sequence analysis need to be organized in a multiple-rows-per-subject
structure. If the data are already in a one-row-per-subject structure, the data need to be transposed
to a multiple-rows-per-subject structure.
A well-known type of association analysis is the market basket analysis, where which products
are being bought together by customers is analyzed.
This data mart structure can easily be retrieved for transactional systems that store these types of
data. In data warehousing environments, these types of data are usually stored in a star schema.
We see that customer 1 has made two purchases and that the products are being grouped.
The additional columns CUSTOMERID and PRODUCTGROUP can either be retrieved from
lookup tables and merged with this table, or they can be retrieved from SAS formats. We will
show an example of this in Chapter 27.
If we want to use only the columns CUSTOMERID and PRODUCTGROUP in the analysis, we
see that we will get duplicate rows. For example, customer 1 has bought FISH in both purchases
and bought MEAT several times.
Note that it is up to the algorithm of the association or sequence analysis how duplicate items are
handled. Duplicate items, for example, can be removed from the preceding table with the
following statements:
PROC SORT DATA = assoc2(KEEP = CustomerID ProductGroup)
OUT = assocs_nodup
NODUPKEY;
BY CustomerID ProductGroup;
RUN;
A sequence variable, however, does not necessarily need to be a consecutive number. The
sequence variable can be a timestamp of when a certain event took place such as the purchase of a
product or the navigation to a certain page of the Web site.
In the case of timestamps, the sequence variable can also have hierarchies—for example, by
defining different time granularities such as days, weeks, or months.
In Table 14.12 in Chapter 14 we discussed the creation of a key-value structure. The resulting
table is again shown in Table 21.4:
In order to be able to use these data in association analysis, we need to concatenate these two
columns:
Feature = CATX("=",key,value);
The ID and Feature columns of this table can now be used in association analysis to analyze
which features of a subject are likely to occur together.
Note that for the values of AGE, HEIGHT, and WEIGHT it makes sense to group the interval
values—for example, by rounding them—in order to receive classes instead of individual values.
238 Data Preparation for Analytics Using SAS
Overview
In Chapter 10 – Data Structures for Longitudinal Analysis, we explored different ways
longitudinal data can be structured. We also introduced a general representation of entities such as
the generic TIME, VALUE, and CATEGORY entities. In Figure 21.1 we repeat this structure.
In Table 21.6 we see an example data table for this structure, where the variable MONTH
represents the TIME entity, the variable PRODUCT represents the CATEGORY entity, and the
variable ACTUAL represents the VALUE entity. We assume that these data cannot be broken
down further. We do not have a more granular classification of PRODUCT and we have the data
only on a monthly level, which cannot be broken down further.
In the following sections we will show how these data can be enhanced with additional variables.
These variables can be used either for aggregation or as possible input variables for time series
forecasting.
We can now use this table to join it to Table 21.6 or, if we transfer this table to a SAS format, we
can use it as a SAS format. Both methods result in the data as shown in Table 21.8.
Using SQL:
PROC SQL;
CREATE TABLE prdsale_sql
AS SELECT *
FROM prdsale AS a,
lookup AS b
WHERE a.product = b.product;
QUIT;
DATA prdsale_fmt;
SET prdsale;
FORMAT Prodtype $12.;
Prodtype = PUT(product,PG.);
RUN;
The column that we have added can be an aggregation level that allows the aggregation of the
data at a higher level. In Section 21.4, “Aggregating at Various Hierarchical Levels,” we will
show different ways of aggregating these data.
The TIME dimension can also provide additional aggregation levels. For example, we can use the
monthly values in the variable MONTH to create quarterly or yearly aggregations. This can either
be done by explicitly creating a QUARTER or YEAR variable or by using the appropriate format.
Table 21.10 contains the number of shops per months. The SHOPS variable can be a good
predictor of the number of sold units or the sales amount.
Chapter 21: Data Preparation for Multiple-Rows-per-Subject and Longitudinal Data Marts 241
Tables 21.9 and 21.10 can be joined to the PRDSALE table. This can be done with the following
statements. Note that for simplicity we skipped the Monday through Sunday variables:
PROC SQL;
CREATE TABLE prdsale_enh
AS SELECT a.*, b.NrDays, c.Shops
FROM prdsale AS a,
cal_month AS b,
shops AS c
WHERE a.month = b.month
AND a.month = c.month;
QUIT;
Table 21.11: Data with number of days as number of shops per month
In this section we added variables to the longitudinal data. These variables can either be derived
variables that can serve as input variables in time series forecasting or variables that can be used
for aggregations, as we will show in the next section.
242 Data Preparation for Analytics Using SAS
General
In the preceding section we created additional hierarchies that can be used for aggregation.
Aggregation is important if the analysis will not be performed on the detailed level for which the
input data are available, but on a higher level. Reasons for this can be that the business question
itself demands the analysis on a certain aggregation level, or that it is not feasible to analyze data
on a certain detail level.
In SAS, aggregations can be performed easily with PROC MEANS or PROC SQL. In the last
section of this chapter we will introduce PROC EXPAND and PROC TIMESERIES for the
aggregation of data. PROC EXPAND and PROC TIMESERIES are part of the SAS/ETS module.
Note that we can simply use a SAS format to get additional aggregations—for example, quarters
instead of months. Consider that additional format statement in the following code:
PROC MEANS DATA = prdsale_sql NOPRINT NWAY;
FORMAT month yyq6.;
CLASS prodtype month;
VAR actual;
OUTPUT OUT = prdtype_aggr_qtr(DROP = _type_ _freq_) SUM=;
RUN;
Note that this method is not restricted to date and time formats, but it also works with SAS
formats in general. For example, we could use a format for PRODUCT_TYPE that groups them
to a higher level.
Here the SAS format is used to overlay the original value, which results in the display and the
identification of the aggregation level. However, the value that is effectively stored in the
resulting data set does not automatically need to be the beginning of the interval. The effectively
stored value of the variable MONTH in Table 21.13 is only the beginning of the interval (e.g.,
01JAN1993), if in the source data (see Table 21.12), the data are also aligned at the beginning of
the interval.
Date values in the source data can be easily aligned with the INTNX function:
Date = INTNX(‘QUARTER’,date,0,’BEGIN’)
This code aligns the values of DATE to the first date in each quarter.
Aggregating with PROC SQL has the advantage that a database view can be defined instead of a
table. This means that we define the aggregation logic, but we do not physically create the
aggregation table. This method is useful if we have a couple of candidate aggregations for the
analysis and do not want to redundantly store all aggregation tables. Using the data from a view,
however, requires a little more processing time than from a table, because the data have to be
aggregated.
The following code shows how to create a view for the aggregation as shown in Table 21.12:
PROC SQL;
CREATE VIEW prdtype_sql_aggr_view
AS SELECT prodtype,
month,
SUM(actual) AS Actual
FROM prdsale_sql
GROUP BY month, prodtype
ORDER BY 1,2;
QUIT;
In contrast, PROC SQL has the disadvantage that SAS formats cannot be used in the same direct
way as in procedures such as PROC MEANS. Assume you have a SAS format in the preceding
SQL statement such as the following:
PROC SQL;
CREATE VIEW prdtype_sql_aggr_view2
AS SELECT prodtype,
Month FORMAT = yyq6.,
SUM(actual) AS Actual
FROM prdsale_sql
GROUP BY month, prodtype
ORDER BY 1,2;
QUIT;
This will cause formatted values but not aggregations by the formatted original (underlying)
values. In order to circumvent this, the following trick can be applied using a PUT function:
PROC SQL;
CREATE VIEW prdtype_sql_aggr_view3
AS SELECT prodtype,
Put(Month,yyq6.) As Quarter,
SUM(actual) AS Actual
FROM prdsale_sql
GROUP BY Quarter, prodtype
ORDER BY 1,2;
QUIT;
Note that we used the PUT statement to produce the year-quarter values and that we gave a new
variable name QUARTER which we use in the GROUP BY clause. If we had omitted the creation
of the new name and used the variable MONTH, the original values would have been used. In this
case, however, the new variable QUARTER will be character formatted and cannot be used for
further date calculations. If we want to retrieve a date-formatted variable, we need to use the
MDY function in order to create the new date variable explicitly.
Chapter 21: Data Preparation for Multiple-Rows-per-Subject and Longitudinal Data Marts 245
Or, alternatively:
PROC MEANS DATA = prdsale_enh NOPRINT NWAY;
CLASS product month;
VAR actual;
OUTPUT OUT = prdtype_aggr(DROP = _type_ _freq_)
SUM(Actual) = Actual
Mean(Shops) = Shops;
RUN;
Note that in the first and third cases we calculate the mean of SHOPS, which evaluates for a list of
equal values to the value itself. We can also use MEDIAN, MIN, or MAX. In PROC MEANS the
ID statement is more elegant. These are possibilities to pass a value through an aggregation. For
more variables it would, however, be better to aggregate the data first and then to join fields that
contain additional information.
Overview
In this section we will show specific SAS functions that are useful for data preparation with
longitudinal data marts. We will cover DATE and TIME functions, and we will specifically show
the INTCK and INTNX functions and the definition of INTERVALS.
246 Data Preparation for Analytics Using SAS
We will also investigate the power of the LAG and DIF functions.
In SAS, functions exist that allow the manipulation of time variables. In Chapter 16 –
Transformations of Interval-Scaled Variables, we introduced the DATDIF, YRDIF, INTNX, and
INTCK functions. For additional SAS functions, see SAS Help and Documentation or the SAS
Press publication SAS Functions by Example.
evaluates to 60, because 60-month borders lie between May 16, 1970, and May 12, 1975. For
example
INTNX('MONTH','16MAY1970'd,100);
evaluates to “01SEP1978,” because we have incremented the value of May16, 1970, for 100-
month borders.
The interval in the INTCK and INTNX functions can be enhanced with a shift value and a
multiplier factor. Possible values for interval include YEAR, SEMIYEAR, QTR, MONTH,
WEEK, WEEKDAY, DAY, HOUR, MINUTE, and SECOND. The following gives an overview
of intervals with shift and multiplier:
MONTH Every month
MONTH2 Every two months
MONTH6.2 Every six months, with boundaries in February and August
WEEK Every week
WEEK4 Every four weeks
WEEK.2 Every week starting with Monday
Chapter 21: Data Preparation for Multiple-Rows-per-Subject and Longitudinal Data Marts 247
The following examples show that intervals allow for very flexible definitions of time spans. In
combination with the INTCK and INTNX functions, they allow the calculation of a variety of
time values.
This returns the date '1OCT1991'D'. Note that for monthly interval data, this could also be
achieved with MDY(MONTH(‘17OCT1991’d),1,YEAR(‘17OCT1991’d)).However, the INTNX
function can take any interval and also shifted intervals.
This calculates the number of days for each month in the variable DATE.
Note that the INTCK function counts the number of times the beginning of an interval is reached
in moving from the first date to the second. It does not count the number of complete intervals
between two dates.
When the first date is later than the second date, the INTCK function returns a negative count. For
example, the function INTCK('MONTH','1FEB1991'D,'31JAN1991'D) returns –1.
The following example shows how to use the INTCK function to count the number of Sundays,
Mondays, Tuesdays, and so forth, in each month. The variables NSUNDAY, NMONDAY,
NTUESDAY, and so forth, are added to the data set.
data weekdays;
format date yymmp7.;
do m = 1 to 12;
date = mdy(m,1,2005);
d0 = intnx( 'month', date, 0 ) - 1;
d1 = intnx( 'month', date, 1 ) - 1;
nsunday = intck( 'week.1', d0, d1 );
nmonday = intck( 'week.2', d0, d1 );
ntuesday = intck( 'week.3', d0, d1 );
nwedday = intck( 'week.4', d0, d1 );
248 Data Preparation for Analytics Using SAS
Because the INTCK function counts the number of interval beginning dates between two dates,
the number of Sundays is computed by counting the number of week boundaries between the last
day of the previous month and the last day of the current month. To count Mondays, Tuesdays,
and so forth, shifted week intervals are used. The interval type WEEK.2 specifies weekly intervals
starting on Mondays, WEEK.3 specifies weeks starting on Tuesdays, and so forth.
Note that this code can also be used to calculate calendar-specific derived variables that we later
join to the longitudinal data.
For example, suppose you want to know the date of the third Wednesday in the month of October
1991. The answer can be computed as
intnx( 'week.4', '1oct91'd - 1, 3 )
Consider this more complex example: How many weekdays are there between 17 October 1991
and the second Friday in November 1991, inclusive? The following formula computes the number
of weekdays between the date value contained in the variable DATE and the second Friday of the
following month (including the ending dates of this period):
n = intck( 'weekday', date - 1,
intnx( 'week.6', intnx( 'month', date, 1 ) - 1, 2 ) + 1 );
Setting DATE to '17OCT91'D and applying this formula produces the answer, N=17.
Note that the last two examples are taken from SAS Help and Documentation.
We see that we created three new columns. LAG_AIR contains the value of the previous period.
DIF_AIR contains the difference from the previous period.
We also created a column LAG2_AIR that holds the value of two periods ago. Note that the LAG
function is a generic function that can be parameterized by extending its name with a number, for
which lags will be produced.
Then we use a DATA step to calculate the LAG values and use a BY statement:
DATA prdsale_sum;
SET prdsale_sum;
BY product;
*** Method 1 - WORKS!;
lag_actual = LAG(actual);
IF FIRST.product then lag_actual = .;
*** Method 2 - Does not work!;
IF NOT(FIRST.product) THEN lag_actual2 = lag(actual);
RUN;
250 Data Preparation for Analytics Using SAS
Note that we presented two methods. The second one is shorter, but it does not work, because the
LAG function does not work correctly in an IF clause because it “sees” only the rows that satisfy
the IF condition. We see the difference of the two methods in the following table:
We see that LAG_ACTUAL contains the actual values except for the row when a new BY group
starts. LAG_ACTUAL2, however, does not contain the correct value—for example, for row
number 14. Instead the correct value of row 12, 6939, is inserted here. This is due to the IF
statement that makes the LAG function skip row 13.
In Appendix B – The Power of SAS for Preparation of Analytic Data, we will see additional SAS
functions that make SAS a powerful tool for managing longitudinal data.
General
In this section we will explore two SAS procedures from the SAS/ETS module, namely PROC
EXPAND and PROC TIMESERIES. The SAS/ETS module in general contains procedures for
time series analysis and econometric modeling. The two procedures we mention here can be used
for data preparation of longitudinal data.
Collapse time series data from higher frequency intervals to lower frequency intervals or
expand data from lower frequency intervals to higher frequency intervals. For example,
quarterly estimates can be interpolated from an annual series, or monthly values can be
aggregated to produce an annual series. This conversion can be done for any combination
of input and output frequencies that can be specified by SAS time interval names.
Chapter 21: Data Preparation for Multiple-Rows-per-Subject and Longitudinal Data Marts 251
Interpolate missing values in time series, either without changing series frequency or in
conjunction with expanding or collapsing the series.
Change the observation characteristics of time series. Time series observations can
measure beginning-of-period values, end-of-period values, midpoint values, or period
averages or totals. PROC EXPAND can convert between these cases.
The following examples illustrate applications of PROC EXPAND. We use the SASHELP.AIR
data set. In order to convert the data from monthly to quarterly values, we can use the following
statements. The resulting data set AIR_QTR is shown in Table 21.16.
PROC EXPAND DATA = sashelp.air OUT = air_month
FROM = month TO = qtr;
CONVERT air / observed = total;
ID DATE;
RUN;
In order to convert the monthly data from SASHELP.AIR to weekly data, the following code can
be used. The result is shown in Table 21.17.
PROC EXPAND DATA = sashelp.air OUT = air_week
FROM = month TO = week;
CONVERT air / OBSERVED = total;
ID DATE;
RUN;
252 Data Preparation for Analytics Using SAS
For our example we create randomly missing values with the following statements:
DATA air_missing;
SET sashelp.air;
IF uniform(23) < 0.1 THEN air = .;
RUN;
Among others, the values in rows 12 and 15 have been replaced by missing values. We can now
use PROC EXPAND to replace the missing values. The result is shown in Table 21.18.
PROC EXPAND DATA = air_missing OUT = air_impute;
CONVERT air / OBSERVED = total;
RUN;
Note that the values are not only replaced by random numbers or means but are replaced by
values that correspond to a fitted time series for these values.
Chapter 21: Data Preparation for Multiple-Rows-per-Subject and Longitudinal Data Marts 253
In this section we will use PROC TIMESERIES to aggregate transactional data. PROC
TIMESERIES allows aggregating the data similar as in PROC MEANS or PROC SQL. The
advantage of this procedure lies in its possibility to specify options for the aggregations statistic
and the handling of missing values.
PROC SORT DATA = sashelp.prdsale OUT= prdsale;
BY product month;
RUN;
PROC TIMESERIES aggregates the data for all variables in the BY statement and for the TIME
variable specified in the ID statement.
The TIME variable in the ID statement is aggregated for the interval specified in the
INTERVAL option.
The ACCUMULATE option allows us to specify how the data will be aggregated.
TOTAL causes the creation of sums.
The SETMISS option specifies how missing values in the aggregated data will be
handled. If, as in many cases, a missing value means “No Value,” the option SETMISS =
0 is appropriate.
For more details, see the SAS/ETS Help.
254 Data Preparation for Analytics Using SAS
P a r t 4
Sampling, Scoring, and Automation
Introduction
In this part of the book we will deal with sampling, scoring, automation, and practical advice
when creating data marts.
In Chapter 22 – Sampling, we will investigate different methods of sampling such as the simple
random sample, the stratified sample, and the clustered sample. We will show a couple of macros
and practical programming tips and examples in order to perform sampling in SAS code.
In Chapter 23 – Scoring and Automation, we will explain the scoring process and show different
ways to perform scoring in SAS such as explicitly calculating the scores or using procedures in
SAS/STAT or SAS/ETS. We will also cover pre-checks that are relevant to scoring data in order
to allow smooth scoring from a technical and business point of view.
In Chapter 24 – Do’s and Don’ts When Building Data Marts, we will cover process, data mart
handling, and coding do’s and don’ts, and present practical experiences in the data preparation
process.
256 Data Preparation for Analytics Using SAS
C h a p t e r 22
Sampling
22.1 Introduction
General
Sampling means that a number of observations are selected from the available data for analysis.
From a table point of view, this means that certain rows are selected. Sampling is the random
selection of a subset of observations from the population. The sampling rate is the proportion of
observations from the original table that will be selected for the sample. In data mining analyses
sampling is important, because we usually have a large number of observations. In this case
sampling is performed in order to reduce the number of observations to gain performance.
Another reason for sampling is to reduce the number of observations that are used to produce
graphics such as scatter plots or single plots. For example, to create a scatter plot with 500,000
observations takes a lot of time and would probably not give a better picture of the relationship
than a scatter plot with 5,000 observations.
258 Data Preparation for Analytics Using SAS
Sampling is also important in the case of longitudinal data if single plots (one line per subject) are
to be produced. Besides the fact that plotting the course of a purchase amount over 18 months for
500,000 customers would take a long time to create, it also has to be considered that 500,000 lines
in one graph will not have the visual impression that only 1,000 or 5,000 lines will give.
The rationale behind sampling is that the results can also be obtained by a lower number of
observations. There is a lot of theory on sampling. We will not cover these considerations in this
book, but in this chapter we will show examples of how different sampling methods can be
performed in SAS.
An alternative to the effective creation of a new data set can also be the definition of a table view
or DATA step view. Here the selected observations are not copied to the sample data set but
dynamically selected when data are retrieved from that view. This saves disk space, but the
analysis takes longer because the sampled observations have to be selected from the data.
Overview
There are a number of different sampling methods. We will investigate the three most important
for analytic data marts: the simple random sample, the stratified sample, and the clustered sample.
Chapter 22: Sampling 259
This is encountered in most cases with binary target variables, where the proportion of events (or
responders) will be increased in the sample. If the number of events is rare, all observations that
have an event in the target variable are selected for the sample and a certain number of non-events
are selected for the sample.
We will find this method in the case of multiple-rows-per-subject data marts or cross-sectional or
interleaved longitudinal data marts, but not in the case of one-row-per-subject data marts.
If the sample count is given as an absolute number, say 6,000 observations out of 60,000, we can
write the code in the following way:
DATA SAMPLE;
SET BASIS;
IF RANUNI(123) < 6000/60000 THEN OUTPUT;
RUN;
Again, we will not necessarily reach exactly 6,000 observations in the sample.
In this example the probability for each observation to be in the sample is influenced by the
number (proportion) of observations that are in the sample so far. The probability is controlled by
the ratio of the actual sample proportion of all previous observations to the desired sample
proportion.
In the preceding example the number of observations in the data set is hardcoded. In order to
make this more flexible we introduce the macro %RESTRICTEDSAMPLE:
%MACRO RestrictedSample(data=,sampledata=,n=);
*** Count the number of observations in the input
data set, without using PROC SQL or other table scans
--> Saves Time;
DATA _NULL_;
CALL SYMPUT('n0',STRIP(PUT(nobs,8.)));
STOP;
SET &data nobs=nobs;
RUN;
DATA &sampledata;
SET &data;
IF smp_count < &n THEN DO;
IF RANUNI(123)*(&n0 - _N_) <= (&n - smp_count) THEN DO;
OUTPUT;
Smp_count+1;
END;
END;
RUN;
%MEND;
Chapter 22: Sampling 261
DATA
The input data set.
SAMPLEDATA
The sample data set that is created.
N
The sample size as an absolute number.
22.4 Oversampling
General
Oversampling is a common task in data mining if the event rate is small compared to the total
number of observations. In this case sampling is performed so that all or a certain proportion of
the event cases are selected and a random sample from the non-event is drawn.
For example, if we have an event rate of 1.5% in the population, we might want to oversample the
events and create a data set with 15% of events.
With these statements we select all event cases and 50% of non-event cases of the observations.
262 Data Preparation for Analytics Using SAS
Note that 0.1995 is the event rate in the SAMPSIO.HMEQ data set. Specifying the value 25
causes the condition to select a number of non-events that we receive approximately a 25% event
rate. The distribution in the data set OVERSAMPLE2 is as follows:
The FREQ Procedure
Cumulative Cumulative
BAD Frequency Percent Frequency Percent
-------------------------------------------------------
0 3588 75.11 3588 75.11
1 1189 24.89 4777 100.00
In the case of oversampling, flexible sampling variable means a variable that contains values that
allow the selection of any oversampling rate. We define this sampling rate with the following
statements in a macro:
%MACRO SamplingVAR(eventvar=,eventrate=);
FORMAT Sampling 8.1;
IF &eventvar=1 THEN Sampling=101;
ELSE IF &eventvar=0 THEN
Sampling=(&eventrate*100)/(RANUNI(34)*(1-&eventrate)+&eventrate);
%MEND;
The macro is designed to run within a DATA step and has the following parameters:
EVENTVAR
The name of the event variable.
EVENTRATE
The event rate as a number between 0 and 1.
Chapter 22: Sampling 263
The invocation of the macro in the following example creates a new variable SAMPLING. The
columns BAD and Sampling are shown in Table 22.1:
DATA hmeq;
SET sampsio.hmeq;
%SamplingVar(eventrate=0.1995,eventvar=bad);
RUN;
We see that the SAMPLING variable is greater than 100 for event values and contains appropriate
random numbers for non-events. The SAMPLING variable allows specifying in a flexible way
which oversampling rate will be chosen for the analysis, without drawing the respective sample
again and again. We demonstrate this behavior in the following example with PROC FREQ:
PROC FREQ DATA = hmeq;
TABLE bad;
WHERE sampling > 30;
RUN;
Overview
In the case of a multiple-rows-per-subject or longitudinal data mart, simple sampling does not
make sense. For one subject or cross-sectional group, we want to have all observations or no
observations in the sampled data. Otherwise, we would have the paradoxical situations where in
time series analysis, some measurements are missing because they were sampled out, or in market
basket analysis, some products per basket are not in the data because they were not sampled.
264 Data Preparation for Analytics Using SAS
In this case we want to do the sampling on a subject or BY-group level and then move all
observations of the selected subjects or BY groups into the sample. This can be done by a two-
step approach where first the subjects or BY groups are sampled. The selected IDs are then
merged back with the data in order to select the related observations.
The following macro shows the unrestricted sampling approach, where the desired sample
proportion or sample count is reached only on average (if a number of samples would be drawn).
Note that the macro expects the data to be sorted by the ID variable.
%MACRO Clus_Sample (data = , id =, outsmp=, prop = 0.1, n=, seed=12345
);
/*** Macro for clustered unrestricted sampling
Gerhard Svolba, Feb 2005
The macro draws a clustered sample in one DATA step.
The exact sample count or sample proportion is not
controlled.
Macro Parameters:
DATA The name of the base data set
The name of the sample data set will be created
as DATA_SMP_<sample count (n)>
ID The name of the ID Variable, that identifies the
subject or BY group;
PROP Sample Proportion as a number from 0 to 1
N Sample count as an absolute number
SEED Seed for the random number function;
Note that PROP and N relate to the distinct ID values;
***/
DATA _test_;
SET &data;
BY &ID;
IF first.&id;
RUN;
DATA _NULL_;
CALL SYMPUT('n0',STRIP(PUT(nobs,8.)));
STOP;
SET _test_ nobs=nobs;
RUN;
%IF &n NE %THEN %let prop = %SYSEVALF(&n/&n0);
DATA &outsmp;
SET &data;
BY &id;
RETAIN smp_flag;
IF FIRST.&id AND RANUNI(&seed) < &prop THEN DO;
smp_flag=1;
OUTPUT;
END;
ELSE IF smp_flag=1 THEN DO;
OUTPUT;
IF LAST.&id THEN smp_flag=0;
END;
Chapter 22: Sampling 265
DROP smp_flag;
RUN;
%MEND;
DATA
The name of the base data set.
OUTSMP
The name of the output data set.
ID
The name of the ID variable that identifies the subject or BY group.
PROP
The sample proportion as a number from 0 to 1. Note that the sampling proportion is
understood as a proportion of subjects and not as a proportion of observations in
underlying hierarchies. PROP is considered only if no value N is specified.
N
The sample count as an absolute number. Note that the sampling count is understood as a
number of subjects and not as the number of observations in underlying hierarchies. If a
value for N is specified, the sampling rate is calculated by dividing N by the number of
distinct subjects. In this case the PROP parameter is ignored.
SEED
The seed for the random number function.
The name of the sample data set will be created as DATA_SMP_<sample count (n)>.
From a programming perspective, we avoid the need of a merge of the selected ID data set with
the base data set by retaining the SMP_FLAG over all IDs. However, the data need to be sorted
first.
DATA
The name of the base data set.
OUTSMP
The name of the output data set.
ID
The name of the ID variable that identifies the subject or BY group.
PROP
The sample proportion as a number from 0 to 1. Note that the sampling proportion is
understood as a proportion of subjects and not as a proportion of observations in
underlying hierarchies. PROP is considered only if no value N is specified.
N
The sample count as an absolute number. Note that the sampling count is understood as a
number of subjects and not as the number of observations in underlying hierarchies. If a
value for N is specified the sampling rate is calculated by dividing N by the number of
distinct subjects. In this case the PROP parameter is ignored.
SEED
The seed for the random number function.
The name of the sample data set will be created as DATA_SMP_<sample count (n)>.
Chapter 22: Sampling 267
22.6 Conclusion
Note that the sampling node in SAS Enterprise Miner offers SIMPLE, STRATIFIED, and
CLUSTERED restricted sampling in the sampling node. We, however, wanted to present these
methods here, because sampling is in many cases performed before the import of data to the data
mining tool—for example, to allow faster data preparation on only a sample of the data or to
provide data for visual data exploration where would like to work on samples for better visibility.
268 Data Preparation for Analytics Using SAS
C h a p t e r 23
Scoring and Automation
23.1 Introduction
General
After a data mart is prepared, it can be used for analysis. Some types of analyses, as we saw in
Section 2.6, “Scoring Needed: Yes/No,” produce a set of rules that can be applied to another data
mart in order to calculate a score.
The process of applying rules to a different data mart is called scoring. The observations to which
the score code is usually applied are either a set of different subjects or the same subject at a later
time.
270 Data Preparation for Analytics Using SAS
During analysis the relationship between customer attributes and whether or not a loan
has been paid back is modeling using logistic regression. The coefficients for the input
variables can be used to predict for other observations the probability that the loan will
not be paid back.
A cluster analysis is performed on seven customer attributes and assigns each customer to
one of five clusters. The discrimination rule that calculates the cluster assignments on the
basis of the input variables can be used to assign new observations to clusters or to re-
assign existing customers after one, two, or more months.
In medical statistics, survival analysis is performed that models the lifetime of melanoma
patients based on a number of risk factors. The results can be applied to new patients in
order to predict their lifetime, depending on the values of their risk factors.
In time series analysis, the number of visitors to a permanent exhibition on a daily basis
will be modeled. The course of the time series will be forecasted for future time periods.
In all of these examples, a logic has been derived from analysis. This logic, or these rules, can be
used to calculate future unknown values of certain variables. We call these future unknown values
scores and the logic or rules scorecard or score logic. The process of applying a scorecard to a
data set is called scoring. The score code is the formulation of the scoring rules or the score logic
in program statements.
explicitly calculating the score values from parameters and input variables
using the respective SAS analytic procedure for scoring
scoring with PROC SCORE of SAS/STAT
We will also discuss the following:
In large data sets, the modeling data mart can be a sample of available information. In cases of
predictive modeling, oversampling can be performed in order to have more events available to
allow for better prediction. We saw in Chapter 12 – Considerations for Predictive Modeling, that
data can be accumulated over several periods in order to have more event data or to have data
from different periods in order to allow for a more stable prediction.
If the purpose of an analysis is to create a scorecard, only that information for subjects can be
used in the modeling data mart that will also be available in the following periods, when the
scoring data marts have to be built. Be aware that ad-hoc data such as surveys are not available for
all subjects. We mentioned this in Chapter 2 – Characteristics of Analytic Business Questions.
In predictive modeling the scoring data mart usually does not contain the target variable, because
it is usually unknown at this time. In the case of a binary or nominal target variable, the score
272 Data Preparation for Analytics Using SAS
code adds a probability score for the observations that fall in each class. In the case of an interval
target the interval value is predicted.
As we can see from Table 23.1 the analysis data mart is used in the analysis to create a model.
From this model, score rules or the score code is retrieved. In the scoring process these rules are
then applied to the scoring data mart to create a data mart with an additional score column. This
column is calculated on the basis of the score rules.
Note that before scoring can start, the data mart has to be prepared in the same manner as the
training data mart before the final analysis started. This includes the creation of new variables, the
application of the transformation to variables, the replacement of missing values, the creation of
dummy variables, and also possibly the filtering of outliers.
The score code that is produced by SAS Enterprise Miner in the Score node contains all data
preparation steps that are defined by SAS Enterprise Miner nodes such as the Transform node, the
Replacement node, and others. If user-defined code from SAS Code nodes has to be included in
the score code, it has to be inserted manually in the Score Code Editor of the Score node.
Chapter 23: Scoring and Automation 273
General
The most direct way to score new observations is to use the parameter estimates and insert them
in the formula to calculate predicted values. In linear regression this is done simply by
multiplying each variable value with its parameter estimate and summing over all variables plus
the intercept. In logistic regression, for example, this can be done in the same way to calculate the
linear predictor and then applying the inverse logit transformation to it.
Note that we output the parameter estimates with the OUTEST= option to the BETAS data set.
This data set can now be used to merge it with a data set with new observations for which
predictions based on the variables will be created. Let’s first look at our data set with the new
observations NEW_OBS (Table 23.2) and our BETAS data set (Table 23.3).
By using these two data sets we can create predictions for OXYGEN using the following SAS
code:
PROC SQL;
CREATE TABLE scores
AS SELECT a.*,
b.age AS C_age,
b.weight AS C_weight,
b.runtime AS C_runtime,
b.intercept AS Intercept,
(b.Intercept +
b.age * a.age +
b.weight * a.weight +
b.runtime * a.runtime) AS Score
FROM NewObs a,
betas b;
QUIT;
As we can see, we merge the data sets together using PROC SQL and calculate the predicted
values manually using the respective formula. The estimates for AGE, WEIGHT, and RUNTIME
are included only in the SELECT clause for didactic purposes. The resulting data set SCORES is
shown in Table 23.4.
The results of the view are the same as shown in Table 23.4. The process of using the view is
described in the following points:
Scoring new observations: Just add or replace the observations in the NEWOBS table and
query the view SCORE_VIEW to retrieve the score values.
Retraining the model with existing variables: Run PROC REG on new training data again
and produce a new BETAS data set that contains the new estimates.
Chapter 23: Scoring and Automation 275
Retraining the model with new variables: Run PROC REG on new training data again
and produce a new BETAS data set that contains the new estimates. Then update the
view definition in order to consider new variables in the score calculation.
This results in a very flexible scoring environment. Note that the rules (= business knowledge) are
stored in the BETAS data set and the scoring algorithm is stored in the view definition.
General
It is possible to use some analytic procedures for model training and scoring in one step. In this
case the input data set has to contain the training data and the scoring data.
The training data must contain the input variables and the target variable.
The scoring data must contain only input variables—the target variable must be missing.
In this case, SAS/STAT procedures such as PROC REG or PROC LOGISTIC use those
observations that have values for both input and target variables for model training and use those
observations where the target variable is missing to assign a score to it.
Table 23.5: FITNESS2 data set with training and new observations
276 Data Preparation for Analytics Using SAS
We see that the last three observations do not have values for the target variable OXYGEN.
Running the following PROC REG code, which includes an OUTPUT statement to produce
predicted values, will produce a new column that contains the predicted values for all
observations:
PROC REG DATA=fitness2;
MODEL Oxygen=Age Weight RunTime ;
OUTPUT OUT = fitness_out P=predicted;
RUN;
QUIT;
This method of scoring observations is possible with all SAS/STAT procedures that have an
OUTPUT statement, which allows us to create predicted values with the P= option. This scoring
method can also be automated with a view, just by changing the concatenation of training and
scoring data as in the following:
DATA fitness2_view / VIEW = fitness2_view;
SET fitness NewObs;
RUN;
Note that no estimate data set has to be considered, because the estimates are re-created during
each invocation of the procedure.
The advantage to this method is that no coding of the scoring formula is needed. The disadvantage
is that each scoring run causes the procedure to re-run the model training, which might
unnecessarily consume time resources.
We use the data from Example 42.1 of the PROC LOGISITC documentation in SAS/STAT User’s
Guide and add additional observations without the target variable to them. See Table 23.7.
Chapter 23: Scoring and Automation 277
Table 23.7: REMISSION data set with three additional observations for scoring
Again we run PROC LOGISTIC and use an OUTPUT statement to produce predicted values:
PROC LOGISTIC DATA=Remission OUTEST=betas;
MODEL remiss(event='1')=cell smear infil li blast temp;
OUTPUT OUT = remission_out P=p;
RUN;
QUIT;
Finally, we receive a new column in the table with the predicted probability for the event
REMISS, as shown in Table 23.8.
Table 23.8: REMISSION table with scores, created with PROC LOGISTIC
Scoring new observations without changing the cluster assignment rules can be achieved by using
a SEED data set in PROC FASTCLUS. The following example illustrates this process, based on a
clustering of the FITNESS data.
278 Data Preparation for Analytics Using SAS
Note that we used the MAXCLUSTERS= option to request five clusters, and we used the
OUTSEED= option to save the seeds of the k-means clustering in a data set.
A look at the CLUSTERSEEDS table in Table 23.9 shows us that the cluster midpoints are stored
for each cluster.
Table 23.9: CLUSTERSEEDS table, output by the OUTSEED= option in PROC FASTCLUS
In the NEWCLUSOBS data set we have three new observations that will be assigned to clusters,
as shown in Table 23.10.
We can now use PROC FASTCLUS to assign the new observations to clusters:
PROC FASTCLUS DATA = NewClusObs
OUT = NewClusObs_scored
SEED = ClusterSeeds
MAXCLUSTERS=5
MAXITER = 0;
VAR age weight oxygen runtime;
RUN;
Note that we specify an OUT data set that contains the scored data. With SEED= we specify our
existing SEED data set that contains the cluster assignment rules. It is important to specify
MAXITER=0 in order to perform cluster assignments only and to prevent the procedure from
starting new iterations from the existing seeds. The resulting data set is shown in Table 23.11.
General
The SCORE procedure is available in SAS/STAT. This procedure multiplies values from two data
sets, one containing raw data values and one containing coefficients. The result of this
multiplication is a data set that contains linear combinations of the coefficients and the raw data
values.
Note that PROC SCORE performs an element-wise multiplication of the respective variables in
the raw and coefficients data set and finally sums the elements. No additional transformations
such as log or logit can be performed within PROC SCORE. Therefore, PROC SCORE can be
used only for scoring of a linear regression directly. For more details, see SAS Help and
Documentation for the SCORE procedure.
Example
We again use the fitness data and run PROC REG for a prediction of oxygen and store the
coefficients in the data set BETAS:
PROC REG DATA=fitness OUTEST = betas;
MODEL Oxygen=Age Weight RunTime ;
RUN;
QUIT;
The data set NEWOBS contains three observations that will be scored with PROC SCORE using
the BETAS data set:
PROC SCORE DATA = NewObs
SCORE = betas
OUT = NewObs_scored(RENAME = (Model1 = Oxygen_Predict))
TYPE = PARMS;
VAR Age Weight RunTime;
RUN;
Note that we specify three data sets with PROC SCORE—the raw data (DATA=), the coefficients
(SCORE=), and the scored data (OUT=).
General
Most of the SAS/ETS procedures deal with time series forecasting. Therefore, scoring in this
context means the writing forward of the forecasted values of the time series.
was used as an independent variable, forecasts over time can be produced by adding observations
with the respective values for time.
In the code an OUTPUT statement is added, which can have the following form:
OUTPUT OUT = autoreg_forecast P=predicted_value PM = structural_part;
Note that two predicted values can be created. P= outputs the predicted value, formed from both
the structural and autoregressive part of the model. PM= outputs the structural part of the model
only.
PROC ARIMA can produce forecasts either on the time variable only or on additional input
variables. See the CROSSCORR= option in the IDENTIFY statement. The future values of
additional input variables can be forecasted as well and can be used in the forecasting for future
periods automatically.
Values of input variables that have been used during model definition with the CROSSCORR=
option in the IDENTIFY statement can also be provided in the input data set. In this case the input
data set contains values for those variables in future periods, as well as missing values for the
dependent variable and for those input variables that will be forecasted in the procedure itself.
The following code shows an example of time series forecasting on the basis of the
SASHELP.CITIMON data set. An ARIMA model for the total retail sales (the variable RTRR)
will be created, and the values of the variable EEGP, the gasoline retail price, will be used. The
following statement shows how PROC ARIMA can be used to forecast the EEGP values and then
use them and their forecasts for a forecast of RTRR:
PROC ARIMA DATA = sashelp.citimon;
IDENTIFY VAR =eegp(12);
ESTIMATE P=2;
IDENTIFY VAR = RTRR CROSSCORR = (eegp(12));
ESTIMATE P=1 Q=1 INPUT=(eegp);
FORECAST LEAD=12 INTERVAL = month ID = DATE OUT = results;
RUN;
QUIT;
Note that the emphasis of this example is not on creating the best possible ARIMA model for the
underlying questions but on illustrating how scoring can be used in PROC ARIMA.
Chapter 23: Scoring and Automation 281
General
In the case of periodic scoring, data marts need to be re-created or updated each period. Over time
it can happen that the content of a data mart does not correspond any longer to the content of the
training data mart on which the basis of the model was built. The most common cases of non-
correspondence are the following:
In some analyses, missing values are replaced by some values during missing value imputation.
However, these replacement methods assume that in most cases, the number of missing values do
not exceed a certain percentage.
Missing values of a variable for all observations in a data set are likely to appear over time if the
definition of data sources, variables, and so forth, changes in the source systems. Therefore, it
makes sense that the number of missing values is monitored and reported before each scoring run.
The number of missing values for numeric variables can be checked with the following macro:
%MACRO AlertNumericMissing (data=,vars=_NUMERIC_,alert=0.2);
PROC MEANS DATA = &data NMISS NOPRINT;
VAR &vars;
OUTPUT OUT = miss_value NMISS=;
RUN;
282 Data Preparation for Analytics Using SAS
The macro uses PROC MEANS to calculate the number of missing values for the numeric
variables and produces a list of all numeric variables with the number of missing values, the
percentage of missing values, and an alert, based on a predefined alert percentage.
DATA
The data set to be analyzed.
VAR
The list of variables that will be checked. Note that the default is all numeric variables.
Also note that here we use the very practical logical variable _NUMERIC_.
ALERT
The value between 0 and 1. Variables that have a proportion of missing values greater
than “Alert” will be flagged with alert = 1.
Number Proportion_
Obs Variable Missing Missing N Alert
For the number of missing values of character variables the following macro counts the number of
missing values:
%MACRO ALERTCHARMISSING(data=,vars=,alert=0.2);
*** LOAD THE NUMBER OF ITEMS IN &VARS INTO MACRO VARIABLE NVARS;
%LET C=1;
%DO %WHILE(%SCAN(&vars,&c) NE);
%LET C=%EVAL(&c+1);
%END;
%LET NVARS=%EVAL(&C-1);
*** CALCULATE THE NUMBER OF OBSERVATIONS IN THE DATA SET;
DATA _NULL_;
CALL SYMPUT('N0',STRIP(PUT(nobs,8.)));
STOP;
SET &data NOBS=NOBS;
RUN;
PROC DELETE DATA = work._CharMissing_;RUN;
%DO I = 1 %TO &NVARS;
PROC FREQ DATA = &data(KEEP =%SCAN(&VARS,&I)) NOPRINT;
TABLE %SCAN(&vars,&I) / MISSING OUT = DATA_%SCAN(&vars,&I)(WHERE
=(%SCAN(&vars,&I) IS MISSING));
RUN;
DATA DATA_%SCAN(&vars,&i);
FORMAT VAR $32.;
SET data_%SCAN(&vars,&i);
VAR = "%SCAN(&vars,&i)";
DROP %SCAN(&vars,&i) PERCENT;
RUN;
PROC APPEND BASE = work._CharMissing_ DATA = DATA_%SCAN(&vars,&i)
FORCE;
RUN;
%END;
PROC PRINT DATA = work._CharMissing_;
RUN;
284 Data Preparation for Analytics Using SAS
DATA _CharMissing_;
SET _CharMissing_;
FORMAT Proportion_Missing 8.2;
N=&N0;
Proportion_Missing = Count/N;
Alert = (Proportion_Missing > &alert);
RENAME var = Variable
Count = NumberMissing;
*IF _NAME_ = '_FREQ_' THEN DELETE;
RUN;
TITLE ALERTLIST FOR CATEGORICAL MISSING VALUES;
TITLE2 DATA = &DATA -- ALERTLIMIT >= &ALERT;
PROC PRINT DATA = _CharMissing_;
RUN;
TITLE;TITLE2;
%MEND;
The macro performs a call of PROC FREQ for each variable and concatenates the results for the
frequencies of the missing values. Note that the macro deletes a potential existing data set
WORK._CHARMISSING_.
DATA
The data set to be analyzed.
VARS
The list of variables that will be checked.
ALERT
The value between 0 and 1. Variables that have a proportion of missing values greater
than “Alert” will be flagged with alert = 1.
Number Proportion_
Obs Variable Missing Missing N Alert
In the case of dummy variables for categorical variables, usually no error will be issued,
but the observation with a new category will typically have zeros for all dummy
variables. In this case, a wrong score will be calculated for these observations.
In case of a SAS Enterprise Miner score code, a new category will cause a warning in the
scoring process. This will lead to an average score for this observation—for example, in
the form of the baseline probability in event prediction.
The macro shown in Chapter 20 – Coding for Predictive Modeling, for the calculation of
proportions for categories, introduced an option that provides the average event rate or
average mean for a new category, which leads to a correct scoring.
Here we will introduce a macro that checks whether new categories in categorical variables exist
in the scoring data mart that were not present in the training data mart.
The macro %REMEMBERCATEGORIES has to be applied on the training data mart. It creates
and stores a list for each variable in the VARS data set. These data sets have the prefix CAT. This
list of categories will be used by macro %CHECKCATEGORIES to compare whether new
categories in the scoring data mart exist. The list of categories will be stored in data sets with the
prefix SCORE. The result of the comparison is stored in data sets with the prefix NEW. The list
of new categories is printed in the SAS Output window.
%MACRO RememberCategories(data =, vars=,lib=sasuser);
*** Load the number of itenms in &VARS into macro variable NVARS;
%LET c=1;
%DO %WHILE(%SCAN(&vars,&c) NE);
%LET c=%EVAL(&c+1);
%END;
%LET nvars=%EVAL(&c-1);
%MEND;
DATA
The data set that contains the training data.
VARS
The list of variables that will be checked. Note that here we cannot use the logical
variable for all categorical variables, _CHAR_, because we need to scan through the
macro list of values.
286 Data Preparation for Analytics Using SAS
LIB
The SAS library where the list of categories will be stored.
The macro %CHECKCATEGORIES is defined as follows:
%MACRO CheckCategories(scoreds=, vars=,lib=sasuser);
*** Load the number of items in &VARS into macro variable NVARS;
%LET c=1;
%DO %WHILE(%SCAN(&vars,&c) NE);
%LET c=%EVAL(&c+1);
%END;
%LET nvars=%EVAL(&c-1);
PROC SQL;
CREATE TABLE &lib..NEW_%SCAN(&vars,&i)
AS
SELECT %SCAN(&vars,&i)
FROM &lib..score_%SCAN(&vars,&i)
EXCEPT
SELECT %SCAN(&vars,&i)
FROM &lib..cat_%SCAN(&vars,&i)
;
QUIT;
TITLE New Categories found for variable %SCAN(&vars,&i);
PROC PRINT DATA = &lib..NEW_%SCAN(&vars,&i);
RUN;
TITLE;
%END;
%MEND;
SCOREDS
The data set that contains the scoring data.
VARS
The list of variables that will be checked. Note that here we cannot use the logical
variable for all categorical variables, _CHAR_, because we need to SCAN through the
macro list of values.
LIB
The SAS library where the list of categories will be stored.
The basic idea of this set of macros is to create a repository of the list of values at the time of
model training. At each scoring the list of values of the scoring data mart is compared with the list
of values in the training data mart. This has the advantage that the training data mart does not
need to be available each time a scoring is done, and it saves processing time because the list for
the training data mart is generated only once. Note, however, that the data sets that are created by
Chapter 23: Scoring and Automation 287
the macro %REMEMBERCATEGORIES must not be deleted and must be present when the
macro %CHECKCATEGORIES is invoked.
This creates the data sets WORK.CAT_JOB and WORK.CAT_REASON, which contain a list of
values that are available in the training data.
Now we invoke the macro %CHECKCATEGORIES to compare the values in the data set with
our repository:
%CheckCategories(scoreds = hmeq_score,vars=job reason);
Obs job
1 SAS-Consultant
We introduce here a set of macros to compare the distribution of numeric variables in the training
data mart with the distribution of numeric variables in the scoring data mart. Again, we create a
repository of descriptive statistics for the training data mart and compare these statistics with the
values of the scoring data mart.
288 Data Preparation for Analytics Using SAS
%MACRO
RememberDistribution(data=,vars=_NUMERIC_,lib=sasuser,stat=median);
PROC MEANS DATA = &data NOPRINT;
VAR &vars;
OUTPUT OUT = &lib..train_dist_&stat &stat=;
RUN;
DATA
The data set that contains the training data.
VARS
The list of variables that will be checked. Note that the default is all numeric variables.
Also note that here we use the very practical logical variable _NUMERIC_.
LIB
The SAS library where the statistics describing the distributions will be stored.
STAT
The descriptive statistic that will be used for comparison. The default is MEDIAN. Valid
values are those statistics PROC MEANS can calculate (see SAS Help and
Documentation for details).
The macro creates data sets in the &lib library with the name TRAIN_DIST_&stat_TP.
The macro %SCOREDISTRIBUTION calculates the same statistics for the score data sets and
joins these statistics to those of the training data set in order to perform a comparison.
%MACRO
ScoreDistribution(data=,vars=_NUMERIC_,lib=sasuser,stat=median,alert=0.1);
PROC MEANS DATA = &data NOPRINT;
VAR &vars;
OUTPUT OUT = &lib..score_dist_&stat &stat=;
RUN;
DATA &lib..compare_&stat;
MERGE &lib..train_dist_&stat._tp &lib..score_dist_&stat._tp;
BY variable;
DIFF = (Score_&stat - Train_&stat);
IF Train_&stat NOT IN (.,0) THEN
DIFF_REL = (Score_&stat - Train_&stat)/Train_&stat;
Alert = (ABS(DIFF_REL) > &alert);
RUN;
%MEND;
DATA
The data set that contains the score data.
VARS
The list of variables that will be checked. Note that the default is all numeric variables.
Also note that here we use the very practical logical variable _NUMERIC_.
LIB
The SAS library where the statistics describing the distributions will be stored.
STAT
The descriptive statistic that will be used for comparison. The default is MEDIAN. Valid
values are those statistics PROC MEANS can calculate (see SAS Help and
Documentation for details).
ALERT
The alert level as the absolute value of the relative difference between training and
scoring data.
290 Data Preparation for Analytics Using SAS
Train_ Score_
Obs Variable mean mean DIFF DIFF_REL Alert
23.8.4 Summary
We see that with a simple macro we can easily monitor and alert changes in distributions. It is up
to the automatization process in general how to act from these alerts—for example, whether a
conditional processing will follow. We will discuss this in the next section.
General
In this section we will briefly touch on the possible automation of data mart creation. With this
topic, however, we are reaching the limits of this book’s scope. It is up to a company’s IT
strategists to determine how the creation of data marts will be automated.
It is possible to automate data mart creation with pure SAS code, as we will show in the following
sections.
Automatizing a SAS program is very closely related to the SAS macro language and also closely
related to efficient SAS programming in general. It would go beyond the scope of this book to go
into too much detail about this. If you are interested, see SAS Help and Documentation or SAS
Press books on this topic.
Chapter 23: Scoring and Automation 291
DATA customer_.
SET source.customer_.
-- some other statements --;
KEEP &Varlist;
RUN;
DATA _NULL_;
CALL
symput('MONTH_YEAR',TRIM(YEAR(TODAY()))||PUT(MONTH(TODAY()),z2.));
RUN;
%PUT &MONTH_YEAR;
**********************************************************************
*** Example 3 -- Conditional Processing;
*********************************************************************;
%MACRO Master(limit=0);
DATA _NULL_;RUN; *** Start with a valid statement to initialize
SYSERR;
%IF &syserr <= &limit %THEN %DO; PROC PRINT DATA = sashelp.class;
RUN; %END;
%IF &syserr <= &limit %THEN %CREATE_DATA SETS;;
%IF &syserr <= &limit %THEN %INCLUDE 'c:\programm\step03.sas';
%MEND;
%MASTER;
Note that not only the &SYSERR variable can be used to control the flow of a program. Generic
checks on data sets such as the number of observations can also be used to make a conditional
next step in the program. The macros we introduced in the previous sections are well suited to
decide whether a next step—for example, the scoring—will be performed or not.
24.1 Introduction
In this chapter we will look at what should be done and what should be avoided when building
data marts. The approach of this chapter is not to teach but to share experiences from practice.
The do’s and don’ts as they are called in this chapter are divided into three sections:
At those stages it is, in most cases, sufficient to create simple descriptive statistics such as means
or quantiles for interval variables or frequencies for categorical variables and to discuss and
review these results from a business point of view. In many cases it makes sense to include
business experts for such a review. Business experts are those people who are involved in the
processes where data are collected and created, and who know whether data are plausible or not.
At these reviews, you should challenge the data mart content with questions like these:
Is it possible that patients have on average 79 days per year in the hospital?
May customers with a certain tariff have a contract with our company that is less than 12
months?
Is it possible that bank customers have cash withdrawals on average of 3.493 per month?
In these cases a cross-check with data from the source systems or operative systems might be
needed in order to have the proof that the aggregated data are correct.
define analysis metadata (what is the analysis variable, what is the ID variable, and so
forth)
prepare statistical data such as replacing missing values, filtering extreme values,
transforming variables, grouping values, and optimal binning with relationship to the
targeting
explore graphical data
These problems get severe if preliminary results have already been discussed in workshops with
interested parties and potentially promising analysis results have already been communicated
throughout different departments.
Chapter 24: Do’s and Don’ts When Building Data Marts 295
After data preparation, however, it makes sense to step back and check the content of the data as
described earlier, or to check the process of data creation in the sense of, where does the data for a
certain column in the analysis data mart come from. Note that this break in the process does not
necessarily last days or weeks; it can be a short break for coffee or tea and a short review of the
data processing steps that have been performed so far. This does not take much time, but it can
save a lot of time.
24.3.1 Do Not Drop the Subject ID Variable from the Data Mart
At some point in the analysis the analyst probably wants to get rid of all unused variables. In this
case it often happens that all variables that are not directly used in the analysis as dependent or
independent variables are dropped from the table.
Also in this case it often happens that the subject ID variable is dropped from the data mart. This
is not necessarily a problem in the current analysis; but, if at a later stage additional data will be
joined to the table, or if scores for different models or different points in time will be compared,
the subject ID variable will be needed.
When working with data sets that have many observations, we tend to downplay the importance
of a few observations in the merge. The fact, however, that observations “disappear” during data
merges shows that we probably have problems with our definition of the merges or with the
consistency of the merging keys.
It makes sense to control the merge by using logical IN variables as data set options. The
following example shows the code:
DATA CustomerMart;
MERGE Customer (IN = InCustomer)
UsageMart (IN = InUsageMart);
BY CustID;
IF InCustomer;
RUN;
The preceding code ensures that all observations from the customer table are output to
CUSTOMERMART and only the matching observations from USAGEMART are output. If we
omit the restriction for the logical variable, we would have all observations, including the non-
matching cases, from the CUSTOMER and USAGEMART tables.
Also note that the following warning in the SAS log might be a cause for lost or unexpected
added observations.
296 Data Preparation for Analytics Using SAS
It makes sense to have equal lengths and formats for the merging key in the respective tables. For
more details, see SUGI paper 98-28, which is available at
http://www2.sas.com/proceedings/sugi28/098-28.pdf.
In the telecommunications industry the number of call minutes per month is a very central
variable and is commonly used in the unit “minutes”. In profiling customers, however, when we
look at monthly or yearly sums, it might make sense to convert the minutes to hours for better
comparability and interpretability.
The duration of phone calls per month is measured in minutes of use. MINUTESOFUSE is in this
case a variable name that tells you what it means.
RENAME usage = MinutesOfUse;
Even if some SAS procedures can loop through static attributes with a COPY statement during the
aggregation process, this should be avoided especially for large data sets for performance reasons,
but also for small data sets, in order to improve the structure of the data mart creation process.
This spreadsheet has one line per variable in the data mart. The following is a list of possible
relevant columns:
Variable name
Variable order number
Variable number in the data set
Type (categorical or interval)
Format
Upper and lower limit for interval variables
Expression for a derived variable
Label
Variable category
Source system
Number of missing values
Measurement type: interval, ordinal, nominal, binary
Variable usage: target, input
Profiling status: Is the variable a profiling variable?
Variable importance: Here a grading for the variables can be created in order to sort them
based on their suggested importance. For example, 0 can mean the variable will be
dropped from the analysis data mart.
Note that these spreadsheet lists have the following advantages:
The spreadsheet can then be used to create the formula by using the columns Variable_Name and
Formula. In the second example, an indicator variable is created without using an IF-THEN/ELSE
clause just by comparing the values.
298 Data Preparation for Analytics Using SAS
The starting point for such a table can be the routing of the output of PROC CONTENTS to SAS
table with the following statements:
PROC CONTENTS DATA = sashelp.citimon
OUT = VarList(KEEP = name type length
varnum label format);
RUN;
The resulting table can then be exported to Microsoft Office Excel, simply by right-clicking it in
the SAS Explorer window and selecting View in Excel from the menu. The table can then be
edited in Excel by sorting and creating new columns as described earlier.
The following code with a typo for variable WEIGHT gives the log shown below:
DATA test;
SET sashelp.class;
Age_Weight = Age * Weihgt;
RUN;
Also note that the additional time and effort in coding a program that is well structured visually is
usually between 10 and 15% of the total time effort. The benefit is a more readable and
transparent program.
The following example shows how the ATTRIB statement can be used to define properties of two
new variables in a data set:
DATA class;
SET sashelp.class;
ATTRIB NrChildren LABEL = "Number of children in family"
FORMAT = 8.;
ATTRIB district LABEL = "Regional district"
FORMAT = $10.;
RUN;
In many cases the programmer is too “busy” to insert a comment line that would make the code
more readable. Keyboard macros can help here. For example, the following code can be
automated with one key:
****************************************************************
*** Author: Gerhard Svolba
*** Date:
*** Project:
*** Subroutine:
***
*** Change History:
***
****************************************************************;
Documenting code is fairly simple, and we have seen that we define macros to insert templates.
The importance of well-documented code increases with the existence of one or more of the
following facts:
In these cases it often happens that a program is re-programmed from scratch rather than re-using
parts of the program.
For example, if we have the variable name in a spreadsheet and create a new column where we
include the variable label, we can easily add a column with the =” values and a column with the “
values that are needed for the LABEL statement.
The contents of the cells A2:D6 only have to be copied to the SAS Program Editor between a
column with LABEL and a column with a “;”.
Another example is the preparation of code for categories. If for many categories the “ values
have to be added, it might be helpful to do this in a spreadsheet such as Excel. In this case, we
export the list of values to Excel. In our case this is column B and copy down the “ in columns A
and C.
We can also use the spreadsheet to copy down programming statements. In this case we export the
codes that are represented in column B and manually edit the other columns A, C, D, and E. Then
we copy the content of the cells to the SAS Program Editor to include the code in a program:
We see how we make use of the automatic copy in editing and preparing SAS code.
302 Data Preparation for Analytics Using SAS
P a r t 5
Case Studies
Chapter 27 Case Study 3—Preparing Data for Time Series Analysis 327
Introduction
In this part we will cover four case studies for data preparation. These case studies refer to the
content we presented in earlier chapters and put together in the context of a concrete question. In
the case studies we will show example data and complete SAS code to create from the input data
the respective output data mart. The following case studies will be examined:
In Case Study 1—Building a Customer Data Mart, we will create a one-row-per-subject data mart
from various data sources. We will show how data from multiple-rows-per-subject tables have to
be aggregated to a one-row-per-subject structure and how relevant derived variables can be
created. Finally, we will create a table that can be used directly for analyses on a CUSTOMER
level such as predictive analysis or segmentation analysis.
In Case Study 2—Deriving Customer Segmentation Measures from Transactional Data, we will
see how we can create a one-row-per-subject customer data mart from tables with hierarchies
from a star schema. We will see how the data can be effectively joined together and how we
create derived variables for marketing analysis purposes.
In Case Study 3—Preparing Data for Time Series Analysis, we will deal again with data with
hierarchies. Here we have to prepare data marts for time series analysis by combining tables for
shops, sales organizations, products, product groups, and orders. We will also deal with
performance considerations and the selection of different aggregation levels.
In Case Study 4—Preparing Data in SAS Enterprise Miner, we will explore how SAS Enterprise
Miner can assist the data preparation process. We will show example process flows where we deal
with the following SAS Enterprise Miner nodes: Input Data Source node, Sample node, Data
Partition node, Time Series node, Association node, Transform Variables node, Filter node,
Impute node, Score node, Metadata node, SAS Code node, and Merge node.
C h a p t e r 25
Case Study 1—Building a Customer Data Mart
25.1 Introduction
This chapter provides an example of how a one-row-per-subject table can be built using a number
of data sources. In these data sources we have in the CUSTOMER table only one row per subject;
in the other tables, we have multiple rows per subject.
The task in this example is to combine these data sources to create a table that contains one row
per customer. Therefore, we have to aggregate the multiple observations per subject to one
observation per subject.
Tables such as this are frequently needed in data mining analyses such as prediction or
segmentation. In our case we can assume that we create data for the prediction of the cancellation
event of a customer.
This example mainly refers to the content that was introduced in Chapters 8, 12, 17, and 18.
306 Data Preparation for Analytics Using SAS
General
The business question is to create an analytical model that predicts customers that have a high
probability to leave the company. We want to create a data mart that reflects the properties and
behaviors of customers. This data mart will also contain a variable that represents the fact that a
customer has left the company at a certain point in time. This data mart will then be used to
perform predictive modeling.
We will define our snapshot date as June 30, 2003. This means that in our analysis we are allowed
to use all information that is known at or before this date. We use the month July 2003 as the
offset window and the month August 2003 as the target window. This means we want to predict
with our data the cancellation events that took place in August 2003. Therefore, we have to
consider the VALUESEGMENT in the SCOREDATA table per August 2003.
The offset window is used to train the model to predict an event that does not take place
immediately after the snapshot date but some time after. This makes sense, because retention
actions cannot start immediately after the snapshot date, but some time later when data are loaded
into the data warehouse and scored, then the retention actions are defined. For details, see Figure
25.1. Customers that cancelled in the offset month, July 2003, are excluded from the training data.
Keeping them as non-event customers would (in the data) reduce the signal of customers that
cancel, as we have cancelling customers as non-events in the offset window and as events in the
target window.
Chapter 25: Case Study 1—Building a Customer Data Mart 307
In the score data, we have multiple observations because of scores for different points in
time.
In the account data, we aggregated data over time, but we have multiple observations per
customer because of different contract types.
In the call center data, we have a classic form of a transactional data structure with one-
row-per-call-center contact.
CUSTOMER
ACCOUNTS
LEASING
CALLCENTER
SCORES
These tables can be merged by using the unique CUSTID variable. Note that, for example, the
account table already contains data that are aggregated from a transactional account table. We
already have one row per account type that contains aggregated data over time.
308 Data Preparation for Analytics Using SAS
The data model and the list of available variables can be seen in Figure 25.2.
Table 25.1 through Table 25.5 give a snapshot of the contents of the tables:
The other tables that we consider to be snapshots of 30JUN2003 are the ones in which we do not
have to worry about the correct time span any more.
2. We will join the resulting tables together and create a CUSTOMER DATA MART table.
We use PROC MEANS to create an output data set ACCOUNTMTP that contains both
aggregations per customer and aggregations per customer and account type. Note that we have
intentionally omitted the NWAY option in order to have the full set of _TYPE_ categories in the
output data set.
PROC MEANS DATA = accounts NOPRINT;
CLASS CustID Type;
VAR Balance Interest;
OUTPUT OUT = AccountTmp(RENAME = (_FREQ_ = NrAccounts))
SUM(Balance) = BalanceSum
MEAN(Interest) = InterestMean;
RUN;
In the next step we use a DATA step with multiple output data sets to create the following:
Finally, we need to transpose the data set with ACCOUNTTYPES to have a one-row-per-
customer structure:
PROC TRANSPOSE DATA = AccountTypeTmp
OUT = AccountTypes(DROP = _NAME_ _LABEL_);
BY CustID;
VAR BalanceSum;
ID Type;
RUN;
Then we aggregate the call center data over all customers using the FREQ procedure:
PROC FREQ DATA = callcenter NOPRINT;
TABLE CustID / OUT = CallCenterContacts(DROP = Percent RENAME =
(Count = Calls));
WHERE datepart(date) < &snapdate;
RUN;
Note that we have used the INTNX function, which allows the relative definition of points in time
in a very elegant way, for various purposes:
We align the values of DATE in the SCORES table to the END of each month. In this
case we actually move the date values that come with the first of the month to the end of
each month in order to be able to compare them with our snapshot date, which is set to
the end of the month.
We calculate how many months a certain date is before or after our specified snapshot
date. In our case the first INTNX function will evaluate to 01MAY2003, and the second
will evaluate to 01AUG2003.
First we define a macro variable for the snapshot date in order to have a reference point for the
calculation of age. (Note that we repeated this macro variable assignment for didactic purposes
here).
%let snapdate = "30JUN2003"d;
Next we start defining our CUSTOMERMART. We use ATTRIB statements in order to have a
detailed definition of the data set variables, by specifying their order, format, and label. Note that
we aligned the FORMAT and LABEL statements for a clearer picture.
DATA CustomerMart;
Next we merge the data set that we created in the previous steps together and create logical
variables for some tables:
MERGE Customer (IN = InCustomer)
AccountSum (IN = InAccounts)
AccountTypes
LeasingSum (IN = InLeasing)
CallCenterContacts (IN = InCallCenter)
CallCenterComplaints
ScoreFuture(IN = InFuture)
ScoreActual
ScoreLastMonth;
BY CustID;
IF InCustomer AND InFuture;
Missing values that can be interpreted as zero values are replaced with zeros in the next step. Such
missing values occur, e.g. in the case of a transposition, if a subject does not have an entry for a
certain category. This can be interpreted as a zero value.
ARRAY vars {*} Calls Complaints LeasingValue LeasingAnnualRate
Loan SavingsAccount Funds NrLeasing;
DO i = 1 TO dim(vars);
IF vars{i}=. THEN vars{i}=0;
END;
DROP i;
/* Accounts */
HasAccounts = InAccounts;
IF BalanceSum NOT IN (.,0) THEN DO;
LoanPct = Loan / BalanceSum * 100;
SavingsAccountPct = SavingsAccount / BalanceSum * 100;
FundsPct = Funds / BalanceSum * 100;
END;
/* Leasing */
HasLeasing = InLeasing;
/* Call Center */
HasCallCenter = InCallCenter;
IF Calls NOT IN (0,.) THEN ComplaintPct = Complaints / Calls *100;
/* Value Segment */
Cancel = (FutureValueSegment = '8. LOST');
ChangeValueSegment = (ValueSegment = LastValueSegment);
RUN;
Note that we calculated the target variable CANCEL from the fact that a customer has the
VALUESEGMENT entry ‘8. LOST’.
Chapter 25: Case Study 1—Building a Customer Data Mart 315
We include only those customers in the final customer data mart that are in the CUSTOMER table
per June 30, and that are still active at the beginning of the target window. We infer this from the
fact that the CUSTID appears in the SCOREFUTURE table.
In the case of the value segment change, we simply compare the current and the last value
segment. It would, however, be more practical to differentiate between an increase and a decrease
in value segment. This would be more complicated because we would need to consider all
segment codes manually. Here it would be more useful if we had the underlying numeric score
available, which would allow a simple numeric comparison.
Referring to Section 16.4, “Time Intervals,” we see that we used the “dirty method” to calculate
customer age. We divided the time span by the average number of days per year (= 365.2422).
For this type of analysis this method is sufficient because we need age only as a rough input
variable for analysis.
Table 25.8: Call center data, score data, and the target variable in the CUSTOMERMART
Note that we used data that we found in our five tables only. We aggregated and transposed some
of the data to a one-row-per-subject structure and joined these tables together. The derived
variable that we defined gives an accurate picture of the customer attributes and allows as good an
inference as possible of customer behavior.
C h a p t e r 26
Case Study 2—Deriving Customer Segmentation
Measures from Transactional Data
General
In this case study we will examine point-of-sales data from a “do-it-yourself” retail shop. We
have a transaction history of sales with one row for each sale item. Additionally, we assume that
each customer has a purchasing card so that we can match each purchase to a customer in order to
analyze repeating sales events. These data are usually stored in a star schema (see Chapter 6 –
Data Models) and are the basis for reporting tasks, either in relational or multidimensional form.
We will show how we can use the transactional point-of-sales data to calculate meaningful
derived variables for customer segmentation. We will do this by creating derived variables on the
CUSTOMER level. These variables can further be used to categorize and segment customers on
the basis of one variable or a set of variables.
The following list shows segmentation variables that turned out to be useful for customer
segmentation in retail. We will structure the list of variables by so-called “segmentation axes,”
each segmentation axis being the logical category of a set of variables.
318 Data Preparation for Analytics Using SAS
The difference is that here we propose a set of variables for each category in order to allow for a
finer selection of segmentation criteria and that we also include information on product usage. In
our example we will classify the product usage only by the product’s group proportion on the
overall sales amount. Another possibility would be to use market basket analysis to include
indicators for the most relevant product combinations in the customer segmentation.
Note that for simplicity, no separate entity is used for the TIME dimension. Furthermore, the
PRODUCT dimension is already de-normalized because we already have the names and IDs of
PRODUCT_GROUP and PRODUCT_SUBGROUP in one table.
If a product has been sold in a promotion period, a link to the PROMOTION dimension exists.
Otherwise, no PROMO_ID is present for the record. In the case of a return of a product from the
customer, the QUANTITY has a negative sign.
Chapter 26: Case Study 2—Deriving Customer Segmentation Measures from Transactional Data 319
Note that for simplicity the data model is truncated because we do not use the full range of
variables that would be available. Tables 26.1 through 26.5 give a short overview of the content of
different tables.
Table 26.1: CUSTOMER table with only two attributes per customer
In Table 26.2, we also have a lookup table for PRODUCTGROUP and PRODUCTSUBGROUP.
Finally, the fact table POINTOFSALE includes all sale events on a SALE ITEM level.
General
We will now show how we can start from this data source to create a customer table that contains
customer information in a one-row-per-subject structure.
In our example for simplicity we will use the format for product group only. If we wanted to use
the format for product subgroup we would have to calculate the concatenation of
PRODUCTGROUP and PRODUCTSUBGROUP in the data.
In the next step we calculate the number of distinct visit days for each customer:
*** Number Visits;
PROC SQL;
CREATE TABLE Visit_Days AS
SELECT CustID, COUNT(DISTINCT date) AS Visit_Days
FROM PointOfSale
GROUP BY CustID
ORDER BY CustID;
QUIT;
In this step we calculate the amount of sales in a promotion period. Note that the fact that an
article was bought within a promotion period is indicated by the presence of a PROMOID value.
*** Promotion Proportion;
PROC MEANS DATA = PointOfSale NOPRINT NWAY;
CLASS CustID;
VAR Sale;
WHERE PromoID NE .;
OUTPUT OUT = Cust_Promo(DROP = _TYPE_ _FREQ_) SUM(Sale)=Promo_Amount;
RUN;
In the same fashion we also calculate the number of returned products, which are indicated by a
negative quantity value.
*** Return ;
Note that in both cases we immediately drop the _TYPE_ and _FREQ_ variables. We could keep
the _FREQ_ variable in the number of returned or promotion items that would be needed in
segmentation.
We define a macro variable with the snapshot date, which we will use in the calculation of the
current age:
%LET SnapDate = "01JUL05"d;
We also extensively define the variables that are available in the segmentation data mart with the
ATTRIB statement. Thus, we have a common place for the definition of variables and a good
sorting place for the variables in the data mart:
DATA CustMart;
Next we merge all the tables that we created before with the CUSTOMER table and create logical
variables for some data sets:
MERGE Customer(IN = IN_Customer)
Cust_pos
Visit_days
Cust_promo(IN = IN_Promo)
Cust_return(IN = IN_Return)
Cust_pg;
BY CustID;
IF IN_Customer;
In the next step we create an ARRAY statement to replace missing values for some variables:
ARRAY vars {*} tools car gardening Promo_Amount Return_Amount ;
DO i = 1 TO dim(vars);
IF vars{i}=. THEN vars{i}=0;
END;
DROP i;
/* Card Details */
AgeCardIssue = (Card_Start - birthdate)/365.25;
AgeCardMonths = (&snapdate - Card_Start)/(365.25/12);
/* Product Groups */
ToolsPct = Tools/ Total_Sale *100;
CarPct = Car / Total_Sale *100;
GardeningPct = Gardening/ Total_Sale *100;
RUN;
Note that for a long list of product groups using an ARRAY statement is preferable.
Finally, we receive a data mart, which is shown in Tables 26.6 through 26.8.
Table 26.6: The variables of the axis: customer demographics, card details, and visit
frequency and recency
Table 26.7: The variables of the axis: sales and profit, promotion, and return
Examples
The preceding results can be used in a variety of ways:
Reporting and descriptive customer analytics: The data mart can be used to answer
questions such as the following:
How is the distribution of customer age of active and inactive customers?
What is the average time interval since a customer last used his purchase card?
How many customers buy more than 20% of their sales in promotions?
The advantage of this data representation is that the preceding questions can be answered without
complex joins and subqueries to other tables.
Data mining and statistical analysis: The data mart can be the input table for data mining
analyses to answer questions such as the following:
Which customer attributes are correlated with the fact that a customer buys products
in certain product groups or has a high product return rate?
In which clusters can my customers be segmented based on certain input variables?
Further variables
Note that the list of variables can be easily increased if we start considering the time axis. For
example,
27.1 Introduction
In this case study we will show how we prepare data for time series analysis. We will use example
data from a wholesale trader in the computer hardware industry. In this case study we will show
how we can use these data to create longitudinal data marts that allow time series analysis on
various hierarchical levels.
Different from the two previous case studies where we created one-row-per-subject data marts
with a number of derived variables, the focus in this case study is to prepare the data on the
appropriate aggregation level with a few derived variables in order to perform time series
analysis.
328 Data Preparation for Analytics Using SAS
An exemplary data model for demand forecast data with hierarchies. This data model is
the basis for our analysis data.
SAS programs that move from transactional data to the most detailed aggregation.
SAS programs that create aggregations on various hierarchical levels.
The differences in performance and usability between PROC SQL aggregations and
PROC MEANS or PROC TIMESERIES aggregations.
SAS programs that create derived variables for time series modeling.
In this case study we will see the following:
That using SAS formats instead of full names saves run time and disk space.
That PROC SQL can aggregate and join tables in one step. It is, however, slower than the
corresponding DATA step and SAS procedure steps.
General
Our business context is the wholesale computer industry. We consider a manufacturer of
hardware components for computers. The manufacturer runs shops in various countries that sell
products to his customers (re-sellers). These customers then re-sell the hardware components to
consumers. Each shop belongs to a sales organization, which is responsible for a number of shops.
The shops pass through the re-seller orders to the central stock. From this central stock, the
products are sent directly to the re-sellers.
For each order, we know the shop that has taken the order, the day, the product, the order type the
amount, and the price. Note that because a shop can get several orders per day for a product, we
can have duplicate lines per SHOP, PRODUCT, DAY, and ORDERTYPE in the data.
overall
SALESORGANIZATION
SHOP
Tables
We have the following tables as the data basis for our case study:
Sales organizations
Shops
Orders
OrderType
ProductGroups
Products
The physical data models for these six tables are shown in Figure 27.1.
The table ORDERTYPE has only two observations, which are represented by the following SAS
format:
PROC format lib=sales;
VALUE OrdType 1 = 'Direct' 2 = 'Indirect';
RUN;
General
In Table 27.3, lines 13 and 14 have duplicate rows. As mentioned in Section 27.2, “The Business
Context,” this can come from the fact that one shop receives more than one order for a product per
day.
In Chapter 10 – Data Structures for Longitudinal Analysis, we saw that transactional data can
have observations that can be aggregated to derive the finest granularity. In our case the
aggregation to the finest granularity level would be DATE, SHOP, PRODUCT, and
ORDERTYPE. This means that we would summarize rows 13 and 14 in Table 27.3 by summing
QUANTITY to 10 and averaging the two prices to 46.24.
However, the finest granularity level would not solve our business question, because we need the
data on the MONTHLY level for monthly forecasts. Therefore, we skip the creation of the finest
granularity level and aggregate the data to the most appropriate aggregation level, which is
SHOP, PRODUCT, ORDERTYPE, and MONTH.
In order to aggregate from daily data to monthly data we use the following statement:
put(date,yymmp7.) AS MonthYear. We mentioned in Chapter 21 – Data Preparation
for Multiple-Rows-per-Subject and Longitudinal Data Marts, that the sole use of
FORMAT = yymmp7. is not sufficient in PROC SQL for aggregation levels and that the
PUT function can be used. In this case, however, we have a character formatted variable
where no DATE functions can be applied. An alternative is to use the following code:
mdy(month(date),1,year(date)) as MonthYear format = yymmp7.
In the SELECT clause, we use the variables SHOPID, PRODUCTID, and so on, instead
of SHOPNAME and PRODUCTNAME. To display the names instead of the ID, we
specify the appropriate format in the FORMAT = expression. The reason is that we
decrease both disk space and run time when using IDs and formats. In our example with
732,432 observations in the ORDERS table and 52,858 observations in the aggregated
table, the difference in run time and necessary disk space is shown in Table 27.6.
Because we are looking at only a small example and already observe these differences, it is
obvious that it makes sense to use the ID and SAS formats method. For this method, however,
SAS formats are needed. Their creation is shown in the next section.
334 Data Preparation for Analytics Using SAS
In our example we assume that we do not have indexes created in our BY variables and we do not
use a SAS Scalable Performance Data Engine library; therefore, we need to explicitly sort the data
before merging.
Chapter 27: Case Study 3—Preparing Data for Time Series Analysis 335
Then we aggregate the data on a SHOP, PRODUCT, ORDERTYPE, and MONTH level using
PROC MEANS:
PROC means data = sales.orders nway noprint
;
class shopid productid ordertype date;
var quantity price;
format date yymmp7. price 8.2 quantity 8.;
output out = orders_month mean(price)=price
sum(quantity)=quantity;
run;
Next we merge the results of the aggregation with the results of the PRODUCTS and
PRODUCTGROUPS merge:
PROC sort data = p_pg; by productid; run;
PROC sort data = orders_month ; by productid ;run;
data p_pg_o;
merge p_pg orders_month(in=in2);
by productid;
if in2;
run;
And finally we merge these results with the results of the SHOPS and SALESORGS merge:
PROC sort data = s_so ; by shopid ;run;
PROC sort data = p_pg_o ; by shopid productid ordertype date;run;
data sales.ordermart_DATA step;
merge s_so p_pg_o(in=in2);
by shopid;
if in2;
run;
336 Data Preparation for Analytics Using SAS
Note that we created a lot of intermediate steps and temporary data sets. This version, however,
takes only 5 seconds of real time, whereas the PROC SQL version needed 32 seconds. The
reasons for the difference are that PROC MEANS is extremely fast in data aggregation, and after
the aggregation the joins are done only on data sets with approximately 50,000 observations. Also
note that these performance tests were performed on a laptop with one CPU, so parallelization of
PROC MEANS compared to PROC SQL is not the reason for the performance difference.
General
In order to compare the performance of PROC SQL, PROC MEANS, and PROC TIMESERIES, a
data set ORDER10 was created, where the number of observations was multiplied by 10. This
results in 7,324,320 observations. PRODUCTIDs have amended, that way there are 264,290
distinct combinations of PRODUCT, SHOP, ORDERTYPE, and DATE.
In order to compare the performance on sorted and unsorted data, a sorted and a randomly sorted
version of ORDER10 was created.
Results
Table 27.7 shows the real run time for the different scenarios.
27.5.1 Summary
The characteristics of the different methods are as follows. A plus sign (+) indicates an advantage;
a minus sign (–) indicates a disadvantage.
PROC SQL
PROC MEANS
PROC TIMESERIES
General
With the aggregation ORDERMART that we created earlier, we have the data in an appropriate
form to perform time series analysis on the levels PRODUCT, SHOP, ORDERTYPE, and
MONTH. This means that in PROC AUTOREG or PROC UCM, for example, we perform a time
series analysis on the cross-sectional BY groups PRODUCT, SHOP, and ORDERTYPE.
In our case, however, the aggregation ORDERMART is only the starting point for additional
aggregations that we need in order to answer our business question. Earlier we defined that the
business question in our case study is to produce monthly demand forecasts on the following
levels:
OVERALL
SALESORGANIZATION
SHOP
We also use PRODUCTGROUPID in the SELECT clause in order to copy its values
down to the PRODUCT level.
FORMAT = is used to assign the appropriate format to the values.
The PRICE is not summed but averaged in order to copy its value down to other
aggregation levels.
Chapter 27: Case Study 3—Preparing Data for Time Series Analysis 339
This has the advantage that the aggregation is not physically created and therefore disk space is
saved. In contrast, however, all analyses on this VIEW take longer because the VIEW has to be
resolved. The use of views is of interest if a number of similar aggregations that produce tables
with a lot of resulting observations will be created for preliminary analysis, and access to a single
view is rarely used.
The resulting table, which is the same for both versions, is shown in Table 27.8.
The remaining aggregations to answer the business question can be created analogously.
340 Data Preparation for Analytics Using SAS
If we want to use this in PROC SQL, we need to use the PUT function as we mentioned in
Section 21.4, “Aggregating at Various Hierarchical Levels.” An alternative would be in the case
of quarters to calculate the relevant months with an expression:
mdy(ceil(month(monthyear)/3*3),1,year(monthyear))
AS QTR_YEAR FORMAT = yyq7.
Note that in this case the calculation is redundant because a SAS format exists. It is shown for
didactic purposes only, that a clever combination of functions can be an alternative. The result is
shown in Table 27.9.
Chapter 27: Case Study 3—Preparing Data for Time Series Analysis 341
General
In the introduction to this chapter we mentioned that this case study focuses more on appropriate
aggregation than on the creation of cleverly derived variables. We will, however, show two
examples of derived variables in our example.
Consider a promotion that took place from September 1, 2004 through November 30, 2004. The
indicator variable PROMOTION can be created with the following DATA step:
DATA SALES.ORDERMART;
SET SALES.ORDERMART;
IF '01SEP2004'd <= monthyear <= '30NOV2004'd
THEN Promotion =1;
ELSE Promotion = 0;
RUN;
The number of shops where a product is sold is obviously a very important predictor for the
number of demand forecasts. It is also usually a variable that can be used in prediction because
the approximate number of shops that are allowed to sell a product is known for the future.
Chapter 27: Case Study 3—Preparing Data for Time Series Analysis 343
In some cases, however, the policies that regulate whether a shop is allowed to sell a certain
product are not known for the past, especially if the past extends over several years. In this case
the number of shops can also be calculated from the data for the past and added to the order data
mart. Note that here we are talking only of the shops that effectively sold the article. However,
this is usually a good predictor.
In our example we calculate the number of shops (counting the SHOPID) per product and year:
PROC sql;
create table sales.nr_shops as
select productid,
mdy(1,1,year(monthyear)) as Year format = year4.,
count(distinct shopid) as Nr_Shops
from sales.ordermart
group by productid,
calculated Year;
quit;
Because we know that the sales policies change only once every year, we calculate the distinct
number on a yearly basis, as shown in Table 27.11.
This table can now be merged with the ORDERMART with the following statements:
PROC sql;
create table sales.ordermart_enh
as
select o.*,
n.Nr_Shops
from sales.ordermart as o,
left join sales.nr_shops as n
on o.productid = n.productid
and year(o.monthyear) = year(n.year);
quit;
Note that because the variable YEAR in Table 27.11 is formatted only to YEAR but contains a
full date value, we need to specify year(o.monthyear) = year(n.year) instead of
year(o.monthyear) = n.year.
344 Data Preparation for Analytics Using SAS
If we calculate aggregations on a calendar date such as the number of weekdays per month, we
would aggregate them to the ORDERMART in the same fashion. Note that we cannot use SAS
formats in this case because the new calculated value depends on both SHOPID and YEAR. If we
had aggregates only on SHOPID or YEAR we could use a format to join them to the data.
General
In time series analysis we want to produce forecasts for future months, also often called lead
months. If we want to base our forecasts on exogenous variables such as PRICE and
PROMOTION, we need to have these variables populated for historic and future periods. In
Section 27.7, “Derived Variables,” we saw how to create derived variables for historic periods.
To create this information for future periods we need a list of products that will be forecasted.
Additionally, we need information about future promotion periods or price levels.
Coding
In Table 27.12 we see a list of products that will be forecasted, and we also have a new price level
for the future.
Starting from this table we create a table called LEAD_MONTHS that contains the observations
for future periods, where the new price level is used and promotional periods from September to
November are indicated.
%LET FirstMonth = '01Apr2005'd;
DATA sales.Lead_Months;
FORMAT ProductID product. ProductGroupID pg.
Monthyear yymmp7. Quantity 8.;
SET sales.lead_base;
DO lead = 1 TO 12;
Quantity = .;
MonthYear = intnx('MONTH',&FirstMonth,lead);
IF month(MonthYear) in (9,10,11) THEN Promotion = 1;
ELSE Promotion = 0;
OUTPUT;
END;
DROP lead;
RUN;
Chapter 27: Case Study 3—Preparing Data for Time Series Analysis 345
Note that our first future month is April 2005, and we are creating 12 lead months per product.
The preceding code results in Table 27.13, where we also see values for the input variables
PRICE and PROMOTION.
This table is then stacked to Table 27.11. From this table SAS Forecast Server and SAS High-
Performance Forecasting can start to create ARIMAX and UCM models and produce forecasts for
the next 12 months.
In these procedures, the variable MONTHYEAR is the ID variable; the variable QUANTITY is
the analysis variable VAR; variables such as PROMOTION, NR_SHOPS, or PRICE are potential
explanatory variables; and the other variables serve as variables to define cross-sectional BY
groups in the BY statement.
346 Data Preparation for Analytics Using SAS
C h a p t e r 28
Case Study 4—Data Preparation in SAS Enterprise
Miner
28.1 Introduction
General
In this case study we will show how SAS Enterprise Miner can be used for data preparation. In
earlier chapters we referred to SAS Enterprise Miner for data preparation tasks. Here we will
illustrate which functionality of SAS Enterprise Miner is useful for data preparation.
The following examples and descriptions refer to SAS Enterprise Miner 5.2.
348 Data Preparation for Analytics Using SAS
General
When introducing SAS Enterprise Miner for data preparation we will consider the following
nodes. We will briefly introduce these nodes and explain their functionality for data preparation.
For more details, see the SAS Enterprise Miner Help.
The Transform Variables node also allows performing transformations and defining groups that
maximize the predictive power if a target variable is specified. These methods can also be done in
normal coding, however, they include much more manual work and pre-analysis of the data.
In addition to the Transform Variables node, nodes such as the Variable Selection node or the
Tree node allow you to group data optimally for their relationship to the target variable.
All transformations that are performed by standard SAS Enterprise Miner nodes are converted
automatically into score code. Transformations that are performed in a SAS Code node have to be
prepared explicitly as SAS score code in the appropriate editor in the SAS Code node.
350 Data Preparation for Analytics Using SAS
In the first node the data source and the variable metadata are defined. An example is shown in
Table 28.1.
Chapter 28: Case Study 4—Data Preparation in SAS Enterprise Miner 351
Note that here we define the target variable in predictive modeling, reject variables, and define the
ID variable. For each input data source it is necessary to define the data set role. This is done in
the last step of the input data source or for each data source in the properties. An example is
shown in Table 28.2.
Note that for one-row-per-subject input data, the role RAW is usually selected. For multiple-rows-
per-subject, longitudinal, or transactional data, the role TRANSACTION must be selected. For
data sources that will be scored in SAS Enterprise Miner, the role SCORE must be defined.
In the Sample or Data Partition node we chose STRATIFIED sampling. Note that the role of the
target variable is automatically set to STRATIFICATION in order to perform a stratified
sampling for the target variable. An example is shown in Table 28.3.
352 Data Preparation for Analytics Using SAS
The Transform Variables node allows the definition of interactions through a graphical user
interface, as shown in Figure 28.3.
We have to connect the Score node to the Decision Tree node in order to get the score code for
this model. Additionally, we defined a new data source FINANCECUSTOMERS_SCORE that
contains new data that will be scored. Note that the table role has to be set to SCORE for this data
set.
Table 28.4 shows how the data set can be displayed in the results of the Score node. Running the
Score node produces an output data set that contains the scores shown in Table 28.5.
Note that the score code creates a number of scoring variables. Depending on the setting “Use
Fixed Output Names” in the Score node, the variable that contains the predicted probability in
event prediction has the name EM_EVENTPROBABILITY or
P_<targetvariablename><targetevent>. For a complete definition of the variables that are created
during scoring and of the scoring process as a whole, see the SAS Enterprise Miner Help.
The Score node also generates the score code as SAS DATA step code. Figure 28.5 shows how to
retrieve this code from the results of the Score node.
Note that the SCORE CODE contains the score code for all nodes in the path, as shown in Figure
28.6.
This score code can be applied by saving it to a file and specifying the DATA, SET, and RUN
statements to create a SAS DATA step:
DATA scored_data;
SET input_data;
%INCLUDE ‘scorecode.sas’;
RUN;
We see that the TRANSACTIONS data set has aggregates on a monthly basis per customer.
These data can be further aggregated on a one-row-per-subject basis or they can be used in time
series analysis to derive variables that describe the course over time.
The data in the ACCOUNTS data set can be used for association analysis to derive the most
frequent account combinations.
We define the data sources in SAS Enterprise Miner and incorporate them into the process flow.
Note that both data sets need to have the role TRANSACTION, and the variable roles have to be
set as shown in Table 28.8 and Table 28.9.
In the Time Series node we specify to create derived variables for the autocorrelations. Note that
only one statistic can be exported per node. A separate node has to be used for each statistic per
subject. An alternative is to use PROC TIMESERIES in a SAS Code node and specify more
output statistics in the code.
In the Merge node we match-merge and select the appropriate BY variable CUSTID. This causes
the one-row-per-subject data set FINANCECUSTOMERS to merge with the one-row-per-subject
results of the Time Series node and the Association node to form one data set.
In the resulting data set we have variables from the data set FINANCECUSTOMERS,
autocorrelations per customer from the Time Series node, and indicators for selected product
combinations from the Associations node.
28.7 Conclusion
We see that using SAS Enterprise Miner in data preparation offers a number of advantages. We
summarize the most important advantages as follows:
A.1 Introduction
In this appendix we will consider various data mart structures from a SAS language point of view.
We will investigate the data structure requirements for selected procedures in Base SAS,
SAS/STAT, SAS/ETS, SAS/GRAPH, and SAS/QC. We will also investigate which data mart
elements are primarily used by which generic statements in SAS procedures.
362 Data Preparation for Analytics Using SAS
A.2.2 BY Statement
A BY statement produces separate analyses on observations in groups defined by the BY
variables. When a BY statement appears, the procedure expects the input data set to be sorted in
the order of the BY variables. The BY variables are one or more variables in the input data set.
The BY variable is usually a categorical or integer variable with discrete values.
Note that the use of the appropriate BY variables in the analysis on multiple-rows-per-subject data
marts or longitudinal data marts creates in many cases one-row-per-subject data marts, where the
subject entity moves one level lower in the hierarchy.
Note that the FREQ statement assumes integer numbers, whereas the WEIGHT statement can be
real numbers from 0 to 1.
A.2.4 ID Statement
With some procedures an ID statement can be specified to identify the observations in the analysis
results. The ID variable can be a variable that contains an ID variable in the form of a key, as we
defined it in Chapter 6 – Data Models. However, every other variable in the data set can be
defined as a key variable—for example, the variable AGE, which adds the age to each observation
in the output.
Appendix A: Data Structure from a SAS Procedure Point of View 363
Note that an observation is used only in the analysis if none of its variables, as defined in the
preceding paragraph, is missing. In multivariate analysis with a list of variables, the number of
observations used in the analysis depends on all variables and can be smaller, as in the respective
univariate analysis.
A.3.1 General
A few of the Base SAS procedures perform analytical tasks and the differentiation of input data
set structures makes sense.
The GLM procedure, for example, accepts input data in the case of repeated measurements as a
one-row-per-subject data mart with columns for the repeated observations and a multiple-rows-
per-subject data mart with repeats in the rows. If it cannot be assumed that the subjects'
measurements are uncorrelated across time, the one-row-per-subject data mart has to be used.
If the data are provided in an interleaved data structure, the WHERE statement allows the
filtering of the relevant rows of the table.
In the case of a cross-sectional data mart structure, a BY statement allows the creation of
independent analyses defined by the categories of the BY variables. Also cross-sections can be
filtered with a WHERE statement.
P charts can be created from so-called count data, which equals our definition of the longitudinal
data set. They can also be created from summary data, where the proportion is already pre-
calculated.
In the case of unequal subgroup sizes, the SUBGROUPN option in the p chart, u chart, or np chart
requires a variable name in the input data set that contains the subgroup size.
All analysis tools in the Model folder of SAS Enterprise Miner, as well as the Variable Selection
node, the Cluster node, the SOM/Kohonen node, and the Interactive Grouping node and all SAS
Text Miner tools require that structure for the data mart. The Link Analysis node can analyze one-
row-per-subject data mart data.
A.8.4 General
Note that in SAS Enterprise Miner the type of multiple-rows-per-subject data marts and
longitudinal data marts are referred to as transaction data sets. The type of data set has to be
defined in the respective Input Data Source node.
368 Data Preparation for Analytics Using SAS
A p p e n d i x B
The Power of SAS for Analytic Data Preparation
B.1 Motivation
Data management and data preparation are fundamental tasks for successful analytics. We
discussed the role and importance of data management in Part 1 – Data Preparation: Business
Point of View. Here we want to emphasize again the fact that data management and analytics are
not sequential phases in the sense that no data management has to be done after analytics has
started. The contrary is true. After running the first analyses, we need to determine the following:
SAS provides these features to the analyst by integrating its data management capabilities with its
analytic procedures. This allows the analyst to perform data management and analysis within one
environment without the need to move data between systems.
Data management and analytics are closely linked. During analysis the two phases are iterated
over and over until the final results are available. The power of SAS for analytics originates from
the fact that powerful data management and a comprehensive set of analytical tools are available
in one environment. Therefore, SAS assists the analyst in his natural working process by allowing
data management and analysis without barriers.
B.2 Overview
B.2.1 General
In addition to the fact that data management and analytic methods are integrated in one system,
SAS offers powerful methods of data management that exceed the functionality of SQL or
procedural extensions of SQL. This functionality includes the following:
the SAS language that includes the SAS DATA step and SAS procedures
the SAS macro language
SAS/IML
The SAS language and the SAS macro language are part of Base SAS. SAS/IML is an extra
module that allows the application of matrix operations on SAS data sets.
Appendix B: The Power of SAS for Analytic Data Preparation 371
The SAS DATA step itself, for example, offers a number of possibilities to import text
files, log files, or event data from hierarchical databases.
SAS/ACCESS interfaces to relational databases such as Oracle, DB2, MS SQL Server,
and Teradata allow native access to databases, which speeds up import and export times
and the automatic conversion of variable formats and column definitions.
SAS/ACCESS interfaces to industry standards such as ODBC or OLEDB allow data
import and export via these interfaces. Easy data access is therefore possible to any data
source that provides an ODBC or OLEDB interface.
The PC File Server allows access to data format of popular PC file formats such as Excel,
Access, Lotus, and dBase.
SAS provides interfaces, the so-called data surveyors, to enterprise resource planning
systems like SAP, SAP BW, PeopleSoft, and Oracle E-Business Suite.
For more details and code examples, see Chapter 13 – Accessing Data.
B.4.1 Overview
Transposing data sets is a significant task in preparing data sets. The structure of data sets might
need to be changed due to different data structure requirements of certain analyses, and data might
need to be transposed in order to allow a join of different data sets in an appropriate way.
Complete transposition: The rows and columns of a data set are exchanged. In other
words, this type of unrestricted transposition turns a data set at a 45° diagonal.
Transposing within BY groups: This is very important if data will be transposed per
subject ID.
Example: The CUSTOMER table in Table B.1 represents data in a multiple-rows-per-subject
structure. With the following statements, the multiple observations per subject are transposed to
columns. The results are given in Table B.2.
372 Data Preparation for Analytics Using SAS
The major advantage of PROC TRANSPOSE as part of SAS compared with other SQL languages
is that the number and the numbering of repeated measurements are not part of the syntax.
Therefore, PROC TRANSPOSE can be used flexibly in cases where the number of repeated
measurements or the list of categories is extended.
Chapter 14 deals with different cases of transposing data sets in more detail.
The FIRST and LAST variables are very powerful for sequential data management operations on
a subject level—for example, for the creation of sequence numbers per subject.
The following code adds the columns LAG_VALUE and DIF_VALUE to the table, where
LAG_VALUE contains the LAG value of one row above and DIF_VALUE contains the
difference between the actual value and the value one row above.
374 Data Preparation for Analytics Using SAS
DATA Series;
SET Series;
Lag_Value = LAG(Value);
Dif_Value = DIF(Value);
RUN;
DATA customer_filled;
SET customer_hierarchic;
RETAIN custid_tmp age_tmp;
IF CustID NE . THEN DO; custid_tmp = CustID; age_tmp = age; END;
ELSE DO; CustID = custid_tmp; age = age_tmp; END;
DROP custid_tmp age_tmp;
RUN;
B.6.3 Sampling
Sampling can be performed easily by using the random number functions in SAS. The following
example shows how to draw a 10% sample:
DATA customer_sample_10pct;
SET customer;
IF UNIFORM(1234) < 0.1;
RUN;
The following example shows how to draw two 20% samples that do not overlap:
DATA customer_sample1
Customer_sample2;
Set customer;
IF UNIFORM(1234) lt 0.2 THEN OUTPUT customer_sample1;
ELSE IF UNIFORM(2345) lt 0.2/(1-0.2) THEN OUTPUT customer_sample2;
RUN;
376 Data Preparation for Analytics Using SAS
This output can also be directed into an output data set and used for further processing.
Example: For each retail customer, a time series over 12 months with the monthly sales volume is
available. PROC REG is used to calculate a linear trend over time of the sales volume for each
customer. The following code shows the trend can be calculated and the resulting regression
coefficients can be merged with the CUSTOMER table:
PROC REG DATA = sales_history
OUTEST = sales_trend_coeff
NOPRINT;
MODEL volume = month;
BY CustID;
RUN;
QUIT;
The result is shown in Table B.9. For more details, see Chapter 18.
378 Data Preparation for Analytics Using SAS
DATA customer_table;
MERGE customer
Sales_trend_coeff (KEEP = CustID time);
BY CustID;
RUN;
PROC REG calculates a linear regression for each customer according to the BY statement and
stores the regression coefficient.
can be used in a program to select the data set for the correct month. Let’s assume that the data set
names are in the format CUSTOMER_200605, CUSTOMER_200606, and so on. The DATA step
can be written as follows in order to use the appropriate data set:
DATA Customer_TMP;
SET customer_&Actual_Month;
---- Other SAS statements ---
RUN;
Macro variables, as in the following example, can be used to define a list of variables that will be
processed together:
%LET varlist = AGE INCOME NR_CARS NR_CHILDREN;
The macro variable VARLIST can be used as a replacement for the variable names.
Appendix B: The Power of SAS for Analytic Data Preparation 379
We discussed the importance and application of this feature in Chapter 23 – Scoring and
Automation.
We specify the list of variables in a macro variable and run PROC FREQ:
%LET var = product country region;
PROC FREQ DATA = sashelp.prdsale;
TABLE &var;
RUN;
This creates the desired result; we receive a frequency analysis for each variable. However, if we
want to save the frequencies in an output data set, amending the code as in the following example
does not produce the desired results.
380 Data Preparation for Analytics Using SAS
We need to specify a TABLE statement for each variable in order to output the results in an
output data set. However, in order to use the values in the VAR= variable, we need to use
%SCAN to extract them from the syntax:
%MACRO loop(data=,var=,n=);
PROC FREQ DATA = &data;
%DO i = 1 %TO &n;
TABLE %SCAN(&var,&i) / OUT = %SCAN(&var,&i)_Table;
%END;
RUN;
%MEND;
%LOOP(DATA = sashelp.prdsale,
VAR = product country region, N = 3);
We see from the SAS log that the macro with the %SCAN function creates the desired code:
MPRINT(LOOP): PROC FREQ DATA = sashelp.prdsale;
MPRINT(LOOP): TABLE product / OUT = product_Table;
MPRINT(LOOP): TABLE country / OUT = country_Table;
MPRINT(LOOP): TABLE region / OUT = region_Table;
MPRINT(LOOP): RUN;
NOTE: There were 1440 observations read from the data set
SASHELP.PRDSALE
NOTE: The data set WORK.PRODUCT_TABLE has 5 observations and 3
variables.
NOTE: The data set WORK.COUNTRY_TABLE has 3 observations and 3
variables.
NOTE: The data set WORK.REGION_TABLE has 2 observations and 3
variables.
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.06 seconds
cpu time 0.06 seconds
General
We saw in Chapter 14 that the TRANSPOSE procedure can be used to transpose data in a very
flexible way and with a clear syntax. PROC TRANSPOSE and the macros we presented in
Chapter 14 are the first choice for data transpositions.
However, in situations where the number of subjects is in the hundreds of thousands, and the
number of variables or repetitions is in the thousands, performance issues might arise. In these
cases it is advisable to use a SAS DATA step instead of PROC TRANPOSE. In addition to a
performance advantage with large data sets, the SAS DATA step also has the advantage that
additional data processing can be performed during transposition.
In order to illustrate the differences in performance, a short study was performed that compared
the time consumption for a PROC TRANSPOSE and a SAS DATA step transposition. In the
following sections, we compare the macros from Chapter 14 with a DATA step transposition. The
macro for the DATA step transposition is presented in the following sections.
384 Data Preparation for Analytics Using SAS
For simplicity, we will again use the LONG and WIDE terminology for the data set shapes. A
one-row-per-subject data set will be called a WIDE data set in this context because the multiple
observations are represented by columns. A multiple-rows-per-subject data set is represented as a
LONG data set because it has a row per subject and multiple observations.
For the performance tests, SAS data sets were created, where the transposed VAR variable was
filled with a random uniform number. Measurements at n=100,000, 200,000, and so on, were
taken with 90 repetitions of the measurement. The duration was measured in seconds by
comparing the start time and the end time. Note that the test data sets were already sorted by
subject ID.
C.1.4 Conclusion
The DATA step version is more time-efficient with increasing data volume. Note, for example,
that in data mining analyses not only the repetitions for one variable have to be transposed, but a
number of variables. In this case the performance difference quickly multiplies with the number
of variables.
The decision whether the DATA step version or the PROC TRANSPOSE version will be selected
depends on the number of observations and repetitions that will be transposed and on how
important it is to shorten the processing time.
C.1.5 Macros
In the following sections we will introduce the following macros to transpose data using a SAS
DATA step:
DATA _null_;
SET distinct END = eof;
FORMAT _string_ $32767.;
RETAIN _string_;
_string_ = CATX(' ',_string_, &time);
IF eof THEN DO;
CALL SYMPUT('list',_string_);
CALL SYMPUT('max',_n_);
END;
RUN;
%MEND;
Using PROC FREQ to create a list of distinct measurement IDs and a _NULL_ DATA
step to create a list of measurement IDs into one macro variable. Note that from a
performance point of view, PROC FREQ with an OUT= data set is much faster than the
equivalent SELECT DISTINCT version in PROC SQL.
Using a SAS DATA step for the transposition. Here we define new variables with the
name from the macro variable &VAR and the numbering from the distinct measurement
IDs that are found in the variable &TIME. The variables are retained for each subject and
the last observation per subject is output to the resulting data set.
Note that the data must be sorted for the ID variable. The legend for the macro parameters is as
follows:
ID
The name of the ID variable that identifies the subject.
COPY
The list of variables that occur repeatedly with each observation for a subject and that
will be copied to the resulting data set. We assume here that COPY variables have the
same values within one ID.
Appendix C: Transposing with DATA Steps 387
VAR
The variable that contains the values that will be transposed. Note that only one variable
can be listed here in order to obtain the desired transposition. See below for an example
of how to deal with a list of variables.
TIME
The variable that enumerates the repeated measurements.
Note that the TIME variable does not need to be a consecutive number, but it can also have non-
equidistant intervals. See the following data set from an experiment with dogs.
%MAKEWIDE_DS(DATA=dogs_long,OUT=dogs_wide_2,
ID=id, COPY=drug depleted,
VAR=Histamine,
TIME=Measurement);
*** Load the number of items in &VARS into macro variable NVARS;
%LET c=1;
%DO %WHILE(%SCAN(&list,&c) NE);
%LET c=%EVAL(&c+1);
%END;
%LET nvars=%EVAL(&c-1);
%MEND;
388 Data Preparation for Analytics Using SAS
The macro uses a DATA step to output one observation for each repetition per subject.
The ROOT and TIME value is copied from the respective variable.
Note that the data must be sorted for the ID variable.
Note that the macro can run in two modes: the LIST mode and the FROM/TO mode. The two
modes differ in how the time IDs that form the postfix of the variable names that have to be
transposed are specified:
In the LIST mode a list of time IDs has to be explicitly specified in the macro variable
LIST = that contains all time IDs where an observation is available. Note that the time
IDs do not need to be consecutive numbers.
In the FROM/TO mode a consecutive list of time IDs is assumed and can be specified
with the MIN= and MAX= variables.
If a value is specified for the LIST= variable, the macro runs in LIST mode; otherwise, it runs in
FROM/TO mode.
ID
The name of the ID variable that identifies the subject.
COPY
The list of variables that occurs repeatedly with each observation for a subject and that
will be copied to the resulting data set. We assume here that COPY variables have the
same values within one ID.
ROOT
The part of the variable name, without the measurement number, of the variable that will
be transposed. Note that only one variable can be listed here in order to obtain the desired
transposition. See below for an example of how to deal with a list of variables.
TIME
The variable that will enumerate the repeated measurements.
MAX
The maximum enumerations of the variables in the WIDE data set that will be
transposed.
MIN
The minimum of the enumerations of the variables in the WIDE data set that will be
transposed.
LIST
The list of time IDs in brackets that is used to enumerate the variable names in the WIDE
data set (note that the variable names start with the string specified under ROOT=).
Appendix C: Transposing with DATA Steps 389
See in the following examples two invocations of macros in the LIST mode and in the FROM/TO
mode.
%MAKELONG_DS(DATA=dogs_wide,OUT=dogs_long2,COPY=drug depleted,ID=id,
ROOT=Histamine,TIME=Measurement,LIST=(0 1 3 5),MAX=4);
%MAKELONG_DS(data=wide,out=long,id=id,min=1,max=3,
root=weight,time=time);
%MEND;
390 Data Preparation for Analytics Using SAS
Aggregating multiple observations per subject. This step is equivalent to the PROC
TRANPOSE macro version. Here we additionally create a list of distinct categories in the
DISTINCT data set. Note that from a performance point of view, PROC FREQ with an
OUT= data set is much faster than the equivalent SELECT DISTINCT version in PROC
SQL.
Assigning the list of categories into a macro variable. Here we use a _NULL_ DATA
step to create a macro variable that contains the list of categories. Note the length of all
category names separated by columns must not exceed 32,767 characters.
Using a SAS DATA step for the transposition. Here we define variables from the
categories. The variables are retained for each subject and the last observation per subject
is output to the resulting data set.
Note that different from the PROC TRANSPOSE macro version, the DATA step version does not
allow blanks or leading numbers in the categories!
ID
The name of the ID variable that identifies the subject.
VAR
The variable that contains the categories, for example, in market basket analysis, the
products a customer purchased.
analysis subject
an entity that is being analyzed. The analysis results are interpreted in the context of the
subject. Analysis subjects are the basis for the structure of the analysis table.
analysis table
a single rectangular table that holds the data for the analysis, where the observations are
represented by rows, and the attributes are represented by columns.
application layer
a logical layer that enables you to access data from a system using its business logic and
application functions.
business question
a type of question that defines the content and rationale for which an analysis is done.
database
an IT system that enables the insertion, storage, and retrieval of data.
data layer
a logical layer where data is directly imported from the underlying database tables.
data mart
a single rectangular table that holds the data for the analysis, where the observations are
represented by rows, and the attributes are represented by columns.
de-normalization
a process whereby data is redundantly stored in more than one table in order to avoid
joining tables.
derived variable
a type of variable whose values are calculated on the basis of one or more other variables.
dispositive system
a type of IT system that retrieves, stores, and prepares data in order to provide reports,
predictions, and forecasts to businesses.
longitudinal analysis
a type of analysis that is based on the time or the sequential ordering of observations.
modeling
the process of creating (learning) a model logic, e.g., in order to predict events and
values.
multiple observations
the existence of more than one observation for an analysis subject in the data.
392 Data Preparation for Analytics Using SAS
one-to-many relationship
a type of relationship between two tables where the two tables include related rows. For
example, in table A and table B, one row in table A might have many related rows in
table B (see section 6.2).
operational system
an IT system that is designed to assist business operations by providing the technical
infrastructure to process data from their business processes.
relational model
a type of model that structures data on the basis of entities and relationships between
them (see section 5.5).
scoring
the process of applying the logic that has been created during modeling to the data.
star schema
a type of relational model. A star schema is composed of one fact table and a number of
dimension tables. The dimension tables are linked to the fact table with a one-to-many
relationship (see section 6.4).
transposing
a method of rearranging the columns and rows of a table.
Index
multiple categorical observations for
A 201–213
absolute frequencies of categories 202–207 number of rows for 56–59
concatenating 206–207 redefining 55
percentage variables 209–212 removing duplicate entries for 125–126,
Access databases, importing data from 108 389–390
accessing data 105–113 standardizing by mean of 190–194
ACCUMULATE option, ID statement 253 when not available 59–60
ad hoc external data 35 analysis tables 17, 51–60, 73–74
aggregating data 65–66, 376 See also data marts, multiple rows per
bootstrap aggregation (bagging) 99 subject
criteria for variables 93–94 See also data marts, one row per subject
hierarchical relationships 67–68, 242–245 data marts vs. 25
longitudinal data structures 83 identifiers for analysis subjects 53, 98
pass-through statements 107 multiple observations per analysis subject
static aggregation 179–184 53–59, 67–68, 179–199
aggregation of multiple observations 179–199 without analysis subjects 59–60
concentration of values 187–189 analysis team 19
correlation of values 184–186 analytic business questions 3–9
derived variables for 194–199 characteristics of 12–19
standardization of values 189–194 examples of 4–5
static aggregation 179–184 analytic data marts
ALERT= parameter See data marts
ALERTCHARMISSING macro 284 analytic data preparation
ALERTNUMERICMISSING macro 282 See data preparation
SCOREDISTRIBUTION macro 289 anonymity of analysis subjects 53
%ALERTCHARMISSING macro 283–284 application layer 40
%ALERT= parameter 284 ARIMA procedure
%DATA= parameter 284 IDENTIFY statement 280
%VARS= parameter 284 LEAD= option 280
%ALERTNUMERICMISSING macro 281–283 scoring in 280
ALERT= parameter 282 array processing of variables 375
DATA= parameter 282 ARRAY statement
VAR= parameter 282 replacing missing values 127
aligning dates 247 standardizing per subject 190
analysis complexity 12–13 variable division and 144
analysis data marts 271 as much data as possible paradigm 14–15
analysis metadata 294 association analysis 213
analysis paradigms 13–14 data preparation for 234–237
analysis process 3, 5–6 hierarchical data structure 235–236,
analysis subjects 52–53 238–240
See also attributes Association node (SAS Enterprise Miner) 348
anonymity of 53 ATTRIB statement 299–300
association analysis on properties of attributes
236–237 See also categorical data and variables
derived variables and 92–93 See also interval variable transformations
first and last observations 372–373 interval data 90, 98–99
identifiers for 53, 98 looping through static attributes 296–297
394 Index
data marts, one row per subject 56–59, 61–68, data set options 298
75 data sets
aggregating from 182 See also transposing data sets
building from multiple data sources (case creating 105–113
study) 305–316 creating formats from 176–177
creating from key-value tables 129–130 imputed 160, 281
putting information into single rows 63–66 long vs. wide data sets 116
structure requirements 364, 367 replacing missing values 127–128, 158–160
transposing to multiple-rows-per-subject sample data sets 257
120–124 data sources
data marts, sampling See also data preparation
See sampling building data mart from (case study)
data mining 13–14 305–316
See also predictive modeling characteristics of 21–27
oversampling 259, 261–263 identifying 6
data models 45–50 origin of 39–43
data normalization 49–50 quality of 25–27
de-normalization of data 49–50, 63–66 data standardization 190–194
DATA= parameter DATA step 106, 109–110, 371
%ALERTCHARMISSING macro 284 converting between character and number
%ALERTNUMERICMISSING macro 282 formats 163
%CLUS_SAMPLE macro 265 IN variables in SET statement 165–166
%CLUS_SAMPLE_RES macro 266 replacing missing values 127–128, 159
%CONCENTRATE macro 188 transposing data with 383–390
%CREATEPROPS macro 219 WHERE and KEEP statements 298
%MAKELONG macro 122 data structures 19, 361–367
%MAKELONG_DS macro 388 See also data marts
%MAKEWIDE macro 118 requirements for 363–367
%MAKEWIDE_DS macro 386 data structures, longitudinal
%PROPSCORING macro 224 See longitudinal data structures
%REMEMBERCATEGORIES macro 285 data warehouse systems 23–24, 26–27
%REMEMBERDISTRIBUTION macro 288 databases
%RESTRICTEDSAMPLE macro 261 See relational database systems
%SCOREDISTRIBUTION macro 289 DATDIF( ) function 148
%TARGETCHART macro 228 date alignment 247
%TRANSP_CAT macro 126 DATE( ) function 246
%TRANSP_CAT_DS macro 390 date( ) system function 147
Data Partition node (SAS Enterprise Miner) 348 de-normalization of data 49–50
data preparation 233–253, 294–295, 369–381 normalization of data 49–50
aggregating at hierarchical levels 242–245 one-row-per-subject data marts 63–66
association and sequence analysis 234–237 DELIMITER= option, INFILE statement 109
paradigms for 14–15 derived variables 92–93, 140–146
reusing procedure output 377–378 based on population means 144–146
SAS/ETS procedures for 250–253 creating for predictive modeling 217–223
time series data enhancement 238–241 efficiency of 94
data preparation case studies for aggregation of multiple observations
building customer data mart 305–316 194–199
customer segmentation measures from for categorical data 164–166
transactional data 317–326 storing definitions in spreadsheets 297–298
in SAS Enterprise Miner 347–359 sufficiency of 94
time series analysis 327–345 DESCENDING option, RANK procedure 150
Index 397
Advanced Log-Linear Models Using SAS ® CRM Segmentation and Clustering Using SAS ® Enterprise
by Daniel Zelterman MinerTM
by Randall S. Collica
Analysis of Clinical Trials Using SAS®: A Practical Guide
by Alex Dmitrienko, Geert Molenberghs, Walter Offen, and Data Management and Reporting Made Easy with
Christy Chuang-Stein SAS ® Learning Edition 2.0
by Sunil K. Gupta
Analyzing Receiver Operating Characteristic Curves with SAS ®
by Mithat Gönen Data Preparation for Analytics Using SAS®
by Gerhard Svolba
Annotate: Simply the Basics
by Art Carpenter Debugging SAS ® Programs: A Handbook of Tools and
Techniques
Applied Multivariate Statistics with SAS® Software, by Michele M. Burlew
Second Edition
by Ravindra Khattree Decision Trees for Business Intelligence and Data Mining: Using
and Dayanand N. Naik SAS® Enterprise MinerTM
by Barry de Ville
Applied Statistics and the SAS ® Programming Language,
Fifth Edition Efficiency: Improving the Performance of Your SAS ®
by Ronald P. Cody Applications
and Jeffrey K. Smith by Robert Virgile
An Array of Challenges — Test Your SAS ® Skills The Essential Guide to SAS ® Dates and Times
by Robert Virgile by Derek P. Morgan
Basic Statistics Using SAS® Enterprise Guide®: A Primer The Essential PROC SQL Handbook for SAS ® Users
by Geoff Der by Katherine Prairie
and Brian S. Everitt
Fixed Effects Regression Methods for Longitudinal Data
Building Web Applications with SAS/IntrNet®: A Guide to the Using SAS ®
Application Dispatcher by Paul D. Allison
by Don Henderson
Genetic Analysis of Complex Traits Using SAS ®
Carpenter’s Complete Guide to the SAS® Macro Language,
Edited by Arnold M. Saxton
Second Edition
by Art Carpenter
The Global English Style Guide: Writing Clear, Translatable
Documentation for a Global Market
Carpenter’s Complete Guide to the SAS® REPORT Procedure
by Art Carpenter by John R. Kohl
The Cartoon Guide to Statistics A Handbook of Statistical Analyses Using SAS®, Second Edition
by Larry Gonick by B.S. Everitt
and Woollcott Smith and G. Der
Categorical Data Analysis Using the SAS ® System, Health Care Data and SAS®
Second Edition by Marge Scerbo, Craig Dickstein,
by Maura E. Stokes, Charles S. Davis, and Alan Wilson
and Gary G. Koch
The How-To Book for SAS/GRAPH ® Software
Cody’s Data Cleaning Techniques Using SAS® Software by Thomas Miron
by Ron Cody
In the Know ... SAS ® Tips and Techniques From
Common Statistical Methods for Clinical Research with Around the Globe, Second Edition
by Phil Mason
SAS ® Examples, Second Edition
by Glenn A. Walker Instant ODS: Style Templates for the Output Delivery System
by Bernadette Johnson
The Complete Guide to SAS ® Indexes
by Michael A. Raithel
support.sas.com/publishing
Integrating Results through Meta-Analytic Review Using Painless Windows: A Handbook for SAS ® Users, Third Edition
SAS® Software by Jodie Gilmore
by Morgan C. Wang (updated to include SAS 8 and SAS 9.1 features)
and Brad J. Bushman
Pharmaceutical Statistics Using SAS®: A Practical Guide
Introduction to Data Mining Using SAS® Enterprise MinerTM Edited by Alex Dmitrienko, Christy Chuang-Stein,
by Patricia B. Cerrito and Ralph D’Agostino
Introduction to Design of Experiments with JMP® Examples, The Power of PROC FORMAT
Third Edition by Jonas V. Bilenas
by Jacques Goupy
and Lee Creighton Predictive Modeling with SAS® Enterprise MinerTM: Practical
Solutions for Business Applications
Learning SAS® by Example: A Programmer’s Guide by Kattamuri S. Sarma
by Ron Cody
PROC SQL: Beyond the Basics Using SAS®
The Little SAS ® Book: A Primer by Kirk Paul Lafler
by Lora D. Delwiche
and Susan J. Slaughter PROC TABULATE by Example
by Lauren E. Haworth
The LittleSAS ® Book: A Primer, Second Edition
by Lora D. Delwiche
Professional SAS ® Programmer’s Pocket Reference,
and Susan J. Slaughter
Fifth Edition
(updated to include SAS 7 features)
by Rick Aster
The LittleSAS ® Book: A Primer, Third Edition
by Lora D. Delwiche Professional SAS ® Programming Shortcuts, Second Edition
and Susan J. Slaughter by Rick Aster
(updated to include SAS 9.1 features)
Quick Results with SAS/GRAPH ® Software
The LittleSAS ®Book for Enterprise Guide® 3.0 by Arthur L. Carpenter
by Susan J. Slaughter and Charles E. Shipp
and Lora D. Delwiche
Quick Results with the Output Delivery System
The Little SAS ® Book for Enterprise Guide® 4.1 by Sunil K. Gupta
by Susan J. Slaughter
and Lora D. Delwiche Reading External Data Files Using SAS®: Examples Handbook
by Michele M. Burlew
Logistic Regression Using the SAS® System:
Theory and Application Regression and ANOVA: An Integrated Approach Using
by Paul D. Allison SAS ® Software
by Keith E. Muller
Longitudinal Data and SAS®: A Programmer’s Guide and Bethel A. Fetterman
by Ron Cody
SAS ® For Dummies®
Maps Made Easy Using SAS® by Stephen McDaniel
by Mike Zdeb and Chris Hemedinger
Measurement, Analysis, and Control Using JMP®: Quality
SAS ® for Forecasting Time Series, Second Edition
Techniques for Manufacturing
by John C. Brocklebank
by Jack E. Reece
and David A. Dickey
Multiple Comparisons and Multiple Tests Using SAS®
SAS ® for Linear Models, Fourth Edition
Text and Workbook Set
by Ramon C. Littell, Walter W. Stroup,
(books in this set also sold separately)
and Rudolf J. Freund
by Peter H. Westfall, Randall D. Tobias,
Dror Rom, Russell D. Wolfinger,
SAS ® for Mixed Models, Second Edition
and Yosef Hochberg
by Ramon C. Littell, George A. Milliken, Walter W. Stroup,
Russell D. Wolfinger, and Oliver Schabenberger
Multiple-Plot Displays: Simplified with Macros
by Perry Watts
SAS ® for Monte Carlo Studies: A Guide for Quantitative
Multivariate Data Reduction and Discrimination with Researchers
SAS ® Software by Xitao Fan, Ákos Felsovályi,
˝ Stephen A. Sivo,
by Ravindra Khattree and Sean C. Keenan
and Dayanand N. Naik
SAS ® Functions by Example
Output Delivery System: The Basics by Ron Cody
by Lauren E. Haworth
support.sas.com/publishing
SAS ® Graphics for Java: Examples Using SAS® AppDev Survival Analysis Using SAS ®: A Practical Guide
StudioTM and the Output Delivery System by Paul D. Allison
by Wendy Bohnenkamp
and Jackie Iverson Tuning SAS ® Applications in the OS/390 and z/OS
Environments, Second Edition
SAS ® Guide to Report Writing, Second Edition by Michael A. Raithel
by Michele M. Burlew
Using SAS ® in Financial Research
SAS ® Macro Programming Made Easy, Second Edition by Ekkehart Boehmer, John Paul Broussard,
by Michele M. Burlew and Juha-Pekka Kallunki
SAS ® Programming by Example Validating Clinical Trial Data Reporting with SAS ®
by Ron Cody by Carol I. Matthews
and Ray Pass and Brian C. Shilling
SAS ® Programming in the Pharmaceutical Industry Web Development with SAS® by Example, Second Edition
by Jack Shostak by Frederick E. Pratter
support.sas.com/publishing
Example Code — Examples from This Book at Your Fingertips
You can access the example programs for this book by linking to its companion Web site at support.sas.com/companionsites. Select the book
title to display its companion Web site, and select Example Code and Data to display the SAS programs that are included in the book.
For an alphabetical listing of all books for which example code is available, see support.sas.com/bookcode. Select a title to display the book’s
example code.
If you are unable to access the code through the Web site, send e-mail to [email protected].
Comments or Questions?
If you have comments or questions about this book, you may contact the author through SAS as follows.
E-mail: [email protected]
See the last pages of this book for a complete list of books available through SAS Press or visit support.sas.com/publishing.
SAS Publishing News: Receive up-to-date information about all new SAS publications via e-mail by subscribing to the SAS Publishing News
monthly eNewsletter. Visit support.sas.com/subscribe.