SAS Questions
SAS Questions
olution to Case 1
This question is a classic case when Base SAS clearly beats SAS EG. In this section you will see a
simple solution for this case study.
data fetch_purchase;
merge list(in=a) purchase(in=b);
if yearmonth = third_mon;
by customer_id;
if a;
run;
proc datasets;
append base=final_dataset data = fetch_purchase foce;
run;
%mend fetch_data;
Solution to Case 2
The solution to this problem is tiresome on SAS EG because there is no median function on SQL
routines after grouping data. SQL routines are the foundation of data handling in SAS EG.But this
becomes quite easy on Base SAS. Lets see how this can be done easily on Base SAS.
proc sql;
create table work.summarize as
select count(*) as trans_nos, customer_id
from work.table1
group by customer_id;
quit;
data add_total_trans;
merge table1 (in=a) summarize (in=b);
median_no = floor(trans_nos/2);
by customer_id;
drop trans_nos;
run;
data final_list;
set add_total_trans;
by customer_id amount;
if first.customer_id then n =1;
if n = median_no;
n + 1;
run;
The solution in base SAS for this question is not only effective but also time efficient.
End Notes
Both Base SAS and EG have their own pros and cons. The best recommended strategy is to use both.
If you want to make a traditional query, use SAS EG to generate automated code. Now copy this code
to make it macronized and generalized using Base SAS. The macro adds a new dimension to the
codes which helps you generalize the code and avoid hard entered data.
Have you faced any other SAS problem in analytics interview? Are you facing any specific problem
with SAS codes? Do you think this provides a solution to any problem you face? Do you think there
are other methods to solve the problems discussed in a more optimized way? Do let us know your
thoughts in the comments below.
1. Merging data in SAS :
Merging datasets is the most important step for an analyst. Merging data can be done through both
DATA step and PROC SQL. Usually people ignore the difference in the method used by SAS in the
two different steps. This is because generally there is no difference in the output created by the two
routines. Lets look at the following example :
Problem Statement : In this example, we have 2 datasets. First table gives the product holding for a
particular household. Second table gives the gender of each customer in these households. What you
need to find out is that if the product is Male biased or neutral. The Male biased product is a product
bought by males more than females. You can assume that the product bought by a household belongs
to each customer of that household.
Thought process: The first step of this problem is to merge the two tables. We need a Cartesian product
of the two tables in this case. After getting the merged dataset, all you need to do is summarize the
merged dataset and find the bias.
Code 1
Proc sort data = PROD out =A1; by household;run;
Proc sort data = GENDER out =A2; by household;run;
Data MERGED;
merge A1(in=a) A2(in=b);
by household;
if a AND b;
run;
Code 2 :
PROC SQL;
Create table work.merged as
select t1.household, t1.type,t2.gender
from prod as t1, gender as t2
where t1.household = t2.household;
quit;
Will both the codes give the same result?
The answer is NO. As you might have noticed, the two tables have many-to-many mapping. For getting
a cartesian product, we can only use PROC SQL. Apart from many-to-many tables, all the results of
merging using the two steps will be exactly same.
Why do we use DATA MERGE step at all?
DATA-MERGE step is much faster compared to PROC SQL. For big data sets except one having
many-to-many mapping, always use DATA- MERGE.
2. Transpose data-sets :
When working on transactions data, we frequently transpose datasets to analyze data. There are two
kinds of transposition. First, transposing from wide structure to narrow structure. Consider the following
example :
For this kind of transposition, data step becomes very long and time consuming. Following is a much
shorter way to do the same task,
Proc transpose data=transposed out=base (drop=_name_) prefix Q;
by cust;
id period;
var amount;
run;
3. Passing values from one routine to other:
Imagine a scenario, we want to compare the total marks scored by two classes. Finally the output
should be simply the name of the class with the higher score. The score of the two datasets is stored
in two separate tables.
There are two methods of doing this question. First, append the two tables and sum the total marks for
each or the classes. But imagine if the number of students were too large, we will just multiply the
operation time by appending the two tables. Hence, we need a method to pass the value from one
table to another. Try the following code: