SAS Training - 101
SAS Training - 101
2006, Cognizant Technology Solutions. All Rights Reserved. The information contained herein is subject to change without notice.
Agenda
Module 1
Introduction to SAS Getting/Extracting Data in/from SAS Working with the Data
Module 2
Introduction to SAS Proc Statements Combining and Modifying SAS Datasets
Module 3
Proc SQL Arrays / DO-END Retain / First. Last.
Agenda Module 1
Introduction to SAS
Getting Started with SAS environment The two parts of a SAS program Reading the SAS Log SAS Dataset
What is SAS ?
Data Warehousing - Easily access, manage and analyze data from many sources
Contains reports Contains reports generated by generated by SAS procedures SAS procedures and DATA steps and DATA steps
Contains information Contains information about the processing about the processing of this SAS program, of this SAS program, including warning including warning and error messages and error messages
Contains reports Contains reports generated by generated by SAS procedures SAS procedures and DATA steps and DATA steps
Functionality of the SAS explorer is similar to explorers for window-based systems Select view explorer
Expand and collapse directories on the left. Drill-down and open specific files in the right Right-click on a SAS dataset and select properties Provides general information about the dataset
Double click on the dataset to open it in VIEWTABLE window Can be used to edit datasets, create datasets and customize view of a SAS dataset
on
Access and edit existing SAS programs Write new SAS programs Submit SAS programs Save SAS programs to a file
* Programs can also be executed without opening them in the SAS environment using batch submit
Log Window
Output Window
SAS Programs
A SAS program is a sequence of steps that the user submits for execution
Raw Data Data Step SAS Data Set Proc Step Output
Data steps are used to CREATE SAS datasets PROC steps are used to PROCESS SAS datasets
SAS Statements
Usually begin with an identifying keyword Always end with a semicolon Statements that begin with /* and end with */ are treated as comments
SAS Statements can be upper/lower case One or more blanks or special characters can be used separate words They can begin and end in any column A single statement can span multiple lines Several statements can be on the same line
DATA steps
Begin with DATA statements Read and Modify data Create a SAS data
PROC steps
Begin with PROC statements Performs specific analysis or function Produces results or reports
PROC steps can create data sets A step ends when SAS encounters a new statement (DATA or PROC statement ) or RUN DATA step executes line by line
Syntax errors include Misspelled keywords Missing or invalid punctuation Invalid options
daat work.staff; infile raw-data-file; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59; run; proc print data=work.staff run;
data work.staff; infile raw-data-file; input LastName $ 1-20 FirstName $ 21-30 JobTitle $ 36-43 Salary 54-59; run; proc print data=work.staff; run; proc means data=work.staff mean max; class JobTitle; var Salary; run;
To correct the problem in the Windows environment, click the break icon Select Cancel Submitted Statements in the Tasking Manager window and select ok
Open and submit the code where the closing quote for the INFILE statement is missing Submit the program and browse the SAS log There are no notes in the SAS log because all the SAS statements after the INFILE statement have become part of the quoted string
SAS Dataset
The data portion of a SAS dataset is a rectangular table of data values & descriptor portion is the header
SAS Data Sets:
Variable names
Variable Values
Character values
Numeric values
Variables (Columns) : Correspond to fields of data, and each data column is named Observations (Rows) : Correspond to records or data lines
Stored Value 0 0
today ( ) function returns the current date -1 DATE9. 31DEC1959 365 DDMMYY10. 31/12/1960 A date literal is specified as <formatted date> d e.g. 31DEC1959 d
Agenda Module 1
Introduction to SAS
Getting Started with SAS environment The two parts of a SAS program Reading the SAS Log SAS Dataset
SAS data libraries are identified by assigning a library reference name On invoking SAS, one automatically has access to a temporary and a permanent SAS data library Work - Temporary library SAS user - Permanent library One can also create and access new permanent libraries The work library and its SAS data-files are deleted after the SAS session ends SAS datasets in permanent libraries are saved after the SAS session ends
Creating Data
data PS_AA_team; input NAME $ Age prior_work_ex $; datalines; Sayaji 40 30 30 . 20 . . . 30 . 20 20 20 Y Y Y Y N N N Y Y Y N N N
Datalines / Cards is used Default format of variable is numeric Missing value for numeric needs to be entered as . Default length for character variables is 8
Vikrant Yashjit Hita Tuhin Sharmila Aditi Shikha Anirban Lata Deepak Ambrish Vaibhav ; run;
Informat statement
General form of an informat:
$informat-namew.d
$ informat-name w . d
indicates a character format names the informat is an optional field width is the required delimiter optionally, specifies a decimal for numeric informats
Selected Informats
7. or 7.0 7.2 reads seven columns of numeric data. reads seven columns of numeric data and inserts a decimal point in the data value. reads five columns of character data and removes leading blanks. reads five columns of character data and preserves leading blanks. reads seven columns of numeric data and removes selected nonnumeric characters, such as dollar signs and commas. reads dates of the form 01/20/2000
$5.
$CHAR5.
COMMA7.
MMDDYY10.
Importing Data
List directed input - data must be separated by a delimiter; must read in all variables. In case of delimited data the data values are separated by a specially designated character called the delimiter. For example, in case of comma separated values, the comma separates individual data values from each other. Column input - data in fixed columns;must know where data starts and ends; can read in selected variables. In fixed format files the data values are placed at pre-specified column addresses in the data file. Informat - alternative to column input; most flexible; must be used for special data Input data can have variable names as part of the data values. In case if the data values have the names of the variables specified in the top most row of the file, then one can use PROC IMPORT;
Fixed Format Names Available Raw Data PROC IMPORT (Use Wizard) Delimited PROC IMPORT
INFILE/INPUT INFILE / INPUT @ signifies the start of the data value DLM OPTION
To read a fixed file format raw file, one need to know the exact position from where each of the variables start and length of the variable For all char variable $ symbol is used while declaring its length If no $ symbol is used that variable by default is taken as numeric The MISSOVER option prevents SAS from loading a new record when the end of the current record is reached. If SAS reaches the end of the row without finding values for all fields, variables without values are set to missing. FIRSTOBS = option tells SAS what line to begin reading data OBS = specifies number of observations to be read DLM = specifies the delimiter used
Example:
Convert a fixed format file (YYY.txt) to SAS Dataset.
Start
1 10 40 65 85
End
9 39 64 84 86 89 99
Length
9 30 25 20 2 3 10
Type
Num Char Char Char Char Char Num
Variable
DOCID Spec STREET CITY STATE ZIP PHONE
Description
Doctor ID Speciality Address - Street Address - City Address - State Address - ZIP Telephone Number
t x . YYY f ot uoy aL t
87 90
data <dataset>; infile X:\YYY.txt" LRECL = 99 MISSOVER; input @1 @10 @40 @65 @85 @87 @90 run; DOCID SPEC STREET CITY STATE ZIP PHONE 9. $30. $25. $20. $2. $3. 10. ;
PROC IMPORT
General form of the IMPORT procedure
PROC IMPORT OUT=SAS-data-set DATAFILE=external-file-name DBMS=file-type; GETNAMES=YES; RUN;
Example Code
PROC IMPORT datafile='D:\fun\Ritesh Training\comp.csv' out=yyy DBMS=CSV REPLACE; GETNAMES=YES; RUN;
IMPORT Wizard
Wizard is the a SAS provided graphical interface to convert raw data file to SAS dataset. It can only convert Delimited and Excel files to SAS files.
Select the type of raw file which is to be imported Browse to the raw file
IMPORT Wizard
Enter the library name and name where you want to save SAS dataset Press Finish to convert raw file to SAS dataset
Import Wizard basically first generates PROC IMPORT code and then executes it. You can save the code that the wizard generates.
Example Code
The following code segment illustrates the use of the export procedure in SAS to output a file in the csv format. PROC EXPORT DATA= <Name of Dataset> OUTFILE= <Output Filename> DBMS=CSV REPLACE; RUN; Note: The output filename should be given under quotes with the full path
EXPORT wizard
SAS export wizard allows us to convert a SAS dataset into other file formats without having to write any code.
EXPORT wizard
The SAS export wizard also allows us to save the corresponding proc export code Step 4: Specify the output filename and its location Step 3: Select the file format
Agenda Module 1
Introduction to SAS
Getting Started with SAS environment The two parts of a SAS program Reading the SAS Log SAS Dataset
PUT/INPUT Statement
PUT Statement is used to convert variables from numeric to character and INPUT Statement is used for vice-versa
You can use symbolic or mnemonic operators You may also use the IN operator to make comparisons
Example:
If Model IN (Corvette, Camaro) Then Make = Chevrolet;
Else is automatically executed for all observations failing to satisfy any of the previous IF statements Else statement is simply an IF-THEN statement with an ELSE tacked onto the front
Example
a from a survey of home improvements, containing owners name, description of work done and cost of improvement. Group the cost into High, Medium, Low.
ory y er n cabinet facelift bathroom addition paint exterior second floor 2000 11350 3910 75362.9
Dat
e:
home_cost; le C:\Home_data.dat; t Owner $1-7 Description $9-33 Cost; Cost < 2000 Then CostGrp = low; if Cost < 10,000 Then CostGrp = medium; CostGrp = high;
Cod
Use IF when it is easier to specify a condition for including observations Use DELETE when it is easier to specify a condition for excluding variables
A sum statement also retains values from previous iteration of the DATA step, but you use it for cases where you simple want to cumulatively add the value of an expression to a variable
Basic form: Variable + expression
Example
Data from base ball game containing the date the game was played, team played, hits and run for the game
6-19 6-20 7-1 7-2 7-4 7-5 Columbia Peaches Columbia Peaches Plains Peanuts Plains Peanuts Sacremento Sacremento 8 3 10 2 10 12 3 4 5 3 10 8
Team wants two additional variables cumulative number of runs for the season and maximum number of runs in a game to date.
Example (Contd)..
Data games; Infile C:\Games.dat; Input Month 1 Day 3-4 Team $6-25 Hits 27-28 Runs 30-31; RETAIN MaxRuns; MaxRuns = Max (MaxRuns, Runs); RunsToDate + Runs; Run;
Questions ??????
Agenda
Module 1
Introduction to SAS Getting/Extracting Data in/from SAS Working with the Data
Module 2
Introduction to SAS Proc Statements Combining and Modifying SAS Datasets
Module 3
Proc SQL Arrays / DO-END Retain / First. Last.
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
SAS Procedures
Start with the keyword PROC
Eg :
PROC CONTENTS DATA = Sales_force_team;
SAS will use the most recently created data if data option is not specified BY statement
required for only PROC SORT everywhere else SAS performs separate analysis for each combination of BY variables
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
PROC SORT
Default sorting is ascending Form of PROC SORT statement
PROC SORT Data = data-name; BY variable-1 variable-2 variable-3 variable-n; RUN;
Sorting in descending
BY variable-1 DESCENDING variable-2 DESCENDING variable-3 ;
OUTPUT
Whales and Sharks Obs 1 2 3 4 5 6 7 8 9 10 Name humpback whale basking mako dwarf blue sperm gray killer beluga Family . shark shark shark shark whale whale whale whale whale Length 50.0 40.0 30.0 12.0 0.5 100.0 60.0 50.0 30.0 15.0
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
PROC MEANS
Form of PROC MEANS statement
PROC MEANS Data = data-name BY variable-list; VAR variable-list; RUN ; options;
If PROC MEANS is used with no other option it gives number of non-missing values, mean, std, min and max for all variables
proc means data=cake n mean max min range std fw=8; var PresentScore TasteScore; title 'Summary of Presentation and Taste Scores'; run;
OUTPUT
Summary of Presentation and Taste Scores The MEANS Procedure
Variable PresentScore TasteScore N 20 20 Mean 76.150 81.350 Maximum 93.000 94.000 Minimum 56.000 72.000 Range 37.000 22.000 Std Dev 9.376 6.611
Hildenbrand 33 81 83 Byron Sanders Jaeger Davis Conrad Walters Rossburger Matthew Becker Anderson Merritt 62 72 87 26 56 79 43 66 74 28 69 75 69 85 94 55 67 72 28 78 81 42 81 92 36 62 83 27 87 85 62 73 84
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
PROC FREQ
Form of PROC FREQ statement
PROC FREQ Data = data-name options; BY variable-list; OUTPUT statistic-keyword(s) <OUT=SAS-data-set>; TABLES request(s) </ option(s)>; RUN ;
To do this Calculate separate frequency or cross-tabulation tables for each BY group Create an output data set that contains specified statistics
Specify frequency or cross-tabulation tables and request tests and measures of TABLES association
1 green fair 1 green dark 1 brown medium 2 blue 2 blue fair dark
1 green red 1 brown fair 1 brown dark 2 blue 2 blue red black
The TABLES statement requests three tables: Eyes and Hair frequencies Eyes by Hair cross-tabulation. OUT = creates FREQCNT data set that contains cross-tabulation table frequencies. OUTEXPECT stores expected cell frequencies SPARSE stores zero cell counts in FREQCNT
proc freq data=color; weight count; tables eyes hair eyes*hair/out=freqcnt outexpect sparse; title 'Eye and Hair Color of European Children'; run; proc print data=freqcnt noobs; title2 'Output Data Set from PROC FREQ;run;
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
PROC SUMMARY
Form of PROC SUMMARY statement
PROC SUMMARY <option(s)> <statistic-keyword(s)>; CLASS variable(s) </ option(s)>; VAR variable(s); OUTPUT <OUT=SAS-data-set><output-statistic-specification(s)> <id-group-specification(s)> <maximum-id-specification(s)> <minimum-id-specification(s)></ option(s)> ; RUN;
To do this Calculate separate frequency or crosstabulation tables for each BY group Create an output data set that contains specified statistics Grouping Variables List of variables needs to be summarized
1 green fair 1 green dark 1 brown medium 2 blue 2 blue fair dark
1 green red 1 brown fair 1 brown dark 2 blue 2 blue red black
proc summary data=color; class eyes hair; var count; Output out = Summary run; (drop=_freq_) sum=;
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
To do this Used if you have any grouping variables that you want to retain as variables. These variables are included in transposed data set, but are not themselves transposed Names the variables whose formatted values will become new variable names. In absence of an ID statement, the new variables will be named COL1, COL2, and so on Names the variables whose values you want to transpose
ID
VAR
1 green fair 1 green dark 1 brown medium 2 blue 2 blue fair dark
1 green red 1 brown fair 1 brown dark 2 blue 2 blue red black
proc transpose data=color out = transpose; by eyes hair; id Region; var count; run;
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
DATA new-data-set;
SET data-set;
To stack data sets (appending) With two or more datasets (that have all or most of the same variables but different observations), in addition to reading the data, the SET statement concatenates the datasets one on top of the other
DATA new-data-set;
SET data-set-1 data-set-n; BY variable-list;
Before you can use the BY statement, the datasets must be sorted by the BY variables
Agenda Module 2
Introduction to SAS Proc Statements
Proc Sort Proc Means Proc Freq Proc Summary Proc Transpose
DATA new-data-set;
MERGE data-set-1 data-set-n; BY variable-list;
If the datasets being merged have variables with same names (besides the BY variables), then the variables from the second dataset will overwrite any variables having the same name in the first data set. All observations from both the data sets are included in the final data set, irrespective of whether they had a match or not
DATA new-data-set;
MERGE data-set-1 data-set-n; BY variable-list;
The order of the datasets in the MERGE statement does not matter to SAS, i.e., a one to many merge is same as many to one merge One to many merge cannot be done without a BY statement. Without any BY variables for matching, SAS simply joins together the first observation from each data set, then the second observation from each data set and so on.
Original-dataset is the data with more than one observation and summary data set is the data with a single observation. SAS reads original data set in a normal SET statement. It also reads the summary data set with the SET statement but only in the first iteration of the data step and then retains the value of variables from summary dataset for all observations in new data set
In the above example, SAS would create 3 identical data sets To create different datasets, use the OUTPUT statement Basic form
OUTPUT data-set-name;
Example
IF family = Ursidae then OUTPUT bears;
Since the OUTPUT statement is within the DO loop, an observation is created each time through the loop. Without the OUTPUT statement, SAS would have written only one observation at the end of the DATA step
When variable has both Date and Time i.e. 23Apr06 00:00:00, the date part is extracted using:
new_variable = DATEPART (variable)
Agenda
Module 1
Introduction to SAS Getting/Extracting Data in/from SAS Working with the Data
Module 2
Introduction to SAS Proc Statements Combining and Modifying SAS Datasets
Module 3
Proc SQL Arrays / DO-END Retain / First. Last.
Agenda Module 3
Proc SQL Arrays / DO-END Retain / First. Last.
Selecting Data
PROC SQL; SELECT DISTINCT rating FROM MFE.MOVIES; QUIT;
The simplest SQL code, need 3 statements By default, it will print the resultant query, use NOPRINT option to suppress this feature. Begin with PROC SQL, end with QUIT; not RUN; Need at least one SELECT FROM statement DISTINCT is an option that removes duplicate rows
Ordering/Sorting Data
PROC SQL ; SELECT * FROM MFE.MOVIES ORDER BY category; QUIT;
Remember the placement of the SAS statements has no effect; so we can put the middle statement into 3 lines SELECT * means we select all variables from dataset MFE.MOVIES Put ORDER BY after FROM We sort the data by variable category
Use comma (,) to separate selected variables CONTAINS in WHERE statement only for character variables Also try WHERE UPCASE(category) LIKE '%ACTION%'; Use wildcard char. Percent sign (%) with LIKE operator.
Always Put WHERE after FROM Sounds like operator =* Search movie title for the phonetic variation of drama, also help possible spelling variations
CREATE TABLE AS can always be in front of SELECT FROM statement to build a sas file. In SELECT, the results of a query are converted to an output object (printing). Query results can also be stored as data. The CREATE TABLE statement creates a table with the results of a query. The CREATE VIEW statement stores the query itself as a view. Either way, the data identified in the query can beused in later SQL statements or in other SAS steps. Produce a new dataset (table) ACTION in work directory, no printing
Terminology: Join (Merge) datasets (tables) No prior sorting required one advantage over DATA MERGE Use comma (,) to separate two datasets in FROM Without WHERE, all possible combinations of rows from each tables is produced, all columns are included
Turn on the HTML result option for better display: Tool/Options/Preferences/Results/ chec Create HTML/OK
Agenda Module 3
Proc SQL Arrays / DO-END Retain / First. Last.
Array Processing
You can use arrays to simplify programs that
perform repetitive calculations create many variables with the same attributes read data rotate SAS data sets by making variables into observations or observations into variables compare variables perform a table lookup.
All variables in an array must have the same type (numeric or character) An array name can't have the same name as a variable You must explicitly state the number of elements when using _temporary_; in other cases SAS figures it out from context, generating new variables if necessary.
ID
Array references
...
ARRAY array-name {subscript} <$> <length> ARRAY array-name {subscript} <$> <length> <array-elements> <(initial-value-list)>; <array-elements> <(initial-value-list)>;
Defining an Array
Write an ARRAY statement that defines the four quarterly contribution variables as elements of an array.
CONTRIB CONTRIB
ID
QTR1
QTR2
QTR3
QTR4
First element
Second element
Third element
Fourth element
...
Defining an Array
Variables that are elements of an array need not have similar, related or numbered names.
array Contrib2{4} Q1 Qrtr2 ThrdQ Qtr4;
CONTRIB2 CONTRIB2
ID
Q1
QRTR2
THRDQ
QTR4
First element
Second element
Third element
Fourth element
...
Processing an Array
Array processing often occurs within DO loops. An iterative DO loop that processes an array has the following form:
DO index-variable=1 TO number-of-elements-in-array; DO index-variable=1 TO number-of-elements-in-array; additional SAS statements additional SAS statements using array-name{index-variable} using array-name{index-variable} END; END;
To execute the loop as many times as there are elements in the array, specify that the values of index-variable range from 1 to number-of-elements-in-array.
Processing an Array
array Contrib{4} Qtr1 Qtr2 Qtr3 Qtr4; do Qtr=1 to 4; Contrib{Qtr}=Contrib{Qtr}*1.25; end;
CONTRIB{QTR} CONTRIB{QTR}
2 CONTRIB{2} QTR2
3 CONTRIB{3} QTR3
4 CONTRIB{4} QTR4
First element
Second element
Third element
...
When
Qtr=1
Qtr1=Qtr1*1.25;
...
When
Qtr=2
Qtr2=Qtr2*1.25;
...
When
Qtr=3
Qtr3=Qtr3*1.25;
...
When
Qtr=4
Qtr4=Qtr4*1.25;
...
Calculate the percentage that each quarter's contribution represents of the employee's total annual contribution. Base the percentage only on the employee's actual contribution and ignore the company contributions. Partial Listing of prog2.donate
ID E00224 E00367
Qtr1 12 35
Qtr2 33 48
Qtr3 22 40
Qtr4 . 30
The second ARRAY statement creates four numeric variables: Percent1, Percent2, Percent3, and Percent4.
c07s3d1.sas
Second difference
ID E00224 E00367
Qtr1 12 35
Qtr2 33 48
Qtr3 22 40
Qtr4 . 30
First difference
c07s3d2.sas
When
i=1
Diff1=Qtr2-Qtr1;
...
When
i=2
Diff2=Qtr3-Qtr2;
...
When
i=3
Diff3=Qtr4-Qtr3;
...
data compare(drop=Qtr Goal1-Goal4); set prog2.donate; array Contrib{4} Qtr1-Qtr4; array Diff{4}; array Goal{4} Goal1-Goal4 (10,15,5,10); do Qtr=1 to 4; Diff{Qtr}=Contrib{Qtr}-Goal{Qtr}; end; run;
Agenda Module 3
Proc SQL Arrays / DO-END Retain / First. Last.
.Using first.
We need to number the observations within each person. We will be using first. person in the process of doing this, so we must first sort the data on person. Then we will create the count variable which will enumerates the observations within each person.
proc sort data=real_life out=sort_real; by person; run; data count_real; set sort_real; retain count; by person; if first.person then count = 0; count = count + 1; run; proc print data=count_real noobs; run;
data three; set wide_real; array topic(6) topicA_1-topicA_6; do i = 2 to 5; if topic[i-1] ne . & topic[i] ne . & topic[i+1] ne . & topic[i]=topic[i-1] & topic[i]=topic[i+1] then flagA=1; end; if flagA=. then flagA=0; run; proc print data=three noobs; var person topicA_1-topicA_6 flagA; run;
Thank you !
Confidential
17