0% found this document useful (0 votes)
58 views

Tut Get Going With Proc SQL

PROC SQL is SAS's implementation of SQL that allows users to access and manipulate data. The document introduces basic PROC SQL syntax and functions like selecting, filtering, renaming, and creating new columns and tables. Examples are provided to demonstrate these SQL capabilities using an airline flight data set.

Uploaded by

Vikram Raju
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

Tut Get Going With Proc SQL

PROC SQL is SAS's implementation of SQL that allows users to access and manipulate data. The document introduces basic PROC SQL syntax and functions like selecting, filtering, renaming, and creating new columns and tables. Examples are provided to demonstrate these SQL capabilities using an airline flight data set.

Uploaded by

Vikram Raju
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Get Going with PROC SQL

Richard Severino, Convergence CT, Honolulu, HI


ABSTRACT

PROC SQL is the SAS Systems implementation of Structured Query Language (SQL). PROC SQL can be used to retrieve or combine/merge data from tables or views as well as generate reports and summary statistics. With PROC SQL you can modify and update tables, create new variables on the fly, access data from a database and join SAS data sets to each other or to tables from a database. This tutorial will introduce the basics of PROC SQL as well as some advanced features to help you start using PROC SQL effectively.

INTRODUCTION
Structured Query Language, or SQL, is a standardized language that retrieves and updates data from data tables. PROC SQL is the SAS Systems implementation of SQL and allows the user to access data, generate reports and summary statistics as well as perform data management tasks. The purpose of this tutorial is to introduce enough of the SQL procedure so that the beginning user will gain enough knowledge to be able to use it effectively and to be able to learn to solve more complex problems using PROC SQL.

SQL TERMINOLOGY
While some of the SQL terminology used in PROC SQL is not the same as that used with Base SAS, it is often interchangeable. The reason for the distinct terminology associated with PROC SQL is due to the terminology associated with and used in Relational Databases around which SQL was developed. The table below shows some Base SAS terms and their analogous SQL terms. SAS Data Set View Observation Variable Merge A data file A data file that can be read (viewed) but can not be modified Records in the data file Examples: age, id, date, salary, flight, destination getting information from more than one data file or table and putting it together Table View Row or Record Column or Field Join SQL

BASIC SYNTAX
The basic syntax for PROC SQL is as follows: PROC SQL ; SELECT <field/column names> FROM <table names> ; QUIT; PROC SQL begins with the Proc SQL; statement. For each PROC SQL statement, you may specify many additional substatements or clauses. Unlike base SAS, the sub-statements or clauses are not delimited by a semi-colon (;). In PROC SQL, one semi-colon is used at the end of the each statement, which may include several sub-statements or clauses. PROC SQL does not require a run; statement anywhere, but you should end it with a quit; statement. In general, an SQL query consists of selecting fields or columns from a data source where the data source may be a single table or it may consist of several tables that are to be joined. The complete PROC SQL syntax is available in the online documentation on the web at support.sas.com.

WORKING WITH DATA


We will be using a dataset named MARCH which consists of airline flight information for the month of March.
SELECT SOME DATA

Lets take a look at the data by sending it to the output window. The following code will send all the data to the output window: proc sql; SELECT * from sql.march ; quit;

The asterisk * on the SELECT statement is a wildcard and means all fields or columns will be selected from the data table, and therefore the above code selects all the fields in table MARCH and then displays all the records or rows in the table. If we dont know how many records there are in MARCH, we should limit the number of records that are sent to the output window just in case the table has hundreds or thousands of records. Using the INOBS= options will limit the number of records read from the data source, while the OUTOBS= option limits the number of records output by PROC SQL. The following code: proc sql INOBS=5; select * from sql.march ; quit; will result in the following output: Selecting All Fields from MARCH Table and Using INOBS=5 flight date depart orig dest miles boarded capacity ----------------------------------------------------------------------114 01MAR94 7:10 LGA LAX 2475 172 210 202 01MAR94 10:43 LGA ORD 740 151 210 219 01MAR94 9:31 LGA LON 3442 198 250 622 01MAR94 12:19 LGA FRA 3857 207 250 132 01MAR94 15:35 LGA YYZ 366 115 178

If we dont want to print out all the fields in the table, then we must specify the names of the fields we want in the SELECT statement. The following code selects three fields, flight, date and dest, and uses OUTOBS to limit the output. proc sql OUTOBS=5; select flight, date, dest from sql.march ; quit; The following warning will be printed in the Log: WARNING: Statement terminated early due to OUTOBS=5 option. And the following will print in the Output window: Selecting All Fields from MARCH Table and Using OUTOBS=5 flight date dest --------------------114 01MAR94 LAX 202 01MAR94 ORD 219 01MAR94 LON 622 01MAR94 FRA 132 01MAR94 YYZ INOBS and OUTOBS can be used together keeping in mind that if INOBS is less than OUTOBS the result will be that the procedure will only output as many records as are specified in the INOBS option. You will find it valuable to check the log when working with PRC SQL.
SELECT DISTINCT

The field dest in the MARCH table holds the 3 character airport code for the destination of the flight. If we want to get a list of all the destinations in the table, we can use SELECT DISTINCT to obtain the unique values of dest. The following code will get such a list:

proc sql; select DISTINCT dest from sql.march; quit; Output: List of Destination Airports dest ---FRA LAX LON ORD PAR WAS YYZ Notice that each airport code is listed only once. Using SELECT DISTINCT is a useful tool to find out what unique values are stored in a field. If you specify more than one field or column in the SELECT DISTNCT statement, the query will return a list of all the combinations of values in the fields specified. The following query: proc sql; select DISTINCT dest, capacity from sql.march; quit; results in the following output: List of Destination Airports and Flight Capacity dest capacity -------------FRA 250 LAX 210 LON 250 ORD 210 PAR 250 WAS 180 YYZ 178 Notice that capacity of 250 is listed several times, but each time with a different destination.
WHERE: SUBSETTING

The WHERE clause is used to select rows or records whose field values meet a particular condition or set of conditions. To list all the flights where the destination was LAX we add a WHERE clause as follows: proc sql; title Flights to LAX; select * from sql.march WHERE dest = LAX ; quit; A partial output listing for the immediately preceding code is as follows: Flights to LAX flight date depart orig dest miles boarded capacity ----------------------------------------------------------------114 01MAR94 7:10 LGA LAX 2475 172 210 114 02MAR94 7:10 LGA LAX 2475 119 210 114 03MAR94 7:10 LGA LAX 2475 197 210

NAMING, LABELING AND FORMATING COLUMNS

The variables in the MARCH dataset are not labeled and some of their names are not necessarily indicative of what the variables hold. In the SELECT statement we can rename columns and we can add or change labels as well as formats. In the example that follows we will rename the field dest to destination and flight to flight_num, we will provide a label for date and miles and we will change the format of miles. The PROC SQL code and a partial listing of the output are shown below. Notice that the column headings in the output show the new names, labels and formats. proc sql; select flight as flight_num, date as depart_dt label="Departure Date", dest as destination, miles label="Distance to Destination in Miles" format=COMMA6.0 from sql.march; quit; Output: Rename, Label and Change Formats of Columns Distance to Departure Destination flight_num Date destination in Miles ----------------------------------------------114 01MAR94 LAX 2,475 202 01MAR94 ORD 740 219 01MAR94 LON 3,442 622 01MAR94 FRA 3,857 132 01MAR94 YYZ 366 271 01MAR94 PAR 3,635 302 01MAR94 WAS 229
CREATING NEW COLUMNS

In the SELECT statement, you can create a new column which is calculated from one or more existing columns or fields in the data source you are querying. The MARCH dataset has the number of passengers that actually boarded each flight as well as the capacity of the flight. Suppose we want to display a column that shows the number of empty seats on each flight. And that we would like to convert the distance from miles to kilometers. The following PROC SQL accomplishes this task. proc sql; select flight label="Flight Number", date, dest, ROUND(miles*1.609344) as kilometers label="Flight Distance in Km" format=COMMA6.0, boarded, capacity - boarded as empty label="Number of Empty Seats" format=4.0 from sql.march ; quit; A partial listing of the output is follows. Creating New Columns Calculated from Existing Columns Number Flight of Flight Distance Empty Number date dest in Km boarded Seats ------------------------------------------------114 01MAR94 LAX 3,983 172 38 202 01MAR94 ORD 1,191 151 59 219 01MAR94 LON 5,539 198 52 622 01MAR94 FRA 6,207 207 43

Notice that the headers for date and dest which we created earlier are no longer shown. That is because changing the name, label or format in the SELECT statement does not affect the attributes of the column in the permanent dataset, it only affects the output.
CREATING AND ALTERING TABLES

Saving the results of a query to a table is accomplished by creating a table from the query using the following syntax: PROC SQL; CREATE TABLE new_table_name AS SELECT column_one, column_two, . . . FROM source_table_name ; QUIT; For example, if we wish to save the results of the results of the query where the number of empty seats was calculated, we could run the following code: proc sql; CREATE table sql.march2 AS select flight label="Flight Number", date, dest, ROUND(miles*1.609344) as kilometers label="Flight Distance in Km" format=COMMA6.0, boarded, capacity - boarded as empty label="Number of Empty Seats" format=4.0 from sql.march ; title New Table: MARCH2; select * from sql.march2 ; quit; The following is the message printed in the log and a partial listing of the output: NOTE: Table SQL.MARCH2 created, with 46 rows and 6 columns. New Table: MARCH2 Number Flight of Flight Distance Empty Number date dest in Km boarded Seats ------------------------------------------------114 01MAR94 LAX 3,983 172 38 202 01MAR94 ORD 1,191 151 59 219 01MAR94 LON 5,539 198 52 622 01MAR94 FRA 6,207 207 43 132 01MAR94 YYZ 589 115 63

You can create a new table that is empty, i.e. a table with no records, by copying the structure of an existing table or by specifying the column names, labels, formats and data types. Suppose you need to create a table identical to the MARCH table so that data for the month of April can be entered. It is very easy to copy the structure of an existing table. The following PROC SQL code will create a table named APRIL with all the same columns as MARCH, but without any records in the table: proc sql; CREATE table work.april LIKE sql.march ; describe table work.april; quit;

There will not be any output generated by the code above, but the log will show that the table was created with zero records. The DESCRIBE TABLE statement will print a list of the variables in the table to the log. Here are the log contents for the above code: 357 create table work.april like sql.march; NOTE: Table WORK.APRIL created, with 0 rows and 8 columns. 358 describe table work.april; NOTE: SQL table WORK.APRIL was created like: create table WORK.APRIL( bufsize=8192 ) ( flight char(3), date num format=DATE7. informat=DATE7., depart num format=TIME5. informat=TIME5., orig char(3), dest char(3), miles num, boarded num, capacity num ); Notice that the APRIL table has the same columns as the MARCH table. We can create a complete copy of the MARCH table, i.e. the structure and the data, as follows: proc sql; create table sql.march_copy as select * from sql.march ; quit; The following is printed in the log: NOTE: Table SQL.MARCH_COPY created, with 46 rows and 8 columns. and the reader can easily verify that the dataset is in fact a copy. Another way to create a table is to specify each column and its attributes. Lets create a lookup table for the airport codes that are stored in the variable named dest in the MARCH table: proc sql; create table sql.airport_lu (airport_code char(3) city char(40) country char(40) describe table sql.airport_lu; quit;

label=Airport Code, label=City, label=Country );

Examining the log shows that the table was created to the specifications given and has no records: NOTE: Table SQL.AIRPORT_LU created, with 0 rows and 3 columns. 10 describe table sql.airport_lu; NOTE: SQL table SQL.AIRPORT_LU was created like: create table SQL.AIRPORT_LU( bufsize=8192 ) ( airport_code char(3) label='Airport Code', city char(40) label='City', country char(40) label='Country' );

Now suppose we wanted to construct a lookup table that would indicate whether a flight was an international or a domestic flight. We already know that we need such a table must include the flight number and at least one column that will categorize the flight as domestic or international. We can start by creating a table that has the list of flight numbers in it as follows: proc sql; create table sql.flight_type_lu as select distinct flight as flight_number from sql.march ; title Flight Numbers in MARCH; select * from sql.flight_type_lu; quit; The output from the above PROC SQL code is: Flight Numbers in MARCH flight_number ------------114 132 202 219 271 302 622 Now that we can create a table, we need to be able to make changes to the table by adding or deleting or modifying columns and rows.
ALTERING TABLES, INSERTING AND UPDATING ROWS

The ALTER TABLE, INSERT INTO and UPDATE statements are used to modify tables and the data stored in them. The AIRPORT_LU table we created above has three columns: airport_code, city and country. If this lookup table is to be useful, we need to populate it with some data. LAX is the airport code for the Los Angeles International Airport. To add this information to the lookup table we need to INSERT a row: proc sql; INSERT INTO sql.airport_lu (city, country, airport_code) VALUES ("Los Angeles, CA", "USA", "LAX"); select * from sql.airport_lu ; quit; The following is a partial listing of the log and output: NOTE: 1 row was inserted into SQL.AIRPORT_LU. Airport Code City Country -----------------------------------------------------------------------------------------LAX Los Angeles, CA USA Notice that the order in which the column names are listed in the INSERT INTO statement does not have to follow any particular order as long as the VALUES order matches. Not all columns have to be listed in the INSERT INTO statement: if a column is not listed, then it will receive whatever the default missing value is for that data type. To illustrate this, lets add a record for San Diego, California, to the look up table: proc sql; INSERT INTO sql.airport_lu (city, country) VALUES ("San Diego, CA", "USA"); select * from sql.airport_lu ; quit;

Output: Airport Code City Country -----------------------------------------------------------------------------------------LAX Los Angeles, CA USA San Diego, CA USA Notice that there is no code for San Diego since we did not include it in the INSERT INTO statement. We can UPDATE the AIRPORT_LU table to add the code for SAN for San Diego: proc sql; UPDATE sql.airport_lu set airport_code = SAN where city = "San Diego, CA"; select * from sql.airport_lu ; quit; Output: Airport Code City Country -----------------------------------------------------------------------------------------LAX Los Angeles, CA USA SAN San Diego, CA USA The ALTER TABLE statement is used to delete or add a column to a table. To delete a column from a table, use the following syntax: ALTER TABLE table_name DROP column_one, column_two,... ; To ADD a column to a table use the following syntax: ALTER TABLE table_name ADD column_name <column specifications> ; where column specifications consist of the column type, label and format. To ADD a column or field to the AIRPORT_LU table, we run the following code: proc sql; ALTER TABLE sql.airport_lu ADD dom_or_int char(13) format=$13. label=Domestic or International , n_gates num format=3.0 label="Number of Gates" ; quit; which results in the following message printed in the Log: NOTE: Table SQL.AIRPORT_LU has been modified, with 5 columns. This will allow us to classify each airport as Domestic or International and to enter the number of gates available at each. We can add this information to the table with the following code: proc sql; UPDATE sql.airport_lu set dom_or_int = "Domestic", n_gates = 40 where airport_code = "LAX" ;

title AIRPORT_LU with added columns ; select * from sql.airport_lu ; quit; which results in the following message printed in the Log: NOTE: 1 row was updated in SQL.AIRPORT_LU. And the following output: AIRPORT_LU with added columns Number Airport Domestic or of Code City Country International Gates -----------------------------------------------------------------------------------------LAX Los Angeles, CA USA Domestic 40 SAN San Diego, CA USA . Note that for the SAN airport record, there is a ., or missing value, for n_gates (number of gates) and there is no data for dom_or_int (domestic or international) because the UPDATE statement had a WHERE clause restricting the update to LAX. (This output has been edited to fit in the space above.) To delete or DROP a column from a table we use the following syntax: ALTER TABLE table_name DROP column_name ; To delete the columns country and dom_or_int from the AIRPORT_LU table we run the following code: proc sql; ALTER TABLE sql.airport_lu DROP country, dom_or_int; title AIRPORT_LU after DROPing country and dom_or_int columns; select * from sql.airport_lu ; quit; which results in the following message printed in the Log: NOTE: Table SQL.AIRPORT_LU has been modified, with 3 columns. And the following output: AIRPORT_LU after DROPing country and dom_or_int columns Number Airport of Code City Gates --------------------------------------------------------LAX Los Angeles, CA 40 SAN San Diego, CA .
DELETING ROWS AND TABLES

The DELETE statement is used with a WHERE clause to delete one or more records from a table. To delete the record for San Diego from the AIRPORT_LU table, we run the following code : proc sql; DELETE from sql.airport_lu WHERE airport_code = "SAN"; quit;

The following message is printed in the log : NOTE: 1 row was deleted from SQL.AIRPORT_LU. CAUTION: Be careful when using the DELETE statement because if you use the DELETE statement without the WHERE clause, all the records will be deleted. To delete an entire table, use the following syntax: DROP TABLE table_name ; To delete the AIRPORT_LU table that we created and modified, we run the following code: proc sql; DROP table sql.airport_lu; describe table sql.airport_lu; quit; which results in the following messages printed in the Log: 370 DROP TABLE sql.airport_lu ; NOTE: Table SQL.AIRPORT_LU has been dropped. 371 372 describe table sql.airport_lu ; ERROR: File SQL.AIRPORT_LU.DATA does not exist. 373 374 quit;
SUMMARY FUNCTIONS

To summarize data, that is to produce a statistical summary of the entire table in the SELECT clause we must use summary functions such as COUNT, SUM, MIN and MAX to name a few. To create summaries for sub-groups, a GROUP BY clause must be used in the select statement. If GROUP BY is not used with a summary function, then all the rows in the table or view are considered to be a single group and the result of the SELECT statement will be one or more summary statistics computed from all the data. The table below lists some of the summary functions more commonly used in PROC SQL. Consult the PROC SQL documentation for other available summary functions. Summary Function AVG, MEAN COUNT, COUNT(DISTINCT), FREQ, N SUM MAX MIN STD NMISS, NMISS(DISTINCT colname) Function Result means or average of values number of nonmissing values sum of values largest value smallest value standard deviation number of missing values

Any column which exists in a table named in the FROM clause of the SELECT statement can be used as an argument in the functions shown in the table above. If the function is used with a single argument, or column name, the function is applied to the column producing one summary statistic for the entire select statement, or one summary statistic for each group in the GROUP BY clause. If the function is used with two or more arguments, the function is applied to the row producing one summary statistic for each row. Suppose we want to calculate the average number of passengers that boarded flights in the MARCH2 table. We can run the following code:

10

proc sql; title Average Number of Passengers; select AVG(boarded) as boarded_avg from sql.march2; quit; which will yield the following result: Average Number of Passengers boarded_avg ----------148 Now to get the average number of passengers for each flight we just have to add the column flight to the SELECT statement and a GROUP BY clause as follows: proc sql; title Average Number of Passengers By Flight; select flight, AVG(boarded) as boarded_avg label=Average Number of Passengers format=$6.1 from sql.march2 GROUP BY flight; quit; The resulting output is: Average Number of Passengers By Flight Average Flight Number of Number Passengers ------------------114 153.0 132 126.3 202 133.0 219 191.1 271 144.5 302 103.7 622 182.5 Lets examine what would happen if we did not use the GROUP BY clause in the previous example. The following are the log contents and a partial listing of the output obtained by running the previous PROC SQL program with the GROUP BY clause removed : 176 select flight, 177 AVG(boarded) as boarded_avg label="Average Number of Passengers" format=6.1 178 from sql.march2 ; NOTE: The query requires remerging summary statistics back with the original data. Average Number of Passengers By Flight: NO 'GROUP BY' Average Flight Number of Number Passengers -----------------114 148.0 202 148.0 219 148.0 622 148.0

11

Because the column flight was included in the SELECT statement, but there was no GROUP BY flight clause specified, the query wants to give us two things: all the flight numbers from all the records and the average of the column boarded calculated for the entire table. Therefore, the query calculates the average of boarded and attaches it to the original data and gives us the results above. In the MARCH2 table, we have the columns boarded and empty which are the number of passengers and number of empty seats respectively. When we created MARCH2 we did not include capacity in the SELECT statement and so the number of seats on the plane is not included in the table. We can compute capacity by summing boarded and empty. Recall that if a summary function has 2 or more arguments it will act across the columns on each row, whereas a summary function with only one argument will act across the rows and on the column which is the argument . Consider the following code: proc sql; select flight, boarded, empty, sum(empty) as empty_sum, sum(boarded, empty) as capacity, min(boarded, sum(boarded, empty) ) as min_x label="Min of Boarded and Capacity" from sql.march2 ; quit; SUM(empty) will sum the number of empty seats for all records in the table, but since there is no BY GROUP clause, this single value will be re-merged with all the records of the table. SUM(boarded,empty) will sum the fields boarded and empty for each record resulting in the capacity for each flight. Finally, MIN( boarded, SUM( boarded, empty ) ) will compute the minimum of boarded and the sum of boarded and empty, or the capacity, for each record. This last example demonstrates that one summary function can be used as an argument to another summary function. The following is a partial listing of the log and output: Another Summary Function Example Number Min of of Boarded Flight Empty and Number boarded Seats empty_sum capacity Capacity ------------------------------------------------------114 172 38 3208 210 172 202 151 59 3208 210 151 219 198 52 3208 250 198 622 207 43 3208 250 207 132 115 63 3208 178 115 271 138 112 3208 250 138 302 105 75 3208 180 105 114 119 91 3208 210 119 202 120 90 3208 210 120 We can now generate some reports that summarize the data in a table as in the following example: proc sql ; title MARCH: Flight Loads; flight dest MEAN(boarded) MIN(boarded) MAX(boarded) MIN(capacity from sql.march Group By flight, dest; quit ; select as flt label as dst label as occup_avg as occup_min as occup_min - boarded) = "Flight Number" , = "Flight Destination" , label = "Average Occupancy" format=8.1, label = "Minimum Occupancy", label = "Maximum Occupancy", as empty_min label = "Minimum Number of Empty Seats"

12

The output for this example is: MARCH: Flight Loads Minimum Number Flight Flight Average Minimum Maximum of Empty Number Destination Occupancy Occupancy Occupancy Seats -------------------------------------------------------------114 LAX 153.0 117 197 13 132 YYZ 126.3 75 164 14 202 ORD 133.0 104 175 35 219 LON 191.1 147 241 9 271 PAR 144.5 104 177 73 302 WAS 103.7 66 135 45 622 FRA 182.5 137 210 40
HAVING: FILTERING GROUPED DATA

The HAVING clause is used following a GROUP BY clause to filter grouped data. HAVING acts on the grouped or aggregated data in contrast to WHERE which acts on the individual rows of the table. You can use summary functions with HAVING, but you can not use a summary function with the WHERE clause if the summary function is aggregating data across rows. In the preceding example, we grouped the data by flight number and destination and computed some summary statistics for each group. If we wanted to include in our report only those flights where the minimum number of empty seats exceeded 40, we would add a HAVING clause to the code as follows: proc sql; title HAVING: Filtering Grouped Data - Flights with a Minimum of 40 Empty Seats; select flight as flt label = "Flight Number" , dest as dst label = "Flight Destination" , MEAN(boarded) as occup_avg label = "Average Occupancy" format=8.1, MIN(boarded) as occup_min label = "Minimum Occupancy", MAX(boarded) as occup_min label = "Maximum Occupancy", MIN(capacity - boarded) as empty_min label = "Minimum Number of Empty Seats" from sql.march Group By flight, dest HAVING empty_min > 40 ; quit; All the flights whose minimum number of empty seats was less than 40 have been excluded from the output which follows: HAVING: Filtering Grouped Data - Flights with a Minimum of 40 Empty Seats Minimum Number Flight Flight Average Minimum Maximum of Empty Number Destination Occupancy Occupancy Occupancy Seats -------------------------------------------------------------271 PAR 144.5 104 177 73 302 WAS 103.7 66 135 45 The results would have been quite different if we had used a WHERE clause instead of a HAVING clause. Consider the following PROC SQL query where the HAVING clause has been replaced by a WHERE clause: proc sql; title Trying to Filter Grouped Data with WHERE;

13

select flight as flt label = "Flight Number" , dest as dst label = "Flight Destination" , MEAN(boarded) as occup_avg label = "Average Occupancy" format=8.1, MIN(boarded) as occup_min label = "Minimum Occupancy", MAX(boarded) as occup_min label = "Maximum Occupancy", MIN(capacity - boarded) as empty_min label = "Minimum Number of Empty Seats" from sql.march where capacity-boarded >40 Group By flight, dest; quit; As the following output shows, this query will return grouped data after eliminating individual flight records where the number of empty seats, capacity-boarded, was greater than 40. Trying to Filter Grouped Data with WHERE Minimum Number Flight Flight Average Minimum Maximum of Empty Number Destination Occupancy Occupancy Occupancy Seats -------------------------------------------------------------114 LAX 131.0 117 160 50 132 YYZ 103.3 75 117 61 202 ORD 126.0 104 151 59 219 LON 173.0 147 198 52 271 PAR 144.5 104 177 73 302 WAS 103.7 66 135 45 622 FRA 177.0 137 207 43

COMBINING DATA FROM DIFFERENT TABLES


Often times we will need to combine or select data from different tables. We will introduce some of the common ways to accomplish this with PROC SQL.
OUTER UNION CORR: APPENDING TABLES OR QUERY RESULTS

Concatenating, or appending, two or more tables or query results can be accomplished by placing the set operator OUTER UNION CORR between the queries whose results are to be concatenated. OUTER UNION CORR is a set operator which will append query results by combining the records from both queries and will align columns that are of the same name and type. If OUTER UNION is used without CORR, then none of the columns will be aligned and the result of the query will have a total number of columns equal to the sum of the number of columns in each of the queries being concatenated. As an illustration, let us append data from a table named APRIL to the data from the MARCH table. The APRIL table has the same columns and same type of flight information as the MARCH table. To limit the output we will add a WHERE clause to select only flights 114 and 219 and we will rename one column in one of the queries. The following PROC SQL code will accomplish this: proc sql ; title OUTER UNION - Concatenating MARCH and APRIL without CORR; select flight, date, boarded from sql.march where flight IN ("114","219") OUTER UNION select flight, date, boarded as n_pasengers from sql2.april where dest LIKE "L%" /* select only flights where dest begins with L */ ; quit;

14

A partial listing of the output shows that none of the columns were aligned: OUTER UNION - Concatenating MARCH and APRIL flight date boarded flight date n_passengers -------------------------------------------------------114 01MAR94 172 . . 219 01MAR94 198 . . 114 02MAR94 119 . . 219 02MAR94 147 . . . . . . . . . . 114 219 114 219 01APR94 01APR94 02APR94 02APR94 167 201 114 150

Notice that the rows of APRIL have been appended to the rows of MARCH and that there are separate columns for the columns from MARCH and those from APRIL. To align the columns in the result, we modify the code by adding CORR to the OUTER UNION set operator: proc sql ; title OUTER UNION CORR- Concatenating MARCH and APRIL; select flight, date, boarded from sql.march where flight IN ("114","219") OUTER UNION CORR select flight, date, boarded as n_passengers from sql2.april where dest LIKE "L%" /* select only flights where dest begins with "L" */; quit; The following is a partial listing of the output : OUTER UNION CORR- Concatenating MARCH and APRIL flight date boarded n_passengers --------------------------------------114 01MAR94 172 . 219 01MAR94 198 . 114 02MAR94 119 . 114 219 114 01APR94 01APR94 02APR94 . . . 167 201 114

The columns flight and date have been aligned, but the columns for boarded have not been aligned. If you look at the code for this example, you will notice that in the query selecting data from APRIL, the column boarded was renamed to n_passengers, and is no longer an exact match for boarded from MARCH. Therefore, boarded and n_passengers are not aligned. Technically, OUTER UNION CORR is operating on the results of queries, not directly on tables. However, if we add a CREATE TABLE clause to the code, the concatenated records shown in the output would be saved to a new table, and we would have effectively concatenated or appended the APRIL table to the MARCH table. Other set operators which you may find useful are UNION, EXCEPT and INTERSECT. These are all documented in the SQL Procedure Users Guide available on the web at support.sas.com.
JOINING TABLES OR QUERY RESULTS

Joining tables in PROC SQL is similar to merging datasets in the data step. Usually, tables are joined based on one or more common columns. When joining tables, you must specify a join condition. That is you must tell Proc SQL what column(s) or

15

field(s) to use in order to match rows in one table to the corresponding rows in the other table(s). Tables that have no common column can also be joined if necessary. Suppose we have tables A and B from a hospital database. Table A contains patient information for patients that visited the emergency room of a hospital: name, address, gender, date of birth, patient account number, etc Table B contains information on surgical procedures performed: type of surgery, date of surgery, patient account number etc What do tables A and B have in common? They both include a patient account number for each record. Each patient coming to the emergency room will be represented once (per visit) in table A. Any patient in the hospital that has had any surgery will be represented in table B with one record per surgical procedure performed. Not all patients having surgery come through the emergency room, and not all patients that come to the emergency room have surgery. Therefore, you can see that not all patients in table A will have records in table B and vice versa. Tables A and B have at least one column in common and some records in each table belong to the same observational units, i.e. have the same value of the common column

INNER JOINS

If we want to get a list of the patients who came to the emergency room and had surgery, we need to get the records identified by the purple intersection in the figure above. An INNER JOIN of A and B results in selecting rows which have the same value of the common column in each table.

A B

OUTER JOINS

There are three types of Outer Joins: 1. Left Outer Joins 2. Right Outer Joins 3. Full Outer Joins
LEFT OUTER JOINS

Suppose we wanted to get a list of all patients who came to the emergency room (A) and information about their surgery history (B) if any, we need all the rows in A and any rows in B whose patient id match a patient id in A.
RIGHT OUTER JOINS

If we wanted to get a list of all patients who had surgical procedures (B) and information about their emergency room visit (A), if any, we need all the rows in B and any rows in A whose patient id match a patient id in B.

All rows from

Some Rows from

Some rows from

All rows from

Left Outer Join Minimum number of rows: _________________

Right Outer Join Minimum number of rows: _________________

16

FULL OUTER JOINS

If we wanted a complete listing of all patients who came to the emergency room (A) OR had surgery (B) then we would use a FULL OUTER JOIN. Full Outer Join

A B

All rows from A and all rows from B.

Minimum number of rows: ________________________

In general, the three types of Outer Joins yield the following results: 1. Left: selects all the rows in the left table and any associated rows from the right table 2. Right: selects all the rows in the right table and any associated rows from the left table 3. Full: selects all rows from both tables Returning to the flight data, the reader will recall that the MARCH table had a flight number and a destination code. But unless we have memorized all the codes for all the airports, those destination codes are not very descriptive. Now recall that we were constructing a lookup table AIRPORT_LU which had the city and country for each airport code. We can join the two tables so that our query results can include the destination city as well as the airport code. The airport code is the common column to both tables, so MARCH will be joined to AIRPORT_LU using the airport code in an INNER JOIN as follows: proc sql ; title INNER JOIN ; select A.flight, A.dest, B.city, sum(A.boarded) as passenger_tot label="Total Number of Passengers" format=comma9.0 from sql.march as A, sql2.airport_lu as B WHERE A.dest = B.code group by A.flight, A.dest, B.city ; quit; Running the above code creates the following output: INNER JOIN Total Number of flight dest City Where Airport is Located Passengers ---------------------------------------------------------------------------114 LAX Los Angeles, CA 1,071 132 YYZ Toronto, ON 884 202 ORD Chicago, IL 931 622 FRA Frankfurt 1,095 The first thing one should notice in the output is that while the city has been added to the output, not all the airport codes from MARCH are listed. The obvious thing to do would be to check the AIRPORT_LU table and see if any of our airport codes are missing from that table. But another way of would be to use a LEFT OUTER JOIN with MARCH listed on the left so that all the airport destination codes from MARCH and any matching codes from AIRPORT_LU would be listed. Another thing the reader may have noticed is that in the PROC SQL code we have assigned ALIASes A and B for tables MARCH and AIRPORT_LU respectively. The aliases make it possible to specify which table each column in the SELECT statement will come from since we are now dealing with more than one table. The following code sets up a LEFT OUTER JOIN for our query:

17

proc sql ; title LEFT OUTER JOIN ; select A.dest, B.city, sum(A.boarded) as passenger_tot label="Total Number of Passengers" format=comma9.0 from sql.march as A LEFT OUTER JOIN sql2.airport_lu as B ON A.dest = B.code Group by A.dest, B.city ; quit; Running the above code creates the following output: LEFT OUTER JOIN Total Number of dest City Where Airport is Located Passengers -------------------------------------------------------------------FRA Frankfurt 1,095 LAX Los Angeles, CA 1,071 LON 1,338 ORD Chicago, IL 931 PAR 867 WAS 622 YYZ Toronto, ON 884 All records from MARCH are selected and any matching information from AIRPORT_LU has been selected. Since there seems to be no city listed for codes LON, PAR and WAS, these codes are either not in the AIRPORT_LU table, or they are not valid codes. A RIGHT OUTER JOIN with AIRPORT_LU on the right would yield all the cities in the lookup table and any matching codes from MARCH and it is left as an exercise for the reader.

MACRO INTERFACE TO PROC SQL


It is very easy to store SQL query results in macro variables which can be used later to enhance our output or for other reasons. We can store individual values in a macro variable, and we can store the values of several records as a delimited string in one macro variable. This section is not about teaching the user about macro variables. The purpose here is to give the user a simple tool to use. If we wanted to enhance our reports with titles that included the beginning and ending dates of the period covered in the data, as well as a list of the destinations, we could run two queries to get the information and then hard code it into the titles, or we can use the macro interface to store the information in macro variables so that we can re-use the code without manually editing the titles. The following code shows how to store the earliest and latest dates from the MARCH table into 2 macro variables and also how to store a list of the destination codes separated by commas into one macro variable. proc sql NOPRINT; /* NOPRINT will suppress any output */ select min(date) as start_date format = DATE7., max(date) as end_date format = DATE7. into :start_date, :end_date from sql.march; %PUT START_DATE: &start_date --- END_DATE: &end_date; /* check the LOG */ select unique(dest) as destination into :destination separated by "," from sql.march ; %PUT Destination List: &destination; /* check the LOG */ quit; The log will show the following: 852 %PUT START_DATE: &start_date --- END_DATE: &end_date; START_DATE: 01MAR94 --- END_DATE: 07MAR94

18

858 %PUT Destination List: &destination; Destination List: FRA,LAX,LON,ORD,PAR,WAS,YYZ

PUTTING IT ALL TOGETHER


The following is the code to create a summary report from the MARCH flight data table. This example combines most of the features and options of PROC SQL introduced in this tutorial. proc sql NOPRINT; select min(A.date) as start_date format = DATE7., max(A.date) as end_date format = DATE7. into :start_date, :end_date from sql.march as A; select unique(A.city) as destination into :destination separated by " - " from sql2.airport_lu as A ; quit; /* use - to delimit the list */

proc sql ; title "FLIGHT REPORT FOR: &start_date - &end_date" ; title2 "Destinations: "; title3 "&destination" ; select B.city as dst label = "Flight Destination" format=$30., COUNT(distinct A.flight) as nd_flight label="Number of Flights" format=comma6.0, COUNT(A.flight) as n_flight label="Total Number of Flights" format=comma6.0, MEAN(A.boarded) as occup_avg label = "Average Occupancy" format=8.1, SUM(A.boarded) as occup_min label = "Total Number of Passengers" format=comma9.0 from sql.march as A LEFT OUTER JOIN sql2.airport_lu as B ON A.dest = B.code Group By A.flight, B.city OUTER UNION CORR select "ALL Cities" as dst label = "Flight Destination" format=$30., COUNT(distinct A.flight) as nd_flight label="Number of Flights" format=comma6.1, COUNT(A.flight) as n_flight label="Total Number of Flights" format=comma6.1, MEAN(boarded) as occup_avg label = "Average Occupancy" format=8.1, SUM(boarded) as occup_min label = "Total Number of Passengers" format=COMMA9.0 from sql.march as A LEFT OUTER JOIN sql2.airport_lu as B ON A.dest = B.code; quit ; Our report is now shown below as it appears in the output window. FLIGHT REPORT FOR: 01MAR94 - 07MAR94 Destinations: Chicago, IL - Frankfurt - London - Los Angeles, CA - Paris - Toronto, ON - Washington DC Total Number Number Total of of Average Number of Flight Destination Flights Flights Occupancy Passengers --------------------------------------------------------------------------------Chicago, IL 1 7 133.0 931 Frankfurt 1 6 182.5 1,095 London 1 7 191.1 1,338 Los Angeles, CA 1 7 153.0 1,071 Paris 1 6 144.5 867 Toronto, ON 1 7 126.3 884 Washington DC 1 6 103.7 622 ALL Cities 7 46 148.0 6,808

19

A basic ODS statement will output this table as an RTF file which can be opened in any standard word processing program.

CONCLUSION
Many aspects of PROC SQL, including most of the syntax required to perform data management and reporting tasks, have been introduced in this tutorial. SAS users new to the SQL procedure should have gained a good understanding of the different tasks that can be performed with this procedure. The reader now should be familiar enough with PROC SQL to be able to search the documentation for options, functions, clauses and statements that will help them solve problems of greater complexity. The material presented in this tutorial should serve as a springboard off of which the SAS user can dive right into PROC SQL and not only manage to stay afloat, but also to get the results they seek.

REFERENCES
SAS 9.1 SQL Procedure Users Guide. Cary, NC: SAS Institute Inc., 2004.
Lafler, Kirk Paul. 2004. PROC SQL: Beyond the Basics Using SAS . Cary, NC: SAS Institute Inc.

CONTACT INFORMATION
If you would like a copy of the code and datasets used in this tutorial, please send me an email with your request. Your comments and questions are valued and encouraged. Contact the author at: Richard Severino Convergence CT 1132 Bishop Street Suite 615 Honolulu HI 96813 Email: [email protected], [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.

20

You might also like