Microsoft SQL Server 2017 On Linux (PDFDrive)
Microsoft SQL Server 2017 On Linux (PDFDrive)
ISBN: 978-1-26-012114-8
MHID: 1-26-012114-3
The material in this eBook also appears in the print version of this title:
ISBN: 978-1-26-012113-1,
MHID: 1-26-012113-5.
All trademarks are trademarks of their respective owners. Rather than put a
trademark symbol after every occurrence of a trademarked name, we use
names in an editorial fashion only, and to the benefit of the trademark
owner, with no intention of infringement of the trademark. Where such
designations appear in this book, they have been printed with initial caps.
TERMS OF USE
This is a copyrighted work and McGraw-Hill Education and its licensors
reserve all rights in and to the work. Use of this work is subject to these
terms. Except as permitted under the Copyright Act of 1976 and the right to
store and retrieve one copy of the work, you may not decompile,
disassemble, reverse engineer, reproduce, modify, create derivative works
based upon, transmit, distribute, disseminate, sell, publish or sublicense the
work or any part of it without McGraw-Hill Education’s prior consent. You
may use the work for your own noncommercial and personal use; any other
use of the work is strictly prohibited. Your right to use the work may be
terminated if you fail to comply with these terms.
Index
Foreword
I
started working with SQL Server more than 30 years ago, when it was a
Sybase product that ran on Unix (as well as on more than half a dozen
other operating systems), and we did all our work on Unix machines.
SQL Server didn’t run on Windows at that time, because there was no
Windows operating system. But several years later, when Sybase partnered
with Microsoft to port its database product onto PC-based operating
systems, few people could foresee just how powerful and ubiquitous these
PCs would become. PC is hardly about only personal computers these days.
I’ve seen a lot of changes with SQL Server over these three decades, and
some of the most interesting ones for me are when the product seems to
circle back and add features or internal behaviors that were originally part
of the product but had once been discarded. Some of the original ideas were
not that far off base after all. Having SQL Server return to its roots and
become available on Linux (a Unix-based OS) in SQL Server 2017 almost
seems like coming home.
Benjamin Nevarez has been working with Unix-based operating systems
almost as long as I’ve been working with SQL Server. He was also excited
to see SQL Server make an appearance on Linux. It didn’t take him long to
decide to get his hands dirty and figure out how SQL Server professionals
could get the maximum value out of the new OS. He wrote this book to
make available to others all that he had learned.
I have known Ben for more than a dozen years, since he first started
finding typos and other errors in my SQL Server 2005 books. We started a
dialog, and I then asked him if he was interested in being a technical editor
for some of my work. Through this technical collaboration, I have learned
that when Ben sets out to learn something, he does it thoroughly. His
attention to detail and passion for complete answers never cease to amaze
me.
In this book, Ben tells you how to get started using SQL Server on Linux
and how the database system actually works on the new platform. Chapters
1, 2, and 3 are particularly useful if you’re new to Linux but experienced
with SQL Server. Although most SQL Server books wouldn’t go into
operating system administration details, Chapter 3 does just that, to make
the transition easier for people who have many years, if not decades, of
experience with Windows. Of course, if you’re already proficient with
Linux, but new to SQL Server, you can focus on the following chapters
when Ben’s expertise with SQL Server shines through.
In Chapter 4, he tells you how SQL Server can be configured and blends
SQL Server details with the Linux tools you need to access and control your
SQL Server. Chapters 5 and 6 are very SQL Server focused. Chapter 5
provides some very detailed information about working with SQL Server
queries, including how queries are optimized and processed, and how you
can tune slow-running queries. Chapter 6 tells you all about some of the
latest and greatest in SQL Server’s optimization techniques in the most
recent versions of SQL Server. Finally, in Chapters 7 and 8, he provides
coverage of two critical focus areas for a database administrator: managing
availability and recoverability, and setting up security. These are critical
topics for any DBA, and because they involve the relationship between the
database engine and the operating system, it’s best to learn about them from
someone who is an expert in both areas.
Although both Linux and SQL Server are huge topics and there is no
way one book can provide everything you need to know about both
technologies, Ben has done an awesome job of giving you exactly what you
need to know, not only to get SQL Server running on the Linux operating
system, but to have it performing well, while keeping your data safe and
secure.
—Kalen Delaney
www.SQLServerInternals.com
Poulsbo, Washington, March 2018
Acknowledgments
A
number of people contributed to making this book a reality. First of
all, I would like to thank everyone on the McGraw-Hill Education
team with whom I worked directly on this book: Lisa McClain,
Claire Yee, and Radhika Jolly. Thanks to all for coordinating the entire
project while trying to keep me on schedule. I also would like to thank my
technical editor, Mark Broadbent, as his amazing feedback was critical to
improving the quality of the chapters of the book. A very special thank you
has to go out to Kalen Delaney for writing the Foreword for this book.
Kalen has been my biggest inspiration in the SQL Server world, and it is
because of people like her that I ended up writing books and presenting at
technology conferences.
Finally, on the personal side, I would like to thank my family: my wife,
Rocio, and three sons, Diego, Benjamin, and David. Thank you for your
unconditional support and patience every time I need to work on another
SQL Server book.
Introduction
I
started my IT career working with Unix applications and databases back
in the early ’90s, and my first job ever was as a data processing manager
for a small IT shop. Back then, I was running Unix System V Release 4
on an NCR system. With such big and expensive minicomputer systems, I
was always wondering if I could have a Unix system on less expensive
hardware, such as a PC, to learn and test without disrupting our shared test
systems.
Then I read an article in a personal computing magazine about
something called Linux. Nobody knew what Linux was back in those days.
Very few people—mostly at universities—had access to the Internet back in
those days. So I downloaded Linux on four or five floppy disks, installed it
on a PC, and started playing with it. It was a distribution called Slackware.
It was amazing that I could finally experiment and test everything I wanted
on my own personal Unix system.
I continued to work with Linux and all the popular Unix commercial
implementations, including IBM AIX, HP-UX, Sun Solaris, and others,
throughout the ’90s. For several years, people still didn’t know what Linux
was. It was not an immediate success. But by the end of the ’90s I decided
to specialize in SQL Server, and by doing that I left the Unix world behind.
So it looked like I was not going to touch a Unix system ever again. But
one day in March 2016, Microsoft surprised the technology community by
announcing that SQL Server would be available on Linux. When I first
heard the news, I thought it would be cool to write a book about it. Because
I was just finishing a book about SQL Server 2016, I decided to wait to see
how the technology evolved and to take a break from writing. One day, as I
was running while training for a marathon, I started thinking about the
project again and decided it could be a great idea to write a book about SQL
Server on Linux. Just after finishing my run, I went to my laptop and sent
an e-mail to my contact at McGraw-Hill Education, who eventually
connected me with Lisa McClain. Within a few days, I was now working on
this new book project.
Let me tell you how I structured this book.
Chapter 1 shows you how to get SQL Server running on Linux as
quickly as possible, so you can start using the technology, even though I
haven’t covered all the details yet. The chapter covers how to install SQL
Server on Red Hat Enterprise Linux, SUSE Linux Enterprise Server, and
Ubuntu and how to configure an image of SQL Server on a Docker
container. More details of the setup and configuration are included in
Chapter 4.
Chapter 2 covers some SQL Server history with different operating
systems and explains some of the details about how SQL Server on Linux
works. This includes describing the interaction between SQL Server and the
operating system, decisions regarding its architecture, and information
about its software implementation, among other related topics. It also
covers details about the SQL Operating System (SQLOS), the Drawbridge
technology, and the SQL Platform Abstraction Layer (SQLPAL).
I include an entire chapter dedicated to Linux for the SQL Server
professional. Chapter 3 covers all the basic Linux commands you need to
get started, including managing files and directories and their permissions,
along with a few more advanced topics, including system monitoring.
Chapter 4 covers SQL Server setup and configuration in a Linux
environment, and it is divided into three main topics: using the mssql-conf
utility to configure SQL Server, which is required in Linux environments;
using Linux-specific kernel settings and operating system configurations;
and using some traditional SQL Server configurations for both Windows
and Linux installations.
After spending time learning how to set up and configure SQL Server,
you’ll move to Chapter 5, which discusses how to use SQL Server to
perform database operations. This chapter, in particular, covers query tuning
and optimization topics, which are applicable both to Windows and Linux
installations—and, in fact, to all the currently supported versions of the
product.
Chapter 6 continues with query processing and covers the new features
available in SQL Server 2017, such as adaptive query processing and
automatic tuning.
Chapter 7 is about high-availability and disaster-recovery solutions for
SQL Server on Linux and focuses on Always On availability groups.
Availability groups on both Windows and Linux can be used in high-
availability and disaster-recovery configurations and for migrations and
upgrades, or even to scale out readable copies of one or more databases.
The chapter also covers Pacemaker, a clustering solution available on Linux
distributions.
Finally, I close the book with Chapter 8, which is about security. This
chapter reviews security from a general point of view and includes details
about some of the new security features in SQL Server, including
Transparent Data Encryption, Always Encrypted, Row-Level Security, and
Dynamic Data Masking.
Chapter 1
In This Chapter
Creating a Virtual Machine
Installing SQL Server
Configuring SQL Server
Connecting to SQL Server
Installing Additional Components
Installing on Ubuntu
Installing on SUSE Linux Enterprise Server
Running SQL Server on Docker
Uninstalling SQL Server
Summary
A
lthough SQL Server has been a Windows-only software product for more
two decades, it originally started as a database engine for the then-new
OS/2 operating system. The year was 1989 and the product was actually
called Ashton-Tate/Microsoft SQL Server, originally written by Sybase
using the C language. By the summer of 1990, after ending a marketing and
distribution agreement with Ashton-Tate, it was renamed Microsoft SQL
Server.
After the failure of OS/2 to gain market acceptance and the huge success of
Windows, Microsoft SQL Server was eventually moved to the then-new
Windows NT platform. SQL Server 4.21a, released in 1993, was the first
version to run on Windows. The last version of SQL Server for OS/2, SQL
Server 4.2B, was released that same year.
From then on, Microsoft focused on its software product as a Windows NT–
only strategy. But that was about to change more than 20 years later when
Microsoft announced in March 2016 that SQL Server would be available on the
Linux platform. In 2017, Microsoft indicated that this version would be named
SQL Server 2017 and would be available on Red Hat Enterprise Linux,
Ubuntu, and SUSE Linux Enterprise Server, in addition to Docker containers.
The product was released in October 2017.
I started my career in information technology with Unix applications and
databases in the early 1990s and was, at the time, mostly unaware of these SQL
Server developments. I worked with all the popular implementations of Unix,
including Linux, for about a decade. These Unix platforms included System V
Release 4, IBM AIX, and Hewlett-Packard HP-UX. I was one of the early
Linux users, which I later deployed in production, but mostly to run web
servers. I remember that Linux was not an immediate success, and nobody
knew about Linux in those early days. I eventually decided to specialize in
SQL Server and left the Unix/Linux world behind.
When I heard that SQL Server would be available on Linux, I was really
excited about the possibilities. It was like going back to the old times. It was
ironic and interesting that SQL Server would bring me back to Linux.
Although Chapter 2 continues the discussion of SQL Server history with
different operating systems, this chapter will show you how to install SQL
Server on Linux so you can start playing with the technology as quickly as
possible. Chapter 2 also describes decisions about SQL Server on Linux
architecture, software implementations, and how SQL Server interacts with the
operating system, among other related topics. The remaining chapters focus on
all the details and more advanced topics.
Creating a Virtual Machine
This chapter will show you how to install SQL Server on a virtual machine
image with Linux preinstalled. You can obtain the Linux virtual machine in
several ways—for example, by using a cloud provider such as Microsoft Azure
or Amazon Web Services (AWS), or by using an image of a virtual machine
created in the virtualization environment in your data center. You can also
install Linux first on your own virtual machine or hardware and then follow the
rest of the chapter.
Later chapters focus on more advanced details, such as configuring SQL
Server or implementing high availability or disaster recovery solutions. The
following Linux distributions are currently supported to run SQL Server:
NOTE
You can sign up for the Red Hat Developer Program, which provides no-cost
subscriptions where you can download the software for development use
only. For more details, see
https://developers.redhat.com/products/rhel/download.
In this section, I’ll show you how to create a virtual machine in Microsoft
Azure. If you need to use Amazon AWS, you can create an account and a
virtual machine at https://aws.amazon.com/.
Start by visiting the Azure web site at https://azure.microsoft.com, where
you can use an existing Microsoft Azure subscription or sign up for a free
account with a US$200 free credit. After logging on with your credentials,
proceed to the Microsoft Azure portal (https://portal.azure.com) by following
the links. You may need to spend a few moments getting familiar with the
Microsoft Azure portal.
To create a new virtual machine on the Microsoft Azure portal, choose
Virtual Machines | Add. Hundreds of images may be available, so you need to
enter a filter to help with the search. As of this writing, you can use two kinds
of virtual machines:
If you select a virtual machine with both Linux and SQL Server installed,
you will need to set the system administrator (sa) password and start the SQL
Server service, which is a simple process that is covered in the section
“Configuring SQL Server.” If you select a virtual machine with only Linux
installed, you’ll also learn how to install SQL Server in that section.
If you are interested in finding a virtual machine with SQL Server already
installed, enter SQL Server Red Hat Enterprise Linux or SQL Server SUSE
Enterprise Linux Server or SQL Server Ubuntu in the Search bar. As of this
writing, you can find images available for the Express, Developer, Web,
Standard, and Enterprise editions for all the supported Linux distributions
mentioned earlier. As suggested, the available choices may change at a later
time. An example using SQL Server Red Hat Enterprise Linux is shown in
Figure 1-1.
Figure 1-1 Microsoft Azure virtual machines search
Clicking any of the results will show you the image details on the right side
of the screen. For example, selecting Free SQL Server License: SQL Server
2017 Developer on Red Hat Enterprise Linux 7.4 (RHEL) will show you the
following information (note that the message refers to the free SQL Server
Developer edition; you still have to pay for the cost of running the Linux
virtual machine):
If you need to find a virtual machine without SQL Server installed, enter
Red Hat Enterprise Linux or SUSE Enterprise Linux Server or Ubuntu in
the Search bar. Be sure to select a supported version, as indicated earlier—for
example, 7.3 or 7.4 for Red Hat Enterprise Linux 7.3 or 7.4.
Because I am going to show the entire installation process, for this exercise,
select Red Hat Enterprise Linux 7.4. You may also choose a virtual machine
with SQL Server installed, such as Free SQL Server License: SQL Server 2017
Developer On Red Hat Enterprise Linux 7.4 (RHEL) and still follow the rest of
the chapter content. My selection shows the following description:
Figures 1-2 through 1-5 do not show all the entire portal screens. You may
need to scroll down on the web portal to see some parts of a screen, but all the
choices are explained here. Some items offer a description. For example, you
may wonder about the VM Disk Type or Resource Group choices shown in
Figure 1-2. VM disk type is described as follows:
Figure 1-3 Configure virtual machine size screen
Premium disks (SSD) are backed by solid state drives and offer
consistent, low-latency performance. They provide the best balance
between price and performance, and are ideal for I/O-intensive
applications and production workloads. Standard disks (HDD) are
backed by magnetic drives and are preferable for applications where
data is accessed infrequently.
I will first show you how to install and configure SQL Server for Red Hat
Enterprise Linux. Instructions to perform the same operations on Ubuntu and
SUSE Linux Enterprise Server are covered later in the chapter. How to run
SQL Server 2017 on Docker is covered at the end of the chapter.
To continue with our Red Hat Enterprise Linux virtual machine, you need to
download and install SQL Server using the curl and yum utilities. The curl
command is designed to transfer data between servers without user interaction,
and yum is a package manager utility that we will use to install the latest
version of a package.
First, switch to superuser mode:
If you have created a virtual machine without SQL Server installed, continue
reading. If you already have SQL Server installed, you may still want to
continue reading here to learn a bit about packages and package managers,
especially if you are new to Linux, but you don’t have to run any of the
following commands. If you selected a virtual machine with SQL Server
installed, you have to run only the following two steps:
2. Start SQL Server. Run the following command as hinted by the previous
output:
I will cover more details about configuring SQL Server later in this chapter
and also in Chapter 4. More details about starting and stopping SQL Server are
provided later in this chapter and also in Chapter 3. The remainder of the
section assumes you do not have SQL Server installed.
Run the following commands to download the SQL Server Red Hat
repository configuration file:
This should finish almost immediately, because this downloads a small file.
If you look at its contents, at /etc/yum.repos.d/mssql-server.repo, you will see
the following:
TIP
You can use several commands to view the contents of a file. For example,
you can try more/etc/yum.repos.d/mssql-server.repo. For information
on these commands, see Chapter 3.
Next, run the following commands to install the SQL Server package mssql-
server:
NOTE
Another difference from Windows is that in Unix and Linux commands are
case-sensitive. An incorrect command name will return an error message
and, in some cases, perhaps a recommendation as in the following examples:
In this example, yum installs the package named mssql-server and the –y
parameter is used to answer “yes” to any question during the installation.
Here’s the output, formatted for better readability:
At the moment the package was installed, SQL Server was not yet properly
configured. We will do that in the next step. Also, close your root session as
soon as it is no longer needed by running the exit command or pressing CTRL-D.
But before we continue, here’s a quick introduction to package managers,
which may be a new concept for most SQL Server developers and
administrators. The process of installing, upgrading, configuring, and removing
software is quite different on Windows platforms. Software in Linux
distributions uses packages, and there are several package management
systems, which are collections of utilities used to install, upgrade, configure,
and remove packages or distributions of software. In this chapter, we will cover
some package management utilities such as yum, apt, and zypper, which we
will use with Red Hat, Ubuntu, and SUSE, respectively.
The yum (which stands for Yellowdog Updater Modified) package manager
was written in Python. We use it to install the SQL Server package mssql-
server by running the following:
If you want to update the package, which is useful when there is a new CTP,
RC, or cumulative update, you can use this:
NOTE
CTPs (Community Technology Previews) or RCs (Release Candidates) are
versions of SQL Server used during their beta program before a final
release. A final release is called RTM (release to manufacturing).
Finally, one of the major changes in SQL Server 2017 compared to previous
versions is the new servicing model. Although service packs will still be used
for SQL Server 2016 and previous supported versions, no more service packs
will be released for SQL Server 2017 and later. The new servicing model will
be based only on cumulative updates (and GDRs [General Distribution
Releases] when required).
Cumulative updates will be released more often at first and then less
frequently in the new servicing model. A cumulative update will be available
every month for the first 12 months and every quarter for the remaining four
years of the full five-year mainstream life cycle.
As of this writing, three repositories are defined for SQL Server on Linux:
Cumulative updates This is the base SQL Server release and any bug
fixes or improvements since that release. This is the choice we selected
earlier using https://packages.microsoft.com/config/rhel/7/mssql-server-
2017.repo.
GDR The GDR repository contains packages for the base SQL Server
release and only critical fixes and security updates since that release. The
repository URL for GDR is
https://packages.microsoft.com/config/rhel/7/mssql-server-2017-gdr.repo.
Preview repository This is the repository used during the SQL Server
beta period, consisting of CTPs and RCs. The preview repository URL is
https://packages.microsoft.com/config/rhel/7/mssql-server.repo.
NOTE
Be aware that there may be a lot of code, articles, or documents published
out there using the preview repository. So make sure that you refer to the
correct repository.
The mssql-conf utility can accept different parameters and can be used to
initialize and set up SQL Server and to perform some other activities such as
enable or disable a trace flag, set the collation of the system databases, or
change the sa password. To see the list of possible options, run mssql-conf
without any parameter or specify the –h or --help parameter.
Note that you need to specify a strong sa password or you may get one of
the following errors:
If you inspect the mssql-conf file, you will notice it is a basic bash script that
calls a /opt/mssql/lib/mssql-conf/mssql-conf.py python script. /opt is a standard
directory in the Unix file system that is used to install local software. There are
many other standard directories, such as /bin, /dev, /etc, /home, /lib, /tmp, and
/var—to list a few. More details about the Unix file system are covered in
Chapter 3.
You can verify that SQL Server is running using the ps command old-school
method:
This will show you some interesting information, as shown next (plus the last
ten lines of the SQL Server error log file, not included here, called the
“journal” in systemctl terminology):
This will show you how long the SQL Server service has been running; the
Linux processes IDs, which are the same as those shown earlier with the ps
command; and even the location of the SQL Server documentation.
Or
An abbreviated output to fit the book page is next. Notice that this time we
are installing two packages—mssql-tools and unixODBC-devel:
Now we can connect to SQL Server using the Linux version of the familiar
sqlcmd command-line tool:
Run the following:
Inspecting the /opt/mssql-tools/bin directory will also show you the familiar
bcp (bulk copy program) utility, which is used to import data from data files
into SQL Server.
So far, we have been able to connect from inside the virtual machine. Could
we connect from a SQL Server client outside the virtual machine? We could try
using SQL Server Management Studio (SSMS) on Windows, connecting to the
listed public IP address.
NOTE
At this point, you may need to install SQL Server Management Studio,
preferably the latest version, which you can download from
https://docs.microsoft.com/en-us/sql/ssms/download-sql-server-management-
studio-ssms. As of this writing, the latest version is 17.3.
Leave the other fields set to their default values and click OK. For
completeness, these are the definitions as shown on the Azure portal:
Priority Rules are processed in priority order; the lower the number, the
higher the priority. We recommend leaving gaps between rules—100, 200,
300, and so on—so that it’s easier to add new rules without having to edit
existing rules.
Source The source filter can be Any, an IP address range, or a default tag.
It specifies the incoming traffic from a specific source IP address range
that will be allowed or denied by this rule.
Service The service specifies the destination protocol and port range for
this rule. You can choose a predefined service, such as RDP or SSH, or
provide a custom port range. In our case, there is a predefined service, MS
SQL, which already specifies the TCP protocol and port 1433.
Protocol You can provide a single port, such as 80, or a port range, such as
1024-65535. This specifies on which ports incoming traffic will be
allowed or denied by this rule. Enter an asterisk (*) to allow traffic from
clients connecting on any port.
Now you can connect from SQL Server Management Studio on Windows or
any other SQL Server client outside the virtual machine, including applications
on a large number of programming languages or frameworks such as .NET,
Java, Python, and so on.
If you are accustomed to connecting using Windows authentication, as I am,
keep in mind that for now you need to use the only login available at the
moment, sa, and the password provided earlier. For example, Figure 1-11
shows Object Explorer from my current connection using SQL Server
Management Studio.
Figure 1-11 SQL Server Management Studio connected to a Linux instance
So you now have your first user database in Linux. Where could it be
located? There are a few ways to figure that out. Try the following:
Here’s the information returned:
/var/opt/mssql/data/master.mdf
/var/opt/mssql/data/mastlog.ldf
/var/opt/mssql/data/model.mdf
/var/opt/mssql/data/modellog.ldf
/var/opt/mssql/data/msdbdata.mdf
/var/opt/mssql/data/msdblog.ldf
/var/opt/mssql/data/tempdb.mdf
/var/opt/mssql/data/templog.ldf
Note that, as with a default Windows installation, both data and transaction
log files reside on the same directory, /var/opt/mssql, which also contains a log
directory, which is also the same as in Windows; it is used to store the SQL
Server error log files, trace files created by the default trace, and extended
events system_health files. This should not be confused with the location for
database transaction log files. Spend some time inspecting the structure and
files located both on /var/opt/mssql and /opt/mssql.
Finally, there are several ways to move or copy your Windows databases
and data to Linux. For now, I can show you how to copy a Windows database
backup file quickly and perform a database restore in Linux.
First, we need a way to copy the database backup file from Windows to our
Linux virtual machine. Once of these choices is to install the PuTTY scp client
on Windows, which you can download from www.putty.org. (The scp stands
for secure copy and it is a utility used to copy files between computers.)
Once scp is installed, run the following:
This will take a couple of minutes and will copy the database backup file to
your home directory in Linux. In my case, the file ended at
/home/bnevarez/AdventureWorks2014.bak.
Next, to avoid issues with permissions, use the following command to copy
the backup file to /var/opt/mssql/data:
Run the next restore statement, which specifies the location of the backup
file and the new location for the created database files:
You can optionally restore the database with SQL Server Management
Studio, where you only have to browse to the location of the backup file and
the database file locations will be handled automatically. The database will be
restored as AdventureWorks2014, and you will be able to query it in exactly the
same way you would have done it in Windows.
At this point, you may need the basic statements to start, restart, and stop
SQL Server. More details about systemctl will be covered in Chapter 3.
To stop SQL Server, run the following statement:
To install SQL Server Full-Text Search, run the following yum command:
Again, you will need to restart SQL Server after SQL Server Full-Text
Search has been installed.
To install SQL Server Integration Services, run the following command:
Finally, as hinted on the install output, run the following command to
configure SQL Server Integration Services, and again select 2 to choose the
free Developer edition:
There is no need to restart SQL Server at this time. It is also recommended
that you update the PATH environment variable by running the following
command:
NOTE
For details about permanently updating the PATH environment variable, see
Chapter 3.
Installing on Ubuntu
You’ve seen the entire process of installing and configuring SQL Server 2017
using Red Hat Enterprise Linux. Let’s now install SQL Server on Ubuntu.
After this I’ll focus only on what is different compared with the previous Red
Hat installation. Let’s start by creating a virtual machine running Ubuntu.
Open the Microsoft Azure virtual machines portal. In the Search field, enter
Ubuntu. This will show you a few results. Select Ubuntu Server 16.04 LTS.
Every Linux installation must have at least 3.25GB of memory for a SQL
Server 2017 installation, but at the moment any virtual machine image on the
gallery has at least that much memory, so this should not be a problem. You
need to be careful with this, however, if you are creating the virtual machine
using other methods or if you’re installing the operating system software
yourself.
Follow the steps described in the “Creating a Virtual Machine” section
earlier in the chapter. Once you have a virtual machine running and you are
able to connect to it, run the following statements to get the public repository
GPG keys (more on this in a moment):
Next, run the following statements to register the SQL Server Ubuntu
repository:
As with Red Hat Enterprise Linux, this was the cumulative update
repository. The GDR repository is at
https://packages.microsoft.com/config/ubuntu/16.04/mssql-server-2017-gdr.list,
and the preview repository is at
https://packages.microsoft.com/config/ubuntu/16.04/mssql-server.list.
Finally, run the following commands to install SQL Server:
After this point, you can configure SQL Server in the same way as before by
running the following, accepting the license terms, selecting the free Developer
addition, and providing a strong sa password:
You’ll see something similar to the following output:
Installing client tools is also different from Red Hat Enterprise Linux. Start
by importing the public repository GPG keys if you have not done so when you
installed SQL Server in the previous step:
Follow the same steps shown earlier to connect from outside the virtual
machine. Basically, you will need to configure a new inbound rule on the
network security group to allow SQL Server connections using the TCP port
1433.
As I did with Red Hat Enterprise Server, I will now cover how to install
SQL Server Agent, SQL Server Full-Text Search, and SQL Server Integration
Services for Ubuntu.
To install SQL Server Agent, run the following command:
After you install these components, you will need to restart SQL Server:
Finally, to install and configure SQL Server Integration Services, run the
following statements:
There is no need to restart SQL Server this time. As mentioned earlier, it is
also recommended at this time that you update the PATH environment variable
by running the following command:
NOTE
For details about permanently updating the PATH environment variable see
Chapter 3.
Same as the previous Linux distributions mentioned, this was the cumulative
update repository. The GDR repository is at
https://packages.microsoft.com/config/sles/12/mssqlserver-2017-gdr.repo, and
the preview repository is at
https://packages.microsoft.com/config/sles/12/mssql-server.repo.
Next, run the following command to install SQL Server:
Configure SQL Server the same way you did earlier using the mssql-conf
utility:
Finally, install the SQL Server client tools. Use the following steps to install
the mssql-tools package on SUSE Linux Enterprise Server. Add the SQL
Server repository to zypper:
Follow the same steps indicated earlier to connect from outside the virtual
machine. You will need to configure a new inbound rule on the network
security group to allow SQL Server client connections using the TCP port
1433.
As I did with Red Hat Enterprise Server and Ubuntu, I will next cover how
to install SQL Server Agent, SQL Server Full-Text Search, and SQL Server
Integration Services on SUSE Linux Enterprise Server. To install SQL Server
Agent, run the following command:
To install SQL Server Full-Text Search, run the following command:
After you install these components, you will need to restart SQL Server:
Next, we need to download the SQL Server 2017 image by running the
following:
Next we are ready to run the SQL Server container image on Docker,
executing the next statement while specifying an appropriate sa login
password:
For this example, port 1421 has to be available on the host computer. We
cannot use the host port 1433 because we are already running an instance of
SQL Server there (installed in a previous exercise). The second port, 1433, will
be the port on the container.
Validate that the SQL Server 2017 container is running by executing the
following:
The output will show something similar to this, where you can validate the
STATUS information to verify that SQL Server container is effectively
running:
If you have SQL Server client tools on the same server, you can connect
directly to the new installation. Or you can install them as described earlier for
the Ubuntu Linux distribution. Note that only client tools are required, because
you will be connecting to a SQL Server image running in a Docker container.
You should be able to connect as indicated earlier. For example, you can use
the following command using the host port 1421:
Optionally you can connect to the container and run the client tools from
there. For example, you can use the following command, which runs an
interactive Bash shell inside the container, where e69e056c702d is the
container ID shown earlier on the Docker ps command. The docker exec
command is used to run a command in a running container:
After running the command, you will receive a bash prompt, #, where you
can run the sqlcmd client as usual. Note that this time, we don’t need to specify
a port since we are inside the container, which was configured to listen to the
default SQL Server port 1433:
Follow the same steps indicated earlier to connect from outside the virtual
machine. Basically, we will need to configure a new inbound rule on the
network security group to allow SQL Server connections using the host TCP
port, in this example, 1421.
Finally, here a few basic Docker commands, just to get you started. In any of
the following cases you can use the container ID or the provided name.
To stop one or more running containers:
Or
Or
To restart a container:
When you no longer need the container, you can remove it using the rm
option:
Note that you should stop the container before trying to remove it, unless you
also specify the –f option.
To troubleshoot problems with a container, you can look at the SQL Server
error log using the following command:
In any of the listed Linux distributions, if you need to remove all the
databases after uninstalling SQL Server, run the following:
Summary
This chapter covered installing SQL Server on Linux and provided you with
enough information to get started using the product as quickly as possible. It
covered how to install and configure SQL Server in a preexisting Linux
installation, which could be either a virtual machine on Microsoft Azure or
Amazon Web Services or your own Linux installation. You learned how to
install SQL Server on Red Hat Enterprise Linux, SUSE Linux Enterprise
Server, and Ubuntu and how to configure an image of SQL Server on a Docker
container.
If you’ve followed the instructions in this chapter, you will have a SQL
Server up and running and will be ready to cover most advanced topics, such as
configuring SQL Server on Linux for high availability and disaster recovery or
covering the new query processing and security features. Chapter 2 covers
architecture decisions, how SQL Server works on Linux, how SQL Server
interacts with the operating system, and other related topics.
Chapter 2
In This Chapter
The Sybase Years
SQLOS
The Industry Changes
Project Helsinki
A Virtualization Surprise
Drawbridge
SQLPAL
Summary
M
ost SQL Server users will be surprised to learn that SQL Server was
actually born on the Unix platform, first ported to OS/2 and later
to the Windows NT operating system. So in its early days, SQL
Server was in fact a multiplatform technology. SQL Server was
originally written by Sybase and released for OS/2 and Windows NT in an
agreement with Microsoft. After its business relationship with Sybase
ended, Microsoft secured the code base of the database engine and decided
to make SQL Server a Windows-only technology, which remained the case
for more than two decades.
A second important architecture feature for the future Linux
implementation was the development of SQL Operating System (SQLOS),
which was written for the SQL Server 2005 release. SQLOS was created to
exploit the newly available hardware and to provide the database engine
with more specialized services than a standard operating system could
afford. SQLOS is also the SQL Server application layer responsible for
managing all operating system resources, such as nonpreemptive
scheduling, memory and buffer management, I/O functions, resource
governance, exception handling, and extended events. SQLOS was never
intended to be a platform-independent solution. Porting SQL Server to other
operating systems was not in the plan during the SQLOS development.
In an astonishing announcement in March 2016, Microsoft surprised the
industry by declaring that SQL Server would be available on the Linux
platform sometime in 2017. After that moment, it seems that the industry
expected a real port that compiled the SQL Server C++ code base into a
native Linux application. Microsoft later indicated that this version would
be named SQL Server 2017 and would be available on Red Hat Enterprise
Linux, Ubuntu, and SUSE Linux Enterprise Server, in addition to Docker
containers. Docker itself runs on multiple platforms, which means that it
would be possible to run the SQL Server Docker image on Linux, Mac, and
Windows.
But when the first parts of the beta version of the software, called a
Community Technology Preview (CTP), were released in November 2016,
we were in for another surprise: instead of a port, SQL Server was able to
run on Linux thanks to some virtualization technologies based on
Drawbridge, the software result of a Microsoft project completed just a few
years earlier.
Microsoft soon released more information about how SQL Server on
Linux works; its architecture includes several components such as the
Drawbridge technology, a revamped SQLOS, and the SQL Platform
Abstraction Layer (SQLPAL), the layer that enabled Microsoft to bring
Windows applications to Linux. SQL Server 2017 was finally released in
October 2017.
This chapter covers some SQL Server history with different operating
systems and explains some of the details about how SQL Server on Linux
works. This includes describing the interaction between SQL Server and the
operating system, decisions regarding its architecture, its software
implementation, and other related topics.
NOTE
For more about the history of SQL Server, at least until SQL Server 2000,
read the first chapter of the Inside Microsoft SQL Server books by
Kalen Delaney and Ron Soukup, which can also be found online at the
bottom of https://www.sqlserverinternals.com/resources.
SQLOS
As mentioned, SQLOS was another very important development for the
future Linux implementation. SQLOS was a new operating system layer
whose purpose was to provide the database engine with performance and
scalability improvements by exploiting the new available hardware
capabilities and providing the database engine with more specialized
services than the general ones an operating system can offer.
SQLOS was first available on the SQL Server 2005 release. Although in
part SQLOS was created to remove or abstract away the operating system
dependencies, it was not originally intended to provide platform
independence or portability, or to help in porting the database engine to
other operating systems. Its first purpose was to exploit the new available
hardware capabilities, including Symmetric Multithreading (SMT) and
multi-CPU configuration with multiple cores per socket systems, computers
with very large amounts of memory, non-uniform memory access (NUMA)
systems, and support for hot memory and CPU add-ons and removals.
Database engines could benefit from these new hardware and hardware
trends.
SQLOS also became the SQL Server application layer responsible for
managing all operating system resources, and it was responsible for
managing nonpreemptive scheduling, memory and buffer management, I/O
functions, resource governance, exception handling, deadlock detection,
and extended events. SQLOS performs these functions by making calls to
the operating system on behalf of other database engine layers or, as in the
cases of scheduling, by providing services optimized for the specific needs
of SQL Server. The SQLOS architecture is shown in Figure 2-1.
Figure 2-1 SQLOS architecture
SQL Server will also detect and work with the then-new NUMA
systems. SQL Server 2000 Service Pack 4 included limited support for
NUMA systems. Full NUMA support was added when SQLOS was
released with SQL Server 2005. Software NUMA is automatically
configured starting with SQL Server 2014 Service Pack 4 and SQL Server
2016 (some support was also possible before but required manually editing
the Windows registry). Starting with these versions of the database engine,
whenever SQL Server detects more than eight physical cores per NUMA
node or socket at startup, software NUMA nodes will be created
automatically by default.
SQLOS was never intended to be a platform-independent solution, but
rather a way to provide purpose-built operating system services to the
database engine for performance and scalability with the SQL Server 2017
release.
NOTE
For more details about SQLOS, read the paper “A New Platform Layer
in SQL Server 2005 to Exploit New Hardware Capabilities and Their
Trends” by Slava Oks, at
https://blogs.msdn.microsoft.com/slavao/2005/07/20/platform-layer-for-
sql-server/. In addition, in the paper “Operating System Support for
Database Management,” Michael Stonebraker examines whether several
operating system services are appropriate for support of database
management functions such as scheduling, process management,
interprocess communication, buffer pool management, consistency
control, and file system services. You can use your favorite search engine
to find this research paper online.
Project Helsinki
There are reports that Microsoft had been contemplating porting SQL
Server to Unix and Linux as early as the 2000s. One example is the article
“Porting Microsoft SQL Server to Linux” at
https://hal2020.com/2011/07/27/porting-microsoft-sql-server-to-linux/, by
Hal Berenson, who retired from Microsoft as a distinguished engineer and
general manager. Also, in an interview with Rohan Kumar, general manager
of Microsoft’s Database Systems group, he mentioned that there were a
couple of discussions in the past about porting SQL Server to Linux, but
such a project was not approved. For more details of the interview, see
https://techcrunch.com/2017/07/17/how-microsoft-brought-sql-server-to-
linux/.
More recently, starting around 2015, there was a new attempt to port—or
in this case to release—SQL Server on Linux. This was called Project
Helsinki. The following were the project objectives of releasing SQL Server
on Linux:
It would cover almost all the SQL Server features available on the
Windows version. Current exceptions were documented earlier in this
chapter.
It would offer at least the same security level as the Windows version.
It would offer at least the same performance as the Windows version.
It would ensure compatibility between Windows and Linux.
It would provide a Linux-native experience—for example, installation
using packages.
It would keep the continued fast pace of innovation in the SQL Server
code base, making sure that new features would appear on both
platforms simultaneously.
If you have followed the history of SQL Server so far, you may wonder,
since SQL Server was born on the Unix platform and later ported to OS/2
and Windows, why not port it back to Linux? Truth is, however, that after
two decades as a Windows-only technology, the code base had diverted
hugely from its Unix origins.
Nevertheless, porting during this project was still a consideration.
Porting the application from one operating system to another would require
using the original source code, making the required changes so it would
work on the new system, and compiling it to run as a native application.
Porting SQL Server to Linux, however, would require the review of more
than 40 million lines of C++ code to make changes so it would work on
Linux. According to the Microsoft team, this would be an enormous project
and would face the following challenges:
With more than 40 million lines of C++ code, porting would take years
to complete.
During the porting project, the code will still be changing. New
features, updates, and fixes are performed all the time. Catching up
with the current code base was a serious challenge.
NT kernel (ntdll.dll)
Win32 libraries
Windows application libraries
The SQL Server team listed the last category, Windows application
libraries, as the one with more complex dependencies. Some of these
Windows application libraries were Microsoft XML Core Services
(MSXML), the Common Language Runtime (CLR), components written in
Component Object Model (COM), the use of the Microsoft Distributed
Transaction Coordinator (MSDTC), and the interaction of the SQL Server
Agent with many Windows subsystems. It was mentioned that porting even
something like SQLXML would take a significant amount of time to
complete.
So the team, according to several posts and interviews mostly by Slava
Oks, partner group engineering manager at the SQL Server team, was
considering alternative choices to porting in order to complete the project in
a faster way, or at least in a reasonable amount of time. This is where
Drawbridge came to the rescue.
A Virtualization Surprise
Although the original Microsoft announcements did not mention whether
SQL Server on Linux was going to be a port, the entire technology
community assumed it would be, and everybody expected that SQL Server
was going to be a native Linux application. It also seems that the first
sources reporting that SQL Server on Linux was not a port, but instead was
using some sort of virtualization technology came from outside Microsoft.
SQL Server CTP 1 was released on November 16, 2016, and just two
days later an article at The Register indicated that SQL Server on Linux was
not a native Linux application but was instead using the Drawbridge
application sandboxing technology. The article stated that Drawbridge
references could be found on the installation, for example, at the
/opt/mssql/lib/system.sfp library, which could be easily confirmed.
NOTE
You can still read the article, “Microsoft Linux? Microsoft Running Its
Windows’ SQL Server Software on Linux: Embrace, Extend, er, Enter,” at
www.theregister.co.uk/2016/11/18/microsoft_running_windows_apps_on
_linux.
I sensed at the time that the SQL Server community had mixed reactions
to this news. It may have been the disappointment that SQL Server was not
going to be a native Linux application, but it later also turned into curiosity,
and everybody wanted to know how it worked and how Microsoft was able
to run a very complex application such as SQL Server on a different
platform without a code port. There was also the initial concern of whether
this Linux implementation would offer the same performance as its
Windows counterpart.
Drawbridge
Drawbridge was a Microsoft Research project that created a prototype of a
new form of virtualization for application sandboxing based on a library OS
version of Windows. Drawbridge was created to reduce the virtualization
resource overhead drastically when hosting multiple virtual machines in the
same hardware, something similar to what Docker would do later.
Drawbridge was a 2011 project, while Docker was released as open source
in March 2013. Drawbridge, according to Microsoft, was one of many
research projects that provided valuable insights into container technology.
In simple terms, Drawbridge took the Windows kernel to run it in user
mode in a process to create a high-density container that could run
Windows applications. So it was basically taking the entire operating
system, Windows, in user mode. The original intention was to use
Drawbridge to host small applications in Windows Azure. At the same
time, Microsoft starting testing running Drawbridge in other operating
systems so they could use this technology as a container to run a Windows
application on another platform. One of those platforms tested was Linux.
NOTE
You can find more details about Drawbridge on the Microsoft Research
page at www.microsoft.com/en-us/research/project/drawbridge/.
SQLPAL
Although for some SQL Server users, it may seem like SQLOS already
provided the abstraction functionality to move SQL Server to another
platform, that was not the case with Linux. Though SQLOS was more about
services and optimizing for new hardware than abstraction, Drawbridge
provided the abstraction that was needed. Marrying these two technologies
was the appropriate solution. In fact, the Drawbridge library OS component
provided the required functionality. (The second component, the
picoprocess, was not required for the project.) The SQLPAL architecture is
depicted in Figure 2-4, which also includes a host extension layer that was
added on the bottom of SQLOS to help SQLPAL interact with the operating
system. Host extension is the operating system–specific component that
maps calls from inside SQLPAL to the real operating system calls.
Finally, Figure 2-7 shows the process model when SQL Server is
running on Linux. SQL Server runs in a Linux process. The Linux host
extension is a native Linux application that first loads and initializes
SQLPAL. SQLPAL then starts SQL Server.
Figure 2-7 Process model
Summary
Even if you are an expert SQL Server user and have worked with the
technology for years, you may think that this database engine has always
been a Windows-exclusive technology. SQL Server in fact started as a
multiplatform technology, and its roots actually go back to an operating
system called OS/2 and even to Unix. To understand how SQL Server came
to the Linux platform and Docker containers, this chapter covered some
historic perspectives that mirrored some of the same challenges that were
faced today.
SQLOS was created as a platform layer designed to exploit new
hardware capabilities and provide database engine–specialized services, but
it was never designed to provide platform independence or portability to
other operating systems. SQLOS was used again, however, for the Linux
release.
When working on the SQL Server on Linux project, the team considered
a code port, but since this would be an enormous project that would take
years to complete, other solutions were considered, including using the
Microsoft Research project Drawbridge. Drawbridge and SQLOS were
used on the final release of SQL Server on Linux implementation.
Chapter 3
In This Chapter
Getting Started
Files and Directories
Additional Commands
Permissions
Bash Shell
Services Management
Software Management
Disk Management
System Monitoring
Summary
T
he primary purpose of this chapter is to help you get started with Linux, but
also covers some more advanced topics and system monitoring tools. A
traditional SQL Server book would not cover how to use an operating
system or its administration. This book is an exception, however, because
SQL Server users have been working with the Windows platform since the
beginning, and most are not familiar with Linux. The information in this chapter
may be extremely useful to help with the transition from the Windows platform
to Linux.
This chapter is not intended to provide much information to the Linux
administrator, but I hope it helps SQL Server professionals get started with
Linux. As he or she would when administering the Windows platform, the SQL
Server administrator should work with related professionals such as system or
storage administrators to help achieve optimum results, especially when
configuring and administering production environments.
First of all, Linux is an operating system based on Unix. Unix was originally
conceived and implemented in 1969 and later rewritten in the C language, with
its first version publicly released in 1973. There have been a large number of
Unix implementations throughout the years, including some successful
commercial versions such as System V, Sun Solaris, IBM AIX, and Hewlett
Packard HP-UX.
In 1991, Linus Torvalds, looking to create a free operating system as a hobby,
wrote and published Linux. Currently Linux reportedly includes more than 18
million lines of source code. As with Unix, Linux has many distributions. This
huge number of implementations, with incompatibilities and lack of
standardization, has been considered a weakness of this operating system. Linux
is open source, while Unix is mostly proprietary software (only a few versions
of Unix have been free or open sourced, such as BSD).
NOTE
Writing this chapter brought back some memories. I used to teach Unix
system administration in a university back in the 1990s, and this is my first
time writing a tutorial since then. I first heard about Linux in a personal
computing magazine back in the early days and immediately went to
download it onto a few 3½-inch floppy disks and installed it. Not many people
knew about Linux or even had access to the Internet in those days. I also
worked with all the most popular Unix implementations including AIX, Sun
Solaris, and HP-UX.
Getting Started
If you made it this far, you are probably at least familiar with the most basic
Linux commands such as ls, cd, more, md, and ps. You also may know that
Linux commands are case-sensitive and usually in lowercase. For example,
typing ls works as expected, but typing LS or Ls does not. Let’s review some
basic commands in this section and then move to more advanced ones later in
the chapter.
When you open a new session, you are always running a shell (more on that
later—for now think about opening an MS-DOS command prompt session in
Windows). Unix traditionally provides different shells such as Bourne, C, or
Korn Shell. In Red Hat Enterprise Linux, the default is Bash (for Bourne-again
shell), one of the most popular shells in Linux distributions. You are also defined
a default startup directory, which is your home directory.
NOTE
This book uses the Courier font to show the commands to execute, and in
some cases, the output follows. The command you have to type will usually be
the first line and starts with the shell prompt, such as $ or #.
Use the man (manual) command, mentioned in Chapter 1, to see the Linux
documentation or online reference manuals. For example, to look for
documentation of the ps command, you can use
To exit the man page, press q for quit. To learn more about the man command,
you would use man itself—for example, type the following:
Use man –k to search for the man page descriptions for keywords and display
any matches. For example, to find information about shells, you could use this:
Also, if you are accustomed to the style of the SQL Server documentation
(where at the end of a topic, similar commands are mentioned), if you don’t find
what you are looking for or if you want more information, look for a “SEE
ALSO” section with additional commands or man pages:
Another important concept is the path, defined by the $PATH variable. Same
as with Windows, $PATH enables you to define the directories containing
executable code so you can run a program without specifying its file location. In
my default installation, I have the following:
I can include a new directory, if needed. For example, trying to execute the
SQL Server sqlcmd utility in a default installation will show the following:
Then you can verify the updated value of the variable by running this command:
You will now be able to execute sqlcmd without having to worry about its
location in the file system. Now, this path change is just for the current session;
if you disconnect and connect again, the changes will be lost. To make this
change permanent, you need to update your profile file, which for the Bash shell
is named .bash_profile. Notice the following lines inside the file:
NOTE
Some Unix commands may have awkward names, such as pwd, which may be
an abbreviation of a word or words describing what they do. I will be using
parentheses to show what the command stands for, if needed. In some other
cases, such as with find or sort, its description will be obvious.
You can navigate to any directory you have access to by using the cd (change
directory) command. For example, to move to the SQL Server databases file
directory, you could use the following:
If you tried the MS-DOS chdir command, you may have noticed that it does
not work here. Using cd without a directory moves you back to your home
directory, which is defined by the $HOME variable:
You can also use the tilde (~) symbol as a shortcut for your home directory,
which may be useful when you want to run a command or script while you are
in some other directory without having to specify the full file path. For example,
the following commands create a backup directory in your home directory and
then try to list the contents of the same directory even if you are working in the
/tmp directory:
As with Windows, you can also use relative paths. For example, use the
following command to start in your current directory and change to apps/dir1:
You can use a dot (.) to refer to your current directory and two dots (..) to
refer to the parent directory of the current directory. For example, use the
following command to move two levels up the directory hierarchy structure:
Assuming you have the required permissions, you can also create and remove
directories using the mkdir (make directories) and rmdir (remove empty
directories) commands. Interesting to note is that, unlike the cd command, the
Windows md and rd commands will not work.
No surprises here if you are familiar with these commands on the old MS-DOS
or Windows command prompt. The first command creates the directory mydir;
the second command changes the current directory to the directory just created
and then changes to the previous directory by going one level back. Finally, the
directory is removed.
The rmdir command can remove only empty directories, however. Trying the
following will return an error message:
In the next section, I will show you how to remove nonempty directories
using the rm command. You could also manually remove each file and directory
inside this directory, which, although time consuming, in some cases may be
beneficial—for example, to avoid deleting something you may need.
To list the contents of a directory, use the ls (list) command. As with many
Unix commands, ls has a large number of options, so make sure you use the
man documentation if you need more details. Most typical use is as ls or ls –l.
For example, the following ls -l command on the /var/opt/mssql/data directory
will list the current SQL Server database files:
You can see a lot of interesting information about the listed file, including file
type and permissions, shown as -rw-r----- (more details on permissions later
in this chapter), the number of links to the file, the user owner, the group owner,
the size in bytes, the last date the file was modified, and the filename.
Displaying the size in bytes for very large files can be difficult to read, so you
can also try both the –l and -h (human readable) options. They will show the
previous size bytes value 15400960 as 15M. Size units are reported as K, M, G,
T, P, E, Z, and Y for kilobytes, megabytes, gigabytes, terabytes, petabytes,
exabytes, zettabytes, and yottabytes, respectively (though I don’t think anyone is
using some of those extremely large file sizes just yet).
NOTE
Although the amount of data generated worldwide is increasing rapidly, units
such as exabytes, zettabytes, and yottabytes are still huge. In comparison, the
maximum database size allowed in SQL Server is 524,272 terabytes, which is
very large even for current standards.
Some other useful ls options are -a (or --all) and -A (or --almost-all),
which are used to include hidden files on the list. The only difference is that the
latter does not include the named . and .. entries, which represent the current
and parent directories, respectively. A Unix file starting with a dot (.) is
considered a hidden file.
Files
Almost everything in Unix is a file, and, unlike Windows, Linux does not use a
file extension in the filename. (The file command can be used to determine
what kind of file it is, although this will be rarely needed.) You can perform
basic file manipulation by using the cp (copy), mv (move), and rm (remove)
commands.
Use cp to copy files and directories. The following makes a copy of file1 and
names it file2:
Use the mv command to move files and directories. Or use it to rename files
and directories. The next commands create a new dir2 directory and move file1
from directory dir1 to dir2:
The following example is similar to the previous cp example, but this time it
moves the entire directory dir1 and its contents to the backup directory. Note that
no recursive option is needed:
Use the rm command to remove files and directories. The following example
removes the file named file1:
I showed you the rmdir command to remove directories. But if the directory
is not empty and you use this command, you will get the “directory not empty”
error message. Use rm with -r (or -R or --recursive) to remove directories and
their contents recursively:
The following example deletes all the files in the current directory, while
being prompted on each one:
Use the touch command to change file timestamps or update the file access
and modification times to the current time. It is also commonly used to create
empty files:
The find command is very useful; its purpose is to search for files in a
directory hierarchy. find also has a large number of parameters and choices, so
make sure to check the documentation if you have some specific needs. For
example, you may be wondering where the sqlcmd utility is located. Here’s how
to find it:
You’ll need the required permissions for the folders you specify, or you may
get the following message for one or more directories:
Optionally, you can use the -iname option for a case-insensitive search. To
find the files modified in the last 50 days, use the -mtime option, as in the
following example:
Similar to -mtime, use the -atime command to report the files that have been
accessed on a specified number of days:
Use the -type option to find a specific type of file, which can be b (block), c
(character), d (directory), p (named pipe), f (regular file), l (symbolic link), and
s (socket). Use the -empty option to find empty files or directories:
The find command is extremely flexible and can also be used to perform an
operation or execute a command on the files found. For example, the following
command changes the permissions on all the files encountered (chmod is
explained later in the chapter in the “Permissions” section):
The next example removes the files found:
The next command searches for a string within the files found:
In these three cases, find is using the –exec option, which enables you to
execute a command for each filename found. The string {} is replaced by the
current filename being processed. Use the optional -print to print the filename
found; it is useful to have a quick indication of the work performed. Executed
commands must end with \; (a backslash and semicolon).
Finally, find is also commonly used to find information and send it to
another command for additional processing. More on than later in this section,
where I cover piping and redirection.
You can use a few Linux commands to show information from a file, such as
cat (concatenate), more, less, tail, or head. The cat command is the simplest
to use to display small files:
Both the more and less commands enable you to read a file or some other
input with additional choices. For example, you can navigate one page at a time
allowing forward and backward movement or search for text:
Once you open a file, you can press the spacebar or f to move forward, press
b to go backward one page, or press q to quit.
Use the tail command to output the last part of a file, which can be used to
inspect the latest data on a changing file:
If nothing is returned (and that seems confusing), you can use the –s option to
get a confirmation when the files are the same:
Use the sort command to sort lines of text files. For example, assuming file1
has a list of items, running the following command will return the list sorted:
This will sort to the standard output, but if you want to keep the sorted list,
you may save it to a file, like in the following example:
Finally, you can use the file command to determine the file type. Here’s an
example:
NOTE
As you may remember from Chapter 1, SQL Server software was installed in
the /opt directory, while the databases were placed in /var/opt.
SQL Server on Linux works on XFS or EXT4 file systems; other file
systems, such as BTRFS, are not currently supported. To print your file system
type, use the df –T or cat /etc/fstab command.
Use the mount command to mount a file system, for example, when you have
a new disk partition. Use umount to unmount a file system. Running mount
without any option will list all the current mount points.
You can also use the lsblk command to list the block devices or disks
available in the system; this output is easier to read than the output using mount.
Here is an example on my system:
The most important columns are the Name, Size, and Type of the device and
their mount point.
Use the lscpu command to display information about the Linux system CPU
architecture. Here is some partial output on my system from a Windows Azure
virtual machine:
Additional Commands
The ps (process status) command is one of the most useful Unix commands and
has a large variety of options. Use it to see the current processes running in the
system. Let’s start with the most basic ps command, which will return only the
processes running in the current session:
The columns returned are PID (process ID), the terminal associated with the
process, the cumulated CPU time, and the executable name. This example shows
only the current shell and the ps process itself. You will usually need to specify
some parameters to see system-wide information. Here’s a common one:
Use option -e to select all the processes and -f to add columns. These
additional columns are UID, PPID, C, and STIME, which are the user ID, parent
process ID, CPU usage, and time when the process started, respectively. In
addition, -f includes the command arguments, if any. Use options -A and –a to
select all the processes in the system. Keep in mind that ps provides a snapshot
of the system at the time the command is executed. For real-time information,
you could also use the top and htop commands, which I cover in the “System
Monitoring” section later in this chapter.
Several other options are useful as well. Let’s try a few.
Use -u to display processes by a specific user:
It is also popular to use the BSD syntax to show every process in the system,
as shown next. Notice that no dash (–) is required in this case:
The ps command offers incredible flexibility. For example, you can specify
which columns to list using the -o option:
This example will list process ID, user ID, CPU usage, memory usage, and
command name. For the entire list of possible columns, see the man
documentation.
You can sort by one or more columns, in ascending or descending order,
where + means increasing, which is the default, and - means decreasing. As with
the -o option, you can see the entire list of available columns in the
documentation using this command:
You can use the watch command to run the same ps command periodically
and display its current output. In this case, watch uses the -n option interval in
seconds:
This output shows that the sqlservr process PID 857 spawned a second copy of
the process with PID 939. You can also correlate using the PPID or parent
process ID: PID 939 has PPID 857.
Use the grep (global regular expression parser) command to return lines
matching a pattern or basically to search text and regular expressions in the
provided input. Let’s create a text file for our next examples. The next command
saves the documentation of the grep command itself into a file named file1:
Let’s try our first search looking for the word “variables”:
This example returns the previous output, plus one more line with the word in all
uppercase.
Use the -v option (or --invert-match) to return all the lines that do not
contain the indicated text:
Use the –E option (or the egrep command) to search for multiple patterns at a
time:
You can also use regular expressions to search for text. For example, the
following command will search for either “egrep” or “fgrep” on the just created
file1:
This expression means search for either “e” or “f” followed by “grep.”
You can use numbers, too. The following command will search for four digits
inside file1:
If you try a search on binary files, such as in the next example, by default,
you’ll get the following response only:
One choice is to use the -a option to process a binary file as if it were text.
The following will perform the desired search:
Finally, grep also includes the behavior of both egrep (extended grep) and
fgrep (fixed-string grep), which are now deprecated but still available for
backward-compatibility. (For more details about regular expressions, see the
grep documentation.)
Use the kill command to terminate a process. You need to specify the
process ID as in the following example:
You can also specify kill -9 (SIGKILL signal) as a last resort when
terminating a process that cannot be terminated otherwise:
Use the who and whoami commands to display information about the current
logged-in user:
Use the date command to print the system date. Or use it to set the system
date:
You can use the wc (word count) utility to count lines, words, and bytes for
each specified file. For example, on the previously created file1, I get the
following output:
A specific command can also be executed using the ! symbol and the history
line number, as in the following case:
Sometimes you may need to find the version of Red Hat Enterprise Linux.
You can see it by opening the /etc/redhat-release file:
Finally, you can use the clear command to clear the terminal screen and the
passwd command to update a user password.
Building a Command
It is very common in Unix environments to use commands built from one or
more commands using piping or redirection. This is typically the result of one
command producing some output to be processed by another command. Output
is usually redirected from one command to another using the pipe symbol (|).
Output can also be redirected to other devices or files using the > and >>
symbols.
Let’s start by looking at some examples. The following command saves the
output of ps into a file named ps_output:
Using > always creates a new file, overwriting an existing file if needed. You
can use >> to add the output to an existing file. Running the following command
will add the new output to the existing file:
Using 2> redirects the standard error output to a file. For example, you may
have standard error output mixed with normal output, which may be difficult to
read. A common case is using find and getting the permission denied error on
multiple directories:
You could try something like the following command, which returns a clean
output with only the information you need:
You may choose to inspect the created file later. Or, better yet, you may want
to discard the errors if you do not need to see them by sending the standard error
output to the null device, a concept that also exists in Windows:
You can send the content of one file as the input to a command and send its
output to yet another file:
You could next send the output as the input to another command—for
example, grep, followed by a string to search for. Here is the desired output:
Permissions
Use root or superuser permissions only when performing system administration
activities and as minimally as possible. As expected, making mistakes as the
root user can create catastrophic failures, especially on a production system. The
sudo command enables a permitted user to execute a command as either the
superuser or another user, according to the defined security policy. Multiple
examples using sudo to execute a command as superuser were provided in
Chapter 1, where I covered the installation and configuration of SQL Server.
Here’s one example used in that chapter:
You could also switch to superuser using the sudo su command, as shown
next:
Use the chmod (change mode) command to change the mode bits of a file,
where u is user, g is group, and o is other. Use the symbol + to grant the
permission and - to revoke it.
In this example, o+w grants the write permission to other and g-w revokes the
write permission to the group.
You can also change multiple permissions at a time:
This example removes execute permission from the user, assigns write
permissions to the group, and removes write permissions from other.
An alternative method to work with permissions is to use the octal permission
representation, as shown in the following example:
You can define any combination of permissions with just three digits, but
using this method may be more complicated for the beginner. Basically, this
octal representation uses 4 for read, 2 for write, and 1 for execute, and you can
add these values to assign more than one permission at a time. For example, to
give read permission you’d use the value 4, to give write you’d use 2, to give
both read and write you’d add both values to get 6, to give read and execute
you’d add both values to get 5, and to give all three permissions you’d add all
three values to get 7. It is also possible to remove all the permissions. You can
see a chart with all the possible permissions in Figure 3-2.
The same permissions apply to directories in which read means that the user
can list the contents of the directory, write means that the user can create or
delete files or other directories, and execute means that the user can navigate
through the directory.
In addition to changing the permissions of a file or directory, you can use the
following commands to change its owner and group. Use chown (change owner)
to change the file or directory owner and chgrp (change group) to change its
group. For example, the following commands are executed as root:
Bash Shell
As mentioned in Chapter 1, a shell is a user interface used to access operating
system services, and in the case of Unix, most of you will use a shell as a
command-line interface. It is called a shell because it is a layer around the
operating system kernel. Most popular Unix shells used throughout the years
include the Bourne, C, and Korn shells. Most Linux distributions use the
Bourne-Again shell, or Bash, as is the case with Red Hat Enterprise Linux,
Ubuntu, and SUSE Linux Enterprise Server.
Unix and Linux sometimes have a GUI similar to that of Windows. However,
there has never been a standard, and those interfaces could differ greatly from
one Unix or Linux distribution to another. Although learning such GUIs could
be useful, using command-line commands and scripts is essential for working
with Linux.
Let’s discuss some basics about working with a Linux shell. The SHELL
variable will return the full pathname to the current shell:
Similar to Windows, you can use wildcards in Linux. It is possible to use the
asterisk symbol (*) to represent zero or more characters and the question mark
symbol (?) to represent a single character. You can also use brackets ([ ]) to
represent a range of characters—for example [0–9] or [a–e] to include any
number between “0” and “9” or any letter between “a” and “e,” respectively. As
an example, assuming that you create the following four files, the ls command
will list anything starting with “b” or “c” plus “123”:
NOTE
As you noticed from the first line, you can submit multiple commands in the
same line if they are separated by a semicolon.
Although most Unix commands are provided as executable files on the file
system, some could be provided by the shell as well. In some cases, even the
same command could be provided by both. For example, the pwd command
could be provided by the shell or by the operating system as an executable file.
Usually the shell version supersedes the latter. Consider the following:
This shows the shell version, which complains that only the choices –L and –P
are allowed.
Now try the following to execute the pwd version located on /usr/bin. Notice
that, this time, the –version option is allowed:
It is also very common to change the Linux prompt instead of just simply $ to
include some information, like the hostname you are connected to. This is
usually defined on the .bash_profile configuration file or in some other script or
variable defined or called inside it. For example, my current prompt is defined as
follows:
You can update your prompt by assigning a new value to the PS1 variable, if
needed.
NOTE
.bash_profile is the profile filename used by the Bash shell. Other shells will
use different filenames.
Try to execute the script by typing its name. You may get a couple of errors
before you can make it work:
First, the current directory is not included in the current $PATH environment
variable, so the second time I used ./hello to specify the script location
explicitly. The second issue was that by default the file did not have execute
permissions, so I granted those, too. Finally, I was able to execute the script.
NOTE
The tradition to use a “hello, world” program as the very first program
people write when they are new to a language was influenced by an example
in the seminal book The C Programming Language by Brian Kernighan and
Dennis Ritchie (Prentice Hall, 1978).
Services Management
As introduced in Chapter 1, systemd is an init, or initialization system, used in
Linux distributions to manage system processes that will run as daemons until
the system is shut down. systemd has several utilities, such as systemctl, which
you can use to start, restart, or stop a service or to show runtime status
information.
Chapter 1 also covered the basics of systemctl, including how to start, stop,
and restart a service using the following syntax:
This section covers some additional systemctl choices. A very useful option
is status, which you can use as shown in the following example:
You will get an output similar to the following, plus the latest ten lines of the
SQL Server error log (not shown here). Technically, it returns the last ten lines
from the journal, counting from the most recent ones.
Notice that the status option provides a lot of valuable information, such as
whether the service is loaded or not, the absolute path to the service file (in this
case, /usr/lib/systemd/system/mssql-server.service), the process ID of the service
process, whether the service is running, and how long it has been running.
Use the is-active option to return whether the service is active or not:
You can also use the service file extension as in the following case:
To display the status of all the services in the system, use the list-units
option:
My system returns 48 service units. Here’s a partial list:
The output shows the name of the unit, description, and whether it is loaded and
active.
Use the reload option to reload a service configuration without interrupting
the service. It does not apply to all the services including SQL Server.
You can also do the opposite and disable a service so it does not start
automatically at boot time:
This would require that the sshd service be running on the remote computer.
Finally, in addition to many other features, you can use systemctl to shut
down the system, shut down and reboot the system, or shut down and power off
the system. The required commands, respectively, are shown next:
This is similar to using the shutdown, halt, poweroff and reboot commands
which, depending on the options provided, can also be used to shut down, shut
down and reboot, or shut down and power off the system.
Software Management
Chapter 1 gave you a quick introduction to package managers to provide basic
information on how to install SQL Server software. A package managing system
is a collection of utilities used to install, upgrade, configure, and remove
packages or distributions of software or to query information about available
packages. There are several package management systems, and the chapter
briefly covered some package management utilities such as yum, apt, and
zypper, which we used with Red Hat Enterprise Linux, Ubuntu, and SUSE
Linux Enterprise Server, respectively. This section covers yum in more detail.
You can easily find similar choices in the documentation for the apt or zypper
utilities.
The RPM Package Manager (RPM) is a packaging system used by several
Linux distributions, including Red Hat Enterprise Linux. The yum package
manager is written in Python and is designed to work with packages using the
RPM format. Yum requires superuser privileges to install, update, or remove
packages.
In Chapter 1, we used yum to install the SQL Server package mssql-server by
running the following command:
Let’s continue with additional topics, starting with updating packages. During
the SQL Server beta deployment process, it was common to upgrade regularly to
new CTP or RC packages. You could check which installed packages had
updates available by running the command
NOTE
CTPs (Community Technology Previews) and RCs (Release Candidates) are
versions of SQL Server used during the beta program before the final software
was released. The final release, which is the only version that can be used in a
production environment, is the RTM (release to manufacturing) version.
Depending on your system, you may see a long list of entries. For example,
this is what I get from a fresh Microsoft Azure virtual machine, which seems to
have an old version of SQL Server:
The second row, for example, shows the package mssql-server, architecture
x86_64, and the available version in this case is 14.0.900.75-1. It also shows the
repository in which the package is located: packages-microsoft-com-mssql-
server. The current version provided by my new virtual machine is SQL Server
vNext CTP 2.0 14.0.500.272, so, obviously, the recommendation by yum is a
newer version.
Let’s install the latest recommended version. Stop your SQL Server instance
and run the following command to update the mssql-server package:
If you already have the latest version of the software, you will see something
similar to this:
You could also update packages that have security updates available—
though, at the time of writing, no security updates have been released:
Use the yum info command to display information about a specific package.
Here’s an example:
The package version matches the version build returned by SQL Server—for
example, using the @@VERSION function.
Finally, to remove a package, you can run the following:
Use the list all option to list information on all installed and available
packages. Commands that return a large amount of data usually allow an
expression similar to the following example to limit the data returned:
Optionally, you can also list all the packages in all yum repositories available
to install:
Finally, use the search command when you have some information about the
package but do not know the package name:
NOTE
For more details and options, consult the yum documentation, which you can
access via the man yum command.
Disk Management
Let’s review a few commands that can provide information about the file system
disk space usage. The df (disk free) command is a very useful utility to display
file system disk space usage. Here is the output I get on my default Microsoft
Azure Red Hat Enterprise Linux virtual machine:
The listed information is self-explanatory and includes the file system name, the
total size in 1KB blocks, the used space and available disk space, the used
percentage, and the file system mount location.
Use -h to print sizes in human-readable format—for example, using K, M, or
G for kilobytes, megabytes, and gigabytes, respectively.
Add totals using the --total option:
Use the du (disk usage) command to display disk usage per file. Here is an
example using the -h human readable output on /var/opt/mssql, summarized for
space:
System Monitoring
System monitoring is also one of those topics that would require an entire
chapter of its own. In this section, I’ll show you some basic commands to get
you started. Earlier in the chapter, you learned about the ps command, which
displays the current process information in the system. You also learned about
top and htop.
The top command shows information about system processes in real time. A
sample output is shown in Figure 3-2. To exit the top command, press q.
Here’s a quick summary of the information provided by the top command:
The first line shows information about the system uptime, or how long the
system has been running; the current number of users; and the system load
average over the last 1, 5, and 15 minutes. This line contains the same
information displayed with the uptime command.
The second line shows, by default, information about system tasks,
including running, sleeping, and stopped tasks or tasks in a zombie state.
The third line shows information about CPU usage.
Figure 3-3 The top command
By default, top lists the following columns, all of which are also available
with the ps command:
PID Process ID
USER Username
PR Scheduling priority of the task
NI The nice value of the task; this Unix concept is related to the priority of
the task and defines which process can have more or less CPU time than
other processes
VIRT Virtual memory size
RES Resident memory size
SHR Shared memory size
S Process status, which could be R (running), S (sleeping), T (stopped by
job control signal), t (stopped by the debugger during trace), Z (zombie), or
D (uninterruptible sleep)
%CPU CPU usage
%MEM Memory usage
TIME CPU time
COMMAND Command or program name
NOTE
For more details about the top command, see the man documentation.
Optionally you can use htop, a variant of top that is becoming increasingly
popular. The htop command is not included with Red Hat Enterprise Linux and
must be installed separately.
NOTE
For more details about htop see http://hisham.hm/htop.
Use the free command to display the amount of free and used memory in the
system. Here’s a sample of output I got from my test system:
The values returned include total installed memory, used memory, free memory,
shared memory, memory used by kernel buffers, memory used by the page
cache, and memory available for starting new applications without swapping.
Finally, you should be aware of the Unix cron facility, which is used to
execute scheduled commands, and the crontab command, which is used to
maintain crontab files for individual users. You can think of cron as something
similar to Windows Task Scheduler. Although the availability of the SQL Server
Agent on the Linux platform makes it less likely that you’ll need to use the
Linux cron, you still may require it for some maintenance scenarios.
Summary
The chapter covered all the basic Linux commands required to get started,
including managing files and directories and their permissions, along with a few
more advanced topics including system monitoring. The chapter is not intended
for the system administrator, but is intended to help the SQL Server
administrator work with Linux. As he or she would with the Windows platform,
the SQL Server administrator should work with related professionals such as
system or storage administrators to help achieve optimum results, especially for
production implementations.
Software management knowledge will be an essential skill for the SQL
Server administrator because it will be required to install and upgrade a SQL
Server instance. Service management will also be important to understand
because it will provide the same functionality provided in the Windows world,
including using the SQL Server Configuration Manager utility. The chapter
closes with a variety of Linux commands related to file and disk management
and system monitoring.
Chapter 4
In This Chapter
The mssql-conf Utility
Linux Settings
SQL Server Configuration
Summary
T
his chapter on configuring SQL Server in a Linux environment is divided
three main topics: using the mssql-conf utility to configure SQL Server,
which is required in Linux environments; using Linux-specific kernel
settings and operating system configurations; and using some traditional
SQL Server configurations for both Windows and Linux installations.
The first section covers the mssql-conf utility, which includes some of the
functionality available with the SQL Server Configuration Manager tool for
Windows, including configuring network ports or configuring SQL Server to
use specific trace flags. In addition, mssql-conf can be used to configure
initialization and setup of SQL Server, set up the system administrator
password, set the collation of the system databases, set the edition of SQL
Server, and update a configuration setting. The second part of the chapter
covers some Linux kernel settings, which can be used to improve the
performance of SQL Server, as well as some operating system configurations
such as transparent huge pages. The third section of the chapter covers some
of the most popular SQL Server configuration settings, and most of the section
applies both to Windows and Linux implementations.
NOTE
For additional information on configuring SQL Server for high
performance, see my book High Performance SQL Server (Apress, 2016).
To run mssql-conf, you must have the proper permissions, either superuser
or as a user with membership in the mssql group. If you try running mssql-
conf without the right permissions, you’ll get an error:
As suggested in the preceding output, you can use the mssql-conf utility to
perform several configuration options, such as initialize and set up SQL
Server, set the sa password (as shown earlier), set the collation of the system
databases, enable or disable one or more trace flags, set the edition of SQL
Server, and assign the value of a setting. This means that the mssql-conf utility
can be used to change a configuration setting in two different ways: using an
mssql-conf option such as traceflag or set-sa-password, or changing a
configuration setting such as memory.memorylimitmb. To list the supported
configuration settings, use the following command:
To set the SQL Server memory limit, for example, you can use the
memory.memorylimitmb setting, as shown next:
As with any of the mssql-conf changes, you will need to restart SQL Server
to apply the changes:
You can also change the default directory for the transaction log files using
the filelocation.defaultbackupdir setting. To set the default directory for
backup files in a similar way, you can use the following:
As with the data directory, you need to grant the proper permissions to the
mssql group and user, as shown earlier.
To change the port used by SQL Server, you can use the network.tcpport
setting, as shown next. The first case shows an error when a port is already in
use, followed by a successful execution:
After making such a configuration change, you will have to specify the port
number every time you need to connect to the Linux instance, because there is
no SQL Server Browser service on Linux to resolve it. The SQL Server
Browser service on Windows would run on UDP port 1434, automatically
listen for connections intended for instances running on nondefault ports, and
provide the correct port to the client. For example, to connect from SQL
Server Management Studio or the sqlcmd utility to use port 1435 as defined
previously, you could use the following format, in which our server hostname
is sqlonlinux. Make sure the TCP/IP ports are properly configured, as
indicated in Chapter 1.
Finally, to enable or disable SQL Server trace flags, you can use the
traceflag option, as shown next:
As shown, you can enable or disable more than one trace flag at a time if
you specify a list separated by spaces.
NOTE
Trace flag 3226 is used to suppress successful backup operation entries in
the SQL Server error log and in the system event log. Some other very
popular trace flags, such as 2371, 1117, and 1118, which are always
recommended, are no longer required in SQL Server 2016 or later because
their behavior is now part of the product. These trace flags are explained
later in this chapter.
You can disable the trace flags using the off parameter, as shown next:
Similarly, you can unset a value to go back to the default configuration. For
example, the change to TCP port 1435 can be reverted back to the default
1433 by running the following mssql-conf command:
As usual, with every configuration change, you will need to restart the SQL
Server instance for the changes to take effect.
Finally, to enable availability groups on your SQL Server instance, you can
run the following command. Availability groups, high availability, and disaster
recovery will be covered in more detail in Chapter 7.
You can see the current configured settings by viewing the contents of the
/var/opt/mssql/mssql.conf file, as shown next, which includes only the
nondefault values. As such, a setting not included in this file is using the
default value:
Using Variables
Another way to change SQL Server configuration settings is to use variables.
Following are the current variables available as defined in the SQL Server
documentation:
In this case, the –e option is used to set any environment variable in the
container. The other options used may look obvious by now and were also
explained in Chapter 1.
Linux Settings
Database professionals working with some other databases in Unix
environments such as Oracle may be aware of specific kernel settings or some
operating system configurations required for a database server. Configuring
SQL Server on Linux is pretty much like configuring it on Windows, and not
much Linux-specific configuration is required.
However, a few configuration settings can help provide better performance
for SQL Server in Linux, and this section describes them. Although I cover
how to implement these recommendations in the most common scenarios, you
may need to look at your specific Linux distribution documentation for more
details on their configuration.
Kernel Settings
Microsoft recommends several CPU- and disk-related Linux kernel settings
for a high-performance configuration. Table 4-1 shows the recommended CPU
settings.
I can list the available profiles, which also specifies the active profile:
NOTE
For more details about the performance tuning profiles, see the Red Hat
Enterprise Linux documentation at
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux or
consult your Linux distribution documentation.
As you can see, two values do not follow the Microsoft recommended
values. To change a sysctl setting, you can use the –w option, as shown next:
Note that you cannot include a space, which may be allowed with other
commands, or you will get the following error messages:
You can display all values of kernel settings currently available in the
system by using –a or --all, which will return a very large number of entries.
An example is shown in Figure 4-1.
You can use the blockdev command to set the disk readahead property. For
example, I will start with a report of all the devices in my system:
The RA column in the report is the block device readahead buffer size. You
can use the --getra and --setra options to print and set the readahead
configuration, respectively. In both cases, it is in 512-byte sectors.
You can disable the automatic NUMA balancing for multinode NUMA
systems by using the sysctl command, as shown next:
NOTE
Nonuniform memory access (NUMA) is a hardware memory design used in
multiprocessing, where memory access time depends on the memory
location relative to the processor.
The second setting is the virtual address space; the default value of 64K
may not be enough for a SQL Server installation. It is recommended that you
change it to 256K, as shown next:
NOTE
Check your Linux documentation if you also need to enable transparent
huge pages.
Swap File
Similar to Windows, Linux swap space can be used when the available
memory is full. Inactive pages in memory are moved to the swap space when
no more physical memory is available. As in Windows, swap space is not a
replacement for memory because it is usually on disk and is slower than
normal memory.
NOTE
See your Linux documentation for more details about configuring swap
space.
To get started, you can use the swapon command to enable or disable
devices and files for paging and swapping. Use the swapon -s command to
display a swap usage summary by device; this is equivalent to looking at the
information on /proc/swaps, as shown in the next two statements:
Finally, you can use the remount option of the mount command to reload
the file system information. This option will attempt to remount an already-
mounted file system:
NOTE
For an example of troubleshooting a case in which the SQL Server process
was killed by the OOM Killer, see the article at
https://blogs.msdn.microsoft.com/psssql/2017/10/17/how-to-safeguard-sql-
server-on-linux-from-oom-killer/.
The best way to avoid the SQL Server process being killed by the Linux
kernel is to configure the server memory properly, assigning the appropriate
amount of memory for SQL Server, the operating system, and possibly any
other process running on the same server. You can allocate the required
amount of memory to SQL Server by using the mssql-conf tool, as explained
earlier in this chapter. In addition, properly configuring a swap file, as
indicated earlier, can help. You can view the available memory for SQL Server
in the error log at instance startup, as shown next:
tempdb Configuration
Correctly configuring tempdb has been a performance consideration for all the
versions of SQL Server for as long as I can remember. SQL Server 2016
brought some improvements such as the ability to create multiple data files
automatically during the product setup based on the number of available
processors on the system, or the new default tempdb configuration, which
integrates the behavior of trace flags 1117 and 1118. Because SQL Server on
Linux does not currently have the ability to create multiple tempdb data files
during setup automatically, manually configuring this remains an important
configuration requirement.
tempdb has been largely related to a classic performance problem: tempdb
contention. The creation of a large number of user objects in a short period of
time can contribute to latch contention of allocation pages. The main kind of
tempdb contention is called DML (Data Modification Language) contention,
as it relates to queries that modify data, mostly due to INSERT, UPDATE, and
DELETE operations on temporary tables. A second type of contention, DDL
(Data Definition Language) contention, although not common, is also possible
in some heavy use scenarios. DDL contention is related to queries that create
or alter objects that impact the system catalogs, as opposed to user data.
Every time a new object has to be created in tempdb, which is usually a
temporary table with at least one row inserted, two new pages must be
allocated from a mixed extent and assigned to the new object. One page is an
Index Allocation Map (IAM) page and the second is a data page. During this
process, SQL Server also has to access and update the very first Page Free
Space (PFS) page and the very first Shared Global Allocation Map (SGAM)
page in the data file. Only one thread can change an allocation page at a time,
requesting a latch on it. When there is high activity and a large number of
temporary tables are being created and dropped in tempdb, contention between
the PFS and SGAM pages is possible. Remember that this is not an I/O
problem, because allocation pages in this case are already in memory.
Obviously, this contention impacts the performance of the processes creating
those tables because they have to wait, and SQL Server may appear
unresponsive for short periods of time. Keep in mind that although user
databases have the same allocation pages, they are not likely to have a latch
contention problem in allocation pages because not as many objects are
created at the same time as they are created in tempdb.
The easiest way to check to determine whether you have a latch contention
problem on tempdb allocation pages is to look for PAGELATCH_XX waits on
the database activity. (Note that these are not the same as
PAGEIOLATCH_XX waits.)
Although there is no perfect solution to the latch contention problem,
because the database engine should be able to escalate and work fine as the
number of operations increase, there are a few good recommendations to help
you solve or minimize the problem. An obvious solution may be to minimize
the number of temporary tables created in tempdb, but this may not be easy to
implement because it would require code and application changes. Keep in
mind that internal objects, such as the ones created by sort and hash
operations, are not created explicitly by users and do not require the allocation
methods discussed in this section. These internal objects can, however, create
a different kind of performance problem.
The workaround to tempdb contention problems has historically been one
or both of the following choices, especially prior to SQL Server 2016:
Process Affinity
Microsoft recommends that you set the process affinity for all the NUMA
nodes and CPUs when SQL Server is running in a Linux operating system.
You can accomplish this using the ALTER SERVER CONFIGURATION statement
with the SET PROCESS AFFINITY option. You can either use CPU or NUMANODE
choices, the latter being the easiest choice. Using the CPU option enables you
to specify the CPU or range of CPUs to assign threads to. Using the NUMANODE
option enables you to assign threads to all CPUs that belong to the specified
NUMA node or range of nodes.
PROCESS AFFINITY enables hardware threads to be associated with CPUs
and helps maintain an efficient Linux and SQL Server scheduling behavior. It
is recommended that you set process affinity even if your system has a single
NUMA node.
The following command, for example, sets the process affinity for all the
NUMA nodes on a system with four nodes:
Max Degree of Parallelism
One of the most important settings to configure on a new SQL Server
installation, the max degree of parallelism option, defines the number of
logical processors employed to run a single statement for parallel execution
plans. Although a perfect configuration may depend on the specific workload,
Microsoft has, since long ago, published a best practice recommendation that
can work on most of the workloads or that can be used as a starting point.
First of all, the hardware on which SQL Server is running must be capable
of running parallel queries—which means at least two logical processors are
required, and this basically includes almost every server available today,
including most typical configurations in virtual machines for production
instances. Second, the affinity mask configuration option, which is now
deprecated, or the ALTER SERVER CONFIGURATION SET PROCESS AFFINITY
statement, covered earlier, must allow the use of multiple processors, which
both do by default. Finally, the query processor must decide whether using
parallelism can in fact improve the performance of a query, based on its
estimated cost.
When a SQL Server installation is using the default max degree of
parallelism value, which is 0, the database engine can decide at runtime the
number of logical processors in a plan, up to a maximum of 64. Obviously,
this does not mean that every parallel query would use the maximum number
of processors available all the time. For example, even if your system has 64
logical processors and you are using the default configuration, it is still very
possible that a query can use only 8 processors and run with eight threads.
This is a decision the query processor makes at execution time.
Microsoft published an article recommending a value for the max degree of
parallelism option to be the same as the number of logical processors in the
server, up to a maximum of eight. You can read the article at
https://support.microsoft.com/en-us/kb/2806535. Because it is very common
to have eight processors or more nowadays, a value of eight is a common
configuration.
Keep in mind that the listed recommendation is just a guideline, and that
the article specifies that the guideline is applicable for typical SQL Server
activity. In addition, depending on your workload or application patterns,
some other values for this setting may be considered and thoroughly tested as
well. As an example, if your SQL Server installation has 16 logical processors
and a workload with a small number of queries running at the same time, a
max degree of parallelism value of 16 could be your best choice. On the other
hand, for a workload with a large number of queries, a value of 4 could be
considered as well.
Statistics
This section covers statistics, which is one of the most important things that
can impact query performance. In versions prior to SQL Server 2016, a very
common configuration was to enable trace flag 2371 to improve the threshold
use by SQL Server to update optimizer statistics automatically. Starting with
this SQL Server version, this configuration was enabled by default, so the
trace flag was no longer needed.
By default, SQL Server automatically creates and updates query optimizer
statistics. You can change this database-level default, but doing so is almost
never recommended because it would require the developer or administrator to
create all the required statistics manually. Although this is possible, it doesn’t
make much sense, because the query optimizer can efficiently create the
required statistics for you. Some other statistics will be created automatically
when you create indexes for the columns involved in the index key. Manually
creating statistics could be required only in a very limited number of cases,
one of those being when multicolumn statistics are required. Multicolumn
statistics are not created automatically by SQL Server.
Updating statistics is a little bit different. SQL Server can automatically
update statistics when a specific threshold is reached. Although there are two
thresholds or algorithms used to accomplish this, which I will cover in just a
moment, a common problem is the size of the sample used to update the
statistics object. High-performance databases may require a more proactive
approach to updating statistics instead of letting SQL Server hit any of these
two thresholds and using a very small sample.
The main limitation with automatic statistics update is the traditional 20
percent fixed threshold of changes required to trigger the update operation,
which for large tables would require a very significant amount of changes. The
second algorithm mentioned, usually enabled by using trace flag 2371,
improves a bit on the threshold required to update statistics automatically, but
the size of sample issue remains. The statistics update is triggered during the
query optimization process when you run a query, but before it is executed.
Because the update is technically part of the execution process, only a very
small sample is used. Using a small sample makes sense, because you don’t
want to use a large sample in the middle of the query execution since it will
most likely impact the execution time.
The process can be efficient for many workloads, but for more
performance-demanding applications, a more proactive approach is required.
This proactive approach usually means performing a scheduled maintenance
job to update statistics on a regular basis. This method fixes both problems—
not waiting until you hit a specific large threshold and providing a better
sample size, which may include using the entire table.
In my opinion, the new algorithm is still not enough, but it is certainly
better than the default threshold. The benefit of the new algorithm is with large
tables, but a small sample in large tables may still be inadequate. This is why I
recommend that you proactively update statistics in the first place, but leave
the automatic update enabled, just in case, as a second choice.
Although there are free tools and scripts available to update statistics, even
within SQL Server, creating an efficient script to perform this update is not as
easy as it sounds, especially for large databases. Your script will have to deal
with some or all of the following questions: Which tables, indexes, or statistics
should be updated? What percent of the table should be used as the sample
size? Do I need to scan the entire table? How often do I need to update
statistics? Does updating statistics impact my database performance activity?
Do I need a maintenance window? The answer to most of these questions will
depend on the particular implementation, because there are many varying
factors.
First, you have to define a point at which you will update statistics. For
example, a typical solution to rebuild indexes is using the fragmentation level,
and such information is available through the sys.dm_db_index_physical_stats
DMV, whose documentation can be found at https://docs.microsoft.com/en-
us/sql/relational-databases/system-dynamic-management-views/sys-dm-db-
index-physical-stats-transact-sql. It even provides a threshold and a script to
get started. This process, however, is a little bit more complicated for
statistics.
Traditionally, database administrators relied on updating statistics based on
the last updated date for statistics (for example, using the DBCC
SHOW_STATISTICS statement or the STATS_DATE function) or older columns
such as rowmodctr, available on the sys.sysindexes compatibility view, both of
which have some drawbacks. If a table has not changed much in a specific
period of time, those statistics may still be useful. In addition, the rowmodctr
column does not consider changes for the leading statistics column, as the
following solution does. Introduced only relatively recently in SQL Server
2012 (and currently in SQL Server 2008 R2 Service Pack 2 and SQL Server
2012 Service Pack 1) and later versions, you can use a new DMF,
sys.dm_db_stats_properties, to return information about a specific statistics
object. One of the columns, modification_counter, returns the number of
changes for the leading statistics column since the last time the statistics on the
object were updated, so this value can be used to decide when to update them.
The point at which to update will depend on your data and could be difficult to
estimate, but at least you have better choices than before.
Along with jobs for statistics maintenance, usually there are also jobs to
rebuild or reorganize indexes, which makes the choice to update statistics a bit
more complicated. Rebuilding an index will update statistics with the
equivalent of the full-scan option. Reorganizing an index does not touch or
update statistics at all. We usually want to rebuild indexes depending on their
fragmentation level, so statistics will be updated only for those indexes. We
may not want the statistics job to update those statistics again. Traditionally,
this has been left to your scripts, with the difficult decision about which
statistics object to update and which sometimes end up updating the same
object twice. As mentioned, currently this problem can be fixed or minimized
by using the sys.dm_db_stats_properties DMF.
Finally, there is currently no documented method to determine whether
statistics are being used by the query optimizer. Let’s suppose, for example,
that an ad hoc query was executed only once, which created statistics on some
columns. Assuming those statistics are not used again, maintenance jobs will
continue to update those statistics objects potentially as long as the columns
exist.
Finally, I recommend Ola Hallengren’s Maintenance Solution, available at
https://ola.hallengren.com, to implement maintenance jobs for backups, index
and statistics maintenance, and consistency checks.
Summary
This chapter covered configuring SQL Server on Linux from three different
areas. First, there is no Configuration Manager on Linux, so configuring the
SQL Server port or a trace flag at the server level now has to be done with the
mssql-conf utility. In addition, mssql-conf can be used to configure a large
variety of other settings.
It also covered several CPU- and disk-related Linux kernel settings for a
high-performance configuration, in addition to some operating system–level
configurations such as transparent huge pages, which can improve the
performance of Linux when managing large amounts of memory.
Finally, it covered some critical SQL Server configuration settings that
apply either to Windows or Linux implementations. An interesting behavior
used by Linux when a server is running low in memory, the Out-Of-Memory
Killer, was introduced as well.
Chapter 5
In This Chapter
Query Performance
Query Processor Architecture
Execution Plans
Query Troubleshooting
Indexes
Statistics
Parameter Sniffing
Query Processor Limitations
Summary
S
o far, I have covered installing and configuring SQL Server on Linux and I
showed you how SQL Server works on this operating system. We’ll now
move on to the operational part and use the technology for database
queries.
The next two chapters discuss query processing and performance. This
chapter is an introduction to query tuning and optimization in SQL Server. The
next chapter covers many new features in SQL Server 2017, including adaptive
query processing and automatic tuning. The content of both chapters applies to
SQL Server on both Linux and Windows, and this chapter applies to all the
currently supported versions of the product. Later chapters in the book will
focus on high availability, disaster recovery, and security.
Query optimization usually refers to the work performed by the query
optimizer, in which an efficient—or good enough—plan is produced.
Sometimes you may not be happy with its query execution performance and
may try to improve it by making changes, via query tuning. It is important to
understand that the results you originally get from the query optimizer—that is,
the execution plan—will greatly depend on the information you feed it,
including your database design, the defined indexes, and even some
configuration settings. You can impact the work performed by the query
processor in many ways, which is why it is very important that you understand
how you can help this SQL Server component to do a superior job. Providing
quality information to the query processor will usually result in high-quality
execution plans, which will also improve the performance of your databases
and applications. You also need to be aware that no query optimizer is perfect,
and sometimes you may not, in fact, get a good execution plan or good query
performance, but other solutions may still be available.
This chapter provides an introduction to query tuning and optimization and
is not intended to cover this topic in detail. For more complete coverage of this
topic, read my book Microsoft SQL Server 2014 Query Tuning & Optimization
(McGraw-Hill Education, 2014). For a focus on SQL Server performance and
configuration, read my book High Performance SQL Server: The Go Faster
Book (Apress, 2016).
Query Performance
We all have been there: A phone call notifies you of an application outage and
asks you to join an urgent conference call. After joining the call, you learn that
the application is so slow that the company is not able to conduct business; it is
losing money and potentially customers, too. Nobody on the call is able to
provide any additional information that can help you determine what the
problem is. What do you do? Where do you start? And after troubleshooting
and fixing the issue, how do you avoid these problems in the future?
Although an outage can occur for several different reasons, including a
hardware failure and an operating system problem, as a database professional,
you should be able to tune and optimize your databases proactively and be
ready to troubleshoot any problem quickly. By focusing on SQL Server
performance, and more specifically on query tuning and optimization, you can,
first, avoid these performance problems by optimizing your databases and,
second, quickly troubleshoot and fix the problems if they actually occur.
One of the best ways to learn how to improve the performance of your
databases is not only to work with the technology, but to understand how the
underlying technology works, how to get the most benefit from it, and even
what its limitations are. The most important SQL Server component impacting
the performance of your queries is the SQL Server query processor, which
includes the query optimizer and the execution engine.
With a perfect query optimizer, you could just submit any query and you
would get a perfect execution plan every time. And with a perfect execution
engine, each of your queries would run in a matter of milliseconds. But the
reality is that query optimization is a very complex problem, and no query
optimizer can find the best plan all the time, or at least not in a reasonable
amount of time. For complex queries, a query optimizer would need to analyze
many possible execution plans. And even if a query optimizer could analyze all
the possible solutions, its next challenge would be to decide which plan to
choose. In other words, which of the possible solutions is the most efficient?
Choosing the best plan would require estimating the cost of each solution,
which again is a very complicated task.
Don’t get me wrong: The SQL Server query optimizer does an amazing job
and gives you a good execution plan almost all the time. But you must
understand which information you need to provide to the query optimizer so it
can do a good job, which may include providing the right indexes and adequate
statistics, as well as defining the required constraints and providing a good
database design. SQL Server even provides you with tools to help you in some
of these areas, including the Database Engine Tuning Advisor (DTA) and the
auto-create and auto-update statistics features. But you can still do more to
improve the performance of your databases, especially when you are building
high-performance applications. Finally, you need to understand the cases for
which the query optimizer may not give you a good execution plan and what to
do in those cases.
To help you better understand this technology, I will start with an overview
of how the SQL Server query processor works. I will explain the purpose of
both the query optimizer and the execution engine and how they interact with
the plan cache to reuse plans as much as possible. I will also show you how to
work with execution plans, which are the primary tool we will use to interact
with the query processor.
Query Optimization
The next step is the optimization process, which is basically the generation of
candidate execution plans and the selection of the best of these plans according
to their cost. As already mentioned, the SQL Server query optimizer uses a
cost-estimation model to estimate the cost of each of the candidate plans.
Query optimization could be also seen as the process of mapping the logical
query operations expressed in the original tree representation to physical
operations, which can be carried out by the execution engine. So it’s actually
the functionality of the execution engine that is being implemented in the
execution plans being created by the query optimizer; that is, the execution
engine implements a certain number of different algorithms, and it is from
these algorithms that the query optimizer must choose when formulating its
execution plans. It does this by translating the original logical operations into
the physical operations that the execution engine is capable of performing.
Execution plans show both the logical and physical operations for each
operator. Some logical operations, such as sorts, translate to the same physical
operation, whereas other logical operations map to several possible physical
operations. For example, a logical join can be mapped to a nested loops join,
merge join, or hash join physical operator. However, this is not a one-to-one
operator matching and instead follows a more complicated process.
Thus, the end product of the query optimization process is an execution
plan: a tree consisting of a number of physical operators that contain the
algorithms to be performed by the execution engine to obtain the desired results
from the database.
Plans may also be removed from the plan cache when SQL Server is under
memory pressure or when certain statements are executed. Changing some
configuration options (for example, max degree of parallelism) will clear the
entire plan cache. Alternatively, some statements, such as altering a database
with certain ALTER DATABASE options, will clear all the plans associated with
that particular database.
It is worth noting, however, that reusing an existing plan may not always be
the best solution for a given query, and some performance problems may result.
For example, depending on the data distribution within a table, the optimal
execution plan for a query may differ greatly depending on the parameters
being used. More details about parameter-sensitive queries are covered later in
the chapter in the section “Parameter Sniffing.”
Execution Plans
Now that we have a foundation in the query processor and how it works its
magic, it is time to consider how we can interact with it. Primarily, we will
interact with the query processor through execution plans, which as I
mentioned earlier are ultimately trees consisting of a number of physical
operators that, in turn, contain the algorithms to produce the required results
from the database.
You can request either an actual or an estimated execution plan for a given
query, and either of these two types can be displayed as a graphic, text, or XML
plan. Any of these three formats shows the same execution plan—the only
differences are in how they are displayed and the levels of detailed information
they contain.
When an estimated plan is requested, the query is not executed; the plan
displayed is simply the plan that SQL Server would most probably use if the
query were executed (bearing in mind that a recompile may generate a different
plan at a later time). However, when an actual plan is requested, the query
needs to be executed, and the plan is then displayed along with the query
results. Nevertheless, using an estimated plan has several benefits, including
displaying a plan for a long-running query for inspection without actually
running the query, or displaying a plan for update operations without changing
the database.
Live Query Statistics, a query troubleshooting feature introduced with SQL
Server 2016, can be used to view a live query plan while the query is still in
execution, so you can see query plan information in real time without needing
to wait for the query to complete. Since the data required for this feature is also
available in the SQL Server 2014 database engine, it can also work in that
version if you are using SQL Server 2016 Management Studio or later.
Graphical Plans
You can display graphical plans in SQL Server Management Studio by clicking
the Display Estimated Execution Plan button or the Include Actual Execution
Plan button from the SQL Editor toolbar. Clicking Display Estimated
Execution Plan will show the plan immediately, without executing the query,
whereas to request an actual execution plan, you need to click Include Actual
Execution Plan and then execute the query and click the Execution plan tab.
As an example, copy the following query to the Management Studio Query
Editor, select the AdventureWorks2014 database, click the Include Actual
Execution Plan button, and execute the query:
Then select the Execution Plan tab in the results pane. This displays the plan
shown in Figure 5-3.
NOTE
This book contains a large number of sample SQL queries, all of which are
based on the AdventureWorks database. All code has been tested on SQL
Server 2017 RTM. These sample databases are not included in your SQL
Server installation but can be downloaded separately. In addition, the
AdventureWorks sample databases are no longer provided in the SQL Server
2017 release. The latest version available, and used for this book, is
AdventureWorks OLTP SQL Server 2014, which you can download from
https://github.com/Microsoft/sql-server-
samples/releases/tag/adventureworks. Restore the database as
AdventureWorks2014. For more details on how to copy backup files to Linux
and restore a database, refer to Chapter 1.
NOTE
Beginning with SQL Server Management Studio 17.4, execution plan
operator icons look a little bit different from those in all the previous
versions. They should be updated by the time this book goes to press on the
SQL Server documentation page at https://docs.microsoft.com/en-
us/sql/relational-databases/showplan-logical-and-physical-operators-
reference.
Figure 5-4 Data flow between the Index Scan and Hash Aggregate operators
By looking at the actual number of rows, you can see that the Index Scan
operator is reading 19,614 rows from the database and sending them to the
Hash Aggregate operator. The Hash Aggregate operator is, in turn, performing
some operation on this data and sending 575 records to its parent, which you
can see by placing the mouse pointer over the arrow between the Hash
Aggregate and the SELECT icon.
Basically, in this plan, the Index Scan operator is reading all 19,614 rows
from an index, and the Hash Aggregate is processing these rows to obtain the
list of distinct cities, of which there are 575, that will be displayed in the
Results window in Management Studio. Notice also that you can see the
estimated number of rows, which is the query optimizer’s cardinality
estimation for this operator, as well as the actual number of rows. Comparing
the actual and the estimated number of rows can help you detect cardinality
estimation errors, which can affect the quality of your execution plans.
To perform their job, physical operators implement at least the following
three methods:
NOTE
For now, I will explain the traditional query-processing mode in which
operators process only one row at a time. This processing mode has been
used in all versions of SQL Server since SQL Server 7.0. In the next chapter,
I will touch on the new batch-processing mode, introduced with SQL Server
2012, which is used by operators related to columnstore indexes.
In addition to learning more about the data flow, you can hover the mouse
pointer over an operator to get more information about it. For example, Figure
5-5 shows information about the Index Scan operator. Notice that it includes,
among other things, a description of the operator and data on estimated costing
information, such as the estimated I/O, CPU, operator, and subtree costs. It is
worth mentioning that the cost is just internal cost units that are not meant to be
interpreted in other units such as seconds or milliseconds.
XML Plans
Once you have displayed a graphical plan, you can also easily display it in
XML format. Simply right-click anywhere on the graphical execution plan
window to display a pop-up menu and select Show Execution Plan XML. This
will open the XML editor and display the XML plan. As you can see, you can
easily switch between a graphical and an XML plan.
If needed, you can save graphical plans to a file by right-clicking and
selecting Save Execution Plan As from the pop-up menu. The plan, usually
saved with an .sqlplan extension, is actually an XML document containing the
XML plan information, but it can be read by Management Studio back into a
graphical plan. You can load this file again by selecting File | Open in
Management Studio to display it immediately as a graphical plan, which will
behave exactly as before. XML plans can also be used with the USEPLAN query
hint.
Table 5-1 shows the different statements you can use to obtain an estimated
or actual execution plan in text, graphic, or XML format. Keep in mind that
when you run any of the statements listed in this table using the ON clause, it
will apply to all subsequent statements until the option is manually set to OFF
again.
To show an XML plan directly, you can use the following commands:
This will display a single-row, single-column (titled “Microsoft SQL Server
2005 XML Showplan”) result set containing the XML data that starts with the
following: <ShowPlanXML
xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan".
Clicking the link will show you a graphical plan, and you can then display the
XML plan using the same procedure I explained earlier.
You can browse the basic structure of an XML plan via the following
exercise. A very simple query will create the basic XML structure, but in this
example I show you a query that can provide two additional parts: the missing
indexes and parameter list elements. Run the following query and request an
XML plan in the same way we did in the previous example:
Text Plans
As shown in Table 5-1, you can use two commands to get estimated text plans:
SET SHOWPLAN_TEXT and SET SHOWPLAN_ALL. Both statements show the
estimated execution plan, but SET SHOWPLAN_ALL shows some additional
information, including the estimated number of rows, estimated CPU cost,
estimated I/O cost, and estimated operator cost. You can use the following code
to display a text execution plan:
This code will actually display two result sets, the first one returning the text
of the T-SQL statement. In the second result set, you see the following text plan
(edited to fit the page), which shows the same Hash Aggregate and Index Scan
operators displayed in Figure 5-3.
Run the query on the first session, and at the same time run the code on the
second session several times to see the resources used. The next output shows a
sample execution while the query is still running and has not completed yet.
Notice that the sys.dm_exec_requests DMV shows the partially used resources
and that sys.dm_exec_sessions shows no used resources yet. Most likely, you
will not see the same results for sys.dm_exec_requests.
After the query completes, the original request no longer exists and
sys.dm_exec_sessions now records the resources used by the first query:
If you run the query on the first session again, sys.dm_exec_sessions will
accumulate the resources used by both executions, so the values of the results
will be slightly more than twice their previous values, as shown next:
Keep in mind that CPU time and duration may vary slightly during different
executions, and most likely you will get different values as well. Logical reads
is 8,192 for this execution, and we see the accumulated value 16,384 for two
executions. In addition, the sys.dm_exec_requests DMV shows information
about only currently executing queries, so you may not see this particular data
if a query completes before you are able to see it. In summary,
sys.dm_exec_requests and sys.dm_exec_sessions are useful to inspect the
resources currently used by a request or the accumulation of resources used by
requests on a session since creation.
sys.dm_exec_query_stats
If you ever worked with any version of SQL Server prior to SQL Server 2005,
you may remember how difficult it was to determine the most expensive
queries in your instance. Performing that kind of analysis would usually require
running a server trace in your instance for a period of time and then analyzing
the collected data, usually in the size of gigabytes, using third-party tools or
your own created methods; this was a very time-consuming process. Not to
mention the fact that running such a trace could also affect the performance of
a system, which most likely is having a performance problem already.
As mentioned, DMVs were introduced with SQL Server 2005 and are a
great help to diagnose problems, tune performance, and monitor the health of a
server instance. In particular, sys.dm_exec_query_stats provides a rich amount
of information not previously available in SQL Server regarding aggregated
performance statistics for cached query plans. This information helps you avoid
the need to run a trace in most cases. This view returns a row for each
statement available in the plan cache, and SQL Server 2008 added
enhancements such as the query hash and plan hash values, which will be
explained soon.
Let’s take a quick look at understanding how sys.dm_exec_query_stats
works and the information it provides. Create the following stored procedure
with three simple queries:
Run the following code to clean the plan cache (so it is easier to inspect),
remove all the clean buffers from the buffer pool, execute the created test
stored procedure, and inspect the plan cache. Note that the code uses the
sys.dm_exec_sql_text DMF, which requires a sql_handle or plan_handle value,
which we are, of course, obtaining from the sys.dm_exec_query_stats DMV,
and it returns the text of the SQL batch.
Examine the output. Because the number of columns is too large to show in
this book, only some of the columns are shown next:
As you can see by looking at the query text, all three queries were compiled
as part of the same batch, which we can also verify by validating they have the
same plan_handle and sql_handle. The statement_start_offset and
statement_end_offset values can be used to identify the particular queries in the
batch, a process that will be explained later in this section. You can also see in
this output the number of times the query was executed and several columns
showing the CPU time used by each query, as total_worker_time,
last_worker_time, min_worker_time, and max_worker_time. Should the query
be executed more than once, the statistics would show the accumulated CPU
time on total_worker_time. Additional performance statistics for physical
reads, logical writes, logical reads, CLR time, and elapsed time are also
displayed in the previous query but not shown in the book for page space. You
can look at the SQL Server documentation online for the entire list of columns,
including performance statistics and their documented descriptions.
Keep in mind that this view shows statistics for completed query executions
only. You can look at sys.dm_exec_requests for information about queries
currently executing. Finally, remember that certain types of execution plans
may never be cached, and some cached plans may also be removed from the
plan cache for several reasons, including internal or external memory pressure
on the plan cache. Information for these plans would therefore not be available
on sys.dm_exec_query_stats. Let’s now take a look at the
statement_start_offset and statement_end_offset values.
This would produce output similar to the following (only a few columns are
shown):
Basically, the query makes use of the SUBSTRING function as well as
statement_start_offset and statement_end_offset values to obtain the text of the
query within the batch. Division by 2 is required because the text data is stored
as Unicode. To test the concept for a particular query, you can replace the
values for statement_start_offset and statement_end_offset directly for the first
statement (44 and 168, respectively) and provide the sql_handle or
plan_handle, as shown next, to get the first statement returned:
Since a sql_handle has a 1:N relationship with a plan_handle (that is, there
can be more than one generated executed plan for a particular query), the text
of the batch will remain on the SQLMGR cache store until the last of the
generated plans is evicted from the plan cache. The plan_handle value is a hash
value that refers to the execution plan the query is part of and can be used in
the sys.dm_exec_query_plan DMF to retrieve such an execution plan. It is
guaranteed to be unique for every batch in the system and will remain the same
even if one or more statements in the batch are recompiled. Here is an example:
Running the code will return the following output, and clicking the
query_plan link will display the requested graphical execution plan:
Cached execution plans are stored in the OBJCP and SQLCP cache stores:
object plans, including stored procedures, triggers, and functions, are stored in
the OBJCP cache stores, whereas plans for ad hoc, auto-parameterized, and
prepared queries are stored in the SQLCP cache store.
In this case, we have only one plan, reused for the second execution, as
shown in the execution_count value. Therefore, we can also see that plan reuse
is another benefit of parameterized queries. However, we can see a different
behavior with the following query:
As you can see, the sql_handle, the plan_handle (not shown), and the
query_plan_hash have different values because the generated plans are actually
different. However, the query_hash is the same because it is the same query,
except with a different parameter. Supposing that this was the most expensive
query in the system and there were multiple executions with different
parameters, it would be very difficult to find out that all those execution plans
actually did belong to the same query. This is where query_hash can help. You
can use query_hash to aggregate performance statistics of similar queries that
are not explicitly or implicitly parameterized. Both query_hash and plan_hash
are available on the sys.dm_exec_query_stats and sys.dm_exec_requests
DMVs.
The query_hash value is calculated from the tree of logical operators created
after parsing but just before query optimization. This logical tree is used as the
input to the query optimizer. Because of this, two or more queries do not need
to have exactly the same text to produce the same query_hash value, as
parameters, comments, and some other minor differences are not considered.
And, as shown in the first example, two queries with the same query_hash
value can have different execution plans (that is, different query_plan_hash
values). On the other hand, the query_plan_hash is calculated from the tree of
physical operators that make up an execution plan. Basically, if two plans are
the same, although very minor differences are not considered, they will
produce the same plan hash value as well.
Finally, a limitation of the hashing algorithms is that they can cause
collisions, but the probability of this happening is extremely low. This basically
means that two similar queries may produce different query_hash values or that
two different queries may produce the same query_hash value, but again, the
probability of this happening is extremely low and it should not be a concern.
These examples are based on CPU time (worker time). Therefore, in the
same way, you can update these queries to look for other resources listed on
sys.dm_exec_query_stats, such as physical reads, logical writes, logical reads,
CLR time, and elapsed time. Finally, we could also apply the same concept to
find the most expensive queries currently executing, based on the
sys.dm_exec_requests, as in the following query:
SET STATISTICS TIME / IO
We close this section with two statements that can give you additional
information about your queries and that you can use as an additional tuning
technique. These can be a great complement to using execution plans and
DMVs to get additional performance information regarding your queries’
optimization and execution process. One common misunderstanding I see is
developers trying to compare plan cost to plan performance. You should not
assume a direct correlation between a query-estimated cost and its actual
runtime performance. Cost is an internal unit used by the query optimizer and
should not be used to compare plan performance; SET STATISTICS TIME and
SET STATISTICS IO can be used instead. This section explains both statements.
You can use SET STATISTICS TIME to see the number of milliseconds
required to parse, compile, and execute each statement. For example, run this
query:
To see the output, you will have to look at the Messages tab of the Edit
window, which will show an output similar to the following:
“Parse and compile” refers to the time SQL Server takes to optimize the
SQL statement, as explained earlier. SET STATISTICS TIME will continue to be
enabled for any subsequently executed queries. You can disable it like so:
Obviously, if you only need the execution time of each query, you can see
this information in the status bar of the Management Studio Query Editor. SET
STATISTICS IO displays the amount of disk activity generated by a query. To
enable it, run the following statement:
Run this next statement to clean all the buffers from the buffer pool to make
sure that no pages for this table are loaded in memory:
It will show an output similar to the following, which you can see in the
Messages pane:
Here are the definitions of these items, which all use 8K pages:
Now, if you run the same query again, you will no longer get physical and
read-ahead reads, and you will get an output similar to this:
Finally, in the following example, scan count is 4 because SQL Server has to
perform four seeks:
Indexes
Indexing is one of the most important techniques used in query tuning and
optimization. By using the right indexes, SQL Server can speed up your
queries and dramatically improve the performance of your applications. There
are several kinds of indexes in SQL Server, so the focus of this section will be
on clustered and nonclustered indexes.
SQL Server can use indexes to perform seek and scan operations. Indexes
can be used to speed up the execution of a query by quickly finding records
without performing table scans, by delivering all the columns requested by the
query without accessing the base table (that is, covering the query), or by
providing sorted order, which will benefit queries with GROUP BY, DISTINCT, or
ORDER BY clauses.
Part of the query optimizer’s job is to determine whether an index can be
used to resolve a predicate in a query. This is basically a comparison between
an index key and a constant or variable. In addition, the query optimizer needs
to determine whether the index covers the query—that is, whether the index
contains all the columns required by the query (in which case it is referred to as
a covering index). It needs to confirm this because a nonclustered index usually
contains only a subset of the columns of the table.
SQL Server can also consider using more than one index and joining them
to cover all the columns required by the query. This operation is called index
intersection. If it’s not possible to cover all of the columns required by the
query, SQL Server may need to access the base table, which could be a
clustered index or a heap, to obtain the remaining columns. This is called a
bookmark lookup operation (which could be a Key Lookup or an RID Lookup
operation). However, because a bookmark lookup requires random I/O, which
is a very expensive operation, its usage can be effective only for a relatively
small number of records.
Also keep in mind that although one or more indexes are available for
selection, it does not mean that they will finally be selected in an execution
plan, as this is always a cost-based decision. So after creating an index, make
sure you verify that the index is, in fact, used in a plan, and, of course, verify
that your query is performing better, which is probably the primary reason why
you are defining an index. An index that is not being used by any query will
take up valuable disk space and may negatively affect the performance of
update operations without providing any benefit. It is also possible that an
index that was useful when it was originally created is no longer used by any
query. This could be the result of changes in the database, the data, or even the
query itself.
Creating Indexes
Let’s start this section with a summary of some basic terminology used in
indexes:
Heap A heap is a data structure where rows are stored without a specified
order. In other words, it is a table without a clustered index.
Clustered index In SQL Server, you can have the entire table logically
sorted by a specific key in which the bottom, or leaf level, of the index
contains the actual data rows of the table. Because of this, only one
clustered index per table is possible. The data pages in the leaf level are
linked in a doubly linked list (that is, each page has a pointer to the
previous and next pages). Both clustered and nonclustered indexes are
organized as B-trees.
Nonclustered index A nonclustered index row contains the index key
values and a pointer to the data row on the base table. Nonclustered
indexes can be created on both heaps and clustered indexes. Each table
can have up to 999 nonclustered indexes, but usually, you should keep this
number to a minimum. A nonclustered index can optionally contain
nonkey columns when using the INCLUDE clause, which are particularly
useful when covering a query.
Unique index As the name suggests, a unique index does not allow two
rows of data to have identical key values. A table can have one or more
unique indexes, although it should not be very common. By default,
unique indexes are created as nonclustered indexes unless you specify
otherwise.
Primary key A primary key uniquely identifies each record in the table
and creates a unique index, which, by default, will also be a clustered
index. In addition to the uniqueness property required for the unique
index, its key columns are required to be defined as NOT NULL. By
definition, only one primary key can be defined on a table.
The code generated by the Table Designer will explicitly request a clustered
index for the primary key, as in the following code (but you usually don’t see
such code):
Creating a clustered index along with a primary key can have some
performance consequences, so it is important that you understand this is the
default behavior. Obviously, it is also possible to have a primary key that is a
nonclustered index, but this needs to be explicitly specified. Changing the
previous code to create a nonclustered index will look like the following
statement, where the CLUSTERED clause was changed to NONCLUSTERED:
Unique If a clustered index is not defined using the UNIQUE clause, SQL
Server will add a 4-byte uniquifier to each record, increasing the size of
the clustered index key. As a comparison, an RID used by nonclustered
indexes on heaps is only 8 bytes long.
Narrow As mentioned earlier in this chapter, because every row in every
nonclustered index contains, in addition to the columns defining the index,
the clustered index key to point to the corresponding row on the base
table, a small size key could greatly benefit the amount of resources used.
A small key size will require less storage and memory, which will also
benefit performance. Again, as a comparison, an RID used by
nonclustered indexes on heaps is only 8 bytes long.
Static or nonvolatile Updating a clustered index key can have some
performance consequences, such as page splits and fragmentation created
by the row relocation within the clustered index. In addition, because
every nonclustered index contains the clustered index key, the changing
rows in the nonclustered index will have to be updated as well to reflect
the new clustered key value.
Ever increasing A clustered index key would benefit of having ever-
increasing values instead of having more random values, like in a last
name column, for example. Having to insert new rows based on random
entry points creates page splits and therefore fragmentation. On the other
hand, you need to be aware that in some cases, having ever-increasing
values can also cause contention, as multiple processes could be writing
on the last page in a table, which could result in locking and latching
bottlenecks.
Statistics
The SQL Server query optimizer is a cost-based optimizer; therefore, the
quality of the execution plans it generates is directly related to the accuracy of
its cost estimations. In the same way, the estimated cost of a plan is based on
the algorithms or operators used as well as their cardinality estimations. For
this reason, to estimate the cost of an execution plan correctly, the query
optimizer needs to estimate, as precisely as possible, the number of records
returned by a given query. During query optimization, SQL Server explores
many candidate plans, estimates their relative costs, and selects the most
efficient one. As such, incorrect cardinality and cost estimation may cause the
query optimizer to choose inefficient plans, which can have a negative impact
on the performance of your database.
SQL Server creates and maintains statistics to enable the query optimizer to
calculate cardinality estimation. A cardinality estimate is the estimated number
of rows that will be returned by a query or by a specific query operation such as
a join or a filter. Selectivity is a concept similar to cardinality estimation, which
can be described as the fraction of rows in a set that satisfy a predicate, and it is
always a value between 0 and 1, inclusive. A highly selective predicate returns
a small number of rows.
Trace flag 2371 was introduced with SQL Server 2008 R2 Service Pack 1 as
a way to update statistics automatically in a lower and dynamic percentage rate,
instead of the mentioned 20 percent threshold. With this dynamic percentage
rate, the higher the number of rows in a table, the lower this threshold will
become to trigger an automatic update of statistics. Tables with fewer than
25,000 records will still use the 20 percent threshold, but as the number of
records in the table increases, this threshold will be lower and lower. For more
details about this trace flag, see the article “Changes to Automatic Update
Statistics in SQL Server – Traceflag 2371,” located at
https://blogs.msdn.microsoft.com/saponsqlserver/2011/09/07/changes-to-
automatic-update-statistics-in-sql-server-traceflag-2371/.
The density information on multicolumn statistics might improve the quality
of execution plans in the case of correlated columns or statistical correlations
between columns. As mentioned previously, density information is kept for all
the columns in a statistics object, in the order that the columns appear in the
statistics definition. By default, SQL Server assumes columns are independent;
therefore, if a relationship or dependency exists between columns, multicolumn
statistics can help with cardinality estimation problems in queries that are using
these columns. Density information will also help on filters and GROUP BY
operations. Filtered statistics can also be used for cardinality estimation
problems with correlated columns
The current cardinality estimator was written along with the entire query
processor for SQL Server 7.0, which was released back in December 1998.
Obviously this component has faced multiple changes during several years and
multiple releases of SQL Server, including fixes, adjustments, and extensions
to accommodate cardinality estimation for new T-SQL features. You may be
thinking, why replace a component that has been successfully used for about
the last 15 years?
In the 2012 paper “Testing Cardinality Estimation Models in SQL Server”
by Campbell Fraser, et al., the authors explain some of the reasons for the
redesign of the cardinality estimator, including the following:
This is the resulting output, with the EstimateRows column manually moved
just after the Rows column and edited to fit the page:
Using this output, you can easily compare the actual number of rows, shown
on the Rows column, against the estimated number of records, shown on the
EstimateRows column, for each operator in the plan. Introduced with SQL
Server 2012, the inaccurate_cardinality_estimate extended event can also
be used to detect inaccurate cardinality estimates by identifying which query
operators output significantly more rows than those estimated by the query
optimizer.
Because each operator relies on previous operations for its input, cardinality
estimation errors can propagate exponentially throughout the query plan. For
example, a cardinality estimation error on a Filter operator can impact the
cardinality estimation of all the other operators in the plan that consume the
data produced by that operator. If your query is not performing well and you
find cardinality estimation errors, check for problems such as missing or out-
of-date statistics, very small samples being used, correlation between columns,
use of scalar expressions, guessing selectivity issues, and so on.
Recommendations to help with these issues may include topics such as
using the auto-create and auto-update statistics default configurations, updating
statistics using WITH FULLSCAN, avoiding local variables in queries, avoiding
nonconstant-foldable or complex expressions on predicates, using computed
columns, and considering multicolumn or filtered statistics, among other
things. In addition, parameter-sniffing and parameter-sensitive queries are
covered in more detail in the next section. That’s a fairly long list, but it should
help convince you that you are already armed with pragmatically useful
information.
Some SQL Server features, such as table variables, have no statistics, so you
might want to consider instead using a temporary table or a standard table if
you’re having performance problems related to cardinality estimation errors.
Multistatement table-valued user-defined functions have no statistics either. In
this case, you can consider using a temporary table or a standard table as a
temporary holding place for their results. In both these cases (table variables
and multistatement table-valued user-defined functions), the query optimizer
will guess at one row (which has been updated to 100 rows for multistatement
table-valued user-defined functions in SQL Server 2014). In addition, for
complex queries that are not performing well because of cardinality estimation
errors, you may want to consider breaking down the query into two or more
steps while storing the intermediate results in temporary tables. This will
enable SQL Server to create statistics on the intermediate results, which will
help the query optimizer to produce a better execution plan. More details about
breaking down complex queries is covered at the end of this chapter.
NOTE
Trace flag 2453, available starting with SQL Server 2012 Service Pack 2,
can be used to provide better cardinality estimation while using table
variables. For more details, see https://support.microsoft.com/kb/2952444.
Statistics Maintenance
As mentioned, the query optimizer will, by default, automatically update
statistics when they are out of date. Statistics can also be updated with the
UPDATE STATISTICS statement, which you can schedule to run as a
maintenance job. Another statement commonly used, sp_updatestats, also
runs UPDATE STATISTICS behind the scenes.
There are two important benefits of updating statistics in a maintenance job.
The first is that your queries will use updated statistics without having to wait
for the automatic update of statistics to be completed, thus avoiding delays in
the optimization of your queries (although asynchronous statistics updates can
also be used to partially help with this problem). The second benefit is that you
can use a bigger sample than the query optimizer will use, or you can even scan
the entire table. This can give you better quality statistics for big tables,
especially for those where data is not randomly distributed in their data pages.
Manually updating statistics can also be a benefit after operations such as batch
data loads, which update large amounts of data, are performed.
On the other hand, note that the update of statistics will cause a recompiling
of plans already in the plan cache that are using these statistics, so you may not
want to update statistics too frequently, either.
An additional consideration for manually updating statistics in a
maintenance job is how they relate to index rebuild maintenance jobs, which
also update the index statistics. Keep the following items in mind when
combining maintenance jobs for both indexes and statistics, remembering that
there are both index and nonindex column statistics and that index operations
obviously may impact only the first of these:
NOTE
I strongly recommend Ola Hallengren’s “SQL Server Maintenance Solution”
for backups, integrity checks, and index and statistics maintenance. You can
find this solution at https://ola.hallengren.com.
Parameter Sniffing
SQL Server can use the histogram of statistics objects to estimate the
cardinality of a query and then use this information to try to produce an optimal
execution plan. The query optimizer accomplishes this by first inspecting the
values of the query parameters. This behavior, called parameter sniffing, is a
very good thing: getting an execution plan tailored to the current parameters of
a query naturally improves the performance of your applications. The plan
cache can store these execution plans so that they can be reused the next time
the same query needs to be executed. This saves optimization time and CPU
resources because the query does not need to be optimized again.
However, although the query optimizer and the plan cache work well
together most of the time, some performance problems can occasionally appear.
Given that the query optimizer can produce different execution plans for
syntactically identical queries, depending on their parameters, caching and
reusing only one of these plans may create a performance issue for alternative
instances of this query that would benefit from a better plan. This is a known
problem with T-SQL code using explicit parameterization, such as stored
procedures. In this section, I’ll show you more details about this problem,
along with a few recommendations on how to fix it.
To see an example, let’s write a simple stored procedure using the Sales
.SalesOrderDetail table on the AdventureWorks database.
Run the following statement to execute the stored procedure requesting the
actual execution plan:
The query optimizer estimates that only a few records will be returned by
this query, and it produces the execution plan shown in Figure 5-8, which uses
an Index Seek operator to quickly find the records on an existing nonclustered
index, and a Key Lookup operator to search on the base table for the remaining
columns requested by the query.
Figure 5-8 Plan using Index Seek and Key Lookup operators
This combination of Index Seek and Key Lookup operators was a good
choice, because, although it’s a relatively expensive combination, the query
was highly selective. However, what if a different parameter is used, producing
a less selective predicate? For example, try the following query, including a SET
STATISTICS IO ON statement to display the amount of disk activity generated
by the query:
As you can see, on this execution alone, SQL Server is performing 18,038
logical reads when the base table has only 1246 pages; therefore, it’s using
more than 14 times more I/O operations than just simply scanning the entire
table. The reason for this difference is that performing Index Seeks plus Key
Lookups on the base table, which uses random I/Os, is a very expensive
operation. Note that you may get slightly different values in your own copy of
the AdventureWorks database.
Now clear the plan cache to remove the execution plan currently held in
memory and then run the stored procedure again, using the same parameter, as
shown next:
This time, you’ll get a totally different execution plan. The I/O information
now will show that only 1246 pages were read, and the execution plan will
include a Clustered Index Scan, as shown in Figure 5-9. Because this time,
there was not a plan for the stored procedure in the plan cache, SQL Server
optimized it from scratch using the new parameter and created a new optimal
execution plan.
Figure 5-9 Plan using a Clustered Index Scan
Of course, this doesn’t mean you’re not supposed to trust your stored
procedures any more or that maybe all your code is incorrect. This is just a
problem that you need to be aware of and research, especially if you have
queries where performance changes dramatically when different parameters are
introduced. If you happen to have this problem, you have a few choices
available, which we’ll explore next.
Another related problem is that you don’t have control over the lifetime of a
plan in the cache, so every time a plan is removed from the cache, the newly
created execution plan may depend on whichever parameter happens to be
passed next. Some of the following choices enable you to have a certain degree
of plan stability by asking the query optimizer to produce a plan based on a
typical parameter or the average column density.
You can find the following entry close to the end of the XML plan (or the
Parameter List property of the root node in a graphical plan):
This entry clearly shows which parameter value was used during
optimization and which one was used during execution. In this case, the stored
procedure is optimized only once, and the plan is stored in the plan cache and
reused as many times as needed. The benefit of using this hint, in addition to
avoiding optimization cost, is that you have total control over which plan is
produced during the query optimization and stored in the plan cache. The
OPTIMIZE FOR query hint can also enable you to use more than one parameter,
separated by commas.
Using OPTION (RECOMPILE) can also allow the values of local variables to
be sniffed, as shown in the next section.
Using Local Variables and the OPTIMIZE FOR
UNKNOWN Hint
Another solution that has been traditionally implemented in the past is the use
of local variables in queries instead of parameters. The query optimizer is not
able to see the values of local variables at optimization time because these
values are known only at execution time. However, by using local variables,
you are disabling parameter sniffing, which basically means that the query
optimizer will not be able to access the statistics histogram to find an optimal
plan for the query. Instead, it will rely on just the density information of the
statistics object.
This solution will simply ignore the parameter values and use the same
execution plan for all the executions, but at least you’re getting a consistent
plan every time. A variation of the OPTIMIZE FOR hint is the OPTIMIZE FOR
UNKNOWN hint. This hint was introduced with SQL Server 2008 and has the same
effect as using local variables. A benefit of the OPTIMIZE FOR UNKNOWN hint
compared with OPTIMIZE FOR is that it does not require you to specify a value
for a parameter. Also, you don’t have to worry if a specified value becomes
atypical over time.
Running the following two versions of our stored procedure will have
equivalent outcomes and will produce the same execution plan. The first
version uses local variables, and the second one uses the OPTIMIZE FOR
UNKNOWN hint.
In this case, the query optimizer will create the plan using the Clustered
Index Scan shown previously, no matter which parameter you use to execute
the stored procedure. Note that the OPTIMIZE FOR UNKNOWN query hint will
apply to all the parameters used in a query unless you use the following syntax
to target only a specific parameter:
The query optimizer will be able to see the value of the local variable (in
this case, 897) and get a plan optimized for that specific value (in this case, the
plan with the Index Seek/Key Lookup operations, instead of the plan with the
Clustered Index Scan, shown earlier when no value could be sniffed). Finally,
the benefit of using the OPTIMIZE FOR UNKNOWN hint is that you need to
optimize the query only once and can reuse the produced plan many times.
Also, there is no need to specify a value like in the OPTIMIZE FOR hint.
In the paper “When to Break Down Complex Queries,” which you can find
at https://blogs.msdn.microsoft.com/sqlcat/2013/09/09/when-to-break-down-
complex-queries, the author describes several problematic query patterns for
which the SQL Server query optimizer is not able to create good plans.
Although the paper was published in October 2011 and indicates that it applies
to versions from SQL Server 2005 to SQL Server code-named “Denali,” I was
still able to see the same behavior in the most recent versions of SQL Server.
Hints
SQL is a declarative language; it defines only what data to retrieve from the
database. It doesn’t describe the manner in which the data should be fetched.
That, as we know, is the job of the query optimizer, which analyzes a number
of candidate execution plans for a given query, estimates the cost of each of
these plans, and selects an efficient plan by choosing the cheapest of the
choices considered.
But there may be cases when the execution plan selected is not performing
as you have expected and, as part of your query troubleshooting process, you
may try to find a better plan yourself. Before doing this, keep in mind that just
because your query does not perform as you expected, this does not mean a
better plan is always possible. Your plan may be an efficient one, but the query
may be an expensive one to perform, or your system may be experiencing
performance bottlenecks that are impacting the query execution.
However, although the query optimizer does an excellent job most of the
time, it does occasionally fail to produce an efficient plan. That being said,
even in cases when you’re not getting an efficient plan, you should try to
distinguish between those times when the problems arise because you’re not
providing the query optimizer with all the information it needs to do a good job
and those times when the problems are a result of a query optimizer limitation.
Part of the focus of this chapter has been to help you to provide the query
optimizer with the information it needs to produce an efficient execution plan,
such as the right indexes and good quality statistics, and also how to
troubleshoot the cases when you are not getting a good plan.
Having said that, there might be cases when the query optimizer just gets it
wrong, and because of that we may be forced to resort to hints. Hints are
essentially optimizer directives that enable us to take explicit control over the
execution plan for a given query, with the goal of improving its performance.
In reaching for a hint, however, we are going against the declarative property of
the SQL language and, instead, giving direct instructions to the query
optimizer. Overriding the query optimizer is risky business; hints need to be
used with caution and only as a last resort when no other option is available to
produce a viable plan.
Types of Hints
SQL Server provides a wide range of hints, which can be classified as follows:
Query hints Tell the optimizer to apply the hint throughout the entire
query. They are specified using the OPTION clause, which is included at the
end of the query.
Join hints Apply to a specific join in a query and can be specified by
using ANSI-style join hints.
Table hints Apply to a single table and are usually included using the
WITH keyword in the FROM clause.
NOTE
Locking hints do not affect plan selection, so they are not covered here.
Summary
This chapter provided a quick introduction to query tuning and optimization.
This material should be helpful for every database professional, and it also
provides the background you may need to understand the following chapters of
the book, which focus on new features in SQL Server 2017, especially those
related to query processing topics.
The chapter started with a foundation into the architecture of the query
processor, explaining its components and describing the work it performs.
There are multiple tools to troubleshoot query performance, and this chapter
covered execution plans, several useful DMVs and DMFs, and the SET
STATISTICS TIME and SET STATISTICS IO statements. Some other important
tools and features, such as the query store, extended events, and SQL trace,
were mentioned but were not in scope for the chapter. A basic introduction to
indexes, statistics, and parameter sniffing was provided as well, closing with
the coverage of query processor limitations and hints.
In the next chapter, I will cover adaptive query processing, automatic tuning,
and some other query processing–related topics, all new with SQL Server
2017.
Chapter 6
In This Chapter
Adaptive Query Processing
Automatic Tuning
SQL Server 2016 Service Pack 1
USE HINT Query Option
CXPACKET and CXCONSUMER Waits
Wait Statistics on Execution Plans
Recent Announcements
Summary
C
hapter 5 provided an introduction to SQL Server query tuning and
optimization. This material is required reading for all SQL Server
professionals and was intended to serve as a foundation for the
contents of this chapter. The information in Chapter 5 is applicable to
all the previous versions of SQL Server.
In this chapter, I will cover what is new in query processing in SQL
Server 2017. In addition, a lot of new features were released early with SQL
Server 2016 Service Pack 1. I don’t remember a previous SQL Server
service pack bringing so many changes and new features, so I will cover
some of them briefly here as well.
Undoubtedly, the most important change brought by SQL Server 2016
Service Pack 1 was that, for the first time in SQL Server history, it provided
a consistent programmability surface area for developers across all SQL
Server editions. As a result, previously Enterprise Edition–only features,
such as columnstore indexes, In-Memory OLTP, always encrypted,
compression, partitioning, database snapshots, row-level security, and
dynamic data masking, among others, were made available on all SQL
Server editions, such as Express, Web, Standard, and Enterprise.
Most of this chapter, however, will focus on the most important query
processing innovations in SQL Server 2017: adaptive query processing and
automatic tuning. Adaptive query processing offers a new generation of
query processing features that enable the query optimizer to make runtime
adjustments to statistics and execution plans and discover additional
information that can lead to better query performance. Several features were
released with SQL Server 2017, and more are promised to be released in the
future. Automatic tuning is a very ambitious feature that offers automatic
detection and fixes for plan regressions. Another feature, automatic index
management, which is available only in Azure SQL Database, lets you
create recommended indexes automatically or drop indexes that are no
longer used.
In addition to these features, SQL Server 2017 includes other benefits,
such as the ability to resume online index rebuild operations; new DMVs
such as sys.dm_db_stats_histogram, sys.dm_os_host_info, and
sys.dm_db_log_info; support for LOB columns on clustered columnstore
indexes; and improvements on In-Memory OLTP, the query store, and the
Database Engine Tuning Advisor (DTA) tool. Resumable online index
rebuilds can be useful to resume online index rebuild operations after a
failure or to pause them manually and then resume for maintenance reasons.
The sys.dm_db_stats_histogram DMV returns the statistics histogram for
the specified database object, similar to the DBCC SHOW_STATISTICS
statement. The sys.dm_os_host_info DMV has been added to return
operating system information for both Windows and Linux. Finally, the
sys.dm_db_log_info can be used to return virtual log file (VLF) information
similar to the DBCC LOGINFO statement.
This chapter will also cover a few other query processing features
released in SQL Server in the last few months, including some that are
included in SQL Server 2016 Service Pack 1.
Microsoft has also mentioned that more features will be added in the
future, including adaptive memory grant feedback for row mode, which was
announced recently.
As mentioned in Chapter 5, the estimated cost of a plan is based on the
query cardinality estimation, as well on the algorithms or operators used by
the plan. For this reason, to estimate the cost of an execution plan correctly,
the query optimizer needs to estimate the number of records returned by a
given query. During query optimization, SQL Server explores many
candidate plans, estimates their relative costs, and selects the most efficient
one. As such, incorrect cardinality and cost estimation may cause the query
optimizer to choose inefficient plans, which can have a negative impact on
the performance of your database.
In addition, cost estimation is inexact and has some known limitations,
especially when it comes to the estimation of the intermediate results in a
plan. Errors in intermediate results in effect get magnified as more tables
are joined and more estimation errors are included within the calculations.
On top of all that, some operations are not covered by the mathematical
model of the cardinality estimation component, which means the query
optimizer has to resort to guess logic or heuristics to deal with these
situations. For example, some SQL Server features such as table variables
and multistatement table-valued user-defined functions have no support for
statistics, and the query optimizer will use a fixed estimate of 100 rows, or
1 row on versions previous to SQL Server 2014.
Traditionally, if a bad cardinality estimation contributed to a suboptimal
execution plan, no additional changes were allowed after that and the plan
was used to execute the query anyway. So if the estimates were incorrect,
the created plan was still used despite the fact that it may be a suboptimal
plan. Adaptive query processing offers some improvements to this
traditional query optimization model. Let us cover next the first three
adaptive query processing algorithms available with SQL Server 2017.
The query must benefit from either a hash join or a nested loops join.
This means that if the third kind of physical join, merge join, is a better
choice, the adaptive join will not be used.
The query must use a columnstore index, or, at least, a columstore
index must be defined on any of the tables referenced by the query.
The generated alternative solutions of the hash join and the nested
loops join should have the same first input, called build input or outer
reference, respectively.
Let’s now try the batch mode adaptive joins by creating a columnstore
index on the AdventureWorks database. For this and the following
exercises, make sure your database is in compatibility level 140 by running
the following statement:
NOTE
Figure 6-2 shows the operator information from the actual execution
plan. If you followed the live query statistics plan, you will see the
estimated properties only.
You can now run the last SELECT statement again. Using this workaround
will obviously not provide the performance benefit of the columnstore
index, so the query optimizer will use an adaptive join if using the row
stores provides performance benefit measured, as usual, by cardinality and
cost estimation. Figure 6-4 shows the plan selected for this query.
Figure 6-4 Adaptive join query plan using row stores
In the first case, usually due to bad cardinality estimations, the query
optimizer may have underestimated the amount of memory required for the
query and, as a result, the sort data or the build input of the hash operations
may not fit into such memory. In the second case, SQL Server estimated the
minimum memory needed to run the query, called required memory, but
since there is not enough memory in the system, the query will have to wait
until this memory is available. This problem is increased when, again due to
a bad cardinality estimation, the amount of required memory is
overestimated, leading to wasted memory and reduced concurrency.
As mentioned, bad cardinality estimations may happen for several
different reasons, and there is no single solution that can work for all the
possible statements. For example, some features such as table variables or
multistatement table-value functions have a fixed and very small cardinality
estimate. In other cases, you may be able to fix the problem, improving the
quality of the statistics, for example, by increasing the sample size.
Batch mode adaptive memory grant feedback was designed to help with
these situations by recalculating the memory required by a query and
updating it in the cached query plan. The memory grant feedback may get
information from spill events or from the amount of memory really used.
Although this improved memory estimate may not help the first execution
of the query, it can be used to improve the performance of the following
executions. The batch mode adaptive memory grant process is in fact
learning and getting feedback from real runtime information.
Finally, the batch mode memory grant feedback may not be useful and
will be automatically disabled if the query has a very unstable and high
variation on memory requirements, which could be possible, for example,
with parameter-sensitive queries.
Let’s try an example to see how the batch mode memory grant feedback
works. We are using a table variable on purpose, which as you know will
have a low fixed estimate of 100 rows to help us to create a low memory
estimate. Run the following code and request the execution plan:
Figure 6-5 shows the created plan.
Notice the following warning on the hash join operator (you may need to
look at the hash join operator properties or hover over the operator in the
graphical plan): “Operator used tempdb to spill data during execution with
spill level 2 and 1 spilled thread(s).” If you run the query a few more times,
the hash join warning may not appear and query execution may be faster. In
my system, the duration average went from 10 seconds to 3 seconds.
This was the original memory grant information from the original XML
plan:
Next is the revised memory grant information later in the same execution
plan. As you can see, RequestedMemory and GrantedMemory went from
1056 and 2080, respectively, to 34,224. Values are in kilobytes (KB).
Optionally, you can use some extended events to get more information
about how the process works. The
spilling_report_to_memory_grant_feedback event will fire when batch
mode iterators report spilling to the memory grant feedback process. In my
case, the spilled_data_size field showed 6,744,890, which is the spilled data
size in bytes.
The memory_grant_updated_by_feedback event occurs when the
memory grant is updated by feedback. In my test, the event showed the
following values for some of its fields:
First let us look at the behavior before SQL Server 2017 by changing the
database compatibility level to 130:
Notice that, with a default estimated 100 rows from the multistatement
table-valued function, the query optimizer decides to use a nested loops
join, when in reality there are 37,339 rows. Obviously, in this case, a nested
loops join was not the best choice for such a large number of rows, because
it was required to execute the inner or bottom input 37,339 times. In
addition, this incorrect estimation early in the plan could potentially create
many problems, as decisions may be incorrect as data flows in the plan.
Performance problems are likely in cases where these rows are used in
downstream operations. A related example was shown earlier in the chapter,
where a bad estimation on a table variable impacted a memory grant later in
the plan.
Let’s switch back the database compatibility level to 140 to test the SQL
Server 2017 behavior:
Running the same query again returns the plan shown in Figure 6-7. You
can see that it now reflects an accurate cardinality estimate.
Now we have the plan shown in Figure 6-8, which also has a correct
estimate, but this time using a merge join.
Figure 6-8 Plan using interleaved execution version 2
Automatic Tuning
Automatic tuning is a new feature, which according to the SQL Server 2017
documentation, is intended to “provide insight into potential query
performance problems, recommend solutions, and automatically fix
identified problems.” Although this may sound extremely broad and
optimistic, and it has created a lot of hype in the SQL Server community, at
the moment its only released feature is to help with plan regressions of
queries, which were originally performing optimally. No announcements
have yet indicated whether it will cover any additional features in a future
release, or what those additional features could be. Although automatic
tuning’s current benefits are extremely limited, we hope they could be
extended in the near future to cover additional query performance problems.
Automatic tuning also has a second feature, automatic index
management, which is available only on Azure SQL Database. This feature
is intended to automatically create recommended indexes or remove
indexes that are no longer used by any query. I do not cover Azure SQL
Database in this book, but you can get more details about this feature by
looking at the online documentation at https://docs.microsoft.com/en-
us/sql/relational-databases/automatic-tuning/automatic-tuning.
Automatic plan regression consists of automatically switching to the last
known good plan of a query in case a regression is detected. The database
engine will automatically force a recommendation only when the estimated
CPU gain is better than 10 seconds, or when both the forced plan is better
than the current one and the number of errors in the new plan is higher than
the number of errors in the recommended plan. These thresholds currently
cannot be changed.
In addition, the automatic plan regression process does not end at forcing
that last known good plan, because the database engine continues to
monitor the changes and can roll them back if the new plan is not
performing better than the replaced plan.
Note that you could also implement this process manually, identifying
the query and plan on the query store and running the
sp_query_store_force_plan procedure. This process requires that you
manually monitor the information on the query store to find the
performance regression, locate the last known good plan, apply the required
script to force the plan, and continue to monitor the performance of the
changes. All of this functionality is available by using the query store, so it
can be implemented starting with SQL Server 2016. A disadvantage of this
manual process, as hinted, is that you also have to monitor the changes
continually to see if they continue to bring performance benefits or
otherwise decide to unforce the execution plan.
Finally, note that forcing a plan should be used only as a temporary fix
while you look into a permanent solution. Forcing a plan is like a collection
of hints, which, as discussed in Chapter 4, basically disables the work of the
query optimizer. Finding a permanent solution may require applying
standard query tuning techniques.
Let’s try an example to see how the technology works by looking at
automatic plan correction. (Automatic index management will not be
covered here because it is implemented on Azure SQL Database only.)
Sometimes trying to create a real performance problem with
AdventureWorks can be a challenge, because it is a very small database.
Although one option could be to create tables with millions of rows, instead
I can show the concept using the same scripts I showed in Chapter 5 while
demonstrating parameter sniffing. So let’s start by restoring a fresh copy of
the AdventureWorks database.
In this case, let’s assume that the desired plan is the one using an index
seek and a key lookup, and if we suddenly have a table scan, we can
consider it a regression. If you have not already done so, enable the query
store on AdventureWorks by running the following statements:
If you already have the query store enabled, you can just run the
following statement to purge its data and start clean:
Now run the following version of the stored procedure, which for the
specified parameter will create a plan using a table scan:
At this point, we have a bad plan in memory that scans the entire table
for any provided parameters. Since most of the provided parameters would
return a small number of rows, they would instead benefit from the original
plan with the index seek and key lookup operators.
Run the stored procedure again for the parameter returning nine rows:
At this point, the database engine should be able to see the performance
regression and make such information available using the
sys.dm_db_tuning_recommendations DMV. Running the following
statement will return one row (make sure you are connected to the database
you want, in this case, AdventureWorks):
For example, you can use the following script to get some of the details
fields, including the script that can be used to force the plan:
NOTE
Running this query requires the use of the SQL Server 2017 compatibility
level, so make sure your database is set to COMPATIBILITY_LEVEL
140.
The query returns some of the information discussed earlier, plus the
fields <script>, which is the script to execute in case you decide to force a
plan manually; <queryId>, which is the query_id of the regressed query;
<regressedPlanId>, which is the plan_id of the regressed plan; and
<recommendedPlanId>, which is the plan_id of the recommended plan. For
this example, my created script was the following, which basically will
force the plan with plan_id 1 for the query with query_id 1.
So far I have covered how to force a plan manually. This requires that I
manually monitor the information on the query store to find the
performance regression, find the last known good plan, apply the required
script to force the plan, and continue to monitor the performance of the
changes. All of this functionality is available by using the query store,
however, so it can be implemented starting with SQL Server 2016. What is
new in SQL Server 2017 is the ability to enable the database engine to
implement these changes automatically. The
sys.dm_db_tuning_recommendations DMV is also new in SQL Server
2017.
In addition, the new sys.database_automatic_tuning_options DMV can
help you track the automatic tuning options for a specific database. Run the
following statement to inspect the current configuration:
Notice that this automatic tuning option requires the query store. Trying
to enable it without the query store will generate the following error
message:
For more details about SQL Server 2016 Service Pack 1, refer to the
online article at https://blogs.msdn.microsoft.com/sqlreleaseservices/sql-
server-2016-service-pack-1-sp1-released/.
NOTE
A list of supported QUERYTRACEON trace flags is available at
https://support.microsoft.com/en-us/help/2801413/enable-plan-affecting-
sql-server-query-optimizer-behavior-that-can-be.
Let’s take a look at a couple of examples. First of all, run the following
query not as a system administrator:
Trying to run the query without sysadmin permissions will produce the
following error message:
In this case, running the test stored procedure as before will get an
estimation of 456.079 and a new plan using a table scan.
NOTE
Chapter 5 also covered using the OPTIMIZE FOR UNKNOWN hint to
obtain exactly the same behavior.
The following example will produce exactly the same behavior using the
DISABLE_PARAMETER_SNIFFING hint name without requiring elevated
privileges:
SQL Server 2017 returns the following hint names, which are also listed
in Table 6-1:
DISABLE_OPTIMIZED_NESTED_LOOP
FORCE_LEGACY_CARDINALITY_ESTIMATION
ENABLE_QUERY_OPTIMIZER_HOTFIXES
DISABLE_PARAMETER_SNIFFING
ASSUME_MIN_SELECTIVITY_FOR_FILTER_ESTIMATES
ASSUME_JOIN_PREDICATE_DEPENDS_ON_FILTERS
ENABLE_HIST_AMENDMENT_FOR_ASC_KEYS
DISABLE_OPTIMIZER_ROWGOAL
FORCE_DEFAULT_CARDINALITY_ESTIMATION
Now you can run the original version of our test stored procedure and
get the behavior of parameter sniffing disabled without any hint:
Running the procedure will get the estimate of 456.079 and the plan
using a table scan. Don’t forget to enable parameter sniffing again if you
tried the previous examples:
NOTE
It is also possible to disable parameter sniffing at the server level,
enabling the trace flag 4136 mentioned earlier for the entire SQL Server
instance. Keep in mind, however, that doing this at the server level may
be an extreme solution and should be tested carefully.
NOTE
SQL Server 2017 Cumulative Update 3 was not yet available at the time
of this writing, although according to the announcement, CXPACKET
and CXCONSUMER waits are already implemented on Azure SQL
Database.
You can see the information in the WaitStats section of the graphical
plan. Next is an XML fragment of the same information I got on my test
execution:
The information shows the wait type, the wait time in milliseconds, and
the wait count or times the wait occurred for the duration of the query.
Finally, there was been some discussion on the SQL Server community
and a Microsoft connect request was submitted indicating that not all the
important waits are reported on the execution plan. The request is still open,
and you can read the details online at
connect.microsoft.com/SQLServer/feedback/details/3137948.
Recent Announcements
Another of the major changes in SQL Server 2017 is the new servicing
model. Although service packs will still be used for SQL Server 2016 and
previous supported versions, no more service packs will be released for
SQL Server 2017 and later. The new servicing model will be based on
cumulative updates (and General Distribution Releases, or GDRs, when
required).
Cumulative updates will be released more often at first and then less
frequently in the new servicing model. A cumulative update will be
available every month for the first 12 months and every quarter for the
remainder 4 years of the full 5-year mainstream lifecycle. You can read
more details about the new servicing model online at
https://blogs.msdn.microsoft.com/sqlreleaseservices/announcing-the-
modern-servicing-model-for-sql-server/.
Finally, as of this writing, there have been a few new releases and
announcements coming in SQL Server. First, SQL Operations Studio is
available for preview. SQL Operations Studio is a new tool for database
development and operations intended to work with SQL Server on
Windows, Linux and Docker, Azure SQL Database, and Azure SQL Data
Warehouse. SQL Operations Studio can run on Windows, Mac, or Linux
operating systems. For more details and to download, visit
https://blogs.technet.microsoft.com/dataplatforminsider/2017/11/15/announ
cing-sql-operations-studio-for-preview/.
Microsoft also recently announced some improvements to the database
engine, which will be included in a future release:
Summary
This chapter covered what is new in query processing in SQL Server 2017,
and it also included some SQL Server 2016 features released recently. The
most important query processing innovations on SQL Server 2017 are
adaptive query processing and automatic tuning.
Adaptive query processing is a very promising collection of features that
for the current release include three algorithms: batch mode adaptive joins,
batch mode adaptive memory grant feedback, and interleaved execution for
multistatement table-valued functions. A future feature, adaptive query
processing row mode memory grant feedback, has been announced as well.
Automatic tuning is also a new collection of features that seems to be in
its infancy, with only one feature available in SQL Server—automatic plan
correction—plus another one, automatic index management, available only
on Azure SQL Database.
The chapter also covered a few SQL Server 2016 Service Pack 1
enhancements. This service pack change indicated the first time in the
history of SQL Server in which a consistent programmability surface area
was available for developers across SQL Server editions. Starting with this
groundbreaking release, features such as columnstore indexes and In-
Memory OLTP, among others, are now available for all editions of SQL
Server.
Chapter 7
In This Chapter
SQL Server High-Availability and Disaster-Recovery Features
Always On Availability Groups
Availability Groups on Windows vs. Linux
Implementing Availability Groups
Summary
T
his chapter provides an introduction to high-availability and disaster-
recovery solutions for SQL Server on Linux and focuses on Always
On availability groups. As high availability and disaster recovery are
very broad topics, this chapter does not intend to be a comprehensive
reference, but it will teach you everything you need to know to get started
using this feature with SQL Server on Linux. For additional information,
refer to the SQL Server documentation.
Availability groups, failover cluster instances (FCIs), and log shipping
are the most important availability features provided by SQL Server 2017
on Linux. Such features are used in high-availability and disaster-recovery
configurations. With availability groups, they can also be used for
migrations and upgrades or even to scale out readable copies of one or more
databases. A SQL Server feature that is also sometimes considered for
availability scenarios, replication, is not currently available in SQL Server
for Linux but is planned to be included in a future release. Another related
feature, database mirroring, has been deprecated as of SQL Server 2012
and will not be included in the Linux release. A flavor of availability
groups, basic availability groups, is available with SQL Server Standard
edition and is intended to provide an alternative to replace (the now
defunct) database mirroring feature on installations not running the
Enterprise Edition.
“Always On” is a term that encompasses both Always On availability
groups and Always On FCIs. A great benefit of availability groups is that
they can provide high-availability and disaster-recovery solutions over
geographic areas. For example, a local secondary replica can provide high-
availability capabilities and also a remote secondary replica on a different
data center to provide disaster recovery. In this scenario, the local replica
traffic can be replicated synchronously, thereby providing zero data loss
high availability, which is possible when network latency is low. And the
remote replica can run asynchronously, providing minimal data loss disaster
recovery where low network latency is not guaranteed.
NOTE
As introduced in Chapter 1, the SQL Server Agent is optional but
recommended for installation and is a requirement for a log-shipping
configuration. Interestingly, although in Windows they are installed
together, the database engine and the SQL Server Agent run as different
Windows services. On Linux, although they are installed with different
packages, the SQL Server Agent runs as part of the database engine
process.
1. From Microsoft Azure, type search “Red Hat SQL Server 2017” to
find the image named “Free SQL Server License: SQL Server 2017
Developer on Red Hat Enterprise Linux 7.4 (RHEL).” See Chapter 1
for more detailed instructions about how to create virtual machines and
configure them as indicated there.
2. Name the virtual machines sqlonlinux1, sqlonlinux2, and sqlonlinux3.
You are creating a primary replica and two secondary replicas.
3. Connect to each server, and make sure the servers can communicate
with each other. When I create Azure, by default, I can ping each
server, so in this case I don’t have to do anything else. You may need
to open firewall ports to connect. Make sure your configuration can
resolve IP addresses to hostnames. You may also need to verify and
update the information on /etc/hosts.
NOTE
/etc/hosts is a plain-text operating system file that maps computer
hostnames to IP addresses.
SQL Server is now running, and you are ready to connect using the sa
account. At this point, running SELECT @@VERSION on my instance returns
Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64), which means
this is the RTM (Release to Manufacturing) release and is missing a few
cumulative updates (CUs).
Let’s update to the latest cumulative update, which at the moment is
CU3, by running the following yum command:
Refer to Chapter 1 for more details. Later in this section, you will need
to open a port for availability groups as well; the default port is 5022.
In addition, and only for Red Hat Enterprise Linux, you need to run the
following commands to open the same TCP port on the firewall. At this
writing, this is not required for SUSE Enterprise Linux Server or Ubuntu
Azure virtual machines.
NOTE
For more details about the AlwaysOn_health extended events session, see
https://msdn.microsoft.com/en-us/library/dn135324(v=sql.110).aspx.
Now let’s copy such files to the same location on the other two servers
and grant the same permissions they have on the principal replica. There are
a few different ways to accomplish this. In Chapter 1, we used the PuTTY
Secure Copy client or pscp utility to copy files between Windows and
Linux. Because in this case, we will be copying files between Linux
servers, we’ll use scp, the secure copy command, which is a remote file
copy program used to copy files between hosts on a network.
The scp command can use the following basic format to copy files:
Run the following command, replacing the host destination name, such
as sqlonlinux2, or IP address and the user and directory destination. This
command is executed at the primary replica and the files will be copied to
my home directory on both of the secondary replicas:
The files are copied as the user bnevarez, so I still need to copy them to
the correct directory and give them the right permissions. Run the following
commands to do that:
Repeat the same operations to copy the files to the additional secondary
replica.
So we just copied the files to the secondary replicas. Now the SQL
Server instance needs to be aware of the certificates. Run the following
statements on the other two servers, sqlonlinux2 and sqlonlinux3. This
creates a master key and a certificate on the secondary servers from the
backup created on the primary replica. Once again, use a strong password.
Also, in this particular case, the password used on DECRYPTION BY
PASSWORD has to be the same password used previously by the ENCRYPTION
BY PASSWORD option when backing up the private key to a file.
We are finally ready to create the database mirroring endpoints on all the
servers. Database mirroring endpoints are still used in the availability
groups terminology, although database mirroring is a separate feature that is
also already deprecated. Database mirroring endpoints require a TCP port,
so, once again, we need to open TCP ports on each virtual machine, which
can be the default 5022 or any other available port of your choice.
Open the port from the Microsoft Azure Portal for each virtual machine,
as indicated earlier when we opened TCP port 1433, and in the Inbound
Port Rules section on the right, select Add Inbound. This time, there is no
predefined service, so just select Custom and specify both Port ranges—in
this case, 5022—and a default Name—in this case Port_5022. You may also
need to change the Priority, for example, to 1020. Priority configures the
order in which rules are being processed.
Run the firewall-cmd command, but this time for TCP port 5022. This
is required only for Red Hat Enterprise Linux and is not needed for Ubuntu
and SUSE Linux Enterprise Server.
We are ready to run the CREATE ENDPOINT statement. The ROLE option of
the statement defines the database mirroring role or roles that the endpoint
supports and could be WITNESS, PARTNER, or ALL. As mentioned
earlier, SQL Server Express now can be used to host a configuration-only
replica, which supports only the WITNESS role.
Also, using a certificate is the only authentication method available on
SQL Server 2017 on Linux. Windows authentication will be supported in a
future release. The example statement also specifies that connections to this
endpoint must use encryption using the AES algorithm, which is the default
in SQL Server 2016 and later. No listener IP address will be defined for this
exercise.
Finally, for each replica instance, create the new endpoint, start it, and
grant connect permissions to it for the SQL Server login created earlier:
Next we will create the availability group only on the primary replica,
sqlonlinux1. The CREATE AVAILABILITY GROUP statement has a large
variety of options. The following statement specifies several items.
DB_FAILOVER with a value of ON specifies that any status other than ONLINE
for a database in the availability group triggers an automatic failover. If the
choice were OFF it means that only the health of the instance is used to
trigger automatic failover.
CLUSTER_TYPE, which is new to SQL Server 2017, defines the cluster
type, and the values can be WSFC, EXTERNAL, and NONE. WSFC is used to
configure availability groups on Windows using a WSFC. EXTERNAL can be
used on Linux when using an external cluster manager such as Pacemaker.
NONE can be used when no cluster manager is used. This last choice can be
useful for availability group configurations to be used as read-only replicas
or migrations or upgrades. On a SQL Server instance on Linux, you will see
only the EXTERNAL and NONE choices.
AVAILABILITY_MODE defines whether the primary replica should wait for
the secondary replica to acknowledge the writing of log records to disk
before the primary replica can commit a transaction on a database, as
defined by the values ASYNCHRONOUS_COMMIT and SYNCHRONOUS_COMMIT. A
third choice, new to SQL Server 2007 CU1, CONFIGURATION_ONLY, defines
a configuration-only replica or a replica that is used only for availability
group configuration metadata. CONFIGURATION_ONLY can be used on any
edition of SQL Server and does not apply for CLUSTER_TYPE as WSFC. It is
mainly used as a witness on Linux environments where a file share is not
yet supported.
FAILOVER_MODE specifies the failover mode, which could be AUTOMATIC,
MANUAL, or EXTERNAL. It is always EXTERNAL for Linux installations.
Finally, SEEDING_MODE defines how the secondary replica is initially
seeded and can have the options AUTOMATIC or MANUAL. ENDPOINT_URL
specifies the URL path for the database mirroring endpoint on the instance
of SQL Server that hosts the availability replica. As you can see in the
following example, ENDPOINT_URL requires a system name, a fully qualified
domain name, or an IP address that unambiguously identifies the
destination computer system and the port number that is associated with the
mirroring endpoint of the partner server instance.
The following statement creates an availability group with three
synchronous replicas. Replace the server, fully qualified domain name, and
port with your own information. For simplicity, only the server name is
used—in your case, add the fully qualified domain name as well.
Interesting to note is that if you run this twice, you get a Windows-
related message, which I also saw on some other errors:
1. Start the New Availability Group Wizard. You will see the New
Availability Group Introduction page, shown in Figure 7-2, which
includes a summary of the information you need to provide in the
wizard. Read the information on the page and click Next.
Figure 7-2 Read the Introduction page.
3. You have only two choices for Cluster type: EXTERNAL and NONE.
WSFC is not available on a Linux instance. The Database Level
Health Detection checkbox (or DB_FAILOVER option of the CREATE
AVAILABILITY GROUP statement) is used to trigger the automatic
failover of the availability group when a database is no longer in the
online status. Leave the default of EXTERNAL and check the
Database Level Health Detection box. Then click Next.
4. On the Select Databases page, shown in Figure 7-4, you’ll select one
or more databases for the availability group. The size and status of the
database are specified. A status of Meets Prerequisites, as shown,
indicates that the database can be selected. If a database does not meet
the prerequisites, the Status column will indicate the reason. (A
common reason may be a database that is not in the full recovery
mode.) You may also need to specify a password if the database
contains a database master key. Check the AdventureWorks box and
click Next.
5. The Specify Replicas page, shown in Figure 7-5, includes five tabs
(Replicas, Endpoints, Backup Preferences, Listener, and Read-Only
Routing) and a wealth of information; I will just cover the basics here.
Figure 7-5 Specify Replicas page
The Replicas tab is the main tab; here you can specify each instance of
SQL Server that will host or currently hosts a secondary replica. Keep
in mind that the SQL Server instance to which you are currently
connected will host the primary replica. For our exercise, add
sqlonlinux2 and sqlonlinux3 (click Add) and configure and change the
Availability Mode to Synchronous Commit for all three replicas, as
shown in Figure 7-5.
The Endpoints tab, shown in Figure 7-6, is where you validate your
existing database mirroring endpoints.
Figure 7-6 Validate your existing database mirroring endpoints on the
Specify Replicas page, Endpoints tab.
You may also want to check the Backup Preferences tab (Figure 7-7),
where you can choose your backup preference for the availability
group as a whole and your backup priorities for the individual
availability replicas.
Figure 7-7 Choose your backup preferences on the Backup Preferences tab.
8. On the Validation page, shown in Figure 7-9, you can validate that
your environment supports all the configuration choices you made on
previous pages of the wizard. The Result column can show Error,
Skipped, Success, or Warning. Errors should be fixed before
continuing with the availability group configuration. Warning
messages should be reviewed as well to avoid potential issues with
your future installation. Skipped means that the validation step was
skipped because it is not required by your selections. Assuming you
got no errors, click Next.
9. The Summary page, shown in Figure 7-10, includes all the selections
you made in the wizard and gives you the opportunity to review before
submitting the changes to SQL Server. Click the Previous button to
make any changes. Once you are satisfied with your choices, click
Finish. Optionally, click the Script button to create a T-SQL script with
all your selections. Then click Finish.
Figure 7-10 Review your choices on the Summary page.
10. On the Results page, shown in Figure 7-11, you can see the results of
implementing the required changes. Results can show Error or
Success. The page will list each of the activities performed, such as
configuring endpoints and starting the AlwaysOn_health extended
events session on each of the replicas, plus creating the availability
group, waiting for the availability group to come online, and joining
the secondary replicas to the availability group. Click Close to finish
the New Availability Group Wizard.
Figure 7-11 View the results of implementing the required changes.
To see the status of the availability group, use the Availability Group
Dashboard. To open it, go back to the Always On High Availability folder,
open the Availability Group folder, right-click your availability group, and
select Show Dashboard. The dashboard for this exercise is shown in Figure
7-12.
Figure 7-12 Availability Group dashboard
NOTE
You can learn more about Pacemaker at
www.opensourcerers.org/pacemaker-the-open-source-high-availability-
cluster/.
NOTE
Access to the Red Hat Enterprise Linux High Availability Add-On
requires a subscription to the Red Hat Developer Program, which
provides no-cost subscriptions, and you can download the software for
development use only. For more details, see
https://developers.redhat.com/products/rhel/download.
Let’s start the configuration. Most of the work on this section will be on
the Linux side.
Register the Red Hat High Availability Add-On on each server using the
subscription-manager command. This command enables you to register
systems to a subscription management service and attaches and manages
subscriptions for software products. For this exercise, we will use the
register, list, and attach modules. The register module is used to register this
system to the customer portal or another subscription management service.
The list module, shown next, lists subscription and product information
for this system:
Now, for each server, use the pool ID from the previous output to run the
attach module, which is used to attach a specified subscription to the
registered system:
Next, on all three servers, open the Pacemaker firewall ports, which
basically opens the TCP ports 2224, 3121, and 21064, and the UDP port
5405. This involves using the firewall-cmd command-line client of the
firewalld daemon. Instead of specifying a specific port or port range, it uses
one of the firewall-cmd provided services. To get a list of the supported
services, use firewall-cmd --get-services. For our example, we will use
the high-availability service as shown next:
Now we can install Pacemaker and all the required components, such as
fencing agents, resource agents, and Corosync, on all the nodes. As you
may notice so far, Pacemaker terminology refer to the servers as “nodes,”
which differs from the primary and secondary replica terminology used in
the previous section.
In some of my testing while writing the book, one repository was not
available, so I had to disable it and find and enable a replacement. I got the
following partial error message:
You can use the following statement to list the enabled repositories:
Now we are ready to try installing the four listed packages again. The
output of the command will be huge, several pages long. Here is a quick
summary that also asks for user input. Enter y.
Later, the command shows the installed packages, only a partial output
here:
Finally, after such a large output, you could just verify the status of the
installed packages by running the same command again. This is the entire
output I got:
Next, change the password of the hacluster user, which was created
when the Pacemaker package was installed. Use the same password for all
nodes.
sudo passwd hacluster
Now we can enable and start the pcsd service and Pacemaker on all
nodes. The pcsd daemon controls and configures the Pacemaker and
Corosync clusters via pcs. Run the following commands:
Create the Pacemaker cluster, but this time only on sqlonlinux1. The
cluster pcs command is used to configure cluster options and nodes. pcs is
the pacemaker/corosync configuration system. Notice the auth, setup, and
start options. The auth command is used to authenticate pcs to pcsd on
nodes specified, or on all nodes configured in the local cluster if no nodes
are specified. The list of nodes is specified in the command. User hacluster
and password, as created earlier, are specified as well.
Finally, use the start option to start the cluster on the specified nodes:
I needed to run this again after a virtual machine restart, so this could be
configured to start automatically:
You can use pcs status to view the current status of the cluster:
This is the current configuration after you start the cluster on the
specified nodes:
NOTE
For more details about configuring the Red Hat High Availability Add-
On with Pacemaker, see https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/6/html/configuring_the_red_hat_high_avail
ability_add-on_with_pacemaker/.
Next, we can install the SQL Server High Availability package. As we
learned from Chapter 1, a SQL Server installation includes several
packages, some of which can be optional. This needs to be installed on all
nodes:
In the next step, we will disable Stonith, which is the Pacemaker fencing
implementation. Stonith is configured by default when you install a
Pacemaker cluster. Fencing may be defined as a method to bring a high-
availability cluster to a known state. When the cluster resource manager
cannot determine the state of a node or of a resource on a node, fencing
brings the cluster to a known state again.
However, at this moment, Red Hat Enterprise Linux does not provide
fencing agents for any cloud environments, including Azure. So I will
disable it for this exercise.
NOTE
You should keep Stonith enabled on supported environments or for
evaluation purposes. Microsoft is working on a solution for the
environments not yet supported.
NOTE
For more details about fencing and Stonith, see
clusterlabs.org/pacemaker/doc/crm_fencing.html.
Note that you can list the Pacemaker cluster properties by using the
following command, which lists the configured properties:
You can use the --all option to list all the properties, including unset
properties with their default values. My installation returns 40 properties:
Save the login information on every node. Create a text file with the
following information, the SQL Server login name and password:
Basically, this uses the pcs command to manage cluster resources. In this
case, it creates a new resource called ag_cluster. The mssql is a resource
provider installed with the SQL Server High Availability package. Since the
master option is used, a master/slave resource is created.
This is the current status of the cluster:
By the way, the providers option lists the available Open Cluster
Framework (OCF) resource agent providers:
You may get the following error if you do not have the mssql-server-ha
package installed:
Notice that we are again creating a cluster resource, same as we did earlier
to create an availability group resource. This time, though, it is a virtual IP
resource.
We can use constraints to enforce whether or not two resources should
be running on the same node, where a resource can or cannot run or the
order in which the resources in a cluster should start. In this case, let’s add a
colocation constraint to ensure that the availability group primary replica
and the virtual IP resources run on the same host. This requires a colocation
constraint with a score of INFINITY. Run the following statement:
The next constraint will make sure the availability group cluster resource
is started first and then the virtual IP resource. This is required to avoid a
case when the IP address is available but pointing to a node where the
availability group resource is not yet available:
Finally, you can use the cluster destroy command to destroy the
cluster permanently on the current node, killing all cluster processes and
removing all cluster configuration files. This could also be helpful to
remove a test configuration and start all over again:
Similarly, you can drop the availability group from SQL Server by
running the following:
As you can see from my newly created Azure virtual machine, the
operating system already has the required software, but it may be different
in your case. Next, set the hacluster user password by running this
command:
Next, same as with Red Hat Enterprise Linux, run the following
commands to create the three-node cluster:
This time, I was not so lucky when I tried this on my Azure virtual
machines, as I got an error. So I had to register the repository, as shown
next:
Save the login information on every node. Create a text file with the
following information, the SQL Server login name and password:
Basically, this uses the pcs command to manage cluster resources. In this
case, it creates a new resource called ag_cluster. mssql is a defined resource
provider. Since the master option is used, a master/slave resource is
created.
As with our Red Hat Enterprise Linux installation, run the following
command to create a virtual IP resource for the cluster. You will need to
specify a valid static IP address:
Next, add a colocation constraint to ensure the availability group primary
replica and the virtual IP resources run on the same host:
Finally, run the following command to make sure the availability group
cluster resource is started first and then the virtual IP resource:
NOTE
Because of page/space constraints, installing SUSE Linux Enterprise
Server (SLES) is beyond the scope of this book. For details, see the SQL
Server documentation “Configure SLES Cluster for SQL Server
Availability Group” at https://docs.microsoft.com/en-us/sql/linux/sql-
server-linux-availability-group-cluster-sles, and additional links to the
“SUSE Linux Enterprise High Availability Extension 12 SP3, Installation
and Setup Quick Start” document at www.suse.com/documentation/sle-
ha-12/singlehtml/install-quick/install-quick.html.
The move option moves the defined resource—in this case, the availability
group resource ag_cluster—to the specified destination node—in this case,
sqlonlinux2.
The second step requires that you remove the location constraint:
Finally, if you have a problem and cannot use the tools for any reason,
such as when the cluster is not responding properly, you can still
temporarily do a manual failover using T-SQL statements from within SQL
Server. To perform this manual failover, you first need to unmanage the
availability group from Pacemaker, perform the manual failover using a T-
SQL statement, and configure managing the availability group with
Pacemaker again.
First, the unmanage option sets the listed resources to unmanaged mode.
The cluster is not allowed to start or stop a resource when it is in
unmanaged mode:
Next, run the following statements on the secondary replica SQL Server
instance, sqlonlinux2. This statement sets a key-value pair in the session
context; in this case, set the session context variable external_cluster to
'yes'.
Run the following statement to initiate a manual failover of the
availability group to the secondary replica to which you are connected. You
could run this, for example, connected to the sqlonlinux2 instance to make
this instance the new primary replica:
Finally, the cleanup option makes the cluster forget the operation history
of the resource and re-detect its current state:
Summary
This chapter covered the basics of high availability and disaster recovery
for SQL Server on Linux. Although several features were discussed, the
majority of the chapter focused on Always On availability groups.
Availability groups on both Windows and Linux can be used in high-
availability and disaster-recovery configurations and for migrations and
upgrades, or even to scale out readable copies of one or more databases.
The main difference between availability groups in Windows and Linux
is that in Windows, the Windows Server Failover Cluster, or WSFC, is built
into the operating system and SQL Server and availability groups are
cluster-aware. In Linux, availability groups depend on Pacemaker instead of
WSFC. Because Linux does not provide a clustering solution out of the
box, all three SQL Server–supported Linux distributions utilize their own
version of Pacemaker. The SQL Server High Availability package, also
known as the SQL Server resource agent for Pacemaker, is an optional
package and must be installed for a high-availability solution in Linux.
For more details about availability groups in Windows, refer to the SQL
Server documentation at https://docs.microsoft.com/en-us/sql/database-
engine/availability-groups/windows/overview-of-always-on-availability-
groups-sql-server.
For more information about availability groups in Linux, go to
https://docs.microsoft.com/en-us/sql/linux/sql-server-linux-availability-
group-overview.
Look for the Red Hat High Availability Add-On documentation at
https://access.redhat.com/documentation/en-
us/red_hat_enterprise_linux/7/html/high_availability_add-
on_reference/index.
The SUSE Linux Enterprise High Availability extension documentation
is available at
www.suse.com/documentation/sle_ha/book_sleha/data/book_sleha.html.
Chapter 8
Security
In This Chapter
Introduction to Security on SQL Server
Transparent Data Encryption
Always Encrypted
Row-Level Security
Dynamic Data Masking
Summary
T
his chapter provides an introduction to security in SQL Server, and its
content applies to both Windows and Linux environments. It covers
how SQL Server security works in terms of layers such as protecting
the data itself, controlling access to the data, and monitoring data
access. This chapter also covers some of the newest SQL Server security
features introduced in the latest few versions of the product, including
Transparent Data Encryption (TDE), Always Encrypted, Row-Level
Security, and Dynamic Data Masking (DDM), in more detail.
All the security features available with Windows are available on Linux,
with the exception of Active Directory, which will be included in a future
update.
The database master key is a symmetric key used to protect the private
keys of certificates and asymmetric keys that are present in the database.
The CREATE CERTIFICATE statement adds a certificate to a database in SQL
Server. The CREATE DATABASE ENCRYPTION KEY statement can now be used
to create a database encryption key and protect it by the certificate. Run the
following statement:
Notice the following warning about backing up the certificate and
private key associated with the certificate:
So let us follow that recommendation and back up the certificate and the
private key associated with the certificate:
Running the following query will show that the database was restored
and is encrypted. Three databases will be listed: AdventureWorks2012,
AdventureWorks2012New, and, as explained earlier, tempdb.
Trying to restore this database into another SQL Server instance will not
work and instead will return the “Cannot find server certificate with
thumbprint” error message. The following section explains how to restore a
database using TDE on another SQL Server instance.
Same as before, the files are copied as the user bnevarez, so I still need
to copy them to the correct directory and give them the right permissions.
Copy the files to /var/opt/mssql/data and change the file ownership
appropriately:
We’ve copied the files to the second server. Now the SQL Server
instance needs to be aware of the certificates. Run the following statements
to create a master key and a certificate on the second SQL Server instance,
using the backup created on the original server. Once again, use a strong
password. Also, in this particular case, the password used on DECRYPTION
BY PASSWORD has to be the same password used previously by the
ENCRYPTION BY PASSWORD option when backing up the private key to a file.
An incorrect password will return the following error message:
Finally, this procedure can work the same way if you need to attach a
database. You would need to repeat all the steps in the same way, copy the
database files to the appropriate location, and attach the database as usual.
Always Encrypted
Always Encrypted is a client-side encryption technology designed to protect
sensitive data. The data remains encrypted in the database engine all the
time, even during query processing operations. In addition, Always
Encrypted is the only SQL Server security feature that can provide a
separation between the entities that own the data and can access it and those
that only need to manage the data but should not access it. This means that
in an Always Encrypted configuration, only the required applications and
users can access and see the data. Always Encrypted was introduced as an
Enterprise Edition–only feature with SQL Server 2016, but after Service
Pack 1 it is now available on any SQL Server edition.
A typical scenario for using Always Encrypted would be when the data
is hosted by a cloud provider such as Microsoft Azure and the client
application is on premises. In this case, cloud administrators would not have
access to see the data they administer. A second scenario may be in the
cases, either on premises or in the cloud, where data administrators are not
required to access the data. The database stores only encrypted values, so
database administrators or any privileged user won’t be able to see the data.
Always Encrypted data could be a challenge to the query processor.
Query performance may be impacted when some operations are performed
on encrypted columns, since the query optimizer may not have information
about the real data distribution and values of these columns. The way the
columns are encrypted has query processing and performance implications
when the columns are used for index lookups, equality joins, and grouping
operations. Because of these implications, Always Encrypted implements
deterministic and randomized encryption. Randomized encryption uses a
method that encrypts data in a less predictable manner and can be used
when you don’t need to use these columns for query-processing operations
—for example, with credit card information. Deterministic encryption
always generates the same encrypted value for any given plain-text value
and is better suited for use in the query processing operations mentioned
earlier such as index lookups, equality joins, and grouping operations.
To help with these query processing limitations, a future enhancement of
Always Encrypted has been recently announced which will include rich
computations on encrypted columns, including pattern matching, range
comparisons, and sorting. The feature is called Always Encrypted with
secure enclaves where an enclave is a protected region of memory that acts
as a trusted execution environment. In addition, this feature will provide in-
place encryption, helping with schema changes that involve cryptographic
operations on sensitive data. For more details about Always Encrypted with
secure enclaves you can read the following article
https://blogs.msdn.microsoft.com/sqlsecurity/2017/10/05/enabling-
confidential-computing-with-always-encrypted-using-enclaves-early-
access-preview/.
Some of the tasks required to implement the Always Encrypted feature
are not supported by T-SQL as, by definition, the database engine cannot be
involved in data encryption and decryption operations. These tasks need to
be done at the client side. You can also use PowerShell or SQL Server
Management Studio. I’ll show you how to use the Always Encrypted
Wizard on SQL Server Management Studio next; this can also optionally
generate a PowerShell script required to implement the wizard selections. In
summary, the database engine is not involved with provisioning column
master keys, column encryption keys, and encrypted column encryption
keys with their corresponding column master keys.
Finally, the client application needs to be aware of which columns are
using Always Encrypted and should be coded accordingly.
NOTE
For an example of creating a client application that works with the
encrypted data, see the article at https://docs.microsoft.com/en-
us/azure/sql-database/sql-database-always-encrypted.
Next let’s look at how Always Encrypted works. We’ll create a small
table in which we will encrypt two columns. Run the following statement:
Now run the Always Encrypted Wizard. From SQL Server Management
Studio Object Explorer, expand Databases, expand the database containing
the table, expand Tables, and right-click the Employee table and select
Encrypt Columns. The Always Encrypted Wizard Introduction page is
shown in Figure 8-2. Then click Next.
Figure 8-2 Always Encrypted Wizard Introduction page
On the Column Selection page, shown in Figure 8-3, you can select the
columns to encrypt on the table. Select the SSN and BirthDate column
checkboxes. For the Encryption Type column, select Deterministic for the
SSN column and Randomized for the BirthDate column, as shown in Figure
8-3. Click Next.
Figure 8-3 Select columns on this page.
On the Master Key Configuration page, shown in Figure 8-4, you set up
your column master key and select the key store provider where the column
master key will be stored. As of this writing, the choices for key store
provider are a Windows Certificate Store and the Azure Key Vault. For this
example, keep the default choices and select the Windows Certificate Store.
Then click Next.
Figure 8-4 Set up your column master key and select the key store provider.
The Run Settings, shown in Figure 8-5, gives you the choice of
generating a PowerShell script to run later or letting the wizard execute the
current selections. In addition, it shows a warning: “While
encryption/decryption is in progress, write operations should not be
performed on a table. If write operations are performed, there is a potential
for data loss. It is recommended to schedule this encryption/decryption
operation during your planned maintenance window.”
Figure 8-5 Run Settings page
After the Always Encrypted wizard completes, you can query your data
again, and this time the columns will be encrypted. Run the following
statement to show encrypted values for both the SSN and BirthDate
columns:
In addition, notice that the schema has changed. The table definition is
now the following, which includes some of the definitions used in the
Always Encrypted Wizard such as encryption type. It also includes the
encryption algorithm, which in this case is called
AEAD_AES_256_CBC_HMAC_SHA_256.
Row-Level Security
The purpose of row-level security is to control access to rows in a table
based on the user executing a query. The benefit of this feature is that the
access restriction logic is defined in SQL Server rather than at the
application level.
Row-Level Security relies on the CREATE SECURITY POLICY statement
and predicates created as inline table-value functions. The CREATE
SECURITY POLICY statement creates a security policy for Row-Level
Security and requires an inline table valued function that is used as a
predicate and is enforced upon queries against the target table. The
SCHEMABINDING option is required in this inline table-value function.
Let’s copy some data from AdventureWorks to show how this feature
works. Create a new database:
At this time, users can access the entire table, as requested by the GRANT
SELECT statement. Let us fix that by creating our filter predicate function:
Create the security policy using the previously defined inline table-value
function:
Since we have specified STATE ON, the security policy is now enabled.
To test the change in permissions, try the following statements:
So you can see how only the rows for the specific salesperson are listed,
even when the user is submitting a SELECT for the entire table.
SalesPerson274 can only see rows with SalesPersonID 274. SalesManager
has no restrictions and can see all the data.
To clean up, you only need to drop the SalesFilter security policy and the
fn_securitypredicate function:
Default mask Works with most data types. For example, it replaces a
string data type with 'xxxx', a numeric data type with a zero value, a
date and time data type with '01.01.1900 00:00:00.0000000', and a
binary data type with the ASCII value 0.
E-mail mask Replaces e-mail address values with the form
'[email protected]'.
Random mask Replaces a numeric data type with a random value
based on a specified range.
Custom string mask Replaces a string data type with a string that
exposes the first and last letters and adds a custom padding string in
the middle.
Now if you are running the query with a login with high-level privileges,
you will still see the real data. Try it with a user without high-level
privileges by running the following statements, which create a temporary
user, and use EXECUTE AS to run the query as the specified user:
In this case, the value for field7 will now show as LoXXXXXXX.
Finally, although Dynamic Data Masking can help in a large variety of
security cases, keep in mind that some techniques may bypass the Dynamic
Data Masking definitions. For example, a malicious user may try brute-
force techniques to guess values, and by doing that may be able to see the
real data. As an example, see the following queries:
On the first query, the malicious user knows that field2 is greater than
600 and less than 700, as the query returns one row in each case.
Continuing such techniques, the user can try other values such as field =
666 until the required data is returned, and by doing that discovering its
value.
Summary
This chapter served as an introduction to security on SQL Server on Linux.
SQL Server on Linux supports the same enterprise-level security
capabilities that are available on SQL Server on Windows today, and all
these features are built into the product.
The chapter defined SQL Server Security in terms of layers. The first
layer, or level, of security is about protecting the data itself, typically by
using methods such as encryption. The second layer is about controlling
access to the data, which is basically defining who is allowed to access
which parts of the data. The third layer is about monitoring access by
tracking activities that are happening against the data.
The chapter also covered Transparent Data Encryption, Always
Encrypted, Row-Level Security, and Dynamic Data Masking in more detail.
Transparent Data Encryption is a security feature designed to protect the
database data and transaction log files and backup files against
unauthorized access to physical media. Always Encrypted is a client-side
encryption technology designed to protect sensitive data. Row-Level
security is a security feature that controls access to rows in a table based on
the user executing a query. Finally, Dynamic Data Masking is another
security feature introduced with SQL Server 2016 that is aimed at limiting
sensitive data exposure by masking it to nonprivileged users.
Index
Please note that index links point to page beginnings from the print edition.
Locations are approximate in e-readers, and you may need to page down
one or more times after clicking a link to get to the indexed material.
F
failover cluster instances (FCIs), 205–206
FAILOVER_MODE option, 215
fencing in Pacemaker, 208, 229, 235–236
fgrep command, 73
file command, 64
filelocation options in mssql-conf, 96–97
files
copying, 213
overview, 61–64
permissions, 77–79
Unix file system, 68–70
working with, 64–68
find command
files, 65–67
redirection, 75–76
firewall-cmd command, 211, 214, 230
firewalls
availability groups, 209, 211
Pacemaker, 240
security groups, 8
fn_securitypredicate function, 267
FORCE_DEFAULT_CARDINALITY_ESTIMATION hint, 196, 198
FORCE_LAST_GOOD_PLAN hint, 193
FORCE_LEGACY_CARDINALITY_ESTIMATION hint, 195–196, 198
free command, 92
hadr.hadrenabled option, 96
hadr.hadrenabled mssql-conf option, 211
Hallengren, Ola, 118
halt command, 85
“hardening the log”, 205
Hash Aggregate operator, 131–132
head command, 67
heaps
vs. clustered indexes, 152–153
indexes, 149
high availability and disaster recovery, 204
availability groups. See availability groups
features, 204–206
settings, 8–9
High-Availability package, 206, 208
hints
complex queries, 170
overview, 171
queries, 166–169
types, 173–174
USE HINT query option, 195–199
uses, 172–173
history command, 74–75
history_update_count value, 185
home directory, 59, 62, 68
$HOME variable, 62
htop command, 91
kernel
last accessed date/time, 108
settings, 102–106
swap files, 107
transparent huge pages, 106–107
Key Lookup operator, 163–165
keys
encryption, 253
indexes, 149–150, 153–154
master, 253
kill command, 73–74
Kumar, Rohan, 48
L
P
Pacemaker
availability groups, 206–209
configuration in Red Hat Enterprise Linux, 228–240
configuration in Ubuntu, 240–244
resources, 244–246
package management systems, 15, 85–88
Page Free Space (PFS) page, 110–111, 113
PAGEIOLATCH_XX waits, 110
PAGELATCH_XX waits, 110
parameter sniffing, 162–168
parent directories, 62
parsing queries, 124–126
partial ordered scans, 153
passwords
administrators, 95
availability groups, 213
Docker, 37
encryption, 256
Pacemaker, 232
SQL Server, 16–17
paths and $PATH variable, 59–62
pcs command, 243
pcs cluster command, 232–234
pcs status command, 234
performance tuning profiles, 102–106
permissions
Linux, 77–79
mssql-conf, 95
PFS (Page Free Space) page, 110–111, 113
physical operator hints, 174
Physical reads data in queries, 147–148
picoprocesses, 52
PIDs (process IDs), 70–72
pipe symbol (|) for redirection, 75
piping, 75
plan_handle DMF, 142–143
plan_hash DMV, 143–145
plans
execution, 124–130
text, 136
XML, 134–136
pools in Pacemaker, 230
“Porting Microsoft SQL Server to Linux” article, 48
ports
availability groups, 214
connections, 98
Pacemaker, 230, 240
poweroff command, 85
powersave setting, 103
PowerShell, 48
Preview repository, 16
primary keys, 149–150
primary replicas, 205
Priority setting for SQL Server connections, 22
/proc directory, 68
/proc/swaps file, 107
procedure caches for queries, 128
process affinity in tempdb, 114
process IDs (PIDs), 70–72
processors
limitations in query optimization, 168–174
max degree of parallelism, 115
profiles, performance tuning, 102–106
Project Helsinki, 48–50
prompts, 81
Protocol setting for SQL Server connections, 22
ps command, 17–18
documentation, 59–60
piping, 76
processes, 70–72
Public IP Address setting, 8
publications, 205
PuTTY terminal emulator, 11
pwd command, 61, 80–81
Ubuntu, 3
Pacemaker configuration on, 240–244
SQL Server installation on, 29–33
umask command, 77
umount command, 69
uniformity assumption in cardinality estimation, 158
uninstalling SQL Server, 39
UNIQUE clause for clustered indexes, 153
unique indexes, 149–151
Unix operating system
commands, 18–19, 70–77
file system, 68–70
implementations, 2
origin, 58
unixODBC-devel package, 19
update command, 85–86
UPDATE STATISTICS statement, 156, 161–162
updates
cumulative, 15
query statistics, 155–157
USE HINT query option, 194–199
Use Managed Disks feature, 7
usernames
availability groups, 212
Pacemaker, 243
/usr directory, 69