Namespaces Cgroups Conatiners PDF
Namespaces Cgroups Conatiners PDF
Rami Rosen
http://ramirose.wix.com/ramirosen
● About me: kernel developer, mostly around
networking and device drivers, author of “Linux
Kernel Networking”, Apress, 648 pages, 2014.
Namespaces and cgroups are the basis of
lightweight process virtualization.
As such, they form the basis of Linux containers.
They can also be used for setting easily a testing/debugging environment or a resource
separation environment and for resource accounting/logging.
Namespaces and cgroups are orthogonal.
We will talk mainly about the kernel implementation with some userspace
usage examples.
What is lightweight process virtualization ?
A process that gives the user an illusion that he runs a linux operating
system. You can run many such processes on a machine, and all such
processes in fact share a single Linux kernel which runs on the machine.
This is opposed to hypervisor-based solutions, like Xen or KVM, where you
run another instance of the kernel.
The idea is not revolutionary - we have Solaris Zones and BSD jails already
several years ago.
A Linux container is in fact a process.
Containers versus Hypervisor-based
VMs
It seems that Hypervisor-based VMs like KVM are here to stay (at least
for the next several years). There is an ecosystem of cloud infrastructure
around solutions like KVMs.
Advantages of Hypervisor-based VMs (like KVM) :
You can create VMs of other operating systems (windows, BSDs).
Security (Though there were cases of security vulnerabilities which were
found and required patches to handle them, like VENOM).
Containers – advantages:
Lightweight: occupies less resources (like memory) significantly then
hypervisor.
Density – you can install many more containers on a given host than KVM-based
VMs.
elasticity - start time and shutdown time is much shorter, almost
instantaneous. Creation of a container has the overhead of creating a
Linux process, which can be of the order of milliseconds, while
creating a vm based on XEN/KVM can take seconds.
The lightness of the containers in fact provides their density and
their elasticity.
There is a single Linux kernel infrastructure for containers
(namespaces and cgroups) while for Xen and KVM we have two
different implementations without any common code.
Namespaces
Development took over a decade: Namespaces implementation started in
about 2002; the last one, true for today, (user namespaces) was completed in
February 2013, in kernel 3.18.
There are currently 6 namespaces in Linux:
● mnt (mount points, filesystems)
● pid (processes)
● net (network stack)
● ipc (System V IPC)
● uts (hostname)
● user (UIDs)
In the past there were talks on adding more namespaces – device namespaces
(LPC 2013), and other (OLS 2006, Eric W. Biederman).
Namespaces - contd
A namespace is terminated when all its processes are terminated and when its
inode is not held (the inode can be held, for example, by bind mount).
Userspace support for namesapces
Apart from kernel, there were also some user space additions:
● IPROUTE package:
● Some additions like ip netns add/ip netns del and more commands
(starting with ip netns …)
● We will see some examples later.
● util-linux package:
● unshare util with support for all the 6 namespaces.
● nsenter – a wrapper around setns().
● See: man 1 unshare and man 1 nsenter.
UTS namespace
UTS namespace provides a way to get information about the system with
commands like uname or hostname.
UTS namespace was the most simple one to implement.
There is a member in the process descriptor called nsproxy.
A member named uts_ns (uts_namespace object) was added to it.
The uts_ns object includes an object (new_utsname struct ) with 6 members:
sysname
nodename
release
version
machine
domainname
Former implementation of
gethostname():
The former implementation of gethostname():
asmlinkage long sys_gethostname(char __user *name, int len)
..
(system_utsname is a global)
return ¤t->nsproxy->uts_ns->name;
...
u = utsname();
errno = -EFAULT;
}
Similar approach was taken in uname() and sethostname() syscalls.
You can list the network namespaces (which were added via “ip netns
add”) by:
● ip netns list
You can monitor addition/removal of network
namespaces by:
● ip netns monitor
This prints one line for each addition/removal event it sees.
You can move a network interface (eth0) to myns1 network namespace by:
● ip link set eth0 netns myns1
You can start a bash shell in a new namespace by:
● ip netns exec myns1 bash
Recent additions – add “all” parameter to exec to allow exec on each netns; for
example:
ip -all netns exec ip link
Show link info on all net namespaces.
A nice feature:
Applications which usually look for configuration files under /etc (like /etc/hosts
or /etc/resolv.conf), will first look under /etc/netns/NAME/, and only if nothing is
available there, will look under /etc.
PID namespaces
Added a member named pid_ns (pid_namespace object) to the nsproxy.
● Processes in different PID namespaces can have the same process ID.
– When a process dies, all its orphaned children will now have the process
with PID 1 as their parent (child reaping).
– Sending SIGKILL signal does not kill process 1, regardless of in which
namespace the command was issued (initial namespace or other pid
namespace).
● pid namespaces can be nested, up to 32 nesting levels.
(MAX_PID_NS_LEVEL).
See: multi_pidns.c, Michael Kerrisk, from http://lwn.net/Articles/532745 /.
The CRIU project
PID use case
The CRIU project - Checkpoint-Restore In Userspace
The Checkpoint-Restore feature is stopping a process and saving its state to
the filesystem and later on starting it on the same machine or on a different
machine. This feature is required in HPC mostly for load balancing and
maintenance.
Previous attempts from OpenVZ folks to implement the same in the kernel in
2005 were rejected by the community as they were too intrusive. (A patch
series of 17,000 lines, touching the most sensitive linux kernel subsystems).
When restarting a process in a different machine, you can have a collision in
PID numbers of that process and the threads within it with other processes in
the new machine.
Creating the process which has its own PID namespace avoids this collision.
Mount namespaces
● All means all types of devices, and all major and minor numbers.
– Major number.
– Minor number.
Delete a group:
cgdelete -g memory:group1
Adds a process to a group:
cgclassify -g memory:group1 <pidNum>
Adds a process whose pid is pidNum to group1 of the memory controller.
Userspace tools - contd
cgexec -g memory:group0 sleepingProcess
Runs sleepingProcess in group0 (the same as if you wrote the pid of that process into
group/tasks of the memory controller).
memory {
memory.limit_in_bytes = 3.5G;
}
cgmanager
Problem: there can be many userspace daemons which set cgroups sysfs
entries (like systemd, libvirt, lxc, docker, and others)
How can we guarantee that one will not override entries written by the other?
Solution – cgmanager: A cgroup manager daemon
● Currently under development (no rpm for Fedora/RHEL, for example).
0,2
In Docker and in LXC, you can configure cgroup/namespaces via config files.
Background - Linux Containers
projects
LXC and Docker are based on cgroups and namespaces.
LXC originated in a French company which was bought by IBM in
2005; the code was rewritten from scratch and released as an
opensource project. The two maintainers are from Canonical.
Using Dockerfiles:
Create the following Dockerfile:
FROM fedora
MAINTAINER JohnDoe
RUN yum install http
Now run the following from the folder where this Dockerfile resides:
docker build .
docker dif
Another nice feature of Docker is a git dif functionality of images, by
docker diff; A denotes “added”, “C” denotes change; for example:
docker diff docakerContainerID
docker diff 7bb0e258aefe
…
C /dev
A /dev/kmsg
C /etc
A /etc/mtab
A /go
…
Why do we need yet another
containers project like Kubernetes?
Docker Advantages
Provides an easy way to create and deploy containers.
A Lightweight solution comparing to VMs.
Fast startup/shutdown (flexible): order of milliseconds.
Does not depend on libraries on target platform.
Docker Disadvantages
Containers, including Docker containers, are considered less secure than VMs.
Work is done by RedHat to enhance Docker security with SELinux.
Docker containers on a single host must share the same kernel image.
Docker handles containers individually, there is no
management/provisioning of containers in Docker.
You can link Docker containers (using the --link flag), but this provides only
exposing some environment variables between containers and entries in
/etc/hosts.
The Docker Swarm project (for containers orchestration) is quite new; it is a very basic
project comparing to Kubernetes.
Contributions also from Microsoft, HP, IBM, VMWare, CoreOS, and more.
There are already rpms for Fedora and REHL 7.
Quite small rpm – comprises of 60 files, for example, in Fedora.
6 configuration files reside in a central location: /etc/kubernetes.
Google Borg:
A paper published last week:
“Large-scale cluster management at Google with Borg”
http://research.google.com/
“Google’s Borg system is a cluster manager that runs hundreds of
thousands of jobs, from many thousands of different applications, across a
number of clusters each with up to tens of thousands of machines.”
Kubernetes abstractions
This request, as well as other kubectl requests, is translated into http POST requests.
pod1.yaml
apiVersion: v1beta3
kind: Pod
metadata:
name: www
spec:
containers:
- name: nginx
image: dockerfile/nginx
POD api (kubectl) - continued
https://github.com/GoogleCloudPlatform/kubernetes/blob/master/exampl
es/walkthrough/v1beta3/pod1.yaml
Currently you can create multiple containers by a single pod config file
only with json config file.
Note that when you create a pod, you do not specify on which node (in
which machine) it will be started. This is decided by the scheduler.
Note that you can create copies of the same pod with the
ReplicationController – (will be discussed later)
Delete a pod:
kubectl delete configFile – remove a pod.
Delete all pods:
kubectl delete pods –all
POD api (kubectl) - continued
Consists of:
Count – Kubernetes will keep the number of copies of
pods matching the label selector. If too few copies are
running the replication controller will start a new pod
somewhere in the cluster
Label Selector
RplicationController yaml file
apiVersion: v1beta3
kind: ReplicationController
metadata:
name: nginx-controller
spec:
replicas: 3
# selector identifies the set of Pods that this
# replicaController is responsible for managing
selector:
name: nginx
template:
metadata:
labels:
# Important: these labels need to match the selector above
# The api server enforces this constraint.
name: nginx
spec:
containers:
- name: nginx
image: dockerfile/nginx
ports:
- containerPort: 80
Master
The master runs 4 daemons: (via systemd services)
kube-apiserver
● Listens to http requests on port 8080.
kube-controller-manager
● Handles the replication controller and is responsible for adding/deleting
pods to reach desired state.
kube-scheduler
● Handles scheduling of pods.
Etcd
● Distributed key-value store
Label
http://kubernetes.io/
https://github.com/googlecloudplatform/kubernetes
https://
github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/pods.m
d
https://github.com/GoogleCloudPlatform/kubernetes/wiki/User-FAQ
https://
github.com/GoogleCloudPlatform/kubernetes/wiki/Debugging-FAQ
https://cloud.google.com/compute/docs/containers
https://
github.com/GoogleCloudPlatform/kubernetes/blob/master/examples/walk
through/v1beta3/replication-controller.yaml
Summary
Thank you!