Linux Container
Linux Container
Containers
Basic Concepts
Lucian Carata
FRESCO Talklet, 3 Oct 2014
Underlying kernel mechanisms
cgroups manage resources for groups of processes
namespaces per process resource isolation
seccomp limit available system calls
capabilities limit available privileges
CRIU checkpoint/restore (with kernel support)
Those mechanisms are orthogonal and are used in conjunction for implementing actual container functionality.
cgroups user space view
lowlevel filesystem interface similar to sysfs (/sys) and procfs (/proc)
new filesystem type “cgroup”, default location in /sys/fs/cgroup
each subsystem can be
cgroup hierarchies used at most once*
subsystems (controllers)
/sys/fs/cgroup
/opus
freezer perf
/normal built as kernel module
/experiment_1 TL top level cgroup (mount)
* or, if a new toplevel cgroup is created with an already existing combination of subsystems, the previous top
level cgroup will be used behind scenes
● issues with systemd premounting directories with certain controllers, which makes new hierarchies (with
different controller combinations) difficult to achieve
● each process can appear at most once within a cgroup hierarchy (from toplevel towards descendants)
cgroups user space view
cgroup hierarchies
/opus
/normal
/experiment_1
● by default, the toplevel cgroup contains all running tasks. a cgroup created as a subdirectory starts with no
tasks, and those must be manually added to the “tasks” file
● release_agent is only present at the toplevel cgroup level, and contains a command to be run when the
last process of a cgroup terminates. notify_on_release needs to be set in particular cgroups for that
command to actually execute.
● cpu controller: by default, the kernel scheduler aims to give equal cpu time to all processes. cgroups can
be used for fair grouping between arbitrary sets of processes (an example of 30 apache processes and 10
postgres processes)
● net_cls interface for tagging network packets with a class identifier (so that you could later add rules
based on packet class)
● memory controller has hierarchical support and allows for soft limits (cgroup can use as much memory as
needed provided there is no memory contention and the hard limit is not exceeded).
○ Hierarchical support means that child cgroups contribute to the memory usage of their ancestors. If an
ancestor exceeds a limit, memory will be reclaimed from the ancestor and all its children
● cpuset is also hierarchical
cgroups kernel space view
include / linux / cgroup.h
task_struct css_set
css_set *cgroups list_head tasks
list_head cg_list cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]
task_struct kernel code for attach/detaching
task from css_set
css_set *cgroups
list_head cg_list
init/main.c
fork(), exit()
list of all tasks using the
same css_set
on initialization, a css_set init_css_set is created containing the initial css_set at system boot.
a css_set contains all the tasks that are under the same state configuration for all enabled controllers (they share cgroups in all hierarchies)
the cgroup hierarchy is not directly accessible from a given task (this is not required as often)
cgroups kernel space view
include / linux / cgroup.h
task_struct css_set
css_set *cgroups list_head tasks
list_head cg_list cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]
...
include / linux / cgroup_subsys.h
cgroup_subsys
task_struct cgroup_subsys cpuset_subsys
int (*attach)(...)
css_set *cgroups void (*fork)(...)
list_head cg_list
cgroup_subsys freezer_subsys
void (*exit)(...)
void (*bind)(...)
cgroup_subsys mem_cgroup_subsys
...
const char* name;
list of all tasks using the cgroupfs_root *root;
cftype *base_cftypes
same css_set
cgroups kernel space view
include / linux / cgroup_subsys.h
cgroup_subsys
int (*attach)(...)
void (*fork)(...)
void (*exit)(...)
void (*bind)(...)
...
const char* name;
cgroupfs_root *root;
cftype *base_cftypes
cgroup_subsys cpuset_subsys
.base_cftypes = files
cgroups summary
each subsystem can be
cgroup hierarchies used at most once*
subsystems (controllers)
/sys/fs/cgroup
/opus
freezer perf
/normal built as kernel module
/experiment_1 TL top level cgroup (mount)
● show in terminal /sys/fs/cgroups
namespaces user space view
Namespaces limit the scope of kernelside names and data structures
at process granularity
mnt (mount points, filesystems) CLONE_NEWNS
pid (processes) CLONE_NEWPID
net (network stack) CLONE_NEWNET
ipc (System V IPC) CLONE_NEWIPC
uts (unix timesharing domain name, etc) CLONE_NEWUTS
user (UIDs) CLONE_NEWUSER
The main purpose of a namespace is the isolation of whatever is contained within from other namespaces running in the same kernel
namespaces user space view
Namespaces limit the scope of kernelside names and data structures
at process granularity
Three system calls for management
clone() new process, new namespace, attach process to ns
unshare() new namespace, attach current process to it
setns(int fd, int nstype) join an existing namespace
The main purpose of a namespace is the isolation of whatever is contained within from other namespaces running in the same kernel
namespaces user space view
each namespace is identified by an inode (unique)
six entries (inodes) added to /proc/<pid>/ns/
(?)
two processes are in the same namespace if they see the same inode for
equivalent namespace types (mnt, net, user, ...)
User space utilities
* IPROUTE (ip netns add, etc)
* unshare, nsenter (part of utillinux)
* shadow, shadowutils (for user ns)
nsenter is a wrapper around setns
unshare has support for all 6 namespaces
namespaces kernel space view
include / linux / nsproxy.h include / linux / cred.h
task_struct nsproxy cred
atomic_t count
struct nsproxy *nsproxy ...
struct cred *cred struct uts_namespace *uts_ns struct user_namespace *user_ns
struct ipc_namespace *ipc_ns
struct mnt_namespace *mnt_ns
struct pid_namespace *pid_ns_for_children
struct net *net_ns
include / linux / nsproxy.h
nsproxy* task_nsproxy(struct task_struct *tsk)
For each namespace type, a default namespace exists (the global namespace)
struct nsproxy is shared by all tasks with the same set of namespaces
namespaces kernel space view
Example for uts namespace
include / uapi / linux / utsname.h
new_utsname
include / linux / nsproxy.h
task_struct nsproxy char sysname []
char nodename []
struct uts_namespace *uts_ns char release []
struct nsproxy *nsproxy
... ... char version []
char machine []
char domainname []
global access to hostname: system_utsname.nodename
namespaceaware access to hostname: ¤t>nsproxy>uts_ns>name>nodename
namespaces kernel space view
Example for net namespace
include / net / net_namespace.h
net
include / linux / nsproxy.h
task_struct nsproxy Logical copy of the network stack:
struct net *net_ns
struct nsproxy *nsproxy loopback device
... ...
all network tables (routing, etc)
all sockets
/procfs and /sysfs entries
a network device belongs to exactly one network namespace
a socket belongs to exactly one network namespace
a new network namespace only includes the loopback device
communication between namespaces using veth or unix sockets
namespaces summary
Namespaces limit the scope of kernelside names and data structures
at process granularity
mnt (mount points, filesystems)
pid (processes)
net (network stack)
ipc (System V IPC)
uts (unix timesharing domain name, etc)
user (UIDs)
The main purpose of a namespace is the isolation of whatever is contained within from other namespaces running in the same kernel
Containers
A light form of resource virtualization based on kernel mechanisms
A container is a userspace construct
Multiple containers run on top of the same kernel
illusion that they are the only one using resources
(cpu, memory, disk, network)
some implementations offer support for
container templates
deployment / migration
union filesystems
taken from the Docker documentation
Container solutions
Mainline
Google containers (lmctfy)
uses cgroups only, offers CPU & memory isolation
no isolation for: disk I/O, network, filesystem, checkpoint/restore
adds some cgroup files: cpu.lat, cpuacct.histogram
LXC: userspace containerisation tools
Docker
systemdnspawn
Forks
Vserver, OpenVZ
Container solutions LXC
An LXC container is a userspace process created with the clone() system call
with its own pid namespace
with its own mnt namespace
net namespace (configurable) lxc.network.type
Offers container templates /usr/share/lxc/templates
shell scripts
lxccreate t ubuntu n containerName
also creates cgroup /sys/fs/cgroup/<controller>/lxc/containerName
Container solutions Docker
A Linux container engine
multiple backend drivers
application rather than machinecentric
app build tools
diffbased deployment of updates (AUFS)
versioning (gitlike) and reuse
links (tunnels) between containers
taken from the Docker documentation
Questions?
Thank you! Lucian Carata
[email protected]
More details
cgroups: http://media.wix.com/ugd/295986_d73d8d6087ed430c34c21f90b0b607fd.pdf
namespaces: http://lwn.net/Articles/531114/ (and series)