Hello, and welcome back to CS631 "Advanced Programming in the UNIX Environment". This is week 13, segment 6, and we're completing our discussion on how to restrict processes by looking at POSIX Capabilities, Linux Control Groups or "cgroups", and how these and the various other methods we've discussed in the last few videos allow us build containers like Docker or LXC. As we've seen methods to restrict CPU usage as well as filesystem views, memory and process table access, we'll also want to restrict other capabilities such that we can better contain and control process groups. --- One approach to define the more generic requirements here are "POSIX Capabilities". In this model, rather than trying to specifically solve a given problem -- as in the case of a restricted shell or a chroot --, we identify the generic "capability" that a process needs and grant fine-grained access controls over these. For example, the following capabilities may be defined: CAP_CHOWN - the ability to chown files CAP_SETUID - to allow setuid CAP_LINUX_IMMUTABLE - to allow append-only or immutable flags we've seen in a previous video. We have a restriction to allow network sockets to bind ports below 1024 or a restriction to allow interface configuration, routing table manipulation. We have a capability to allow the use of raw packets... and we have more collective capabilities such as the CAP_SYS_ADMIN capability that provides broad sysadmin privileges, such as mounting file systems, setting hostname, handling swap, etc. As so often, the standard is interpreted and implemented by different operating systems in different ways. For example, on FreeBSD, capsicum(4) implements a capability and sandbox framework; NetBSD and macOS implement a kernel authorization framework called "kauth", while on Linux systems you can read about its implementation in the capabilities(7) manual page. --- Another way to partition the system and restrict visibility of resources by processes and process groups are Linux Namespaces, inspired by Bell Labs' Plan 9 operating system. Using namespaces, one process may view components of the system either differently from or not at all than another process group might. These resources may exist in multiple namespaces, which allows for a high degree of flexibility in carving up the system to only expose what's needed to a given process group. The different types of resources made available via namespaces are: - mount points - process ID visibility — a virtualized network stack, whereby each namespace has its own set of IP addresses, its own routing table, firewall rules etc. — System V IPC visibility -- semaphores, shared memory, and message queue kernel structures; — a Unix Time Sharing namespace allowing for different host- and domain names; - user namespaces that allow mapping of different user-IDs, such that, for example, the root account is mapped to a non-privileged account in a given namespace — a namespace to allow different processes to utilize different system times and finally — so-called control groups --- Now these control groups or "cgroups" -- or "process containers", as they were initially called -- allow for isolation of the different resource utilization we've seen: - memory limits - CPU utilization, prioritization, and limits - accounting - that is, keeping track of which processes utilize which resources in what ways and - process control, allowing for suspension, interruption, application checkpointing, and restarting of processes --- cgroups were redesigned at least once, and version two now supports the following controls: - ability to schedule tasks - CPU utilization - the activity of control groups themselves; tasks in frozen groups would then not be scheduled, for example. - large page support and usage - block device I/O - memory, kernel memory, and swap memory - the ability to monitor threads - restrictions on the number of processes available - as well as "remote direct memory access" --- cgroups are implemented as a virtual file system, often using the /sys/fs/cgroup mountpoint, and thus allowing for enabling of different controllers using mount options. - Here's an example of restricting memory usage of the current shell: - Creation of a new cgroup is trivially done by creating a new directory in the pseudo filesystem, placing the process ID of the shell into it, and then adding a conveniently human-readable restriction. The manual page is actually very detailed and extensive here, and I recommend you take the time to read through it. --- Now containers, at last, are, as the name suggests, a way to _contain_ processes. That is, they provide an isolated execution environment on -- and this is the important distinction from full hardware virtualization -- the same operating system, providing a lightweight approach: In order to "contain" process, you might - use null and union mounts to provide the right file system view - restrict processes in their utilization to avoid interference with the parent system or other processes - restrict filesystem visibility beyond the assigned views - restrict processes from what other processes they can see - restrict processes from what they can do Even though we've discussed many ways to apply such restrictions, in this context, cgroups and namespaces are frequently discussed together, as they complement each other well. - In fact, the combination of cgroups and namespaces forms the basis for many operating-system-level virtualization and container technologies, such as CoreOS, LXC, or Docker. --- Consider the basic operating system, with the layered distinction we've used from the beginning of the semester: We have hardware at the bottom, a kernel managing the hardware, a set of system calls as the interface into the kernel as well as a number of library functions to allow applications to execute within the OS. --- Now access to the hardware includes broad access by the kernel, but also means that processes do, by and large, retain the same view of the hardware. The filesystem, process space, and networking capabilities are the same for each process, even if we can restrict that each can directly manipulate. --- In full hardware virtualization, things are a bit different: we still have hardware and a kernel managing it, but then we have a slim OS on top of that: the hypervisor, which virtualizes the hardware and makes it available to each VM. Within the VM, each OS sees only what the hypervisor makes available, but within each application within that OS, it again behaves the same as any process execution on physical hardware. --- When talking about lightweight OS-level virtualization, we then have a different situation. That is, we start out again with the basic view we're familiar with, but we can now apply the various lessons from this series of videos and, for example, use a restricted shell in combination with certain mount options and a fixed CPU priority as well as some file attributes to craft a restricted view of the filesystem and restrict process execution capabilities, or we can use a jail in conjunction with cpusets and ACLs to restrict the process-, filesystem-, and network- view of certain processes, or we can create a per-process or process-group restricted environment consisting of finely tuned namespaces, cgroups, and resource limits. And that is really all "containers" are: processes running on our general purpose Unix operating system that we have restricted such that they can't see or access all the resources; that they are contained to only the views we allow. Note, however, that unlike with full hardware virtualization, containers are still processes or process groups. That is, they are still running on the same kernel as the "host" or "parent"; this is both an advantage (instantiation of the virtual environment is much faster than, for example, booting a virtual machine), but also has limitations (you can only run a container of the given OS, not another OS). Well, that sums up our tour of all the various ways of restricting processes, the techniques and technologies that lead up to the ever popular containers. There are many other related approaches, and we only just scratched the surface, but I hope that you've at least seen that there is no magic: everything we've covered in this semester so far should enable you to better understand, say, Docker and friends. Perhaps the most important lessons to draw here are that most process restrictions can be circumvented in some way, and that the goal is to _voluntarily_ restrict yourself such that a compromise cannot gain an attacker elevated privileges that you may have held previously; understanding Unix processes and base semantics are critical in setting up and configuring such restricted environments. That, and that we've actually come a pretty long way from our first lecture - perhaps go back and revisit some of the topics now with an eye towards these concepts. Either way - thanks for watching, and until next time. Cheers!