### Containers "Containers" became popular a few years ago with the emergence of [Docker](https://www.docker.com/), but they are actually the result of a long line of evolution starting with [chroot](https://en.wikipedia.org/wiki/Chroot), a concept which dates all the way back to 1979. The idea of a container, or a chroot, is to run a process or set of processes in a (more or less) isolated environment that's separate from your main operating system. The first iteration, chroot, only isolated the filesystem: chroot would "change" the "root" directory (hence the name) to a subdirectory of the main filesystem, then run a program that would see only files in that subdirectory. Among other things, this was used as a way to prevent rogue programs from accidentally damaging other files on the system. But it wasn't particularly safe, especially because any program running with administrator privileges could play tricks and eventually switch its root back to the "real" root directory. Separately from security, though, it's sometimes interesting to install a different operating system variant in a subdirectory, then chroot into it and run programs that require that operating system version. For example, if you're running the latest version of Debian Linux, but you want to build an application that only builds correctly on the Debian version from 5 years ago, you can install the 5-years-ago Debian files in a directory, chroot into that, and build your application. The main limitation is that your "host" system and your chroot environment share the same kernel version, and rogue programs usually can find a way to escape the chroot, so it's not useful if your inner system is running dangerous code. Partly in response to the limitations of chroot, "virtualization" started to gain popularity around 2001, made famous by VMware. (IBM mainframes had been doing something similar for a few decades, but not many people knew how IBM mainframes worked.) Anyway, virtualization simulates a computer's actual hardware and lets you run a different kernel on the virtual hardware, and a filesystem inside that hardware. This has several advantages, including much stricter security separation and the ability to run a different kernel or even a different "guest" operating system than the one on the host. Virtualization used to be pretty slow, but it's gotten faster and faster over the years, especially with the introduction of "paravirtualization," where we emulate special virtual-only "hardware" that needs special drivers in the guest, in exchange for better performance. On Linux, the easiest type of paravirtualization nowadays is [kvm](https://www.linux-kvm.org/page/Main_Page) (kernel virtual machine), a variant of [QEMU](https://www.qemu.org/). Virtual machines provide excellent security isolation, but at the expense of performance, since every VM instance needs to have its own kernel, drivers, init system, terminal emulators, memory management, swap space, and so on. In response to this, various designers decided to go back to the old `chroot` system and start fixing the isolation limits, one by one. The history from here gets a bit complicated, since there are many, overlapping, new APIs that vary between operating systems and versions. Eventually, this collection of features congealed into what today we call "containers," in products like [OpenVZ](https://en.wikipedia.org/wiki/OpenVZ), [LXC](https://en.wikipedia.org/wiki/LXC), and (most famously) Docker. Why are we talking about all this? Because in this tutorial, we'll use `redo` to build and run three kinds of containers (chroot, kvm, and docker), sharing the same app build process between all three. redo's dependency and parallelism management makes it easy to build multiple container types in parallel, share code between different builds, and use different container types (each with different tradeoffs) for different sorts of testing. You can follow along by [checking out the redo source](../../GettingStarted/) and looking in the [docs/cookbook/container/](https://github.com/apenwarr/redo/tree/master/docs/cookbook/container) directory. ### A Hello World container Most Docker tutorials start at the highest level of abstraction: download someone else's container, copy your program into it, and run your program in a container. In the spirit of redo's low-level design, we're going to do the opposite, starting at the very lowest level and building our way up. The lowest level is, of course, Hello World, which we compiled (with redo of course) in [an earlier tutorial](../hello/):
In fact, our earlier version of Hello World is a great example of redo's safe recursion. Instead of producing an app as part of this tutorial, we'll just `redo-ifchange ../hello/hello` from in our new project, confident that redo will figure out any locking, dependency, consistency, and parallelism issues. (This sort of thing usually doesn't work very well in `make`, because you might get two parallel sub-instances of `make` recursing into the `../hello` directory simultaneously, stomping on each other.) For our first "container," we're just going to build a usable chroot environment containing our program (`/bin/hello`) and the bare minimum requirements of an "operating system": a shell (`/bin/sh`), an init script (`/init`, which will just be a symlink to `/bin/hello`), and, for debugging purposes, the all-purpose [busybox](https://busybox.net/about.html) program. Here's a .do script that will build our simple container filesystem:
There's a little surprise here. Did you see it above? In current versions of redo, the semantics of a .do script producing a directory as its output are undefined. That's because the redo authors haven't yet figured out quite what ought to happen when a .do file creates a directory. Or rather, what should happen *after* you create a directory? Can people `redo-ifchange` on a file inside that newly created directory? If so, what if the new directory contains .do files? What if you `redo-ifchange` one of the sub-files before you `redo-ifchange` the directory that contains it, so that the sub-file's .do doesn't exist yet? And so on. We don't know. So for now, to stop you from depending on this behaviour, we intentionally made it not work. Instead of that, you can have a .do script that produces a *different* directory as a side effect. So above, `simple.fs.do` produces a directory called `simple` when you run `redo simple.fs`. `simple.fs` is the (incidentally empty) target file, which is remembered by redo as a node in the dependency tree, so that other scripts can depend upon it using `redo-ifchange simple.fs`. The `simple` directory happens to materialize too, and redo doesn't know anything about it, which means it doesn't try to do anything about it, and you don't have to care what redo's semantics for it might someday be. In other words, maybe someday we'll find a more elegant way to handle .do files that create directories, but we won't break your old code when we do. Okay? All right, one more catch. Operating systems are complicated, and there's one more missing piece. Our Hello World program is *dynamically linked*, which means it depends on shared libraries elsewhere in the system. You can see exactly which ones by using the `ldd` command: ```shell $ ldd ../hello/hello linux-vdso.so.1 (0x00007ffd1ffca000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9ddf8fd000) /lib64/ld-linux-x86-64.so.2 (0x00007f9ddfe9e000) ``` If we `chroot` into our simplistic "container" and try to run `hello`, it won't work, because those libraries aren't available to programs inside the chroot. That's the whole point of chroot, after all! How do we fix it? We get a list of the libraries with `ldd`, and then we copy the libraries into place. Actually, for reasons we'll address below, let's make a copy of the new filesystem and copy the new libraries into *that*:
So now there's a directory called `simple`, which contains our program and some helper programs, and one called `libs`, which contains all that stuff, plus the supporting libraries. That latter one is suitable for use with chroot. ### Running a container with `unshare` and `chroot` So let's run it! We can teach redo how to start a program inside any chroot by using a `default.do` script. In this case, we'll use `default.runlocal.do`. With that file in place, when we run `redo whatever.runlocal` (for any value of `whatever`), redo will first construct the `whatever` directory (using `redo-ifchange whatever.fs`), and then chroot into it and run `/init` inside. We'll collect stdout into the redo output (ie. the file outside the chroot named `whatever.runlocal`). Also, the stderr will go to redo's build log, readable with [redo-log](/redo-log/) or on the console at build time, and if the `/init` script returns a nonzero exit code, so will our script. As a result, the whole container execution will act like a single node in our build process. It can depend on other things, and other things can depend on it. Just one more thing: once upon a time, `chroot` was only available to sysadmins, not normal users. And it's never a good idea to run your build scripts as root. Luckily, Linux recently got a feature called "user namespaces" (userns), which, among many other things, lets non-administrator users use `chroot`. This is a really great addition. (Unfortunately, some people worry that user namespaces might create security holes. From an abundance of caution, many OSes disable user namespaces for non-administrators by default. So most of this script is just detecting those situations so it can give you a useful warning. The useful part of the script is basically just: `unshare -r chroot "$2" /init >$3`. Alas, the subsequent error handling makes our script look long and complicated.)
Speaking of error handling, the script above calls a script called `./need.sh`, which is just a helper that prints a helpful error message and aborts right away if the listed programs are not available to run, rather than failing in a more complicated way. We'll use that script more extensively below.
And that's it! A super simple container! ```shell $ redo libs.runlocal redo libs.runlocal redo libs.fs redo simple.fs $ time redo libs.runlocal redo libs.runlocal real 0m0.112s user 0m0.060s sys 0m0.024s $ du libs 792 libs/bin 156 libs/lib64 1656 libs/lib/x86_64-linux-gnu 1660 libs/lib 3752 libs $ cat libs.runlocal Hello, world! ``` By the way, if this were a docker tutorial, it would still print "Hello, world!" but your container would be >100 megabytes instead of 3.7 megabytes, and it would have taken at least a couple of seconds to start instead of 0.11 seconds. But we'll get to that later. First, now that we have a container, let's do more stuff with it! ### Running a container with `kvm` and `initrd` Now you've seen chroot in action, but we can run almost the same container in `kvm` (kernel virtual machine) instead, with even greater isolation. `kvm` only runs on Linux, so for this step you'll need a Linux machine. And for our example, we'll just have it run exactly the same kernel you're already using, although kvm has the ability to use whatever kernel you want. (You could even build a kernel as part of your redo project, redo-ifchange it, and then run it with kvm. But we're not going to do that.) Besides a kernel, kvm needs an "initial ramdisk", which is where it'll get its filesystem. (kvm can't exactly access your normal filesystem, because it's emulating hardware, and there's no such thing as "filesystem hardware." There are tools like the [9p filesystem](https://www.kernel.org/doc/Documentation/filesystems/9p.txt) that make this easier, but it's not available in all kernel builds, so we'll avoid it for now.) "Initial ramdisk" (initrd) sounds fancy, but it's actually just a tarball (technically, a [cpio](https://en.wikipedia.org/wiki/Cpio) archive) that the kernel extracts into a ramdisk at boot time. Since we already have the files, making the tarball is easy:
(Ignore that `try_fakeroot.sh` thing for now. We'll get to it a bit further down. In our `simple.fs` example, it's a no-op anyway.) The main thing you need to know is that, unlike tar, cpio takes a list of files on stdin instead of on the command line, and it doesn't recurse automatically (so if you give it a directory name, it'll store an entry for that directory, but not its contents, unless you also provide a list of its contents). This gives us a lot of power, which we'll use later. For now we're just doing basically `find | cpio -o`, which takes all the files and directories and puts them in a cpio archive file. ```shell $ redo libs.initrd redo libs.initrd 5163 blocks 1 block $ cpio -t Configuring a virtual machine can get a little complicated, and there are a million things we might want to do. One of the most important is setting the size of the ramdisk needed for the initrd. Current Linux versions limit the initrd to half the available RAM in the (virtual) machine, so to be safe, we'll make sure to configure kvm to provide at least 3x as much RAM as the size of the initrd. Here's a simple script to calculate that:
With all those pieces in place, actually executing the kvm is pretty painless. Notice in particular the three serial ports we create: one for the console (stderr), one for the output (stdout), and one for the exit code:
And it works! ```shell $ redo libs.runkvm redo libs.runkvm redo libs.initrd 5163 blocks 1 block libs: kvm memory required: 70M [ 0.306682] reboot: Power down ok. $ time redo libs.runkvm redo libs.runkvm libs: kvm memory required: 70M [ 0.295139] reboot: Power down ok. real 0m0.887s user 0m0.748s sys 0m0.112s $ cat libs.runkvm Hello, world! ``` Virtual machines have come a long way since 1999: we managed to build an initrd, boot kvm, run our program, and shut down in only 0.9 seconds. It could probably go even faster if we used a custom-built kernel with no unnecessary drivers. ### A real Docker container Okay, that was fun, but nobody in real life cares about all these fast, small, efficient isolation systems that are possible for mortals to understand, right? We were promised a **Container System**, and a container system has daemons, and authorization, and quotas, and random delays, and some kind of Hub where I can download (and partially deduplicate) someone else's multi-gigabyte Hello World images that are built in a highly sophisticated enterprise-ready collaborative construction process. Come on, tell me, can redo do **that**? Of course! But we're going to get there the long way. First, let's use the big heavy Container System with daemons and delays to run our existing tiny understandable container. After that, we'll show how to build a huge incomprehensible container that does the same thing, so your co-workers will think you're normal. #### Docker and layers Normal people build their Docker containers using a [Dockerfile](https://docs.docker.com/engine/reference/builder/). A Dockerfile is sort of like a non-recursive redo, or maybe a Makefile, except that it runs linearly, without the concept of dependencies or parallelization. In that sense, I guess it's more like an IBM mainframe job control script from 1970. It even has KEYWORDS in ALL CAPS, just like 1970. Dockerfiles do provide one really cool innovation over IBM job control scripts, which is that they cache intermediate results so you don't have to regenerate it every time. Basically, every step in a Dockerfile copies a container, modifies it slightly, and saves the result for use in the next step. If you modify step 17 and re-run the Dockerfile, it can just start with the container produced by step 16, rather than going all the way back to step 1. This works pretty well, although it's a bit expensive to start and stop a container at each build step, and it's unclear when and how interim containers are expunged from the cache later. And some of your build steps are "install the operating system" and "install the compiler", so each step produces a larger and larger container. A very common mistake among Docker users is to leave a bunch of intermediate files (source code, compilers, packages, etc) installed in the output container, bloating it up far beyond what's actually needed to run the final application. Spoiler: we're not going to do it that way. Instead, let's use redo to try to get the same Dockerfile advantages (multi-stage cache; cheap incremental rebuilds) without the disadvantages (launching and unlaunching containers; mixing our build environment with our final output). To understand how we'll do this, we need to talk about [Layers](https://medium.com/@jessgreb01/digging-into-docker-layers-c22f948ed612). Unlike our kvm initrd from earlier, a Docker image is not just a single tarball; it's a sequence of tarballs, each containing the set of files changed at each step of the build process. This layering system is how Docker's caching and incremental update system works: if I incrementally build an image starting from step 17, based on the pre-existing output from step 16, then the final image can just re-use layers 1..16 and provide new layers 17..n. Usually, the first few layers (install the operating system, install the compilers, etc) are the biggest ones, so this means a new version of an image takes very little space to store or transfer to a system that already has the old one. The inside of a docker image looks like this: ```shell $ tar -tf test.image ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/ ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/VERSION ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/json ae5419fd49e39e4dc0baab438925c1c6e4417c296a8b629fef5ea93aa6ea481c/layer.tar b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/ b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/VERSION b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/json b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/layer.tar ``` We could use redo to build a Docker image by simply making a single `layer.tar` of the filesystem (like we did with initrd), adding a VERSION and json file, and putting those three things into an outer tarball. But if we want a system that works as well as a Dockerfile, we'll have to make use of multiple layers. Our `simple` container is already pretty tiny by container standards - 2.6MB - but it's still a bit wasteful. Most of that space turns out to be from the dynamic libraries we imported from the host OS. These libraries don't change when we change Hello World! They belong in their own layer. Up above, in preparation for this moment, we created `libs.fs.do` to build a separate filesystem, rather than adding the libraries inside `simple.fs.do`, which would have been easier. Now we can make each of those filesystems its own layer. There's one more complication: we did things a bit backwards. In a Dockerfile, you install the libraries first, and then you install your application. When you replace your application, you replace only the topmost layer. We did it the other way around: we installed our application and some debugging tools, then detected which libraries they need and added a layer on top. The most recent versions of Docker, 1.10 and above, are more efficient about handling layers changing in the middle of the stack, but not everyone is using newer Docker versions yet, so let's try to make things efficient for older Docker versions too. Luckily, since we're starting from first principles, in redo we can do anything we want. We have to generate a tarball for each layer anyway, so we can decide what goes into each layer and then we can put those layers in whatever sequence we want. Let's start simple. A layer is just a tarball made of a set of files (again, ignore the `try_fakeroot` stuff for now):
The magic, of course, is in deciding which files go into which layers. In the script above, that's provided in the .list file corresponding to each layer. The .list file is produced by `default.list.do`:
This requires a bit of explanation. First of all, you probably haven't seen the very old, but little-known `comm` program before. It's often described as "compare two sorted files" or "show common lines between two files." But it actually does more than just showing common lines: it can show the lines that are only in file #1, or only in file #2, or in both files. `comm -1 -3` *suppresses* the output of lines that are only in #1 or that are in both, so that it will print the lines that are only in the second file. If we want to make a `libs.layer` that contains only the files that are *not* in `simple`, then we can use `comm -1 -3` to compare `simple` with `libs`. Now, this script is supposed to be able to construct the file list for any layer. To do that, it has to know what parent to compare each layer against. We call that the "diffbase", and for layers that are based on other layers, we put the name of the parent layer in its diffbase file:
(If there's no diffbase, then we use /dev/null as the diffbase. Because if file #1 is empty, then *all* the lines are only in file #2, which is exactly what we want.) There's just one more wrinkle: if we just compare lists of files, then we'll detect newly-added files, but we won't detect *modified* files. To fix this, we augment the file list with file checksums before the comparison (using `fileids.py`), then strip the checksums back out in `default.layer.do` before sending the resulting list to `cpio`. The augmented file list looks like this: ```shell $ cat simple.list . 0040755-0-0-0 ./bin 0040755-0-0-0 ./bin/busybox 0100755-0-0-ba34fb34865ba36fb9655e724266364f36155c93326b6b73f4e3d516f51f6fb2 ./bin/hello 0100755-0-0-22e4d2865e654f830f6bfc146e170846dde15185be675db4e9cd987cb02afa78 ./bin/sh 0100755-0-0-e803088e7938b328b0511957dcd0dd7b5600ec1940010c64dbd3814e3d75495f ./init 0120777-0-0-14bdc0fb069623c05620fc62589fe1f52ee6fb67a34deb447bf6f1f7e881f32a ``` (Side note: the augmentation needs to be added at the end of the line, not the beginning so that the file list is still sorted afterwards. `comm` only works correctly if both input files are sorted.) The script for augmenting the file list is fairly simple. It just reads a list of filenames on stdin, checksums those files, and writes the augmented list on stdout:
Just one more thing! Docker (before 1.10) deduplicates images by detecting that they contain identical layers. When using a Dockerfile, the layers are named automatically using random 256-bit numbers (UUIDs). Since Dockerfiles usually don't regenerate earlier layers, the UUIDs of those earlier layers won't change, so future images will contain layers with known UUIDs, so Docker doesn't need to deduplicate them. We don't want to rely on never rebuilding layers. Instead, we'll adopt a technique from newer Docker versions (post 1.10): we'll name layers after a checksum of their contents. Now, we don't want to actually checksum the `whatever.layer` file, because it turns out that tarballs contain a bunch of irrelevant details, like inode numbers and [mtimes](https://apenwarr.ca/log/20181113), so they'll have a different checksum every time they're built. Instead, we'll make a digest of the `whatever.list` file, which conveniently already has a checksum of each file's contents, plus the interesting subset of the file's attributes. Docker expects 256-bit layer names, so we might normally generate a sha256 digest using the `sha256sum` program, but that's not available on all platforms. Let's write a python script to do the job instead. To make it interesting, let's write it as a .do file, so we can generate the sha256 of `anything` by asking for `redo-ifchange anything.sha256`. This is a good example of how in redo, .do files can be written in any scripting language, not just sh.
Let's test it out: ```shell $ redo simple.list.sha256 redo simple.list.sha256 redo simple.list $ cat simple.list.sha256 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145 $ rm -rf simple $ redo simple.list.sha256 redo simple.list.sha256 redo simple.list redo simple.fs $ cat simple.list.sha256 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145 ``` Consistent layer id across rebuilds! Perfect. #### Combining layers: building a Docker image We're almost there. Now that we can produce a tarball for each layer, we have to produce the final tarball that contains all the layers in the right order. For backward compatibility with older Docker versions, we also need to produce a json "manifest" for each layer. In those old versions, each layer was also its own container, so it needed to have all the same attributes as a container, including a default program to run, list of open ports, and so on. We're never going to use those values except for the topmost layer, but they have to be there, so let's just auto-generate them. Here's the script for customizing each layer's json file from a template:
And here's the empty template:
Now we just need to generate all the layers in a subdirectory, and tar them together:
This requires a list of layers for each image we might want to create. Here's the list of two layers for our `simple` container:
Finally, some people like to compress their Docker images for transport or uploading to a repository. Here's a nice .do script that can produce the .gz compressed version of any file:
Notice the use of `--rsyncable`. Very few people seem to know about this gzip option, but it's immensely handy. Normally, if a few bytes change early in a file, it completely changes gzip's output for all future bytes, which means that incremental copying of new versions of a file (eg. using `rsync`) is very inefficient. With `--rsyncable`, gzip does a bit of extra work to make sure that small changes in one part of a file don't affect the gzipped bytes later in the file, so an updated container will be able to transfer a minimal number of bytes, even if you compress it. Let's try it out! ```shell $ redo simple.image.gz redo simple.image.gz redo simple.image redo libs.list.sha256 redo libs.list redo simple.list redo libs.layer 3607 blocks redo simple.list.sha256 redo simple.layer 1569 blocks layer: b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95 libs layer: 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145 simple flow:~/src/redo/docs/cookbook/container $ tar -tf simple.image.gz 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/ 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/VERSION 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/json 4d1fda9f598191a4bc281e5f6ac9c27493dbc8dd318e93a28b8a392a7105c145/layer.tar b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/ b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/VERSION b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/json b65ae6e742f8946fdc3fbdccb326378162641f540e606d56e1e638c7988a5b95/layer.tar ``` In the above, notice how we build libs.layer first and simple.layer second, because that's the order of the layers in `simple.image.layers`. But to produce `libs.list` we need to compare the file list against `simple.list`, so it declares a dependency on `simple.list`. The final `simple.image` tarball then includes the layers in *reverse* order (topmost to bottommost), because that's how Docker does it. The id of the resulting docker image is the id of the topmost layer, in this case 4d1fda9f. #### Loading and running a Docker image Phew! Okay, we finally have a completed Docker image in the format Docker expects, and we didn't have to execute even one Dockerfile. Incidentally, that means all of the above steps could run without having Docker installed, and without having any permissions to talk to the local Docker daemon. That's a pretty big improvement (in security and manageability) over running a Dockerfile. The next step is to load the image into Docker, which is easy:
And finally, we can ask Docker to run our image, and capture its output like we did, so long ago, in `default.runlocal.do` and `default.runkvm.do`:
The result is almost disappointing in its apparent simplicity: ```shell $ time redo simple.rundocker redo simple.rundocker redo simple.load real 0m2.688s user 0m0.068s sys 0m0.036s $ cat simple.rundocker Hello, world! ``` Notice that, for some reason, Docker takes 2.7s to load, launch and run our tiny container. That's about 3x as long as it takes to boot and run a kvm virtual machine up above with exactly the same files. This is kind of weird, since containers are supposed to be much more lightweight than virtual machines. I'm sure there's a very interesting explanation for this phenomenon somewhere. For now, notice that you might save a lot of time by initially testing your containers using `default.runlocal` (0.11 seconds) instead of Docker (2.7 seconds), even if you intend to eventally deploy them in Docker. ### A Debian-based container We're not done yet! We've built and run a Docker container the hard way, but we haven't built and run an **unnecessarily wastefully huge** Docker container the hard way. Let's do that next, by installing Debian in a chroot, then packaging it up into a container. As we do that, we'll recycle almost all the redo infrastructure we built earlier while creating our `simple` container. #### Interlude: Fakeroot It's finally time to talk about that mysterious `try_fakeroot.sh` script that showed up a few times earlier. It looks like this:
[fakeroot](https://wiki.debian.org/FakeRoot) is a tool, originally developed for the Debian project, that convinces your programs that they are running as root, without actually running them as root. This is mainly so that they can pretend to chown() files, without actually introducing security holes on the host operating system. Debian uses this when building packages: they compile the source, start fakeroot, install to a fakeroot directory, make a tarball of that directory, then exit fakeroot. The tarball then contains the permissions they want. Normally, fakeroot forgets all its simulated file ownership and permissions whenever it exits. However, it has `-s` (save) and `-i` (input) options for saving the permissions to a file and reloading the permissions from that file, respectively. As we build our container layers, we need redo to continually enter fakeroot, do some stuff, and exit it again. The `try_fakeroot.sh` script is a helper to make that easier. #### Debootstrap The next Debian tool we should look at is [debootstrap](https://wiki.debian.org/Debootstrap). This handy program downloads and extracts the (supposedly) minimal packages necessary to build an operational Debian system in a chroot-ready subdirectory. Nice! In order for debootstrap to work without being an administrator - and you should not run your build system as root - we'll use fakeroot to let it install all those packages. Unfortunately, debootstrap is rather slow, for two reasons: 1. It has to download a bunch of things. 2. It has to install all those things. And after debootstrap has run, all we have is a Debian system, which by itself isn't a very interesting container. (You usually want your container to have an app so it does something specific.) Does this sound familiar? It sounds like a perfect candidate for Docker layers. Let's make three layers: 1. Download the packages. 2. Install the packages. 3. Install an app. Here's step one:
On top of that layer, we run the install process:
Since both steps run debootstrap and we might want to customize the set of packages to download+install, we'll put the debootstrap options in their own shared file:
And finally, we'll produce our "application" layer, which in this case is just a shell script that counts then number of installed Debian packages:
#### Building the Debian container Now that we have the three filesystems, let's actually generate the Docker layers. But with a catch: we won't actually include the layer for step 1, since all those package files will never be needed again. (Similarly, if we were installing a compiler - and perhaps redo! - in the container so we could build our application in a controlled environment, we might want to omit the "install compiler" layers from the final product.) So we list just two layers:
And the 'debian' layer's diffbase is `debootstrap`, so we don't include the same files twice:
#### Running the Debian container This part is easy. All the parts are already in place. We'll just run the existing `default.rundocker.do`: ```shell $ time redo debian.rundocker redo debian.rundocker redo debian.load redo debian.image redo debian.list.sha256 redo debian.list redo debian.layer 12 blocks layer: a542b5976e1329b7664d79041d982ec3d9f7949daddd73357fde17465891d51d debootstrap layer: d5ded4835f8636fcf01f6ccad32125aaa1fe9e1827f48f64215b14066a50b9a7 debian real 0m7.313s user 0m0.632s sys 0m0.300s $ cat debian.rundocker 82 ``` It works! Apparently there are 82 Debian packages installed. It took 7.3 seconds to load and run the docker image though, probably because it had to transfer the full contents of those 82 packages over a socket to the docker server, probably for security reasons, rather than just reading the files straight from disk. Luckily, our chroot and kvm scripts also still work: ```shell $ time redo debian.runlocal redo debian.runlocal real 0m0.084s user 0m0.052s sys 0m0.004s $ cat debian.runlocal 82 $ time redo debian.runkvm redo debian.runkvm redo debian.initrd 193690 blocks 1 block debian: kvm memory required: 346M [ 0.375365] reboot: Power down ok. real 0m3.445s user 0m1.008s sys 0m0.644s $ cat debian.runkvm 82 ``` #### Testing and housekeeping Let's finish up by providing the usual boilerplate. First, an `all.do` that builds, runs, and tests all the images on all the container platforms. This isn't a production build system, it's a subdirectory of the redo package, so we'll skip softly, with a warning, if any of the components are missing or nonfunctional. If you were doing this in a "real" system, you could just let it abort when something is missing.
And here's a `redo clean` script that gets rid of (most of) the files produced by the build. We say "most of" the files, because actually we intentionally don't delete the debdownload and debootstrap directories. Those take a really long time to build, and redo knows to rebuild them if their dependencies (or .do files) change anyway. So instead of throwing away their content on 'redo clean', we'll keep it around.
Still, we want a script that properly cleans up everything, so let's have `redo xclean` (short for "extra clean") wipe out the last remaining files: