Containers and Security: Seccomp


When working with potentially dangerous, unverified, or simply raw software, developers often use sandboxes. These are special environments that isolate or restrict programs and code from accessing data outside the environment. Sandboxes limit the software’s network access, OS interactions, and information from IO devices.

Lately, people have been turning more and more towards containers for launching unverified and non-secure software.

Despite their similarities, containers are not the same as sandboxes, if only for the fact that sandboxes are often designed for a specific application and containerization is a more universal technology. It’s also worth mentioning that an application launched in a container could gain access to the kernel and compromise it. This is why today’s containerization tools use mechanisms for boosting security. In today’s article, we’d like to talk about one of these mechanisms: seccomp.

First we’ll look at the principles behind seccomp, and then demonstrate its usage in Docker.

Seccomp: An Introduction

Seccomp (short for secure computing) is a Linux kernel mechanism that lets you restrict the system calls a process can use. If hackers gain access, seccomp won’t let them use any calls that haven’t already been declared.

Seccomp was developed by Google and is mainly used in Chrome for launching plugins.

To activate seccomp, we use the system call prctl().

Let’s see how this works with a simple program:

#include 
#include 
#include 
#include 
 
int main () {
  pid_t pid;
 
  printf("Step 1: no restrictions yet\n");
 
  prctl (PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
  printf ("Step 2: entering the strict mode. Only read(), write(), exit() and sigreturn() syscalls  are allowed\n");
 
  pid = getpid ();
  printf ("!!YOU SHOULD NOT SEE THIS!! My PID = %d", pid);
 
  return 0;
}

We save the program as seccomp1.c, compile it and launch:

$ gcc seccomp1.c -o seccomp1
 
$ ./seccomp1

We’ll see the following printout in the console:

Step 1: no restrictions yet
Step 2: entering the strict mode. Only read(), write(), exit() and sigreturn() syscalls are allowed
Killed

To see where this particular printout came from, we use strace:

$ strace ./seccomp1
 
 
/output fragment/
 
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
write(1, "Step 2: entering the strict mode"..., 100Step 2: entering the strict mode. Only read(), write(), exit() and sigreturn() syscalls are allowed
) = 100
+++ killed by SIGKILL +++
Killed

So what happened? We activated seccomp using the system call prctl and enabled strict mode. Afterwards, our program tried to find the PID of the current process with the getpid() system call and the assigned restrictions prevented it: the SIGKILL signal was sent to the process and it was terminated.

As we can see, seccomp performs its job perfectly well. However, strict mode isn’t that convenient since it doesn’t let you choose which system calls to allow and which to prohibit. To solve this issue, we can use the BPF (Berkeley Packet Filters) mechanism.

Seccomp and BPF Filters

The BPF mechanism was initially created for filtering network packets, but its potential applications have grown significantly. Today, BPF is used for tracing the Linux kernel (you can read an interesting article about this on Brendan Gregg’s blog). It was integrated with seccomp in 2012 and an extension was released called seccomp-bpf.

Writing for BPF is fairly complicated (you can read about it here). We won’t get into the particulars of the BPF syntax—that’s an article for another day—instead we’ll use the libseccomp library, which is a simple and easy-to-use API for filtering system calls.

Libseccomp can be installed from the standard package manager:

$ sudo apt-get install libseccomp-dev

Now we’ll write a small program:

#include 
#include 
#include 
 
int main() {
    pid_t pid;
 
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);
 
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sigreturn), 0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
 
 
    printf ("No restrictions yet\n");
 
    seccomp_load(ctx);
    pid = getpid();
    printf("!! YOU SHOULD NOT SEE THIS!! My PID is%d\n", pid);
 
    return 0;
}

Let’s look at this code line by line.

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

Here we initialize the filter and define the default action. In our case, it’s SCMP_ACT_KILL, which immediately suspends any process making a forbidden system call.

Next we have the seccomp rules. These define the system calls our process can run:

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(sigreturn), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);

Then we activate the rules:

seccomp_load(ctx);

Just as in the previous example, we’ll try to print the PID of the current process in our console. Let’s see if it’ll work.

We compile and launch the program:

$ gcc -o seccomp2 seccomp2.c  -lseccomp
$ ./seccomp2

We get the following printout:

No restrictions yet
Bad system call

So what happened? Again like the last example, we can find our answer using strace:

$ strace ./seccomp2
 
/output fragment/
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, {len = 9, filter = 0x1ef5fe0}) = 0
+++ killed by SIGSYS +++

Here we see that the filter worked: the process ran the getpid system call, which is restricted by rules, and was then halted.

To better understand how seccomp filters work, we should change the default action from SCMP_ACT_KILL to SCMP_ACT_TRAP:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_TRAP);

The strace printout will be a bit more detailed:

$ strace ./seccomp2
/output fragment/
syscall_18446744073709551615(0xffffffff, 0x7feb8c47ab28, 0, 0x22b, 0x130c0c0, 0) = 0x27
--- SIGSYS {si_signo=SIGSYS, si_code=SYS_SECCOMP, si_call_addr=0x7feb8c18366f, si_syscall=__NR_getpid, si_arch=AUDIT_ARCH_X86_64} ---
+++ killed by SIGSYS +++

In our case (Ubuntu 16.04, kernel 4.4), the printout shows the system call that caused the termination: si_syscall=__NR_getpaid to terminate.

In other distros and kernel versions, the printout may show the system call’s number from /asm/unistd.h instead of its name.

Seccomp in Docker

In previous sections, we worked out the principles behind seccomp. Now let’s look at an example in Docker, where seccomp is used in containerization tools.

The first seccomp profiles for containerization appeared in runc, which we’ve already written about.

They were added to Docker Engine v. 1.10.

44 system calls are blocked by default for all Docker containers (that’s out of the several hundred system calls you’ll find in modern 64-bit Linux systems). These prohibited system calls include reboot(): we can hardly imagine a situation where we’d have to reboot the host machine OS from the container.

Another good example is keyctl(), which a vulnerability was discovered for not long ago (CVE 2016-0728). Now Docker blocks it automatically.

The default seccomp profiles limit the impact hackers can make and lower the probability of an attack. This obviously isn’t enough though; a lot of calls that are still available have their own vulnerabilities. Obviously it would be impossible to block all of these potentially dangerous calls.

That’s why containers can filter system calls. All filters are written as configuration files in JSON format.

Let’s look at a simple example:

{ 
   "defaultAction":"SCMP_ACT_KILL",
   "syscalls":[  
      {  
         "name":"chmod",
         "action":"SCMP_ACT_ERRNO"
      }
   ]
}

As we can see, everything here is done the same as in our previous examples. First we give the default action, then the prohibited system calls, and lastly what to do if the call is run.

We save the file as config.json and try to launch a container with the seccomp configuration above:

$ docker run --security-opt seccomp:chmod.json busybox chmod 400 /etc/hostname
 
chmod: /etc/hostname: Operation not permitted

We see the filter worked as per our rules: the prohibited system call chmod was blocked.

Conclusion

In this article, we explained how seccomp works and how it can be used in Docker. If you have any questions or comments, please share them with us in the comments below.

As usual, below is a list of additional resources for anyone interested in learning more: