The seccomp() system call operates on the Secure Computing (seccomp) state of the calling process.

Portable has made use of the SECure COMPuting feature of Linux kernel since early versions, way before the Go rewrite, via systemd’s SystemCallFilter switches. It currently rejects system calls in form of pre-defined groups: @clock, @cpu-emulation, @module, @obsolete, @raw-io, @reboot, @swap and replies an error EAGAIN. While this helps us minimize the attack surface available inside the sandbox, it is not very flexible.

Available syscalls and architectures

The available system calls with groups can be dumped via systemd-analyze syscall-filter --no-pager.

By default, no architecture limitation is enforced. Although this is good for backwards compatibility, obsolete architectures may have unnoticed security holes and this makes up a bigger attack surface.

As such, for lockdown mode (which is the default for bawn) we deny non-native system calls to constraint said surface. Modern Linux apps should NOT rely on non-native system calls because they are potentially slow and vulnerable.

In addition, lockdown mode switches from deny-list to allow-list, providing a more secure sandbox environment for untrusted executable.

Resolving syscall

Output from systemd-analyze are literal names of which needs to be resolved to libseccomp::ScmpSyscall because the actual number constant is architecture independent. We can build a simple function to output an Option:

fn get_syscall_by_name(
	name: &String,
	logtx: &tokio::sync::mpsc::Sender<LogMessage>,
) -> Option<libseccomp::ScmpSyscall> {
	let result = libseccomp::ScmpSyscall::from_name(name);
	match result {
		Ok(val)	=>	Some(val),
		Err(e)	=>	{
			crate::logger::log_sync(
				logtx,
				crate::logger::Loglevel::Debug,
				format!("Could not resolve syscall from name {name}: {e:#?}"));
			None
		}
	}
}

Because of the static definition of system call in Init, it is possible that some of the system calls does not exist on the current kernel. Bearing that in mind, we print a debug logging and return None.

The flow

With all bits ready, Init can implement seccomp trivally

Resolve the built-in deny-list and allow-list
Create a new filter with defaults depending on lockdown mode: notify supervisor when true, allow otherwise
Apply previously compiled syscall list: allow-list when lockdown is true, deny-list otherwise
- Note that EACCS might get returned if conflicted rules are set. For example, if you set the default policy to Allow then add a rule to allow certain syscall, this is going to blow up
Deny or allow other architectures depending on lockdown mode
Load the filter into kernel and obtain a file descriptor for notification from the unotify notification
Enter a loop to deny system calls and inform user (see seccomp-unotify)

unotify

seccomp-unotify empowers us to respond to system calls inside the user-space.

Being able to allow or straight out deny a system call is very useful. In fact this perfectly describes what Portable does until version 17. But it’s a nightmare to debug what failed. Imagine that you are playing some old Windows games via the wine32 package because WoW64 has worse performance for OpenGL apps in a bawn transient sandbox, whose lockdown mode is turned on by default. After Portable implemented architecture filtering, you would have no idea about why games refuse to start. There’s nothing other than an EPERM error, and you are left in the dark.

With seccomp-unotify, we can alleviate some of the pain. We can initialise a filter whose default action is Notify (or alternatively, set rules with action), and get a notification file descriptor to receive and respond to kernel notifications:

// Set up the filter
let filter = libseccomp::ScmpFilterContext::new(
	libseccomp::ScmpAction::Notify,
);
let mut filter = match filter {
	Ok(val) => val,
	Err(e) => {
		return Err(SeccompError::CreateFilterError(e));
	}
};

// Load + get fd
let result = filter_result.load();
match result {
	Ok(_)	=> {},
	Err(e)	=> return Err(SeccompError::LoadFilterError(e))
};
let result = filter_result.get_notify_fd();
match result {
	Ok(fd)	=> Ok(fd),
	Err(e)	=> Err(SeccompError::GetFdError(e))
}

(Note that the filter must be loaded into the kernel first, then can you retrieve the relevant file descriptor for notifications)

pub fn process_seccomp_unotify (
	fd: libseccomp::ScmpFd,
	logtx: &tokio::sync::mpsc::Sender<LogMessage>,
) {
	// On Linux, this should always be -1
	let raw_eperm_err = -1;
	loop {
		let request = libseccomp::ScmpNotifReq::receive(fd);
		let request = match request {
			Ok(val)	=> val,
			Err(e)	=> {
				crate::logger::log_sync(
					&logtx,
					crate::logger::Loglevel::Fatal,
					format!("Could not receive seccomp notification: {e:#?}"));
				return
			}
		};
		let syscall_name = request.data.syscall.get_name();
		let syscall_name = match syscall_name {
			Ok(val)	=> val,
			Err(e)	=> {
				format!("unresolved syscall ({:#?})", e)
			}
		};
		crate::logger::log_sync(
			&logtx,
			crate::logger::Loglevel::Warn,
			format!(
				"PID {} performed illegal system call {}",
				request.id,
				syscall_name,
			),
		);
		let response = libseccomp::ScmpNotifResp::new_error(
			request.id,
			raw_eperm_err,
			libseccomp::ScmpNotifRespFlags::empty(),
		);
		match response.respond(fd) {
			Ok(_)	=> {},
			Err(e)	=> {
				crate::logger::log_sync(
					&logtx,
					crate::logger::Loglevel::Warn,
					format!(
						"Error filtering syscall: {e:#?}",
					),
				);
			},
		}
	}
}

You can get a vibe about what’s going on:

We setup and load the filter, instructing the kernel to notify our helper about every single system call.
- This is because the default action is set to ScmpAction::Notify, and we don’t have any rule to override such action
process_seccomp_unotify takes ownership of that notification file descriptor, and calls function libseccomp::ScmpNotifReq::receive(fd) in a loop to receive incoming notifications
Error is being checked and termination is enforced to avoid undefined behaviour
Call the logger to inform user, regarding which process (PID) is performing what syscall. Resolve the syscall name if we can.
Respond to kernel, rejecting the operation.

With said implementation, the user is aware because of warnings we print out:

1	PID 1225 performed illegal system call mount

Thus, they can turn off lockdown mode knowing that it is the most likely blocker.

Decoding content from syscall

WARNING: this experiment has security concerns! See [TOCTOU Attack](#TOCTOU attack) in later chapters.

Looking at the data returned from file descriptor, there’re also more fields to play with:

pub struct ScmpNotifReq {
	pub id: u64,
	pub pid: u32,
	pub flags: u32,
	pub data: ScmpNotifData,
}

pub struct ScmpNotifData {
	pub syscall: ScmpSyscall,
	pub arch: ScmpArch,
	pub instr_pointer: u64,
	pub args: [u64; 6],
}

You can see that a several u64 fields are exported in ScmpNotifReq.ScmpNotifData.args. These are in fact memory addresses that we can read and decipher from. We can read /proc/PID/mem for it’s memory content and figure out what argument is it passing along with such syscall.

Note that this function is deprecated and removed in the Rust rewrite of Portable Init, thus we are only showing the Go version.

Writing a simple Go function, we can read all arguments out:

func readArgFromMemory(pid int, addr uint64) (string, error) {
	if addr == 0 {
		return "", errors.New(
			"Could not read argument: Null pointer passed",
		)
	}
	path := filepath.Join("/proc", strconv.Itoa(pid), "mem")
	file, err := os.Open(path)
	if err != nil {
		return "", err
	}
	defer file.Close()
	_, err = file.Seek(int64(addr), io.SeekStart)
	if err != nil {
		return "", err
	}
	reader := bufio.NewReader(file)
	bytes, err := reader.ReadBytes(0)
	switch err {
		case nil:
		case io.EOF:
			return "", err
		default:
			return "", err
	}
	str := string(bytes)
	return strings.TrimSuffix(str, "\x00"), nil
}

This is relatively simple:

Ensure that we are not dealing with nullptr
Build the actual path for the memory content
Seek to said address offset
Read until a NUL byte, check that no errors occur
Trim the trailing NUL byte, and return the argument

And thus, the legacy Go version of Init shows suspicious calls to the user:

[Init]	PID 11 spawned a bash shell
[Init]	System call triggered: PID 438 requested mount using architecture amd64 with [0 94596337430784 0 573440 0 140379275995840] which may be problematic
[Init]	Rejecting syscall due to lockdown
[Init]	System call triggered: PID 438 requested mount using architecture amd64 with [94596337429860 94596337429839 94596337429860 6 0 140379275995840] which may be problematic
[Init]	Rejecting syscall due to lockdown
[Init]	Could not read argv0 from memory: open /proc/440/mem: permission denied
[Init]	Got execve() from PID 440 with argument:  Deciphered from memory address: 140055202268192

You might have already spotted one of the problems: We could not decipher all system call. This is the result of security measurements in kernel such as Yama ptrace scope. Unfortunately it creates more confusion rather than clearing things up in the legacy version. Which is part of the reason why we are throwing things out in the Rust rewrite.

Ideal application

In an ideal world, we can use unotify to build not just a messenger, but an entire sandbox.

Because of the nature of intercepting system calls, a sandbox can hijack them via unotify and customise return values as they please. In theory, intercepting syscalls like read, write, etc. would provide an environment where outside filesystem cannot be accessed directly. There’re several projects implementing sandboxing like sandlock.

However, other than performance concerns, there’s also a huge security issue.

This emphasises the philosophy and proves the relevance in which stacking multiple layers of defence in Portable.

He who employ unotify, must be aware of the issues.

TOCTOU attack

The unotify mechanism is not safe while evaluating arguments.

Suppose an attacker knows that unotify is the sole security boundary on the filesystem side. They can employ the following process again and again to “race” the sandbox and read arbitrary file:

The attacker wants to read /home/victim/.git/id_ed25519 but it would be blocked by the unotify supervisor.
The attacker COULD access /home/victim/.git/id_ed25519 because that is the sandbox home.
The attacker calls read on /home/victim/.sandbox/com.unsafe.app/file. But, it soon changes the argument in memory to read on /home/victim/.sandbox/com.unsafe.app/file.
The observed result within supervisor is undefined behaviour: if the attacker successfully changed content in memory before supervisor (sandbox engine) performs checks, the result would look harmless and supervisor replies with ScmpNotifResp::new_continue() to allow such action; if not, we can simply do it again.
Because the syscall has been sent to kernel earlier, kernel still executes with the old system call arguments.
The attacker has your private ssh key.

However, there is room for improvement. Which is not the main point of discussion here.