Delegation of control groups in Portable
The background:
Portable started as the safe, efficient sandbox at 10th November, 2024. At the core of its resource management and system security scope is systemd and the unified control group version 2, available at /sys/fs/cgroup.
What’s before
For a long period of time, Portable did not allow applications to read and write the unified control group interface for a number of good reasons:
A) We don’t want sandboxed applications to mess with the system, or raise it’s CPU weight. Since anyone with write access to
/sys/fs/cgroupcan easily manipulate and potentially overwrite values set by Portable.B) Exposing the control group can cause privacy issues. The sandboxed app might be able to speculate what’s running, who is doing what and all the statistics of other services. Which is not what we want.
C) The control group should have only one writer. The so-called single-writer rule should be followed, otherwise systemd and other things might step on us and create lots of issues.
D) And many more…
Changes come
The blog from Sebastian Wick: SO_PEERPIDFD Gets More Useful changed things. As control groups now support extend attributes, aka xattrs, we can now potentially identify process and assign them a sandbox ID & engine using this new mechanism. Hence dropping the legacy, messy behaviour.
This hopefully will allow us to simplify or extend the capabilities of the current D-Bus filtering magic in Portable, and finally stop emulating Flatpak, quoting swick:
For sandbox developers, it means you have a standardized way to communicate application identity without implementing custom socket mounting schemes.
systemd, in the 258 iteration also added “private” knob for ProtectControlGroups, allowing us to inherit a private copy of cgroup fs directly.
Great, where’s the roadmap?
We’ve refactored lots of Portable during the X release cycle and 11 beta. I’ll pick the ones which matter here:
-
- (yes it’s pull request number 345)
- Those 2 pull requests enabled systemd to create and mount the private copy of the unit’s control group tree under
/sys/fs/cgroup. Saving us from manually parsing it and mounting it via bwrap. - The latter one actually mounts the control group interface
-
- Those two are for the systemd standard: Desktop Environments. This was changed to strip the main service from the initial portable slice. In order for resource monitors to follow and identify processes more accurately.
- This can potentially be used to identify processes too, since it contains the launcher and application ID.
And here comes the important part:
Delegation
Rules of control groups
You might be wondering: what the hell is delegation??? To start with, we have to be familiar with the 2 key designs of control groups v2:
- No processes in inner nodes
Take a look at the output of ls /sys/fs/cgroup/user.slice:
1 | cgroup.controllers cgroup.threads cpu.weight memory.current memory.pressure memory.zswap.writeback |
You can see that the root of the user.slice control group hierarchy not only contains controllers of itself, but also many child control groups. Which is called inner nodes in the following text. The structure of unified control group is key to understanding why delegation is used in Portable.
Control groups usually fit in 2 types. They can be either a leaf node, or an inner node. The former one may contain processes, but not child control groups. The latter one may contain child control groups but not processes. They acts like a container relationship: inner nodes are the father of leaf nodes. (Note the root group is different and contains both processes and child groups)
- Single writer
Every control group should only have one writer or manager. This ensures that we won’t step on systemd’s toes or vice versa.
The issue
From the aforementioned information, you can see an issue: Portable wants to write xattrs to the control group of which contains the running sandboxed applications. But we cannot violate the single writer rule which requires a child group to be created via delegation. We must also run everything in a sub group instead of directly spawning them in the root group of the service, otherwise we violate rule #1.
To solve this issue, and prepare for the xattr approach of identifying processes, Portable was adapted to do the following:
- Turning on delegation by passing
Delegate="yes"to systemd
In a nutshell, delegation turned on instructs systemd to not fiddle with any sub groups created within the unit’s root control group, and run clean up commands in a .control sub group. Solving rule #2.
- Run helper in the portable-cgroup sub group.
This instructs systemd to dispatch our helper command to run directly in the subgroup, without us doing the work. We also mount the subgroup as read-only to prevent applications from changing cgroup attributes.
What’s next?
We’ll be waiting systemd-appd’s initial specification to come out, and implement it as needed. The helper would also need to be changed to not start processes until portable instructs it to do so, to avoid race conditions. This represents another future enhancement of Portable.
