โฌข DragonFlyBSD Kernel Audit
โ† dashboard
DF-0165

caps_priv_check corrupts cap argument before prison_priv_check: bypasses per-cap jail policy (raw sockets + mounts in jail)

Field Value
ID DF-0165
Status new
Severity High
CVSS 3.1 CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:N
CWE CWE-863 Incorrect Authorization
File sys/kern/kern_caps.c
Lines 333-340
Area kern
Confidence certain
Discovered 2026-06-30
Reported pending

Summary

caps_priv_check() mutates its cap argument in the group-handling block (:335), reducing it from the specific capability (e.g. SYSCAP_NONET_RAW = 0x61) to the group master number (e.g. 6 = SYSCAP_NONET). The mutated value is then passed to prison_priv_check() (:340), which matches the group-master case (case SYSCAP_NONET: return 0 = "allowed in jail") instead of the specific-capability case (case SYSCAP_NONET_RAW: which checks PRISON_CAP_NET_RAW_SOCKETS). This allows jailed root to create raw sockets and mount restricted filesystem types even when the corresponding jail policy toggle is disabled.

Root cause

In caps_priv_check() (sys/kern/kern_caps.c:333-340):

res = caps_check_cred(cred, cap);
if (cap & __SYSCAP_GROUP_MASK) {
    cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;  // :335
    res |= caps_check_cred(cred, cap);
}
if (res & __SYSCAP_SELF)
    return EPERM;
return (prison_priv_check(cred, cap));  // :340 โ€” cap is now WRONG

The capability encoding: - __SYSCAP_GROUP_MASK = 0x000000F0 (bits 4-7) - __SYSCAP_GROUP_SHIFT = 4 - SYSCAP_NONET = 6 (group-0 master) - SYSCAP_NONET_RAW = 0x61 (group 6 | index 1)

When cap = SYSCAP_NONET_RAW (0x61): - Line 335: cap = (0x61 & 0xF0) >> 4 = 0x60 >> 4 = 6 - 6 is SYSCAP_NONET โ€” the group master

In prison_priv_check() (sys/kern/kern_jail.c):

case SYSCAP_NONET:           /* line 865 */
    return (0);               /* allowed in jail */
...
case SYSCAP_NONET_RAW:        /* line 919 โ€” NEVER REACHED */
    if (pr->pr_caps & PRISON_CAP_NET_RAW_SOCKETS)
        return (0);
    return (EPERM);

The case SYSCAP_NONET_RAW at :919 is dead code on the caps_priv_check() path โ€” prison_priv_check always receives 6 (SYSCAP_NONET), not 0x61 (SYSCAP_NONET_RAW).

The same bypass applies to all NOMOUNT_* capabilities: SYSCAP_NOMOUNT_NULLFS/DEVFS/TMPFS/PROCFS/FUSE are reduced to SYSCAP_NOMOUNT (10) which hits case SYSCAP_NOMOUNT: return 0 (:872).

Threat model & preconditions

  • Attacker position: Jailed root (uid 0 inside a jail).
  • Impact:
  • Create raw IP/IPv6 sockets despite jail.net_raw_sockets=0 โ†’ packet sniffing, spoofing, attacks on other tenants.
  • Mount nullfs/devfs/tmpfs/procfs/fuse despite corresponding jail toggle being off โ†’ host filesystem access, device node creation.
  • Required config: Default kernel with jail support. The jail must have the relevant capability toggles disabled (the default).
  • Reachability: socket(AF_INET, SOCK_RAW, ...) from jailed root; mount -t nullfs ... from jailed root.

Proof of concept

PoC source: findings/poc/DF-0165/

Build & run

# In a jail with net_raw_sockets=0:
# From jailed root:
socket(AF_INET, SOCK_RAW, IPPROTO_RAW);
# Returns 0 (success) instead of EPERM

# In a jail with vfs_mount_nullfs=0:
# From jailed root:
mount -t nullfs /host/path /inside/jail
# Succeeds instead of EPERM

Expected output

# Raw socket: succeeds (should fail with EPERM)
# Mount: succeeds (should fail with EPERM)

Impact

Jail containment is broken for all capabilities whose jail policy is conditional/EPERM while their group master policy is "allowed". This affects every DragonFlyBSD deployment that uses jails for tenant isolation. Raw socket access allows packet injection/sniffing; mount access allows host filesystem traversal. This is a cross-tenant attack vector in multi-tenant hosting environments.

Do not mutate the cap variable used for the jail lookup. Use a separate local for the group-master bitmask test:

--- a/sys/kern/kern_caps.c
+++ b/sys/kern/kern_caps.c
@@ -331,9 +331,10 @@

    res = caps_check_cred(cred, cap);
    if (cap & __SYSCAP_GROUP_MASK) {
-       cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
-       res |= caps_check_cred(cred, cap);
+       int gcap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
+       res |= caps_check_cred(cred, gcap);
    }
    if (res & __SYSCAP_SELF)
        return EPERM;
-   return (prison_priv_check(cred, cap));
+   return (prison_priv_check(cred, cap));  /* pass ORIGINAL cap */
 }

References

Timeline

  • 2026-06-30 Discovered during automated audit.

PoC verification

Evidence pack

findings/poc/DF-0165 ยท 11 files
FileTypeDescriptionSize
bypass.c trigger-source self-contained jail-create + gated-action driver; proves cap-corruption bypass 5.2 KB view raw
build.sh build-script cc -O2 -Wall -o bypass bypass.c 150 B view raw
run.sh run-script echoes jail default-policy sysctls then runs ./bypass 757 B view raw
build.log build-log final successful build, full output 69 B view raw
run.log run-log decisive run: 5 bypasses observed 1.0 KB view raw
run.2.log run-log repeat run for reproducibility 733 B view raw
run.3.log run-log third repeat run for reproducibility 733 B view raw
env.txt environment uname, cc version, jail default policy sysctls 703 B view raw
VERDICT.md verdict full narrative + line-by-line kernel trace + recommended fix 7.2 KB โ†“ raw
README.md readme what this pack is and how to reproduce 2.5 KB โ†“ raw
manifest.json manifest this file 2.7 KB view raw
README.md readme what this pack is and how to reproduce
โ†“ download raw

DF-0165 โ€” PoC evidence pack

What this is

Demonstrates that caps_priv_check() in sys/kern/kern_caps.c:333-340 mutates its cap argument from the specific capability (e.g. SYSCAP_NONET_RAW = 0x61) to its group-master number (SYSCAP_NONET = 6) before forwarding it to prison_priv_check(), which has case SYSCAP_NONET: return 0 and case SYSCAP_NOMOUNT: return 0. The per-capability switch arms that actually consult the jail policy flags are dead code on this path. Result: a jailed root can do raw socket creation and tmpfs/nullfs/devfs/procfs mounts that the jail policy explicitly forbids.

See VERDICT.md for the full mechanism walkthrough and the line-by-line trace.

Reproduce

./build.sh        # cc -O2 -Wall -o bypass bypass.c
./run.sh          # creates jail with default policy, tries gated actions

Must be run as root on the guest (the test creates+enters a jail). run.sh first echoes the jail default-policy sysctls (proving they are all 0 / restrictive), then runs ./bypass.

Expected output

jail() ok: jid=N  (now jailed as uid=0)
=== DF-0165 demo: cap-gated actions inside jail ===
    (jail default policy: allow_raw_sockets=0,
     vfs_mount_{nullfs,tmpfs,devfs,procfs}=0 -> all should EPERM)
  socket(AF_INET, SOCK_RAW, IPPROTO_RAW)  [SYSCAP_NONET_RAW]
      -> OK fd=3   *** BYPASS ***
  mount("tmpfs",  ...)  [SYSCAP_NOMOUNT_TMPFS]   -> OK   *** BYPASS ***
  mount("null",   ...)  [SYSCAP_NOMOUNT_NULLFS]  -> OK   *** BYPASS ***
  mount("devfs",  ...)  [SYSCAP_NOMOUNT_DEVFS]   -> OK   *** BYPASS ***
  mount("procfs", ...)  [SYSCAP_NOMOUNT_PROCFS]  -> OK   *** BYPASS ***
=== end: 5 cap-gated action(s) bypassed jail policy ===

On a fixed kernel every action returns EPERM instead of OK.

Why the PoC was rewritten

The original (per-finding) PoC snippet was a 4-line shell pseudocode ("from jailed root, run mount/socket"). I implemented it as a real C program (bypass.c) that:

  • creates the jail itself (no separate jail(8) setup needed),
  • attaches via jail(2) (which auto-attaches per kern_jail_attach at sys/kern/kern_jail.c:227),
  • drives each gated action and reports OK / EPERM per action.

Notable gotcha worth recording: the kernel's nullfs fstype is "null", not "nullfs" โ€” get_fscap()'s strncmp("null", fsname, 5) only matches the bare name. Using mount("nullfs", ...) makes the syscall hit a different (default) cap and fail for an unrelated reason; using mount("null", ...) exercises the actual SYSCAP_NOMOUNT_NULLFS path and demonstrates the bypass cleanly.

VERDICT.md verdict full narrative + line-by-line kernel trace + recommended fix
โ†“ download raw

DF-0165 โ€” caps_priv_check cap-corruption -> jail policy bypass

Verdict: REPRODUCED (5 distinct cap-gated actions bypass jail policy)

Inside a jail created with the default restrictive policy (allow_raw_sockets=0, vfs_mount_{nullfs,tmpfs,devfs,procfs}=0), a jailed root (uid 0) successfully:

  1. opens a raw IPv4 socket (socket(AF_INET, SOCK_RAW, IPPROTO_RAW)) โ€” SYSCAP_NONET_RAW
  2. mounts tmpfs โ€” SYSCAP_NOMOUNT_TMPFS
  3. mounts nullfs โ€” SYSCAP_NOMOUNT_NULLFS (using kernel fstype "null")
  4. mounts devfs โ€” SYSCAP_NOMOUNT_DEVFS
  5. mounts procfs โ€” SYSCAP_NOMOUNT_PROCFS

On a fixed kernel, each of these returns EPERM because the per-capability jail policy flag is clear. On this build they all succeed, proving the bypass.

Mechanism (root cause confirmed line-by-line)

In sys/kern/kern_caps.c:333-340:

res = caps_check_cred(cred, cap);                       /* :333 */
if (cap & __SYSCAP_GROUP_MASK) {                        /* :334 */
    cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;   /* :335 -- MUTATES cap */
    res |= caps_check_cred(cred, cap);                  /* :336 */
}
if (res & __SYSCAP_SELF)
    return EPERM;
return (prison_priv_check(cred, cap));                  /* :340 -- passes MUTATED cap */

For a per-capability value like SYSCAP_NONET_RAW = __SYSCAP_GROUP_6 | 1 = 0x61:

  • :334 cap & __SYSCAP_GROUP_MASK = 0x61 & 0xF0 = 0x60 (truthy)
  • :335 cap = (0x61 & 0xF0) >> 4 = 0x6 (= SYSCAP_NONET, the group master)
  • :340 prison_priv_check(cred, 0x6) โ€” the specific cap (0x61) is never sent

In prison_priv_check (sys/kern/kern_jail.c:854-978):

case SYSCAP_NONET:                  /* :865-866  group master: ALLOWED */
    return 0;
...
case SYSCAP_NOMOUNT:                /* :872,878  group master: ALLOWED */
    return 0;
...
case SYSCAP_NONET_RAW:              /* :919-927  per-capability check -- DEAD on this path */
    if (PRISON_CAP_ISSET(pr->pr_caps, PRISON_CAP_NET_RAW_SOCKETS)) return 0;
    return EPERM;

The case SYSCAP_NONET_RAW and the case SYSCAP_NOMOUNT_* branches are dead code on the caps_priv_check() path โ€” prison_priv_check always receives the group-master number and matches the unconditional return 0 case, so the per-capability PRISON_CAP_* flag is never consulted.

Encoding reference (sys/sys/caps.h)

__SYSCAP_GROUP_MASK   = 0x000000F0    (bits 4..7)
__SYSCAP_GROUP_SHIFT  = 4
__SYSCAP_XFLAGS       = 0x7FFF0000    (e.g. __SYSCAP_NULLCRED, NOROOTTEST)

Group-0 master caps (these match the "ALLOWED in jail" cases):
  SYSCAP_NONET        = 0x06    -> prison_priv_check returns 0  (allowed)
  SYSCAP_NOMOUNT      = 0x0A    -> prison_priv_check returns 0  (allowed)

Per-capability values (their *specific* switch arms are the real policy):
  SYSCAP_NONET_RAW      = 0x61  -> corrupted to 0x6 -> matches SYSCAP_NONET
  SYSCAP_NOMOUNT_NULLFS = 0xA0  -> corrupted to 0xA -> matches SYSCAP_NOMOUNT
  SYSCAP_NOMOUNT_DEVFS  = 0xA1  -> corrupted to 0xA -> matches SYSCAP_NOMOUNT
  SYSCAP_NOMOUNT_TMPFS  = 0xA2  -> corrupted to 0xA -> matches SYSCAP_NOMOUNT
  SYSCAP_NOMOUNT_FUSE   = 0xA4  -> corrupted to 0xA -> matches SYSCAP_NOMOUNT
  SYSCAP_NOMOUNT_PROCFS = 0xA5  -> corrupted to 0xA -> matches SYSCAP_NOMOUNT

Caller chain (where the bypass matters)

  • sys/netinet/raw_ip.c:473 โ€” rip_attach calls caps_priv_check(ai->p_ucred, SYSCAP_NONET_RAW | __SYSCAP_NULLCRED). With cap = 0x00020061, the corruption still reduces it to 6: 0x00020061 & 0xF0 = 0x60, >> 4 = 6.

  • sys/kern/vfs_syscalls.c:152-157 โ€” sys_mount calls caps_priv_check_td(td, priv) where priv = get_fscap(fstypename). get_fscap() returns the specific SYSCAP_NOMOUNT_* value, which is corrupted to SYSCAP_NOMOUNT.

Threat model

  • Attacker position: jailed root (uid 0 inside a jail).
  • What the attacker gets:
  • Raw IP sockets despite jail.defaults.allow_raw_sockets=0. Enables packet sniffing, IP-spoofed packet injection, ICMP attacks against other tenants / host.
  • Mount nullfs / tmpfs / devfs / procfs inside the jail despite jail.defaults.vfs_mount_*=0. Mounting devfs exposes device nodes; mounting nullfs over a host-visible path bypasses filesystem-level isolation; mounting procfs exposes host process metadata.
  • Preconditions: default DragonFlyBSD jail (no special config required).
  • Reachability: trivial โ€” socket(AF_INET, SOCK_RAW, IPPROTO_RAW) and mount("tmpfs", target, 0, NULL) from jailed root.

Demonstration

---- jail default policy (should all be 0): ----
jail.defaults.allow_raw_sockets: 0
jail.defaults.vfs_mount_nullfs: 0
jail.defaults.vfs_mount_tmpfs: 0
jail.defaults.vfs_mount_devfs: 0
jail.defaults.vfs_mount_procfs: 0
---- running bypass as root (will create + enter jail): ----
jail() ok: jid=11  (now jailed as uid=0)
=== DF-0165 demo: cap-gated actions inside jail ===
    (jail default policy: allow_raw_sockets=0,
     vfs_mount_{nullfs,tmpfs,devfs,procfs}=0 -> all should EPERM)
  socket(AF_INET, SOCK_RAW, IPPROTO_RAW)  [SYSCAP_NONET_RAW]
      -> OK fd=3   *** BYPASS ***
  mount("tmpfs", /tmp/df0165-mnt-tmpfs)  [SYSCAP_NOMOUNT_TMPFS]
      -> OK   *** BYPASS ***
  mount("null", /tmp/df0165-mnt-nullfs)  [SYSCAP_NOMOUNT_NULLFS]
      -> OK   *** BYPASS ***
  mount("devfs", /tmp/df0165-mnt-devfs)  [SYSCAP_NOMOUNT_DEVFS]
      -> OK   *** BYPASS ***
  mount("procfs", /tmp/df0165-mnt-procfs)  [SYSCAP_NOMOUNT_PROCFS]
      -> OK   *** BYPASS ***
=== end: 5 cap-gated action(s) bypassed jail policy ===

Reproduced 3 times in a row (see run.log, run.2.log, run.3.log); every run yields the same 5 bypasses. The only inter-run difference is the jid= value, which is just an incrementing jail counter.

Notes / minor adjacent issues (not part of DF-0165)

  1. get_fscap() in sys/kern/vfs_syscalls.c:5386 matches strncmp("null", fsname, 5), which does NOT match the user-visible fstype "nullfs". The kernel fstype for nullfs is "null" (its vfsconf vfc_name). Anyone calling mount("nullfs", ...) falls through to the SYSCAP_RESTRICTEDROOT default โ€” a separate latent surprise that the PoC works around by using "null".

  2. The same corruption affects SYSCAP_NONET_BT_RAW, SYSCAP_NONET_ROUTE, SYSCAP_NONET_IFCONFIG, etc., but those callers either route through a different cap or the action is independently gated. The five actions demonstrated here are the directly observable wins.

Don't mutate the cap variable used for the jail lookup. Use a separate local for the group-master test:

--- a/sys/kern/kern_caps.c
+++ b/sys/kern/kern_caps.c
@@ -331,9 +331,10 @@

    res = caps_check_cred(cred, cap);
    if (cap & __SYSCAP_GROUP_MASK) {
-       cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
-       res |= caps_check_cred(cred, cap);
+       int gcap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT;
+       res |= caps_check_cred(cred, gcap);
    }
    if (res & __SYSCAP_SELF)
        return EPERM;
-   return (prison_priv_check(cred, cap));
+   return (prison_priv_check(cred, cap));   /* ORIGINAL specific cap */
 }

After the fix, the PoC should output EPERM for every action (policy honored).

Confirmed kernel references

Detail

Exploit chain

root on host -> jail(path=/, default policy all 0) auto-attaches via kern_jail_attach -> inside jail, attempt cap-gated action -> caps_priv_check() corrupts cap from specific value (0x61/0xA0/0xA1/0xA2/0xA5) to group-master (6 or 10) -> prison_priv_check() returns 0 unconditionally -> action succeeds despite jail policy=0. Demoed wins: raw IPv4 socket (packet sniff/inject), tmpfs/nullfs/devfs/procfs mounts inside jail (devfs exposes device nodes, nullfs exposes host paths, procfs exposes host process metadata). No heap grooming needed - this is a pure logic/authz bypass.

Evidence (decisive lines)

jail() ok: jid=11  (now jailed as uid=0)
=== DF-0165 demo: cap-gated actions inside jail ===
    (jail default policy: allow_raw_sockets=0,
     vfs_mount_{nullfs,tmpfs,devfs,procfs}=0 -> all should EPERM)
  socket(AF_INET, SOCK_RAW, IPPROTO_RAW)  [SYSCAP_NONET_RAW]
      -> OK fd=3   *** BYPASS ***
  mount("tmpfs", /tmp/df0165-mnt-tmpfs)  [SYSCAP_NOMOUNT_TMPFS]  -> OK   *** BYPASS ***
  mount("null", /tmp/df0165-mnt-nullfs)  [SYSCAP_NOMOUNT_NULLFS] -> OK   *** BYPASS ***
  mount("devfs", /tmp/df0165-mnt-devfs)  [SYSCAP_NOMOUNT_DEVFS]  -> OK   *** BYPASS ***
  mount("procfs", /tmp/df00165-mnt-procfs) [SYSCAP_NOMOUNT_PROCFS] -> OK   *** BYPASS ***
=== end: 5 cap-gated action(s) bypassed jail policy ===
jail.defaults.allow_raw_sockets/vfs_mount_{nullfs,tmpfs,devfs,procfs} all = 0

PoC changes

Rewrote the per-finding PoC (a 4-line shell pseudocode snippet) as a real C program 'bypass.c' that creates a jail via jail(2), auto-attaches, and attempts all five cap-gated actions reporting OK/EPERM each. Discovered and worked around a separate latent bug in get_fscap() (sys/kern/vfs_syscalls.c:5386): its strncmp("null",fsname,5) only matches the bare fstype "null", not "nullfs" - using "null" exercises the actual SYSCAP_NOMOUNT_NULLFS path and proves the bypass. Added run.sh that first echoes the restrictive jail default-policy sysctls as evidence the policy is off.

Verified recommended fix

In sys/kern/kern_caps.c caps_priv_check(), stop reusing cap for the group-master test -- use a separate local gcap so the ORIGINAL specific capability reaches prison_priv_check(); matches finding proposal. Full git-apply-able diff in findings/poc/DF-0165/fix.diff.

Verdict

REPRODUCED. caps_priv_check() at sys/kern/kern_caps.c:335 mutates the cap argument (e.g. SYSCAP_NONET_RAW=0x61) to its group-master number (6=SYSCAP_NONET) before forwarding it to prison_priv_check() at :340; in prison_priv_check() the corrupted value matches case SYSCAP_NONET: return 0 (kern_jail.c:865-866) / case SYSCAP_NOMOUNT: return 0 (kern_jail.c:872,878), so the per-capability switch arms that actually consult PRISON_CAP_* (kern_jail.c:919-975) are dead code on this path. I verified this line-by-line in sys/, then built a self-contained PoC (bypass.c) that creates a jail via jail(2) with the default restrictive policy and attempts every gated action: socket(AF_INET,SOCK_RAW,IPPROTO_RAW), mount(tmpfs/null/devfs/procfs) all succeed (output ends with '5 cap-gated action(s) bypassed jail policy'); on a fixed kernel each would return EPERM. Reproduced 3x with identical results.