# DF-0032 — VERDICT

**Verdict: REPRODUCED.** Real, unprivileged, local, **system-wide fork-DoS**
(permanent until reboot). The reviewer-written PoC's *trigger* was wrong
(mmap pressure), but the underlying bug is real and exploitable; a corrected
trigger reproduces it reliably.

## The bug (confirmed line-by-line in `sys/`)

In `fork1()` the irreversible steps happen **before** the only failing fd op:

| kern_fork.c | operation | undone on fdcopy failure? |
|---|---|---|
| `:415` | `atomic_add_int(&nprocs, 1)` | **NO** |
| `:421` | `chgproccnt(uid, 1, RLIMIT_NPROC)` (per-uid) | **NO** |
| `:444` | `p2 = kmalloc(sizeof(struct proc), M_PROC, M_WAITOK\|M_ZERO)` | **NO** |
| `:475` | `p2->p_uidpcpu = kmalloc(..., M_SUBPROC, ...)` | **NO** |
| `:491` | `proc_add_allproc(p2)` (on allproc, SIDL) | **NO** |
| `:509` | `p2->p_ucred = crhold(...)` | **NO** |
| `:521-529` | sigacts (share-ref or kmalloc) | **NO** |
| `:536-542` | `p_textvp` vref / `p_textnch` cache_copy | **NO** |
| `:551` | `error = fdcopy(p1, &p2->p_fd);` ← **only failing fd op** | — |
| `:552-554` | `if (error) { error = ENOMEM; goto done; }` | — |
| `:724-732` | `done:` releases only `p_token`/`p1_token`/pglock | — |

`nprocs` is decremented **only** at `kern_exit.c:1337` and `chgproccnt` only at
`kern_exit.c:1280` — both require a runnable/exiting lwp, which a `SIDL` orphan
(no lwp, no parent linkage — `lwp_fork1` is at `:674`, *after* fdcopy) never
has. `allproc` scans skip `SIDL` procs, so `ps`/`procstat` never show them.

`fdcopy` is the only fd op that can fail because it is the only one whose
`struct filedesc` kmalloc uses `M_NULLOK`:

```
sys/kern/kern_descrip.c:2481   newfdp = kmalloc(sizeof(struct filedesc),
sys/kern/kern_descrip.c:2482                    M_FILEDESC, M_WAITOK|M_ZERO|M_NULLOK);
sys/kern/kern_descrip.c:2483   if (newfdp == NULL) { *fpp = NULL; return (-1); }
```

`fdinit` (`:2408`) and `fdshare` use plain `M_WAITOK` (cannot fail), and
`fdcopy`'s own `fd_files[]` array (`:2504`) is `M_WAITOK` (no `M_NULLOK`), so it
would *panic* on limit exhaustion rather than return NULL. The single clean
failure mode is the `M_NULLOK` newfdp.

## When does that `M_NULLOK` kmalloc actually return NULL?

`kmalloc` returns NULL with `M_NULLOK` when the **per-type** `ks_limit` is
exceeded (`kern_slaballoc.c:863-879`):

```
ks_limit = kmem_lim_size() * 1MB / 10          (kern_slaballoc.c:371-372)
kmem_lim_size() = min(physmem, KvaSize)/1MB    (kern_slaballoc.c:255-263)
```

On this 2 GB guest: `ks_limit(M_FILEDESC) = ~195 MB`.

The original PoC tried to induce this with `mmap`+touch. **That does not work**:
anonymous `mmap` consumes *user* VM and physical pages, **not** the kernel
`M_FILEDESC` malloc pool. Live measurement showed `M_FILEDESC` unchanged
(27 KB → 31 KB) across an mmap-pressure run. So the original PoC never triggers
the bug — it only causes userland OOM.

## The working trigger (fd-table amplification)

Each successful `fork()` on the `RFFDG` path calls `fdcopy`, which allocates a
**copy of the parent's `fd_files[]` table** (`kern_descrip.c:2504`) under
`M_FILEDESC`. A process that has grown its fd table large (via `dup2` to high
fds) therefore forces every child's `fdcopy` to charge `M_FILEDESC` for a large
(~hundreds-of-KB) `fd_files` array. With a ~15000-entry fd table, each child
costs ~700 KB of `M_FILEDESC`; ~260 such children push `M_FILEDESC` to its
~195 MB limit. At that point the next `fdcopy`'s `M_NULLOK` newfdp kmalloc
returns NULL → `fdcopy` returns -1 → `fork1` does `goto done` → **leak**.

`exhaust.c` does exactly this and is fully unprivileged.

## Evidence (all in this folder)

`run.log` is the decisive record. Highlights:

```
$ ./exhaust
[*] grew fd table to fd=14976 (fd_files[] ~234KB per fdcopy)
[!!!] ENOMEM from fork() -- fdcopy failure leak TRIGGERED at child 259
[*] summary: ok=259 eagain=51 enomem=705 other=0
```

Parallel root sampling during the slow variant caught the failure moment:

```
[t=57] file_desc=18.0M   proc=51
[t=59] file_desc=176M    proc=261     <- M_FILEDESC hit its ~195M ks_limit
```

The leak is confirmed by the kernel malloc-type counters (the leaked structs
are never freed, so they persist):

| type | baseline | after one run | meaning |
|---|---|---|---|
| `proc` (M_PROC) | 25 | **744** | +719 `struct proc` permanently leaked |
| `subproc` (uidpcpu) | 48 | 1450 | +1402 `p_uidpcpu` leaked |
| `lwp` | 34 | **34** | unchanged → leak is **before** `lwp_fork1` (`:674`) |
| `file_desc` | 28 | **28** | unchanged → newfdp returned NULL, no filedesc made |
| `ps ax \| wc -l` | 146 | **146** | leaked SIDL orphans are invisible to ps |

`lwp`/`file_desc` being **flat** is the fingerprint that pins the leak to the
exact point the code trace predicts: `fdcopy` failure (`:551`) **after** `p2`
was put on allproc (`:491`) but **before** `lwp_fork1` (`:674`). If the leak
were anywhere else, one of those counters would move.

## System-wide impact (DoS demonstrated)

`nprocs` is a *global* counter; the leaked slots reduce fork capacity for
**every** user, including root. Because the per-uid `chgproccnt` is also leaked
(never decremented), one unprivileged uid can permanently burn ~700–1000
**system-wide** `maxproc` slots before self-capping at its own `RLIMIT_NPROC`.
~4–6 unprivileged uids exhaust all of `maxproc=4036`.

Multi-uid staged attack (proc Count ≈ global `nprocs`; `ps ax` frozen at 146):

```
baseline → 25   maxx→732   u1002→1.41K   u1003→2.10K   u1004→2.79K
u1005→3.48K   u1006→3.68K (+203; system nprocs check now pre-blocks fdcopy)
```

Then, with `nprocs` permanently ~3680/4036, a root fork-bomb:

```
$ /root/forktest_bomb
forktest_bomb: root fork() EAGAIN after 272 children (errno=35 Resource temporarily unavailable)
forktest_bomb: ROOT RESULT ok=272 eagain=3  (clean system would allow ~4036)
```

Root can fork only **~272** children (vs ~3890 on a clean system) — a ~93 %
collapse — and is itself fork-blocked. `dmesg` corroborates with
`maxproc limit exceeded by uid 0`. The leaked slots are **permanent** (they do
not recover after the attackers exit); only a reboot clears them.

## Caveats / precision

- The original PoC (`fork_leak.c`) is **not** a valid trigger (mmap ≠ M_FILEDESC).
  It is retained for provenance; `exhaust.c` is the working trigger.
- Full `maxproc` exhaustion needs ~4–6 unprivileged uids (single user is capped
  at ~1009 leaked slots by its own leaked per-uid `chgproccnt`). On any
  multi-user system (or for any user able to raise `RLIMIT_NPROC` / run from
  several accounts) full system-wide fork-DoS is straightforward. Even a single
  user permanently destroys ~18–25 % of system fork capacity and permanently
  fork-blocks their own uid.
- No kernel panic occurred at any point; the failure is a clean `ENOMEM` leak,
  exactly the path cited.

## Files in this folder

| file | purpose |
|---|---|
| `exhaust.c` / `exhaust` | **working trigger** — fd-table amplification → fdcopy failure → leak |
| `exhaust_slow.c` | slow variant for parallel kernel-state sampling |
| `forktest.c`, `forktest_bomb.c` | prove root fork capacity collapses / root fork-blocked |
| `fork_leak.c` | original reviewer PoC (mmap pressure; does **not** trigger) |
| `build.sh`, `run.sh` | exact build / run commands |
| `run.log` | decisive untrimmed run output + interpretation |
| `dmesg.txt` | kernel `maxproc limit exceeded` messages (incl. uid 0) |
| `env.txt` | guest uname, sysctls, ks_limit derivation |
| `fix.diff` | git-apply-able fix (verified `git apply --check` clean) |
| `manifest.json` | machine-readable artifact catalog |

## Fix

`fix.diff` adds a full teardown of the partially-built `p2` on the fdcopy-failure
path (reversing every acquisition from `:491` back through `:415`/`:421`), plus a
new symmetric `proc_remove_allproc()` helper in `kern_proc.c` (the inverse of
`proc_add_allproc()`). This **supersedes** the finding markdown's primary proposal
(drop `M_NULLOK` from `fdcopy`): dropping `M_NULLOK` would convert the leak into a
`panic("malloc limit exceeded")` at `kern_slaballoc.c:877` (worse for
availability). The teardown keeps `fdcopy`'s clean `ENOMEM` failure mode and
makes `fork1` correctly clean up after it — fixing the root cause and adding
defense-in-depth for any future error path after `p2` allocation.
