# DF-0033 — VERDICT: REPRODUCED (local-unprivileged kernel panic / UAF)

| Field        | Value |
|--------------|-------|
| Verdict      | **REPRODUCED** |
| Impact       | `panic` — reliable local-unprivileged kernel DoS; underlying primitive is a UAF on `M_FILEDESC_TO_LEADER` (64-byte slab zone) |
| Confidence   | **certain** (three distinct kernel panics from unprivileged `maxx`, one with a stack through the exact cited functions; plus an airtight line-level lock-mismatch proof) |
| Tested on    | DragonFly 6.5-DEVELOPMENT v6.5.0.1712.g89e6a-DEVELOPMENT (master DEV build) |
| Attempts     | ~8 build/run iterations; the race is intermittent (CVSS AC:H) and typically needs several 20–60s bursts to hit |

## Mechanism (confirmed in `sys/`, every hop cited)

The finding's claim is correct and the locking has **not** been fixed on master.

1. `fdtol->fdl_refcount` is a plain `int` — `sys/sys/filedesc.h:110`. No atomics.
2. **Increment side** — `sys/kern/kern_fork.c`:
   - `:324`  `lwkt_gettoken(&p1->p_token)` — taken at the top of `fork1()`.
   - `:563`  the `RFTHREAD` branch is entered for `rfork(RFPROC|RFTHREAD)` (fdshare).
   - `:568`  `fdtol = p1->p_fdtol;`
   - `:569`  `fdtol->fdl_refcount++;`  ← mutated under `p1->p_token` **only**.
   - `p1->p_token` is held until `:727`.
3. **Decrement side** — `sys/kern/kern_descrip.c`:
   - `:2622` `spin_lock(&fdp->fd_spin)` — the **shared** fd-table spinlock.
   - `:2675` `fdtol->fdl_refcount--;` ← mutated under `fd_spin` **only**.
4. `p1->p_token` is **per-process**. Two peers `A` and `B` that share `p_fd`/`p_fdtol`
   (created by `rfork(RFPROC|RFTHREAD)`) hold **different** p_tokens
   (`A.p_token` ≠ `B.p_token`). lwkt tokens do not serialize across processes
   that do not share the same token. So:
   - `fork-in-A` holds `A.p_token`;
   - `exit-in-B` (`fdfree` via `exit1`, `kern_exit.c:382`) holds `B.p_token` **plus** the shared `fd_spin`;
   - **neither lock is common to both sides for the `fdl_refcount` word.**
   The only lock that is genuinely common to all sharers is `fd_spin` — and the
   increment side does **not** take it. `++`/`--` on a plain int is a classic
   read/modify/write: concurrent `++`/`--` from two CPUs is a lost update.
5. **Consequence.** A lost increment drives `fdl_refcount` below the true
   reference count. When some peer later exits, `fdfree` decrements, sees
   `fdl_refcount == 0`, unlinks the node from the circular `fdl` list and
   `kfree(fdtol, M_FILEDESC_TO_LEADER)` (`kern_descrip.c:2676-2686`) **while
   other peers still have `p_fdtol` pointing at it** → use-after-free. The next
   `rfork` in a surviving peer dereferences `p1->p_fdtol` (dangling) and bumps
   `fdl_refcount` in freed memory; or takes the non-`RFTHREAD` branch and calls
   `filedesc_to_leader_alloc(old=p1->p_fdtol, ...)`, which reads
   `old->fdl_next`/`old->fdl_prev` from the freed slot and writes through them —
   corrupting whatever now occupies that slab slot.
6. **Unlocked list splice** — `sys/kern/kern_descrip.c:3342-3368`
   `filedesc_to_leader_alloc()` is self-admitted **"NOT MPSAFE"** (`:3343`) and
   splices the shared `fdl_next`/`fdl_prev` list (`:3358-3362`) under **no**
   lock. With the fix's `fd_spin` held around the call site, the splice becomes
   serialized against `fdfree`'s list walk.

## Evidence (three panics, unprivileged `maxx`)

All three are in `panic.txt`. Summary:

- **Panic A — `panic: filedesc_to_refcount botch: fdl_refcount=0`** in `fdfree`.
  This is the `KASSERT(fdtol->fdl_refcount > 0, …)` at `kern_descrip.c:2627-2629`
  firing — the kernel's own invariant check catching the refcount underflow that
  the lost-update race produces. Stack: `fdfree → exit1 → sigexit → postsig → userret`.

- **Panic B — `panic: BADFREE2`** in `_kfree`. Slab double-free detection: after
  the premature `kfree(fdtol)`, a dangling `p_fdtol` reference drove a second
  free of the same `M_FILEDESC_TO_LEADER` object. The slab allocator's
  bookkeeping became inconsistent, so a later `kfree` in `sysctl_kern_proc_args`
  (collateral) trips `BADFREE2`.

- **Panic C — `panic: memory chunk … is already allocated!`** — slab corruption
  detected on the **next** allocation out of `M_FILEDESC_TO_LEADER`. The stack
  is the smoking gun, walking the exact functions cited in the finding:
  `chunk_mark_allocated ← _kmalloc ← filedesc_to_leader_alloc ← fork1 ← sys_rfork`.
  This is the full chain: refcount race → premature free → dangling write into
  the freed 64-byte-zone slot → corrupted slab bitmap → next `kmalloc` panics.

The race is intermittent (AC:H). Across the verification, panic A fired after
~50 s of hammering; panics B and C each fired within a handful of 20 s bursts.

## Exploit chain (characterization; LPE not demonstrated)

The primitive **is** memory corruption (UAF), so per the audit methodology the
chain is characterized even though root was not achieved in this session.

- **Object / zone.** `struct filedesc_to_leader` is 40 bytes
  (`int×3` + `ptr×3`, `sys/sys/filedesc.h:109-117`). DragonFly's slab rounds
  allocation size up to a power of two (`powerof2_size`,
  `sys/kern/kern_slaballoc.c:776-786`), so `fdtol` lands in the **64-byte
  chunk zone** (`kmalloc(sizeof(struct filedesc_to_leader)=40)` → 64).
- **Write primitive.** After the premature free, a surviving peer's next
  non-`RFTHREAD` `rfork` calls `filedesc_to_leader_alloc(old=p1->p_fdtol)`,
  which executes `old->fdl_next->fdl_prev = old->fdl_prev`
  (`kern_descrip.c:3362`). If the attacker reclaims the freed 64-byte slot with
  a crafted object, the attacker controls `old->fdl_next` (**write address**) and
  `old->fdl_prev` (**write value**) → a single arbitrary-pointer-sized write.
  Additionally `old->fdl_next = fdtol` (`:3361`) writes a known kernel pointer
  into a controlled offset.
- **Victim objects (64-byte zone).** Candidate victims would be any `kmalloc`
  of ≤64 bytes containing an attacker-interesting field: a function pointer,
  a `uid`, a `struct ucred *`/`struct file *` pointer, a refcount. Grooming
  would spray the 64-byte zone (sockets, pipes, small `kinfo` structs) to place
  such a victim adjacent to the freed `fdtol` slot.
- **How far it got / what blocks root.** The write primitive is real but
  **gated behind a non-deterministic race** (AC:H): the attacker must first win
  the refcount lost-update, then reclaim the slot, then trigger the splice —
  all in the right order across two CPUs. On this INVARIANTS kernel the
  KASSERT (Panic A) catches the underflow before the corruption phase, masking
  the exploitable path. On a **production (non-INVARIANTS) kernel** the
  KASSERT is compiled out and the UAF proceeds silently to the arbitrary
  write; turning that into `uid=0` requires slab grooming + a chosen victim
  object + a deterministic race trigger, which is substantial exploit
  development and was not completed in this session. The demonstrated,
  reproducible ceiling here is **local-unprivileged kernel panic (DoS)**,
  which is already a High-impact outcome for a default-config kernel.

A maintainer should treat the LPE ceiling as plausible-but-unproven; the DoS
is proven.

## PoC changes (vs. the seeded `fdtol_race.c`)

The seeded PoC compiled and ran but **fork-bombed into `kern.maxprocperuid`**
within seconds, wedging ssh before the race had time to fire. Rewrote it as a
controlled racer:

- Bounded concurrency: each peer does `rfork(RFPROC|RFTHREAD)` + child `_exit(0)`
  and the parent-peer reaps with `waitpid`, so the live process count stays
  well under the per-uid cap (the original looped `fork()` forever and
  orphaned everything).
- Added slab pressure (`open("/dev/null")`/`close` churn in both peer and
  child) so that, once a premature free occurs, the freed 64-byte slot is
  likely reclaimed and the next deref hits clobbered memory → visible panic
  (this is exactly what produced Panic C).
- The parent also runs `rfork(RFPROC|RFTHREAD)`+`_exit` to add a third
  contender for the refcount word (more cross-CPU `++`/`--` overlap).
- `SIGALRM`-bounded runtime; the run scripts loop short bursts because the
  race is intermittent.

Build/run: `./build.sh && ./run.sh` (or `./fdtol_race <secs> <peers>`).

## Why this is not a false positive

The lock mismatch is structural, not a reviewer oversight:
- The increment and decrement sides are guarded by **two different,
  non-mutually-held locks** (`p1->p_token` is per-proc; `fd_spin` is shared).
- No `atomic_t`, no `atomic_add_int`, no common spinlock protects the word.
- `filedesc_to_leader_alloc`'s own comment (`kern_descrip.c:3343`)
  **"NOT MPSAFE"** corroborates that the splice was known-unsafe.
- Three independent kernel panics reproduce from unprivileged userland, one
  with a stack through the exact cited call chain.

## Recommended fix

`fix.diff` in this folder (git-apply-able, verified). It takes the shared
`fd_spin` (the same lock `fdfree` already uses for the decrement and list walk)
around **both** the `fdl_refcount++` and the `filedesc_to_leader_alloc()`
splice in `fork1`'s fdshare branch. This **matches** the finding markdown's
proposal (the markdown sketched the spin_lock around the `++`; this diff
additionally wraps the `else`-branch `filedesc_to_leader_alloc` call, which is
the unlocked splice the markdown also flagged). A more thorough follow-up would
convert `fdl_refcount`/`fdl_holdcount` to `atomic_t` and add a dedicated
`fdl` list lock, but the minimal correct fix is the `fd_spin` pairing.
