# DF-0014 — VERDICT: FALSE POSITIVE (error path unreachable)

| Field          | Value                                                    |
|----------------|----------------------------------------------------------|
| Finding        | DF-0014                                                  |
| Verdict        | **NOT REPRODUCED — false positive (unreachable code)**   |
| Status         | not_reproduced                                           |
| Impact         | none                                                     |
| Confidence     | certain                                                  |
| Attempts       | 6 build/run iterations (~850K race iterations total)     |

## One-line summary

The `lwkt_reltoken(&prg->proc_token)` at `sys/kern/kern_proc.c:764` IS genuinely
erroneous code (it releases a token the thread does not hold), but the error
branch at `:763` that reaches it is **dead code on the current kernel** — the
only caller path (`sys_setpgid` → `enterpgrp`) makes it unreachable because
`pfind()`'s `PHOLD()` prevents the target process from transitioning to `SZOMB`
until after `enterpgrp()` returns.

## What the finding claims (and gets right)

The reviewer correctly identified two facts:

1. **`pfindn()` never returns with `prg->proc_token` held.** Confirmed:
   `kern_proc.c:554` (curproc shortcut — no token acquired), `:568` (hash
   match — token released before return), `:572` (not-found — token released
   before return).

2. **`enterpgrp()` acquires `prg->proc_token` only on the success path at
   `:770`, not before `:763`.** Confirmed: no `lwkt_gettoken(&prg->proc_token)`
   between function entry (`:727`) and `:763`.

Therefore the `lwkt_reltoken(&prg->proc_token)` at `:764` WOULD release an
un-held token and WOULD trigger `panic("lwkt_reltoken: illegal release")`
(`sys/kern/lwkt_token.c:853`) — **if** the error branch were ever reached.

## Why the error branch is unreachable (the race cannot be won)

The finding claims the error branch fires when the target process becomes
`SZOMB` between `sys_setpgid`'s `pfind()` (`kern_prot.c:372`) and `enterpgrp`'s
`pfindn()` (`kern_proc.c:763`). This race **cannot be won** because of the
`PHOLD`/`PSTALL` serialization:

### The serialization chain

1. **`sys_setpgid` calls `pfind(uap->pid)` at `kern_prot.c:372`.**
   `pfind()` (`kern_proc.c:510-535`) does `PHOLD(p)` (`:512` or `:527`),
   which atomically increments `p->p_lock` to ≥ 1. The returned `targp` is
   **referenced**.

2. **`sys_setpgid` calls `enterpgrp(targp, pgid, 0)` at `kern_prot.c:409`
   while still holding the `PHOLD` reference.** The reference is released
   only at `PRELE(targp)` at `kern_prot.c:415`, which is **after** `enterpgrp`
   returns. So `p->p_lock ≥ 1` throughout `enterpgrp()`.

3. **The ONLY code that sets `p->p_stat = SZOMB` is `proc_move_allproc_zombie()`
   at `kern_proc.c:1189`.** Verified: `grep -rn 'p_stat.*=.*SZOMB' sys/` returns
   exactly one write site.

4. **`proc_move_allproc_zombie()` calls `PSTALL(p, "reap1", 0)` at
   `kern_proc.c:1185` BEFORE setting `SZOMB` at `:1189`.** `PSTALL` is defined
   as: `if ((p)->p_lock > count) pstall(...)` (`sys/sys/proc.h:489-490`).
   `pstall()` (`kern_proc.c:302-336`) **blocks (tsleep)** until `p->p_lock`
   drops to `≤ count` (0 in this case).

5. **Therefore, while the parent holds `PHOLD` (p_lock ≥ 1), the child CANNOT
   reach `:1189` and CANNOT become `SZOMB`.** It blocks at `PSTALL` in
   `proc_move_allproc_zombie`, staying `SACTIVE`.

### The kernel's own comment confirms this

`sys/kern/kern_exit.c:501-503`:
```c
/*
 * We have to handle PPWAIT here or proc_move_allproc_zombie()
 * will block on the PHOLD() the parent is doing.
 */
```

And `kern_exit.c:515-518`:
```c
/*
 * Move the process to the zombie list.  This will block
 * until the process p_lock count reaches 0.
 */
```

### Result inside enterpgrp

Because `p->p_lock ≥ 1` (parent's `PHOLD`) throughout `enterpgrp()`, the target
process is **always `SACTIVE`** when `pfindn(savepid)` runs at `:763`:

- `pfindn()` scans `prg->allproc` and **skips `SZOMB` entries** (`:565`).
- The target is `SACTIVE` (blocked at `PSTALL` in `proc_move_allproc_zombie`),
  so `pfindn` finds it and returns it.
- `np == p` (same pid, same process), so the condition `np == NULL || np != p`
  is **false**.
- Execution falls through to the **success path** at `:770`. The error branch
  at `:763-768` is **never executed**.

### No alternate path to the error branch

`enterpgrp()` has exactly two callers (verified via grep):

1. `sys_setpgid` at `kern_prot.c:409` — target is `PHOLD`ed (proven above).
2. `sys_setsid` at `kern_prot.c:338` — target is `curproc` (`p`), which is the
   currently-running process and can never be `SZOMB`.

`pfindn()` can only return `NULL` for a `SZOMB` or absent process; `np != p` is
impossible (pids are unique and the target is still in `allproc` — it is not
removed until `proc_remove_zombie()` during `wait4`, which the single-threaded
parent cannot call concurrently with `setpgid`).

**Conclusion: the `lwkt_reltoken` at `:764` is unreachable dead code. The race
the finding describes is structurally impossible.**

## Empirical confirmation

The PoC (`setpgid_panic.c`) was built and run on the DragonFlyBSD master DEV
guest (2 vCPUs, `usched_dfly`). Multiple race strategies were tried across ~6
iterations totaling ~850K+ fork+exit+setpgid race attempts:

| Run config                          | Racers | Total iters | Panic? |
|-------------------------------------|--------|-------------|--------|
| sweep=200, parallel=1 (original)    | 1      | ~209K       | no     |
| sweep=200, parallel=4               | 4      | ~524K       | no     |
| sweep=150000, parallel=4            | 4      | ~327K       | no     |
| sweep=50000, parallel=8, yield=1000 | 8      | ~131K       | no     |

Guest remained healthy (`vm.sh status` → up) after every run. No kernel
warnings in `dmesg`. No panic signature in `dfbsd-qemu/boot.log`. This is
consistent with the code trace: the error branch cannot fire.

## PoC source changes

The original PoC was a single-threaded `fork`+`_exit`+`setpgid` loop with no
timing control. I rewrote it to:
- Add a **tunable spin-delay sweep** (0 to N iterations) between `fork` and
  `setpgid` to give the child time to enter `exit1()` and acquire `p_token`.
- Add **parallel racers** (configurable process count) to increase race pressure.
- Add optional **periodic `sched_yield()`** during the spin to cover the
  same-CPU scheduling case.
- Document the full race mechanism with line citations in the source comments.

These changes were an honest attempt to widen the race window as far as
possible. They failed to trigger the panic — not because the timing was wrong,
but because the race is structurally impossible (see above).

## Recommended fix (defense-in-depth)

Although the error branch is currently unreachable, the `lwkt_reltoken` at
`:764` is genuinely erroneous code — it releases a token that is not held. If a
future change adds a new caller of `enterpgrp()` (or relaxes the `PHOLD`/
`PSTALL` serialization), this would become a live kernel panic. The fix
(`fix.diff`) removes the erroneous `lwkt_reltoken` and adds a comment
explaining why the token is not held.

This is a **defense-in-depth hardening**, not a security fix — no exploitable
vulnerability exists on the current kernel.

## Key citations

- `sys/kern/kern_proc.c:763-768` — error branch with erroneous `lwkt_reltoken`.
- `sys/kern/kern_proc.c:1185-1189` — `PSTALL(p, "reap1", 0)` before `SZOMB`.
- `sys/kern/kern_prot.c:372` — `pfind()` → `PHOLD(targp)`.
- `sys/kern/kern_prot.c:409` — `enterpgrp()` called with `PHOLD` held.
- `sys/kern/kern_prot.c:415` — `PRELE(targp)` (after `enterpgrp`).
- `sys/kern/kern_exit.c:501-503` — kernel comment confirming PSTALL/PHOLD block.
- `sys/sys/proc.h:489-490` — `PSTALL` macro definition.
- `sys/kern/kern_proc.c:302-336` — `pstall()` blocks on `p_lock`.
- `sys/kern/lwkt_token.c:853` — the panic that WOULD fire if reachable.
