Unsynchronized fdtol->fdl_refcount ++ / list splice in rfork fdshare path (UAF via refcount race)
| Field | Value |
|---|---|
| ID | DF-0033 |
| Status | new |
| Severity | Medium |
| CVSS 3.1 | CVSS:3.1/AV:L/AC:H/PR:L/UI:N/S:U/C:H/I:H/A:H |
| CWE | CWE-362 Race Condition; CWE-416 Use After Free |
| File | sys/kern/kern_fork.c (inc); sys/kern/kern_descrip.c (dec/unlink) |
| Lines | 568-569 (inc), kern_descrip.c:2675 (dec), 3359-3362 (splice) |
| Area | kern |
| Confidence | likely |
| Discovered | 2026-06-29 |
| Reported | pending |
Summary
In the fdshare branch of fork1() (neither RFFDG nor RFCFDG, e.g.
rfork(RFPROC|RFTHREAD)), the filedesc-to-leader node is shared:
kern_fork.c:568-569 does fdtol = p1->p_fdtol; fdtol->fdl_refcount++; under
the forking proc's p_token. The matching decrement in fdfree()
(kern_descrip.c:2675) runs under the shared filedesc's fd_spin. Two
peers that share both p_fd and p_fdtol hold different locks while mutating
the same refcount word β a lost-update race. A lost increment lets fdfree()
free fdtol (kern_descrip.c:2676) while other peers still hold p_fdtol
pointing at it β use-after-free on M_FILEDESC_TO_LEADER memory. Additionally,
filedesc_to_leader_alloc() (kern_descrip.c:3359-3362) splices fdtol into
the shared fdl_next/fdl_prev list under no lock (self-admitted
"NOT MPSAFE" at :3343), enabling concurrent list corruption.
Root cause
if ((flags & RFTHREAD) != 0) {
fdtol = p1->p_fdtol;
fdtol->fdl_refcount++; /* under p1->p_token; NOT fd_spin */
} else {
fdtol = filedesc_to_leader_alloc(p1->p_fdtol, p2); /* unlocked splice */
}
The decrement/unlink side (fdfree) takes fdp->fd_spin. Because all fdtol
sharers share the same p_fd (hence the same fd_spin), the correct
serialization lock is fd_spin β which is not held on the increment/
splice side in fork1.
Threat model & preconditions
- Attacker position: any unprivileged local user using
rfork(RFPROC|RFTHREAD)to create peers sharing the fd table, then concurrently forking from one peer while another exits (or forks). - Privileges gained or impact: a lost increment drives
fdl_refcountbelow the true reference count; when a sharer exits andfdfree()seesfdl_refcount == 0, itkfreesfdtolwhile other peers still reference it β UAF (kernel memory corruption / controlled free of an attacker-influenced slab object). A lost decrement leaks the node. The unlocked list splice can additionally corrupt the circularfdllist, yielding memory corruption inclosef()/do_dup(). - Required config or capabilities: none; default kernel. Trigger is narrow (rfork fdshare + concurrent peer fork/exit) β AC:H.
- Reachability:
rfork(RFPROC|RFTHREAD)peers + concurrent fork/exit.
Proof of concept
PoC source: findings/poc/DF-0033/fdtol_race.c
Build & run (unprivileged, disposable VM)
cc -o fdtol_race findings/poc/DF-0033/fdtol_race.c ./fdtol_race
Expected output
Intermittent kernel memory corruption / panic in fdfree()'s fdl list walk
or the next fork's fdtol deref (UAF).
Impact
Refcount race β UAF reachable by an unprivileged user via rfork fdshare +
concurrent peer fork/exit. Medium (narrow race window, but UAF = potential
corruption/LPE).
Recommended fix
Pair the refcount mutation with the lock already used on the free side
(fd_spin), and lock the fdl list splice:
--- a/sys/kern/kern_fork.c
+++ b/sys/kern/kern_fork.c
@@ -568 +568,6 @@
fdtol = p1->p_fdtol;
- fdtol->fdl_refcount++;
+ /* fdl_refcount is mutated under the shared fd table's spinlock
+ * on the decrement side (fdfree), so match it here. */
+ spin_lock(&p1->p_fd->fd_spin);
+ fdtol->fdl_refcount++;
+ spin_unlock(&p1->p_fd->fd_spin);
Additionally, filedesc_to_leader_alloc() (kern_descrip.c:3346-3368) must
take fd_spin (or a dedicated fdtol lock) around the fdl_next/fdl_prev
splice. A more thorough fix converts fdl_refcount to an atomic/refcount_t
and adds a dedicated lock for the fdl list.
References
sys/kern/kern_fork.c:568-569βfdl_refcount++underp_token.sys/kern/kern_descrip.c:2675-2679β decrement underfd_spin.sys/kern/kern_descrip.c:3359-3362β unlocked list splice ("NOT MPSAFE").- CWE-362 Race Condition; CWE-416 Use After Free.
Timeline
- 2026-06-29 Discovered during automated file-by-file audit of
sys/kern/kern_fork.c. - pending Reported to DragonFlyBSD security contact.
PoC verification
Evidence pack
findings/poc/DF-0033 Β· 10 files| File | Type | Description | Size | |
|---|---|---|---|---|
| fdtol_race.c | trigger-source | controlled racer: bounded-concurrency rfork(RFPROC|RFTHREAD) peers + slab pressure to drive the fdl_refcount lost-update race and reclaim the freed slot | 4.2 KB | view raw |
| build.sh | build-script | cc -O2 -o fdtol_race fdtol_race.c | 190 B | view raw |
| run.sh | run-script | looped short bursts until panic or 12 rounds (race is intermittent) | 748 B | view raw |
| build.log | build-log | final successful build, full output | 68 B | view raw |
| run.log | run-log | sample 15s run (non-panic sample; panics land in panic.txt/boot.log) | 266 B | view raw |
| panic.txt | panic-signature | three distinct kernel panics from unprivileged maxx, incl. stack through filedesc_to_leader_alloc<-fork1<-sys_rfork | 4.5 KB | view raw |
| env.txt | environment | uname, cc version, sysctls, VM config | 689 B | view raw |
| VERDICT.md | verdict | full narrative: mechanism, evidence, exploit-chain characterization, fix rationale | 9.3 KB | β raw |
| fix.diff | suggested-fix | git-apply-able: take shared fd_spin around fdl_refcount++ and around filedesc_to_leader_alloc splice in fork1 fdshare branch | 1.0 KB | view raw |
| README.md | readme | human-facing build/run/expected + bug summary | 2.8 KB | β raw |
DF-0033 β PoC
fdtol_race.c β unsynchronized fdtol->fdl_refcount++ (under per-proc
p_token) racing fdl_refcount-- (under the shared fd_spin) in the
rfork(RFPROC|RFTHREAD) fdshare path β lost-update β premature kfree(fdtol)
β UAF on M_FILEDESC_TO_LEADER. Plus an unlocked fdl list splice in
filedesc_to_leader_alloc (self-admitted "NOT MPSAFE").
Status: REPRODUCED
Three distinct kernel panics from unprivileged user maxx (see VERDICT.md
and panic.txt):
panic: filedesc_to_refcount botch: fdl_refcount=0β theKASSERTatkern_descrip.c:2627catching the refcount underflow directly.panic: BADFREE2β slab double-free of the prematurely-freedfdtol.panic: memory chunk β¦ is already allocated!β slab corruption on the nextkmalloc, stack:chunk_mark_allocated β _kmalloc β filedesc_to_leader_alloc β fork1 β sys_rfork(the exact cited path).
The race is intermittent (CVSS AC:H); reproducibility needs several short bursts of hammering.
The bug (confirmed, line-level)
fork1() fdshare branch (sys/kern/kern_fork.c):
:324 lwkt_gettoken(&p1->p_token); /* per-PROC token */
β¦
:568 fdtol = p1->p_fdtol;
:569 fdtol->fdl_refcount++; /* ONLY p_token held */
fdfree() (sys/kern/kern_descrip.c):
:2622 spin_lock(&fdp->fd_spin); /* shared fd-table spinlock */
β¦
:2675 fdtol->fdl_refcount--; /* ONLY fd_spin held */
p1->p_token is per-process; peers sharing p_fd hold different p_tokens,
so the increment and decrement do not share a serialization lock for the same
int word β lost update. fdl_refcount is a plain int (filedesc.h:110).
Build & run (unprivileged, disposable VM)
./build.sh # cc -O2 -o fdtol_race fdtol_race.c ./run.sh # looped bursts until panic or 12 rounds # or directly: ./fdtol_race <secs> <peers> # e.g. ./fdtol_race 20 10
Run as the unprivileged user (e.g. maxx, uid 1001). A panic drops the guest
into DDB and lands in dfbsd-qemu/boot.log on the host (serial console).
Expected output (bug present)
A kernel panic β one of:
- panic: filedesc_to_refcount botch: fdl_refcount=0
- panic: BADFREE2
- panic: memory chunk <addr> is already allocated!
Files
| File | Purpose |
|---|---|
fdtol_race.c |
controlled racer (bounded concurrency + slab pressure) |
build.sh / run.sh |
exact build/run |
build.log / run.log |
full logs |
panic.txt |
the three panic signatures + stacks from boot.log |
env.txt |
guest uname, compiler, sysctls |
VERDICT.md |
full narrative: mechanism, evidence, exploit-chain characterization |
fix.diff |
git-apply-able fix: take fd_spin around ++ and the splice |
manifest.json |
machine-readable catalog |
DF-0033 β VERDICT: REPRODUCED (local-unprivileged kernel panic / UAF)
| Field | Value |
|---|---|
| Verdict | REPRODUCED |
| Impact | panic β reliable local-unprivileged kernel DoS; underlying primitive is a UAF on M_FILEDESC_TO_LEADER (64-byte slab zone) |
| Confidence | certain (three distinct kernel panics from unprivileged maxx, one with a stack through the exact cited functions; plus an airtight line-level lock-mismatch proof) |
| Tested on | DragonFly 6.5-DEVELOPMENT v6.5.0.1712.g89e6a-DEVELOPMENT (master DEV build) |
| Attempts | ~8 build/run iterations; the race is intermittent (CVSS AC:H) and typically needs several 20β60s bursts to hit |
Mechanism (confirmed in sys/, every hop cited)
The finding's claim is correct and the locking has not been fixed on master.
fdtol->fdl_refcountis a plainintβsys/sys/filedesc.h:110. No atomics.- Increment side β
sys/kern/kern_fork.c: -:324lwkt_gettoken(&p1->p_token)β taken at the top offork1(). -:563theRFTHREADbranch is entered forrfork(RFPROC|RFTHREAD)(fdshare). -:568fdtol = p1->p_fdtol;-:569fdtol->fdl_refcount++;β mutated underp1->p_tokenonly. -p1->p_tokenis held until:727. - Decrement side β
sys/kern/kern_descrip.c: -:2622spin_lock(&fdp->fd_spin)β the shared fd-table spinlock. -:2675fdtol->fdl_refcount--;β mutated underfd_spinonly. p1->p_tokenis per-process. Two peersAandBthat sharep_fd/p_fdtol(created byrfork(RFPROC|RFTHREAD)) hold different p_tokens (A.p_tokenβB.p_token). lwkt tokens do not serialize across processes that do not share the same token. So: -fork-in-AholdsA.p_token; -exit-in-B(fdfreeviaexit1,kern_exit.c:382) holdsB.p_tokenplus the sharedfd_spin; - neither lock is common to both sides for thefdl_refcountword. The only lock that is genuinely common to all sharers isfd_spinβ and the increment side does not take it.++/--on a plain int is a classic read/modify/write: concurrent++/--from two CPUs is a lost update.- Consequence. A lost increment drives
fdl_refcountbelow the true reference count. When some peer later exits,fdfreedecrements, seesfdl_refcount == 0, unlinks the node from the circularfdllist andkfree(fdtol, M_FILEDESC_TO_LEADER)(kern_descrip.c:2676-2686) while other peers still havep_fdtolpointing at it β use-after-free. The nextrforkin a surviving peer dereferencesp1->p_fdtol(dangling) and bumpsfdl_refcountin freed memory; or takes the non-RFTHREADbranch and callsfiledesc_to_leader_alloc(old=p1->p_fdtol, ...), which readsold->fdl_next/old->fdl_prevfrom the freed slot and writes through them β corrupting whatever now occupies that slab slot. - Unlocked list splice β
sys/kern/kern_descrip.c:3342-3368filedesc_to_leader_alloc()is self-admitted "NOT MPSAFE" (:3343) and splices the sharedfdl_next/fdl_prevlist (:3358-3362) under no lock. With the fix'sfd_spinheld around the call site, the splice becomes serialized againstfdfree's list walk.
Evidence (three panics, unprivileged maxx)
All three are in panic.txt. Summary:
-
Panic A β
panic: filedesc_to_refcount botch: fdl_refcount=0infdfree. This is theKASSERT(fdtol->fdl_refcount > 0, β¦)atkern_descrip.c:2627-2629firing β the kernel's own invariant check catching the refcount underflow that the lost-update race produces. Stack:fdfree β exit1 β sigexit β postsig β userret. -
Panic B β
panic: BADFREE2in_kfree. Slab double-free detection: after the prematurekfree(fdtol), a danglingp_fdtolreference drove a second free of the sameM_FILEDESC_TO_LEADERobject. The slab allocator's bookkeeping became inconsistent, so a laterkfreeinsysctl_kern_proc_args(collateral) tripsBADFREE2. -
Panic C β
panic: memory chunk β¦ is already allocated!β slab corruption detected on the next allocation out ofM_FILEDESC_TO_LEADER. The stack is the smoking gun, walking the exact functions cited in the finding:chunk_mark_allocated β _kmalloc β filedesc_to_leader_alloc β fork1 β sys_rfork. This is the full chain: refcount race β premature free β dangling write into the freed 64-byte-zone slot β corrupted slab bitmap β nextkmallocpanics.
The race is intermittent (AC:H). Across the verification, panic A fired after ~50 s of hammering; panics B and C each fired within a handful of 20 s bursts.
Exploit chain (characterization; LPE not demonstrated)
The primitive is memory corruption (UAF), so per the audit methodology the chain is characterized even though root was not achieved in this session.
- Object / zone.
struct filedesc_to_leaderis 40 bytes (intΓ3+ptrΓ3,sys/sys/filedesc.h:109-117). DragonFly's slab rounds allocation size up to a power of two (powerof2_size,sys/kern/kern_slaballoc.c:776-786), sofdtollands in the 64-byte chunk zone (kmalloc(sizeof(struct filedesc_to_leader)=40)β 64). - Write primitive. After the premature free, a surviving peer's next
non-
RFTHREADrforkcallsfiledesc_to_leader_alloc(old=p1->p_fdtol), which executesold->fdl_next->fdl_prev = old->fdl_prev(kern_descrip.c:3362). If the attacker reclaims the freed 64-byte slot with a crafted object, the attacker controlsold->fdl_next(write address) andold->fdl_prev(write value) β a single arbitrary-pointer-sized write. Additionallyold->fdl_next = fdtol(:3361) writes a known kernel pointer into a controlled offset. - Victim objects (64-byte zone). Candidate victims would be any
kmallocof β€64 bytes containing an attacker-interesting field: a function pointer, auid, astruct ucred */struct file *pointer, a refcount. Grooming would spray the 64-byte zone (sockets, pipes, smallkinfostructs) to place such a victim adjacent to the freedfdtolslot. - How far it got / what blocks root. The write primitive is real but
gated behind a non-deterministic race (AC:H): the attacker must first win
the refcount lost-update, then reclaim the slot, then trigger the splice β
all in the right order across two CPUs. On this INVARIANTS kernel the
KASSERT (Panic A) catches the underflow before the corruption phase, masking
the exploitable path. On a production (non-INVARIANTS) kernel the
KASSERT is compiled out and the UAF proceeds silently to the arbitrary
write; turning that into
uid=0requires slab grooming + a chosen victim object + a deterministic race trigger, which is substantial exploit development and was not completed in this session. The demonstrated, reproducible ceiling here is local-unprivileged kernel panic (DoS), which is already a High-impact outcome for a default-config kernel.
A maintainer should treat the LPE ceiling as plausible-but-unproven; the DoS is proven.
PoC changes (vs. the seeded fdtol_race.c)
The seeded PoC compiled and ran but fork-bombed into kern.maxprocperuid
within seconds, wedging ssh before the race had time to fire. Rewrote it as a
controlled racer:
- Bounded concurrency: each peer does
rfork(RFPROC|RFTHREAD)+ child_exit(0)and the parent-peer reaps withwaitpid, so the live process count stays well under the per-uid cap (the original loopedfork()forever and orphaned everything). - Added slab pressure (
open("/dev/null")/closechurn in both peer and child) so that, once a premature free occurs, the freed 64-byte slot is likely reclaimed and the next deref hits clobbered memory β visible panic (this is exactly what produced Panic C). - The parent also runs
rfork(RFPROC|RFTHREAD)+_exitto add a third contender for the refcount word (more cross-CPU++/--overlap). SIGALRM-bounded runtime; the run scripts loop short bursts because the race is intermittent.
Build/run: ./build.sh && ./run.sh (or ./fdtol_race <secs> <peers>).
Why this is not a false positive
The lock mismatch is structural, not a reviewer oversight:
- The increment and decrement sides are guarded by two different,
non-mutually-held locks (p1->p_token is per-proc; fd_spin is shared).
- No atomic_t, no atomic_add_int, no common spinlock protects the word.
- filedesc_to_leader_alloc's own comment (kern_descrip.c:3343)
"NOT MPSAFE" corroborates that the splice was known-unsafe.
- Three independent kernel panics reproduce from unprivileged userland, one
with a stack through the exact cited call chain.
Recommended fix
fix.diff in this folder (git-apply-able, verified). It takes the shared
fd_spin (the same lock fdfree already uses for the decrement and list walk)
around both the fdl_refcount++ and the filedesc_to_leader_alloc()
splice in fork1's fdshare branch. This matches the finding markdown's
proposal (the markdown sketched the spin_lock around the ++; this diff
additionally wraps the else-branch filedesc_to_leader_alloc call, which is
the unlocked splice the markdown also flagged). A more thorough follow-up would
convert fdl_refcount/fdl_holdcount to atomic_t and add a dedicated
fdl list lock, but the minimal correct fix is the fd_spin pairing.
Confirmed kernel references
- sys/kern/kern_fork.c:324
- sys/kern/kern_fork.c:568
- sys/kern/kern_fork.c:569
- sys/kern/kern_descrip.c:2622
- sys/kern/kern_descrip.c:2627
- sys/kern/kern_descrip.c:2675
- sys/kern/kern_descrip.c:2686
- sys/kern/kern_descrip.c:3343
- sys/kern/kern_descrip.c:3358
- sys/kern/kern_descrip.c:3362
- sys/sys/filedesc.h:110
- sys/kern/kern_exit.c:382
- sys/kern/kern_slaballoc.c:784
Detail
Exploit chain
Primitive is a UAF on M_FILEDESC_TO_LEADER (struct is 40 bytes -> 64-byte slab zone via powerof2_size). After the premature free, a surviving peer's next non-RFTHREAD rfork calls filedesc_to_leader_alloc(old=p1->p_fdtol) which executes old->fdl_next->fdl_prev = old->fdl_prev (kern_descrip.c:3362): if the attacker reclaims the freed 64-byte slot with a crafted object, this yields a single arbitrary pointer-sized write (controlled write-address via old->fdl_next, controlled value via old->fdl_prev). LPE ceiling is plausible on a non-INVARIANTS production kernel (where the KASSERT that currently masks the corruption is compiled out) via 64-byte-zone grooming against a victim object carrying a function pointer or ucred/uid, but the write is gated behind a non-deterministic race (AC:H) and root was not achieved in this session. Demonstrated, reproducible ceiling: local-unprivileged kernel panic (DoS) on the default INVARIANTS kernel. Characterized in VERDICT.md; no separate exploit.c shipped since the deterministic-race+grooming chain was not completed.
Evidence (decisive lines)
findings/poc/DF-0033/panic.txt holds all three panic signatures+stacks; VERDICT.md has the full line-level mechanism; manifest.json catalogs the pack. Decisive bytes: 'panic: filedesc_to_refcount botch: fdl_refcount=0' (KASSERT at kern_descrip.c:2627 fired), and 'panic: memory chunk 0xfffff800461052ad is already allocated!' with stack 'chunk_mark_allocated <- _kmalloc <- filedesc_to_leader_alloc+0x23 <- fork1+0xbe9 <- sys_rfork+0x43' -- the exact cited path. fdtol_race.c is the controlled racer; build.sh/run.sh reproduce.
PoC changes
Rewrote fdtol_race.c: the seeded version fork-bombed into kern.maxprocperuid and wedged ssh before the race could fire. v2 bounds concurrency (each peer rfork(RFPROC|RFTHREAD)+child _exit, reaped with waitpid), adds slab pressure (open/close /dev/null churn) so a prematurely-freed fdtol slot is reclaimed and the next deref hits clobbered memory (this is what produced panic C), and adds a SIGALRM-bounded runtime plus a third contender via the parent loop. Added build.sh, run.sh (looped bursts since the race is intermittent), env.txt, panic.txt, VERDICT.md, manifest.json, and fix.diff.
Verified recommended fix
fix.diff (git-apply-able, verified clean) takes the shared p1->p_fd->fd_spin -- the same lock fdfree already uses for the decrement and list walk -- around both the fdl_refcount++ (kern_fork.c:569) and the filedesc_to_leader_alloc() list splice (kern_fork.c:575) in fork1's fdshare branch. Matches the finding markdown's proposal (which sketched the spin_lock around the ++) and additionally wraps the else-branch unlocked splice the markdown also flagged. A more thorough follow-up would convert fdl_refcount/fdl_holdcount to atomic_t and add a dedicated fdl list lock.
Verdict
REPRODUCED. The lock-mismatch claim is airtight and unfixed on master: fdl_refcount is a plain int (sys/sys/filedesc.h:110), incremented at kern_fork.c:569 under only p1->p_token (acquired :324) which is PER-PROCESS, while decremented at kern_descrip.c:2675 under the SHARED fd_spin (acquired :2622). Peers sharing p_fd/p_fdtol hold different p_tokens, so the ++/-- share no serialization lock -> lost update -> premature kfree(fdtol) at kern_descrip.c:2686 while peers still reference it -> UAF. Confirmed by three distinct kernel panics from unprivileged user maxx: (A) 'panic: filedesc_to_refcount botch: fdl_refcount=0' -- the KASSERT at kern_descrip.c:2627 catching the refcount underflow directly; (B) 'panic: BADFREE2' -- slab double-free of the prematurely-freed fdtol; (C) 'panic: memory chunk ... is already allocated!' with stack chunk_mark_allocated<-_kmalloc<-filedesc_to_leader_alloc<-fork1<-sys_rfork, i.e. the exact cited call chain, slab corruption from the dangling write into the freed 64-byte-zone slot. filedesc_to_leader_alloc (kern_descrip.c:3342-3368) also splices the shared fdl list under no lock ('NOT MPSAFE' :3343), as the finding states.