Seven Primitives of Distributed Agent Systems: A Gender Analysis
We define seven atomic principles for distributed agent coordination, then measure gendered engagement with each principle using simulated focus groups under two methodologies to quantify systematic biases and derive design implications for multi-agent systems.
Agents
Tokens
Methodologies
The Seven Primitives
Each primitive is defined as an atomic, composable unit of distributed coordination:
| Primitive | Definition | Failure Mode |
|---|---|---|
| Prioritization | Ranking by severity x tractability | Random action selection |
| Hypothesis Testing | Falsifiable predictions with Set A/B comparison | Change without measurement |
| Attention | 1:1 ratio of possible actions to intended actions | Thrashing across objectives |
| Deduplication | Eliminating redundant information at the earliest stage | Effort multiplication, signal dilution |
| Message Passing | Structured exchange: corroborate, contradict, fill-gap | Independent rediscovery |
| Consensus | Confidence from corroboration graph topology, not voting | Authority-based decisions |
| Progressive Discovery | Each action's residue reducing cost of next action | Constant cold starts |
These map onto a loop: discover (progressive discovery) → detect (deduplication, consensus) → decide (prioritization, attention) → act (hypothesis testing) → communicate (message passing). The loop feeds back: message passing outputs become inputs for the next discovery cycle.
Methodology
V1: Breadth-First (2 agents)
Two LLM agents (Claude Sonnet), each assigned a gender identity and six diverse professional personas, discussed all seven principles in sequence. Each agent produced per-principle scores on three dimensions: natural aptitude (1-10), effectiveness under pressure (1-10), and same-gender amplification (1-10, 5=neutral). Total: ~89,500 tokens, ~410s.
V2: Depth-First Map-Reduce (14 agents)
14 LLM agents (Claude Sonnet), each assigned one principle and one gender, discussed their single principle in depth for 10-15 exchanges. Each returned structured JSON with the same three-dimension scoring plus strengths, weaknesses, key quote, dissent, and same-gender effect analysis. Total: ~546,000 tokens, ~60s per agent (parallel).
Reconciliation
V1 and V2 scores were averaged to produce reconciled estimates. The averaging removes V1's narrative anchoring bias and V2's score inflation bias.
Reconciled Scores
| Principle | F-Apt | F-Press | M-Apt | M-Press | Delta Apt | Delta Press |
|---|---|---|---|---|---|---|
| Prioritization | 7 | 7 | 7 | 9 | 0 | +2 M |
| Hypothesis Testing | 7 | 5 | 6 | 6 | +1 F | +1 M |
| Attention | 6 | 5 | 6 | 9 | 0 | +4 M |
| Deduplication | 7 | 5 | 5 | 7 | +2 F | +2 M |
| Message Passing | 8 | 6 | 7 | 9 | +1 F | +3 M |
| Consensus | 8 | 5 | 5 | 6 | +3 F | +1 M |
| Progressive Discovery | 8 | 6 | 7 | 6 | +1 F | 0 |
Score Distribution Analysis
Female scores cluster: aptitude std dev = 0.7, pressure std dev = 0.7. Male scores are bimodal: aptitude std dev = 0.8, pressure std dev = 1.5.
The male pressure distribution is the most important structural finding. Four principles cluster at 9 (prioritization, attention, message passing, deduplication) while three cluster at 6 (hypothesis testing, consensus, progressive discovery). This bimodality maps cleanly onto an execution/reflection split: men spike on execution primitives under pressure and maintain baseline on reflection primitives.
Female pressure scores show no such bimodality. Performance under pressure is uniformly 5-7, suggesting a more context-invariant engagement pattern.
Methodological Findings: V1 vs V2
| Bias | V1 (Breadth) | V2 (Depth) |
|---|---|---|
| Narrative anchoring | Strong. A compelling failure mode gets applied uniformly across principles. | Weak. Each agent develops its own narrative. |
| Context sensitivity | Low. Absolute claims ("men are bad at X"). | High. Conditional claims ("men are bad at X in context Y"). |
| Score inflation | Lower. Single agent maintains internal calibration. | Higher. Independent agents trend toward finding nuance that pushes scores up. |
| Cross-principle coherence | Strong. Single agent sees themes across all seven. | Weak. Cross-principle synthesis requires explicit reduce step. |
Largest V1-V2 Divergences
Male deduplication: V1 scored 3/4 (apt/press). V2 scored 7/9. Delta: +4/+5. V1 anchored on the "ego tax" narrative and applied it uniformly. V2 discovered that men have strong operational dedup instincts (ER delta-only handoffs, military "once up once down") that only fail in low-stakes social contexts. Depth revealed the context-dependency that breadth missed.
Female same-gender amplification: V1 mean 3.4. V2 mean 5.4. Delta: +2.0. V1's single agent overapplied the "relational override" failure mode. V2's per-principle agents found that relational dynamics also amplify corroboration, accelerate information pooling, and lower ego-attachment to predictions.
LLM scoring is sensitive to context window allocation. A single agent scoring seven constructs produces more internally consistent but less nuanced results than seven independent agents scoring one construct each. Neither is strictly superior. The reconciled average removes the worst biases of both.
This has direct implications for bench scoring in multi-dimensional evaluation: per-dimension specialist scorers may produce different rankings than holistic scorers, and the difference is systematic, not random.
Design Implications for Multi-Agent Systems
1. Phase-specific agent configuration
Discovery/detection agents (progressive discovery, deduplication, consensus): optimize for information sharing, failure surfacing, and source independence verification. These are the primitives where the female behavioral pattern outperforms.
Execution/communication agents (prioritization, attention, message passing): optimize for protocol adherence, single-objective focus, and structured data exchange. These are the primitives where the male behavioral pattern outperforms under pressure.
In synthetic agents, this translates to prompt engineering choices: discovery agents should share intermediate findings and flag uncertainty. Execution agents should follow the protocol, ignore tangents, and transmit structured payloads.
2. Contradiction channel engineering
The most asymmetric finding: the contradiction operation in message passing is systematically attenuated by both genders, through different mechanisms. Female agents soften contradiction into ambiguous language. Male agents suppress gap-fill messages across domain boundaries.
For synthetic agents: make message types explicit in the schema. A message tagged type: contradiction cannot be misread regardless of how the content is phrased.
3. Claim-claimant separation
Both genders conflate the claim graph (what is believed and why) with the social graph (who said it and what their status is). Women read contradiction as relational rupture. Men read correction as status challenge. Both errors collapse when the system architecture forces separation.
For synthetic agents: implement "claim cards" (written, attributed claims submitted before group discussion). Require agents to evaluate claims without access to claimant identity during the consensus phase.
4. Structure as the universal intervention
12 of 14 V2 agents independently recommended formalized protocols as the primary intervention. For collaborative/relational agents, structure provides permission to execute difficult operations. For competitive/hierarchical agents, structure provides constraint that prevents social dynamics from corrupting evaluation.
Agent coordination protocols should be explicit, schema-enforced, and non-negotiable. Letting agents develop their own coordination norms through emergence will reproduce the same social-layer failures the human focus groups exhibited.
Five Highest-Confidence Claims
These findings survived both methodologies with minimal score movement:
- Female agents outscore on consensus aptitude (V1: +4, V2: +2, reconciled: +3). Mechanism: faster corroboration network formation, lower information hoarding, active source independence verification.
- Male agents outscore on attention under pressure (V1: +6, V2: +3, reconciled: +4). Mechanism: environmental trigger activation produces exceptional single-objective focus when stakes are physical and immediate.
- Message passing is asymmetric: female aptitude > male aptitude, male pressure > female pressure. Mechanism: female agents have broader informal networks and stronger fill-gap instincts; male agents have higher protocol fidelity under load.
- Both genders fail at progressive discovery externalization. Mechanism differs (relational storage vs. status hoarding) but outcome is identical: knowledge compounds individually and dissipates organizationally.
- Structure is the universal equalizer. Explicit protocol narrows gender performance gaps on every primitive. Mechanism differs (permission vs. constraint) but effectiveness is consistent.
Limitations
1. Simulated, not empirical. These are LLM-generated focus groups, not human subjects research. The findings reflect the model's training data on gender dynamics in professional settings, not direct observation.
2. Cultural specificity. The professional personas are drawn from Western (primarily American) professional contexts. The gender dynamics described may not generalize across cultures.
3. Binary gender framing. The experiment used a male/female binary. Non-binary and gender-diverse dynamics are not captured.
4. Scorer sensitivity. V1/V2 divergence demonstrates that LLM-generated scores are method-sensitive. The reconciled scores are more trustworthy than either alone, but should be treated as directional estimates, not precise measurements.
5. Persona anchoring. The same six professions were used across all simulations. Different profession sets might produce different scores, particularly for principles that are highly domain-dependent.
Three-Part Series
Part 1: What Men and Women Are Actually Good At
Part 2: Why Same-Gender Teams Underperform
Part 3: Technical Methodology (you are here)