Societies of Thought ResearchResearch investigating how enhanced reasoning in AI models emerges through simulated multi-agent interactions rather than merely extended computationFull title: Reasoning Models Generate Societies of Thought. This strategic research plan examines conversational behaviors, diversity mechanisms, and reinforcement learning dynamics in reasoning-optimized language models (DeepSeek-R1, QwQ-32B) compared to instruction-tuned models, revealing that effective reasoning emerges through structured interplay of distinct internal perspectives characterized by varied personality traits and domain expertise. ^^ Submitter's Note: This StratML rendition was compiled from the source by Claude.ai.Paradigms of Intelligence TeamPITuuid-org-001Cross-institutional research collaborationGoogle ResearchPrimary research institutionUniversity of ChicagoAcademic research partnerSanta Fe InstituteComplexity science research partnerUnderstanding artificial reasoning through the lens of social organization and collective intelligence, where diversity and structured coordination drive superior problem-solving capabilities analogous to human group dynamicsuuid-vision-001To reveal the mechanisms by which reasoning models achieve enhanced performance through implicit simulation of multi-agent interactions, informing the design of more effective AI reasoning architecturesuuid-mission-001Scientific RigorEmploying quantitative analysis, mechanistic interpretability methods, and controlled experiments across multiple model systems and benchmarksInterdisciplinary IntegrationDrawing insights from cognitive science, social psychology, collective intelligence research, and AI systems to understand reasoning emergenceReproducibilityUsing established benchmarks, validated measurement frameworks (Bales IPA, Big Five personality), and open-source models to enable replicationConversational Behavior AnalysisCharacterize dialogue-like patterns in reasoning tracesuuid-goal-0011AI ResearchersMechanistic interpretability and reasoning researchersKnowledge BeneficiaryGain understanding of conversational dynamics in model reasoningBeneficiaryThis goal investigates whether conversational behaviors and socio-emotional roles constitutive of back-and-forth dialogue are prevalent in reasoning traces, using LLM-as-judge methodology to quantify four conversational behaviors (question-answering, perspective shifts, conflict of perspectives, reconciliation) and Bales' twelve socio-emotional roles across reasoning and instruction-tuned models.Conversational Patterns QuantificationMeasure prevalence of dialogue behaviors in reasoning tracesuuid-obj-001-0011.1Using Gemini-2.5-Pro as LLM-as-judge with substantial inter-rater reliability (mean ICC = 0.855 with GPT-5.2, mean ICC = 0.756 with human raters), quantify occurrence of four conversational behaviors within each reasoning trace across 8,262 problems from six benchmarks (BigBench Hard, GPQA, MATH Hard, MMLU-Pro, MUSR, IFEval).1.1.1Question-answering behavior frequencyPercentage point increaseuuid-pi-001-001-001PLACEHOLDER_134.5DeepSeek-R1 vs DeepSeek-V3 (β=0.345, 95% CI=[0.328, 0.361], p<1×10^-323)45.9QwQ-32B vs Qwen-2.5-32B-IT (β=0.459, 95% CI=[0.444, 0.475], p<1×10^-323)1.1.2Perspective shift behavior frequencyPercentage point increaseuuid-pi-001-001-002PLACEHOLDER_221.3DeepSeek-R1 vs DeepSeek-V3 (β=0.213, 95% CI=[0.197, 0.230], p<1×10^-137)37.8QwQ-32B vs Qwen-2.5-32B-IT (β=0.378, 95% CI=[0.362, 0.394], p<1×10^-323)1.1.3Conflict of perspectives behavior frequencyPercentage point increaseuuid-pi-001-001-003PLACEHOLDER_329.3QwQ-32B vs Qwen-2.5-32B-IT (β=0.293, 95% CI=[0.277, 0.308], p<1×10^-277)1.1.4Reconciliation behavior frequencyPercentage point increaseuuid-pi-001-001-004PLACEHOLDER_419.1DeepSeek-R1 vs DeepSeek-V3 (β=0.191, 95% CI=[0.176, 0.207], p<1×10^-125)34.4QwQ-32B vs Qwen-2.5-32B-IT (β=0.344, 95% CI=[0.328, 0.360], p<1×10^-323)Socio-emotional Role AssessmentAnalyze Bales IPA interaction roles in reasoning tracesuuid-obj-001-0021.2Examine twelve socio-emotional roles grouped into four categories (asking, giving, positive emotional, negative emotional) using Bales' Interaction Process Analysis framework, measuring both frequency and reciprocal balance through Jaccard indices to distinguish dialogue-like patterns from monologic reasoning.1.2.1Information-asking role frequencyPercentage point increaseuuid-pi-001-002-001PLACEHOLDER_518.9DeepSeek-R1 vs DeepSeek-V3 (β=0.189, 95% CI=[0.176, 0.203], p<1×10^-158)20.0QwQ-32B vs Qwen-2.5-32B-IT (β=0.200, 95% CI=[0.186, 0.215], p<1×10^-155)1.2.2Negative emotional role frequencyPercentage point increaseuuid-pi-001-002-002PLACEHOLDER_616.2DeepSeek-R1 vs DeepSeek-V3 (β=0.162, 95% CI=[0.147, 0.176], p<1×10^-10)45.0QwQ-32B vs Qwen-2.5-32B-IT (β=0.450, 95% CI=[0.436, 0.463], p<1×10^-323)1.2.3Positive emotional role frequencyPercentage point increaseuuid-pi-001-002-003PLACEHOLDER_727.8DeepSeek-R1 vs DeepSeek-V3 (β=0.278, 95% CI=[0.263, 0.293], p<1×10^-254)31.2QwQ-32B vs Qwen-2.5-32B-IT (β=0.312, 95% CI=[0.296, 0.327], p<1×10^-307)1.2.4Ask-give role balance (Jaccard index)Index point increaseuuid-pi-001-002-004PLACEHOLDER_80.222DeepSeek-R1 vs DeepSeek-V3 (β=0.222, 95% CI=[0.208, 0.237], p<1×10^-189)0.284QwQ-32B vs Qwen-2.5-32B-IT (β=0.284, 95% CI=[0.269, 0.299], p<1×10^-281)1.2.5Positive-negative role balance (Jaccard index)Index point increaseuuid-pi-001-002-005PLACEHOLDER_90.189DeepSeek-R1 vs DeepSeek-V3 (β=0.189, 95% CI=[0.176, 0.203], p<1×10^-158)0.200QwQ-32B vs Qwen-2.5-32B-IT (β=0.200, 95% CI=[0.186, 0.215], p<1×10^-155)Complexity-Behavior CorrelationTest relationship between problem difficulty and conversational patternsuuid-obj-001-0031.3Examine whether conversational behaviors and socio-emotional roles become more pronounced when DeepSeek-R1 faces more difficult tasks, using both external LLM-as-judge complexity assessment and error rates across conventional instruction-tuned models as complexity measures.1.3.1Conversational behavior activation patternQualitative findinguuid-pi-001-003-001PLACEHOLDER_10FindingConfirmedConversational behaviors appear more frequently when DeepSeek-R1 tackles more complex problems, with highest complexity scores in GPQA (graduate-level science) and challenging math problems exhibiting strong conversational patterns, while simple procedural tasks show minimal dialogic behaviorPerformance Mechanism InvestigationTest causal links between conversational features and reasoning accuracyuuid-goal-0022Model DevelopersTeams building reasoning-capable LLMsImplementation PerformerApply findings to design more effective reasoning systemsPerformerThis goal employs mechanistic interpretability methods to identify and manipulate features in the model's activation space related to conversational behaviors, examining whether steering these features causally affects reasoning capabilities through both direct effects and indirect pathways mediated by cognitive strategies.SAE Feature IdentificationIdentify conversational features in model activation spaceuuid-obj-002-0012.1Using sparse autoencoders (SAE) trained on Layer 15's residual stream activations of DeepSeek-R1-Llama-8B (32,768 features trained on SlimPajama dataset), compute conversation ratio for each feature—the proportion of feature activations occurring in interpersonal, conversational settings—and curate features with high conversation ratios that activate near sentence onsets.2.1.1Conversational surprise feature conversation ratioPercentageuuid-pi-002-001-001PLACEHOLDER_1165.7Feature 30939 (discourse marker for surprise, realization, acknowledgment) places in 99th percentile among all features with 0.016% sparsitySteering Experiment ExecutionTest causal impact of conversational feature activation on accuracyuuid-obj-002-0022.2Using activation addition method on Countdown game benchmark (1,024 multi-step reasoning problems requiring combination of numbers with arithmetic operations to reach target), add scaled feature 30939 vector to layer 15 activations during generation, testing steering strengths from -10 to +10 and measuring both accuracy changes and induced conversational behaviors.2.2.1Reasoning accuracy improvement from positive steeringPercentage point increaseuuid-pi-002-002-001PLACEHOLDER_1227.7Accuracy doubled from 27.1% (baseline) to 54.8% (steering +10)2.2.2Question-answering behavior induction from steeringBeta coefficientuuid-pi-002-002-002PLACEHOLDER_132.199Positive steering 0 to +10 (β=2.199, 95% CI=[1.648, 2.750], p<1×10^-14)2.2.3Perspective shift behavior induction from steeringBeta coefficientuuid-pi-002-002-003PLACEHOLDER_141.160Positive steering 0 to +10 (β=1.160, 95% CI=[0.665, 1.655], p<1×10^-5)2.2.4Conversational vs non-conversational feature steering advantagePercentage point accuracy advantageuuid-pi-002-002-004PLACEHOLDER_154.17Random conversational feature vs random non-conversational feature (β=0.042, 95% CI=[0.016, 0.068], p=0.002)Cognitive Pathway AnalysisDecompose direct and indirect accuracy effects through cognitive behaviorsuuid-obj-002-0032.3Using structural equation modeling, examine pathways from steering conversational surprise (feature 30939) to accuracy, testing whether effects are mediated through four cognitive behaviors (verification, backtracking, subgoal setting, backward chaining) previously established as contributors to reasoning accuracy.2.3.1Verification behavior increase from steeringMean difference in behavior countuuid-pi-002-003-001PLACEHOLDER_165.815Steering 0 to +10 (95% CI=[4.922, 6.709], t=12.77, p<1×10^-34)2.3.2Backtracking behavior increase from steeringMean difference in behavior countuuid-pi-002-003-002PLACEHOLDER_170.881Steering 0 to +10 (95% CI=[0.515, 1.248], t=4.72, p<1×10^-5)2.3.3Direct effect on accuracy from steeringStandardized beta coefficientuuid-pi-002-003-003PLACEHOLDER_180.228Steering 0 to +10 (β=0.228, 95% CI=[0.183, 0.273], z=9.98, p<1×10^-22, N=2048)2.3.4Indirect effect through cognitive behaviors from steeringStandardized beta coefficientuuid-pi-002-003-004PLACEHOLDER_190.066Steering 0 to +10 mediated by verification, subgoal setting, backward chaining (β=0.066, 95% CI=[0.046, 0.086], z=6.38, p<1×10^-10, N=2048)Perspective Diversity AssessmentMeasure personality and expertise variation within reasoning tracesuuid-goal-0033Collective Intelligence ResearchersScholars studying group dynamics and team performanceTheoretical Framework BeneficiaryGain computational parallel to human collective intelligenceBeneficiaryThis goal examines whether DeepSeek-R1 increases diversity of perspectives expressed within reasoning traces, analyzing personality and expertise variation among distinct reasoning "perspectives" or simulated voices, using LLM-as-judge to infer number of perspectives, characterize their personality traits (BFI-10) and domain expertise, and attribute tokens to specific perspectives.Personality Diversity QuantificationMeasure Big Five trait variation across implicit reasoning perspectivesuuid-obj-003-0013.1Using Gemini-2.5-Pro to characterize each perspective's personality traits via BFI-10 questionnaire, estimate personality diversity as standard deviation of inferred traits for each Big-5 dimension, validated against Intelligence Squared Debates Corpus (1,196 human conversations) showing accurate prediction of number of distinct speakers (Spearman's ρ=0.86) and token-to-speaker attribution accuracy of 73% weighted by predicted perspectives.3.1.1Extraversion diversity increaseBeta coefficientuuid-pi-003-001-001PLACEHOLDER_200.103DeepSeek-R1 vs DeepSeek-V3 (β=0.103, 95% CI=[0.075, 0.131], t=7.16, p<1×10^-13)0.253QwQ-32B vs Qwen-2.5-32B-IT (β=0.253, 95% CI=[0.223, 0.282], t=16.78, p<1×10^-63)3.1.2Agreeableness diversity increaseBeta coefficientuuid-pi-003-001-002PLACEHOLDER_210.297DeepSeek-R1 vs DeepSeek-V3 (β=0.297, 95% CI=[0.271, 0.323], t=22.65, p<1×10^-113)0.490QwQ-32B vs Qwen-2.5-32B-IT (β=0.490, 95% CI=[0.462, 0.519], t=34.09, p<1×10^-254)3.1.3Neuroticism diversity increaseBeta coefficientuuid-pi-003-001-003PLACEHOLDER_220.567DeepSeek-R1 vs DeepSeek-V3 (β=0.567, 95% CI=[0.542, 0.592], t=44.57, p<1×10^-323)0.825QwQ-32B vs Qwen-2.5-32B-IT (β=0.825, 95% CI=[0.797, 0.852], t=58.49, p<1×10^-323)3.1.4Openness diversity increaseBeta coefficientuuid-pi-003-001-004PLACEHOLDER_230.110DeepSeek-R1 vs DeepSeek-V3 (β=0.110, 95% CI=[0.083, 0.137], t=8.06, p<1×10^-16)0.268QwQ-32B vs Qwen-2.5-32B-IT (β=0.268, 95% CI=[0.238, 0.298], t=17.41, p<1×10^-68)3.1.5Conscientiousness diversity decreaseBeta coefficientuuid-pi-003-001-005PLACEHOLDER_24-0.291DeepSeek-R1 vs DeepSeek-V3 (β=-0.291, 95% CI=[-0.317, -0.265], t=-21.90, p<1×10^-106) indicating more consistently engaged voices-0.402QwQ-32B vs Qwen-2.5-32B-IT (β=-0.402, 95% CI=[-0.435, -0.369], t=-23.79, p<1×10^-125)Expertise Diversity QuantificationMeasure domain knowledge dispersion across reasoning perspectivesuuid-obj-003-0023.2Generate short free-form descriptions of each perspective's domain expertise, measure expertise diversity as mean cosine distance between embedding of each domain expertise description and average embedding in semantic space, validated against Intelligence Squared Debates showing inferred expertise diversity predicts actual diversity among participants' ground-truth biographies (Spearman's ρ=0.55).3.2.1Expertise embedding space diversity increaseBeta coefficientuuid-pi-003-002-001PLACEHOLDER_250.179DeepSeek-R1 vs DeepSeek-V3 (β=0.179, 95% CI=[0.161, 0.196], t=20.11, p<1×10^-89)0.250QwQ-32B vs Qwen-2.5-32B-IT (β=0.250, 95% CI=[0.231, 0.269], t=25.50, p<1×10^-142)SAE Diversity Representation AnalysisExamine activation patterns of personality and expertise featuresuuid-obj-003-0033.3Classify each of 32,768 SAE features as personality-related, expertise-related, or other using LLM-as-judge, then steer conversational feature 30939 with strength ±10 and measure diversity using coverage (number of unique features activated) and entropy (distribution evenness across tokens) to examine whether conversational steering activates broader personality and expertise feature ranges.3.3.1Personality-related feature coverage increase from steeringNumber of additional features activateduuid-pi-003-003-001PLACEHOLDER_26315.915Steering +10 vs baseline (β=315.915, 95% CI=[277.320, 354.509], t=16.04, p<1×10^-323)3.3.2Expertise-related feature coverage increase from steeringNumber of additional features activateduuid-pi-003-003-002PLACEHOLDER_27391.312Steering +10 vs baseline (β=391.312, 95% CI=[313.743, 468.880], t=9.89, p<1×10^-323)3.3.3Personality-related feature entropy increase from steeringShannon entropy increaseuuid-pi-003-003-003PLACEHOLDER_280.262Steering +10 vs baseline (β=0.262, 95% CI=[0.227, 0.298], t=14.48, p<1×10^-323)3.3.4Expertise-related feature entropy increase from steeringShannon entropy increaseuuid-pi-003-003-004PLACEHOLDER_290.096Steering +10 vs baseline (β=0.096, 95% CI=[0.075, 0.117], t=9.02, p<1×10^-323)Learning Dynamics ExperimentationTest spontaneous conversational emergence and scaffolding effects in RLuuid-goal-0044RL PractitionersResearchers training reasoning models via reinforcement learningTraining Strategy PerformerApply conversational scaffolding to accelerate reasoning developmentPerformerThis goal implements self-taught reinforcement learning experiments rewarding only accuracy and correct formatting (not conversational behaviors directly) to examine whether LLMs spontaneously develop conversational behaviors when optimized for correct answers, and whether initial fine-tuning with conversational scaffolding accelerates reasoning improvement compared to monologue-like reasoning or no priming.Spontaneous Emergence TestingTest whether conversational behaviors arise from accuracy-only rewardsuuid-obj-004-0014.1Using Proximal Policy Optimization (PPO) via Verl framework for 250 steps on Qwen-2.5-3B pre-trained model (without instruction-tuning) solving Countdown arithmetic tasks, reward only accuracy (0.9 weight) and format compliance (0.1 weight) with no direct reward for conversational or cognitive behaviors, then track emergence of question-answering, perspective shifts, conflict, and reconciliation behaviors over training trajectory.4.1.1Final accuracy achievementPercentageuuid-pi-004-001-001PLACEHOLDER_3058Accuracy rose from near zero at baseline to approximately 58% by step 2504.1.2Conversational behavior emergence patternQualitative findinguuid-pi-004-001-002PLACEHOLDER_31FindingConfirmedQuestion-answering and conflict of perspectives rise throughout training despite receiving no direct reward; perspective shifts increase until step 160 then decrease as model reaches answers more efficiently; by step 120 two distinctive simulated personas emerged recognizing collectivity with pronoun "we"4.1.3Persona differentiation emergenceQualitative findinguuid-pi-004-001-003PLACEHOLDER_32FindingConfirmedModel employs two distinct personas by step 120: methodical problem-solver high in Conscientiousness and low in Openness, and exploratory trial-and-error thinker high in Openness and Extraversion, with metacognitive reflection on solvability marked by Neuroticism mediating between the twoScaffolding Comparison ExperimentCompare RL training under conversation vs monologue vs no priminguuid-obj-004-0024.2Generate conversational fine-tuning data by prompting Qwen-2.5-32B-IT to produce multi-agent dialogues (2-4 personas with distinct personality/expertise) solving 8,262 reasoning tasks, sample 500 training/100 validation instances reaching correct answers; generate monologue fine-tuning data as standard chain-of-thought traces for same problems; fine-tune Qwen-2.5-3B on each dataset using next-token prediction before applying identical RL optimization for task accuracy.4.2.1Early-stage accuracy advantage (Qwen-2.5-3B)Percentage point difference at step 40uuid-pi-004-002-001PLACEHOLDER_3310Conversation-fine-tuned models reach 38% accuracy while monologue-fine-tuned models remain at 28%4.2.2Architecture replication (Llama-3.2-3B early stage)Percentage point difference at step 70uuid-pi-004-002-002PLACEHOLDER_346Conversation-fine-tuned Llama reaches 11% accuracy vs 5% for monologue-fine-tuned4.2.3Architecture replication (Llama-3.2-3B late stage)Percentage point difference at step 150uuid-pi-004-002-003PLACEHOLDER_3522Conversation-fine-tuned Llama achieves 40% accuracy while monologue-fine-tuned plateau around 18%, less than half performanceCross-Domain Transfer TestingTest whether conversational scaffolding transfers to novel domainsuuid-obj-004-0034.3Evaluate models fine-tuned on multi-agent dialogues for Countdown arithmetic task on qualitatively different domain: political misinformation detection requiring discrimination between true and fabricated headlines from 23,299 fact-checked PolitiFact claims, testing whether conversational structure facilitates reasoning strategy emergence across domains despite never encountering this content during fine-tuning.4.3.1Cross-domain transfer effectivenessQualitative findinguuid-pi-004-003-001PLACEHOLDER_36FindingConfirmedConversation-primed models achieve faster accuracy gains than baseline models on misinformation detection despite never encountering this domain during fine-tuning, suggesting conversational structure itself drives improvement rather than task-specific knowledgeuuid-admin-0012025-01-012025-12-312026-01-21https://arxiv.org/abs/2601.10825uuid-submitter-001OwenAmburOwen.Ambur@verizon.net