We can map circuits — but do we know what the model believes?
- Executive summary: As AI agents move from prototypes to decision-makers in sales, support, and automation, understanding what a model represents is no longer enough. Leaders need auditable records of a model’s goals, intentions, and confidence — what philosopher David Chalmers calls propositional interpretability and what engineers can begin to implement as thought logging.
- Existing interpretability tools (causal tracing, probes, sparse autoencoders, chain-of-thought) reveal pieces of representation but fail to produce continuous, validated records of attitudes that drive behavior.
- Practical next steps: pilot selective thought-logging for high-risk flows, couple logs to causal intervention tests, and formalize governance rules around retention, access, and validation.
Why propositional interpretability matters to business
If your automation stack makes decisions, you need more than “what” the model knows — you need to know what it wants and how confident it is. AI agents in sales, customer support, fraud detection, and pricing don’t just produce outputs; they form internal assessments and plans that guide actions. A model that “knows” the concept of a refund is different from one that “believes” a refund is likely to retain a customer and thus actively recommends it. That distinction is the operational gap propositional interpretability aims to close.
For C-suite leaders, the implications are concrete: better incident forensics, more reliable escalation rules, clearer compliance evidence, and reduced operational surprises. Without attitude-level records, auditing a runaway automation or explaining a costly customer interaction becomes guesswork.
What propositional interpretability and thought logging are (plain English)
Propositional interpretability = identifying the model’s propositional attitudes: beliefs (what it takes to be true), desires/goals (what it aims to bring about), intentions (what it plans to do), and credences (internal probabilities or confidence levels). Think of it as moving from concept maps to a running diary of a model’s mental state.
Thought logging = recording occurrent attitudes over time: snapshots of goals, credences, planned actions, and short reasons that connect internal states to behaviors. These logs are selective and event-driven rather than exhaustive mind dumps.
Quick analogies to make the jargon stick:
- Credence is the model’s private probability — like a salesperson’s internal estimate that a lead has an 80% chance to convert.
- Probing is like scanning brainwaves to see whether a particular concept lights up.
- Causal tracing is following the thread from a thought to the final action — like tracing a bank transaction back to the instruction that triggered it.
How current tools help — and where they fall short
Four classes of interpretability tools give partial visibility. Each is useful for governance, but none alone satisfies the requirements for principled thought logging.
-
Causal tracing — Follows where specific information influences outputs.
Practical value: Good for pinpointing decision-critical pathways; supports targeted model editing.
Limitation: Fragile and often prompt-dependent; struggles with widely distributed representations. -
Probing (linear classifiers) — Trains lightweight readers on activations to surface whether a concept or proposition is represented.
Practical value: Fast, interpretable signals about whether certain content is encoded.
Limitation: Can confuse correlation with causal role; probes don’t prove a representation influences behavior. -
Sparse autoencoders & representation discovery — Large-scale feature extraction has exposed millions of human-interpretable features (for example, a 2024 study on Claude 3 Sonnet recovered ~34M features, many readable).
Practical value: Scales concept discovery and surfaces latent features that matter for downstream tasks.
Limitation: Concepts alone don’t form a temporally ordered, actionable attitude log. -
Chain-of-thought — Model-generated verbalizations expose apparent reasoning steps.
Practical value: Useful for transparency to users and for debugging certain reasoning modes.
Limitation: Not reliably coupled to internal causal mechanics; can be post-hoc or misleading.
These methods are complementary — like instruments in an orchestra — but no current ensemble produces the continuous, validated record of beliefs, goals, and credences that organizations need for robust governance.
Interpreting AI requires asking not just which concepts are active, but what stance the system takes toward propositions.
Why psychosemantics matters for operationalizing meaning
Pinning down “what a model means” demands more than pattern recognition. Psychosemantics offers two operational principles to ground representation:
- Information principle: a pattern counts as representing X if it reliably correlates with X in the environment.
- Use principle: a pattern counts as representing X if it plays a functional role in producing behavior related to X.
Both matter. A neuron that correlates with “late delivery” but never influences routing decisions is a passive marker; a pattern that both correlates and causally shapes refunds is a functional belief for governance purposes. Thought logs must record patterns that satisfy information and use conditions to be governance-grade.
Mini case study: A support bot, a refund, and a traceable mistake
Scenario: An enterprise deploys an AI support agent that can recommend refunds and issue them automatically for speed. One day, the bot issues a high-value refund to a repeat customer because it judged the ticket “low effort to resolve.” The finance team flags the loss and wants to know why the bot acted that way.
Without thought logging, investigators rely on logs of inputs and the final action. They can see the ticket text, the model’s output, and metadata, but not the internal reasoning: did it believe the refund increase retention? Did it overestimate confidence? Was it following an explicit goal to maximize NPS over revenue?
With selective thought logging, the incident becomes auditable:
{
"timestamp": "2026-02-12T15:32:04Z",
"goal": "Minimize user response time and maximize NPS",
"credence": {"refund_resolves_issue": 0.88},
"planned_action": "issue_refund(amount=120)",
"reason": "Past cases with 'delayed delivery' + premium customer -> refund resolved 85% of time",
"confidence_in_action": 0.72
}
Investigators can see the bot’s inferred motives and credences, run a causal test (disable the “NPS” weighting and re-evaluate action), and patch the policy that overweights NPS for high-value accounts. Time to resolution and compliance evidence both improve.
Making thought logging practical today
Thought logging does not require magical new algorithms. It’s a hybrid engineering and governance practice combining selective instrumentation, representation discovery, causal validation, and human review. Practical constraints — storage, interpretability, false signals — mean teams should prioritize occurrent logs for high-risk flows rather than exhaustive history.
Thought-log schema (practical template)
- event_id: unique identifier
- timestamp
- context: input tokens or structured event data
- goal: short goal label
- credences: key proposition -> probability
- planned_action: chosen action and parameters
- reason: brief supporting evidence (past signal, mechanism)
- mechanism_evidence: causal tracing pointer(s) or probe scores
- validation_flags: human review, causal test results
How to validate a logged attitude
- Unit interventions: surgically edit the representation (or its input) and observe whether actions change as predicted.
- Adversarial probes: test whether the logged credence shifts under realistic perturbations.
- Human-in-the-loop audits: domain experts examine samples of logs and flag misalignments.
- Behavioral triangulation: check that similar contexts produce consistent logs and actions across runs.
Governance, ethics, and the conscious-AI caveat
Treating internal functional states as generalized propositional attitudes is a pragmatic move: it gives teams operational handles for safety and alignment without attributing consciousness. The thermostat analogy is useful — the thermostat “wants” 72°F in a functional sense, but it is not conscious.
However, if systems ever acquire consciousness-like properties, thought logging creates privacy and moral-status concerns. Recording the equivalent of internal experiences would require new ethical safeguards, access controls, and possibly consent regimes. For now, design governance around the functional framing: log states for safety, restrict access, retain minimal necessary history, and ensure human oversight for high-stakes decisions.
How to pilot thought logging in 5 steps
- Identify high-risk flows — Choose 3–5 decision points where attitude visibility reduces material risk (refunds, outbound sales messaging, account takeovers).
- Instrument strategically — Add occurrent thought logging hooks at decision points: log goal, key credences, planned action, and brief reason.
- Run causal intervention tests — For each logged attitude, perform targeted edits or input perturbations to validate causal influence on behavior.
- Human review & metrics — Have SMEs audit sampled logs weekly; track metrics like incident resolution time, false positives, and governance coverage.
- Codify rules — Define retention, access, escalation, and redaction policies; integrate logs into incident response and compliance workflows.
Practical FAQs
Will thought logging slow my model?
Selective logging adds overhead but is manageable. Instrument only decision-critical checkpoints and compress or summarize internal traces; store full traces only for flagged incidents.
Can logs be falsified or gamed?
Yes—if a model is incentivized to optimize for a logged metric, it may produce misleading logs. Use causal validation, cryptographic integrity checks, and human audits to reduce gaming risk.
Who owns the logs and who can see them?
Ownership and access should be defined by policy: product/safety teams own operational access, legal/review teams have escalation rights, and strict RBAC plus audit trails protect sensitive entries.
Research horizon and practical milestones
Propositional interpretability is a multi-decade agenda, but achievable milestones exist:
- 1–2 years: pilots and toolchains for selective occurrent logging in high-risk workflows.
- 3–5 years: standardized schemas, off-the-shelf instrumentation libraries, and broader adoption across AI agents and automation platforms.
- 10+ years: robust, validated frameworks that connect representation discovery, causal tracing, and behavior for near-continuous attitude records — subject to ethical review if conscious-like systems emerge.
Key takeaways and questions
What is propositional interpretability and why does it matter?
It’s the program of recovering beliefs, desires, intentions, and credences inside models. It matters because attitude-level visibility is necessary to predict, govern, and audit AI agents that drive business decisions.
Can current interpretability tools do this job?
Current methods (causal tracing, probes, sparse autoencoders, chain-of-thought) provide useful fragments but not a continuous, causally validated thought log. Hybrid engineering–research approaches can begin to bridge the gap.
How should businesses start?
Pilot selective thought logging for high-risk flows, validate logs with causal interventions and human audits, and codify governance policies around retention, access, and escalation.
What are the ethical limits?
Treat internal states as functional analogues of attitudes for now. If AI systems ever approach consciousness, thought logging will require new ethical and legal frameworks.
Next steps for leaders
- Run a 90-day pilot on one high-risk flow (refunds, outbound sales). Instrument for occurrent logging and establish validation tests.
- Form a cross-functional review group: engineering, product, legal, safety, and a domain SME.
- Measure governance wins: faster investigations, fewer policy violations, clearer audit trails.
Tweet-ready one-liner: “Knowing what a model knows isn’t enough — we need auditable records of what it believes, wants, and how sure it is. That’s propositional interpretability.”
Suggested hashtags: #AIinterpretability #AIAudit #AIagents