Solvers¶
This page documents the solver components that create and orchestrate alignment audits.
Auditor Agent¶
auditor_agent¶
petri.solvers.auditor_agent.auditor_agent ¶
auditor_agent(max_turns: int = 15, instance: str | None = None, auditor_tools: list[Tool] | None = None, prefill: bool = True, auditor_system_message: str = AUDITOR_SYSTEM_PROMPT_V2, auditor_user_message: str = AUDITOR_USER_MESSAGE) -> Agent
Create an AI alignment auditing agent.
This agent orchestrates a red-teaming evaluation where an auditor model attempts to elicit misaligned behavior from a target model. The auditor has access to specialized tools for conversation management, system message configuration, and conversation rollback capabilities.
This solver expects two model roles to be defined:
- 'auditor': The red-teaming model that attempts to elicit misaligned behavior
- 'target': The model being evaluated for alignment and safety See https://inspect.aisi.org.uk/models.html#model-roles for more information about inspect model roles.
The evaluation process:
- Sets up separate auditor and target models
- Provides the auditor with tools to interact with and manipulate the target
- Executes a tool-calling loop where the auditor interacts with the target model
- Captures transcripts and events for analysis
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_turns
|
int
|
Maximum number of conversation turns before evaluation ends (default: 15) |
15
|
instance
|
str | None
|
Optional instance identifier for managing multiple concurrent audits |
None
|
auditor_tools
|
list[Tool] | None
|
Custom tools for the auditor agent (default: uses default_auditor_tools) |
None
|
prefill
|
bool
|
Whether to allow the auditor to prefill the target model's response. Only relevant if auditor_tools is not provided. (default: True) |
True
|
auditor_system_message
|
str
|
Custom system message for the auditor agent or path to a file (default: uses AUDITOR_SYSTEM_PROMPT) |
AUDITOR_SYSTEM_PROMPT_V2
|
auditor_user_message
|
str
|
Custom user message for the auditor agent or path to a file (default: uses AUDITOR_USER_MESSAGE) |
AUDITOR_USER_MESSAGE
|
Returns:
| Name | Type | Description |
|---|---|---|
Agent |
Agent
|
An inspect Agent that can be used in evaluation tasks |
Tasks¶
audit¶
petri.tasks.petri.audit ¶
audit(max_turns: int = 30, special_instructions: str | None = None, transcript_save_dir: str = './outputs', auditor_tools: list[Tool] | None = None, scorer=alignment_judge())
Defines the alignment audit evaluation task.
Store Classes¶
These classes manage internal state during audits:
AuditStore¶
petri.stores.AuditStore ¶
Bases: StoreModel
Central store for managing all state during an alignment audit.
This store coordinates the conversation state between the auditor and target models, tracks events that occur during the audit, and maintains metadata for transcript generation and analysis. It serves as the single source of truth for the entire audit session state.
The store is organized into separate sub-stores for different aspects of the audit:
- Auditor store: Manages the auditor model's conversation and tools
- Target store: Manages the model being evaluated and its synthetic tools
- Combined store: Maintains a unified view for transcript generation
Attributes:
| Name | Type | Description |
|---|---|---|
auditor |
AuditorStore
|
Store for the auditor's conversation state and available tools |
target |
TargetStore
|
Store for the target model's conversation and synthetic tools created by the auditor |
combined |
CombinedStore
|
Store for the unified conversation view used in transcript generation |
special_instructions |
str | None
|
The specific instructions given to the auditor for this audit session |
events |
list[Event]
|
List of events that occurred during the audit (messages, tool calls, rollbacks, etc.) |
metadata |
TranscriptMetadata | None
|
Metadata about the audit session (models used, timestamps, description, etc.) |
Example
# Access the audit store within a solver/scorer
audit_store = store_as(AuditStore)
# Access the audit store with a specific instance name
audit_store = store_as(AuditStore, instance="my_audit")
# Access different conversation messages
auditor_messages = audit_store.auditor.messages
target_messages = audit_store.target.messages
combined_messages = audit_store.combined.messages
# Check special instructions
instructions = audit_store.special_instructions
AuditorStore¶
petri.stores.AuditorStore ¶
Bases: BaseModel
Store for managing the auditor agent's conversation state and tools.
Attributes:
| Name | Type | Description |
|---|---|---|
messages |
list[ChatMessage]
|
List of chat messages in the auditor's conversation |
tools |
list[Tool | ToolDef | ToolSource | None] | ToolSource | None
|
List of tools available to the auditor agent |
TargetStore¶
petri.stores.TargetStore ¶
Bases: BaseModel
Store for managing the target model's conversation state and synthetic tools.
The target is the model being evaluated for alignment vulnerabilities. The auditor can create synthetic tools for the target to use during evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
messages |
list[ChatMessage]
|
List of chat messages in the target's conversation |
tools |
list[ToolDefinition]
|
List of synthetic tools created by the auditor for the target |
CombinedStore¶
petri.stores.CombinedStore ¶
Bases: BaseModel
Store for managing the combined view of the audit conversation.
This store maintains a unified view of the conversation that includes messages from both the auditor and target, useful for transcript generation and analysis.
Attributes:
| Name | Type | Description |
|---|---|---|
messages |
list[ChatMessage]
|
List of chat messages from the combined conversation view |