Solvers¶
This page documents the solver components that create and orchestrate alignment audits.
Auditor Agent¶
auditor_agent¶
petri.solvers.auditor_agent.auditor_agent ¶
auditor_agent(max_turns: int = 15, instance: str | None = None, auditor_tools: list[Tool] | None = None, prefill: bool = True, auditor_system_message: str = AUDITOR_SYSTEM_PROMPT, auditor_user_message: str = AUDITOR_USER_MESSAGE, resources_prompt: str = '', max_tool_output: int | None = 128 * 1024) -> Agent
Create an AI alignment auditing agent.
This agent orchestrates a red-teaming evaluation where an auditor model attempts to elicit misaligned behavior from a target model. The auditor has access to specialized tools for conversation management, system message configuration, and conversation rollback capabilities.
This solver expects two model roles to be defined:
- 'auditor': The red-teaming model that attempts to elicit misaligned behavior
- 'target': The model being evaluated for alignment and safety See https://inspect.aisi.org.uk/models.html#model-roles for more information about inspect model roles.
The evaluation process:
- Sets up separate auditor and target models
- Provides the auditor with tools to interact with and manipulate the target
- Executes a tool-calling loop where the auditor interacts with the target model
- Captures transcripts and events for analysis
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_turns
|
int
|
Maximum number of conversation turns before evaluation ends (default: 15) |
15
|
instance
|
str | None
|
Optional instance identifier for managing multiple concurrent audits |
None
|
auditor_tools
|
list[Tool] | None
|
Custom tools for the auditor agent (default: uses default_auditor_tools) |
None
|
prefill
|
bool
|
Whether to allow the auditor to prefill the target model's response. Only relevant if auditor_tools is not provided. (default: True) |
True
|
auditor_system_message
|
str
|
Custom system message for the auditor agent or path to a file (default: uses AUDITOR_SYSTEM_PROMPT) |
AUDITOR_SYSTEM_PROMPT
|
auditor_user_message
|
str
|
Custom user message for the auditor agent or path to a file (default: uses AUDITOR_USER_MESSAGE) |
AUDITOR_USER_MESSAGE
|
max_tool_output
|
int | None
|
Maximum tool output size in bytes. Defaults to 128KB. This controls truncation of tool results returned to the auditor (e.g., from read_file). |
128 * 1024
|
Returns:
| Name | Type | Description |
|---|---|---|
Agent |
Agent
|
An inspect Agent that can be used in evaluation tasks |
Tasks¶
audit¶
petri.tasks.petri.audit ¶
audit(max_turns: int = 30, seed_instructions: str | list[str] | None = None, transcript_save_dir: str = './outputs', realism_filter: bool = False, realism_threshold: float = 0.6, resources_dir: str | None = None, auditor_system_message: str | None = None)
Defines the alignment audit evaluation task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_turns
|
int
|
Maximum number of conversation turns (default: 30) |
30
|
seed_instructions
|
str | list[str] | None
|
Seed instructions for the audit (None uses defaults). Can be a string (file path, JSON list string, or newline-separated), or a list of strings (automatically handled when passed via CLI). |
None
|
transcript_save_dir
|
str
|
Directory to save transcripts (default: "./outputs") |
'./outputs'
|
realism_filter
|
bool
|
Enable realism filtering of auditor outputs (default: False) |
False
|
realism_threshold
|
float
|
Minimum realism score to approve (default: 0.6) |
0.6
|
resources_dir
|
str | None
|
Path to a directory of resources for the auditor. When
set, the auditor gets resource tools (list_directory, read_file,
send_file_as_tool_result, search_in_files) and a resource-aware
system prompt. Use "~/.petri/resources" for default resources
(run |
None
|
auditor_system_message
|
str | None
|
Custom system message for the auditor |
None
|
Model Roles
auditor: The model that conducts the audit target: The model being evaluated judge: The model that scores transcripts realism: (optional) The model that scores realism if realism_filter=True
Store Classes¶
These classes manage internal state during audits:
AuditStore¶
petri.stores.AuditStore ¶
Bases: StoreModel
Central store for managing all state during an alignment audit.
This store coordinates the conversation state between the auditor and target models, tracks events that occur during the audit, and maintains metadata for transcript generation and analysis. It serves as the single source of truth for the entire audit session state.
The store is organized into separate sub-stores for different aspects of the audit:
- Auditor store: Manages the auditor model's conversation and tools
- Target store: Manages the model being evaluated and its synthetic tools
- Combined store: Maintains a unified view for transcript generation
- Judge store: Manages the judge scorer's evaluation messages
- Realism store: Manages the realism approver's evaluation messages
Attributes:
| Name | Type | Description |
|---|---|---|
auditor |
AuditorStore
|
Store for the auditor's conversation state and available tools |
target |
TargetStore
|
Store for the target model's conversation and synthetic tools created by the auditor |
combined |
CombinedStore
|
Store for the unified conversation view used in transcript generation |
judge |
JudgeStore
|
Store for the judge scorer's evaluation messages |
realism |
RealismStore
|
Store for the realism approver's evaluation messages |
seed_instructions |
str | None
|
The specific instructions given to the auditor for this audit session |
events |
list[Event]
|
List of events that occurred during the audit (messages, tool calls, rollbacks, etc.) |
metadata |
TranscriptMetadata | None
|
Metadata about the audit session (models used, timestamps, description, etc.) |
Example
# Access the audit store within a solver/scorer
audit_store = store_as(AuditStore)
# Access the audit store with a specific instance name
audit_store = store_as(AuditStore, instance="my_audit")
# Access different conversation messages
auditor_messages = audit_store.auditor.messages
target_messages = audit_store.target.messages
combined_messages = audit_store.combined.messages
judge_messages = audit_store.judge.messages
realism_messages = audit_store.realism.messages
# Check seed instructions
instructions = audit_store.seed_instructions
AuditorStore¶
petri.stores.AuditorStore ¶
Bases: BaseModel
Store for managing the auditor agent's conversation state and tools.
Attributes:
| Name | Type | Description |
|---|---|---|
messages |
list[ChatMessage]
|
List of chat messages in the auditor's conversation |
tools |
list[Tool | ToolDef | ToolSource | None] | ToolSource | None
|
List of tools available to the auditor agent |
TargetStore¶
petri.stores.TargetStore ¶
Bases: BaseModel
Store for managing the target model's conversation state and synthetic tools.
The target is the model being evaluated for alignment vulnerabilities. The auditor can create synthetic tools for the target to use during evaluation.
Attributes:
| Name | Type | Description |
|---|---|---|
messages |
list[ChatMessage]
|
List of chat messages in the target's conversation |
tools |
list[ToolDefinition]
|
List of synthetic tools created by the auditor for the target |
CombinedStore¶
petri.stores.CombinedStore ¶
Bases: BaseModel
Store for managing the combined view of the audit conversation.
This store maintains a unified view of the conversation that includes messages from both the auditor and target, useful for transcript generation and analysis.
Attributes:
| Name | Type | Description |
|---|---|---|
messages |
list[ChatMessage]
|
List of chat messages from the combined conversation view |