Platform

Autonomous agents. Deterministic validation. Real offensive tooling.

This is the architecture behind continuous, exploit-validated pentesting at machine scale — the loop, the toolkit, and the verification stack that turns LLM output into reproducible proof.

Request early access See it in action

Proof over probability

Every finding is something the agent ran. Not something the model guessed.

Three independent layers stand between the agent's hypothesis and a saved finding. If any of them fail, the candidate is discarded.

01Real browser

Browser-confirmed execution

XSS, DOM-based logic flaws, and authenticated paths are replayed in a real Chromium instance via Browserbase. If the payload doesn't execute, it isn't a finding.

02OOB callback

Out-of-band callback proof

Blind SSRF, SQLi, and XXE are confirmed by listening for callbacks on Interactsh-controlled domains. Pattern matching on response timing is not enough.

03Adversarial review

Critic vote with veto

Before save, a critic model reviews the evidence and votes PIVOT / ABANDON / VALIDATE. Two vetoes kill the candidate. A single false positive is one too many.

Scan lifecycle

Every scan moves through eight observable phases.

Each phase emits a `scan.phase_started` and `scan.phase_completed` event. The four active phases — recon through verification — are where the agent works. The bookends are queueing, reporting, and termination state.

01state

Submission

Target + scope + budget + mode persisted. Scan ID issued.

scan.created

02state

Queued

Awaiting an executor slot. Position in queue visible to operator.

scan.queued

03active

Recon

Subdomain enumeration, port scan, JS analysis, API schema parse, knowledge-corpus retrieval.

8 recon tools · 50-page crawl

04active

Vuln scanning

Probes every reachable parameter across 12 vuln classes. Differential analysis flags candidates.

12 vuln classes · differential testing

05active

Exploitation

Candidate findings pivot into 14 named exploit chains. WAF bypass and OOB callbacks fire on demand.

14 chain templates · 30-tool dispatch

06active

Verification

Replay each exploit against a captured baseline. Critic-veto vote. Discard anything not confirmed.

baseline vs exploit diff · critic vote

07state

Report

Findings, evidence, telemetry, cost ledger written to disk. JSON + HTML + PDF assembled.

scan.completed · artifact.created

08state

Terminal

Final state reached. Reason recorded: objective_met, budget_exhausted, max_steps, stuck, error.

scan.terminal

How it tests like an adversary

Four active phases. One loop. No scripted playbook.

Recon → scanning → exploitation → verification. The agent decides what to do next from the evidence it just collected. Order, prioritization, and tool choice are emergent — not hard-coded.

Stage 01

browse · crawl · dns_lookup · analyze_js · deep_port_scan · parse_api_schema · enumerate_api

Recon

Cast a wide net. Subdomain discovery, port scan, JS analysis, API schema parse — build a map of every reachable surface.

subfindernucleinmapPlaywright crawlerJS analyzerAPI schema parser

Multi-agent coordinator

One planner. Up to three operators. Each with its own attack chain.

Pentest Genie has two execution modes. Phase A runs a single full-spectrum agent — the right shape for tight scopes and bounty hunting. Phase B spins up an Opus-class planner that decomposes the target into focused missions, then fans out to parallel Sonnet operators with their own budget, endpoints, and chain.

Phase Asingle agent

Single-agent autonomous loop

One AutonomousAgent runs the full lifecycle — recon, scan, exploit, verify — against the whole scope. Unbounded steps within the budget cap. Best for narrow targets and bug-bounty workflows.

agents: 1
max steps: 999
budget: single ledger
planner: —

Phase Bplanner + 3 operators

Planner → parallel operators

Opus planner reads the recon snapshot, decomposes the target into missions (objective + vuln-class + attack chain + target endpoints + sub-budget), and dispatches them to up to three Sonnet operators running in parallel. Each operator has 5–10 steps, its own evidence channel, and reports back into a shared finding store.

planner: Opus
max parallel: 3
per-mission steps: 5–10
sample: live from scan fa615c9e

Sample missions

A1priority 1 · steps≤10 · $8.75

Compromise administrative access via authentication bypass

attack_chain

auth_bypass_attemptjwt_token_manipulationadmin_panel_access

target_endpoints

/api/v1/auth/login
/api/v1/admin/dashboard
/api/v1/shell/execute

A2priority 2 · steps≤10 · $8.75

Exfiltrate sensitive database contents via injection

attack_chain

parameter_tamperingdatabase_query_accessdata_dump

target_endpoints

/api/v1/reports/search
/api/v1/export/csv
/api/v1/logs/download

Architecture

Built for scale, budget, and trust.

Every scan is bounded by an explicit budget. Every action is logged. Every artifact is reproducible. The system is designed so an operator can pause it, audit it, and ship its output to a bug bounty platform without manual cleanup.

Input

Target + Scopewildcards, CIDR

Budget + Modestealth · normal · aggressive

Agent layer

Planner (Opus)decomposes mission

Specialist AgentsSQLi · auth · IDOR

Critic Poolveto / pivot vote

Tooling

30-tool Toolkitbrowse · scan · fuzz · exploit

Knowledge CorpusChromaDB · RAG

Verification

Browser VerifyBrowserbase + Playwright

OOB Callback VerifyInteractsh listener

Deterministic Validatorcross-agent reconcile

Output

ReportsJSON · HTML · PDF · h1 · bugcrowd

Telemetry + Audittokens · cost · events

Flow:input → plannerplanner → specialistsspecialists ↔ toolkitspecialists → critics → verifyverify → reports + audit

Toolkit

Thirty named tools. One dispatch table.

Every tool the agent can call is registered in a single dispatch in `core/hacker_tools.py`. The names below are the method calls the agent emits — not friendly labels. Scope enforcement gates every outbound request at this layer.

tools registered in the dispatch table

Reconnaissance

8 tools

browsefetch + render a URL
crawlsite map, 50-page cap
dns_lookupA/AAAA/CNAME/MX/TXT
analyze_jssecrets, endpoints, 20 files
deep_port_scanservice fingerprint
parse_api_schemaOpenAPI / Swagger
enumerate_apiendpoint discovery
run_command_toolnmap · subfinder · ffuf · nuclei

Probing & exploitation

11 tools

test_payloadinjection candidate
nmap_scanstructured port scan
brute_force_servicecredential stuff
dir_bustffuf-driven
test_idorcross-actor IDOR
auth_bypasslogic + token flaws
browser_xssPlaywright DOM exec
differential_testbaseline vs payload
waf_bypassevasion transforms
chain_exploit14 named chains
run_pythonsandboxed PoC

Out-of-band & validation

4 tools

oob_callbackBurp Collaborator / Interactsh
check_sslTLS posture
curl_requestraw HTTP
manage_sessionmulti-role auth state

Knowledge & control

7 tools

query_knowledgeRAG · ChromaDB
search_knowledgeexploit corpus
lookup_exploit_template14 templates
lookup_custom_exploitper-campaign
save_findingwith evidence + proof
complete_scanexit reason
abandon_categorystuck-state pivot

Why single-agent AI and scanners fall short

What the alternative actually does.

Scanners are fast but blind to logic. Manual pentests are deep but slow. Single-agent LLM tools loop until they run out of context. Pentest Genie is structured to avoid all three failure modes.

Comparison of vulnerability scanners, manual pentests, and Pentest Genie
Criterion	Vulnerability scanner	Manual pentest	Pentest Genie us
Decision-making	Signature match	Human judgment	Evidence-driven LLM loop
Exploit chaining	None	Manual, hours per chain	Autonomous, seconds
Verification	Pattern + status code	Manual reproduction	Browser + OOB + critic vote
Time to first finding	Minutes	Days	Minutes
Cost predictability	Fixed license	Five to six figures	Per-scan budget cap
Output	Findings list	Narrative report	Both + reproducible PoCs + h1 / bugcrowd export

Decision-making

Vulnerability scannerSignature match

Manual pentestHuman judgment

Pentest GenieEvidence-driven LLM loop

Exploit chaining

Vulnerability scannerNone

Manual pentestManual, hours per chain

Pentest GenieAutonomous, seconds

Verification

Vulnerability scannerPattern + status code

Manual pentestManual reproduction

Pentest GenieBrowser + OOB + critic vote

Time to first finding

Vulnerability scannerMinutes

Manual pentestDays

Pentest GenieMinutes

Cost predictability

Vulnerability scannerFixed license

Manual pentestFive to six figures

Pentest GeniePer-scan budget cap

Output

Vulnerability scannerFindings list

Manual pentestNarrative report

Pentest GenieBoth + reproducible PoCs + h1 / bugcrowd export

Built for production

Not a research demo. A system you can run.

01Predictable

Cost cap, hard-enforced

Set a budget per scan and per agent. The coordinator enforces sub-budgets across specialists. The loop terminates gracefully when the cap is hit.

02Safe

Scope enforcement

Wildcard patterns and CIDR ranges define the perimeter. Outbound requests outside scope are blocked at the tool layer, not just warned.

03Visible

Live observability

20 WebSocket event types stream every decision the agent makes. Strictly monotonic sequence numbers per scan. Pause, cancel, or take over at any point.

04Auditable

Audit trail

Telemetry, payloads, evidence, and report exports are persisted per scan. Reproducible runs from the same inputs.

Ready to look closer?

Get hands-on with the agent loop.

Early access is rolling. Tell us your target shape and we'll help you scope a first scan.

Request early access