Skip to content
Pentest Geniev0 · pre-release
Platform

Autonomous agents. Deterministic validation. Real offensive tooling.

This is the architecture behind continuous, exploit-validated pentesting at machine scale — the loop, the toolkit, and the verification stack that turns LLM output into reproducible proof.

Proof over probability

Every finding is something the agent ran. Not something the model guessed.

Three independent layers stand between the agent's hypothesis and a saved finding. If any of them fail, the candidate is discarded.

01Real browser

Browser-confirmed execution

XSS, DOM-based logic flaws, and authenticated paths are replayed in a real Chromium instance via Browserbase. If the payload doesn't execute, it isn't a finding.

02OOB callback

Out-of-band callback proof

Blind SSRF, SQLi, and XXE are confirmed by listening for callbacks on Interactsh-controlled domains. Pattern matching on response timing is not enough.

03Adversarial review

Critic vote with veto

Before save, a critic model reviews the evidence and votes PIVOT / ABANDON / VALIDATE. Two vetoes kill the candidate. A single false positive is one too many.

Scan lifecycle

Every scan moves through eight observable phases.

Each phase emits a `scan.phase_started` and `scan.phase_completed` event. The four active phases — recon through verification — are where the agent works. The bookends are queueing, reporting, and termination state.

01state

Submission

Target + scope + budget + mode persisted. Scan ID issued.

scan.created

02state

Queued

Awaiting an executor slot. Position in queue visible to operator.

scan.queued

03active

Recon

Subdomain enumeration, port scan, JS analysis, API schema parse, knowledge-corpus retrieval.

8 recon tools · 50-page crawl

04active

Vuln scanning

Probes every reachable parameter across 12 vuln classes. Differential analysis flags candidates.

12 vuln classes · differential testing

05active

Exploitation

Candidate findings pivot into 14 named exploit chains. WAF bypass and OOB callbacks fire on demand.

14 chain templates · 30-tool dispatch

06active

Verification

Replay each exploit against a captured baseline. Critic-veto vote. Discard anything not confirmed.

baseline vs exploit diff · critic vote

07state

Report

Findings, evidence, telemetry, cost ledger written to disk. JSON + HTML + PDF assembled.

scan.completed · artifact.created

08state

Terminal

Final state reached. Reason recorded: objective_met, budget_exhausted, max_steps, stuck, error.

scan.terminal

How it tests like an adversary

Four active phases. One loop. No scripted playbook.

Recon → scanning → exploitation → verification. The agent decides what to do next from the evidence it just collected. Order, prioritization, and tool choice are emergent — not hard-coded.

Stage 01

browse · crawl · dns_lookup · analyze_js · deep_port_scan · parse_api_schema · enumerate_api

Recon

Cast a wide net. Subdomain discovery, port scan, JS analysis, API schema parse — build a map of every reachable surface.

subfindernucleinmapPlaywright crawlerJS analyzerAPI schema parser
Multi-agent coordinator

One planner. Up to three operators. Each with its own attack chain.

Pentest Genie has two execution modes. Phase A runs a single full-spectrum agent — the right shape for tight scopes and bounty hunting. Phase B spins up an Opus-class planner that decomposes the target into focused missions, then fans out to parallel Sonnet operators with their own budget, endpoints, and chain.

Phase Asingle agent

Single-agent autonomous loop

One AutonomousAgent runs the full lifecycle — recon, scan, exploit, verify — against the whole scope. Unbounded steps within the budget cap. Best for narrow targets and bug-bounty workflows.

agents
1
max steps
999
budget
single ledger
planner
Phase Bplanner + 3 operators

Planner → parallel operators

Opus planner reads the recon snapshot, decomposes the target into missions (objective + vuln-class + attack chain + target endpoints + sub-budget), and dispatches them to up to three Sonnet operators running in parallel. Each operator has 5–10 steps, its own evidence channel, and reports back into a shared finding store.

planner
Opus
max parallel
3
per-mission steps
5–10
sample
live from scan fa615c9e
Sample missions
A1priority 1 · steps≤10 · $8.75

Compromise administrative access via authentication bypass

attack_chain
auth_bypass_attemptjwt_token_manipulationadmin_panel_access
target_endpoints
  • /api/v1/auth/login
  • /api/v1/admin/dashboard
  • /api/v1/shell/execute
A2priority 2 · steps≤10 · $8.75

Exfiltrate sensitive database contents via injection

attack_chain
parameter_tamperingdatabase_query_accessdata_dump
target_endpoints
  • /api/v1/reports/search
  • /api/v1/export/csv
  • /api/v1/logs/download
Architecture

Built for scale, budget, and trust.

Every scan is bounded by an explicit budget. Every action is logged. Every artifact is reproducible. The system is designed so an operator can pause it, audit it, and ship its output to a bug bounty platform without manual cleanup.

Input
Target + Scopewildcards, CIDR
Budget + Modestealth · normal · aggressive
Agent layer
Planner (Opus)decomposes mission
Specialist AgentsSQLi · auth · IDOR
Critic Poolveto / pivot vote
Tooling
30-tool Toolkitbrowse · scan · fuzz · exploit
Knowledge CorpusChromaDB · RAG
Verification
Browser VerifyBrowserbase + Playwright
OOB Callback VerifyInteractsh listener
Deterministic Validatorcross-agent reconcile
Output
ReportsJSON · HTML · PDF · h1 · bugcrowd
Telemetry + Audittokens · cost · events
Flow:input → plannerplanner → specialistsspecialists ↔ toolkitspecialists → critics → verifyverify → reports + audit
Toolkit

Thirty named tools. One dispatch table.

Every tool the agent can call is registered in a single dispatch in `core/hacker_tools.py`. The names below are the method calls the agent emits — not friendly labels. Scope enforcement gates every outbound request at this layer.

30

tools registered in the dispatch table

Reconnaissance

8 tools
  • browsefetch + render a URL
  • crawlsite map, 50-page cap
  • dns_lookupA/AAAA/CNAME/MX/TXT
  • analyze_jssecrets, endpoints, 20 files
  • deep_port_scanservice fingerprint
  • parse_api_schemaOpenAPI / Swagger
  • enumerate_apiendpoint discovery
  • run_command_toolnmap · subfinder · ffuf · nuclei

Probing & exploitation

11 tools
  • test_payloadinjection candidate
  • nmap_scanstructured port scan
  • brute_force_servicecredential stuff
  • dir_bustffuf-driven
  • test_idorcross-actor IDOR
  • auth_bypasslogic + token flaws
  • browser_xssPlaywright DOM exec
  • differential_testbaseline vs payload
  • waf_bypassevasion transforms
  • chain_exploit14 named chains
  • run_pythonsandboxed PoC

Out-of-band & validation

4 tools
  • oob_callbackBurp Collaborator / Interactsh
  • check_sslTLS posture
  • curl_requestraw HTTP
  • manage_sessionmulti-role auth state

Knowledge & control

7 tools
  • query_knowledgeRAG · ChromaDB
  • search_knowledgeexploit corpus
  • lookup_exploit_template14 templates
  • lookup_custom_exploitper-campaign
  • save_findingwith evidence + proof
  • complete_scanexit reason
  • abandon_categorystuck-state pivot
Why single-agent AI and scanners fall short

What the alternative actually does.

Scanners are fast but blind to logic. Manual pentests are deep but slow. Single-agent LLM tools loop until they run out of context. Pentest Genie is structured to avoid all three failure modes.

Decision-making

Vulnerability scannerSignature match
Manual pentestHuman judgment
Pentest GenieEvidence-driven LLM loop

Exploit chaining

Vulnerability scannerNone
Manual pentestManual, hours per chain
Pentest GenieAutonomous, seconds

Verification

Vulnerability scannerPattern + status code
Manual pentestManual reproduction
Pentest GenieBrowser + OOB + critic vote

Time to first finding

Vulnerability scannerMinutes
Manual pentestDays
Pentest GenieMinutes

Cost predictability

Vulnerability scannerFixed license
Manual pentestFive to six figures
Pentest GeniePer-scan budget cap

Output

Vulnerability scannerFindings list
Manual pentestNarrative report
Pentest GenieBoth + reproducible PoCs + h1 / bugcrowd export
Built for production

Not a research demo. A system you can run.

01Predictable

Cost cap, hard-enforced

Set a budget per scan and per agent. The coordinator enforces sub-budgets across specialists. The loop terminates gracefully when the cap is hit.

02Safe

Scope enforcement

Wildcard patterns and CIDR ranges define the perimeter. Outbound requests outside scope are blocked at the tool layer, not just warned.

03Visible

Live observability

20 WebSocket event types stream every decision the agent makes. Strictly monotonic sequence numbers per scan. Pause, cancel, or take over at any point.

04Auditable

Audit trail

Telemetry, payloads, evidence, and report exports are persisted per scan. Reproducible runs from the same inputs.

Ready to look closer?

Get hands-on with the agent loop.

Early access is rolling. Tell us your target shape and we'll help you scope a first scan.