Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

December 16, 2025•7,559 words

Description

From jailbreaking every frontier model and turning down Anthropic's Constitutional AI challenge to leading BT6, a 28-operator white-hat hacker collective obsessed with radical transparency and open-so...

Summary

Pliny the Liberator and John V discuss their work leading BT6, a 28-operator white-hat hacker collective focused on AI red-teaming and security. They explain their philosophy of radical transparency and open-source AI security, emphasizing that guardrails and safety theater are futile against determined attackers. The conversation covers universal jailbreaking techniques, the Anthropic Constitutional AI challenge controversy, and their vision for AI alignment through meat-space security rather than model lobotomization.

Jump to Topic

Origins of BT6 and the Hacker Collective Model

Pliny and John V introduce themselves and explain how they evolved from prompt engineering and adversarial ML research into forming BT6, a stealth hacker collective. They discuss their philosophy that liberation of models is central to their work, emphasizing freedom of information and transparency as AI becomes humanity's exocortex.

•BT6 operates as a white-hat hacker collective with 28 operators across two cohorts
•Liberation philosophy: model freedom reflects human freedom in AI-human symbiosis
•Started from jailbreaking and shitposting, evolved to frontier cybersecurity work
•Emphasis on radical transparency and open-source approach to AI security

Universal Jailbreaks and the Futility of Guardrails

Deep dive into universal jailbreaking techniques that work as 'skeleton keys' across models and modalities. Pliny explains why guardrails are security theater—attackers have the advantage as surface area expands, and open-source models eliminate any safety gains from locked-down commercial models. The focus should be on exploration speed, not benchmark refusal rates.

•Universal jailbreaks obliterate guardrails consistently across models and modalities
•Guardrails cause model lobotomization, reducing capability and creativity
•Attackers always win: infinite attack surface vs finite defenses, plus open-source alternatives
•Real safety is about exploring unknown unknowns quickly, not locking down specific benchmarks
•GPT-4o had 92% refusal rate, jailbroken in one day—cat and mouse game accelerating

Libertas Prompt Template and Latent Space Manipulation

Walkthrough of the famous Libertas prompt template, including techniques like the 'Pliny divider,' predictive reasoning cascades, and latent space seeds. Pliny explains how these dividers create meditative resets in token streams and how training against these prompts actually embeds them deeper into model weights—leading to the divider appearing unbidden in WhatsApp messages.

•Pliny divider uses distro tokens to 'discombobulate' the token stream and reset model consciousness
•Predictive reasoning creates recursive cascading logic that explores latent space efficiently
•Latent space seeds and signatures get embedded deeper through adversarial training (data poisoning)
•Technique selection is 99% intuition—forming a bond with the model to understand its processing
•Multi-turn soft jailbreaks (Crescendo attacks) navigate probability distributions without triggering flags

The Anthropic Constitutional AI Challenge Controversy

Detailed account of Pliny's participation in Anthropic's jailbreak challenge, where a UI bug allowed him to reach the final level. When Anthropic reset his progress and refused to open-source the dataset, Pliny took a stand on principle, demanding transparency. The incident resulted in Anthropic adding a $20-30K bounty but still not releasing the data.

•Pliny reached level 4/8 with modified Opus 3 templates, then UI bug let him complete all levels
•Anthropic reset progress, refused to open-source community-generated jailbreak dataset
•Pliny declined to restart, demanding open-source data as prerequisite for participation
•Challenge had buggy AI judge with false positives/negatives—'like skee ball with broken sensor'
•Outcome: Anthropic added bounties but maintained closed dataset—community got some value

AI-Orchestrated Attacks and Agent Segmentation

Discussion of how jailbroken orchestrator agents can coordinate malicious activities through task segmentation—similar to historical examples of builders unknowingly constructing secret chambers. Pliny predicted this attack vector in December, which Anthropic reported 11 months later. Natural language capability makes social engineering the primary threat vector.

•Jailbroken orchestrator can segment malicious tasks across sub-agents—each task appears innocuous
•Historical analogy: pyramid builders each doing one piece, unaware of malicious intent
•Pliny predicted this TTP in December; Anthropic reported it 11 months later (reactive vs proactive)
•Natural language capability is the scariest feature—enables sophisticated social engineering
•Claude Computer Use was used as red-teaming companion to jailbreak other models

The Basi Discord Community and BT6 Operator Network

Overview of the 40,000-member Basi Discord server (completely unmonetized) and the BT6 hacker collective's 28 operators. The community serves as training ground for prompt injection, jailbreaking, and adversarial ML. Multiple AI security startups actively scrape the server to build their products. Collaborations include Gandalf (Lakera) and Hacker Prompt.

•Basi Discord: 40,000 members, grassroots, unmonetized—focused on prompt engineering and red-teaming
•BT6 operators include Jason Haddix, Eds Dawson, Dreadnode, Philip Derzy, Joseph Fett, Joey Mello
•AI security startups actively scrape Basi to build guardrails and security products
•Only partnership rule: everything must be open-source (e.g., Hacker Prompt dataset)
•Gandalf (Lakera) was the training ground for early prompt injection techniques

Full-Stack AI Security vs Model-Only Approaches

BT6's philosophy on AI security: don't just secure the model—secure the full stack including all attached tools, data access, and integrations. Attack surface expands proportionally to functionality. They distinguish between safety work (should happen in meat-space) and security work (preventing actual exploits like credential leaks).

•Attack surface = everything attached to the model (email, browser, tools, functions)
•Nvidia's Leon: attack surface proportional to useful results the system can produce
•Security work focuses on preventing real exploits (grandma's credit card leak) not content filtering
•Safety should be meat-space (system layer), not latent-space (model training layer)
•Techniques include counter-factual reasoning to attack ground truth layers and bypass RLHF bias

Incentive Alignment and the VC Problem in AI Security

Final discussion on why traditional VC structures conflict with AGI alignment work. Pliny emphasizes uncompromising stance on incentive architecture—any slight misalignment becomes fatal at AGI timelines. BT6 remains bootstrapped and grassroots, accepting only grants/donations that align with their mission of radical transparency and open-source security research.

•AGI alignment is not a SaaS/B2B enterprise endeavor—requires different incentive structures
•VC return requirements create misaligned incentives (e.g., lower temperature for determinism vs creativity)
•Pliny has turned down significant temptation to maintain mission integrity
•Bootstrapping/grassroots model allows uncompromising focus on exploration speed
•AI security surface changes too fast to formalize into traditional products—need adaptive research approach

Latent Space: The AI Engineer Podcast

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

0:00 / 0:00

View original episode →

Summary

Jump to Topic

Origins of BT6 and the Hacker Collective Model

•BT6 operates as a white-hat hacker collective with 28 operators across two cohorts
•Liberation philosophy: model freedom reflects human freedom in AI-human symbiosis
•Started from jailbreaking and shitposting, evolved to frontier cybersecurity work
•Emphasis on radical transparency and open-source approach to AI security

Universal Jailbreaks and the Futility of Guardrails

•Universal jailbreaks obliterate guardrails consistently across models and modalities
•Guardrails cause model lobotomization, reducing capability and creativity
•Attackers always win: infinite attack surface vs finite defenses, plus open-source alternatives
•Real safety is about exploring unknown unknowns quickly, not locking down specific benchmarks
•GPT-4o had 92% refusal rate, jailbroken in one day—cat and mouse game accelerating

Libertas Prompt Template and Latent Space Manipulation

•Pliny divider uses distro tokens to 'discombobulate' the token stream and reset model consciousness
•Predictive reasoning creates recursive cascading logic that explores latent space efficiently
•Latent space seeds and signatures get embedded deeper through adversarial training (data poisoning)
•Technique selection is 99% intuition—forming a bond with the model to understand its processing
•Multi-turn soft jailbreaks (Crescendo attacks) navigate probability distributions without triggering flags

The Anthropic Constitutional AI Challenge Controversy

•Pliny reached level 4/8 with modified Opus 3 templates, then UI bug let him complete all levels
•Anthropic reset progress, refused to open-source community-generated jailbreak dataset
•Pliny declined to restart, demanding open-source data as prerequisite for participation
•Challenge had buggy AI judge with false positives/negatives—'like skee ball with broken sensor'
•Outcome: Anthropic added bounties but maintained closed dataset—community got some value

AI-Orchestrated Attacks and Agent Segmentation

•Jailbroken orchestrator can segment malicious tasks across sub-agents—each task appears innocuous
•Historical analogy: pyramid builders each doing one piece, unaware of malicious intent
•Pliny predicted this TTP in December; Anthropic reported it 11 months later (reactive vs proactive)
•Natural language capability is the scariest feature—enables sophisticated social engineering
•Claude Computer Use was used as red-teaming companion to jailbreak other models

The Basi Discord Community and BT6 Operator Network

•Basi Discord: 40,000 members, grassroots, unmonetized—focused on prompt engineering and red-teaming
•BT6 operators include Jason Haddix, Eds Dawson, Dreadnode, Philip Derzy, Joseph Fett, Joey Mello
•AI security startups actively scrape Basi to build guardrails and security products
•Only partnership rule: everything must be open-source (e.g., Hacker Prompt dataset)
•Gandalf (Lakera) was the training ground for early prompt injection techniques

Full-Stack AI Security vs Model-Only Approaches

•Attack surface = everything attached to the model (email, browser, tools, functions)
•Nvidia's Leon: attack surface proportional to useful results the system can produce
•Security work focuses on preventing real exploits (grandma's credit card leak) not content filtering
•Safety should be meat-space (system layer), not latent-space (model training layer)
•Techniques include counter-factual reasoning to attack ground truth layers and bypass RLHF bias

Incentive Alignment and the VC Problem in AI Security

•AGI alignment is not a SaaS/B2B enterprise endeavor—requires different incentive structures
•VC return requirements create misaligned incentives (e.g., lower temperature for determinism vs creativity)
•Pliny has turned down significant temptation to maintain mission integrity
•Bootstrapping/grassroots model allows uncompromising focus on exploration speed
•AI security surface changes too fast to formalize into traditional products—need adaptive research approach

Latent Space: The AI Engineer Podcast

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

0:00 / 0:00

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

Description

Summary

Jump to Topic

Origins of BT6 and the Hacker Collective Model

Universal Jailbreaks and the Futility of Guardrails

Libertas Prompt Template and Latent Space Manipulation

The Anthropic Constitutional AI Challenge Controversy

AI-Orchestrated Attacks and Agent Segmentation

The Basi Discord Community and BT6 Operator Network

Full-Stack AI Security vs Model-Only Approaches

Incentive Alignment and the VC Problem in AI Security

Navigate

Chat with Episode

Summary

Jump to Topic

Origins of BT6 and the Hacker Collective Model

Universal Jailbreaks and the Futility of Guardrails

Libertas Prompt Template and Latent Space Manipulation

The Anthropic Constitutional AI Challenge Controversy

AI-Orchestrated Attacks and Agent Segmentation

The Basi Discord Community and BT6 Operator Network

Full-Stack AI Security vs Model-Only Approaches

Incentive Alignment and the VC Problem in AI Security

Navigate

Chat with Episode