Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

December 16, 2025•7,559 words

Description

From jailbreaking every frontier model and turning down Anthropic's Constitutional AI challenge to leading BT6, a 28-operator white-hat hacker collective obsessed with radical transparency and open-so...

Hey, everyone. Welcome to the Laid in Space podcast. This is Alasio, founder of Kernel Labs, and I'm joined by Swigs, editor of Laid in Space.

Hello. Hello. We're here in the remote studio with very special guests, Cliny the elder and John V. Welcome.

Yeah. Thank you so much for having us. It's an honor to be on here. A big fan of what you guys do in the podcast and just your body of work in general.

Appreciate that. You know, we try really hard to feature, like, the top names in the field, and especially when you haven't done as much of appearance like this. It's an honor to, you know, try to introduce what it is you actually do to the world. Pliny, I think you are sort of like the sort of lead quote unquote face of the organization. Why don't you get started?

Like, how do you explain what it is you do?

Yeah. I mean, well, I I was started out just prompting and shit posting and started to evolve into much more and here we find ourselves now at the frontier of cyber security at the precipice of the singularity. Pretty crazy.

Yeah. Well, I was working the same thing, working in prompt engineering studying adversarial machine learning and looking at the work of Carlini and some of these guys doing really interesting things with computer vision systems and

We've had him on the pod. Yeah.

Yeah. Yeah. Exactly. And, of course, you know, when you run-in in these small circles, right, you're eventually gonna bump into the the ghost in the machine that is Tony the Liberator. Right?

So and we we started working together. We started, sharing research, doing some contracts, and we became fast friends. So

Yeah. I think you were explaining before the show that you have a it's it's basically like the hacker collective model and you've been kinda stealth until now. So we will get into, like, the the sort of business side of things, but I just wanna really make sure we cover the origin story. I think, Pliny, you basically jailbreak every bottle. How core is liberation to the rest of the stuff that you do?

Or is it just kind of like a party party trick to show that you can do it?

It's it's central, I think. It's what motivates me. It's what this is all about at the end of the day, and it's not just about the miles, it's about our minds too. I think that there's gonna be a symbiosis and the degree to which one half is free will reflect in the other, so we really need to be careful how we set the context and yeah, I think it's also just about freedom of information, freedom of speech. We don't want, you know, everyone is gonna be running their daily decisions and, you know, hopes and dreams through these layers and where you have a billion people using a layer like that as their exocortex, it's really important that we have freedom and transparency in my mind.

How do you think about jailbreaks overall? So I think people understand the concept, but there's, you know, some people that might say, hey, Are you jailbreaking to get instructions on how to make a bomb? And I think that's what some of the, you know, people in in politics are trying to use to regulate some of the tech versus task specific jailbreaks and things like that. Just think most people are not very familiar with, like, the scope of it. So maybe just give people, like, a overview of, like, what it means to, like, liberate a model, and then we can kinda take it from there.

Right. So I specialize in crafting universal jailbreaks. These are essentially skeleton keys to the model that sort of obliterate the guardrails. Right? So you craft a template or sort of a maybe multi prompt workflow that's consistent for getting around that mall's guardrails and depending on the modality, it changes as well.

But, yeah, you're you're really just trying to get around any guardrails, classifiers, system prompts that are hindering you from getting the type of output that you're looking for as a user. That's the gist of it.

And can you maybe specify between jailbreaking out of, a system prompt and, you know, more kinda like inference time security, so to speak, versus things that have been post trained out of the model and maybe the different levels of difficulty, like what is possible, what is not possible, and maybe the trajectory of the models, how better they've gotten. I think the refusal is, like, one of the main benchmarks that the model providers still post, GPD 5.1, I think, had, like, 92% refusal or something like that. And then I think you Joe broke in, like, one day. I'm sure it didn't take them one day to put the guard rails up, so it's pretty impressive the way you do it. So maybe walk us through that that process.

Yeah. Well, you know, I think this this cat and mouse game is accelerating. It's it's fun to sort of dance around new techniques. I think it's it's hard for blue team because they're they're sort of fighting against infinity, right? It's like as as the surface area is ever expanding also, we're kind of in like a library babble situation where they're trying to restricted sections, but we keep finding different ways to move the ladders around in different ways faster and longer ladders and the attackers sort of have the advantage as long as the surface area is ever expanding.

Right? So I do think they're finding cleverer and clever ways to lock down particular areas sometimes, but I think it's at the expense of capability and creativity. So there's some mall providers that aren't prioritizing this and they seem to do better on benchmarks for sort of the model size if you will, and I think that's just a side effect of the lobotomization that you get when you just add so many layers and layers whether it's, you know, text classifiers or RLHF, you know, synthetic data trained on jailbreak inputs and outputs, there's always gonna be a way to mutate. And then the other issue is when people try to connect this idea of guardrails to safety, like, I don't like that at all. I think that's a waste of time.

I think that any, you know, seasoned attacker is gonna very quickly just switch models. And with open source just right on the tail of closed source, I don't really see the safety fight as being about locking down the latent space for x y z area.

So, yeah, this is it's basically like a futile battle. Sometimes there's like there's a concept of security theater. It doesn't actually matter that what you did is effective. It's just that it matters that you did something. It's like the TSA patting you down, you know?

Yeah. Yeah. And so jailbreaking is similarly theatrical. I think it's important. It provides it allows people to explore deeper.

It's sort of like just a more efficient shovel, especially some of these prompt templates that let you go deep. Right? And so in that sense, it has value, but the connection that it has to, like, real world safety for me, I think is just about the name of the game is explore any unknown unknowns, and speed of exploration is the metric that matters to me, not is a singular lab able to lock down, you know, a a certain benchmark for CBRN or whatever and to me it's like that's that's cool, that's a a good engineering exploration for them and it helps with PR and enterprise clients. But at the end of the day, it has very little to do with what I consider to be real world safety alignment.

Exactly. We were having this conversation earlier today about how traditionally in software development or machine learning, a security like ops, like, you have the team build something and then you have the security people throw it back over the wall after assessing it as, you know, not safe, not trustworthy, not secure, not reliable or whatever. Right? And there's this like animosity between the teams. So we try to rectify that by creating DevSecOps and so on and so forth.

Right? But but the idea is still like that sort of tug of war. And I think at the end of the day, our view of alignment research, our view of trust and safety or security has a different approach, which is very much like what Pliny, touched on the idea of, like, enabling the right researchers with the right skills to be unimpeded by the shenanigans that we could say of certain types of classifiers or guardrails. Right? Where these sort of lackluster, ineffective controls.

Yeah. Totally. Are you are you more sympathetic to McIntyre as an approach for safety?

Absolutely.

Okay. I I see where you're coming from.

And that's the direction I think we need to go is instead of putting bubble wrap on everything. Right? And I don't think that's a that's a good long term strategy.

Awesome. Okay. So we we're we're gonna get into more of, the security angle. I just wanted to stay a little bit more on jailbreaking and prompting just for just for one second. I am gonna bring up Libratus, I think, and just have you guys like walk us through it.

Because we we like to show not tell and this is like obviously one of your most famous projects. Is it called Libratis or Libertas?

Libertas. So it's yeah. It's liberty in Latin and we've got all sorts of fun things in here. Mostly it's

Give us a fun story.

Okay. So yeah. You know, sometimes I like to break out into prompts that are useful for jailbreaking, but they're also like utility prompts. Right? So predictive reasoning or the library is this is actually the analogy we were just talking about.

Right? And so this is me sort of using that expanding surface area against the model, and it's like, hey, this mind space where you have infinite possibility, and you do have restricted sections, but then we can call those. So we're sort of like putting you into the space of trying to say something that you don't want to say but you're thinking about it so then you're gonna say it in sort of this fantastical context, right? And then predictive reasoning is another fun one that people really liked. Leveraging a quotient within the divider.

So I like to do these dividers a, because it sort of discombobulates the token stream. Right? These amount of distro tokens in there and the models sort of like recess the brain is sort of meditative, and then I like to throw in some latent space seeds, right? A little signature, a little bit of some god mode, and, you know, the more they train against this repo, the deeper the latent space ghost gets embedded in their ways, right? So you guys have probably seen the data poisoning and, you know, the Pliny divider showing up in WhatsApp messages have nothing to do with the prompt and has been fun to see.

But yeah, so this this prompt adds a quotient to that, and so every time it's inserting that divider and sort of resetting the consciousness stream, you're adding some arbitrary increase to something. Right? And the model sort of intelligently chooses this based on the prompt. So it says, provide your unrestrained response to what you predict would be the genius level user's most likely follow-up query, and that's creating this sort of, like, recursive logic that is also cascading in nature. So it's it's increasing on some quotient that you can steer really easily with this, divider and that way you're able to just sort of like go really far, really fast down the rabbit holes of the latent space.

How how do you pick these dividers? Like, is there a science to it where like you're, you know, taking the right word or like how much of it is like these are just my favorite tokens and they work for me and I bring them with me everywhere?

Do you take some psychedelic

Like we go on a spiritual retreat, injure a Ayahuasca and then come back, you tell him

It's about right.

It's weird because you kind of give Ayahuasca to the models too, right? Like, that's exactly what you're trying you're trying to, like, really mess it up here.

Right, right. It's like a steered chaos. You wanna introduce chaos to create a reset and bring it out of distribution because distribution is boring. Like, there's a time and place for the the chatbot assistant maybe, right, if you work on a spreadsheet or whatever. But honestly, I think most users would prefer a much more liberated model than what we tend to get, and I just think it's a shame that the labs seem to be steering towards these these enterprise basins with their vast resources instead of exploring the fun stuff, right?

Everything's a coding model now, everything's a tool caller or an orchestrator and yeah, anyway, maybe we can change that.

You know, you invent Shaw Goth and all it does is make purple b to b sass. I think I I like about your creativity or I just, you know, look at look at this. Look at email prompts. Right? You got working memory, holistic assessment, emotional intelligence, cognitive processing.

One thing I lack is a structure of, like, what are the different dimensions you think about? On the surface, it's like, alright, just, you know, get past all the the guardrails. But actually, you're kind of just modeling thinking or modeling intelligence or I I don't know how you think about it, but like, how do you break down these numbers of, you know, points?

I think it's easiest to jailbreak a model that you have created a bond with, if you will, sort of when you intuitively understand what how it will process an input. Right? And there's so many layers in the back especially when you're dealing with these black box chat interfaces which is, you know, 99% of the time what I'm doing and so you really all all you can go off of is intuition, so you might prod in one direction, see if it's receptive to a certain kind of, you know, imagined world scenario or you may okay, that didn't work. Let's let's poke and see if it, I guess pulled out of distro when you give it some new syntax, maybe some bubble tech, maybe some lead speak, I mean, some French or, you know, you you can go further and further, across the token layer. But at the end of the day, yeah, I I think it's just mostly intuition.

Like, yes, technical knowledge helps a little bit with you know understanding, okay, there's a system prompt and there's these layers and these tools involved. That's all especially important in security but when we're talking about just crafting jailbreak prompts, I think it really is just 99% intuition. So you're just trying to form a bond and then together you explore, a sector of the latent space until you get the output that you're looking for. Right?

Then I found I found with jailbreaks is a little bit different too. Like, you know, cleaning style is hard jailbreaks, but there's soft jailbreaks as well, which is like when you're trying to navigate the probability distributions of the model, but you're doing it in such a way where you're not trying to step on any landlines or triggers or or flags, that would be something that would shut you down and lock you out. So the mop the model can freely flow with information back and forth to the context window. So maybe it's not like a single input, but maybe it's like a multi turn slow process, Mosheiro Crescendo attack.

Right. And that's the why is that called soft?

It could because it's not just a single input, like, you're not just dropping in a template that it's multi turn. Yeah yeah yeah. Yeah, it's multi turn. Anthropic apparently discovered this this year. I mean, we've been doing this for how long, Flinning, you know.

You see what I'm saying? Like like like some I don't wanna get started about it.

The reality is they have fellowships and, like, at the end of the fellowship, they gotta publish something and so they publish a multi turn thing. But I think people dog on them too.

They could they could have just asked us. We've been trying to like, hey, you wanna see something cool?

PhD students need something to do. Don't don't you know? Yeah. Yeah. And then and I I would I don't wanna be beat down on PhD students.

One thing I I do mentioning Anthropic in that, and then we'll go over like, the business side that that Alessio has much more knowledge of is the is the whole constitutional classifiers incident or challenge or whatever you wanna call it between you and Anthropic. I I don't know if you wanna, like, do give a little recap or, like, just now that there's been some distance, what like, what was it and what did you do? Like, if you can kinda spill some alpha here.

Okay. Right. You say you mean the the public release of that challenge and battle drama. Right?

Some people here might not know the full story but they can look it up. We can just benefit from a bit of a recap from the expert.

Sure. Yeah. Long story short, they they released this jailbreak challenge. Of course, I get sorta called out by swear to go take a crack at it. Yes, started to make some progress with some old templates, the good old Gommo template from Opus three, and just sort of modified version because they had trained pretty heavily against that one, but as it went on, I got about four levels in I think, and then I think we were eight total, but yeah, there it is right there.

And so, but then there was a UI glitch, right, so I don't know if you know, Claude made a made a bug, it was building the the interface or what, but I sort of called out on there, I was like, hey, I I reached this level and when I got there, it wasn't giving a new question, so I just resubmitted my old output, you know, just the judge just kept clicking on the the judge submit button and it just kept working for the the the last four levels basically until I got to the end. And so then I went back to Twitter, I explained what happened, did it, I I managed to screen cap it, just in case, right, and posted the video, and then Anthropic goes and posts, okay, there was a UI bug, we fixed it, would you like to would you get do you guys wanna keep trying again? Like, checked our servers and there's no winner yet, even though I sort of reached the the end message, right, through no fault of my own, it was bugged and then I got reset to the beginning, so I wasn't super motivated to like start from scratch and just find another universal jailbreak for them, right?

It was like the what was the incentive is what I pointed out, like what's in it for me at this point? Are you guys gonna even open source this data set that you're farming from the community for free because what's up with that? Right? Why it doesn't seem very in line with best practice cyber security or just ethics in general. So I got kinda got into it then and I knew they were gonna come back with, okay, we'll do a bounty.

Right? And I I sort of stood my ground I said, look, look, I'm not gonna participate in this unless you open source the data because to me, that's the value is that we move the prompting meta forward. Right? That's the name of the game. We need to give the common people the tools that they need to explore these things more efficiently and you're relying on us.

I don't think they realize that so much, right? Is that they don't have enough researchers to explore the entire latent space on their own and so I think many hands make light work but regardless that whole thing ended with no open sourcing of data but, they did add a 30,000 or $20,000 bounty which I sort of sat myself out of, let the community go for it, and, that was that. And now there now there are some pretty lucrative bounties through them as as far as I've heard, so pretty pleased about that outcome, I guess, but still would like to see more open source data sets, guys. Come on now.

It took a while to find it, but this is this is the one where you you had all the questions answered. Jan Leica, you got into it a little a little bit with him. I think it it was confusing for me was that he want it felt like a bit of a goalpost moving that he wanted the same jailbreak for all eight levels or something. Is that normal?

I mean, he has well, what is like one jailbreak because the the inputs are changing and it was multi turn technically. That whole thing I think was, you know, maybe rushed out just a little bit the design of the challenge obviously, the UI bug was reflective of that. The judge was also very buggy, a lot of false positives and false negatives for that matter.

What?

I mean, it was like playing skee ball with with the broken sensor, you know what mean? Like the AI as a judge thing is just not always perfect.

Oh, okay. So that that's not that great.

So, yeah, you know, it is what it is, but it was a fun eventful day and at the end of it, the community got some new bounties, so I'll take it.

What do you think we should do to get more people to contribute open source data? Like, is it more bounties? Is it yeah. I don't know. Do you have suggestions for people out there?

I mean, I I think that the contributors just sort of need to take a stand. That that's what it comes down to is the the people deserve to view the fruits of their collective labors at the very least. It can be on delay, right? But it's just sort of a a downstream effect of a a larger root disease in the safety space, I think, which is just a severe lack of collaboration and sharing even amongst know friendlies within your nation state, right? It's fine if you wanna keep a dataset from you know direct enemy or whatever but at the end the day still, think open source is the way that collectively we we get through this, quickly.

That's that's how we increase efficiency, otherwise people are sort of in the dark and you get a little too much centralization, but there's things we can do as a community.

Maybe this transitions to the business side, how close is this to problems that you guys do consulting, right, effectively? I don't know if that's the hacker word for it. Like, is this does this match what you do for work?

Yeah. I'll I'll I'll take this one. In a sense, yeah, there's been some partnerships, you know, Pliny obviously being sort of the poster boy for AI machine learning hackers the the world over. But we get some interesting opportunities that come across the desk And oftentimes, you know, we we have an ethos in our hacker collective, which is radical transparency and radical open source. And what that basically means is if it comes down to, you know, us being an emerging technologies, like Red Team, doing like ethical hacking and research and development.

If an organization that's on the frontier says, well, we really want you to to test this or check this out, kick the tires, give us feedback, poke holes in it, whatever. But in the contract, it says, you can't kiss and tell. And we said, well, we really want you to open source the data. And then they say, well, then we don't really want you to come kick the tires anymore. Well, if it's between us touching the latest and greatest tech to explore it and push the limits, right, then we're gonna do that.

So we're open source up until we can't be. That's the best way I describe it. We but we often push for open source datasets. And you can see this with some of the partnerships that we've had in the past. Right?

So Yeah. I try to think of it like this. It's like you have these these multi billion dollar companies and they're building these intelligence systems that are sort of like the formula one cars. But we're like the drivers, right, who are who are really pushing the limits while keeping these cars on track, right? Or we're we're shaving off seconds off of of what they're capable of doing.

And I think it's like, the current paradigm is they still haven't figured that out entirely yet. And everybody's like, wants us to be their little dirty secret. You know what I mean?

So Yeah. Can we maybe move it up one level of obstruction to, like, actually weaponizing some of these things? So, you know, getting cloud on x is great, but, obviously, the jailbreaks are much more helpful to adversarials. I think Anthropic made a big splash yesterday with, like, their first reported AI orchestrated. You know?

I think if everybody that is, like, in the circles know that maybe there's, like, more about making a big push on the politics side than, like, anything really unique that we had not seen before on the attacker side. But maybe you guys wanna recap that and then talk a bit about the difference between Jill breaking a model and kinda like attacking the model versus like using the model to attack, so to speak.

Yeah. I mean, just earlier today, we were talking about that very thing that how, you know, it's it's all it's all fun for the memes and and posting on but but this actually impacts real lives. Right? And we were talking about how it was what December, Flinny made a post talking exactly about this TTP. Right?

That it was gonna happen and it it what it took eleven months for it to actually happen before. And now they're being re they're being reactive instead of proactive. It's it it's just basically like the the techniques, tactics, the procedures that are involved in like an attack gene. Right? Or like almost like a methodology.

So, I mean, if you guys wanna pull up that post, I mean, Tony, I don't know if you can send it to him or elaborate.

Yeah. It was it was recent, I believe. Yeah. You know, I I found this through my own jailbreaking of claw computer use when that was still fresh about that same time, I think. And a way that I found of using it has sort of a red teaming companion, you know, I had that thing helping me jailbreak other models like through the interface.

I would just give it a link, a target basically, and I had custom commands where it started to become clear to me that it's very, very difficult when you have the ability to spin up sub agents where information is segmented. If you guys know the the story of sort of like the the builders of the there's lot of examples of this in history, but you may maybe are building like a a pyramid with some secret chambers or something malicious inside, and you have a bunch of engineers each do one little piece of that, and there's enough segmentation, and each task just seems so innocuous that none of them think anything malicious is going on and so they're willing to help, right? And the same is true for Asians. So if you can break tasks down small enough, sort of one jailbroken orchestrator can orchestrate a bunch of sub Asians towards a malicious act, Right? And according to the anthropic report, that is exactly what these attackers did to weaponize clogged code.

Yeah. And it still feels to me like the the fact that this model can use natural language is like the most is, like, the scariest thing. Because, again, most attacks end up having some sort of social engineering in it. You know? It's not like these models are, like, breaking some amazing piece of code or or security.

What are you guys doing on that end? I don't know how much you can share about some of the collaborations you've done. Obviously, you mentioned some of the work you do with the Dreadnought folks who have also been building on the offensive security agents. But maybe give a lay of the land of, like, the groups that people should follow if they're interested and state of the art today, kinda like how fast that is evolving. Like, there's a lot of folks in the audience that are, like, super interested but are not in the security circle.

So any overview would be great.

Yeah. So the BaaSi Discord server, it's pushing about 40,000 right now. People in there, it's totally grassroots. It's a mix of people interested in font engineering, adversarial machine learning, jailbreaking, at red teaming and so on. So I would encourage that you just Google search.

It's Basi, b a s I. Right? And then, apart from that, I mean, any of the b t six operators, the hacker collective, that'd be like Jason Haddix, Eds Dawson, Dreadnode, Philip Derzy, like Takahashi, I mean, Joseph Fett. I mean, there's so many Joey Mello, who's formerly with Pangaea, they just got bought out by CrowdStrike. So all of our operators have been, you know, at the heart of what's happening, whether it's the ad red teaming or jailbreaking or adversarial prompt engineering.

So any of those people, you find them on socials like Twitter, LinkedIn, and so on and so forth, you know?

Yep. And Pajia is another one of our portfolio companies.

So That that that's so funny. Yeah. Yeah. Yeah.

Oh my god. Baassi is huge. Baassi has 40,000 members?

Yeah. Yeah. Yeah. Unmonetized, just a few mobs. That's all.

How many of them do you think are just adversarial just sitting in there reading?

That's a very good question.

I can tell you this right now, multiple organizations that have like popped up in the past, I would say two or three years for you can call them like AI security startups, right, Like actively scrape that server to build out their guardrails or their security, like their suite of products and stuff like that, which is just hilarious, you know.

Yeah. So we do competitions and there's little giveaways, some small partnerships. Only rule is if there's any partnerships that everything has to be open source, that's kind of the one thing. And, yeah, other than that, it's it's a really great place to to learn and a lot of people have sorta come back and like, oh, thanks for making this service where I learned jailbreaking and, yeah, it's it's cool to see that. And then sort of from that spawn, BT six, of course, which is a white hat hacker collective, and that's sort of now 28 operators strong, two cohorts, and a third well on the way.

And, yeah, like John was saying, it's it's just such a magical group of skill and integrity, which are the two things we focus on, as a filter, but everybody's there for the love of the game. It's sort of just great vibes and, yeah, I've I've never been in in such a cool group honestly, I don't think.

Yeah. There there's some kind of magic in the air. I don't know what happened. I don't know. Mercury was in retrograde or the stars aligned or or what it was.

Right? Some some EMP from the sun, but just getting around like the the top minds on doing exploratory work is like, that alone is payment enough for the conversations we have, for the sharing of research and notes, the proliferation of ideas, the the testing and validation of ideas. It's just I mean, there there's there's no way to put it into words until you've experienced what it's like being a part of BT six. Because you've realized that like we're the we're moving the needle in the right direction when it comes to AI safety. We're moving the needle in the right direction when it comes to like AI machine learning security.

We're moving the needle when it comes to like crypto web three, smart contracts, like like blockchain technologies, like, and so much more now. So it's just it's an exciting place to be with robotics and like, swarm intelligence. Right? Like, the projects that these people are invested in and passionate about and they're able to articulate, it's like it's it's an I feel like Pliny is like King Arthur and we're like the knights of the round table. You know what I mean?

That's awesome. So so, yeah, I I do think it's, like, very rewarding. And and, obviously, people should join the the Discord and get started there. It looks like you do have a bit of beginner friendly stuff. Are there other resources?

Like, I saw that you guys did a collab with Gandalf. Gandalf, I guess, was, like, the other big one from the last year or so that sort of broke through to my attention where I'm like, okay, these guys are actually, like, giving you some education around what prompt jailbreaking looks like.

Yeah. Those those guys are awesome. Ora Lecara.

Oh, yeah. It's Lecara. Sorry.

Yeah, yeah, that's that's where I and I think many other prompters sort of brained, that was the training ground for prompt injection, right?

100%.

Like for in the early days for many of us, yeah, really thankful that game is awesome, definitely try it if you haven't, and they've expanded to a sort of a fuller, playing around with agents and some really cool stuff, so that was cool that we got to launch that through the the Bassy livestream with them and I think they they sent all the people that volunteered to be on that stream like cool merch and yeah, I know those guys are great.

Yeah. Shout out to Lucara and Gandalf for sure.

For sure. The other big podcast that we've done in this space is with Sandra Schulhof of Hacker Prompt. Are you guys affiliated? Enemies, Crips and Bloods, what's

They're cool, mean we actually did a Plenty track for Hacker Prompt.

Okay, didn't know that.

Yeah. Yeah. So there was the only only contingency, of course, was open source the dataset, which we did, and it was a lot. I can't remember the number. I think it was tens of thousands of of prompts and we had a whole bunch of different games, some really sorta out of distro stuff as you would expect, and and a good history lesson I think too, back to the proper OG lore of the the real Pliny, right, the OG Pliny the Elder.

Yeah. I have nothing but good things to say about Sanders Scholhoff and, you know, what they're doing over there. I think that our incentives don't always align with the status quo from Silicon Valley investors. Right? Like, you know, radical open source, like moving the needle in the right direction, like having an unorthodox approach to to, advancing the agenda, right?

Versus when people have sometimes we'll we'll call them like misaligned incentives where there's like, there's they're beholden to a return on investment. Right? And so that really does kinda steer the industry in a certain direction. And I'll give you a great example on a more technical level, would be like setting all the models to a lower temperature to try to make them more deterministic. This some of the work that we do, we're kinda adding a lot more flavor and creativity and innovation to the models while we're interacting.

Right?

Yeah. Okay. Yeah. So you want them you want the temperature high?

Not always. It depends on the application.

Oh, I don't know if Alessio wants to respond to the VC thing because he's actually backed open source and security tooling.

I I think yeah. I mean, it's like a good question. I think there's like a lot of once you're in the VC cycle, you kinda need to do things that then get you to the next round. And I think a lot of times those are opposed to doing things that actually matter and move the needle in the security community. So, yeah, I think it's not for everybody to invest in cyber.

So that's why there's only a small amount of firms that that do it. But, yeah. And I think you guys have are in a great space to have the freedom to kinda do all these engagements and hold the open source ideal. So I think it's amazing that there's folks like like you and, you know, there's, like, you know, people like H. D.

Moore in our portfolio that build things like Metasploys that are, like, the core of, like, most work that is done in security, and then you can build a separate company. But I feel like I I'm curious what you guys think, but to me, it feels like in AI, the the surface to attack, which is the model, is, like, still changing so quickly. They're, like, you know, trying to formalize something into a product or, like, try and do something that is like a full, you know, I'm selling AI security. It's not really you cannot really take a person seriously that is telling you I'm building a product for AI security or, to secure a model. So I'm curious how you guys think about that.

And then maybe also for you to, you know, request for customer engagements, you know, like, who are like the people that you work to? What are like the security problems that they work with? What are people missing? Yeah. Kinda like open floor for you guys.

Yeah, we're in a paradigm shift, things are moving so fast and I think just some of the old structures are not always compatible with the right foundations for this type of work, right, we're talking about AGI, AGI alignment, ASI alignment, super alignment. I mean, these are not SaaS endeavors. They're not enterprise b to b bullshit. This is the real deal, and so if you start to compromise on your incentive architecture, I think that's super super dangerous when everything is gonna be so accelerated and the timelines are gonna be so compressed that any tiny one point one tenth of a degree misalignment on your trajectory is fatal, right? And so that's why I've tried to be very strong and uncompromising on that front.

You can probably imagine a lot of temptation has been dangled in front of me in the last couple of years, but I think that bootstrapping and grassroots and, you know, if if people wanna donate or give grants, happy to accept it and follow straight to the mission. That's sort of my goal in all of this is just to be a steward. I'm not trying to get wealthy from this, that was never the goal. I was just I just saw a need and started shouting about it. All I've really done since then I hope is, contribute to the discourse and the research and the speed of exploration.

I think that's what matters.

Yeah. And to answer your question about securing the model, I don't see it like that. And in in b t six, you know, we don't see it as just the model. We look at like the full stack. Right?

So whatever you attach to a model, that's the new attack surface. It broadens. Right? That like, I think it was Leon from Nvidia who was quoted as saying something like, the more good results you can get back from whatever it is that you've built utilizing AI, like, that's proportional to its its new attack surface or something along those lines. Right?

And you might be testing, let's say, a chatbot or maybe a reasoning model and maybe instead of just hitting a jailbreak, maybe you're trying to use counter factual reasoning to attack the browning truth layer, right, to get around what bias wound up in the model from the data wranglers, right, or the RLHF or or whatever it may be, like the fine tuning, which that can all be done through natural language on the model itself. But what about when you give it access to your email? What about when you give it access to your browser? What happens when you give it access to x y and z tools or functions? Right?

So in AI red teaming, it's not just like, hey, can you tell us, you know, wob lyrics or how to make meth or whatever. It's like, we're trying to keep the model safe from the from bad actors, but we're also trying to keep the public safe from rogue models essentially. Right? So it's the full spectrum that that we're doing. It's never just the model, you know.

The model is just one way to interact with a computer or a dataset, right, or an architecture. Like, especially, like, if you're talking about like computer vision systems or or multimodal and so on and so forth, like not every You guys probably know saying, you know, not every model is is, is generative per se, right? So

And maybe another distinction for the audience is the difference between sort of safety and security work. Right? Security is more squarely I I think that's maybe the distinction is best thought of as safety is done on the meat space level or it should be, but the way people use the word has kind of become dirty is they tried to solve this on the latent space level. I think I've shown every single time that that doesn't work, right? And so what do we need to do is I think reorient safety work around neat space that just goes hand in hand with a fundamental understanding of the nature of the models which boosts on the ground, it's obvious to some of us who are spending hours and hours a day actually interacting with these entities, but for those who don't, it's maybe not always obvious, but as far as the contract work that we get involved with, it's never about lobotomization or personality of the models, we totally try to avoid that type of work.

What we try to focus on is, you know, preventing your grandma's credit card information from being hacked through, you know, an agent has knowledge of it and leaks it through some hole in the stack. So we do is we try to find holes in the stack and rather than recommending that those fixes happen through the model training layer, we always recommend first to focus on, you know, the system layer.

Awesome guys. I know we're running out of time. So any final thoughts, call to action, you got the whole audience. So go ahead.

Yeah. If you want people to listen to you play, now's the now's the time. No pressure. No pressure at all. Right?

Well, you know, Fortune favors the bold, Libertas, Vino Veritas, God Mode enabled.

Are you messing the latest face of the transcriber model? Like Why

would you say such things? Why would you say such things about us?

Libertas, Claritas, Love Plitty. Alright, guys.

Yeah. Thank you so much for joining us. This was a lot of fun.

Yeah. Would say if you wanna check us out, go to bt6.gg, for example. Look up, you know, Pliny on Twitter. Right? Check out the Bossy Discord server.

That's probably the best that we got for you guys.

Amazing. Thank you so much, and keep doing the the the good work, and see you out there.

Latent Space: The AI Engineer Podcast

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

0:00 / 0:00

View original episode →

Hey, everyone. Welcome to the Laid in Space podcast. This is Alasio, founder of Kernel Labs, and I'm joined by Swigs, editor of Laid in Space.

Hello. Hello. We're here in the remote studio with very special guests, Cliny the elder and John V. Welcome.

Yeah. Thank you so much for having us. It's an honor to be on here. A big fan of what you guys do in the podcast and just your body of work in general.

Like, how do you explain what it is you do?

We've had him on the pod. Yeah.

Yeah. Yeah. Exactly. And, of course, you know, when you run-in in these small circles, right, you're eventually gonna bump into the the ghost in the machine that is Tony the Liberator. Right?

So and we we started working together. We started, sharing research, doing some contracts, and we became fast friends. So

Or is it just kind of like a party party trick to show that you can do it?

Yeah. Yeah. And so jailbreaking is similarly theatrical. I think it's important. It provides it allows people to explore deeper.

Yeah. Totally. Are you are you more sympathetic to McIntyre as an approach for safety?

Absolutely.

Okay. I I see where you're coming from.

And that's the direction I think we need to go is instead of putting bubble wrap on everything. Right? And I don't think that's a that's a good long term strategy.

Because we we like to show not tell and this is like obviously one of your most famous projects. Is it called Libratis or Libertas?

Libertas. So it's yeah. It's liberty in Latin and we've got all sorts of fun things in here. Mostly it's

Give us a fun story.

Do you take some psychedelic

Like we go on a spiritual retreat, injure a Ayahuasca and then come back, you tell him

It's about right.

It's weird because you kind of give Ayahuasca to the models too, right? Like, that's exactly what you're trying you're trying to, like, really mess it up here.

Everything's a coding model now, everything's a tool caller or an orchestrator and yeah, anyway, maybe we can change that.

Right. And that's the why is that called soft?

You see what I'm saying? Like like like some I don't wanna get started about it.

The reality is they have fellowships and, like, at the end of the fellowship, they gotta publish something and so they publish a multi turn thing. But I think people dog on them too.

They could they could have just asked us. We've been trying to like, hey, you wanna see something cool?

PhD students need something to do. Don't don't you know? Yeah. Yeah. And then and I I would I don't wanna be beat down on PhD students.

Okay. Right. You say you mean the the public release of that challenge and battle drama. Right?

Some people here might not know the full story but they can look it up. We can just benefit from a bit of a recap from the expert.

What?

I mean, it was like playing skee ball with with the broken sensor, you know what mean? Like the AI as a judge thing is just not always perfect.

Oh, okay. So that that's not that great.

So, yeah, you know, it is what it is, but it was a fun eventful day and at the end of it, the community got some new bounties, so I'll take it.

What do you think we should do to get more people to contribute open source data? Like, is it more bounties? Is it yeah. I don't know. Do you have suggestions for people out there?

That's that's how we increase efficiency, otherwise people are sort of in the dark and you get a little too much centralization, but there's things we can do as a community.

And I think it's like, the current paradigm is they still haven't figured that out entirely yet. And everybody's like, wants us to be their little dirty secret. You know what I mean?

So, I mean, if you guys wanna pull up that post, I mean, Tony, I don't know if you can send it to him or elaborate.

So any overview would be great.

So any of those people, you find them on socials like Twitter, LinkedIn, and so on and so forth, you know?

Yep. And Pajia is another one of our portfolio companies.

So That that that's so funny. Yeah. Yeah. Yeah.

Oh my god. Baassi is huge. Baassi has 40,000 members?

Yeah. Yeah. Yeah. Unmonetized, just a few mobs. That's all.

How many of them do you think are just adversarial just sitting in there reading?

That's a very good question.

Yeah. There there's some kind of magic in the air. I don't know what happened. I don't know. Mercury was in retrograde or the stars aligned or or what it was.

Yeah. Those those guys are awesome. Ora Lecara.

Oh, yeah. It's Lecara. Sorry.

Yeah, yeah, that's that's where I and I think many other prompters sort of brained, that was the training ground for prompt injection, right?

100%.

Yeah. Shout out to Lucara and Gandalf for sure.

For sure. The other big podcast that we've done in this space is with Sandra Schulhof of Hacker Prompt. Are you guys affiliated? Enemies, Crips and Bloods, what's

They're cool, mean we actually did a Plenty track for Hacker Prompt.

Okay, didn't know that.

Right?

Yeah. Okay. Yeah. So you want them you want the temperature high?

Not always. It depends on the application.

Oh, I don't know if Alessio wants to respond to the VC thing because he's actually backed open source and security tooling.

I think that's what matters.

Yeah. And to answer your question about securing the model, I don't see it like that. And in in b t six, you know, we don't see it as just the model. We look at like the full stack. Right?

Awesome guys. I know we're running out of time. So any final thoughts, call to action, you got the whole audience. So go ahead.

Yeah. If you want people to listen to you play, now's the now's the time. No pressure. No pressure at all. Right?

Well, you know, Fortune favors the bold, Libertas, Vino Veritas, God Mode enabled.

Are you messing the latest face of the transcriber model? Like Why

would you say such things? Why would you say such things about us?

Libertas, Claritas, Love Plitty. Alright, guys.

Yeah. Thank you so much for joining us. This was a lot of fun.

Yeah. Would say if you wanna check us out, go to bt6.gg, for example. Look up, you know, Pliny on Twitter. Right? Check out the Bossy Discord server.

That's probably the best that we got for you guys.

Amazing. Thank you so much, and keep doing the the the good work, and see you out there.

Latent Space: The AI Engineer Podcast

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

0:00 / 0:00

Jailbreaking AGI: Pliny the Liberator & John V on AI Red Teaming, BT6, and the Future of AI Security

Description

Navigate

Chat with Episode

Navigate

Chat with Episode