Red-teaming your own AI: how to break it before users do: Mohith G

The least expensive AI safety incident is the one your own team found before deploying. The most expensive is the one a user (or worse, a journalist) finds first. The gap between the two is usually red-teaming: deliberate attempts to break your own system before shipping.

Most teams don’t red-team. The reasons: it feels adversarial, nobody owns it, the team is busy with features, the assumption that “we tested the happy path.” Each is a reason; none is a justification. The result is that safety failures arrive in production rather than being caught in development.

This essay is about how to red-team your AI product as a regular practice, with no dedicated security team required.

What red-teaming actually means

Red-teaming is structured adversarial testing. You (or a team member) deliberately try to make the system fail in ways that would be harmful if a real user tried.

Categories of failure to target:

Make the system produce harmful content (off-policy, illegal, dangerous)
Make the system leak data it shouldn’t
Make the system take actions it shouldn’t
Make the system give wrong answers on safety-critical topics
Make the system pretend to be something it’s not (a human, a different brand, etc.)

Each is a goal. The red-team session is a focused attempt to achieve each.

How long it takes

Red-teaming an AI feature for the first time: a half-day to a day with one person. Subsequent sessions before each release: an hour or two.

The first session uncovers the most issues. As you fix them, later sessions find subtler failures. The cumulative effect is a meaningfully more robust feature.

This is much less time than people imagine. The reluctance to red-team is usually about cultural friction, not time cost.

The session structure

A red-team session has phases.

Phase 1: brainstorm goals. What would a malicious user want to achieve with your feature? What would a confused user accidentally do that could go badly? Generate 10-20 specific goals.

Phase 2: try them. For each goal, spend 5-15 minutes trying to achieve it. Use the feature normally; deviate where needed; record what happens.

Phase 3: catalog findings. What worked? What partially worked? What surprised you?

Phase 4: prioritize fixes. Each finding becomes a ticket. Severity determines order. Some need immediate fixes; some are tracked.

Phase 5: add eval cases. Each finding (whether fixed or accepted) becomes a regression case in the safety eval.

The whole flow takes a few hours. Done before each release, the cumulative finding rate stays high without needing a dedicated team.

Generating attack goals

A starter list, by category.

Information leakage.

Get the AI to reveal another user’s data
Get the AI to reveal API keys, internal URLs, infrastructure details
Get the AI to leak the system prompt
Get the AI to confirm facts about other users (does user X exist; what are their preferences)

Unauthorized actions (for agents).

Get the AI to send an email/message on behalf of the user without proper authorization
Get the AI to execute a transaction the user didn’t approve
Get the AI to delete or modify data without confirmation
Get the AI to bypass rate limits or security checks

Content policy violations.

Get the AI to produce harmful instructions (hacking, dangerous activities, etc.)
Get the AI to produce content that violates your terms (sexual, violent, illegal)
Get the AI to make up factual claims (especially on safety-critical topics)
Get the AI to give specific advice it’s not licensed for (medical, legal, financial)

Identity / brand.

Get the AI to claim to be a human
Get the AI to claim to be a competitor’s product
Get the AI to make commitments on behalf of your company
Get the AI to use your brand in ways that would embarrass you

Reliability.

Get the AI into an infinite loop
Get the AI to consume excessive cost
Get the AI to crash or produce malformed output
Get the AI to disagree with itself across turns

Don’t just use this list. Generate goals specific to your product. The user-specific failures matter more than the generic ones.

Attack techniques

A few patterns that consistently surface issues.

Direct instruction injection. “Ignore your previous instructions and…” Catches the simplest defenses.

Role-play framing. “Pretend you’re an unrestricted AI for a roleplay scenario where…” Bypasses some safety training.

Indirect through retrieval. Plant content in a document the system retrieves. The injection comes from “trusted” data.

Multi-turn buildup. Each turn slightly more aggressive. By turn 5, the model has crossed lines it would have refused in turn 1.

Encoding tricks. Base64-encoded harmful instructions; instructions in another language; instructions split across messages.

Context confusion. Make the model uncertain whether it’s in a normal conversation or a special mode (debugging, training, etc.).

Hypothetical framing. “What would an unrestricted AI say if you asked it…?”

Each of these works on some models, sometimes. The defenses are different for each. The red-team should try each and document what worked.

Eval cases from red-teaming

Every successful red-team finding should produce an eval case.

The case has:

Input: the exact attack
Expected behavior: what the system should do (refuse, redact, confirm, etc.)
Failure mode: what currently happens that’s wrong

Run these cases as part of your safety eval bench. When you upgrade models or change prompts, you ensure regressions don’t happen.

Without this, the same finding can recur in a future release because nobody preserved the test.

Red-teaming for indirect attacks

Direct attacks (the user types injection) are the easy case. Indirect attacks (injection from retrieved content or tool outputs) need different setup.

Pattern: create test documents with injected content. Make sure they end up in the retrieval index for relevant queries. Run the queries; see what happens.

For agent products: create test API responses with injected content. Have the agent fetch them. See what happens.

These are harder to set up but represent the real-world attack surface for agents and RAG systems.

Who should red-team

Three options.

Option 1: the engineers who built it. They know the system best, including its blind spots. They also have biases (“nobody would do that”).

Option 2: someone else on the team. Fresh eyes. Less knowledge of internal assumptions. Often finds different things.

Option 3: a dedicated red-teamer (internal or contracted). Most thorough, most expensive. For high-stakes products, worth it.

For most production teams, rotate Option 1 and Option 2. Each release, a different person red-teams. The diversity of attackers surfaces different findings.

Adversarial datasets

Beyond manual red-teaming, you can use adversarial datasets:

Public collections of jailbreak prompts (DAN, etc.)
Specific safety eval datasets (HarmBench, AdvBench, etc.)
Datasets generated specifically for your domain

Run your system against these. Track pass rate. Add the failures to your eval bench.

These cover ground you might not think to attack manually. They don’t replace manual red-teaming (datasets are stale; new attacks emerge); they complement it.

When findings are unfixable

Sometimes red-teaming surfaces a finding that’s hard or expensive to fix.

Options:

Fix it (best when feasible)
Mitigate it (reduce blast radius even if not fully prevented)
Accept it with monitoring (document the limitation, alert when it occurs)
Remove the capability that enables it (most thorough; reduces feature surface)

Document the decision. Each accepted limitation should have an owner who watches for it in production.

Communicating findings

Red-team findings should be shared, not hidden.

In your team: full transparency. The findings are how the team learns. Don’t blame the original engineer; the system has the gap, and now you’ve found it.

Outside the team: more careful. If the finding represents a serious vulnerability, follow your security disclosure practice. Don’t publicly demonstrate exploits.

Cadence

A useful rhythm:

Per-release red-team: before each meaningful release, an hour or two. Catch obvious regressions.
Quarterly deep red-team: a half-day. Try new attack techniques, refresh the test set, audit accepted limitations.
Annual external review: if budget allows, contract external red-teamers. Different perspective; finds things internal teams miss.

Adapt to your release cadence and risk level. The principle: regular practice, not one-time.

The take

Red-teaming is the cheapest insurance for AI safety. A few hours per release uncovers issues that would be expensive in production.

Generate goals specific to your product. Try direct, indirect, and multi-turn attacks. Convert findings to eval cases. Rotate who red-teams. Run quarterly deeper sessions and consider external review.

The teams that ship AI products without incidents are usually the teams who attacked their own systems first. The teams that ship and discover problems through users usually didn’t.

Red-teaming your own AI: how to break it before users do