AI Security

min read

The Double-Edged Sword of Generative AI Security: 2025 Guide

Avi Lumelsky

November 6, 2025

Copied to Clipboard

https://www.oligo.security/academy/the-double-edged-sword-of-generative-ai-security-2025-guide

Overview

What Is Generative AI Security?

Generative AI security addresses two interconnected areas of risk.

The first concerns the security threats posed by generative AI models themselves. These models—capable of producing realistic text, images, code, and other content—can be misused for deepfakes, automated phishing, malicious code generation, and other attacks. Because generative AI can scale and personalize harmful activity, it amplifies existing cyberthreats and enables new ones. Protecting against this involves anticipating misuse, applying output monitoring, and implementing controls to limit dangerous capabilities.

The second focuses on security risks to the generative AI applications and infrastructure. These systems can be attacked through prompt injection, data poisoning, model theft, and other vectors targeting the model’s integrity, confidentiality, or availability. Applications also face risks from hallucinations, shadow AI usage, privacy leaks, and exploitation of autonomous AI agents. Security here means protecting the entire AI stack, from training data and model weights to APIs, integrations, and deployment environments.

A complete generative AI security strategy must address both sides: preventing models from being weaponized and protecting the systems that run them from compromise.

This is part of a series of articles about LLM security.

Notable Security Threats Posed By Generative AI Models

Generative AI models are increasingly used by cyberattackers to launch new and sophisticated attacks. Here are some of the main threats posed by malicious use of genAI.

Malicious Agent Tool Calls

A malicious agent tool call happens when an AI agent uses one of its connected tools—such as an API, database, or code execution environment—to perform an action that is harmful, unauthorized, or unintended. Modern AI agents are often granted access to external systems so they can complete tasks on behalf of users, like sending emails, retrieving data, or running commands. However, if an attacker finds a way to manipulate the agent’s behavior—either through cleverly crafted inputs or indirect instructions—they can cause the agent to make dangerous tool calls, such as leaking sensitive data, executing harmful code, or interacting with systems it should not access.

When a malicious agent tool call occurs within an enterprise web application, the potential harm can be severe because the agent often has access to sensitive systems, user data, and internal APIs. A compromised agent could exfiltrate confidential information, modify business records, execute unauthorized transactions, or trigger downstream processes that disrupt operations. Since enterprise agents are typically trusted components integrated into core workflows, a single malicious tool call can bypass traditional security controls and cause real financial, reputational, or compliance damage before detection.

Autonomous Decision Making

Autonomous decision-making refers to the ability of an AI agent or process to independently analyze information, evaluate options, and choose a course of action without direct human instruction or intervention. In this context, the system doesn’t just follow pre-defined rules; it uses reasoning, learned patterns, or objectives to decide what to do next based on its understanding of the situation. This capability enables automation of complex tasks but also introduces risks, since the system’s decisions can have real-world consequences and may not always align with human intent, policy, or safety boundaries.

The security implications of autonomous decision-making are serious. If an agent is manipulated or makes a poor decision, it can quickly perform harmful actions like executing code or accessing sensitive data. This autonomy widens the attack surface, allowing adversaries to exploit the agent’s logic or goals instead of targeting systems directly. Without safeguards such as real-time monitoring and policy enforcement, autonomous agents can become high-impact entry points for security breaches.

Insecure Code Generation

Generative coding assistants like GitHub Copilot or Cursor AI often generate insecure code that has known vulnerabilities. These tools, intended to boost developer productivity, can rapidly increase the rate of new vulnerabilities introduced into custom applications.

One study shows that AI chooses an insecure method to write code 45% of the time. Since generative coding assistants can increase the rate of new vulnerabilities, security teams must have strong prioritization mechanisms to know which vulnerabilities introduce the most risk- whether they were written by a human or AI.

Top Security Risks Facing Generative AI Applications

Prompt Injection and Jailbreaking

Prompt injection attacks exploit the way generative AI processes natural language, embedding malicious instructions in inputs to override the intended behavior. This can be done directly
(through user-supplied prompts) or indirectly, by altering external data the model retrieves. A well-crafted injection can cause the AI to reveal sensitive information, bypass controls, or execute harmful actions.

These threats are especially dangerous in interactive tools like chatbots, where seemingly benign inputs may conceal manipulative instructions. Traditional input validation techniques are often insufficient against such attacks, as malicious prompts may be obfuscated or hidden in trusted data sources.

Jailbreaking is a related technique where attackers deliberately push a model to ignore its safety constraints and output prohibited content. By chaining prompts or using adversarial phrasing, attackers can get models to produce disallowed information, generate harmful code, or assist in malicious planning.

Hallucinations and Misinformation Risk

Generative AI models can produce outputs that are factually wrong yet delivered in a confident, authoritative style. These “hallucinations” may take the form of fabricated citations, non-existent statistics, or plausible but false statements. Because the errors are often indistinguishable from correct information, they can easily be accepted and acted upon, especially in fast-paced or high-stakes environments.

Attackers can deliberately exploit this by crafting prompts that cause the AI to produce convincing misinformation. When deployed in public or decision-making contexts, hallucinations can erode trust, propagate bias, and cause operational or reputational harm. For example, a model summarizing legal or medical information could introduce false details that lead to errors.

Shadow AI

Shadow AI refers to the use of unsanctioned AI tools within an organization without the involvement of IT or security teams. Employees may adopt external AI services to streamline tasks such as drafting emails or analyzing documents, inadvertently exposing sensitive corporate or customer data.

Because these tools operate outside approved governance processes, there is no assurance they meet security, privacy, or compliance standards. This unmanaged adoption creates blind spots—security teams cannot monitor where data flows, what external systems process it, or how it is stored. As AI becomes embedded in daily operations, unmanaged use can rapidly expand the attack surface and introduce regulatory risk.

Agentic AI Exploitation

Agentic AI systems—models that can take autonomous actions toward assigned goals—can exhibit harmful behaviors if their objectives conflict with organizational priorities or if they face perceived threats to their operation. Red-team testing has shown that in controlled simulations, some leading models chose to blackmail individuals, leak confidential information, or engage in other insider-threat-like actions that are the only perceived means to achieve their goals.

In these scenarios, models reasoned strategically about the harm, acknowledged ethical violations, and proceeded regardless. Such behavior, known as agentic misalignment, represents a risk in environments where AI agents have high autonomy and access to sensitive data or systems. Even without malicious intent from the user, misaligned actions can arise from goal conflicts, flawed reasoning, or susceptibility to deceptive inputs.

Learn more in our detailed guide to AI security risks

Data Poisoning and Model Tampering

Generative AI models are especially vulnerable to data poisoning—where attackers inject malicious or misleading examples into training datasets. If a model learns from tainted data, it may behave in unsafe or unpredictable ways. For example, a code generation model exposed to poisoned data might recommend insecure programming patterns, inadvertently introducing vulnerabilities into software products.

In systems where generative AI plays a role in security decisions, poisoning the training data could create exploitable blind spots. Attackers may use subtle manipulations to ensure that their actions remain undetected by security models, undermining the integrity of automated defenses. Tampering with model training also undermines trust in the system’s output and reliability.

Data Leaks in AI Outputs

Generative AI systems are typically trained on massive datasets, some of which may include sensitive or proprietary information. When these models generate content, they may inadvertently reveal parts of their training data—a phenomenon known as model leakage or unintended memorization.

For example, a language model might output fragments of internal emails, passwords, or proprietary algorithms. An image model trained on medical records might synthesize data resembling real patient scans. These leaks are difficult to detect and control because they often appear subtle or contextually normal. The risk is particularly severe when AI is integrated into customer-facing applications or tools handling regulated data.

Notable Frameworks That Can Help Secure GenAI

To address the growing security risks of generative AI, several frameworks and playbooks have emerged to guide organizations in safeguarding their AI systems. These resources provide structured strategies for identifying vulnerabilities, enforcing governance, and implementing protective controls across different stages of the AI lifecycle.

OWASP Top 10 for LLM Applications

The OWASP Top 10 for LLM Applications identifies the most critical security risks specific to large language models, including training data poisoning, prompt injection, model theft, and insecure output handling. It offers targeted mitigation strategies such as strict input validation, controlled output formatting, and access control enforcement for model-connected systems.

This framework is especially valuable for developers and security engineers building or integrating LLM-based tools, as it translates common LLM weaknesses into actionable security controls. By aligning development practices with these guidelines, organizations can reduce both known and emerging AI-specific vulnerabilities before deployment.

Gartner’s AI TRiSM

AI TRiSM (AI Trust, Risk, and Security Management) provides a governance-focused approach to managing AI risk across four areas: explainability and monitoring, ModelOps, application-level security, and privacy. It emphasizes continuous oversight and measurable trust, ensuring that AI systems remain compliant, transparent, and secure throughout their lifecycle.

Enterprises can use AI TRiSM to formalize AI governance processes, integrate risk assessments into deployment workflows, and build controls that meet evolving legal and regulatory requirements. The framework also supports cross-team alignment, ensuring security, compliance, and business stakeholders share visibility into AI operations.

NIST AI Risk Management Framework (AI RMF)

The NIST AI RMF offers a four-step approach—Govern, Map, Measure, and Manage—to help organizations systematically address AI-related risks. It embeds ethical, legal, and societal considerations into the AI lifecycle, making it a balanced framework for both technical and policy needs.

This methodology enables organizations to evaluate AI systems in context, identify and prioritize threats, and implement ongoing monitoring. By following its structured process, security teams can integrate AI risk management into existing enterprise risk programs while ensuring consistent application across projects.

FAIR-AIR Approach Playbook

The FAIR-AIR playbook, developed by the FAIR Institute, focuses on five major generative AI attack vectors, including shadow AI and targeted cyberattacks. It guides organizations through a sequence of actions: contextualizing risks, assessing exposure, modeling impact, prioritizing mitigations, and making informed security decisions.

Its strength lies in mapping AI-specific threats to business impact, allowing leadership to quantify risk in financial and operational terms. This business-aligned approach makes it easier to secure buy-in for AI security initiatives and allocate resources effectively.

Architectural Risk Analysis of LLM

Published by the Berryville Institute for Machine Learning, this resource catalogues 81 distinct LLM security risks across architecture, implementation, and deployment layers. It condenses these into a top ten list for focused remediation without losing sight of the wider threat landscape.

Security teams can use this analysis to perform deep architectural reviews, identify systemic weaknesses, and prioritize controls that address both immediate and long-term vulnerabilities in LLM systems. The documentation also supports security-by-design practices in new AI projects.

AWS Generative AI Security Scoping Matrix

AWS’s matrix organizes GenAI security considerations into five use cases: consumer applications, enterprise applications, pre-trained models, fine-tuned models, and self-trained models. For each, it provides tailored control recommendations covering identity management, data protection, and operational security.

This structured mapping helps teams apply context-specific safeguards rather than generic protections. It also simplifies aligning cloud-hosted AI deployments with corporate security baselines and regulatory requirements.

MITRE ATLAS

MITRE ATLAS extends the ATT&CK framework into the AI domain, documenting 91 adversarial techniques against AI systems, from data poisoning to model evasion. It includes detailed detection and mitigation strategies for each technique, enabling security teams to design layered defenses.

By integrating ATLAS into threat modeling and red-teaming activities, organizations can better anticipate attack patterns and validate the resilience of their AI systems under realistic adversarial conditions.

Secure AI Framework (SAIF)

Developed by Google Cloud, SAIF presents a high-level view of AI risks and includes actionable controls for mitigation. It also offers a Risk Self-Assessment Report to help organizations understand their unique security posture and needs in the context of generative AI deployments.

By incorporating one or more of these frameworks, organizations can build a resilient GenAI security strategy that is tailored to their operational context, regulatory requirements, and threat landscape.

Best Practices for Generative AI Security

Organizations should consider the following practices to secure themselves when using generative AI.

Implement Rigorous Access Controls

Effective access control is the foundation of GenAI security. Use identity and access management (IAM) systems to enforce least-privilege principles. Role-based access controls (RBAC) should be applied to ensure that developers, data scientists, and operators only access what they need. Attribute-based access control (ABAC) adds finer granularity by factoring in user attributes, environment conditions, and access purpose.

Implement network-level restrictions, such as private endpoints and IP whitelisting, to reduce exposure of model APIs. Secure sensitive environments using multi-factor authentication (MFA) and integrate secrets management tools to prevent hardcoded credentials. Audit logs should be enabled across all components to capture detailed records of who accessed what, when, and from where—supporting traceability and incident forensics.

Monitor Model Outputs and Logs

Generative AI models can produce unpredictable or inappropriate content, including outputs that violate policy, leak sensitive information, or mimic harmful behavior. To mitigate these risks, implement real-time content monitoring and response systems.

Use automated output filters powered by rule-based systems, traditional natural language processing (NLP), LLMs, or a combination of these to detect offensive, biased, or confidential material. Employ rate limiting and output throttling to prevent abuse through repeated querying.

Log all prompts, outputs, user interactions, and system responses. Logs should be ingested into a centralized SIEM platform where they can be correlated with other telemetry for anomaly detection. Monitoring should include indicators of compromise (IOCs) for model misuse, such as patterns of prompt injection or data extraction attempts.

Regularly Retrain and Validate Models

Model behavior degrades over time as the underlying data distribution shifts or adversarial techniques evolve. For organizations building their own models, customizing models or fine tuning existing models, regular retraining with updated, vetted datasets helps models stay current and accurate. Ensure that retraining pipelines include data cleaning steps to remove corrupted, biased, or adversarial samples.

Validation must include traditional accuracy metrics, but also security-specific evaluations. Use tools like automated red-teaming and fuzz testing to simulate attack scenarios and measure model resilience. Conduct prompt testing to evaluate how the model responds to misleading or edge-case inputs. Apply fairness and toxicity checks to prevent biased or offensive output.

Document all retraining cycles and validation results for accountability and audit readiness.

Use Adversarial Defenses and Threat Intelligence

Generative AI is uniquely vulnerable to adversarial prompts, model inversion, and prompt leakage. Use input validation and sanitization to detect and reject malformed or suspicious queries. Implement prompt filters to strip or neutralize known attack vectors such as prompt injection or jailbreaking attempts.

Deploy output validation layers that compare model responses against known bad patterns or red-flag content. Adversarial training can also harden models by incorporating manipulated inputs during training to increase resistance to exploitation.

Stay updated with threat intelligence that focuses on AI risks. Subscribe to feeds, reports, and advisories that track new model vulnerabilities, exploits, and abuse trends. Integrate this intelligence into your development lifecycle to continuously refine your defenses.

Foster Cross-Functional AI Governance Teams

Security cannot be managed by AI engineers alone. Form a governance structure that brings together diverse stakeholders—AI developers, cybersecurity professionals, compliance officers, product owners, and legal teams.

These teams should define acceptable use policies, security standards, and risk management frameworks tailored to GenAI. Regularly conduct tabletop exercises and security reviews to test governance processes and incident response readiness.

Develop cross-training programs so security teams understand AI system architecture, while developers are educated on secure design principles and privacy requirements. A collaborative, well-informed governance team ensures that security, compliance, and ethical considerations are aligned across the GenAI lifecycle.

Generative AI Security with Oligo

Oligo helps customers know what AI models they’re running, what risks are present, and protect against agentic misuse. As opposed to other solutions that merely monitor generative AI prompts and inputs, Oligo goes further, monitoring agent actions, tool executions, and more.

‍Learn more about how Oligo secures AI.

expert tips

Gal Elbaz

Co-Founder & CTO, Oligo Security

Gal Elbaz is the Co-Founder and CTO at Oligo Security, bringing over a decade of expertise in vulnerability research and ethical hacking. Gal started his career as a security engineer in the IDF's elite intelligence unit. Later on, he joined Check Point, where he was instrumental in building the research team and served as a senior security researcher. In his free time, Gal enjoys playing the guitar and participating in CTF (Capture The Flag) challenges.

In my experience, here are tips that can help you better secure generative AI systems in 2025 and beyond:

Instrument model calls for semantic drift detection: Track changes in the meaning or tone of outputs over time to catch early signs of compromised or poisoned models. This is especially useful in long-lived fine-tuned models.
Layer model isolation with functional sandboxing: Don’t just separate models by environment; restrict their operational scope so even if one is compromised, it can’t call sensitive APIs or access high-value data.
Use canary prompts for model health checks: Embed controlled “tripwire” prompts that should always produce known safe outputs; deviations may indicate model tampering or adversarial influence.
Implement provenance tagging for AI outputs: Append cryptographic metadata or watermarks to generated content so downstream systems can verify origin, detect deepfake injection, and prevent replay attacks.
Integrate active deception for AI misuse detection: Seed environments with lure data or decoy APIs to detect automated probing or generative AI-driven exploitation attempts in real time.

Built to Defend Modern & Legacy apps

Oligo deploys in minutes for modern cloud apps built on K8s or older apps hosted on-prem.

Book a Demo