Your Data Has Already Left the Building: AI Exfiltration and the SME Governance Gap

Take a look at this insightful blog article from ScotlandIS member Plaid Security’s founder, Matthew Wood, as he explores the risks to SMEs of integrating AI before implementing a structured organisational process of use.

AI is all-knowing — an almost common refrain regardless of your occupation. The implication of that refrain is quite serious though: if you’re running a tech SME in Scotland, your data has almost certainly already been into ChatGPT. The question is whether you know about it, whether you can do anything about it, and perhaps most consequentially what happens next.

The basic components of entire organisations now reside in LLMs, and less in the direct hands of the people who work there. AI use has slipped into people’s workflows so quickly that the act of using it has started to feel like process. It isn’t. There is no design, no review, no defined data boundary — just habit. And once habit feels like process, it gets trusted with things it was never built to handle.

Let’s be clear: this isn’t a failure of existing organisational processes. It’s the absence of process. The unsung heroes in the room tend to be the ones less excited about the latest tech than anyone else. Looking at the sea of hacks, data leaks, and general mismanagement, it’s not hard to see why those in cyber security are playing that role and are particularly concerned about the rise of AI. Take Samsung: their own engineers pasted semiconductor code, meeting recordings, and internal notes into ChatGPT’s consumer interface in a 20-day window, prompting the company to ban its own staff from the tool[1].

Samsung is not alone in this. The news is littered with these reports across every major LLM, from ChatGPT to Microsoft Copilot to Anthropic’s Claude[2]. Each of these incidents are happening in well-funded, well-organised, mature environments at respected institutions. How does that happen? The impact is made plain the further we drill. Any data placed inside LLMs is a potential data leak event. Attacks, lawsuits, and disclosures have repeatedly shown just about every LLM, including the major frontier models, vulnerable to active exfiltration. Security researchers have demonstrated zero-click, server-side attacks capable of silently pulling sensitive data from ChatGPT’s cloud, with each disclosed vulnerability followed by a new variant once the original was patched[3]. Lawsuits do similar work through different means. In its ongoing copyright case against OpenAI, the New York Times demonstrated that ChatGPT would reproduce verbatim text from Times articles when prompted with just the first few words or sentences of those same articles[4]. The mechanism isn’t theoretical; academic research has demonstrated that production LLMs can be made to regurgitate verbatim training data through extraction attacks, even on aligned commercial systems[5]. Samsung was the headline. The behaviour was everywhere.

“But my organisation only approves one AI, and the provider tells us our data is secure.” The data tells a different story. Netskope’s 2026 Cloud and Threat Report found that 47% of employees using generative AI at work do so through personal accounts their employer cannot see; that the volume of data sent to AI tools grew sixfold year over year; and that the average organisation now sees 223 AI-related data policy violations every month[6]. There’s a term for this — Shadow AI, and it’s growing. Employers and organisations are treating AI usage policies as fire-and-forget solutions to a problem that is both greater than they realise and quickly metastasising.

The inherent problem is structural: when no governance policy exists, the absence becomes the policy. ISACA’s 2025 research found that while 83% of European IT and cybersecurity professionals say staff are using generative AI at work, only 31% of organisations have a formal, comprehensive AI policy in place[7]. Your organisation’s data protection is quietly reduced to whether your employees are attentive enough to protect it: every chat, every upload, every API call.

For a Scottish SME, these cases likely wash over you like a tsunami. The incidents above happened at organisations with security budgets, dedicated legal teams, and formal governance functions. Most Scottish SMEs have none of those, and run on a significantly tighter budget. Your trade secrets are your business: not Coca-Cola or WD-40 formulas, but the proprietary algorithms, client lists, and pricing models that make your company defensible. Unlike patents or copyrights, trade secrets in many cases are only protected because they remain secret. Once disclosed, the protection collapses with the secrecy.

AI governance isn’t traditional data governance scaled down. It’s a different shape entirely. Traditional pipelines are linear: A to B to C, mapped, auditable. AI breaks that. Anything entered into an AI system is aggregated with everything else the model has seen, processed in a black box, emitted as outputs nobody fully understands. The pipeline isn’t a pipeline anymore. And most organisational policies were designed to govern people, not systems; they tell your team how to behave, not how the AI behaves with the data your team has just handed it.

Nobody has fully solved this yet, and Scottish SMEs don’t need to solve it perfectly. They need to start treating it as an engineering problem rather than a paperwork one. Understand where data actually flows. Get honest about which AI tools your team is using and what they’re putting into them. Then pick one high-risk system and harden it properly, through some combination of robust training, licensing, the development of new technologies designed for AI usage, or customised prompts and models that understand what to limit. Targeting one data flow or system first gives you the foundational knowledge to identify future ones. This is a multi-stakeholder task, requiring cooperation across groups that traditionally have little to no overlap inside an organisation.

There’s an opportunity hiding inside this risk for Scotland’s cyber community, but that’s a longer post for another day.

References

[1] Dark Reading. “Samsung Engineers Feed Sensitive Data to ChatGPT, Sparking Workplace AI Warnings.” 2023. https://www.darkreading.com/vulnerabilities-threats/samsung-engineers-sensitive-data-chatgpt-warnings-ai-use-workplace

Includes Cyberhaven research finding that employees at client companies routinely pasted source code, client data, and regulated information into ChatGPT, establishing the Samsung incident as illustrative rather than exceptional.

[2] Reco AI. “AI & Cloud Security Breaches: 2025 Year in Review.” March 2026. https://www.reco.ai/blog/ai-and-cloud-security-breaches-2025

Cross-platform overview of AI-related security incidents through 2025, covering Microsoft Copilot (EchoLeak zero-click prompt injection, June 2025), Anthropic’s Claude (GTG-1002 cyber espionage campaign disclosed November 2025), ChatGPT incidents, and AI-enabled fraud causing $200M+ in losses. Demonstrates that exfiltration risks span every major LLM platform, not ChatGPT alone.

[3] ArsTechnica. “ChatGPT falls to new data-pilfering attack as a vicious cycle in AI continues.” January 2026. https://arstechnica.com/security/2026/01/chatgpt-falls-to-new-data-pilfering-attack-as-a-vicious-cycle-in-ai-continues/

Reports on Radware’s discovery of ZombieAgent, an evolution of the ShadowLeak vulnerability that allows attackers to siphon user data directly from ChatGPT servers and persist via long-term memory entries.

[4] IPWatchdog. “New York Times Hits Back at OpenAI’s Hacking Claims.” March 2024. https://ipwatchdog.com/2024/03/12/new-york-times-hits-back-openais-hacking-claims/id=174263/

Reports on the Times’ opposition brief in The New York Times Company v. Microsoft Corporation, OpenAI, et al., No. 1:23-cv-11195 (S.D.N.Y.). Quotes the Times’ own court filing describing its extraction technique: ‘The Times elicited the infringing content from OpenAI’s chatbot, ChatGPT, by prompting it with the first few words or sentences of Times articles.’

[5] Nasr, M. et al.. “Scalable Extraction of Training Data from (Production) Language Models.” November 2023. https://arxiv.org/abs/2311.17035

Peer-reviewed research (Google DeepMind, University of Washington, Cornell, UC Berkeley, ETH Zürich) demonstrating that production language models including ChatGPT can be made to regurgitate verbatim training data through extraction attacks. Establishes that LLM memorisation and recovery is not theoretical but a practical, repeatable phenomenon, even on aligned commercial systems.

[6] Netskope Threat Labs. “Cloud and Threat Report: 2026.” January 2026. https://www.netskope.com/resources/cloud-and-threat-reports/cloud-and-threat-report-2026

Key findings: 47% of generative AI users access tools through personal accounts not overseen by their employer; data volume sent to AI tools grew sixfold year over year (3,000 to 18,000 prompts per month average); GenAI-related data policy violations more than doubled year-over-year, with the average organisation now recording 223 incidents per month and the top quartile exceeding 2,100.

[7] ISACA. “AI use is outpacing policy and governance, ISACA finds.” June 2025. https://www.isaca.org/about-us/newsroom/press-releases/2025/ai-use-is-outpacing-policy-and-governance-isaca-finds

European IT and cybersecurity professional survey: 83% report staff using generative AI at work (up 10 points year over year); only 31% of organisations have a formal, comprehensive AI policy in place. Captures the gap between AI use and AI governance maturity in a European/UK context.