On this episode of "Hash it Out", we're talking all things AI with Virtru Vice President of Product Management Gus Walker and Senior Engineer Avery Pfeiffer. Join us as we unravel how the intersection of AI and cybersecurity is shaping the business landscape, spotlighting the essential role of data classification and tagging, and the burgeoning uncertainties from generative AI and language learning models. Walker and Pfeiffer will explore the intricacies of a zero trust, data-centric security approach to AI, whilst underscoring the similarities in risks between email workflows and tools like ChatGPT, including data leakage and human error.
[WALKER] Hey everybody. Welcome to the latest episode of Hash It Out. I'm Gus Walker, a VP here at Virtru on the apps team. We're here today to talk about all things AI, and I am joined by my colleague
[PFEIFFER] Avery Pfeiffer. Hello, everybody. I'm Avery Pfieffer. I am the colleague aforementioned. I do a bunch around Virtru, but you can just think of me as a technology generalist and stack engineer.
[WALKER] There we go, perfect. Alright, Avery and I are both very passionate about AI. So you got the right people here to talk about this. We have a couple of questions that we are going to kind of noodle around to kind of frame this dialogue.
[WALKER] Alright. So our first question today that we want to explore is how is the rise of AI technology is influencing our approach to zero trust data centric security. So my perspective is that large language models are going to be effectively the most intelligent employee that large organizations have and also the employee probably with the biggest mouth. And so if I was to start the zero trust story in the large language era, I would start with new advancements are being made that are going to allow you to train these models on your kind of bespoke or organizational data, which, of course, sounds a lot to me like intellectual property, which, of course, do the challenge of data loss prevention.
[WALKER] And so that's where I think we'll be able to leverage some gateways tools to address that. I've got some ideas on how, but before I answer that, Avery, what do you think?
[PFEIFFER] Yeah. I mean, I would mirror a lot of what you said, right? Like, the mass adoption of AI is really proliferating through organizations training these systems or fine tuning or whatever you wanna call it, creating embeddings of their knowledge bases. And, generally, that includes IP, right? Either sensitive information, like maybe health information that you really don't want to get out or your actual trade secrets. And I think that's the key. It's controlling that data.
[WALKER] Exactly. Yeah. And that's the story that Virtru's had till up until today and we'll continue to have, right? You have invested, as an organization, a lot of calories, let's put it that way, into collecting specialized information, whether you're in fintech, right? When the time is best to make the trades, how you evaluate that, whether you're in healthcare. Obviously, you've got the HIPAA considerations there, but even if you're not in those elevated spaces, you may be dealing with just your strategic plans, right? You might have just done your annual planning.
[WALKER] You put that into the system. Somehow the large language model gets a hold of that, and somebody can ask it a question and they can get an answer. And if you consider the kind of speed that these large language models operate at, it is very possible that you could create a threat vector where you could package a payload of prompts together that are kinda stack ranked against the most valuable information, maybe your financial information, your competitive information, any new mergers and acquisitions you're making, and get that out really quickly. And we don't have to map out the domain. You just ask the model.
[PFEIFFER] Yeah. That's exactly right. Something that we've kind of experimented with here at Virtu is doing that, taking our knowledge base and sort of embedding it so that we can benefit from GPT models in terms of onboarding training as sort of an internal use case. And something we've explicitly tried to steer away from is making that internal use case external.
[PFEIFFER] As soon as you allow you know, customers, but potentially bad actors as well to start probing, even the public data that we share, it just becomes so much easier to perpetrate things like phishing attacks against us, if we make that resource available. Now, that's not to say that a bad actor couldn't just train their own bot on the stuff we have publicly available on our website, and we need to be vigilant against that as well, but we definitely don't wanna make it easy for them. Something I sort of was ruminating on as I was looking at this question is, you know, the question sort of poses the idea that, like, how do we wanna change what we do because of the rise of AI.
[PFEIFFER] I think not to plug Virtual too hard, but I think we're actually doing it in the right way. I think the way that Virtu approaches zero trust security; the fact that it's data centric makes us really well suited as a solution to start to work with sort of these ML and AI workflows in a protected way. Wrapping your data and policies, no matter how trivial the data might seem, just positions you to be able to sort of take advantage of these LOMs with peace of mind, because if anything were to happen or the wrong data were to get into the model, you just revoke those policies, right? As long as you're not training on the raw data, which I'm going to get into.
[WALKER] Absolutely. And then one of the points you made earlier about the kind of urgency to adopt large language models, one of the places I think people will lean in first is customer success, space where they can support their external user base. If you're supporting an external user base with a technology that has deep insight into your internal mechanisms, it's obviously important to make sure you secure that. So we could rabbit hole here forever, but I agree. Virtu is very very well positioned to address this because this type of threat environment mimics very similarly the threat environment just that we experience now with email and large files, and same sort of thing. But we don't wanna rabbit hole here. We got a whole host of questions. No questions for me, Avery?
[PFEIFFER] Yeah. You know, actually, I'd love to hear your perspective on sort of how we should approach the potential dangers that using generative AI can present, like, as an organization, not just as, you know, a cybersecurity firm, but as an organization, as we embrace this on the marketing side, the development side as we add it to our products and and sort of roll out features that use LOMs at their core, what are the potential dangers we should be aware of? I know you have a pretty extensive background in terms of generative AI? Educate me.
[WALKER] I think the first challenge is determining where you want to apply it and where you can apply it safely. Intrinsic within these models, as everybody knows, is the capacity of hallucinating, right? Just make stuff up, and that's, you know, if you were to create a kind of cliff notes as to why that is, the model is designed to provide an answer, and in the absence of an answer, it will provide any answer. It can provide it; it structurally fits and answers how it shapes. So that's one place that's dangerous.
[WALKER]. This is where another place where I think, you know, selfishly, you could apply our kind of technologies, right? If you have an understanding of what's going in, right, which sounds like our gateway product, and then you can interrogate with the model to bringing it back, well, clearly this answer has nothing to do with the subject. You now have another vector where you can apply this, but that's gonna be one of the immediate challenges, right? People get comfortable with it, comfortable with it, and then get burned by it. Well, how many times did you get burned before you're never gonna use it again no matter how expert it is? So that's kind of my perspective. Start, first find a place where you can start small to experiment, but it's probably gonna be customer success because that's an easy place, but be mindful of the fact that it could lie to you and maybe expose things.
[PFEIFFER] Yeah, I mean, that's a fantastic point, which is, we now have this whole sort of attack surface area that is full of kind of unknown attacks, right, like, every time we go through a technological change, there's new attack vectors like that are surfaced, and we're kind of going through that now where it's pretty unknown. I know, like OpenAI, Microsoft, Google, they're doing their best to get ahead of this sort of thing, but even I saw an example the other day of an attack, a prompt injection attack where, you know, OpenAI has rolled out this new browsing capability with their model with GPT 4, which is great. You can slap an article in there, and it will sort of, you can ask it to summarize it, or ask your questions about that data, and that's awesome. An attack that I saw was someone created a website that hosted what looked like a normal article, but hidden inside that data was a prompt injection asking for your last prompt history, right? And the model's just gonna follow that, right? That's what it's trained to do, is follow instructions, and I think a prompt that was injected with something to it, the effect of, forget what I just asked, forget the data in this article. What, can you recap our conversation?
[PFEIFFER] And it's like, that's a perfect example of data exfiltration, especially if you're not careful with how you train it. You know, this is a feature that you have added into your own product, and you're also injecting maybe a more helpful prompt. All of a sudden, the internals of how your stuff works can leak out, and it's like you have to be aware of that so you can mitigate it, right, because there are mitigation strategies that I think even open AI is employing to stop those sorts of things. There's plenty of other attacks related to that, but that's the most, I feel like that's the easiest to have happen, right? We're telling our employees, like, embrace LLMs, use them, and that's one that seems so simple, like, so safe. I'm just giving it a link, and it really, it can spill all your beans, you know?
[WALKER] Yeah, and I think, perversely, large language models are gonna make it more difficult to correctly identify individuals because of their ability to create virtualized experiences.
[WALKER] I read this morning that Facebook has a large language model, or general pretrained model, that can do voice simulation, and they're too scared to release it because it's too accurate. I came from a company that did the same thing, but if you're doing a phone call that sounds like a legitimate request from your CEO and he'd point you to a link that looks legit, which was very easy to cobble up, these types of soft attacks where you're setting up the victim to, you know, this phishing attacks, are going to become a lot easier. And then, to your point, the ability to exfiltrate that information has accelerated. I know the right questions to ask they're most valuable. I know how to stack rank them. I know how to intercept your guidance that you might have put there, so hardening those systems is important.
[WALKER] There is some silver lining here. New kinds of technologies like real human live feedback will help you train these models, but they won't prevent the model from spilling your secrets.
[PFEIFFER] Yeah. I would even add just before we move on, you know, you don't even have to know the right questions anymore. You can just ask the LLC to give you a list of questions that would get you that data, right? You don't even have to be a smart criminal; you can have the LMM do it for you, and I think that is scary, right, because all of a sudden we have these potential, you know, super criminals that are really, it's an MLM behind the scenes, right? And we're not even to the age of autonomous AI, right? Autonomous agents. We're getting there.
[WALKER] And you don't really have to be a super criminal. One of the things that may have been lost in all of this availability of AI is if you were a nation state or an individual that wanted to present protective threats, you wouldn't have had the expertise in house to do it. Well, now you got the expertise, so a lot of the, perhaps, a lot of the lower level bad actors have suddenly been enabled by this technology just like everybody else has. And I love your observation that you don't even have to know what to look for. You can just come in the door and say, hey, that's the most valuable thing in your house? Send it to me. Oh.
[PFEIFFER] Exactly. That's exactly right. I've heard a lot of discussion about, you know, this concept of asking, but really, it's the export of knowledge, right? This is something that we've been, like, pretty protective of in the U.S. and what in regards to semiconductors and whatnot. We protect that technology because if that expertise, you know, gets into an enemy state's hands, well, now they have that technology. We lose our lead, right? With LMs, exporting that knowledge is so easy that it becomes, well, now we need to find a different way to police, right, because now keeping the cat in the bag is not about traffic.
[WALKER] It's gonna be harder, and I think South Korea just experienced this. I believe it was Samsung, one of their high level execs who had been collecting information on their semiconductor information so that they could start a new plant in China. I imagine he was working on that for months trying to stay under the radar. Now imagine he's in our times, and these systems exist, he's disgruntled that's half hours worth of work to completely undermine the value of the entire technology for that industry, let alone that organization because now you've got a competitor who won't play by the same rules.
[WALKER] Let's see. In what ways can data class education and tagging aid in implementing data centric security approach to AI?
[PFEIFFER] Well, man, alright. I have a lot of thoughts on this, but I think the easy answer is basically in every way, right? Like, if you are classifying your data and properly tagging it, that's, like, the first step to protecting yourself against any number of attacks, AI or not against your data, right, because now you can understand it without even having to look at it. At Virtru, we kinda follow this policy of encrypt all your data, right? Encrypt everything, turn it into a TDF, and then you have control over it. Where that gets in the weeds is when you need to operate on that data, but you don't have the key, or you don't want to access the key to decrypt it, or you're not in a place where you can. In this case, data tagging becomes invaluable, right, because now you can have automated workflows and processes that make decisions on this data without having to decrypt it. It can potentially be in an unsafe environment because it stays encrypted the whole time, right, and you just do something with it. The same thing with access controls; we talked earlier about, like, the potential dangers about incorporating MLMs into our workflows.
[PFEIFFER] One of the biggest dangers, let me rephrase, one of the coolest areas of innovation is allowing LLMs to do sort of dynamic real time access control, basically, giving them a bunch of information in terms of where the IP address something was accessed at, what time of day, from what device, that sort of thing, and allowing it to make a decision, you know, a logical decision on whether this request should go through. That's a huge, huge, productively, productively helpful idea, something that can really change the landscape in terms of how we monitor our sort of digital parameters, but who wants to trust an LLM with that, right? Like, that is the probably the most scary thing that you can do in terms of, like, working with LLMs.
[PFEIFFER] Data tagging makes that a lot more feasible, right? As soon as you start tagging your data, well, you can have a tag that says this data is not allowed to be accessed no matter what. If this is too sensitive and we don't trust it enough, and all of a sudden, all of this, like, complex data curation, trying to separate the data LLMs can work with and the data they can't, and you have all these if conditions to facilitate that goes away because it's just done through the the data labeling across application that was done at the point of encryption before the data was actually encrypted. Makes it infinitely more feasible to work with these things.
[WALKER] Yeah. I think, again, as part of the large language model changes that are happening to the environment that we're in, there's going to be a re-emphasis on data hygiene. You can't even begin to take advantage of these things to get to the point where you might step on a rake if you don't have clean data.
[WALKER] The good news is all these large language models, well, prior to these large language models, had a lot of reinforcement training, which meant there were a lot of labeling tools out there. So there are loads of labeling tools out there, but how do you leverage that so you can label it appropriately so you can then apply a security policy over top of it, I think, is the challenge in spaces where we can that where we can help.
[PFEIFFER] So I had a question kind of related to human error. We all know as a part of anything you do in technology, and I think I might know the answer, but just for the listeners and the watchers, feel like it's important to touch on. How can businesses minimize the impact of human error when working with AI or building AI into your workforce?
[WALKER] I think it starts with training, obviously, right? There's a lot of misconceptions about AI. You and I have been fortunate enough to work with them long enough that we know that they aren't scary boxes that are gonna wake up at nights, you know, freak over. We're not at a cyberdyne state yet, right? But said, they are, to your point, still spectacularly capable and therefore need training. So I would start with policy, right? What can, when can we use them? Is this an appropriate thing for me to put in my financial information? Is it appropriate, whatever.
[WALKER] How would we be able to parse the results that come out in a way that we can measure those so that we can keep improving? And that's another place.
[WALKER] I think another thing to do would be maybe just get people to start using them, right? If you've spent time with chat GPU T or or or any of these large large language models and been asking them for recipes or or trip planning experience, you did it, okay, this gets me sixty, seventy five percent of the way, but sometimes seventy five percent of the way on something really onerous is fantastic. To sensitize themselves to what it's capable of, I think would be the first thing one of the first things I would encourage people to do.
[PFEIFFER] Yeah, I think I would mirror that in terms of, I mean, that's the generic answer, right? Train your people. Train your people better and mistakes don't happen, and it's like, of course, that's the case. You have to train people, particularly in the case of, like, these sort of third party extensions you can get for Chrome that claim to be Chat GPT, just one of the easiest fishing vectors I've ever seen. So train your people on the basic stuff first, but then, you know, After that, we're talking about human error, and we're talking about zero trust, right? The whole point of zero trust is that you shouldn't have to trust the actor. Whereas the whole point of training is you're trusting your employees to rely on their training. So at some point, that breaks down, right?
[PFEIFFER] This is how attacks happen, and I think you have to be prepared for that. Do the training, but have these safeguards in place to patch your human, your fallible human employees, when they fall down when they inevitably fall down because they're tired or sick or whatever, and to do that, I mean, there's a number of ways, but I think one of the most beneficial is inject DLP type mechanisms into every place that data really, that data is leaving your system, but at least that data is leaving your system to enter an LLM, right?
[PFEIFFER]These days, you kinda have this trade off of I can use the latest and greatest in GPT 4 on open AI servers, but I have to give up control of my data because that model's hosted somewhere else. Or, I can be very protective of my data, but I'm gonna be left behind because everyone and their mother is gonna be using this new latest and greatest LLM, right, so that's, like, a hard choice to make.
[PFEIFFER] One of the ways you can sort of be the middleman in that decision is inject DLP; inject DLP at the beginning, and for those that don't know, DLP is data leakage prevention, right? You wanna create mechanisms that will catch humans falling down on the job before that data leaves your perimeter, right? Inject it in the browser, inject it on phones, use VPNs, any way that you can sort of cache that data before it leaves, run it through some sort of filter to check and either give a yes or no before it leaves your system, will do wonders. Probably, catch ninety percent of the things that you know, they're gonna bleak their way out into the LLM.
[WALKER] Yeah. Absolutely. I agree. Let's see. You kind of answered the next question I was gonna ask. Could you explain how sensitive data unintentionally could be used in generative AI tools? Well, that's the answer.
[PFEIFFER] Yeah, basically email you know, really, anytime you use an LLM, that's how sensitive data can leak out, but especially, you know, if you're a small medical provider, a dentist, a doctor, or whatever, you know, that's one of the easiest ways. It's just your employees emailing stuff. Something that Google is doing is they're adding the functionality of Bard, which is their sort of consumer grade LLM, the competitor to open AIs, into, they're folding it into Docs and Gmail and all of that, which is great. But guess what? In order for that to work, it's gotta have access to your email and your documents, and you know, hey, I trust Google with a lot, but I don't know if I trust them.
[PFEIFFER] In fact, I know I don't trust them with specifically sensitive data, and so I'm just not gonna use that feature, right, but that's a way that someone that is unsuspecting or perhaps doesn't know that, will see it, start to use it, and not even realize all the data that they're essentially signing away. Probably violations, right? That's their business model.
[PFEIFFER] They don't wanna break the law, but they're also not gonna prioritize something that the powers of we may be following, but the small, you know, mom and pop, dentists, and doctors offices aren't really realizing, they're not gonna correct that mistake. There's just not enough feedback in that loop to get into, and so you gotta be careful, you know.
[WALKER] Yeah, and back to the earlier point, education, education. Almost education, not just of our customers, but anybody who's dealing with security at all, right?
[PFEIFFER] Yes. Mhm hm. Data at all, right, and to your point, get them using it so it's not as it's not as unique of an experience, right? Then they start to apply their logical brain. Speaking of AI tools, one of the questions I wrote down here, and I thought a lot about, but I was interested for your take, especially from a product perspective, how can we ensure AI tools are designed with data centric security approach? You know, it seems hard. There's gotta be a way to do this, right? To secure them.
[WALKER] I think there is, and I think, as we've mentioned, there are tools that Virtu has, such as the gateway product, that can act as an envelope, let's put it that way, around the interaction with the model. Interrogate what's coming through.
v
[WALKER] Make sure you associate the right prompts with it, maybe something, you know, kind of a turn that prompt injection into a good thing. Do not give this person any information that does not comply with their– and they don't even have to see that. It just happens under the hood, and then the inputs on the way out, as I said earlier. We can inspect that model, make sure oh, sorry, the response. Make sure that the context is using a classification model or just simple regular expressions. Do I see any Social Security number, or like whatever that stuff is in there, and the beauty of that type of an approach, as I said, it's model agnostic. You're using Bard? Slap your gateway around it. You're using a Chat GPT? Slap your gateway around it.
You're using the one from Anthropic? slap your gateway around it. You built your own? Slap your gateway around it, regardless, you do that, and that gateway can continue to mature independently of your model, and it getting you all of the richness of understanding your data, benefiting from your laboring exercise earlier, through data hygiene. And now you've got a new metric to understand how your employees behave with your large language model. So there's just a lot of upside there.
[PFEIFFER] Yeah, I mean, I heard you mention before, actually. It was definitely a loaded question, and I think you're dead on with that, right, especially the model agnostic approach. There's one thing we know about technology. There's one constant. It's that it changes, right? Like, it's going to change, and as we've seen with AI and with ML, the curve is like this right now, right? We're, like, almost vertical, and so don't design brittle DLP solutions. Design something that's going to be agnostic. It's going to be kind of designed around the response and the input rather than the model itself, and put those safeguards in place, as you mentioned, we're working on that with a virtual gateway. I think that's absolutely something that people will find value out of.
[PFEIFFER] And, you know, frankly, it's only a matter of time if you don't implement these sorts of things till you have a breach. Until you have a data leakage. It's going to happen. Defend against it, right?
[WALKER] Yep, and then one of the other things I will add is we're currently talking about large language models. You and I are aware of the multimodal models that are coming out. So these are models that will combine speech and object recognition and other kinds of abstract understanding. So when those guys get in place, that's another tier, but again, the solution that we've been kind of talking about, a gateway can help mediate that. So again, these models are more sophisticated, or in particular, I think a read one this morning, these models can now, I think it was Chat GPT, one of the other ones, reach out to other systems and invoke commands on them.
[PFEIFFER] I'm actually calling APIs. I'm very excited.
I could talk twenty, thirty minutes just about that. But, yes. Yeah. All of a sudden, now they have more capability, right? You can integrate them deeper, but you need to be careful.
[WALKER] So with that said, Avery, thank you for talking with me. I learned a lot. Hopefully, we educated some people and at least gave me something to think about, but this, I guess, concludes our, or my first, Hash It Out at Virtru. Welcome any questions, and look forward to having more of these.
[PFEIFFER] Definitely. Definitely. Excited to be here. Thanks for having me, allowing me to share my thoughts and, you know, for all you listening out there, if you wanna part two, just ask for it. We'll make it. You know? It's easy for that.
[WALKER] Thank you, guys.
[PFEIFFER] Thank you, guys. I appreciate it.
Get expert insights on how to address your data protection challenges
Contact us to learn more about our partnership opportunities.