Ready, Test, Go. brought to you by Applause \/\/ Episode 20

Enhancing Data Diversity and Quality

Resource Library / Podcasts / Enhancing Data Diversity and Quality

Listen to this episode on:

About This Episode

Join Jason Mills, VP of Solution Engineering at Snowflake, as he discusses the key data quality challenges organizations face and how they can leverage advanced strategies to ensure security, quality and scalability.

Special Guest

Jason Mills

Jason Mills is the VP Solutions Engineering at Snowflake and a business leader with 30 years of experience delivering innovative technology solutions within Fortune 100 companies and early stage startups.

Transcript

(This transcript has been edited for brevity.)

DAVID CARTY: When you think of meditation, surely, the first thing that you think of is flying down the track in the McLaren, going 100mph, feeling the torque as you speed into the next hairpin turn. OK, well, maybe that’s not everyone’s idea of zen. But Jason Mills has tried many different forms of meditation, and velocity works best for him.

JASON MILLS: I call it my moving meditation. Because when you’re driving in that way, there is no distraction. Like, every second, you have to be hyper focused on what you’re doing in the present moment. And driving has always been a passion of mine, so I’ve been able to take it to the next level on the track. As somewhat of a practicing Buddhist for many years, meditation is of all forms. It can be still, but it can also be movement. But it’s really all about present awareness. And that’s what I enjoy doing. And sometimes movement at that speed helps you be completely present. You have to be.

CARTY: Jason’s car fascination goes back to his younger days with his father. But it was a COTA race in Austin that ignited his passion for F1 and Supercars, a passion he now shares with his sons.

MILLS: Always– my dad, he was an avid car guy– and grew up around just the passion of driving and taking long trips places. And I was just impressed with the discipline and focus around driving cars on the track at that high speed and the competitive nature of the whole sport itself. So many years later, after that, I had got an opportunity to race on the track. And although I had a demo car– I think it was a Mazda Miata– eventually developed a passion for jumping on the track and eventually was able to purchase two cars that I regularly put on the track. One is a McLaren 720, and the other one is a Lamborghini Huracán. So that’s been a passion hobby of mine, I would say, for the past five years. Yeah, it’s just been a great hobby. I enjoy it, spend time with my sons and other folks that I’ve met, some amazing people. So part of my DNA now, and I enjoy it.

CARTY: Whether he’s talking about his favorite racer, Lewis Hamilton, or the race locations on his bucket list, Jason is always ready to rev up the racing conversation.

MILLS: Yeah, so Lewis Hamilton’s always been– a super fan of him and the career that he’s built and the legend that he’s become– and, really, I think even put Formula 1 on a broader map globally. But so far, I’ve been to COTA in Austin, Texas several times. I really enjoy the city and the experience there. I also live in Miami, Florida. So I’ve been to the Miami Formula 1 Open, which this is the third year that it’s been hosted here in Miami. So that’s been fantastic. A dream would be to go to Monaco as well. So that’s on the roadmap for the next couple of years, just got to find the time to do it.

CARTY: This is the Ready Test, Go podcast brought to you by Applause. I’m David Carty. Today’s guest is speed racer and technology leader Jason Mills. Jason is the VP solution engineering at Snowflake, as well as a member of the Applause board of directors.

As businesses grapple with the myriad organizational and technological challenges of the data they collect, manage, and analyze, it’s clear that an adaptive and an ethical approach is necessary. Velocity is important. Jason Mills knows that. But so is risk assessment, governance, and– most importantly– doing right by humanity, a simple concept that can sometimes get lost in all of this AI hype. Here’s Jason to talk about these challenges and more.

Jason, the gravity of company data only continues to increase. But we’ve also seen some pretty sophisticated solutions rise to meet that challenge. So how has the challenge for managing data changed over the years, and how well are organizations handling that task?

MILLS: Yeah, it’s a great question. And I’ve been in this industry for nearly 30 years and have seen the transformation of data and data management over those years. First, my career in financial services, about 20 years with two major banks, and then moving into major tech companies like Google– and now Snowflake– it’s been a real vision into how organizations are managing their data. And so it started with– if you think about– many of the large-scale database platforms like Oracle, Netezza, Teradata, and SQL Server, many of those others– that were primarily focused on siloed data sets within an organization, which means you had an application or several applications that stored data at scale for those applications to run efficiently.

What we’ve seen over the years is a transformation where applications are becoming multi-modal, in the sense that sometimes they need access to more than one container data set. And so as applications are pointing to different sources of data, companies have had to say, well, how do we manage this? And what is the process of providing access to that data? Initially, it’s been on prem for the past several decades. And most, recently within the past 10 years or so, we’ve seen this shift to the cloud– first, obviously, with Amazon, then with Microsoft, and now Google and many other cloud providers– are emphasizing the need to have your data stored in the cloud. And that has provided tremendous scale for both compute and storage reasons, meaning companies got into a bit of a glass ceiling on what they could do internally in terms of improving infrastructure. You just had to buy more servers and buy more real estate to hold those servers. And it just became untenable.

So organizations now are becoming smarter about using the cloud infrastructure and technology to scale out these large-scale applications that drive data. And so as that transformation has occurred, the applications have even gotten better. The platforms have gotten better. They’ve become faster. They’ve become smarter. They’re becoming more distributed. And so with that, companies have been able to do different things and provide better services to customers. And so there’s a couple of key areas that I think organizations have always had to master, whether it’s on-prem or cloud-based data sets or data platforms. And that is thinking about security. What are the frameworks that both store, encrypt, protect data sets so that that data is secure within the platform and the application? The second is, ‘How do you manage this data and the quality of that data wholesale?’ meaning the data that’s inputted by whatever application or user– how do you validate that that this data is accurate and landing safely and securely? And then the data telemetry– that refers to, how do these systems log events of data tracked into the platform? How do we think about traceability of that data? Where did it come from, et cetera? So all of these are features that, regardless of what the location of these platforms are, you have to be able to master in order to provide efficient use of the data set to those applications that need it.

CARTY: You mentioned the big three cloud providers, so to speak, Amazon, Google and Microsoft. They’ve been in a bit of an arms race, essentially, over the last 10-plus years, in terms of trying to appeal to enterprise customers with their storage and compute and data management solutions. So there’s some pretty sophisticated stuff out there, in terms of what organizations can roll into their data management toolchains. But, obviously, the processes need to try to keep up with that pace of innovation today, right? So what more can organizations be doing in that task?

MILLS: So one of the things that I do in my day job, working with some of the top companies in the United States, is help companies understand that whatever end state you’re looking for, whether it’s AI and innovation or automation, it does start with having a very strong data strategy. So the first approach of really understanding, what is your data state? Most companies, as they migrate from on-prem solutions to cloud solutions, really have to take a strong inventory of, What are these platforms operating on? meaning, What systems do they use? What is the input and output of that data set, and what is it used for? And then, which systems need infinite scalability? meaning, you cannot build enough fast enough to store that information. So you need to leverage more scalable environments, like cloud offerings.

So I think the feedback for many of the customers is spending a lot of time prioritizing use cases and projects that matter most to their organization and deliver value. And that’s always business case driven. So central IT teams really own that process for the most part. And they are responsible to help figure out which applications and how they need to operate and run so that the organization can provide value to either its internal stakeholders or external customers and shareholders. And I think we’re going to see more transformation over the next several years as, A, the cloud environments get more complex but safer, as well as more scalable, higher compute power, a larger ability to store data. And we’re going to see more and more companies relying on these third parties to store and even process their application and data workloads.

CARTY: Businesses are becoming increasingly reliant on data as the foundation for their strategic and technological decisions. As you say, they need to deliver value from all of that data. So it stands to reason that it’s more important than ever for this data to maintain a high level of quality. What are some of the significant challenges organizations face today in maintaining high data quality, especially when it comes to dealing with large and varied data sources?

MILLS: So there’s a couple of key areas that we advise customers and we work with customers on, and that is an overall data management and governance strategy. And that takes several layers underneath that.

When you think about first-party data, the data set created by either the users directly of that application or the application itself– you think about third-party data sets that exist outside of the organization that may be ingested into the application to either provide additional value to the use case that it’s serving. Then you also have data residency. Where does that data exist? Does it exist on prem in a location that is safe and secure? Does it exist in the cloud? What region of the world does it exist in? Are there laws that govern those regions that make the use of that data pretty impactful, meaning you have to understand PII data, or Personally Identifiable Information data? All of these things are critical in any organization that operates not only just nationally but globally and wants to scale beyond their territories or geography. Data privacy is also really, really critical. And we’ve seen quite a few challenges around data penetration and data risks and data infiltration over the past several decades that have impacted not only just the companies that provide these data sets but the users of those data sets. So that’s a key tenant of making sure your data strategy has a data privacy process or framework. And then lastly is the data quality. That means the data that’s input and output– what is the quality of that data? Are there any errors that have occurred that have taken that data set in the wrong direction and effectively providing the wrong answers to questions that are asked through a query? All of these things matter to an organization that needs to govern its overall data strategy. And so whether it’s a large data set that exists within the organization or a third-party data set with a variety of data sources, all of these tenants are critical to analyze. And that’s what central IT organizations, data management organizations, chief data officer leaders are worried and thinking about every day. I will say that things are getting more consolidated. I think companies are really beginning to see value in leveraging fewer data platforms than have been used in the past, but at the same time concerned about making sure that there is no single point of failure so that they are able to hedge any disruption to data sets across either a geography region or across a vendor solution. So I think all of those things matter. And I’m hearing directly from leaders in these organizations that companies that address those concerns, they want to work with. And I’m proud to be a part of a lot of those conversations and have seen success in delivering against that.

CARTY: Speaking of providing the wrong answer to a query, there’s more of a push than ever for systems to root out biases, especially as AI grabs a firmer hold of the steering wheel. And we can’t trust AI to define our ethics for us, right? So that all starts with data diversity and quality, whether it’s through audits, improved transparency, or cross collaboration with other departments. How is the data science community grappling with the task of reducing bias in the data that they already own and use?

MILLS: That’s a great question. And I can say honestly, I think for the most part, this is a problem that’s not been 100% solved for. There are nuanced ways of going about it. But if we’re talking about AI as a broad topic– and most recently, large language models, specifically, in dealing with bias– there are certain benchmarks that exist that can enable customers and also the companies developing these models to make sure that the model performance of a specific data set ingested is effectively responding in the way as expected. So you’ll have in that kind of hierarchy– the data set itself– you’ll have a collection of questions or tasks that are asked of that data. And then you have a scoring mechanism, a quality score. Is the model or the AI answering the question to the extent as expected? And so that process is one way.

The second way is model governance. And that is the process of evaluating how the model was trained, the parameters that were used to define the expected output or inference, and then an analysis of, let’s say, the quality of those results over time. And so, usually, there are organizations connected with– but maybe even outside of– the data science community as an overall organization governance team. And so I think that’s going to be essential going forward. And then the last piece is third-party testing and auditing. This is emerging as a really good area to make sure that you have an outside view of how your model or your platform is performing, as it relates to artificial intelligence and machine learning, which in some cases, we don’t know how it comes to that answer. But we want to make sure that we have at least a third-party checking that. Applause happens to be one of the leaders in this globally in providing those services– so really, really proud of seeing the work that they’re doing in this space. And I think it offers a tremendous value to the organizations.

CARTY: Right. There’s a few different ways of validating some of that information. You can roll in experts, an expert panel, something like that. But I think, as you mentioned, there’s a place for real-world, crowdsourced data in all of this, too, right? So what’s the best way to take an efficient approach to that task, in terms of rolling in some of that real-world feedback?

MILLS: So your language here matters, right? So “real world,” I think, is a relative term to the enterprise which is looking to build those models or applications. And so, depending on the world in which they operate in– financial services, healthcare, life science, government– there’s all nuanced ways of how they see the world.

In the case of something like ChatGPT and the underlying models, GPT 1 through 4– and 5, eventually– primarily trained on public data sets, such as Wikipedia, social media data, web pages, digital books, et cetera. The goal of creating a powerful generic model which can respond to everyday human questions or prompts and even have a conversation are at the heart of what we’re seeing in news media. And that’s kind of the thing that has transformed how much excitement is coming around artificial intelligence. But for enterprises that have these highly-nuanced subjects that require detailed responses relative to their industry or their company– companies need to augment those generic models with their own sourced data.

And so to that extent, based on your question– you’re right. It could be crowd sourced, crowd sourced, meaning you can use your own internal organization, your teams to gather data sets or contribute to a set of data. You can also use your customers that are responding to applications or chat rooms, et cetera. But it is relevant to that organization’s or that company’s industry and their use case that they’re solving for. And that, combined with the generic models, is where we’re really heading to, in terms of solving some of these real-world problems at scale. But regardless of the crowdsourced data set, you have to have a clear understanding of the use case and how the data set will drive meaningful output. That’s the critical aspect. And now, obviously, companies are also forced to think about, is it going to deliver business value? Are we reducing costs? Are we generating additional revenue? What will implementing these solutions– are we providing a better customer experience? All of these conversations are happening every day today. And they are related to that real-world data collection process. So it starts with the data strategy. And then it goes into, how do you use that data strategy to effectively train the right solutions, models to deliver against the applications and use cases that you’re building?

CARTY: And when it comes to data strategy and collection, it’s especially important to gather data that represents a diverse pool of perspectives and demographics, too, right? So how proactive should teams be in this effort to procure diverse data sets, especially as we think ahead to the potential– especially in certain industries– for regulation in the coming years?

MILLS: So I think it’s twofold. One, it relates back to understanding what foundational models or solutions you’re using and then analyzing the use case that is relevant to the problem you’re solving for. And then usually teams are really pretty much tasked with analyzing that data before it’s ingested, meaning if you get a third-party data set from a public source or even a third-party contract, having a clear analysis of what data exists in that existing data set and what data is continuously enriched into that data set so you’re not ingesting garbage– there’s an old saying in the tech world. “Garbage in, garbage out.” And it certainly rests true with algorithms and models. If you train your algorithms or models on data that’s not accurate or has private information or information that can be skewed or is not relevant to the question you’re answering, then you do effectively have a poison pill within your environment that could effectively spread and have challenges.

So what we advise with customers is, have a data strategy in addition to your AI strategy. In fact, your data strategy should precede your AI strategy so that you can make sure that your approach is accurate, refinable. And as you refine these models with your data science organization or third-party company, you almost can predict and pull the right levers to get the exact needs that you want to drive towards these applications.

CARTY: What sorts of practices or guidelines can you put in place to refine and update some of that process over time? We’re talking really about organizational change here to ensure that data is high quality and remains high quality over time. So what are some ways that you can approach that at the organizational level?

MILLS: So I think chief data officers, chief privacy officers– a lot of these titles that have been created, I would say, over the past 20 years– are really essential to an organization’s idea about how they govern their information and their data. There should be, first of all, you know, some kind of organizational policy based on the values of the company, based on the goals that they want to achieve, based on their customers, external customers that they interface with, based on local regulatory demands and necessities and governances. So all of that is your founding starting point

I think the way organizations can approach it differs by industry. But at the end of the day, it does come back to, are you accountable to the output that a model or an algorithm exhibits to its user base based on the data that it was given and trained on? And I think companies are starting to get better at this because this is a fairly new area of opening up this automation to a larger population of users and customers. But I think the future looks bright because we’re able to now analyze that data in real time as it’s being ingested. We’re able to refine through query processes and ultimately understand where data sets exist and then provide the right access based on role-based access to the right user community, the right developer community so that we’re not exposing sensitive information.

So all of this is to say that it gets back to the strategy that the organization is defining based on their own internal values and the company policies and, obviously, the regulations that they serve under.

CARTY: And defining and refining that policy is a challenge, I would imagine, over time, and data analytics might be able to be helpful there. So to get into that component a little bit– different stakeholders– that means different perspectives, different business needs, different measurements for success, even if you have incredible alignment across the business. Now, this makes for a tough challenge for visualizing everything that you need, especially when there’s far from a consensus on the best measurements of quality with some of the AI tools that are available today. So how can and should we be visualizing all of this data to stay on top of potential issues, whether it’s something harmful to the business or to the customer or to both?

MILLS: So there’s a couple of things that companies are doing. The emergence of graph databases and graph models often help to show both collection of data sets and their similarity patterns, their nearness to other data sets– that’s one way. And data science communities are pretty diverse at using these technologies to do that. And that’s how many decisions are actually being made of which data sets to enrich or use or not use.

The second is through pure BI– Business Intelligence and business analytics and business analytics platforms– that allow you to look at a variety of data sets before it’s ingested into a model and effectively govern what is used and what’s not.

And then the last way is effectively testing these data sets within a model and then, before the platform is moved into production, you’re doing all kinds of benchmarking against what I mentioned before. These benchmark tests are really essential to make sure that you’re providing the most accurate and predictable responses where possible to the questions or prompts that are asked.

I think overall, this will get better. We are still in an early stage of defining this by industry. We’re seeing– at least in the public space– organizations like OpenAI and Google and many other organizations that are laying out some of the foundation work for this work. But it is up to each organization to say, does that fit into my framework? Does that fit into my organization’s value system? And how do we make sure that, before we release this to the wild, to our customers, that we’re achieving the right success points? I think there’s still more to be done around this, specifically around the topic of bias and fairness. And there are new organizations– and even new laws being structured, like in the EU– that are helping to define how organizations should think about it. So in this race to automation and AI, I think we will see some governance come in from both government organizations, as well as internal organizations that are trying to figure this out. And our hope is– and no one really knows where this is going, but our hope is that we can have a fairer world than we’ve had prior without this automation and we can provide more insight into the decisions that are made, whether it’s fair lending or whether it’s access to health care or other things. We want to see the world a better place with these solutions and the data set that powers them.

CARTY: As opposed to the opposite– relying on old, bad data to yield, again, problematic conclusions, as has happened over time. It’s just adding a technology lens to it. You’ve got your ear to ear to the ground in this space. So I always like to ask a future-leaning sort of question. What are some trends in the data management and analytics space that you’re watching that will ultimately touch on and help improve digital quality down the road?

MILLS: So trends that I’m seeing– I mentioned graph databases. That’s really huge, in terms of being able to see patterns of data and what decisions need to be made and how that data is used to automate applications. But the other is advanced query algorithms and ways to use natural language to query data. So in the past, most data sets stored in either, like, a SQL database or a NoSQL database always required a coding language to translate or ask the machine a question about that data as it exists. And so the response was always done through a very specific query language. Well, today we are seeing more and more platforms that are able to use natural language to help users– and also developers– be better at asking questions without needing to know the correct query language that’s associated with that platform. And that’s creating a differentiation in globally. Because, obviously, you have multilingual communities around the world.

And as we move forward with more robust language models that are inclusive of a variety of different languages, now you can have different people ask the same question in different languages of a certain data set without needing to learn that specific query language for that database. And so the platforms that are enabling that capability are going to be the platforms that are defining the future of data management. I’m excited. Because I happen to work for a company that does that at scale, and our customers are seeing a lot of uplift for that.

And then the way in which that’s tested– in our experience with Applause and what they’re doing in the testing space– is really kind of honing in on the ability to provide those services writ large. So it’s an exciting time. We’re not there yet. I wouldn’t say everybody has adopted these approaches. But the early indications of these capabilities and the ability to grow is tremendous.

CARTY: Lightning round, Jason, here we go. First question for you: What is your definition of digital quality?

MILLS: There’s a science to this answer, meaning– and we’ve talked a lot about in this conversation– how do you analyze the output of any digital signal or digital platform? And then there’s the art of it. There’s kind of like this “the beauty is in the eye of the beholder” conversation. If you think about, for example, the ability of models to create images or create video, there is a nuance that we’re seeing today, where it’s not really about, How beautiful do you think it is? because it’ll vary. It’s opinionated. But was it able to complete the question or complete the goal?

And so digital quality, in my opinion, is being redefined in a way that was never done before. We never thought machines could be creative. And so as we’re seeing more creative answers to questions through prompts, these large-language models– as we’re seeing through gen AI with the use of image creation and video creation– we’re learning what we’ve never seen before, which is machines taking data sets– obviously human created in the past– but ingesting that data and creating something net new. This is huge and a big advancement in AI, and it’ll only get better. But to answer your question directly, I think it depends on the question being asked.

Ultimately, my definition of digital quality is, is it serving mankind in the way that it was meant to? We set out with a goal. Did we achieve that goal? And is that making our existence better?

CARTY: What is one digital quality trend that you find promising?

MILLS: I will go back to this audio, video and image creation. I think what’s happening here in this space is really groundbreaking. And I think we need to pay attention to the ethical use of generative AI. Because, obviously, you can’t get the inferencing that we’re seeing or the results that we’re seeing without the data that it was trained on. And so there are some legal implications, even ethical implications around how these things were trained. But once we get this right, I think it opens up new opportunities for people to live a better life. And we just have to be, I think, really careful about the fairness around it and how we’re approaching it. And it just really goes back to a lot of what we discussed today, is, how do we structure our data collection and data governance and data strategy, if you will, in a way that the end result– the output of these models and these algorithms and the generative AI topic– the outcome is beneficial to humanity?

And so, as I mentioned before, there are a lot of organizations– or, at least, frameworks that are being defined– that help companies think about the values that they want to expose within their organization to help make sure that they’re governing this correctly. But I’m excited. I think, at the end of the day, we, as humans, need to overthink this particular topic. And the combined effort from both government enterprise as well as academic all need to continue to contribute to the way that we get to the end state, which is obviously, again, improving humanity’s life here. So I’m excited. I’m hopeful. I like to roll up my sleeves and play a role in how it’s done. But we’re nowhere near a complete strategy yet.

CARTY: Jason, what is your favorite app to use in your downtime?

MILLS: I’m heavily into music, so I enjoy, Spotify, Pandora. But I also am playing a lot with generative AI. So ChatGPT and many of the other applications that are being created with underlying large language models are definitely in my wheelhouse. Image-based solutions, video-based solutions, I haven’t gotten to totally yet. And then audio-based solutions– I haven’t delved into that at all yet. But all of these things are exciting and definitely on my roadmap.

But at the end of the day, I think everyone’s killer app is search. And we’re seeing an emergence of the transformation of how we search for things on the internet. Obviously, Google owned this space for many, many years, with Microsoft fast following with Bing, and several other search engines. But we’re seeing now more natural-language search engines come out that are giving Google a run for their money. So I’m excited about that because it offers more value in the market, more competition and other ways to get at an answer to a question. And ultimately, I think we’re in a better place than we were, relying on a single company to provide all of our search needs. So I would say search is the killer app.

Email– we’re all still email junkies, and that’s pretty much been since the dawn of the internet. And chat platforms– I always enjoy looking at chat platforms, both from an enterprise perspective– whenever I’m working with the company that I use a service for or utility whatever, I always try to chat on the chat platform, just to see what the quality of the communication is. And being a data guy, I can almost infer how that was trained based on some of the responses or lack thereof. But I like to engage with enterprises and how they’re communicating with their customers in my own personal space.

CARTY: Finally, Jason, what is something that you are hopeful for?

MILLS: So I’m hopeful that, globally, people have a healthy skepticism of artificial intelligence. I think that’s a good thing. With technology, oftentimes, we can be amazed at ourselves with what we’ve invented or created. But if you really think about what intelligence actually means, these models are nowhere near a level of intelligence of a human being. And so that skepticism really will empower companies and people and governments to begin to understand where this could potentially go.

The value of AI today is, A, augmenting human performance and– in some cases– outperforming the scale of human performance. No doctor can read every new medical journal and publication in the world in every language. A machine can. And so how do you augment that doctor or that physician with the right intelligence at the right time at their fingertips to make the right decisions on a surgical procedure, et cetera? So all of those things, I’m hopeful about. Because that improves humanity to deliver medical services and health services to the world.

And so that’s kind of where this needs to go, in my opinion, where we can continue to augment and enhance human performance. I think the concern, obviously, is where the same technologies can be used for negative reasons– and so warfare and destruction and all of these things. A definitely healthy and extreme skepticism needs to be employed, and we need to make sure that we understand what the potential is before we unleash things that we can’t control. So, again, I do remain hopeful that human nature doesn’t want to see the destruction of its own existence. So we have to be careful about what we create. But at the same time, we do have to be cautious in the approaches that we use and continue to challenge each other. Are we doing the best thing for humanity? Are we doing things the way that, ultimately, will have a good outcome versus a negative outcome? Can the technology be used for negative reasons that we won’t be able to unravel once it’s unleashed?

All of these questions are a part of what we should be asking each other. And as long as we remain open to that– the developers, the people like myself and engineers that are in this community– that we’re open to that criticism and that feedback and that we’re inclusive in how we’re building these things, I think we’ll end up in a better place.

Enhancing Data Diversity and Quality

About This Episode

Special Guest

Transcript

General

Company

Resources

Legal