Fostering Data Quality and Governance
About This Episode
Many businesses have immature data governance processes reminiscent of bad diets — garbage in, garbage out. Insights gleaned from data, whether by human eyes or AI, are only valuable if the data is readable and relevant. For too many businesses, it’s not.
Civic tech service designer, author and lecturer Lauren Maffeo joins the podcast to discuss where organizations miss the mark on data quality, a problem that costs businesses millions of dollars per yet. She’ll discuss findings from her book, Designing Data Governance from the Ground Up: Six Steps to Build a Data-Driven Culture, and how governance-driven development can foster shared data responsibility across the organization.
Lauren Maffeo is an award-winning civic tech service designer at Steampunk. She is the founding editor of Springer's AI and Ethics journal and an area editor for the open access journal Data and Policy. Lauren is the author of the book Designing Data Governance from the Ground Up: Six Steps to Build a Data-Driven Culture, and has also written for publications such as Harvard Data Science Review, Financial Times and The Guardian. She is also an adjunct lecturer at George Washington University and a former Gartner analyst.
(This transcript has been edited for brevity.)
DAVID CARTY: Are you a dog person? Or are you a cat person? There's no right answer. Well, I guess, it depends on who you ask. Growing up, Lauren Maffeo was more of a cat person, but in her adult life, fostering dogs has become one of her devoted passions.
LAUREN MAFFEO: I have loved animals my whole life. But it's funny, because when I was growing up in Massachusetts, my neighborhood was very cat-centric. It's hard to say when I really started getting more attached to dogs. I think, that was actually in adulthood, probably, in my 20's. And I do like both, but I would say, I'm definitely more of a dog person. I bought my own apartment, and I knew immediately that I wanted to start fostering, with intention to adopt when it was a good match.
So I started doing that two weeks after I moved into my apartment. I ended up adopting the third dog that was brought with me to foster. She was a senior pit bull with advanced kidney disease, and so she didn't have a lot of adoption interest. And I got pretty attached to her, so I did end up adopting Ella. And she only got about eight more months with me after the adoption before passing away, but that was what we knew was going to happen.
I did end up adopting another dog, and so I do have it now, a three-year-old rescue. We don't know yet what his breed is, I haven't done the test yet. But he's probably like a husky, Jack Russell, pitty mix, so he's much smaller. He's got bright blue eyes. And I still foster a dog about every three months because he likes having a buddy around. So with any luck, I should be getting another foster this weekend. And if I do, that will be my 10th in two years.
CARTY: Over the last two years, Lauren has taken care of quite a few furry friends such as:
MAFFEO: Delilah was my first. She was a 1-year-old pit bull. Amy, who was a puppy, and she was a lab mix of some sort, Ella was definitely a pitty-lab mix. I did watch Spots for a while, he was an English setter. Cece, who was a Chihuahua lab retriever mix. Cleo, who was a black lab-Chihuahua mix; Lady, who was a shepherd mix of some kind, and then, there was Penny, who, I think, is maybe like a schnauzer mix. So those are my best guesses for the ones that I've had so far.
CARTY: But Lauren only goes the rescue route. With so many pets in need of forever homes, Lauren takes her responsibility seriously, as fostering just one dog can actually help save several.
MAFFEO: Shelters are very pressed for space, especially in summer months. And so that means that, if animals are not adopted, or they're not fostered, they get stuck in shelters. And if they're in a high-kill shelter, that means that they're at risk of being euthanized just for being in the shelter when they don't have a behavioral problem or any major health issues. And so then, if you foster, you get a dog out of the shelter, but you also make room for another dog to come in.
And so fostering is something I'm really passionate about encouraging people to do. I meet a lot of people who say, that's nice that you do that, I could never do that, and I don't really believe that. Obviously, not everybody has the time or space to do it, but I think-- especially for people who love animals, but don't want the full time responsibility of a pet-- fostering is perfect. And it literally does save several lives because every time you have a pup or cat come into your home, you're making space for another animal to come in and be cared for at the shelter.
CARTY: This is the Ready, Test, Go. podcast brought to you by Applause. I'm David Carty.
Today's guest is dog foster and civic tech service designer, Lauren Maffeo. Lauren's work is everywhere, when it comes to data governance and policy. She is the founding editor of Springer's AI and Ethics journal and editor for the Open Access Journal, Data and Policy, and she's written for The Guardian and Financial Times, among many others. Her latest book called Designing Data Governance from the Ground Up published in January. Data drives the decisions that businesses make, obviously, but what kind of data is collected? How is it processed and stored? Is it compliant with regulations and standards? And are you making the most out of it? Heck, do you even know strategically what you want to do with all that data as it keeps piling up? Well, as it turns out, these are difficult questions for many organizations to answer. Lax data governance and poor data quality can be a recipe for disaster. That's what Lauren wrote about, and that's what she talked about too.
Let's start by understanding the scope of the problem. Some of the most influential companies that we have are data-based business models, and you write about this in the book. Poor data quality costs businesses millions of dollars per year, so how did businesses get to this level of immaturity? And how pervasive is the problem?
MAFFEO: That's a great question. If you think about people who work in tech, and if you work in tech yourself, which I know you do, you're familiar with this concept of technical debt, it's the idea that you get into an environment, and you are responsible, not just for creating something from scratch, you're also responsible for fixing whatever that big problem is. And as somebody who works in civic tech, which is not known for being the most forward-thinking sector in regards to technology, you go in, and you go in knowing that the problem is going to be bad. And then, the degree to which it is bad still continues to shock me, in terms of the outdated systems used, the lack of practices, the lack of automation of very repetitive tasks, which takes a lot of time from humans that they could be using to put towards other strategic initiatives.
And so when we think about this concept of technical debt, which is still very pervasive in tech, in code, now, think about technical debt in the context of data and think about it, not only in that context, but in the context of how much data is created and consumed by organizations every day, the scope of the problem is-- unwieldy doesn't even begin to describe it. It's really beyond anything that I think we can really comprehend because not only are you ingesting huge amounts of data per day and producing new amounts of data, you also have, probably, millions of unique data points in your possession as an organization. And the concept of going in and trying to, quote, unquote, "govern all of them" is really overwhelming. It's overwhelming on a very basic level. Now, consider that, most organizations don't have the talent, in-house, to manage a project of that scale, even if you do have a chief data officer or a data scientist on your team, even if you hold those roles yourself. A big premise of my book is that, this work is too big for one team or one person to manage on their own, which is why you really need to invest in building a data-driven culture that co-owns and co-creates governance where everybody has some ownership and responsibility for it. Because without that ownership and responsibility, that technical debt, as it pertains to data, is only going to grow.
And the problem is really pervasive. There's a survey in the book that I discussed which surveyed C-level leaders at various organizations, and only one in four respondents to that survey said that their organization is data-driven. And that number was down from 38%, in the survey, the year before. So the number of organizational leaders who say that they're data-driven is going down, meanwhile, consider how much more data is being produced at a breakneck pace today. And this was taken way before ChatGPT came to market. And obviously, the last six to eight months have completely changed the game, in terms of making AI more mainstream. And so these are these are things to consider, that, basically, the amount of data being produced continues to grow exponentially, while the amount of leaders who say that they're data-driven is going down. And that really does present a big problem, especially as it pertains to data quality, especially as this data becomes more used by the public in a wider range of contexts,
CARTY: Right. A massive problem and a growing problem, right? I mean you just phrased it perfectly. And you mentioned a few things there that I want to touch on later in this episode as well. But let's start with maybe some of the building blocks that organizations can put in place to help build a more mature data governance plan. What does that look like? And I understand that there are some lessons that we can learn from their product strategy efforts, right?
MAFFEO: That's right. So you don't necessarily have to reinvent the wheel when it comes to data governance because it can feel like a very abstract concept. It can also feel like it's purely a set of barriers to innovation. Everybody talks about legal coming in and saying, you can't do that, and squashing a project. And so governance does not have a good connotation with many people in the tech world. But the reality is that if you look at what you do as an org for product strategy, developing a roadmap, developing key tasks and milestones with owners of each milestone, there's a lot that carries over into data. And actually, one of the biggest opportunities in big data management right now is this concept of data as a product. It's this idea that you-- that rather than giving out data as a service, where it's managed, top down, by, let's say, the CIO and they control access to the data, as well as all of the definitions, even though they might not have that domain expertise-- when you manage data as a product, you really take a domain-focused attitude towards it.
So you have defined what your key domains are, which are the areas that you collect data about thematically. So for instance, you could have a sales data domain. You could have a marketing data domain, a customer success data domain. These are the core building blocks of your business that allow you to categorize data and its subsequent metadata. And that ultimately allows everybody to not only find the data that they're looking for more easily, but it also gives you an opportunity to define what those data points mean in the context of your business.
And then, ultimately, what you can do is take a product mindset towards rolling that data out for use across your organization. Now, this does not mean that you have to give every single person access to every single data set. Not only is that foolhardy. That's also just not possible, really, and it's not best practice. But what you should be doing is setting up data stewards who own the different data domains. And because they're the domain experts, they're the best people to decide who should get access to their data sets to define what the data points are, to really get that data prepared for your data engineers who can then load it into the environment, into pipelines, and all of the technical aspects of the environment that you work in.
CARTY: And you mentioned it's not just one person's responsibility, and this is where a data governance council can be useful, which I know is something that you advocate for. So can you explain what your ideal data governance council might look like and how that group can exert some influence throughout the organization?
MAFFEO: Sure, I think this is really the most critical aspect of doing data governance well for a few reasons. One is that this really has to be a top-down effort. If you look at any big initiative that succeeds in an organization, it succeeds because it has the right executive sponsorship and the right buy-in from someone at the very top. And so you can be the most senior member of the data team in your organization. You can be the chief data officer. If you don't have the backing of your CIO or your COO, your CEO, you're only going to get so far because your success really depends on the time, money, and autonomy that you are given to solve this problem. So one of the biggest things that is really important for any data governance council is to get the right executive sponsor. We're past mentorship at this point. You really need a sponsor for the council who is dedicated to making sure that it succeeds, making sure that the council is working on data projects that benefit the business and can give the resources, in both time and talent and money, to help that council succeed.
And then in terms of who should be on the council, you really want to make sure that all data stewards who are overseeing each major data domain have representation on that council, because you want to create an environment where there's cross-functional data leadership meeting on a regular basis to come to a consensus on everything from data definitions, to the charter, to approving various tools that you can bring into your stack to help manage data. So for instance, if your team is interested in bringing in AWS SageMaker for machine learning and they want to-- and they want to do that in sales, that's an example of where that-- the data governance council could come in and look at the proposed use case and approve it.
As I described what a data governance council could do, I'm very aware that it can sound like I'm talking about death by committee. And I think, left unchecked, this could go wrong that way. I think if it's not managed well, like any council, there is the risk that it just becomes a meeting point without much action or initiative taken to follow up on whatever you discuss at your council meetings. And so it really is on organizations and on the chair of the council to ensure that these meetings not only occur regularly but that they have key takeaways, they have notes, they have specific action items that people are expected to fulfill in between those meetings. And if you think about it, this is really no different than serving on a board of directors or serving on any other committee you've been on. And so anybody who has nonprofit management experience is a great opportunity to serve-- there's a great opportunity for them to serve as a data steward because they can take that experience of building consensus across functions, of capturing action items, reporting on progress. I was pleasantly surprised to see how much-- how many aspects of nonprofit management data governance councils could learn from because they already do a lot in terms of transparency and reporting and meeting that I think people in data should be copying.
CARTY: That's an interesting thread. I wouldn't have considered that.
You're touching on this here, but one task that the council must undertake is defining data governance principles. And there's obviously the myriad regulatory and industry standards for data, right? And those are extremely important. But there's also the contract, written or otherwise, between the company and the customer, and regulations might not move fast enough or at all. You live in the Washington, DC, area. I don't have to tell you that-- to protect the customer's interests, right? So how can a business collect and manage data in a way that is ethical and also clearly defined for customers?
MAFFEO: I think that transparency above all else is really key here because, as you said, to say that we have a tech-illiterate group of politicians is being kind. The innovation is always going to move 10 steps ahead of the legislation. And the reality is I would love for it to not get to a stage where it needs to be regulated. I would love to see organizations take ownership of the data that they have, to be responsible stewards of it, to protect customer data. We have ample evidence to show that this is not what happens. We have ample evidence to show that when organizations' financial success rests on having data-driven business models, that they are going to exploit those loopholes and do whatever they can to monetize that data to at users' expense, even if they are giving them very-- even if they are giving away very valuable consumer data, like social security numbers, private photos, things like that.
And so when you think about your data council's principles, you really want to bring it back to that wider framework I talk about in chapter one of the book, where you're talking about the key areas that you want your data framework and governance council to follow. And I talk about how you don't have to invent this framework from scratch. There's already a lot out there that can help you decide how to govern your data in a more holistic way. And Gartner has a seven-step framework that I really like that I use as a model in the book because it focuses on trust, transparency and ethics, values and outcomes, risk and security accountability, and decision rights, collaboration and culture, and education and training.
And so, basically, when you're thinking about your data governance council, you want to ensure not only that all of your data domains are defined and that they have clear stewards to act as owners of the domains. You also want to tie any new project or tool back to at least one aspect of that framework. And as you see, trust is at the top of the framework. Transparency and ethics is second. If you, as an organization, are not transparent about what you're doing with consumer data, your consumers really have no reason to trust you or to think that you are stewarding it appropriately. And to date, we've been a pretty lax society about that, at least in the US. We, as a society, have very freely given away a lot of our data for convenience because it's easier for us to give it away than it is to have to do something a bit more manually. But I think in 5 to 10 years, that landscape is going to change. I think people are going to expect more of organizations. They are going to expect more transparency about data. And while I don't think that we will necessarily ever reach the stage of GDPR legislation in Europe, which gives private citizens much more rights-- many more rights over their data, I still think, nonetheless, that there's no question the tide is turning, and people, overall, are getting more savvy about what data they give to whom. And so organizations really need to get ahead of that by showing consumers how they are respectable stewards of data.
CARTY: Hopefully, we'll be able to rebottle the genie there a little bit on that one. But as you write about in the book, there are actually a few examples that we can learn from the media industry. So what lessons can we learn from Blockbuster about defining a data roadmap? And if it helps, I probably still have my membership card.
MAFFEO: Wow, I'm impressed by that. There was a Blockbuster just up the road from my parents' place. And of course, it's long gone, and I lament that every time I drive by.
But Blockbuster is a very interesting story because, as we both acknowledged, it was very successful for decades. It was, really, a cornerstone of people who grew up in the '80s and '90s, in particular. And I open chapter four of the book, talking about how Reed Hastings, who founded Netflix, actually tried to sell Netflix to Blockbuster several-- a few decades ago now. And he had-- he thought that was going to be the best opportunity for Netflix. He thought that that was the best they could do, was to get acquired by Blockbuster because Blockbuster was the player in the video rental space, and Netflix was pure video rentals at the time. And so he made an offer to the then CEO of Blockbuster to buy. The CEO thought it was-- the number was too high and refused to buy it, and we all know what happened after that. Blockbuster went under, and Netflix thrived, to put it kindly. Blockbuster eventually completely went out of business while Netflix continued to grow. And even though they're experiencing a bit of a downturn right now, they've become a much bigger player in the video and streaming space than Blockbuster ever could have been.
And the takeaway there, I think, is that Blockbuster had a mountain of data about buying behavior, rental behavior, which industry-- which people rented particular genres of films. And so because they were the leader in that space, if they had seen that there was a move towards digital, they could have really acted on it and actually crushed Netflix before it became big. But because they didn't have that foresight, they ended up folding as an organization because they didn't innovate fast enough by acting on the resources that they had. So I think that is a real cautionary tale for people because it's not enough to just have all of this data in your possession. If you are not organizing it, combing it for quality, making sure that it is used in the right context by the right people, you risk falling to the same fate as Blockbuster.
CARTY: Now, let's keep on the media thread and talk about a Netflix migration that you wrote about in the book. How can their approach to governance-driven development-- what can organizations learn from that to help establish a healthy data infrastructure, both at a point in time and as standards and technology evolve into the future?
MAFFEO: Yeah, I use Netflix as a case study in chapter five of the book because when I was reading this case study about how they moved all of their data over to a streaming platform, I saw a lot of takeaways that would be really valuable for managing big data in a governance-driven environment. And so one example that really comes to mind is that they were experimenting with new technology. Streaming architecture was very new at the time that Netflix was trying to move all of its data over to it. And the risk was very high. I mean, people never-- if you do your job right, your customers are never going to know that you did a migration. But they will, for sure, know if you mess it up. And you're dealing with new technology. They were dealing with open source technology. So the risk was high.
And something that I thought this engineering manager did that was so exceptional was he actually literally created a test environment where his team had room to fail. They had never worked with Kafka clusters before, as one example. And so he knew that if he just gave them all of the data and let them run with it, they were probably going to screw it up. So instead, he created this test environment where they could experiment with the clusters. They could move dummy data from one environment to the other. They could make mistakes, and then they could learn from those mistakes in time for the migration. It was not a fully pain-free migration when it did happen, but ultimately, it was very successful.
And similarly, I think, as the head of a data governance council, as a data leader in your org, you do have to allow for the fact that failures, if you will, are going to happen, especially because a big part of this work is education and training. That's one of the steps in Gartner's framework, and you really can't underestimate it. This is new for almost everyone. Even people who have been in the AI space for decades-- they have never dealt with data at this pace, at this volume. This is something that is new for everyone. And that does equal the playing field, to some degree, because regardless of your role, this is uncharted territory. And so if you are a leader, you not only have a responsibility to train your full team. You also are responsible for giving them an environment where they can learn about these aspects of data governance without being terrified that they're going to mess up and put their jobs at risk.
CARTY: Right. And you've mentioned AI now, a few times. AI is advancing so quickly and has even changed so much since you released this book a few short months ago. Most notably, we've seen the rise of generative AI models, which show a ton of promise but has its immediate and long-term challenges, too, right? If organizations are already fairly immature with their data governance processes, what does that mean for generative AI and other advanced AI systems that interpret all of that data?
MAFFEO: I think we can expect that the quality of those models is going to decline over time because one of the things that these models do is that they're constantly ingesting and learning from new data to refine their results-- hopefully improve their results. But we have a lot of bad data out there, which means that it has not been governed. It hasn't been assessed against any quality standards. There is no one really checking if it's right or wrong in various contexts. And AI only learns from the data that it's given. So if you give it, quote unquote, "bad data," you're only going to get as good of a result.
And so the result is that, unless we are, as a society and as organizations, more thoughtful and strategic about how we clean, destroy, govern all of our data moving forward, I think there are going to be serious implications for the products that these models produce. If you look at ChatGPT, that's the most consumer-friendly version of generative AI on the market today, and I know people who use it all day every day for various reasons. And it can have a lot of benefits, for sure, depending on the use case. But I think what's going to happen over time is that these models, which depend on very large amounts of data-- if the quality of the data that they're fed does not increase, the quality of their results will decrease. And then, as a result, we won't make as much progress as we think we're making when it comes to AI.
And I also just want to say on that note that there-- there's this ongoing conversation about AI that is going to overtake humans, destroy the world, that it's as big a threat as nuclear war. And I really would encourage people to think more critically about not only those statements themselves but who is making them because those warnings, so to speak, are coming from many of the people who created these models in the first place. And that type of language implies that the machines have minds of their own, a.k.a. that they cannot be controlled by humans. And I, again, would like people to think a little more critically about that because, at the end of the day, AI is just data, and it only learns from the data that it's given. And it is only available in the world in so far as it is put into the world by the humans who make it. And so when we assign these human characteristics to machines, we shift the blame of the consequences from the people who create them to the machines themselves. And I would really encourage people to not do that because then it stops holding the people accountable who create these solutions in the first place.
CARTY: Absolutely. So we're talking pretty potentially severe consequences if we get all of this wrong. Just one last question for you, Lauren, because I think it's always good to have different perspectives here. As a consumer and as a customer, you're as prone to data breaches as any of us. How do you protect yourself, and what recommendations would you make for anybody else?
MAFFEO: That's a great question. So I am a fan of VPNs. I bought a VPN for my personal computer and my cell phone last year. And so VPNs protect people from seeing your search history and looking at your phone. And so, if you if you want a little more privacy in that regard, that's important.
I would also say that you should enable two-factor authentication for all tools-- digital tools that you can. I know that Google, for instance, allows you to activate 2FA. And that basically means that when you're trying to log in to your Gmail on your desktop, it also sends a notification to your phone, so that should also be done.
And then you also want to be really aware of not clicking on links from people you don't know because phishing is getting more and more common. It's something that we deal with at my company every day. We get attempted phishing attacks. And I will say those phishing attempts are getting more strategic and more real in terms of how they're presented. I have had people impersonate my colleagues and text me asking me to call them back or click on a link, and it was not my colleagues. It was people trying to extort information from me. And so being very smart about which links you click on and where information is coming from is really crucial. And I do think that there's a lot that the data world can learn from cybersecurity and CISOs who are in charge of security. We know that the biggest source of a breach at an organization is a company's employee who, through no ill intent, ends up inadvertently letting in a hacker or exposing company data. And so as a result, we're seeing more organizations be proactive about educating their workforces to be cyber smart, showing them what a phishing attempt looks like, giving mini quizzes to test their knowledge and tracking the results to see how they increase or decrease over time. And there's no reason that CDOs can't be doing the same thing to create a more data-literate workforce. The specifics of that will vary depending on your organization, who's in it, what type of education they need. But you don't need to put everybody into a room for 10 hours, teach them everything about data governance, and then expect them to go on their merry way. It's really something that you have to integrate into everyone's workflow over time, and actually, that's a more sustainable way to go about it.
CARTY: OK, Lauren, in one sentence, what does digital quality mean to you?
MAFFEO: Digital quality means that your data is fit for its intended use and ready to be consumed by outside audiences.
CARTY: What is one software testing trend that you find promising?
MAFFEO: Well, I like hacker testing. I think trying to hack your own creations is a really cool way to get inside someone else's head and also test the strength of what you've built.
CARTY: And that goes by a few different names, right? Destructive testing, chaos engineering. There's a variety of different ways of approaching that.
CARTY: What's your favorite app to use in your downtime?
MAFFEO: Oh, that's a good question. Well, I could give a more enlightened answer, but it's Instagram. I use Instagram pretty regularly-- not just to post and look at photos but also to keep in touch with friends.
CARTY: Absolutely. And Lauren, what's something that you are hopeful for?
MAFFEO: I'm hopeful about the conversation we're having, that AI does have downsides. This is a very large, extensive conversation, and I worry that it is going in many different directions rather than being more focused and concentrated. Having said that, the fact that people are aware of it and asking those questions about the possible risks that does encourage me because it shows that hopefully we, as a society, have enough vested interest in ensuring that the technology is used to benefit us rather than to harm us, either intentionally or inadvertently.