Why GCs Need To Know How Bad Data Cripples AI

Both Generative Artificial Intelligence (e.g., output from ChatGPT) and Non-Generative AI (e.g., analytics from Machine Learning) depend on high-quality data to achieve reliable results. Because good AI depends on reliable quality data, which in turn often relies on third-party vendor agreements, corporate law departments are often called upon to supervise or coordinate AI projects in a multi-department, multi-stakeholder, multi-user and multi-vendor business and technology environment. This is compounded when different corporate business units have different uses for AI. Data quality is necessary to support these different uses.

Overview of AI Implementation Risks

Fundamental to the role of data is understanding that poor data inevitably leads to poor AI outcomes and introduces substantial risks. The business risk of AI is that poor data hinders rather than supports the effectiveness of the company's AI operations. The legal risks, especially with respect to Generative AI, include copyright and trademark liability. In addition, a lack of care in using AI on a company's database can disclose the company's customers' and business partners' proprietary information and business strategies. This introduces the risk of breach of contract, intellectual property infringement and tortious interference with business relationships as well as introducing other additional types of combined legal and business risks.

Companies without data departments or dedicated AI teams often need to retain third-party vendors to produce data and AI services. General counsel need to be aware that third-party vendors can introduce significant risk. For example, a vendor running its AI on company data can be the cause of the risks mentioned above. Accordingly, the legal and business teams negotiating vendor contracts need to be aware of the AI risks arising out of vendor services and address them in the governing agreements, and General Counsel should take steps to ensure that the corporate legal department review these agreements. Often the subject matter of the different agreements overlap. This can create potential problems where one vendor points to another in the event of a problem. This problem is familiar to General Counsel, but with an increased risk in AI and data given the lack of track record of vendors new to services.

Core Issues to Know about AI & Data Quality

The first step to the successful use of AI starts with defining the business problems to be addressed, such as "improving the productivity of a help desk" in IT, "better distinguishing malignant cells benign skin abrasions lesions" in healthcare, and "improving the criteria for granting loans" in financial services.

The second step involves creating a model or algorithm based on training data. Chat GPT is an example of a Large Language Model. It is trained using data scraped from the public Internet. Other models can use corporate data either alone or in combination with public or third-party data. An example of using this to meet the business objective is the use by the bank of results of previous credit applications and loan performance to train a model to determine the amount of monetary credit to be provided to parties in specific circumstances.

In the third step, future data is fed into the model, which in Machine Learning provides outputs in the form of predictions or other analyses.

Data risks can arise when there is a lack of effort in determining and curating the appropriate training data.

Company employees are familiar with data as it appears in spreadsheets. A spreadsheet covers a category (e.g., "employees"); an individual datum appears in a row, which defines an entity (e.g., employee John Doe); a column, which defines an attribute (e.g., "start date"); and the cell itself is the value assigned to that entity for that attribute (e.g., September 2, 2010). In good data management, each data value is correct. As a cautionary note, however, this is often not the case.

Such data is called "structured data." AI data can also be "unstructured." Documents (e.g., John Doe's employment contract, transcripts of calls to the help desk), sound recordings, photos (e.g., skin lesions), and video may all be used with AI. Companies often adapt LLMs to their specific industry needs by augmenting them with their own unstructured data. Indeed, this is the current trend to avoid the first-generation problems of Generative AI.

Fed bad data, either training or future data, the outputs are suspect. For as advanced as the computer programs to create models may be, they are still just software code. And, as computer scientists have noted for generations, "garbage in, garbage out."

What is Data Quality?

Data quality requirements are broad and deep. It is convenient to divide them into two broad categories:

Whether a company has the so-called "right data" to address the business problem and
Whether that "data (values) is right," or correct

The criteria associated with the "data is right" are more familiar, and include accuracy, absence of duplicates, and so forth. It is usually possible to measure the degree to which the data is right. General Counsel should be aware that the best evidence suggests that many existing corporate data sets are not of high enough quality and will require considerable remediation before being used to train a model. This may involve internal data professionals or third-party vendors. As noted, vendor agreements need review.

The issues involved in determining whether a company has the "right data" is less familiar, more subtle, and more complex. The right data is intimately tied to the problem statement. To illustrate, the statement "improving the process for granting loans" from above begs important questions such as "improved how?" "which loans?" and "to whom?" The answers are found in determining what data the company needs. Examples with respect to the criteria of "improved" are below.

If "improved" means bias-free, the company will need bias-free training data, which may be very hard to come by
If "improved" means more "performing loans" and fewer "non-performing loans," the company needs data on loans that the company has not granted historically
If "improved" means decisions that can be more easily explained to regulators, AI may present difficulties when the metric of "explainability" is extremely difficult to document or achieve
If "improved" means "will work in parts of the world in which the company does not currently operate and does not have local data, or where it operates but does not have local data," then getting data ready for AI requires a large-scale effort
If "improved" means simply lower costs in existing operations, company historical data may be sufficient (although datasets can always be improved and improve the results)

General Counsel need to be aware that it is predictable that business sponsors, data modelers, and other stakeholders have different views of the "problem" as that term is defined above, but it may often be hard work to come to agreement in how the company defines the "problem" that AI is asked to solve. Further, it is natural that modelers select the "most easily available" data, rather than the best data for the problem at hand. Each creates problems. A similar problem occurs when third-party vendors are involved in the process.

Problems With Training Data Run Deep

Many companies have not been rigorous in managing data for a long time. Remediation is required to implement AI to make data "fit-for-purpose" for the applicable AI services. It is important to note that most historical data fails the "data is right" criteria as well as the "right data" criterion, both of which criteria are, in turn, difficult to define and verify. Exacerbating this is a difference in terminology used in different data systems across the company. For example, the same individual may be referred to as a "customer" in the customer support department's database; as a "prospect" in the marketing department's database, and as an "account" in the accounting department's database. This is compounded when the same person's name is spelled differently in each database, including, for example, when the full middle name is used in place of an initial, or where some entries use middle name references and others do not. This is the "are you you" problem. In the AI context, aligning that data, in this case identifying an individual described with different attributes in different databases, is a special challenge.

Challenges With Third-Party Data

A company may have acquired data from data brokers and other third parties who collected data for sales, and therefore relied on zip codes or similar information. The issue then is whether there is bias built into the dataset which distorts the reliability of the data when the company uses the data for new purposes. A similar issue arises when a company acquires data from another company in a merger, and the acquiring company wants to use the data for a new purpose. The same fit-for-purpose standard arises in such corporate transactions, and validating data should be part of the due diligence process.

How to Question & Verify Data

Given the wide range of quality requirements, the problematic state of historical data, and the "black box" nature of AI, General Counsel should question and verify data issues before the data is used in AI. This, in turn, entails focusing on the different roles that different corporate employees play, as discussed below.

For Those Leading the AI Initiative

What goals is the company aiming to achieve with its AI efforts?

While AI has potential, actually making money is extremely demanding. Given the risks, General Counsel should confirm that the business teams have made a cost-benefit analysis and justified the use cases.

For Those Who Develop and Maintain the Model

What problems is this specific AI project aiming to solve?

As discussed above, General Counsel should make sure all in the company teams have agreed on a definition of the company's problem to be addressed by AI.

Has the right data been used to solve the problem? Who has verified it?

Make sure the company has clarified the "right data" criteria in terms of relevancy, completeness, freedom from bias, timeliness, clear definitions of good data, and appropriate exclusions or problematic data from the training data. It may be difficult to meet all criteria, so make sure strengths and weaknesses are well understood.

Is the training data accurate and reliable?

Demand measurements, before and after clean-up.

Do data sources fully integrate with one another?

Make sure that those integrating data have understood and accounted for subtleties in data sources.

For Those Who Deploy the Model in the Business

Who is responsible for data quality going forward? Do they have the necessary skills and authority to ensure that data quality standards are met? How do they work with the General Counsel to meet the corporate legal department's needs? Overall, what is the definition of "success" in this context?

When developing a model, there is usually time to remediate (e.g., clean up) bad data. Not so once the model is put to work. Make sure it is clear who is responsible for ensuring the model is fed high-quality data and that those employees involved in the data project are qualified to accomplish the relevant tasks.

Who is responsible for ensuring that future data is within the range of data on which the model was trained?

Companies often train a model to work for a specified range of data. In actual use, future data can drift beyond that range, jeopardizing model performance. Make sure there is a plan for detecting these issues.

What are the top three ways the company can envision the model failing in the future?

Make sure those responsible have thought broadly about this question (e.g., losing a data source, quality dipping, model degradation) and have a plan for dealing with those issues.

Conclusion

Leveraging the transformative power of Generative and Non-Generative AI, and treating data as a business asset, are critical to success in today's business world, and can provide a company with a competitive advantage in the market. Good AI depends on good data. Good AI projects depend on good data projects, and often good IT projects. Protecting the company's interests requires coordination of different business units with different corporate functions and then integrating this into the AI implementation plan. Businesses often use a dedicated AI team to determine the business objective to be achieved by AI and to design a program for the implementation of AI. As noted, this requires including data quality in the program.

There are advantages to the company of having General Counsel being part of this team and/or the advisory panel to the Board of Directors. The short answer is that General Counsel should be involved in coordinating the activities of company stakeholders and technical support teams and making sure that legal issues are built into the process. For example, it is easier to build in regulatory compliance when creating or updating databases than it is to retrofit compliance. The same is true when AI should meet corporate policies and the needs of providing services to a company's customers and business partners.

Further, AI legislation is being enacted at the state level and is pending at the federal level. In addition, lessons learned from the Biden administration's Executive Order on the use of AI by federal agencies will impact the private sector. The AI rules and practices federal agencies adopt will result in a set of best practices that General Counsel can use in designing or supervising corporate AI and data projects.

Contributed by Dr. Thomas C. Redman, Data Quality Solutions, and William A. Tanenbaum, Moses & Singer LLP

The content of this article is intended to provide a general guide to the subject matter. Specialist advice should be sought about your specific circumstances.