Why Governance and Data Quality Are Vital in Training AI Models

Joseph J. Lazzarotti

Email

908-795-5205

Bio and Articles

Find Your Next Job !

Associate Litigation Attorney

Social Services Attorney

Paralegal - Illinois

Explore More Job Openings

Can Your AI Model Collapse?

by: Joseph J. Lazzarotti of Jackson Lewis P.C. - Workplace Privacy, Data Management & Security Report

Friday, August 23, 2024

Print Mail Download info_icon_img

/>i

A recent Forbes article summarizes a potentially problematic aspect of AI which highlights the importance of governance and the quality of data when training AI models. It is called “model collapse.” It turns out that over time, when AI models use data that earlier AI models created (rather than data created by humans), something is lost in the process at each iteration and the AI model can fail.

According to the Forbes article:

Model collapse, recently detailed in a Nature article by a team of researchers, is what happens when AI models are trained on data that includes content generated by earlier versions of themselves. Over time, this recursive process causes the models to drift further away from the original data distribution, losing the ability to accurately represent the world as it really is. Instead of improving, the AI starts to make mistakes that compound over generations, leading to outputs that are increasingly distorted and unreliable.

As the researchers published in Nature who observed this effect noted:

In our work, we demonstrate that training on samples from another generative model can induce a distribution shift, which—over time—causes model collapse. This in turn causes the model to mis-perceive the underlying learning task. To sustain learning over a long period of time, we need to make sure that access to the original data source is preserved and that further data not generated by LLMs remain available over time. The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale. One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that were crawled from the Internet before the mass adoption of the technology or direct access to data generated by humans at scale.

These findings highlight several important considerations when using AI tools. One is maintaining a robust governance program that includes, among other things, measures to stay abreast of developing risks. We’ve heard a lot about hallucinations. Model collapse is a relatively new and a potentially devastating challenge to the promise of AI. It raises an issue similar to the concerns with hallucinations, namely, that the value of the results received from a generative AI tool, one that an organization comes to rely on, can significantly diminish over time.

Another related consideration is the need to be continually vigilant about the quality of the data being used. Trying to distinguish and preserve human generated content may become more difficult over time as sources of data will be increasingly rooted in AI-generated content. The consequences could be significant, as the Forbes piece notes:

[M]odel collapse could exacerbate issues of bias and inequality in AI. Low-probability events, which often involve marginalized groups or unique scenarios, are particularly vulnerable to being “forgotten” by AI models as they undergo collapse. This could lead to a future where AI is less capable of understanding and responding to the needs of diverse populations, further entrenching existing biases and inequalities.

Accordingly, organizations need to build strong governance and controls around the data on which their (or their vendors’) AI models were and continue to be trained. That need is only made more clear considering the potential for model collapse is only one of a number of risks and challenges facing organizations when developing and/or deploying AI.