icons-social-media-facebook-circleicons-social-media-twitter-circleicons-social-media-linked-in-circle
Everyone wants to do the model work, not the data work.
icons-action-calendar18/11/2021
8 minute read
A thorough overview of the paper by six Google researchers: Data Cascades in High-Stakes AI with a focus on why data-centric AI matters.

Google Research team has recently published a paper titled Data Cascades in High-Stakes AI. The six authors of this article, Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora Aroyo, bring light to a profound pattern of data undervaluation in high-stake fields where AI models are critical and prevalent. They conclude that although there is great interest in creating MT and ML models, there is less interest in doing the actual data work. 

This research is particularly interesting for TAUS as a one-stop-shop for language data, providing a plethora of training datasets and NLP services to enhance data. It showcases that in the language data field where there is a lot of demand yet very little will to provide services, TAUS takes on a mission to provide top-quality datasets and data services. This topic has been drawing more attention lately as more understanding and campaigns around data-centric AI occur. 

With this perspective in mind, we will be diving into the paper and highlighting the key takeaways. 

What are Data Cascades? 

The authors define Data Cascades as “compounding events causing negative downstream effects from data issues that result in technical debt over time,” based on empirical results from the study. They found that 92% of AI practitioners experienced at least one data cascade, based on the study. The study indicated that data cascades are influenced by: “activities and interactions of actors involved in AI development (e.g. developers, governments, etc), and the physical world and community where the AI system is situated (e.g. hospitals)”. 

The researchers observed the following aspects of data cascades:  

  • Data opaqueness: no clear tools, metrics, or indicators to detect data cascades and their impact on their system
  • Triggers: when conventional AI practices are applied to high-stakes fields
  • Negative Impact: multiple downstream impacts such as technical debt
  • Multiple Cascades: 45.3% of practitioners in the study experienced two/more cascades
  • Avoidability of Cascades: through early interventions in the development process

Incentives and currency in AI highlight the widespread lack of recognition for the invisible, yet difficult data work in AI. The expectation of a high-performing model from organizations comes without the much-needed consideration for the underlying data quality. TAUS places a high emphasis on data work and ensures data quality from the get-go. Specifically, TAUS data services offer custom data solutions, which can have a direct impact on the avoidability of data cascades through early-phase data intervention.

Data Cascade Root Causes

Four primary root causes for data cascades are identified in the study. Below, we dive deeper into each root cause.

Interacting with Physical World Brittleness 

Real-world phenomena in high-stakes domains give more reasons for an ML system to break, due to factors such as limited training data, volatile domains, regulation changes, and complex underlying phenomena. Data cascades often appear as a direct result of hardware drifts (e.g. improper lighting, dust, fingerprints that tarnish model performance), environmental drifts (e.g cloud cover hiding vegetation), and human drifts (social phenomena where changes in societal behavior impact data). 

Inadequate Application-Domain Expertise

This data cascade is triggered when AI practitioners who oversee the data do not have enough domain expertise. Because of their limited domain expertise, false assumptions can easily become incorporated into the ML system. 

Conflicting Reward Systems

Data cascades can occur when priorities and incentives are not aligned between practitioners, domain experts, and field partners. Data literacy training is often poorly if at all conducted, leading to data quality challenges and misalignment. 

Poor Cross-Organizational Documentation

Lack of documentation across various sectors within an organization sets off data cascades. Due to the volatile nature of high-stakes data, metadata and schemas are constantly changing. Without proper documentation or domain knowledge, this leads to critical data details missed. 

Findings

The key findings related to the source of data cascades as well as possible solutions to avoid them altogether are highlighted below. 

Data Excellence

The researchers’ results present an overwhelming trend of opaque and messy data cascades, including domains where organizational leaders were attuned to the importance of data quality. They argue that a potential solution to avoid data cascades as a whole could be to strive for data excellence - a focus on practices, politics, and values of humans of the data pipeline to improve data quality through systematic processes, infrastructures, and standards. The authors argue that these same challenges observed in high-stake domains also exist in all forms and levels of AI development. 

Data First

The researchers propose a shift from “the goodness-of-fit to the goodness-of-data.” The goodness of fit metrics, such as F1, accuracy, and AUC, all measure the goodness-of-fit of the model to the data. While this may be an effective way to evaluate model performance, it does little to represent the phenomena captured in data quality. Furthermore, these kinds of metrics do not help to stop unforeseen effects of data cascades downstream. Measuring goodness-of-data, on the other hand, will help to enable organizations to assess early development and allow an early feedback loop. This can uncover possible data implications early on to help avoid data cascades in the future. 

Wrong Tools 

To simplify the workflows in ML pipelines, researchers have created tools to manage these, such as Data Linter (a tool that inspects ML datasets and flags any potential problems), ActiveClean, and BoostClean (tools that discover bugs). While these tools can be useful, the authors highlight that such tools take away from the importance of using model performance as the primary proxy for data quality. They argue that the amount of focus that practitioners spend on testing code should also be similar to monitoring data. 

Create Incentives

In light of data excellence, the authors suggest that proposing incentives for data excellence could benefit organizations to further avoid data cascades. For structural incentives in the market, the researchers propose conferences (such as SIGCHI, CSCW, and AAAI) as platforms to recognize the importance of research on data through conventions like crowd work, human competition, and data visualization. Data excellence supports an emphasis on the value of sustained partnerships, as opposed to a few individuals engaging on a one-off basis in an organization. More collaboration and transparency of AI application use-cases, data literacy, and shared incentives are a few ways the authors suggest moving this notion forward. 

Real-World Data Literacy in Education

Furthermore, the paper highlights that AI education should be reformed to fit real-world data literacy. As it currently stands, graduates of AI-related fields are under-prepared in working with data, including data collection, documentation, and infrastructure-building. In the present state, there is a lack of collaboration and appreciation for application-domain experts in AI. 

More Visibility in AI Data Lifecycle

Additionally, the authors call for better visibility in the AI data lifecycle, particularly in feedback channels of different time scales. Because of limited visibility, AI practitioners struggle to comprehend the impact of data quality. In contrast, the study showed that the groups with the least amount of data cascades were teams that had feedback loops in place, had close relations with application-domain experts, maintained clear documentation, and routinely monitored new data. 

Uneven ML Capital Globally

Lastly, the study identified a major discrepancy in the “data equity of the Global South'' - the uncovering of drastic differences of data and compute accessibility in Africa and India compared to the United States. The Global South is often seen as a site for low-level data annotation work, a direct result of the uneven ML capital in the world. The authors note that publishing open-source datasets, data collection tools, and training could help reduce this disparity. 

Conclusion

As AI increasingly becomes a major aspect of our day-to-day decision-making in core aspects of life, the quality of the data powering these models is of major importance. The paper showcases that data work is perhaps the most under-valued, yet a widely understood key necessity in AI. Organizations seem to be using the wrong tools to address data quality issues - as these are often approached as a database problem. TAUS, however, distinguishes itself by taking initiative in the often neglected data work. TAUS embodies the notion of data excellence, namely through its platforms which offer enriched datasets, matching data, and numerous data quality services. Thus, TAUS recognizes the importance and value of data work by taking actionable steps in ensuring data quality, which in turn, ultimately can help avoid data cascades from emerging in a downstream AI lifecycle.

The researchers suggest that by placing more recognition on the taken-for-granted data work will and shifting the organizational structure around data quality, data cascades can be addressed, discovered, and entirely avoided early on in the development cycle. The primary way to avoid data cascades is through data excellence - placing a systematic, valued, and sustained shift towards data quality through processes, standards, infrastructures, and incentives. With increased ML literacy among the world, society can move towards a model of data excellence, and away from model accuracy.

 

everyone-wants-to-do-the-model-work-not-the-data-work
Author
husna-sayedi

Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.

Related Articles
icons-action-calendar11/03/2024
Purchase TAUS's exclusive data collection, featuring close to 7.4 billion words, covering 483 language pairs, now available at discounts exceeding 95% of the original value.
icons-action-calendar09/11/2023
Explore the crucial role of language data in training and fine-tuning LLMs and GenAI, ensuring high-quality, context-aware translations, fostering the symbiosis of human and machine in the localization sector.
icons-action-calendar19/12/2022
Domain Adaptation can be classified into three types - supervised, semi-supervised, and unsupervised - and three methods - model-centric, data-centric, or hybrid.