Google Research team has recently published a paper titled Data Cascades in High-Stakes AI. The six authors of this article, Nithya Sambasivan, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora Aroyo, bring light to a profound pattern of data undervaluation in high-stake fields where AI models are critical and prevalent. They conclude that although there is great interest in creating MT and ML models, there is less interest in doing the actual data work.
This research is particularly interesting for TAUS as a one-stop-shop for language data, providing a plethora of training datasets and NLP services to enhance data. It showcases that in the language data field where there is a lot of demand yet very little will to provide services, TAUS takes on a mission to provide top-quality datasets and data services. This topic has been drawing more attention lately as more understanding and campaigns around data-centric AI occur.
With this perspective in mind, we will be diving into the paper and highlighting the key takeaways.
The authors define Data Cascades as “compounding events causing negative downstream effects from data issues that result in technical debt over time,” based on empirical results from the study. They found that 92% of AI practitioners experienced at least one data cascade, based on the study. The study indicated that data cascades are influenced by: “activities and interactions of actors involved in AI development (e.g. developers, governments, etc), and the physical world and community where the AI system is situated (e.g. hospitals)”.
The researchers observed the following aspects of data cascades:
Incentives and currency in AI highlight the widespread lack of recognition for the invisible, yet difficult data work in AI. The expectation of a high-performing model from organizations comes without the much-needed consideration for the underlying data quality. TAUS places a high emphasis on data work and ensures data quality from the get-go. Specifically, TAUS data services offer custom data solutions, which can have a direct impact on the avoidability of data cascades through early-phase data intervention.
Four primary root causes for data cascades are identified in the study. Below, we dive deeper into each root cause.
Interacting with Physical World Brittleness
Real-world phenomena in high-stakes domains give more reasons for an ML system to break, due to factors such as limited training data, volatile domains, regulation changes, and complex underlying phenomena. Data cascades often appear as a direct result of hardware drifts (e.g. improper lighting, dust, fingerprints that tarnish model performance), environmental drifts (e.g cloud cover hiding vegetation), and human drifts (social phenomena where changes in societal behavior impact data).
Inadequate Application-Domain Expertise
This data cascade is triggered when AI practitioners who oversee the data do not have enough domain expertise. Because of their limited domain expertise, false assumptions can easily become incorporated into the ML system.
Conflicting Reward Systems
Data cascades can occur when priorities and incentives are not aligned between practitioners, domain experts, and field partners. Data literacy training is often poorly if at all conducted, leading to data quality challenges and misalignment.
Poor Cross-Organizational Documentation
Lack of documentation across various sectors within an organization sets off data cascades. Due to the volatile nature of high-stakes data, metadata and schemas are constantly changing. Without proper documentation or domain knowledge, this leads to critical data details missed.
The key findings related to the source of data cascades as well as possible solutions to avoid them altogether are highlighted below.
The researchers’ results present an overwhelming trend of opaque and messy data cascades, including domains where organizational leaders were attuned to the importance of data quality. They argue that a potential solution to avoid data cascades as a whole could be to strive for data excellence - a focus on practices, politics, and values of humans of the data pipeline to improve data quality through systematic processes, infrastructures, and standards. The authors argue that these same challenges observed in high-stake domains also exist in all forms and levels of AI development.
The researchers propose a shift from “the goodness-of-fit to the goodness-of-data.” The goodness of fit metrics, such as F1, accuracy, and AUC, all measure the goodness-of-fit of the model to the data. While this may be an effective way to evaluate model performance, it does little to represent the phenomena captured in data quality. Furthermore, these kinds of metrics do not help to stop unforeseen effects of data cascades downstream. Measuring goodness-of-data, on the other hand, will help to enable organizations to assess early development and allow an early feedback loop. This can uncover possible data implications early on to help avoid data cascades in the future.
To simplify the workflows in ML pipelines, researchers have created tools to manage these, such as Data Linter (a tool that inspects ML datasets and flags any potential problems), ActiveClean, and BoostClean (tools that discover bugs). While these tools can be useful, the authors highlight that such tools take away from the importance of using model performance as the primary proxy for data quality. They argue that the amount of focus that practitioners spend on testing code should also be similar to monitoring data.
In light of data excellence, the authors suggest that proposing incentives for data excellence could benefit organizations to further avoid data cascades. For structural incentives in the market, the researchers propose conferences (such as SIGCHI, CSCW, and AAAI) as platforms to recognize the importance of research on data through conventions like crowd work, human competition, and data visualization. Data excellence supports an emphasis on the value of sustained partnerships, as opposed to a few individuals engaging on a one-off basis in an organization. More collaboration and transparency of AI application use-cases, data literacy, and shared incentives are a few ways the authors suggest moving this notion forward.
Real-World Data Literacy in Education
Furthermore, the paper highlights that AI education should be reformed to fit real-world data literacy. As it currently stands, graduates of AI-related fields are under-prepared in working with data, including data collection, documentation, and infrastructure-building. In the present state, there is a lack of collaboration and appreciation for application-domain experts in AI.
More Visibility in AI Data Lifecycle
Additionally, the authors call for better visibility in the AI data lifecycle, particularly in feedback channels of different time scales. Because of limited visibility, AI practitioners struggle to comprehend the impact of data quality. In contrast, the study showed that the groups with the least amount of data cascades were teams that had feedback loops in place, had close relations with application-domain experts, maintained clear documentation, and routinely monitored new data.
Uneven ML Capital Globally
Lastly, the study identified a major discrepancy in the “data equity of the Global South'' - the uncovering of drastic differences of data and compute accessibility in Africa and India compared to the United States. The Global South is often seen as a site for low-level data annotation work, a direct result of the uneven ML capital in the world. The authors note that publishing open-source datasets, data collection tools, and training could help reduce this disparity.
As AI increasingly becomes a major aspect of our day-to-day decision-making in core aspects of life, the quality of the data powering these models is of major importance. The paper showcases that data work is perhaps the most under-valued, yet a widely understood key necessity in AI. Organizations seem to be using the wrong tools to address data quality issues - as these are often approached as a database problem. TAUS, however, distinguishes itself by taking initiative in the often neglected data work. TAUS embodies the notion of data excellence, namely through its platforms which offer enriched datasets, matching data, and numerous data quality services. Thus, TAUS recognizes the importance and value of data work by taking actionable steps in ensuring data quality, which in turn, ultimately can help avoid data cascades from emerging in a downstream AI lifecycle.
The researchers suggest that by placing more recognition on the taken-for-granted data work will and shifting the organizational structure around data quality, data cascades can be addressed, discovered, and entirely avoided early on in the development cycle. The primary way to avoid data cascades is through data excellence - placing a systematic, valued, and sustained shift towards data quality through processes, standards, infrastructures, and incentives. With increased ML literacy among the world, society can move towards a model of data excellence, and away from model accuracy.
Husna is a data scientist and has studied Mathematical Sciences at University of California, Santa Barbara. She also holds her master’s degree in Engineering, Data Science from University of California Riverside. She has experience in machine learning, data analytics, statistics, and big data. She enjoys technical writing when she is not working and is currently responsible for the data science-related content at TAUS.
The AI scene of the 2010s was shaped by breakthroughs in vision-enabled technologies, from advanced image searches to computer vision systems for medical image analysis or for detecting defective parts in manufacturing and assembly. The 2020s, however, are foreseen to be all about natural language technologies and language-based AI tasks. NLP, NLG, NLQ, NLU… The list of abbreviations starting with NL (Natural Language) seems to grow each day. Regardless of the technology domain, it’s observed that natural language technologies will be in a field-shaping position in a variety of areas from business intelligence and healthcare to fintech.
This is the third article in my series on Translation Economics of the 2020s. In the first article published in Multilingual, I sketched the evolution of the translation industry driven by technological breakthroughs from an economic perspective. In the second article, Reconfiguring the Translation Ecosystem, I laid out the emerging new business models and ended with the observation that new smarter models still need to be invented. This is where I will now pick up the thread and introduce you to the next logical translation solution. I call it: Data-Enhanced Machine Translation.