Handling Noisy and Erroneous Data

To tackle noise and uncertainty in observed data, we develop new models to capture variability in data due to various socio-economic reasons and analyze these to understand the impact on algorithmic decisions (e.g., stable matchings of students to schools) for different groups of people. This analysis is key in downstream decisions, e.g., in planning investment of limited resources. I received the NSF CAREER Award in 2023 to explore these directions further.

We are interested in is how to make “robust and reliable” decisions when the observed data is noisy or biased. Evaluation data often shows distributional differences across individuals from different backgrounds. However, algorithmic techniques to make data-driven decisions on such data, without exacerbating unfairness are not well-understood. This research theme explores new algorithms under different distributional assumptions, characterization of impact of presence of noise in data, and interventions for reducing these impacts. The problems with noisy data impact downstream decision-making pipelines, often involving some machine learning. Some times domain-specific constraints or high-quality data can be used to augment noisy pipelines for more resilient and robust decisions. In this theme, we explore the trade-offs between quality of data and the cost of acquiring it through robust and resilient algorithms. Here are some data models we have considered thus far:

(a) Ordinal Online Optimization: Currently, 97% organizations are rely on automated algorithms for candidate selection process, as it is impossible for humans to sift through millions of resumes or test scores or health records. Unfortunately, predicting “hirability” for each candidate creates multiple challenges. These ML generated scores tend to pick up biased trends in the underlying data. This poses a huge challenge as a significant fraction of the U.S. workforce, with at least 70+ million adults in the U.S., are “STARS” skilled through alternative routes, such as community college, workforce training, boot camps, certificate programs, military service, or on-the-job learning, rather than through a traditional 4-year bachelor’s degree. Proactively addressing these challenges remains a big open question.

From a theoretical lens, we are broadly interested in extending existing algorithms for ordinal data (i.e., data where only comparisons can be performed, without numeric values). We explored this in the context of hiring, specifically for the online secretary problem. As opposed to existing literature, where the algorithm can observe utility of candidates, we restricted to the setting where the algorithm can only observe a partial order over the candidates that have arrived thus far. We showed that competitive algorithms in this setting are naturally fairer (the candidates higher up in the poset have a higher probability of selection), induce better data-specific distributions, and do not overcorrect the selection rates across gender groups (as opposed to using quotas). We further showed legal implications of this work.

poset

Secretary Problems with Biased Evaluations using Partial Ordinal Information, Jad Salem, Swati Gupta. Management Science, 2023.

Closing the GAP: Mitigating Bias in Online Resume-Filtering, Jad Salem, Swati Gupta. WINE 2020. link

Using Algorithms to Tame Discrimination – Deven Desai, Swati Gupta, Jad Salem, UC Davis Law Review, 2023

Don’t let Ricci v. DeStefano Hold You Back: A Bias-Aware Legal Solution to the Hiring Paradox – Jad Salem, Deven Desai, Swati Gupta, FAccT 2022

(b) Impact and Interventions for Noisy Data: We are interested in estimating bias in data, and developing interventions to mitigate its impact. Using data from the department of education in New York, we showed that there are consistent distributional differences in student performance on the specialized high school SAT (SHSAT), based on whether the students went to schools with low economic need (G1) or high economic need (G2). Our recent work explores the question about which students are the most impacted if this distributional difference persist and how can one use scholarships or additional trainings to mitigate these impacts. This involves understanding the stable matching mechanism under noisy student performance. We further explored the matching mechanism of the discovery program, and showed that this can in fact creating a large number of blocking pairs within the group of disadvantaged students (a blocking pair is when a student and school would rather be matched to each other compared to their current matches). We characterized new market conditions of “high competitiveness” under which another mechanism would provably not create such blocking pairs.

We are further exploring bias in organ transplantation pipelines and potential interventions, in collaboration with MGH.

Discovering Opportunities in New York City’s Discovery Program: an Analysis of Affirmative Action Mechanisms, Yuri Faenza, Swati Gupta and Xuan Zhang, EC 2023 (journal version under submission). arxiv

Reducing the Filtering Effect in Public School Admissions: A Bias-aware Analysis for Targeted Interventions, Yuri Faenza, Swati Gupta, Aapeli Vuorinen, Xuan Zhang, ACDA 2023 (journal version under revision at M&SOM) arxiv

(c) Incorporating Domain Constraints: Errors in measurements, like pulse oximetry errors for darker-skinned individuals, can lead to unintentional biases in patient care, and electronic medical records data is rife with such errors and recorded noise. This is a massive challenge today as machine learning pipelines are permeating the space of life-critical decisions. In a collaboration with the Emory University and Grady Hospital, we developed a data-correction algorithm using the theory of projections. Our method converts clinical knowledge about dependencies in vitals and clinical labs into (non-convex) mixed-integer constraints to model the space of feasible data P, and correct EMR data by computing projections on these non-convex sets. Our results show improved sepsis prediction accuracy, 6 hours before onset, improving upon the current state-of-the-art methods. We are currently exploring incorporating LLMs into such predictive pipelines, and speeding up the projections.

We are further exploring bias in organ transplantation pipelines and potential interventions, in collaboration with MGH.

Improving Clinical Decision Support through Interpretable Machine Learning and Error Correction in Electronic Health Records – Mehak Arora, Hassan Mortagy, Nate Dwarshius, Jeffrey Wang, Philip Yang, Swati Gupta, Andre Holder, Rishikesan Kamaleswaran, Journal of the American Medical Informatics Association (JAMIA) 2025.

(d) Mixed-Fidelity Data with Cost: AI is rapidly gainly the monopoly in making decisions across all applications that humans interact with today. AI is not only able to process large amounts of data, but is also able to generalize patterns and integrate into more complex decision-making pipelines. However, AI also has the power to generate fake information, combine data sources with low-quality and high-quality signals, and mimic human interactions with decision systems. This raises a massive challenge in terms of navigating fake data, discounting fake sources of information when training models, and learning from behaviors that look “human” but are machine-generated. A big challenge is to mitigate the propagation of faulty decisions and fake information to the extent possible, without violating the rights of citizens. This not only opens a huge opportunity and market for high-quality data but also calls for intelligent systems that can detect fake and low-quality signals. It is critical that this challenge be addressed, so we have reliability in automated decisions and are not vulnerable to fake data attacks. The AI models we build are as good as the data sources that are used. In this question, we are exploring various questions on understanding the trade-offs between cost, signal and regret in online learning. We are currently exploring these questions in the context of mental health and human-AI augmented systems.

About Me

Handling Noisy and Erroneous Data

Leave a Reply Cancel reply