top of page

Haosen Ge Ph.D.

Rethinking Algorithmic Fairness for Human-AI Collaboration arX iv
ITCS 2024
Haosen Ge; Hamsa Bastani; Osbert Bastani
Existing approaches to algorithmic fairness aim to ensure equitable outcomes if human decision-makers comply perfectly with algorithmic decisions. However, perfect compliance with the algorithm is rarely a reality or even a desirable outcome in human-AI collaboration. Yet, recent studies have shown that selective compliance with fair algorithms can amplify discrimination relative to the prior human policy. As a consequence, ensuring equitable outcomes requires fundamentally different algorithmic design principles that ensure robustness to the decision-maker's (a priori unknown) compliance pattern. We define the notion of compliance-robustly fair algorithmic recommendations that are guaranteed to (weakly) improve fairness in decisions, regardless of the human's compliance pattern. We propose a simple optimization strategy to identify the best performance-improving compliance-robustly fair policy. However, we show that it may be infeasible to design algorithmic recommendations that are simultaneously fair in isolation, compliance-robustly fair, and more accurate than the human policy; thus, if our goal is to improve the equity and accuracy of human-AI collaboration, it may not be desirable to enforce traditional algorithmic fairness constraints. We illustrate the value of our approach on criminal sentencing data before and after the introduction of an algorithmic risk assessment tool in Virginia.

Stochastic Online Conformal Prediction with Semi-Bandit Feedback arXiv
ICML 2025
Haosen Ge; Hamsa Bastani; Osbert Bastani
Conformal prediction has emerged as an effective strategy for uncertainty quantification by modifying a model to output sets of labels instead of a single label. These prediction sets come with the guarantee that they contain the true label with high probability. However, conformal prediction typically requires a large calibration dataset of i.i.d. examples. We consider the online learning setting, where examples arrive over time, and the goal is to construct prediction sets dynamically. Departing from existing work, we assume semi-bandit feedback, where we only observe the true label if it is contained in the prediction set. For instance, consider calibrating a document retrieval model to a new domain; in this setting, a user would only be able to provide the true label if the target document is in the prediction set of retrieved documents. We propose a novel conformal prediction algorithm targeted at this setting, and prove that it obtains sublinear regret compared to the optimal conformal predictor. We evaluate our algorithm on a retrieval task, an image classification task, and an auction price-setting task, and demonstrate that it empirically achieves good performance compared to several baselines.

Generative AI without Guardrails Can Harm Learning: Evidence from High School Mathematics Journal Version
Proceedings of the National Academy of Sciences (2025)
Hamsa Bastani; Osbert Bastani; Alp Sungu; Haosen Ge; Özge Kabakcı; Rei Mariman
Generative artificial intelligence (AI) is poised to revolutionize how humans work, and has already demonstrated promise in significantly improving human productivity. However, a key remaining question is how generative AI affects learning, namely, how humans acquire new skills as they perform tasks. This kind of skill learning is critical to long-term productivity gains, especially in domains where generative AI is fallible and human experts must check its outputs. We study the impact of generative AI, specifically OpenAI's GPT-4, on human learning in the context of math classes at a high school. In a field experiment involving nearly a thousand students, we have deployed and evaluated two GPT based tutors, one that mimics a standard ChatGPT interface (called GPT Base) and one with prompts designed to safeguard learning (called GPT Tutor). These tutors comprise about 15% of the curriculum in each of three grades. Consistent with prior work, our results show that access to GPT-4 significantly improves performance (48% improvement for GPT Base and 127% for GPT Tutor). However, we additionally find that when access is subsequently taken away, students actually perform worse than those who never had access (17% reduction for GPT Base). That is, access to GPT-4 can harm educational outcomes. These negative learning effects are largely mitigated by the safeguards included in GPT Tutor. Our results suggest that students attempt to use GPT-4 as a "crutch" during practice problem sessions, and when successful, perform worse on their own. Thus, to maintain long-term productivity, we must be cautious when deploying generative AI to ensure humans continue to learn critical skills.

Political Diversity in U.S. Police Agencies Journal Version
American Journal of Political Science (2025)
Bocar Ba, Haosen Ge, Jacob Kaplan, Dean Knox, Mayya Komisarchik, Gregory Lanzalotto, Rei Mariman, Jonathan Mummolo, Roman Rivera, and Michelle Torres
Partisans are increasingly divided on policing policy, which may affect officer behavior. Wemergerosters from 99 of the 100 largest local U.S. agencies—over one third of local law enforcement nationwide—with voter files to study police partisanship. Police skew more Republican than their jurisdictions, with notable exceptions. Using fine-grained data in Chicago and Houston, we compare behavior by Democratic and Republican officers facing commoncircumstances. Overall, wefindfewpartisandifferences after correcting for multiple comparisons. But consistent with prior work, we find Black officers make fewer stops and arrests in Chicago, and they use force less in both cities. Comparing same-race Democratic and Republican officers, we find only that White Democrats make more violent-crime arrests than White Republicans in Chicago. Our results suggest that despite Republicans’ preference for more punitive law enforcement policy and their overrepresentation in policing, partisan divisions do not translate into detectable differences in on-the-ground enforcement.

Action vs. Attention Signals for Human-AI Collaboration: Evidence from Chess SSRN
Stefanos Poulidis; Haosen Ge; Hamsa Bastani; Osbert Bastani
Machine learning is increasingly employed to support human decision-makers by offering algorithmic advice in high-stakes domains such as healthcare, law, and finance. While most prior work has studied action signals, which recommend specific actions to decision-makers, many practical implementations actually rely on attention signals, which flag key decisions but do not prescribe a course of action. While superficially similar, attention signals provide very different kind of information to the decision-maker—e.g., in hospitals, attention signals may trigger upon encountering high-risk patients, while action signals may suggest specific treatments for those patients. We study the impact of action and attention signals on human decision-making via an extensive behavioral experiment in the context of chess, a challenging and well-studied decision-making problem. We find that both signal types can effectively improve decision-making, with attention signals achieving at least 40% of the benefits of action signals. More interestingly, action and attention signals improve performance through very different mechanisms. Action signals improve decision-making only in the specific states where they are provided. However, they can also guide decision-makers into "uncharted waters," where they are unsure how to make effective decisions, thereby degrading performance. In contrast, attention signals, while requiring human effort to be effective, improve decision-making quality not only in states where they are given, but also have positive spillovers to subsequent states. Our findings have significant implications for the deployment of algorithmic signals to improve decision-making in practice.

Transformer Encoder for Social Science arXiv Huggingface
Haosen Ge; In Young Park; Xuancheng Qian; Grace Zeng
High-quality text data has become an important data source for social scientists. We have witnessed the success of pretrained deep neural network models, such as BERT and RoBERTa, in recent social science research. In this paper, we propose a compact pretrained deep neural network, Transformer Encoder for Social Science (TESS), explicitly designed to tackle text processing tasks in social science research. Using two validation tests, we demonstrate that TESS outperforms BERT and RoBERTa by 16.7% on average when the number of training samples is limited (<1,000 training instances). The results display the superiority of TESS over BERT and RoBERTa on social science text processing tasks. Lastly, we discuss the limitation of our model and present advice for future researchers.

Measuring Regulatory Barriers Using Annual Reports of Firms Journal Version
Asian Review of Political Economy (2024)
Haosen Ge
Existing studies show that regulation is a major barrier to global economic integration. Nonetheless, identifying and measuring regulatory barriers remains a challenging task for scholars. I propose a novel approach to quantify regulatory barriers at the country-year level. Utilizing information from annual reports of publicly listed companies in the U.S., I identify regulatory barriers business practitioners encounter. The barrier information is first extracted from the text documents by a cutting-edge neural language model trained on a hand-coded training set. Then, I feed the extracted barrier information into a dynamic item response theory model to estimate the numerical barrier level of 40 countries between 2006 and 2015 while controlling for various channels of confounding. I argue that the results returned by this approach should be less likely to be contaminated by major confounders such as international politics. Thus, they are well-suited for future political science research.

bottom of page