Working Papers

Transformer Encoder for Social Science
Haosen Ge; In Young Park; Xuancheng Qian; Grace Zeng

High-quality text data has become an important data source for social scientists. We have witnessed the success of pretrained deep neural network models, such as BERT and RoBERTa, in recent social science research. In this paper, we propose a compact pretrained deep neural network, Transformer Encoder for Social Science (TESS), explicitly designed to tackle text processing tasks in social science research. Using two validation tests, we demonstrate that TESS outperforms BERT and RoBERTa by 16.7\% on average when the number of training samples is limited (<1,000 training instances). The results display the superiority of TESS over BERT and RoBERTa on social science text processing tasks. Lastly, we discuss the limitation of our model and present advice for future researchers.

AdroitRA: A Deep Learning Model for Information Extraction

Haosen Ge; Xuancheng Qian

For social science research, high-quality text data has become an important data source. However, the time-consuming process of extracting information from documents often makes using text for theory development and empirical testing too expensive for researchers. The AdroitRA model, which uses state-of-the-art technology from deep neural network research, is a low-cost, semi-automated alternative to human coding. Our model allows researchers to extract information from documents using query-formulation and question-asking. We also present an adversarial-network-based active learning approach for improving the model's performance across diverse corpus domains. The model performance is evaluated using three popular and challenging databases: SQuAD 2.0, NewsQA, and MASH-QA. The results demonstrate our model's effectiveness and versatility. Lastly, we apply our model to U.S. firms' annual reports from which we extract information about regulatory barriers faced by firms. We believe that our method helps open up new avenues for future research by lowering the cost of extracting information from documents.