I am a Ph.D. candidate at SpeechLab at Columbia University, advised by Prof. Julia Hirschberg. I hold a Master of Science in Computer Science from Columbia University and a Bachelor of Mathematics, with majors in Computer Science, Statistics, and Actuarial Science, from the University of Waterloo.
My research centers around NLP and information disorder, including misinformation detection, malicious intent analysis, and content analysis. Currently, I’m focusing on LLM safety and alignment. Broadly, I am passionate about developing AI systems that prioritize ethical considerations and contribute to responsible AI deployment.
The rapid expansion of online content has intensified the issue of information redundancy, underscoring the need for solutions that can identify genuinely new information. Despite this challenge, the research community has seen a decline in focus on novelty detection, particularly with the rise of large language models (LLMs). Additionally, previous approaches have relied heavily on human annotation, which is time-consuming, costly, and particularly challenging when annotators must compare a target document against a vast number of historical documents. In this work, we introduce NovAScore (Novelty Evaluation in Atomicity Score), an automated metric for evaluating document-level novelty. NovAScore aggregates the novelty and salience scores of atomic information, providing high interpretability and a detailed analysis of a document’s novelty. With its dynamic weight adjustment scheme, NovAScore offers enhanced flexibility and an additional dimension to assess both the novelty level and the importance of information within a document. Our experiments show that NovAScore strongly correlates with human judgments of novelty, achieving a 0.626 Point-Biserial correlation on the TAP-DLND 1.0 dataset and a 0.920 Pearson correlation on an internal human-annotated dataset.
@inproceedings{ai-etal-2025-novascore,title={NovAScore: A New Automated Metric for Evaluating Document Level Novelty},author={Ai, Lin and Gong, Ziwei and Deshpande, Harshsaiprasad and Johnson, Alexander and Phung, Emmy and Emami, Ahmad and Hirschberg, Julia},booktitle={Proceedings of the 31st International Conference on Computational Linguistics},publisher={Association for Computational Linguistics},month=jan,year={2025},address={Abu Dhabi, UAE},}
Propaganda plays a critical role in shaping public opinion and fueling disinformation. While existing research primarily focuses on identifying propaganda techniques, it lacks the ability to capture the broader motives and the impacts of such content. To address these challenges, we introduce PropaInsight, a conceptual framework grounded in foundational social science research, which systematically dissects propaganda into techniques, arousal appeals, and underlying intent. PropaInsight offers a more granular understanding of how propaganda operates across different contexts. Additionally, we present PropaGaze, a novel dataset that combines human-annotated data with high-quality synthetic data generated through a meticulously designed pipeline. Our experiments show that off-the-shelf LLMs struggle with propaganda analysis, but training with PropaGaze significantly improves performance. Fine-tuned Llama-7B-Chat achieves 203.4% higher text span IoU in technique identification and 66.2% higher BertScore in appeal analysis compared to 1-shot GPT-4-Turbo. Moreover, PropaGaze complements limited human-annotated data in data-sparse and cross-domain scenarios, showing its potential for comprehensive and generalizable propaganda analysis.
@inproceedings{liu-etal-2025-propainsight,title={PropaInsight: Toward Deeper Understanding of Propaganda in Terms of Techniques, Appeals, and Intent},author={Liu, Jiateng and Ai, Lin and Liu, Zizhou and Karisani, Payam and Hui, Zheng and Fung, May and Nakov, Preslav and Hirschberg, Julia and Ji, Heng},booktitle={Proceedings of the 31st International Conference on Computational Linguistics},publisher={Association for Computational Linguistics},month=jan,year={2025},address={Abu Dhabi, UAE},}
Machine Reading Comprehension (MRC) poses a significant challenge in the field of Natural Language Processing (NLP). While mainstream MRC methods predominantly leverage extractive strategies using encoder-only models such as BERT, generative approaches face the issue of out-of-control generation – a critical problem where answers generated are often incorrect, irrelevant, or unfaithful to the source text. To address these limitations in generative models for MRC, we introduce the Question-Attended Span Extraction (QASE) module. Integrated during the fine-tuning phase of pre-trained generative language models (PLMs), QASE significantly enhances their performance, allowing them to surpass the extractive capabilities of advanced Large Language Models (LLMs) such as GPT-4. Notably, these gains in performance do not come with an increase in computational demands. The efficacy of the QASE module has been rigorously tested across various datasets, consistently achieving or even surpassing state-of-the-art (SOTA) results.
@inproceedings{ai-etal-2024-enhancing,title={Enhancing Pre-Trained Generative Language Models with Question Attended Span Extraction on Machine Reading Comprehension},author={Ai, Lin and Hui, Zheng and Liu, Zizhou and Hirschberg, Julia},booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},year={2024},address={Miami, Florida, USA},publisher={Association for Computational Linguistics},}
The proliferation of Large Language Models (LLMs) poses challenges in detecting and mitigating digital deception, as these models can emulate human conversational patterns and facilitate chat-based social engineering (CSE) attacks. This study investigates the dual capabilities of LLMs as both facilitators and defenders against CSE threats. We develop a novel dataset, SEConvo, simulating CSE scenarios in academic and recruitment contexts, and designed to examine how LLMs can be exploited in these situations. Our findings reveal that, while off-the-shelf LLMs generate high-quality CSE content, their detection capabilities are suboptimal, leading to increased operational costs for defense. In response, we propose ConvoSentinel, a modular defense pipeline that improves detection at both the message and the conversation levels, offering enhanced adaptability and cost-effectiveness. The retrieval-augmented module in ConvoSentinel identifies malicious intent by comparing messages to a database of similar conversations, enhancing CSE detection at all stages. Our study highlights the need for advanced strategies to leverage LLMs in cybersecurity.
@inproceedings{ai-etal-2024-defending,title={Defending Against Social Engineering Attacks in the Age of LLMs},author={Ai, Lin and Kumarage, Tharindu and Bhattacharjee, Amrita and Liu, Zizhou and Hui, Zheng and Davinroy, Michael and Cook, James and Cassani, Laura and Trapeznikov, Kirill and Kirchner, Matthias and Basharat, Arslan and Hoogs, Anthony and Garland, Joshua and Liu, Huan and Hirschberg, Julia},booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing},year={2024},address={Miami, Florida, USA},publisher={Association for Computational Linguistics},}
Open Information Extraction (OpenIE) represents a crucial NLP task aimed at deriving structured information from unstructured text, unrestricted by relation type or domain. This survey paper provides an overview of OpenIE technologies spanning from 2007 to 2024, emphasizing a chronological perspective absent in prior surveys. It examines the evolution of task settings in OpenIE to align with the advances in recent technologies. The paper categorizes OpenIE approaches into rule-based, neural, and pre-trained large language models, discussing each within a chronological framework. Additionally, it highlights prevalent datasets and evaluation metrics currently in use. Building on this extensive review, the paper outlines potential future directions in terms of datasets, information sources, output formats, methodologies, and evaluation metrics.
@inproceedings{pai-etal-2024-survey,title={A Survey on Open Information Extraction from Rule-based Model to Large Language Model},author={Liu, Pai and Gao, Wenyang and Dong, Wenjie and Ai, Lin and Gong, Ziwei and Huang, Songfang and Li, Zongsheng and Hoque, Ehsan and Hirschberg, Julia and Zhang, Yue},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},year={2024},address={Miami, Florida, USA},publisher={Association for Computational Linguistics},}