Training on Synthetic Data
- Xu Guo, Zilin Du, Boyang Li, and Chunyan Miao. Generating Synthetic Datasets for Few-shot Prompt Tuning. The First Conference on Language Modeling (COLM). 2024.
- Zilin Du, Yunxin Li, Xu Guo, Yidan Sun, and Boyang Li. Training Multimedia Event Extraction With Generated Images and Captions. The ACM International Conference on Multimedia (ACM MM). 2023. [Code]
- Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Boyang Li, Shafiq Joty, and Lidong Bing. Is GPT-3 a Good Data Annotator? The Annual Conference of the Association for Computational Linguistics (ACL). 2023.
Pruning Training Data
- Jaewoo Lee, Boyang Li, and Sung Ju Hwang. Concept-skill Transferability-based Data Selection for Large Vision-Language Models. The Conference on Empirical Methods in Natural Language Processing (EMNLP). 2024. [Code]
- Devaansh Gupta and Boyang Li. A Training Data Recipe to Accelerate A* Search with Large Language Models. Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP Findings). 2024.
Understanding Data
- Anthony Tiong, Junqi Zhao, Boyang Li, Junnan Li, Steven Hoi, and Caiming Xiong. What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases. The 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 2024. [Dataset]
- Yuanyuan Chen, Boyang Li, Han Yu, Pengcheng Wu, and Chunyan Miao. Hydra: Hypergradient Data Relevance Analysis for Interpreting Deep Neural Networks. The AAAI Conference on Artificial Intelligence (AAAI). 2021. [Code]