Recent advancements in instruction tuning for large language models (LLMs) suggest that a small, high-quality dataset can significantly equip LLMs with instruction-following capabilities, outperforming large datasets often burdened by quality and redundancy issues. However, the challenge lies in automatically identifying valuable subsets from large datasets to boost both the effectiveness and efficiency of instruction tuning. In this paper, we first establish data selection criteria based on three distinct aspects of data value: diversity, difficulty, and dependability, and then propose the D3 method comprising two key steps of scoring and selection. Specifically, in the scoring step, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty to evaluate sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we integrate an external LLM for dependability assessment. In the selection step, we formulate the D3 weighted coreset objective, which jointly optimizes three aspects of data value to solve for the most valuable subset. The two steps of D3 can iterate multiple rounds, incorporating feedback to refine the selection focus adaptively. Experiments on both public datasets and the real-world Taobao Live application demonstrate the effectiveness of D3 in endowing LLMs with competitive or even superior instruction-following capabilities using less than 10% of the entire dataset.
Figure 1: The overall framework of D3 data selection method, including the warm-up and two key steps of scoring and selection.
As illustrated in Figure 1, our method comprises a warm-up step and two key steps of scoring and selection. We first utilize a tiny portion of data to obtain the initially acclimated model \(F_0\). In the scoring step, we evaluate samples in \(\mathcal{D}\) across three criteria. Specifically, we define the diversity function to measure sample distinctiveness and introduce the uncertainty-based prediction difficulty (UPD) to assess sample difficulty by mitigating the interference of context-oriented generation diversity. Additionally, we incorporate a teacher LLM \(F^*\) to evaluate sample dependability. In the selection step, we formulate the D3 weighted coreset objective to solve for the most valuable subset for LLM tuning within the selection budget. We integrate the scoring and selection steps into an iterative process that leverages the feedback from LLM to enable the adaptive refinement of the selection focus over multiple rounds.
Table 1: Performance comparisons of various data selection strategies across different datasets. The numbers in parentheses denote the sample sizes of test sets. Higher winning scores and leaderboard metrics indicate better performance, with the best results highlighted in bold.
@misc{zhang2025d3,
title={D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning},
author={Jia Zhang and Chen-Xi Zhang and Yao Liu and Yi-Xuan Jin and Xiao-Wen Yang and Bo Zheng and Yi Liu and Lan-Zhe Guo},
year={2025},
eprint={2503.11441},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.11441
}