Fast training dataset attribution via in-context learning

Milad Fotouhi; Taha Bahadori; Oluwaseyi Feyisetan; Seyed Miran; David E. Heckerman

Publication

Fast training dataset attribution via in-context learning

By Milad Fotouhi, Taha Bahadori, Oluwaseyi Feyisetan, Seyed Miran, David E. Heckerman

2024

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM out-puts with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.

Fast training dataset attribution via in-context learning

Latest news

Work with us