KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

arXiv

Publication

We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., ‘a list of ice cream shops in San Diego’). In the past, such queries were considered to be tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.

Kitab is available for download at microsoft/kitab · Datasets at Hugging Face (opens in new tab)

Evaluation and Understanding of Foundation Models

Presented by Besmira Nushi at Microsoft Research Forum, Episode 1

Besmira Nushi, Principal Researcher at Microsoft Research AI Frontiers summarizes timely challenges and ongoing work on evaluating and in-depth understanding of large foundation models.

Transcript

Evaluation and Understanding of Foundation Models

BESMIRA NUSHI: Hi, everyone. My name is Besmira Nushi, and together with my colleagues at Microsoft Research, I work on evaluating and understanding foundation models. In our team, we see model evaluation and understanding as a guide to AI innovation. Our work measures, informs, and accelerates model improvement and, at the same time, is a contribution that is useful to the scientific community for understanding and studying new forms and levels of intelligence.

But evaluation is hard, and new generative tasks are posing new challenges in evaluation and understanding. For example, it has become really difficult to scale up evaluation for long, open-ended, and generative outputs. At the same time, for emergent abilities, very often some benchmarks do not exist and often we have to create them from scratch. And even when they exist, they may be saturated or leaked into training datasets. In other cases, factors like prompt variability and model updates may be just as important as the quality of the model that is being tested in the first place. When it comes to end-to-end and interactive scenarios, other aspects of model behavior may get in the way and may interfere with task completion and user satisfaction. And finally, there exists a gap between evaluation and model improvement. 

In our work, we really see this as just the first step towards understanding new failure modes and new architectures through data and model understanding. So in Microsoft Research, when we address these challenges, we look at four important pillars. First, we build novel benchmarks and evaluation workflows. Second, we perform and put a focus on interactive and multi-agent systems evaluation. And in everything we do, in every report that we write, we put responsible AI at the center of testing and evaluation to understand the impact of our technology on society. Finally, to bridge the gap between evaluation and improvement, we pursue efforts in data and model understanding.  

But let’s look at some examples. Recently, in the benchmark space, we released KITAB. KITAB is a novel benchmark and dataset for testing constraint satisfaction capabilities for information retrieval queries that have certain user specifications in terms of constraints. And when we tested recent state-of-the-art models with this benchmark, we noticed that only in 50 percent of the cases these models are able to satisfy user constraints.

And similarly, in the multimodal space, Microsoft Research just released HoloAssist (opens in new tab). HoloAssist is a testbed with extensive amounts of data that come from recording and understanding how people perform tasks in the real and physical world. And this provides us with an invaluable amount of resources in terms of evaluation for understanding and measuring how the new models are going to assist people in things like task completion and mistake correction. In the responsible AI area, ToxiGen (opens in new tab) is a new dataset that is designed to mention and to understand toxicity generation from language models. And it is able to measure harms that may be generated from such models across 13 different demographic groups. 

Similarly, in the multimodal space, we ran extensive evaluations to measure representational fairness and biases. For example, we tested several image generation models to see how they represent certain occupations, certain personality traits, and geographical locations. And we found that sometimes such models may present a major setback when it comes to representing different occupations if compared to real-world representation. For instance, in some cases, we see as low as 0 percent representation for certain demographic groups.  

Now when it comes to data on model understanding, often what we do is that we look back at architectural and model behavior patterns to see how they are tied to important and common errors in the space. For example, for the case of constraint satisfaction for user queries, we looked at factual errors, information fabrication and mapped them to important attention patterns. And we see that whenever factual errors occur, there are very weak attention patterns within the model that map to these errors. And this is an important finding that is going to inform our next steps in model improvement. 

So as we push the new frontiers in AI innovation, we are also just as excited about understanding and measuring that progress scientifically. And we hope that many of you are going to join us in that challenge.

Thank you.