The Garden of the Forking Paths: Why do We Need to Focus on Cognitive Evaluation Methods for Large Language Models?

Maria Victoria Carro, Francisca Gauna, Facundo Nieto, Mario Leiva, Juan G. Corvalán, and Gerardo Simari

Today, advances in linguistic abilities of large language models (LLMs) make it possible to test these AI systems on language-based assessments originally designed for people (Ivanova, 2023). As is the case with humans, when only given black-box access to an LLM the main and most straightforward way for researchers and users to understand its capabilities is interacting with the model. Although running such assessments can be accessible, creating a reliable methodology that allows interpreting and obtaining valid results is not. Whether you work in industry, government, or academia, it’s crucial to ask: What factors should we consider when assessing model performance?

Conducting and analyzing a cognitive evaluation is like entering a garden of forking paths: along the way researchers have many choices to make, and even though each choice may be small and seemingly innocuous, collectively they can have a substantial effect on the outcome (van der Lee et al., 2019). In LLM evaluation, these decisions include determining the cognitive capacity you wish to assess, selecting the tasks through which this will be tested, deciding whether a human or an artificial evaluator will conduct the assessment, and identifying the metrics against which the results will be gauged.

Another significant issue is the lack of clarity regarding the best choices at each step of the evaluation process. In fact, there is no consensus on how natural language generation (NLG) systems should be evaluated. Each aspect of the methodology is context-dependent, varying based on the resources, goals, and specific needs of the user or researcher. Basically, when navigating the garden, the best route depends on what you hope to find at the journey’s end.

LLMs find applications across diverse fields, including healthcare, law, education, and marketing. While the cognitive requirements vary within each domain, there’s a shared and overarching necessity for performance evaluations that are conducted in an efficient, transparent, responsible, and minimally biased manner. However, there exists a remarkable variety in the reporting of certain methodological aspects. This complex and rapidly evolving scenario makes it difficult for newcomers to the evaluation arena to identify which approach to take for the cognitive assessment, especially for those who lack technical knowledge.

For this reason, our multidisciplinary team — comprised of people with backgrounds of both technical and non-technical nature — has embarked on this research journey, in alignment with other researchers who advocate for considering AI assessments as a new science or discipline (Chang et al., 2023). We recognize the critical and challenging nature of evaluating AI systems, particularly in the era of sophisticated language and multimodal models. We believe that the most significant advancements arise from diverse, multidisciplinary collaborations — therefore, we welcome partnerships with academics, industry professionals, policymakers, and other stakeholders.

Currently, our efforts are concentrated on understanding the state of the art, developing, and refining cognitive evaluation methods and protocols. We aim to contribute novelty with a comprehensive analysis that seeks to achieve an understanding of key considerations in evaluating language models, rather than showing the proficiency of these systems across various tasks and skills.

Understanding the Evolution and Importance of Evaluations Methods

Ever since the proposal of the Turing Test (1950), conversational interactions with humans have long been a cornerstone for evaluating machine intelligence. However, as the field of natural language generation has advanced, we have come to understand that the mere act of a human or artificial entity discussing a topic does not necessarily indicate comprehension. As “stochastic parrots” (Bender et al., 2021), LLMs are capable of imitating human speech without necessarily understanding its meaning.

Consequently, as LLM technology evolves and we acquire a better understanding of how tools built with it work, evaluation methodologies need to adapt accordingly. Continuous improvement in evaluation methods ensures that they remain relevant and effective in assessing the performance of LLMs across various tasks and domains. But this relationship also occurs in the opposite direction: through rigorous testing protocols, these evaluations offer insights into the performance metrics of models, shedding light on their strengths, weaknesses, and areas in which improvement is needed. In sum, reliable assessments improve our understanding of technology, and our understanding of technology improves the assessment methodology.

Bridging Theory and Practical Application

Effective evaluations can provide better guidance for human-LLM interaction, which could in the future inspire the design and implementation of better interaction mechanisms (Chang et al., 2023). For example, some benchmarks illustrate that LLMs are sensitive to prompt design, which is an extremely valuable insight not only for a carefully-designed assessment, but also for everyday uses.

Stepping out of the ivory tower, every time we generate language with AI, we are inherently evaluating its outputs for various purposes. Each use of a language model involves a series of decisions and judgements, including whether to accept, edit, or discard its output. To ensure that this process is as effective as possible, it is crucial for users to understand how language models function and to be aware of the limitations and opportunities presented by the way in which evaluation, whether formal or not, is conducted.

A well-known example is the work of “labelers” who are paid to provide their preferences about which outcomes of a model are better than others as part of the RLHF technique. This procedure significantly influences the performance of the models as it aids in training systems to align with human preferences. However, it’s worth noting that the alignment with the preferences of these labelers may not always be representative. In the realm of human evaluations, researchers have investigated and offered recommendations to mitigate biases in judgments, which could be valuable for this labeling process. Nonetheless, the specific conditions under which companies conduct these training processes remain unclear.

Evaluation Frameworks for Ensuring AI Safety

Additionally, AI evaluations play a pivotal role in informing policy and governance decisions. Firstly, because cognitive assessments reveal issues such as biases and hallucinations, enabling us to quantify their prevalence and impact. Secondly, because they allow us to ascertain the state of the art and observe how language models evolve over time, independently, beyond the technical specifications reported by proprietary companies.

Moreover, performance evaluations constitute components of comprehensive audit frameworks proposed for the governance of LLMs. These frameworks address various legal provisions, one of which is the requirement of human oversight (contained, e.g., in the European Union Artificial Intelligence Act and the Canada Artificial Intelligence and Data Act)​​, aimed at ensuring human involvement in decision-making processes, which is also commonly known as “human-in-the-loop” system design. However, the relationship between cognitive evaluations of LLMs and governance frameworks remains an open research question. Does mandating human supervision imply excluding the automatic cognitive assessments that currently dominate the field and have demonstrated reliability in some cases? Or does the level of human participation in these evaluations suffice to meet the legal standard?

Conclusion

In this preliminary essay, we have elaborated on why language model evaluation methods are critical to addressing central challenges in the field of AI today. These protocols serve as foundational pillars in fostering transparency, accountability, and trust within the AI ecosystem. By establishing standardized evaluation methodologies and reporting frameworks, evaluators enhance the reproducibility and comparability of results, allowing for more informed decision-making by stakeholders.

We believe that the results of some AI evaluations are far from achieving the minimum requirements in terms of expected reliability. Our commitment is to take impactful strides in this emerging area, enhancing the diversity and accessibility of LLM evaluations. We are eager to engage in exchanges of opinions and ideas with fellow researchers and interested parties to enrich our project. If you are interested in contributing to this initiative, please contact us.

Together, we aim to drive forward progress and contribute to a more transparent and robust evaluation landscape. Stay tuned as we look forward to sharing our results with the community soon.