AI researchers and developers are increasingly turning to large language models LLMs to evaluate the responses of other LLMs in a process known as “LLM as a judge”. Unfortunately, the quality of these evaluations degrades on complex tasks like long form factual checking, advanced coding, and math problems. Now, a new research paper published by researchers from the University of Cambridge and Apple outlines a new system that augments AI judges with external validation tools to improve their...

Read the full article at Neowin