NEW RESEARCH
LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error
In the same way that tools can help people complete tasks beyond their innate abilities, tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a surprisingly understudied question is how accurately an LLM uses tools for which it has been trained.
In a recent paper: LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error, researchers from Microsoft find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate of 30% to 60%, which is too unreliable for practical use. They propose a biologically inspired method for tool-augmented LLMs – simulated trial and error (STE) – that orchestrates three key mechanisms: trial and error, imagination, and memory. STE simulates plausible scenarios for using a tool, then the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration. Experiments on ToolBench show STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings.
Spotlight: Blog post
Eureka: Evaluating and understanding progress in AI
How can we rigorously evaluate and understand state-of-the-art progress in AI? Eureka is an open-source framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. Learn more about the extended findings.
NEW RESEARCH
Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
The latest LLMs have surpassed the performance of older language models on several tasks and benchmarks, sometimes approaching or even exceeding human performance. Yet, it is not always clear whether this is due to the increased capabilities of these models, or other effects, such as artifacts in datasets, test dataset contamination, and the lack of datasets that measure the true capabilities of these models.
As a result, research to comprehend LLM capabilities and limitations has surged of late. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. In a recent paper: MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks, researchers from Microsoft aim to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, Gemini, Gemma and Llama2) by comparing them on the same set of multilingual datasets. Their benchmark comprises 22 datasets covering 81 languages including several low-resource African languages. They also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.
NEW RESEARCH
Training Audio Captioning Models without Audio
Automated Audio Captioning (AAC) is a process that creates text descriptions for audio recordings. Unlike Closed Captioning, which transcribes speech, AAC aims to describe all sounds in the audio (e.g. : A muffled rumble with people talking in the background while a siren blares in the distance). Typical AAC systems require expensive curated data of audio-text pairs, which often results in a shortage of suitable data, impeding model training.
In this paper: Training Audio Captioning Models without Audio, researchers from Microsoft and Carnegie Mellon University propose a new paradigm for training AAC systems, using text descriptions alone, thereby eliminating the requirement for paired audio and text descriptions. Their approach leverages CLAP, a contrastive learning model that uses audio and text encoders to create a shared vector representation between audio and text. For instance, the text “siren blaring” and its corresponding audio recording would share the same vector. The model is trained on text captions: a GPT language decoder generates captions conditioned on the pretrained CLAP text encoder and a mapping network. During inference, audio input is first converted to its vector using the pretrained CLAP audio encoder and then a text caption is generated.
The researchers find that the proposed text-only framework competes well with top-tier models trained on both text and audio, proving that efficient text-to-audio conversion is possible. They also demonstrated the ability to incorporate various writing styles, such as humorous, beneficial for tailoring caption generation to specific fields. Finally, they highlight that enriching training with LLM-generated text leads to improved performance and has potential in increasing vocabulary diversity. This is a research project with Microsoft AI principles into practice. If the system is used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal (opens in new tab).