Skeleton-of-Thought: Parallel decoding speeds up and improves LLM output

已发布 2023年11月17日

作者 Xuefei Ning , Postdoctoral researcher Zinan Lin , Senior Researcher

分享这个页面

A figure showing the difference between the normal sequential decoding approach and the Skeleton-of-Thought approach. Given a question, the left part of the figure shows that the normal sequential decoding approach generates the answer sequentially from the beginning to the end. The right part of the figure shows that the Skeleton-of-Thought approach first prompts the LLM to give a skeleton of answer and then expand multiple points in the skeleton in parallel to get the final answer.

This research was accepted by the 2024 International Conference on Learning Representations.

Large language models (LLMs) such as LLaMA and OpenAI’s GPT-4 are revolutionizing technology. However, one of the common complaints about LLMs is their speed, or lack thereof. In many cases, it takes a long time to get an answer from them. This limits LLMs’ applications and their usefulness in latency-critical functions, such as chatbots, copilots, and industrial controllers.

Publication Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding

To address this question, researchers from Microsoft Research and Tsinghua University proposed Skeleton-of-Thought (SoT), a new approach to accelerate generation of LLMs. Unlike most prior methods, which require modifications on the LLM models, systems, or hardware, SoT treats LLMs as black boxes and can therefore be applied on any off-the-shelf open-source (e.g., LLaMA) or even API-based (e.g., OpenAI’s GPT-4) models. Our evaluation shows that not only does SoT considerably accelerate content generation among the 12 LLMs examined, it may also improve the answer quality in some cases. For example, on OpenAI’s GPT-3.5 and GPT-4, SoT provides 2x speed-up while improving the answer quality on benchmark datasets.

Our code and demo are open-sourced at https://github.com/imagination-research/sot/ (opens in new tab).

SoT - Figure 1: A screen recording of the demo of the normal sequential decoding approach and the Skeleton-of-Thought approach on answering the question “How can I improve my time management techniques?”. The left figure shows the generation process of normal sequential decoding, where the answers are generated word-by-word and. The right figure shows the generation process of Skeleton-of-Thought, where multiple points in the answer are generated in parallel and is therefore faster. After the generation of both methods is done, the generation time of both methods is displayed, which suggests that Skeleton-of-Thought provides 3.72x speed-up. — Figure 1: Compared to the vanilla approach (left), SoT (right) provides 3.72x speed-up on answering the question: “How can I improve my time management techniques?” with LLaMA-2-7b model on one NVIDIA A100 GPU.

SoT: Encouraging structured thinking in LLMs

The idea of SoT stems from the difference in how LLMs and humans process information. LLMs generate answers sequentially. For example, to answer “How can I improve my time management techniques?” in Figure 1 (left), the LLM finishes one point before moving to the next. In contrast, humans may not always think about questions and write answers sequentially. In many cases, humans first derive the skeleton of the answer and then add details to explain each point. For example, to answer the same question in Figure 1, a person might first think about a list of relevant time management techniques before digging into the details of each technique. This is especially the case for exercises like offering consultancy, taking tests, writing papers, and so on.

Can we make LLMs process information more dynamically and less linearly? As illustrated in Figure 2, SoT does the trick. Instead of generating answers sequentially, SoT decomposes the generation into two stages: (1) SoT first asks the LLM to derive a skeleton of the answer, and then (2) asks the LLM to provide the answer to each point in the skeleton. This method offers a new opportunity for acceleration, as the answers to separate points in stage 2 can be generated in parallel. This can be done for both local models, whose weights are accessible by the users (e.g., LLaMA), and API-based models which can only be accessed through APIs (e.g., OpenAI’s GPT-4).

For API-based models, we can issue parallel API requests, one for each point.
For models that are running locally, we can answer all points simultaneously in a batch. Note that in many scenarios (e.g., local service, centralized service within unsaturated query period), the decoding phase of LLMs is usually bottlenecked by weight loading instead of activation loading or computation, and thus underutilizes available hardware. In these cases, running LLM inference with increased batch sizes improves the hardware computation utilization and does not significantly increase latency.

Consequently, if there are B points in the answer, generating these points in parallel as in SoT can theoretically give up to Bx speed-up compared to sequential generation as in current LLMs. However, in practice, due to the extra skeleton stage, unbalanced point lengths, and other overheads, the actual speed-up can be smaller.

We test SoT on 12 recently released models, including nine open-source models and three API-based models as shown in Table 1. We use the Vicuna-80 dataset, which contains 80 questions spanning nine categories, such as coding, math, writing, roleplay, and so on. For results on more datasets and metrics, please refer to the paper: Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding.

SoT - Table 1: A table listing the models evaluated in the paper. There are four columns: (1) access, which means whether the model is open-source or API-based, (2) model name, (3) the institution who develops the model, and (4) the released date of the model. Therefore 12 rows in the table, corresponding to 9 open-source models and 3 API-based models. — Table 1: The list of evaluated models.

Figure 3 shows that SoT provides considerable speed-up across all models. In particular, SoT obtains a >2x speed-up (up to 2.39x) on eight out of 12 models. Moreover, SoT achieves this level of speed-up without significant answer quality degradation. Figure 4 shows the win/tie/lose rates of SoT (defined as the fraction of questions for which SoT provides better answers than the normal sequential generation; “better” is defined by metrics proposed in FastChat (opens in new tab) and LLMZoo (opens in new tab), and evaluated by the GPT-4 judge). We can see that the quality of SoT’s answer is on par with the sequential generation. In the paper, we further show that SoT improves answer quality in terms of relevance to the question and comprehensiveness across multiple aspects, thanks to the skeleton stage, which explicitly requires LLMs to plan the answer structure in advance.

SoT - Figure 3: A bar plot showing the average speed-up of Skeleton-of-Thought on different models. Each bar corresponds to one model. The figure shows that Skeleton-of-Thought provides speed-up for all models. The speed-up ranges from 1.13x to 2.39x. — Figure 3: Average speed-ups of SoT on different models. SoT provides speed-up for all models we evaluated.

SoT - Figure 4: A plot showing the win/tie/lose rates of Skeleton-of-Thought compared with the normal sequential generation using metrics from FastChat and LLMZoo. For FastChat metric, the win/tie/lose rates are 29.5%, 29.3%, and 41.2%. For LLMZoo metric, the win/tie/lose rates are 45.8%, 19.6%, and 34.5%. In summary, Skeleton-of-Thought performs better than or equal to normal sequential generation in around 60% of cases. — Figure 4: Win/tie/lose rates of SoT vs. normal generation using metrics from FastChat and LLMZoo. SoT performs better than or equal to normal generation in around 60% of cases.

SoT-R: Adaptively triggering SoT

SoT conducts an independent and parallel expansion of points. Consequently, it is not suitable for questions that require step-by-step reasoning, such as math and coding. To this end, we propose an extension of SoT called SoT with Router (SoT-R) that adaptively triggers SoT only when it is suitable. More specifically, we propose a router module that decides if SoT should be applied for the user request and then we call either SoT or normal sequential decoding, accordingly. The router module can be implemented either by prompting OpenAI’s GPT-4 without any model training (called “prompting router”) or training a specified RoBERTa model (called “trained router”). We show that SoT-R improves SoT’s universality across question categories (Figure 5) while maintaining considerable speed-up (Figure 6).

SoT - Figure 5: A plot showing the net win rates (defined as win rates minus lose rates) of SoT and SoT-R on different question categories on Vicuna-80 dataset. For the question categories not suitable for SoT (e.g., coding, math), SoT-R has net win rates around 0 as SoT-R learns to fall back to the normal generation mode. For question categories that are suitable for SoT (e.g., generic, counterfactual), SoT-R has similar net win rates as SoT and SoT-R triggers SoT as expected. Overall, SoT-R improves upon SoT and maintains good answer quality for all question categories. — Figure 5: The net win rates (defined as win rates minus lose rates – higher is better) of SoT and SoT-R on different question categories on Vicuna-80. “Human router” refers to the oracle that uses human preference to decide whether SoT should be applied for each question. For the question categories not suitable for SoT (e.g., coding, math), SoT-R learns to fall back to the normal generation mode. Consequently, SoT-R can maintain good answer quality for all question categories. SoT-R could even surpass the human router (e.g., on roleplay questions).

SoT - Figure 6: A plot showing the speed-ups of SoT and SoT-R on different models of Vicuna-80. While SoT-R has smaller speed-up than SoT, SoT-R can still keep >1 speed-up for most models. — Figure 6: The speed-ups of SoT and SoT-R on different models of Vicuna-80. SoT-R can keep >1 speed-up for most models.

Limitations and next steps

Reducing the cost of SoT

SoT uses longer prompts compared to normal sequential decoding, which can lead to higher costs on LLM APIs that are charged by the length of the prompt and potentially reduce the throughput of the serving system. Further research is needed to explore reducing the cost of SoT, including compressing SoT prompts or tuning the LLMs to automatically trigger SoT when necessary.

Improving LLM capability

SoT is inspired by the structured thinking of humans. As such, its success warrants further examination of the human thinking process that could facilitate more effective and efficient AI.

Given the sophistication of human thinking, which we might envision as more of a graph, SoT can be viewed as an interim step towards a “Graph of Thoughts” (GoT). GoT is a new framework that aims to represent more complex thinking that more closely approximates the way people solve problems. GoT presents multiple concepts connected in a graph structure, in which the graph’s edges represent the dependencies, and each point is decoded based on the contents of its ancestor points. In addition, instead of complying with a static graph, we expect the need to have a dynamic Graph-of-Thought, where the high-level thought structure is adjusted dynamically by LLMs themselves, attempting to mimic how humans think. This could potentially combine the efficiency and global thinking advantages of SoT while capturing more complexity in the thinking process.

Another question to examine is how the more structured answers from SoT can be used to fine-tune LLMs to enhance their ability to generate well-organized and comprehensive answers.

Data-centric efficiency optimization

In contrast to existing model- and system-level efforts to boost LLM efficiency, SoT takes a novel “data-level” pathway by letting the LLM organize its output content. This perspective is becoming feasible and increasingly relevant, owing to the evolving capabilities of state-of-the-art LLMs. We hope this work can stimulate more research in the realm of data-centric optimization for efficiency.

Please feel free to contact us if you’d like to discuss this research and potential collaborations on this topic.