MInference

Million-Tokens Prompt Inference for Long-context LLMs

Télécharger

MInference: Accelerating Pre-filling for Long-context LLMs via Dynamic Sparse Attention

mai 2024

MInference 1.0 leverages the dynamic sparse nature of LLMs’ attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online…

Github