Apple engineers talked about their collaboration with Nvidia, thanks to which they were able to improve the performance of systems when generating text from large artificial intelligence language models.

Image source: developer.nvidia.com

This year, Apple published the source code for its Recurrent Drafter (ReDrafter), a new method for generating text using large language models. It is characterized by high speed, combining two technologies: beam search and dynamic attention tree. Apple’s research project showed compelling results, but the ReDrafter deployment integrated the technology into Nvidia’s TensorRT-LLM system, a tool that allows large language models to run faster on Nvidia accelerators.

Performance measurements showed that when running large language models with tens of billions of parameters using the Nvidia TensorRT-LLM framework and ReDrafter, the speed of token generation increased by 2.7 times. Thus, the technology makes it possible to reduce the delay between the user entering a request and receiving a response from the model – while using fewer accelerators and reducing energy consumption, Apple engineers concluded.

«Large language models are increasingly used in applications, and improving inference efficiency can impact computational costs and reduce latency for users. With ReDrafter’s new approach to speculative execution integrated into the Nvidia TensorRT-LLM framework, developers can now generate tokens faster on Nvidia accelerators for their applications,” Apple added.

Leave a Reply

Your email address will not be published. Required fields are marked *