Innovation is moving at a great pace
A GPU is not just a single chip—it’s a sophisticated system comprising multiple processors, short-term caches, long-term memory, communication switches, specialized functional modules, and task controllers that coordinate operations across all components. Performance and efficiency gains can be achieved not only within each of these elements but also through their integration and the optimization of the algorithms they run.
Nvidia CEO Jensen Huang highlighted this synergy with a striking example: training a 1.8 trillion-parameter AI model using the new Blackwell GPUs required just 4 megawatts of power, compared to 15 megawatts using the previous Hopper architecture—a 75% reduction in energy use, despite each Blackwell GPU consuming 70% more power individually.
These massive improvements in chip performance have been paired with improved architectures and communication improvements such as NVLink.
This step change has been recently illustrated by CoreWeave’s MLPerf Training v5.0 results that showcase the real-world impact of these architectural advancements, particularly with the Blackwell chipset. In collaboration with NVIDIA and IBM, CoreWeave deployed the largest-ever NVIDIA GB200 NVL72 cluster, utilizing 2,496 NVIDIA Blackwell GPUs across 39 racks. This massive deployment achieved breakthrough performance on the Llama 3.1 405B model benchmark, completing training in just 27.3 minutes and delivering more than 2x faster training performance compared to NVIDIA Hopper-based systems of similar cluster sizes.
Similarly, AMD has improved energy efficiency by 38 times since 2020 with its Instinct GPU based MI350 Series and is now suggesting that this can be improved a further 20 times by 2030 with Rack scale improvements. Deepseek’s big announcements in January were focused on techniques like PTX programming, which allow for more fine-grained control over GPU instruction execution, leading to more efficient GPU usage.