Name: NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models
Item: NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models
Author: Ryan Smith

NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models

by Ryan Smith on 3/21/2023 12:15 PM EST

Post Your Comment
Please log in or sign up to comment.

23 Comments

Back to Article

Threska - Tuesday, March 21, 2023 - link
Buy one of these and have ChatGPT at home. Reply
satai - Tuesday, March 21, 2023 - link
To run it? Yes (but actually no, you wouldn't get the weights probably, the closest think you can get hands on is probably to torrent LLaMA).

To infer it? No. Reply
brucethemoose - Tuesday, March 21, 2023 - link
I'm sure the community will cook up some low-memory finetuning schemes for LLaMA or whatever else catches on. Alpaca was already tuned with relatively modest hardware. Reply
satai - Tuesday, March 21, 2023 - link
LLaMA is now down to 4b weights and I guess this is the end for now.
So now you can live with something like 20GB or 24GB cards for medium size models.

So - to run such a model is quite possible. (The issue is, that you won't get to such a model in a licence-OKish way probably.) To infer such a model... oh, that's a bit different story for now and probably for years (decades?) to come. Reply
brucethemoose - Tuesday, March 21, 2023 - link
Its much less when rounded down to 4 bits, probably even less with some clever swapping or frameworks like DeepSpeed, thoigh the larger LLaMA models are better.

Maybe I am misinterpreting what you mean by "infer," but finetuning Stable Diffusion with LORA can comfortably squeeze onto 6GB cards (and less comfortably onto 4GB), where inference eats around 3.3GB for reference. Reply
satai - Tuesday, March 21, 2023 - link
You can use such a model on a mainstream card but you still can't construct it. So we are still dependent on who provides (leaks?) the already computed model. Reply
brucethemoose - Tuesday, March 21, 2023 - link
Yep.

But I think the probability of good LLMs releasing/leaking is high.

And again, stable diffusion is a good example of what happens after that. SD 1.5 alone is antiquated compared to newer Midjourney, Dall-E and such, but with the endless community finetunes and extensions, it blows the cloud models away. Reply
atomek - Wednesday, March 22, 2023 - link
Inferencing is actually "running it". You train network, and inferencing is executing inputs on trained model. Reply
p1esk - Tuesday, March 21, 2023 - link
The big question is if the memory is exposed as a unified 188GB, or as 2x94GB. I mean how will it show up in Pytorch? Reply
Ryan Smith - Tuesday, March 21, 2023 - link
The answer to that is however Pytorch would treat a dual H100 setup today. That part is unchanged. Reply
p1esk - Tuesday, March 21, 2023 - link
That is disappointing. Reply
mode_13h - Wednesday, March 22, 2023 - link
It's not surprising. The 600 GB/s link between cards is a mere 15% of the onboard bandwidth. If software naively treated it as a single GPU, performance would be garbage.

Conversely, I'm sure it's now well-supported for software to divide up big networks across multiple GPUs and align the division with a layer boundary. If you do it that way, the NVLink is probably no bottleneck at all. Reply
abufrejoval - Wednesday, April 19, 2023 - link
While I understand your sentiment, it's physics.

But playing around with Llama on 2 V100 (which can't do all the nice low precision weight things), I noticed that there was very little noticeable slowdown with the 2nd V100, which in my case only shared the PCIe bus, too.

I guess the explanation is that Llama like the other LLM models are *already* split into many graphs to manage the updates of the weights during training, which happens on thousands of GPUs after all, which connect via Infiniband at best, because there is nothing else readily available to OpenAI.

So the penalty and pain typically associated with models that outgrow a single GPU's memory space have already been worked around as much as possible with the current breed of LLMs and that is why they don't deteriorate as much as you'd think when they are spread for inference, too.

I was even surprised to see how well Llama tolerated having some graphs moved to the CPU as well, where unfortunately my hardware (and the current PyTorch release) don't yet support the lower precision weights (and you can't seem to have mixed weight graphs due to software constraints), but that is about to change going forward, as CPU vendors don't want to loose out on the opportunity and would like to play their RAM size card.

I think I could load the 13B (or 30B?)Llama into 768GB of RAM on a 28 core Skylake, but it was at one letter per second, if that, rather impractical.

30B is much more fun on my RTX3090 with 24GB VRAM and 4-bit weights, and I believe I have seen code for 3-bit weights, too.

Did not get to play around with that, yet. Reply
brucethemoose - Tuesday, March 21, 2023 - link
"12x the GPT3-175B inference throughput"

That is a very interesting claim, as GPT-3 is a closed source model with precisely 1 user: OpenAI.

The open source models are kinda being cobbled together into usable repos as I type this. Reply
p1esk - Tuesday, March 21, 2023 - link
The model architecture is well known, it was described in the paper. Weights have not been released, but they are not needed to measure the hw performance. Reply
mode_13h - Wednesday, March 22, 2023 - link
And who do you think supplies hardware to OpenAI? Reply
p1esk - Wednesday, March 22, 2023 - link
Microsoft Reply
mode_13h - Thursday, March 23, 2023 - link
For one thing, they only bought a controlling stake just a couple months ago.

I'm talking about the hardware GPT-3 was developed on, last year.

BTW, I'm sure Microsoft is just buying Nvidia GPUs. I've read they're doing stuff with FPGAs, but AFAIK MS has no comparable hardware solution for such large models. Reply
abufrejoval - Wednesday, April 19, 2023 - link
OpenAI is hosted on Azure. Have a look here:
https://www.nextplatform.com/2023/03/21/inside-the... Reply
brucethemoose - Tuesday, March 21, 2023 - link
Also, AMD's and Intel's upcoming XPUs are a surprisingly good fit for this depending on how much RAM they can take. Just being able to *fit* mega model in a single memory pool is huge, even if they don't have the raw throughput of an HBM H100. Reply
DigitalFreak - Wednesday, March 22, 2023 - link
"customers aren’t getting access to quite all 96GB per card."

I see Nvidia is up to their old tricks again. LOL Reply
puttersonsale - Monday, March 27, 2023 - link
they do this product but they axe SLI NVlink?

They can totally support SLI and they are just trying to save $$$ from having to update drivers and etc. Reply
Morawka - Sunday, April 9, 2023 - link
AFAIK Nvidia killing off SLI had more to do with how modern game engines render 3d graphics than any desire to cut costs for their driver development team. Reply

NVIDIA Announces H100 NVL - Max Memory Server Card for Large Language Models

Post Your Comment

23 Comments

Back to Article

Threska - Tuesday, March 21, 2023 - link

satai - Tuesday, March 21, 2023 - link

brucethemoose - Tuesday, March 21, 2023 - link

satai - Tuesday, March 21, 2023 - link

brucethemoose - Tuesday, March 21, 2023 - link

satai - Tuesday, March 21, 2023 - link

brucethemoose - Tuesday, March 21, 2023 - link

atomek - Wednesday, March 22, 2023 - link

p1esk - Tuesday, March 21, 2023 - link

Ryan Smith - Tuesday, March 21, 2023 - link

p1esk - Tuesday, March 21, 2023 - link

mode_13h - Wednesday, March 22, 2023 - link

abufrejoval - Wednesday, April 19, 2023 - link

brucethemoose - Tuesday, March 21, 2023 - link

p1esk - Tuesday, March 21, 2023 - link

mode_13h - Wednesday, March 22, 2023 - link

p1esk - Wednesday, March 22, 2023 - link

mode_13h - Thursday, March 23, 2023 - link

abufrejoval - Wednesday, April 19, 2023 - link

brucethemoose - Tuesday, March 21, 2023 - link

DigitalFreak - Wednesday, March 22, 2023 - link

puttersonsale - Monday, March 27, 2023 - link

Morawka - Sunday, April 9, 2023 - link

Log in

Don't have an account? Sign up now