Machine Learning and Inference Setup on AMD Instinct MI210 with VMware ESXi 8

Jul 18, 2024 · 30 min

Introduction

The goal of this document is to enable you to utilize your AMD Instinct MI210 on Machine Learning Tasks as well as Model Inferencing. In order to do so, there are some things you have to set up beforehand and are crucial to the usage of your GPU. This Document is written with the assumption that you’ve already set up your hardware and installed VMWare ESXi 8 on it.

Setup

AMD Drivers and ROCm(Radeon On Compute) currently support a limited number of options in terms of Operating Systems on VMWare. As of the moment of writing this document the supported Operating systems are:

HypervisorVersionGPUValidated guest OS (kernel)
VMWareESXI 8MI210Ubuntu 20.04 (5.15 HWE), SLES 15 SP4 (5.14.21)
VMWareESXI 7MI210Ubuntu 20.04 (5.15 HWE), SLES 15 SP4 (5.14.21)

So for our setup we’ll create an Ubuntu 20.04 machine on our VMWare console with the following specifications:

  • 8 Cores of CPU
  • 64 GB RAM
  • 500 GB of Storage

Follow the prompts and finish setting up your OS.

  • Add the GPU as a PCI device, Dynamic for distributed resources between host machine and VM and Direct for complete use on VM.
  • Set Advanced Parameters to allow PCI passthrough to the VM
 pciPassthru.64bitMMIOSizeGB: 128 #(This is double the VRAM of the given GPU)
 pciPassthru.use64bitMMIO: TRUE

Once the machine is rebooted and accessible, by default 20.04 Ubuntu comes with a lower kernel version and we need to have the Hardware Enabled Kernel version for our Setup. The first step will be to upgrade the kernel.

Upgrading The Kernel

Part 1

We can check the current kernel on our OS with, uname -r . You ideally want to have the Hardware Enabled kernel for 20.04 so lets get started.

In order to upgrade the kernel we need to know the list of available options we have. To get that information we can run the following command

apt-cache search linux-image | grep generic # To check for available kernels
apt-cache search linux-image | grep generic-hwe # To check for hardware enabled kernels

First we’ll download the latest version of kernel 5.15 from the list of available kernels. As of the time of writing its 5.15.0-113-generic . Its probably not going to change but it is best that you check. PS: Don’t download unsigned versions of the kernel.

For the hardware enabled kernels choose the one that is neither edge nor a dummy transitional package.

sudo apt-get install linux-image-<version>-generic # In my case linux-image-5.15.0-113-generic
sudo apt-get install linux-image-genric-hwe-20.04

Once we have them installed, its time to change the kernel version.

Part 2

We will use the grub bootloader to do this:

Run the following command. it will open the GRUB configuration file in the nano text editor:

sudo nano /etc/default/grub

In the grub configuration file make the following changes:

GRUB_TIMEOUT_STYLE=menu  # it was "hidden" by default
GRUB_TIMEOUT=10          # it was "0" by default

The above changes make grub menu to be shown automatically(you don't need to press ESC or Shift keys) while booting and it waits 10 seconds.

Then, update the GRUB bootloader configuration:

sudo update-grub

When you run the command sudo update-grub, it scans the system for installed operating systems and kernels, generates a new GRUB configuration file, and updates the menu entries in the GRUB bootloader. This ensures that the boot menu shows the correct options and that the system can boot to the correct operating system or kernel version.

Then reboot the system:

sudo reboot

If you accessed your Machine via SSH you should head back to VSphere for VMWare to be able to get your grub menu on reboot(On the Web or Remote Console).

While rebooting, grub menu appears on the screen automatically.

Press Enteron **Advanced options for Ubuntu section, and then choose desired kernel version you want to boot. After that, your system will be loaded with the chosen kernel, and if you run uname -r it shows the selected kernel version that was chosen while rebooting.

Great, That's done!!!...

Now for clean up

You can make the specific kernel version default while booting. If the kernel version you want to make default is already running on your system (as shown by the output of uname -r) and you would like to prevent it from being upgraded when using sudo apt update && sudo apt upgrade, you can use the apt-mark command to mark the package as hold:

sudo apt-mark hold linux-image-<version>-generic
sudo apt-mark hold linux-image-genric-hwe-20.04

You can make sure it was held by running this:

apt-mark showhold

This will prevent the package from being automatically upgraded or removed. To make it default during boot, you can use sudo update-grub command to update the grub bootloader configuration file and set the desired kernel as the default.

sudo update-grub

Remove older kernels that are not in use by running the following command:

sudo apt-get purge <kernel-version>

This will free up disk space and make sure that the current kernel version is the only one available to boot. You should now have the HWE and latest 5.15 version installed.

Installing The AMD ROCm Drivers on your machine

Now that we have the desired OS and Kernel required to install the drivers, we can start installing the compatible drivers. At time of writing. The latest drivers are 6.1.2 on the ROCm Website.

sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo usermod -a -G render,video $LOGNAME # Add the current user to the render and video groups
wget https://repo.radeon.com/amdgpu-install/6.1.2/ubuntu/focal/amdgpu-install_6.1.60102-1_all.deb
sudo apt install ./amdgpu-install_6.1.60102-1_all.deb

This will install the installer for AMD GPU drivers as well as install your Linux headers and set your user to have access to the render and video groups.

The next thing is to install the use-cases you’ll need for your environment.

Here’s a command to install some use-cases that will be used in this document.

sudo amdgpu-install --usecase=dkms,graphics,multimedia,opencl,hip,hiplibsdk,rocm

You might also need the cuda toolkit in your programs so.

sudo apt install nvidia-cuda-toolkit
edit ~/.profile and add HSA_ENABLE_SDMA=0

Then reboot:

sudo reboot

Testing and Troubleshooting

Once rebooted you should check whether you can see your GPU stats by running:

rocm-smi

If you get output about your GPU Temp, VRAM, and GPU usage… You are set!!!

Using your GPU to run an LLM

> comment on twitter
> cd . .
If you don't take risks, you can't create a future. ~ Monkey D. Luffy
2025