Accelerated ML experiments on MicroK8s with InAccel FPGA Operator and Kubeflow Katib

1. Overview

What is MicroK8s?

MicroK8s is the simplest production-grade upstream Kubernetes.

What is InAccel FPGA Operator?

InAccel FPGA Operator is a cloud-native method to standardize and automate the deployment of all the necessary components for provisioning FPGA-enabled Kubernetes systems. FPGA Operator delivers a universal accelerator orchestration and monitoring layer, to automate scalability and lifecycle management of containerized FPGA applications on any Kubernetes cluster.

The FPGA operator allows cluster admins to manage their remote FPGA-powered servers the same way they manage CPU-based systems, but also regular users to target particular FPGA types and explicitly consume FPGA resources in their workloads. This makes it easy to bring up a fleet of remote systems and run accelerated applications without additional technical expertise on the ground.

What is Kubeflow Katib?

Kubeflow Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports hyperparameter tuning, early stopping and neural architecture search (NAS).

Katib is the project which is agnostic to machine learning (ML) frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many ML frameworks, such as TensorFlow, MXNet, PyTorch, XGBoost, and others.

Hyperparameters are the variables that control the model training process. They include:

  • The learning rate.

  • The number of layers in a neural network.

  • The number of nodes in each layer.

What you’ll learn

  • Deploy and configure instances with MicroK8s on AWS F1

  • Install InAccel FPGA Operator

  • Install and configure Kubeflow Katib using Charmhub

  • Run FPGA accelerated hyperparameter tuning examples using the Katib user interface (UI)

What you’ll need

  • Account credentials for AWS

  • Some basic command-line knowledge


What best describes your expertise in FPGA acceleration?

What is your primary job role in your organization?

How will you use this tutorial?


2. AWS F1 configuration

To interact with AWS services using commands in your command-line shell, you will need aws and jq CLI tools. Install them by running:

sudo apt-get update
sudo apt-get install awscli jq

Add credentials

With AWS you have the option of adding credentials using the following environment variables that may already be present (and set) on your client system:

AWS_ACCESS_KEY_ID, AWS_DEFAULT_REGION, AWS_SECRET_ACCESS_KEY

Launch an FPGA instance

To launch an Amazon EC2 F1 instance, use the aws ec2 run-instances command.

aws ec2 run-instances \
    --block-device-mapping DeviceName=/dev/sda1,Ebs={VolumeSize=32} \
    --image-id resolve:ssm:/aws/service/canonical/ubuntu/server/bionic/stable/current/amd64/hvm/ebs-gp2/ami-id \
    --instance-type f1.2xlarge \
    --key-name <KeyName> \
| jq -r '.Instances[0].InstanceId'

List your instance

You can use the AWS CLI to list your instance and view information about it. The following example shows how to use the aws ec2 describe-instances command to output the PublicIpAddress of your instance.

aws ec2 describe-instances \
    --instance-ids <InstanceId> \
| jq -r '.Reservations[0].Instances[0].PublicIpAddress'

Connect to your instance

ssh -L 8080:localhost:8080 -o StrictHostKeyChecking=no ubuntu@<PublicIpAddress>

3. Install MicroK8s on Ubuntu

The following steps describe how to create, configure and launch a single-node Kubernetes cluster on an Ubuntu Linux system.

You need MicroK8s version >= 1.23 to enable and run InAccel addon.

  1. Install MicroK8s by running the following command:

    sudo snap install microk8s --classic
    
  2. Add your user to the microk8s group.

    sudo usermod -aG microk8s $USER
    
  3. Run the following command to activate the changes to groups:

    newgrp microk8s
    
  4. Check the status while Kubernetes starts.

    microk8s status --wait-ready
    

4. InAccel FPGA Operator

The FPGA Operator allows administrators of Kubernetes clusters to manage FPGA nodes just like CPU nodes in the cluster. Instead of provisioning a special OS image for FPGA nodes, administrators can rely on a standard OS image for both CPU and FPGA nodes and then rely on the FPGA Operator to provision the required software components for FPGAs.

InAccel FPGA Operator is already built into MicroK8s as a community add-on. This means once you install MicroK8s, you can enable InAccel straight away.

microk8s enable community # MicroK8s version >= 1.24
microk8s enable inaccel --wait

Note that the FPGA Operator is specifically useful for scenarios where the Kubernetes cluster needs to scale quickly - for example provisioning additional FPGA nodes on the cloud and managing the lifecycle of the underlying software components.


5. Getting Started with Kubeflow Katib

Prerequisites

Your Kubernetes cluster must have dynamic volume provisioning for the Katib DB component. The local storage service can be enabled by running the microk8s enable command:

microk8s enable storage

Installing Charmed Katib

These are the steps you need to install and deploy Katib with Charmed Operators and Juju on MicroK8s:

  1. Install the Juju client.

    Juju is an operation Lifecycle manager (OLM) for clouds, bare metal or Kubernetes. We will be using it to deploy and manage the components which make up Kubeflow Katib.

    As with MicroK8s, Juju is installed from a snap package:

    sudo snap install juju --classic
    
  2. Create a Juju controller.

    As Juju already has a built-in knowledge of MicroK8s and how it works, there is no additional set up or configuration needed. All we need to do is run the command to deploy a Juju controller to the Kubernetes we set up with MicroK8s:

    juju bootstrap microk8s
    

    The controller is Juju’s agent, running on Kubernetes, which can be used to deploy and control the components of Katib. You can read more about controllers in the Juju documentation.

  3. Create a new model.

    A model in Juju is a blank canvas where your operators will be deployed, and it holds a 1:1 relationship with a Kubernetes namespace.

    You need to create a model and give it the name katib, with the juju add-model command:

    juju add-model katib
    
  4. Deploy the Katib bundle.

    Run the following command to deploy Katib with the main components:

    juju deploy katib
    

    Juju will now fetch the applications and begin deploying them to the MicroK8s Kubernetes.

  5. Create a namespace for running experiments, with Katib Metrics Collector enabled:

    microk8s kubectl create namespace kubeflow
    microk8s kubectl label namespace kubeflow katib-metricscollector-injection=enabled katib.kubeflow.org/metrics-collector-injection=enabled
    

Katib components

Run the following command to verify that Katib components are running:

watch microk8s kubectl get --namespace katib pods

Accessing the Katib UI

You can use the Katib user interface (UI) to submit experiments and to monitor your results. The Katib home page looks like this:

You can set port-forwarding for the Katib UI service:

microk8s kubectl port-forward --namespace katib svc/katib-ui 8080:8080 --address 0.0.0.0

Then you can access the Katib UI at this URL:

http://localhost:8080/katib


6. Accelerated XGBoost experiments with Kubeflow Katib Hyperparameter tuning

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework and provides a parallel tree boosting that solves many data science problems in a fast and accurate way.

Hyperparameter tuning is the process of optimizing the hyperparameter values to maximize the predictive accuracy of a model. It is a dark art in machine learning, the optimal parameters of a model can depend on many scenarios. So it is impossible to create a comprehensive guide for doing so. If you don’t use Katib or a similar system for hyperparameter tuning, you need to run many training jobs yourself, manually adjusting the hyperparameters to find the optimal values.

Running a Katib experiment

The steps to configure and run a hyperparameter tuning experiment in Katib are:

  1. Package your training code in a Docker container image and make the image available in a registry.

  2. Define the experiment in a YAML configuration file. The YAML file defines the range of potential values (the search space) for the parameters that you want to optimize, the objective metric to use when determining optimal values, the search algorithm to use during optimization, and other configurations.

  3. Run the experiment from the Katib UI, either by supplying the entire YAML file containing the configuration for the experiment or by entering the configuration values into the form.

As a reference, you can use the YAML file of the fpga xgboost example.

Image classification on Street View House Numbers (SVHN) dataset

House Numbers 32x32
Source: http://ufldl.stanford.edu/housenumbers

SVHN is a real-world image dataset obtained from house numbers in Google Street View images. It can be seen as similar in flavor to MNIST (e.g., 32-by-32 images centered around a single digit), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images).

Create XGBoost SVHN Experiments

Click NEW EXPERIMENT on the Katib home page. You should be able to view tabs offering you the following options:

  1. Metadata. The experiment code name, e.g. xgb-svhn-fpga.

  2. Trial Thresholds. Use the Parallel Trials to limit the number of hyperparameter sets that Katib should train in parallel.

  3. Objective. The metric that you want to optimize. A common objective is to maximize the model’s accuracy in the validation pass of the training job. Use the Additional metrics to monitor how the hyperparameters work with the model (e.g. if/how they affect the train time).

  4. Hyper Parameters. The range of potential values (the search space) for the parameters that you want to optimize. In this section, you define the name and the distribution of every hyperparameter that you need to search. For example, you may provide a minimum and maximum value or a list of allowed values for each hyperparameter. Katib generates hyperparameter combinations in the range.

  5. Trial Template. The template that defines the trial. You have to package your ML training code into a Docker image, as described above. Your training container can receive hyperparameters as command-line arguments or as environment variables.

    • FPGA accelerated Trial Job:

      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            labels:
              inaccel/fpga: enabled
            annotations:
              inaccel/cli: |
                bitstream install --mode others https://store.inaccel.com/artifactory/bitstreams/xilinx/aws-vu9p-f1/dynamic-shell/aws/com/inaccel/xgboost/0.1/2exact
          spec:
            containers:
              - name: training-container
                image: "docker.io/inaccel/jupyter:lab"
                command:
                  - python3
                  - XGBoost/parameter-tuning.py
                args:
                  - "--name=SVHN"
                  - "--test-size=0.35"
                  - "--tree-method=fpga_exact"
                  - "--max-depth=10"
                  - "--alpha=${trialParameters.alpha}"
                  - "--eta=${trialParameters.eta}"
                  - "--subsample=${trialParameters.subsample}"
                resources:
                  limits:
                    xilinx/aws-vu9p-f1: 1
            restartPolicy: Never
      
    • CPU-only Trial Job:

      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: "docker.io/inaccel/jupyter:lab"
                command:
                  - python3
                  - XGBoost/parameter-tuning.py
                args:
                  - "--name=SVHN"
                  - "--test-size=0.35"
                  - "--max-depth=10"
                  - "--alpha=${trialParameters.alpha}"
                  - "--eta=${trialParameters.eta}"
                  - "--subsample=${trialParameters.subsample}"
            restartPolicy: Never
      

View the results of the experiments in the Katib UI

  1. Open the Katib UI as described previously.

  2. You should be able to view your list of experiments. Click the name of the XGBoost SVHN experiment.

  3. There should be a graph showing the level of validation accuracy and train time for various combinations of the hyperparameter values (alpha, eta, and subsample):

    • FPGA accelerated Experiment Overview:

    • CPU-only Experiment Overview:

    Comparing the FPGA accelerated experiment with the equivalent CPU-only one, you will notice that the accuracy of the best model is similar in both implementations.

    However, the performance of the 8-core Intel Xeon CPU of the AWS F1 instance is significantly (~6 times) worse than its single (1) Xilinx VU9P FPGA, in this XGBoost model training use case.


7. That’s all folks!

Congratulations! You have made it!

Until next time, exit and stop your FPGA instance:

aws ec2 stop-instances \
    --instance-ids <InstanceId> \
| jq -r '.StoppingInstances[0].CurrentState.Name'

Where to go from here?

Learn more about InAccel and our mission to enable multi-accelerator application models and create a platform to manage it all.

Explore the InAccel documentation that has everything you need if you want to look more into FPGA acceleration for your projects.

Alternatively, if you need commercial support for your FPGA deployments, contact us to get all your questions answered.