Co-authored with Jirka Borovec, Lead contributor for PyTorch Lightning and Zach Cain, ML engineer @ Google
As PyTorch Lightning adoption continues to grow, we continuously evolve our testing suite to ensure that the companies and AI research labs that build their AI systems on PyTorch Lightning have a reliable and robust codebase. In preparation for our upcoming V1, we have taken a major step in our support for training on TPUs.
As the first ML framework to implement PyTorch’s xla-TPU support (PyTorch Lightnight’s TPU support is built on top of pytorch/xla’s support of PyTorch native API), we continue to lead the charge in getting PyTorch users closer to running full workloads on TPUs. We’re proud to show you how we became the first ML framework to run CI on TPUs!
TPUs, or Tensor Processing Units, are hardware chips developed by Google to accelerate machine learning applications. The chip was designed to handle the computational demands of Google’s AI framework TensorFlow, which performs its computations on tensors, multidimensional data arrays. They are available on the cloud, using cloud TPUs. Cloud TPU v2 consists of 180 teraflops and 64 GB High Bandwidth Memory (HBM).
In 2018, Google released the latest generation TPU v3, more than doubling performance with 420 teraflops and 128 GB HBM.
PyTorch Lightning is a lightweight PyTorch framework (really just organized PyTorch),
PyTorch Lightning provides seamless training of deep learning models over arbitrary hardware like GPUs, TPUs, CPUs without needing to modify your code. Much like TPUs, it was designed to help you iterate faster through your deep learning research ideas.
PyTorch Lightning has robust documentation, tutorial videos, and an active Slack channel supported by over 10 core members, in addition to 200+ contributors and 5 full-time staff engineers.
To deliver high-quality, stable, and error-free code to the numerous companies (Facebook, NVIDIA, Uber) and research labs who use PyTorch Lightning, we need very rigorous testing which we’ll describe below.
We use GitHub Actions to streamline the PyTorch Lightning development lifecycle. We tried many different CI platforms in the past such as Travis CI, Appveyor, and others. Keeping track of API changes in each CI platform was challenging and neither offered the complete experience we were looking for — simple testing over multiple operating systems with minimal code changes, stable API, and a sufficient number of concurrent jobs.
When in 2019 GitHub launched the beta version of GH Actions, we were excited to give it a try. It’s easy to use and maintain since everything is under one roof. You can use it to create custom workflows to build, test, package, release, or deploy any code project on GitHub.
We created several workflows for end-to-end continuous integration (CI) and continuous deployment (CD) with GitHub actions, directly in our repository. The configuration is very simple, especially given we had to write matrix testing for all three main OS. It’s also free for all public repositories (up to 2,000 minutes of runtime)!
Check out more info here.
We test all possible environments- all combinations of Linux, Conda, PyTorch, Mac, and Windows versions. We have 16-bit support and multi GPU tests. Our testing coverage was 88%, but one of Lightning key features was still missing from CI testing — TPU training. We could only do testing non debugging using Google Colab. We wanted to integrate TPU tests to our GH actions CI. We were fortunate enough to have Zachary Cain, Google Engineer for Google Cloud ML Accelerators, make it happen.
Adding CI on TPUs is the first step towards making TPU fully covered in PyTorch Lightning’s tests.
Cloud TPUs can be accessed from 3 different services:
The PyTorch Lightning Github Action integration relies on GKE, which is a service that automatically starts and stops machines to run Docker images.
In general, any time new code arrives at the repo, a Github Action captures the latest version of the code in a Docker image that can be launched on GKE. The GKE configs are produced with the help of GoogleCloudPlatform/ml-testing-accelerators: an open-source framework for running deep learning jobs on GKE. The repo can be used with any combination of Tensorflow, PyTorch, GPUs, TPUs, or CPUs.
For new commits to Pytorch Lightning, this workflow does the following:
GKE makes it easy to use cluster autoscaler, which automatically resizes the GKE cluster’s node pools based on the workload demands. It increases the availability of your workloads when you need it while controlling costs. It is especially important for community developments like PyTorch Lightning, where many contributors are working on PRs in parallel. Learn more about auto-scaling here.
Click here to learn more about PyTorch Lightning.
Want to start your own CI for TPUs and/or GPUs/CPUs? Please open an issue or ask a question at https://github.com/GoogleCloudPlatform/ml-testing-accelerators/issues.
An open source machine learning framework that accelerates…
Thanks to William Falcon and Zachary Cain.