August 5, 2019

Pytorch Lightning vs PyTorch Ignite vs Fast.ai

William Falcon
Image for post

Apparently a lion, bear, and tiger are friends

PyTorch-lightning is a recently released library which is a Kera-like ML library for PyTorch. It leaves core training and validation logic to you and automates the rest. (BTW, by Keras I mean no boilerplate, not overly-simplified).

As the core author of lightning, I’ve been asked a few times about the core differences between lightning and fast.ai, PyTorch ignite.

Here, I will attempt an objective comparison between all three frameworks. This comparison comes from laying out similarities and differences objectively found in tutorials and documentation of all three frameworks.

Motivations

Image for post
Ummmm

Fast.ai was originally created to facilitate teaching the fast.ai curriculum. It’s recently also morphed into a library of implementations of common approaches such as GANs, RL and transfer learning.

PyTorch Ignite and Pytorch Lightning were both created to give the researchers as much flexibility by requiring them to define functions for what happens in the training loop and validation loop.

Lightning has two additional, more ambitious motivations: reproducibility and democratizing best practices which only PyTorch power-users would implement (Distributed training, 16-bit precision, etc…). I’ll discuss these motivations in detail in later sections.

So, at a base level, the target user is clear: For fast.ai it’s people interested in getting into deep learning, while the other two are focused on active researchers either in ML or who use ML (ie: biologists, neuroscientists, etc…)

Learning Curve

Image for post
Framework Overload

Both Lightning and Ignite have very simple interfaces, as most of the work is still done in pure PyTorch by the user. The main work happens inside the Engine and Trainer objects respectively.

Fast.ai however, does require learning another library on top of PyTorch. The API doesn’t operate directly on pure PyTorch code most of the time (there are places it does), but it requires abstractions like DataBunches, DataBlocs, etc… Those APIs are extremely useful when the “best” way of doing something isn’t obvious.

For researchers, however, it’s crucial to not have to learn yet another library, and directly control key parts of research such as data-processing without having other abstractions operate on those.

In this case, the fast.ai library has a much higher learning curve, but it’s worth it if you don’t necessarily know the “best” way to do something and just want to take good approaches as black-boxes.

Lightning vs Ignite

Image for post
More like sharing

It’s clear from the above that comparing fast.ai to these two frameworks isn’t fair given that the use cases and users are different (However, I’ll still add fast.ai to the comparison table at the end of this article).

The first major difference between Lightning and ignite is the interface in which it operates on.

In Lightning, there is a standard interface (see LightningModule) of 9 required methods EVERY model has to follow.

This flexible format allows for the most freedom in training and validating. This interface should be thought of as a system, not as a model. The system might have multiple models (GANs, seq-2-seq, etc…) or it might be as model, such as this simple MNIST example.

Thus researchers are free to try as many crazy things as they want, and ONLY have to worry about these 9 methods.

Ignite requires a very similar setup, but does not have a standard interface which every model needs to follow.

Notice the run function might be defined differently, ie: there could be many different events added to the trainer, or it could even be named something different such as main, train, etc…

In a complicated system where training might be happening in strange ways (looking at you GANs and RL), it wouldn’t be immediately obvious to people looking at this code what is happening. Whereas in Lightning, you’d know to look at the training_step to figure out what’s happening.

Reproducibility

Image for post
When you try to reproduce work

As I mentioned, Lightning was created with a second more ambitious broad motivation: Reproducibility.

If you’ve tried to read someone’s implementation of a paper, it’s very hard to figure out what’s happening. Long gone are the days where we were just designing different neural network architectures.

Modern SOTA models are actually systems, which employ many models or training techniques to achieve specific results.

As I said earlier, a LightningModule is a system, not a model. Thus if you want to know where all the crazy tricks and super complicated training is happening you can look at the training_step and validation_step.

If every research project and paper is implemented using the LightningModule template, it will be very easy to find out what’s going on (but perhaps not easy to understand haha).

A standardization of this kind across the AI community would also allow an ecosystem to flourish which could use the LightningModule interface to do cool things like automate deployments, audit systems for biases, or even support hashing weights into a blockchain backend to reconstruct models used for key predictions which may need to be audited.

Out-of-the-box Features

Image for post

Another major difference between Ignite and Lightning are the features Lightning supports out of the box. Out-of-the-box means there’s no additional code on your part.

To illustrate, let’s try to train a model on multiple GPUs on the same machine

Ignite (demo)

Lightning (demo)

Ok, neither is bad… But what about if we want to use multiple-GPUs across many machines? Let’s train on 200 GPUs.

Ignite

Well, there’s no built-in support for this… You’d have to extend this example quite a bit and add a way to submit scripts easily. Then you’d have to take care of loading/saving, not overwriting weights/logs with all the processes, etc… you get the idea.

Lightning

With lightning, you just set the number of nodes and submit the appropriate job. Here’s an in-depth tutorial on configuring jobs correctly.

Out-of-the-box features are these features you don’t need to do anything to get. That means you may not need most of them right now, but when you need to say… accumulate gradients, or gradient clip, or 16-bit train, you won’t spend days/weeks reading through tutorials to get it to work.

Just set the appropriate Lightning flag and move on with your research.

Lightning comes prebuilt with these features so users spend more time researching and not engineering. This is especially useful for non-CS/DS researchers such as physicists, biologists, etc… who may not be as well-versed in the programming department.

These features democratize features of PyTorch only a power-user might take the time to implement.

Here are tables comparing the features across all 3 frameworks grouped by sets of functionality.

If I missed anything critical please post a comment and I will update the table!

High-Performance Computing

Debugging Tools

Usability

Closing Thoughts

Image for post

Here we did a deep comparison on multiple levels for the three frameworks. Each is great in its own regard.

If you’re just learning or aren’t up-to-date with all the latest best practices, don’t need super-advanced training tricks, and can afford time to learn a new library, then go with fast.ai.

If you need the most flexibility, go with either Ignite of Lightning.

If you don’t need super-advanced features and are ok with adding your tensorboard support, accumulated gradients, distributed training, etc… then go with Ignite.

If you need more advanced features, distributed training, the latest and greatest deep-learning training tricks, and would love to see a world where implementations are standardized across the world then use Lightning.