Just a disclaimer before starting this post:

  • I'm not an Erlang coder, never wrote an Erlang project, I've been working over BEAM only with Elixir.

  • I'll talk about some stuff that is general to OTP but my point of view comes from an Elixir perspective.

This article is about OTP Applications and libraries and how making an open-source library that starts an application is possibly anti-pattern.

OTP Applications and Libraries

I have a strong opinion that if you're learning Elixir(or even Erlang) and past the point of learning the language and the primitives that it offers you to build systems. You must read the OTP Design Principles, and this post is all about a concept described in this guide. In the Applications section it describes:

When you have written code implementing some specific functionality you might want to make the code into an application, that is, a component that can be started and stopped as a unit, and which can also be reused in other systems.

It's interesting because if you come from any other background it seems to be describing a library. Moving forward to the Included Applications section, you gonna find more stuff that might corroborate this idea:

An application can include other applications. An included application has its own application directory and .app file, but it is started as part of the supervisor tree of another application. An application can only be included by one other application. An included application can include other applications. An application that is not included by any other application is called a primary application.

Not only it seems to be describing a library, but also describing how library dependencies work and even that they cannot define circular dependencies. If you keep moving forward and read the guide in its entirety in the Distributed Applications section you finally gonna find something that take you out this wrong train of thought:

In a distributed system with several Erlang nodes, it can be necessary to control applications in a distributed manner. If the node, where a certain application is running, goes down, the application is to be restarted at another node.

To be fair with the guide, it explains that you can start, stop and restart applications, the relation between applications and the runtime, and how it is an architectural tool of a distributed system. But it never states clearly what is an application, except that it's a component that can be started and stopped as a unit, and which can also be reused in other systems.

What is an Application.

I'm not crazy to try to come here and define what is an OTP Application all by myself with no research first. Just be prepared for a lot of quotes in this section and maybe at the end we can conclude something together. I've already used the Erlang documentation and it didn't clarify that much for us. Now I gonna land on the Getting Started guide for Elixir, in specific the Supervisor and Application section:

In a nutshell, an application consists of all of the modules defined in the .app file, including the .app file itself. An application has generally only two directories: ebin, for Elixir artifacts, such as .beam and .app files, and priv, with any other artifact or asset you may need in your application.

Elixir in Action by Saša Jurić, summarize beautifully what are all the parts that compose an OTP application in section 11.1.1, pg. 278:

  • The application’s name and version, and a description

  • A list of application modules

  • A list of application dependencies (which must be applications themselves)

  • An optional application-callback module

We can understand an OTP application as a project, and except for the application-callback module it's practically the same thing as what we see as libraries. Saša Jurić continues on pg. 280:

It’s worth noting that an OTP application is a runtime construct: a resource file that’s dynamically interpreted by the corresponding OTP-specific code. When using mix, you describe some aspects of this file, and other aspects are derived from your code. But the application itself has meaning only at runtime.

A great book(work in progress at the date I'm writing this) Adopting Erlang by Tristan Sloughter, Fred Hebert, and Evan Vigil-McClanahan, defines what OTP Application means for their entire book in the chapter/section OTP High Level:

For the sake of clarity, we’re going to use the following terminology for OTP Applications for this entire book:

  • Library Applications: stateless collections of modules

  • Runnable Applications: OTP applications that start stateful supervision tree structures with processes running in them

  • OTP Applications: either Library or Runnable Applications, interchangeably

I think we can conclude now that OTP Applications are the way runtime has to expose reusable code(in other terms, a library) and a running application that can be attached to the same node yours gonna run. What differentiates them is the fact that you implement or not the application-callback module on your OTP application. It seems a small difference, but it has huge implications on how you manage your system.

Applications, libraries and consequences

From now forward when I use the term application, I'll be talking exclusively about an OTP Application that implements the application-callback module and consequently starts a supervision tree at the start-up phase of the node. When I say library I'll be referencing an OTP Application that just exposes a bunch of reusable code for other systems to use.

As I stated in the previews section, there is almost no difference between applications and libraries in the BEAM context. But this small difference has huge consequences. Listing those consequences:

  • You'll have more applications running together with yours in the same node, that you didn't define how you want them to behave.

  • An application can bring down the entire node if something goes wrong with it.

  • You'll have little to almost no control of the implications of failures in the additional application running in the node.

Elaborating more on those consequences. So you bring an application to your list of dependencies, which means that it gonna start a supervision tree independently if need or want to start it. Of course, the application can define parameters so you deactivate it. But I think it loses all the purpose of defining an application if you allow that because you'll end with a zombie process that is a supervisor with no children. It's way better to provide a library that has already defined an easy to start supervision tree. I can define an isolated application for this supervision tree if I need to. I can plug it into an existing application. I can easily define how the rest of my system behaves in case this specific supervision tree fails.

In the end, it's the job of the person designing the system how they want all supervision trees to interact between them. By defining an application instead of a library you're hijacking this from the person writing the system. You're making the system of whoever is using it in their system a hostage of your decisions. I've already suffered a lot with this. Systems going down apparently for no reason, and the attached application was the cause, but it hasn't logs or any way that let me control that. The only solution that I had was to migrate to a library that didn't hijack the control that I had over the running system.

Exceptions

I've asked some friends to provide some feedback on this article's first draft and one thing that almost everyone highlighted the need to expose some exceptions of the points I've made here.

The first exception I can think of is something like what telemetry does. Telemetry just starts a SGP that has little consequences in terms of system architecture but adds a lot to the usability of the system. If you had to start the telemetry registry as part of the supervision tree, you'd have to orchestrate better when and how you attach handlers. It wouldn't be a big problem having to do that but I understand the tradeoff and the preference to simplify telemetry usage. So the exception would be if making it an application reduces the complexity of using the library but doesn't add risks of bringing down the VM then it's ok.

One other exception that I can think of is if your library sits between the system and the applications that gonna use it. Libraries like libcluster and recon, when I went through their code both are just libraries. I cannot think of any good example of this use case, but it seems reasonable to think this is a good exception.

A use case that for sure is not an exception that I can think of, is if your library needs to communicate with the external world. If you're doing something that touches anything outside the node itself, you need to make it just a library. IO, NIFs, ports, other external systems have too many breakpoints for you to choose how to handle that better than the person using the library. The person designing the system needs to have agency over that stuff, no matter how easy to use it gonna be if you make it an application.

tl;dr

  • OTP application is a way to expose code and runtime definitions for others to use.

  • Understand library as an OTP application that exposes just code and application as an OTP application that defines a supervision tree as part of their start process.

  • Avoid defining applications, define libraries.

  • If your library needs, define a supervision tree and expose an easy way to start it when needed, instead of starting it inside your library regardless of the user choice.

  • Let people choose to turn your library into an application if their needs require that.

  • Avoid hijacking the power and control that the BEAM provides over its execution.