The Journey of Adopting gRPC and Protobuf

Published in

Riskified Tech

6 min readJul 27, 2022

When I joined Riskified 5 years ago, the company was just starting its microservices journey. I remember attending a Microservices Steering Committee meeting in my first week, where one of the topics I raised was the question of how we are going to communicate. HTTP or RPC?

I thought we’d need some way to define contracts between services, and generate clients and servers from them. We considered RPC frameworks, but felt the overhead of using one would be too large for the stage we were at.

The situation at Riskified has changed considerably since then. We now have almost 1,000 services communicating using gRPC (alongside HTTP and Kafka / Avro). So why did we eventually decide to use an RPC framework?

Organizational growth — the cost of integrating both services required significant investment in defining API, maintaining compatibility, stubs for testing, standardization of error handling, etc.
The overhead of writing the client or server side parsing code is significant, especially since we use multiple languages
Stricter API versioning

In this post I’lll assume you are familiar with Protobuf and gRPC, and mainly focus on our adoption journey and how we encouraged usage by dev teams.

The build pipeline

To use Protobuf and gRPC, we need to store schemas and create a pipeline that generates clients from them, as well as consider what developers need in order to use them effectively.

1. Schema Management and Building a Pipeline

First we needed to give some thought to managing Protobuf schemas. What does the workflow of adding a schema look like? Where do we store the schemas?

From our previous experience with handling Avro schemas for our Kafka topics, we knew we wanted to go with a centralized repository. This enables enforcing standards and sharing models at the “source” (schema) level easily. It also simplifies syncing schemas with external resources, such as the Confluent schema registry that has started to support Protobuf.

We created a company-wide Protobuf repository. Changes to the repository trigger a process for modified schemas, which checks breaking changes, generates and packages code for every language, and finally versions and uploads it to an artifact repository. Let’s take a look at each step…

Within the repository, schemas are organized into projects.
Projects have a major version, which is set manually by the directory structure (project name/major version) and a minor version, which is automatically determined using git commit history length.

Each project/major pair defines the generated package.
For example, the package built from the directory “analysis/v1” would be named “analysis-v1”, with minor version 1.

2. Compatibility check

For compatibility checking we employed a tool called Buf. Buf aims to make building services with Protobuf easy, and aims to solve many problems in the Protobuf ecosystem, such as schema distribution, dependency management and standardization. We’re currently only using it for compatibility checking.

One of the benefits of using schemas for inter-service communication, is the ability to check whether changes made will break an API. Buf defines several compatibility levels (FILE, PACKAGE, WIRE and WIRE_JSON), we went with FILE as it’s the one that makes sense for the languages we use.

Buf has a mechanism to describe Protobuf schemas, called an image file (image files themselves are based on FileDescriptorSets). When building a Protobuf project, we use Buf to generate an image file for schema, which we store in S3.

On every subsequent update of a schema, the latest image file is fetched from S3. Buf checks it against the candidate schema version in the repository. If compatibility is broken, the developer can either change the schema or declare a new major version by moving it to a new directory (for example, analysis/v2).

3. Generating clients

After the compatibility check passes, we generate code for all languages in our stack: Scala, Python, Ruby, Typescript and Javascript.

The generated code doesn’t always exactly suit our needs or preferences. So, if needed, we manipulate it by using tools or by adjusting the protoc compiler initial properties. Here are some examples of customizations we’ve made:

Scala

We use ScalaPB, an sbt plugin for generating Scala files from proto schemas. We use it to manipulate the naming of enum values in Scala generated code. We also use it to disable Scala case class parameter default values. This helps avoid errors when converting between data models using Chimney, a Scala library for data transformations.

Python

The Protobuf documentation suggests using the protoc compiler to generate Python code. Unlike when you generate Scala and C++ Protobuf code, the protoc compiler doesn’t generate a data access code for you directly. Instead, it generates special descriptors for all messages, which makes them unreadable and hard to use.

After investigating a few tools, we decided to use mypy-protoubuf that generates readable Python files and .pyi stub files, which also contain typing information.

We used the following flag to generate files, which helped to improve serialization / deserialization performance significantly.

Ruby

We used the grpc_tools_ruby_protoc compiler provided by gRPC to generate Ruby files, and cookiecutter to create the project template and specify the gem properties.

For the server code we used the gruf framework, although it is not required.

Enhancing clients

The generated client / server code can be augmented using interceptors to handle cross cutting concerts, such as monitoring, logging and authentication. We describe the Scala version of the interceptors we implemented, though they are all similar.

Authentication

Our services run on Kubernetes, and we utilize Istio mutual TLS, so we only perform simple authentication similar to HTTP basic access authentication.

4. Monitoring

We use Kamon for tracking metrics, which are then collected by Prometheus. The reported metrics include tags for service, method, status code, and an optional user defined classifier string to enable easily building queries.

Among the metrics, we collect:

Status code count
Current active calls
Total messages sent
Total messages received
Request and response duration

We’ve also added access logging and context propagation (again using Kamon), so that we can trace requests throughout the entire system.

5. Testing

Building and using the docker image for testing

One of the benefits of working with schemas for inter service communication is easy mocking. We wanted to let teams mock gRPC services with the most minimal setup possible.

We found a few open-source libraries that help with creating mock servers from proto schemas. The problem was that they required copying Protobuf files into the test repository, and those copies can easily get out of sync.

Our solution was to create a Docker image that contains all our Protobuf schemas, and is thus capable of mocking all gRPC endpoints.

We use GripMock, an open source gRPC mock server, as the base layer of the Docker image. We then add a layer containing all the schema files in the repository.

To mock a gRPC endpoint, the user runs the image and supplies the project/version to be mocked as an environment. Request / response scenarios can then be recorded using JSON requests to an admin port, while gRPC traffic is served from another port.

Overcoming the shortcomings

Protobuf and gRPC, like any technology, have downsides.
Let’s discuss the specific shortcomings relevant to our usage, and how we worked around them.

All properties are optional

In Protobuf V3, all fields are optional, or more precisely, all fields have default values. The Protobuf developers have good points about why that is so, but in our services we sometimes can’t handle a request if a field is missing. This also creates mismatches when trying to define Avro schemas, which can represent the same data as the Protobuf schemas, since Avro has required fields.

In such cases, we use a validation layer that translates the Protobuf objects to internal domain objects, while rejecting invalid requests.

Minor language incompatibilities

Some language plugins expect camel case enum values, and some expect snake case. We had to tweak code generation plugin settings to make the messages compatible between all languages.

Integration with Spark

Compared to Avro, support is limited. We haven’t yet found a straightforward way to read a serialized Protobuf payload into a DataFrame.

Wrapping up

Using microservices lets us choose the best language and framework for each task. At Riskified, we use Scala, Ruby, Node.js and Python. This forced us to think about how we maintain clients and servers, document our APIs and handle version upgrades.

Protobuf and gRPC helped us move towards a solution. Even though it requires investment in tooling and processes, we feel the journey was worth it.

I want to thank my colleagues Alik Berezovsky, who initiated this effort, and Nir Dunetz, Nadav Wiener, and Tomer Barak who were instrumental in making it happen.