Outboxes
Outboxes are database backed deferred units of work that drive a large portion of our system's eventual consistency workflows. They were designed with a couple of key features in mind:
- Durability: Outboxes are stored as database rows that are typically committed transactionally alongside associated model updates. They are retained until the processing code signals that work has been completed successfully.
- Retriability: If an outbox processor fails for whatever reason, it will be reprocessed indefinitely until it succeeds.
- Note: We do not support deadlettering, which means a head of line blocking outbox will continue to backlog work until the processing code is fixed.
Outbox Properties
An outbox consists of the following parts:
shard_scope
- The operational group the outbox belongs to. This tends to be aligned with the models or domains the outbox applies to.
shard_identifier
- The shard, or grouping Identifier.
object_identifier
- The ID of the impacted model object (if applicable).
Note: Not every outbox has an explicit model target, so this can be set to an arbitrary unique value if necessary. Just be aware that this identifier is used in tandem with theshard_identifier
for coalescing purposes.
category
- The specific operation the outbox maps to, for example
USER_UPDATE
andPROVISION_ORGANIZATION
.
payload
- An arbitrary JSON blob of additional data required by the outbox processing code.
scheduled_for
- The datetime when the outbox should be run next
scheduled_from
- The datetime of when the
scheduled_for
date was last set.
Outbox Sharding
Outbox shards are groups of outboxes representing an interdependent chunk of work that must be processed in order. An outbox’s shard is determined via the combination of its shard_scope
and shard_identifier
columns, and selecting appropriate values for both is essential.
ControlOutbox
instances have 1 additional sharding column to consider: region
, which ensures that each region processes its own shard of work independent of the other.
Some pragmatic examples:
- Organization Scoped outboxes use the organization’s ID as the shard identifier, meaning Organization, Project, Org Auth Token Updates (and more!) for a single organization are all processed in order. If any of these different outboxes get stuck, the entire shard will begin to backlog, so be mindful of failure points.
- Audit log event outboxes live in their own scope and use distinct shard identifiers for every outbox to ensure they are processed individually without the threat of being blocked by other outboxes.
Synchronous vs Asynchronous outboxes
When creating a new outbox message, you can either attempt to immediately process it synchronously, or defer the work for later. Deciding which approach to take depends entirely on the use case and the source of the outbox's creation.
Some quick heuristics for deciding:
- Was the outbox created by an API request that needs to report the status of the operation to the requestor? (process it synchronously)
- Was the outbox a result of a task, automatic process, or webhook request? (process it asynchronously)
Thankfully, choosing the desired behavior is simple:
from django.db import router, transaction
from sentry.models.outbox import outbox_context
# Synchronous outbox processing
with outbox_context(transaction.atomic(router.db_for_write(<model_name_here>)):
RegionOutbox(...).save()
# Asynchronous outbox processing
with outbox_context(flush=False):
RegionOutbox(...).save()
Both examples require the usage of the outbox_context
context manager, but the key difference is in the arguments supplied.
Synchronous outboxes
Supplying a transaction to the outbox_context
signals our the intent to immediately process the outbox after the provided transaction has been committed. This is handled via Django’s on_commit
handlers which are automatically generated by the context manager.
The context manager will attempt to flush all outboxes generated within the context manager, unless their creation operations are wrapped in a nested asynchronous context manager:
with outbox_context(transaction.atomic(router_db_for_write(<model_type>):
sync_outbox = RegionOutbox(...).save()
with outbox_context(flush=False):
async_outbox = RegionOutbox(...).save()
Because processing occurs after the transaction has been committed, any outboxes that cannot be processed are treated as asynchronous outboxes after their initial flush attempts.
Asynchronous outboxes
Supplying the outbox_context
with flush=False
instead of a transaction skips generating the on_commit
handlers entirely, meaning any outboxes created within the context manager will not be processed in the current thread. Instead, they will be picked up by a future periodic celery task that queries all outstanding outbox shards and attempts to process them.
Choosing an outbox’s silo and database
Outboxes should live in the same silo and database as the models or processes that produce them.
For example, Organization
model changes generate OrganizationUpdate
outboxes. Because the Organization
is a region-silo model that lives in the sentry
database, the OrganizationUpdate
outbox is a RegionOutbox
that also lives in the sentry
database.
Having both models aligned to the same database ensures that both the model change and the outbox message creation can be committed in the same database transaction for consistency.
Coalescing
Outboxes are coalesced in order to prevent backpressure from toppling our systems after a blocked outbox shard is cleared. We accomplish this by assuming the last message in a coalescing group is the source of truth, ignoring any preceding messages in the group.
An outbox’s coalescing group is determined by the combination of its sharding, category
and object_identifier
columns.
This coalescing strategy means that any outbox payloads that are stateful and order dependent in the same coalescing group will result in data loss when the group is processed. If you want to bypass coalescing entirely, you can set an arbitrary unique object identifier to ensure messages are run individually and in order; however, this can cause severe bottlenecks so be cautious when doing so.
Processing Outboxes
Outboxes processors are implemented as Django signal receivers, with the sender of the receiver set to the outbox category. Here’s an example of a receiver that handles project update outboxes:
from sentry.models.outbox import OutboxCategory, process_region_outbox
@receiver(process_region_outbox, sender=OutboxCategory.PROJECT_UPDATE)
def process_project_updates(object_identifier: int, shard_identifier: int, payload: Any, **kwargs):
pass
Each receiver is passed the following outbox properties as arguments:
object_identifier
shard_identifier
payload
region_name
[ControlOutbox only]
If the receiver raises an exception, the shard’s processing will be halted and the entire shard will be scheduled for future processing. If the receiver returns gracefully, the outbox’s coalescing group will be deleted and the next outbox message in the shard will be processed.
Because any outbox message can be retried multiple times in these exceptional cases, it’s crucial to make these processing receivers idempotent.
Steps for adding a new Outbox
- Add a category enum
- Choose an outbox scope
- Choose a shard identifier and object identifier
- Create a receiver to process the new outbox type
- Update your code to save new outboxes within an outbox context
Possible issues to be mindful of
Deadlocking
If an outbox processor issues an RPC request to another silo, which in turn generates a synchronously processed outbox, this can result in a deadlock occurring if any of the outboxes in the newly generated outbox’s shard also generates or processes an outbox targeting the originating silo.
This is fairly rare situation, but one we’ve encountered in production before. Make sure to choose your outbox scopes and synchronous outbox processing use-cases carefully, minimizing cross-silo outbox interdependency whenever possible.
Data Inconsistency Between Silos
There’s no guarantee that an outbox will be processed in a timely manner. This can cause data between our silos to drift, so consider this when writing code that depends on outboxes to sync data.
When to use Outboxes
- The operation requires retriability and persistence.
- The operation can be deferred and made eventually consistent.
- A higher latency than a typical API or RPC request (<10s) is acceptable.
When not to use Outboxes
- The work can take more than 10 seconds to process. This is a current limitation due to the way we acquire locks on outbox shards when processing them.
- Data is ordered and has low cardinality, meaning it cannot be sharded efficiently.
- Dependent code requires strong consistency guarantees cross-silo.