System Design: Building an Authorization Service

Authentication answers "Who are you?"

Authorization answers "What are you allowed to do?"

Most systems start with a simple role column on the user table. That works until the product grows. Then permissions become tenant-specific, resource-specific, time-bound, inherited through teams, and different across environments. A support agent can view an account but not export data. A project admin can invite members but not change billing. A service account can read webhooks but not rotate credentials. A user may be an admin in one workspace and a viewer in another.

At that point, authorization becomes a platform problem.

This guide designs a production authorization service: RBAC, ABAC, relationship-based permissions, policy evaluation, caching, consistency tradeoffs, audit logs, admin workflows, and failure modes.

Problem Statement

Build a service that lets product teams answer permission checks consistently:

Can principal P perform action A on resource R in context C?

Examples:

Can user u_123 update project p_456?
Can service account svc_webhook_worker replay delivery d_789?
Can support agent agent_7 view tenant t_42 after entering a ticket?
Can user u_123 export audit logs for workspace w_9?
Can API client client_abc call POST /v1/refunds?

The authorization service should make these decisions fast, explainably, and safely.

Requirements

Functional requirements:

define permissions and actions
assign roles to users and service accounts
support resource-specific access
support group/team inheritance
support tenant isolation
support attribute-based policy conditions
evaluate authorization decisions
explain deny decisions
audit policy changes and sensitive checks
support safe policy rollout

Non-functional requirements:

low latency
high availability
clear consistency model
minimal blast radius for bad policies
strong auditability
easy debugging
backward-compatible policy changes
safe caching and invalidation

The hardest requirement is not storing permissions. It is making permission decisions correct everywhere, even when policies change, caches exist, and product teams keep adding special cases.

Authorization Models

Most real systems use a mix of models.

RBAC: Role-Based Access Control

Users get roles. Roles contain permissions.

workspace_admin -> project:create, project:update, member:invite
workspace_viewer -> project:read, member:read

RBAC is simple and understandable. It becomes awkward when every customer wants custom roles or when permissions depend on resource attributes.

ABAC: Attribute-Based Access Control

Policies use attributes of the principal, resource, action, and request.

allow if principal.department == resource.department
allow if request.ip_range in trusted_networks
allow if resource.sensitivity != "restricted"

ABAC is flexible, but policy debugging can become painful if every decision depends on many attributes.

ReBAC: Relationship-Based Access Control

Permissions come from relationships between objects.

user:u_123 member workspace:w_9
workspace:w_9 owner project:p_456

ReBAC is useful for collaboration products: documents, folders, workspaces, teams, organizations, and inherited permissions.

Good product authorization often uses RBAC for common roles, ABAC for conditions, and ReBAC for resource hierarchy.

High-Level Architecture

        +------------------+
        | Product Service  |
        +--------+---------+
                 |
                 v
        +--------+---------+
        | AuthZ SDK / Cache|
        +--------+---------+
                 |
                 v
+----------------+----------------+
|         Authorization API        |
+----------------+----------------+
                 |
      +----------+----------+
      | Policy Evaluation   |
      | Engine              |
      +----------+----------+
                 |
      +----------+----------+
      | Permission Store    |
      | Relationship Store  |
      | Attribute Provider  |
      +----------+----------+
                 |
                 v
          +------+------+
          | Audit Log   |
          +-------------+

Product services call the authorization SDK. The SDK handles local caching and request shaping. The authorization API authenticates callers and sends checks to the policy engine. The engine reads role assignments, relationships, resource attributes, and policy definitions. Every policy change and sensitive access decision is written to the audit log.

Core Data Model

Start with actions:

CREATE TABLE authz_actions (
  action TEXT PRIMARY KEY,
  description TEXT NOT NULL,
  risk_level TEXT NOT NULL
);

Roles:

CREATE TABLE authz_roles (
  id UUID PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  name TEXT NOT NULL,
  description TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (tenant_id, name)
);

CREATE TABLE authz_role_permissions (
  role_id UUID NOT NULL REFERENCES authz_roles(id),
  action TEXT NOT NULL REFERENCES authz_actions(action),
  PRIMARY KEY (role_id, action)
);

Assignments:

CREATE TABLE authz_role_assignments (
  id UUID PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  principal_type TEXT NOT NULL,
  principal_id TEXT NOT NULL,
  role_id UUID NOT NULL REFERENCES authz_roles(id),
  resource_type TEXT NOT NULL,
  resource_id TEXT NOT NULL,
  expires_at TIMESTAMPTZ,
  created_by TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX authz_assignments_lookup_idx
ON authz_role_assignments (
  tenant_id,
  principal_type,
  principal_id,
  resource_type,
  resource_id
);

Relationships:

CREATE TABLE authz_relationships (
  tenant_id TEXT NOT NULL,
  subject_type TEXT NOT NULL,
  subject_id TEXT NOT NULL,
  relation TEXT NOT NULL,
  object_type TEXT NOT NULL,
  object_id TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (
    tenant_id,
    subject_type,
    subject_id,
    relation,
    object_type,
    object_id
  )
);

Example relationships:

user:u_123 member workspace:w_9
group:g_engineering member workspace:w_9
project:p_456 child_of workspace:w_9
folder:f_1 parent_of document:d_2

Decision API

Single check:

POST /v1/authorize

{
  "principal": {
    "type": "user",
    "id": "u_123"
  },
  "action": "project:update",
  "resource": {
    "type": "project",
    "id": "p_456"
  },
  "context": {
    "tenantId": "t_42",
    "requestIp": "10.2.4.10",
    "ticketId": null
  }
}

Response:

{
  "decision": "ALLOW",
  "reason": "role workspace_admin on workspace w_9 includes project:update",
  "policyVersion": "2026-04-08.17",
  "cacheTtlSeconds": 30
}

Batch check:

POST /v1/authorize/batch

Use batch checks for list pages:

{
  "principal": { "type": "user", "id": "u_123" },
  "checks": [
    { "action": "project:update", "resource": { "type": "project", "id": "p_1" } },
    { "action": "project:update", "resource": { "type": "project", "id": "p_2" } },
    { "action": "project:update", "resource": { "type": "project", "id": "p_3" } }
  ],
  "context": { "tenantId": "t_42" }
}

Without batch APIs, product services often create N+1 authorization calls and make every list page slow.

Policy Evaluation

Policy evaluation should be deterministic and explainable.

Pseudo-code:

async function authorize(req: AuthorizationRequest): Promise<AuthzDecision> {
  const resource = await resourceGraph.resolve(req.resource);

  const roleAssignments = await assignmentStore.findAssignments({
    tenantId: req.context.tenantId,
    principal: req.principal,
    resourceLineage: resource.lineage,
  });

  const permissions = expandPermissions(roleAssignments);
  const hasAction = permissions.includes(req.action);

  if (!hasAction) {
    return deny("no matching role permission");
  }

  const conditions = await policyStore.findConditions(req.action);
  for (const condition of conditions) {
    const ok = await evaluateCondition(condition, req, resource);
    if (!ok) {
      return deny(`condition failed: ${condition.name}`);
    }
  }

  return allow("matching role and conditions");
}

Prefer explicit deny for safety:

if (hasDenyAssignment(req)) {
  return deny("explicit deny assignment");
}

But do not overuse explicit deny. It can make debugging difficult because an old deny rule can override a new allow rule in surprising ways.

Resource Hierarchy

Permissions often inherit through a tree:

organization
  -> workspace
      -> project
          -> environment
              -> deployment

If a user is workspace_admin on workspace w_9, they may be allowed to update projects under that workspace.

Store lineage:

CREATE TABLE resource_edges (
  tenant_id TEXT NOT NULL,
  parent_type TEXT NOT NULL,
  parent_id TEXT NOT NULL,
  child_type TEXT NOT NULL,
  child_id TEXT NOT NULL,
  PRIMARY KEY (tenant_id, parent_type, parent_id, child_type, child_id)
);

At read time, resolve ancestors:

WITH RECURSIVE ancestors AS (
  SELECT parent_type, parent_id, child_type, child_id
  FROM resource_edges
  WHERE tenant_id = :tenant_id
    AND child_type = :resource_type
    AND child_id = :resource_id

  UNION ALL

  SELECT e.parent_type, e.parent_id, e.child_type, e.child_id
  FROM resource_edges e
  JOIN ancestors a
    ON e.child_type = a.parent_type
   AND e.child_id = a.parent_id
  WHERE e.tenant_id = :tenant_id
)
SELECT parent_type, parent_id FROM ancestors;

For high-scale systems, precompute resource lineage or keep it in a graph-optimized store. Recursive SQL is fine for moderate depth and moderate traffic, but list pages with thousands of resources need careful batching.

Attribute Conditions

ABAC conditions handle context:

{
  "name": "support_ticket_required",
  "action": "customer:read_sensitive",
  "expression": "principal.type == 'support_agent' && context.ticketId != null"
}

Another example:

{
  "name": "business_hours_only",
  "action": "billing:refund",
  "expression": "context.hour >= 9 && context.hour <= 18 && principal.mfa == true"
}

Do not let every service invent its own expression language. Standardize policy syntax and test policies before rollout.

If policy expressions become complex, add a policy test suite:

{
  "name": "support cannot export raw card data",
  "request": {
    "principal": { "type": "user", "id": "support_1" },
    "action": "payment_card:export",
    "resource": { "type": "tenant", "id": "t_42" },
    "context": { "ticketId": "INC-1", "mfa": true }
  },
  "expectedDecision": "DENY"
}

Treat policy changes like code changes: reviewed, tested, deployed gradually, and easy to roll back.

Caching Strategy

Authorization checks must be fast. They also must react to permission changes.

Cache layers:

SDK in-process cache
authorization service cache
relationship graph cache
policy bundle cache

Cache key:

tenantId:principalType:principalId:action:resourceType:resourceId:policyVersion

Use short TTLs for sensitive actions:

function ttlFor(action: string): number {
  if (action.endsWith(":read_sensitive")) return 5;
  if (action.endsWith(":delete")) return 5;
  if (action.endsWith(":read")) return 60;
  return 30;
}

Add invalidation events for permission changes:

{
  "eventType": "AUTHZ_POLICY_CHANGED",
  "tenantId": "t_42",
  "principalId": "u_123",
  "resourceType": "workspace",
  "resourceId": "w_9",
  "policyVersion": "2026-04-08.18"
}

If invalidation is best effort, keep TTLs bounded. Never rely on cache invalidation alone for critical revocation.

Consistency Tradeoffs

Authorization has uncomfortable consistency requirements.

If a user loses access, how quickly must it take effect?

For low-risk read actions, a short stale window may be acceptable. For destructive actions, privilege escalation, billing, exports, and production operations, stale decisions should be avoided.

One practical model:

Action Type	Cache TTL	Stale Allowed	Notes
Public read	5 minutes	yes	not sensitive
Normal product read	30-60 seconds	limited	improves UI speed
Sensitive read	5-10 seconds	no after revocation event	audit heavily
Write	5-30 seconds	limited	depends on business risk
Delete/export/admin	0-5 seconds	no	prefer fresh check

For high-risk actions, force a fresh authorization check:

await authz.authorize({
  principal,
  action: "audit_log:export",
  resource,
  context,
  consistency: "fresh",
});

The authorization service can route fresh checks around caches and read from the primary store.

Admin Workflows

Permissions are not only APIs. Humans need workflows.

Admin UI features:

view who has access to a resource
view what a user can access
grant a role
revoke a role
set expiration
require approval for privileged roles
simulate access before applying changes
show policy explanation
export audit records

Simulation is important:

{
  "change": {
    "grantRole": "workspace_admin",
    "principal": "user:u_123",
    "resource": "workspace:w_9"
  },
  "preview": {
    "newlyAllowedActions": [
      "project:create",
      "project:update",
      "member:invite"
    ],
    "highRiskActions": [
      "member:invite"
    ]
  }
}

Bad permission changes can create security incidents. Make the blast radius visible before the operator clicks apply.

Audit Logging

Audit policy changes:

{
  "eventType": "ROLE_GRANTED",
  "tenantId": "t_42",
  "actor": "user:admin_1",
  "principal": "user:u_123",
  "role": "workspace_admin",
  "resource": "workspace:w_9",
  "reason": "Project lead for migration",
  "expiresAt": "2026-05-08T00:00:00Z",
  "createdAt": "2026-04-08T10:30:00Z"
}

Audit sensitive decisions:

{
  "eventType": "AUTHZ_DECISION",
  "tenantId": "t_42",
  "principal": "user:u_123",
  "action": "audit_log:export",
  "resource": "workspace:w_9",
  "decision": "ALLOW",
  "policyVersion": "2026-04-08.18",
  "requestId": "req_abc",
  "createdAt": "2026-04-08T10:31:00Z"
}

Do not audit every low-risk read decision synchronously if it would make the system too expensive. Use sampling or asynchronous logging for low-risk checks, but keep full audit logs for privileged actions.

Integration With Product Services

Keep the call site boring:

authz.require(
    principal,
    "project:update",
    Resource.of("project", projectId),
    AuthzContext.ofTenant(tenantId)
);

For list pages, avoid checking one resource at a time:

Map<String, Decision> decisions = authz.batchAuthorize(
    principal,
    "project:update",
    projects.stream()
        .map(project -> Resource.of("project", project.id()))
        .toList(),
    AuthzContext.ofTenant(tenantId)
);

Do not load all projects and then filter unauthorized ones if the list is large. Push coarse authorization into the query when possible:

SELECT p.*
FROM projects p
JOIN authz_relationships r
  ON r.object_type = 'workspace'
 AND r.object_id = p.workspace_id
WHERE p.tenant_id = :tenant_id
  AND r.subject_type = 'user'
  AND r.subject_id = :user_id
  AND r.relation IN ('member', 'admin');

Then use the authorization service for fine-grained checks on returned objects.

Safe Policy Rollout

Policy rollout should support:

dry run
shadow evaluation
canary by tenant
policy version pinning
rollback
decision diff reporting

Shadow mode example:

{
  "currentDecision": "ALLOW",
  "candidateDecision": "DENY",
  "action": "invoice:read",
  "resource": "invoice:inv_123",
  "principal": "user:u_123",
  "reason": "candidate requires billing_viewer role"
}

Before changing a widely used policy, run the candidate policy against production traffic in shadow mode and measure:

new denies
new allows
high-risk decision changes
top affected tenants
endpoints affected

Most authorization incidents are boring: a policy was technically correct but broader than expected.

Observability

Metrics:

decision latency
allow/deny rate by action
policy evaluation errors
cache hit rate
stale decision count
fresh check rate
policy bundle version skew
relationship graph lookup latency
audit logging failures
shadow policy decision diff rate

Structured log:

{
  "event": "authz_decision",
  "tenantId": "t_42",
  "principal": "user:u_123",
  "action": "project:update",
  "resource": "project:p_456",
  "decision": "ALLOW",
  "policyVersion": "2026-04-08.18",
  "latencyMs": 7
}

Useful dashboards:

top denied actions
p95/p99 authorization latency
cache hit rate by service
policy version adoption
high-risk action volume
admin role grants
expired grants still active

Failure Modes

Default allow. A timeout or unknown action accidentally returns allow.

Permission cache is too sticky. A revoked user keeps access for minutes or hours.

N+1 authorization checks. List pages become slow because every row calls the authz service separately.

Policy language is too flexible. Nobody can predict what a change will do without running production traffic through it.

Resource lineage is wrong. A project points to the wrong workspace, granting inherited access to the wrong users.

No explanation. Support cannot debug why a customer lost access.

Authorization is split across services. One service checks project:update; another checks workspace:admin; behavior diverges.

Audit logging is optional. Privileged role changes happen without a durable record.

Cyclic group membership. Group inheritance loops cause slow or incorrect evaluation.

Policy rollout has no rollback. A bad policy deploy locks out users or exposes data.

Production Checklist

Define actions centrally.
Separate authentication from authorization.
Start with RBAC, then add ABAC/ReBAC only where needed.
Make tenant and environment part of every decision.
Support batch authorization for list pages.
Keep high-risk actions fresh or near-fresh.
Use short TTLs and invalidation for permission caches.
Emit audit logs for policy changes and sensitive checks.
Add policy simulation before applying admin changes.
Support shadow evaluation for risky policy changes.
Explain deny decisions.
Test resource hierarchy and inheritance.
Detect group membership cycles.
Fail closed on unknown actions and evaluation errors.
Track policy version rollout and decision diff rate.

System Design: Building an Authorization Service

Problem Statement

Requirements

Authorization Models

High-Level Architecture

Core Data Model

Decision API

Policy Evaluation

Resource Hierarchy

Attribute Conditions

Caching Strategy

Consistency Tradeoffs

Admin Workflows

Audit Logging

Integration With Product Services

Safe Policy Rollout

Observability

Failure Modes

Production Checklist

Read Next

Recommended Resources

Sachin Sarawgi

Related Articles

System Design: Building a Distributed Cache

System Design: Building a Distributed Configuration Platform

System Design: Building a Fraud Detection Platform

System Design: Building an Authorization Service

Problem Statement

Requirements

Authorization Models

High-Level Architecture

Core Data Model

Decision API

Policy Evaluation

Resource Hierarchy

Attribute Conditions

Caching Strategy

Consistency Tradeoffs

Admin Workflows

Audit Logging

Integration With Product Services

Safe Policy Rollout

Observability

Failure Modes

Production Checklist

Read Next

Recommended Resources

Get the next backend guide in your inbox

Sachin Sarawgi

Related Articles

System Design: Building a Distributed Cache

System Design: Building a Distributed Configuration Platform

System Design: Building a Fraud Detection Platform