Federated Data Catalog: When Should You Go for One?

Last Updated on: May 18th, 2023, Published on: May 12th, 2023
header image

Share this article

A federated data catalog is a natural extension of the data catalog model. It lets you control access to data using the source data access and control policies.

Federated data catalogs are particularly useful for enterprises dealing with complex data environments.

In this article, we’ll cover the basics of a federated data catalog, when to use a federation and the advantages and disadvantages of centralized versus decentralized (federated) data catalogs.


Table of contents

  1. What is a federated data catalog?
  2. What is federation?
  3. How federation affects the core functions of a data catalog
  4. How federated authentication (authn) and authorization (authz) streamline data security and governance
  5. When should you use a federated data catalog?
  6. Summary
  7. Related reads

What is a federated data catalog?

A federated data catalog is a central portal for data discovery and collaboration that also enforces the standards for:

  • Authentication: Validating a user’s identity
  • Authorization: Determining what access rights should be granted

It uses a decentralized approach to data access management.

Before we delve into the specifics of a federated data catalog, let’s understand the concept of federation.


What is federation?

In security, federation refers to synchronizing a digital identity across multiple platforms.

Non-federated systems duplicate access credentials (like usernames and passwords) and access permissions across multiple software systems. This results in fragmentation and inconsistency, opening the door for data leaks.

In a federated identity system, sites and toolsets integrate with one or more identity providers. The identity provider validates the user’s identity and determines the scope of the user’s rights and permissions.

Every system that integrates with a given federated identity provider is part of the same federated domain. As a result, their login credentials are valid across all systems that belong to that domain.

Federation vs. SSO: What’s the difference?


Federation is similar to single sign-on (SSO) but not synonymous.

In SSO, all sites within a company integrate with a single identity management system (for example, Microsoft’s Active Directory).

Federation takes this a step further by establishing a set of standards so that sites and toolsets can integrate with multiple identity providers. This enables flexibility in organizations that use disparate tools from several vendors.


How federation affects the core functions of a data catalog

A fully federated data catalog impacts every service a data catalog has to offer, such as:

  • Search
  • Lineage
  • Tagging and classification
  • Governance

Let’s explore this further.


A federated data catalog will provide a single interface for you to search across your data ecosystem, regardless of where the data is sourced from or stored. However, the accessibility will depend on the user’s access rights.

For example, a user may be able to see and read certain data, but not edit it once they find it.

Moreover, they may only be able to find certain sensitive data (for example, customer’s Personally Identifiable Information, or PII) if they have sufficient privileges.

Lineage


Data lineage tracks the flow of your data through your organization as it is created, consumed, and modified over time.

In a federated data catalog, distributed authorization controls who can access the lineage history. It also provides comprehensive tracking of who in the company made specific changes to a dataset.

Tagging and classification


With a federated authorization model, data stewards can define who in the company has the necessary insights to clean, tag, and classify data properly.

This opens up and democratizes classification while also maintaining governance controls.

Governance


Federation simplifies data governance by propagating policies defined by data stewards and owners across the organization.


How federated authentication (authn) and authorization (authz) streamline data security and governance

Let’s explore federated authentication and authorization in a federated data catalog in more detail.

Federated authentication (authn)


What is federated authentication?

Authentication (often abbreviated to authn) determines that a user is who they say they are.

In federated authentication, a user’s identity and authentication information is stored in a trusted identity provider (IdP). The IdP can be used to authenticate a user with other systems or services that support the same authentication protocol.

Let’s say that a site integrates with a federated authentication provider. Instead of handling authorization itself, the site will redirect the user to login via the authentication provider’s interface. The authentication provider then returns a security token that represents a successful request.

When do you need federated authentication?

Companies rely on federated authentication when they need to integrate different systems and tools from a large number of different vendors.

The federated approach enables a more distributed, decentralized approach to information management, which can speed up time to market and encourage innovation.

How can you implement federated authentication?

Identity providers can use several technologies to implement federated authentication. Some of the most used include Security Assertion Markup Language (SAML), OAuth, or OpenID.

However, instead of implementing these standards themselves, many companies will integrate with a popular federated authentication provider, such as PassportJS or Okta Identity.

This approach leaves authentication in the hands of security experts and frees up your IT resources to focus on the domain-specific challenges relevant to your business.

Federated authorization (authz)


What is federated authorization?

Authorization (abbreviated to authz) is the natural complement to authn. It determines what rights an authorized user has to what parts of a system.

In non-federated systems, authorization can end up being a nightmare. This is because every system makes its own policy decisions, often in proprietary ways that aren’t portable to other systems or toolsets.

With federated authorization, all tools rely on a common policy language to express authorization decisions around data and resources.

How can you implement federated authorization?

One of the most common policy languages for authorization is Open Policy Agent (OPA).

OPA uses Rego language to process arbitrary structured JSON descriptions of resources. The output is a set of policy decisions represented as arbitrary structured JSON.

OPA is domain agnostic, i.e., services can define their preferred input and output formats for themselves.

More importantly, they can ship the Rego code that recognizes and produces these structures along with their service. This decouples policy decisions from the service code, enabling other services to interoperate easily with a services’ policy framework.

With federated authorization, services no longer need to roll their own authorization to interoperate with other services. All services, including a federated data catalog, can use a single policy language and framework, and consume other services’ policy frameworks as is with little additional coding.


When should you use a federated data catalog?

Moving to a federated data catalog does involve some lift. It requires services that produce data to onboard to federated authentication and authorization models.

While that work is less intensive — and, ultimately, more time-saving — than every service rolling its own authz and authn, it does take an investment of time and personnel.

So when do you move to a decentralized, federated data catalog instead of using a more centralized model with strict specifications for inbound data?

Investing in a federated data catalog makes the most sense when:

  • Your business is large and generates data from a large number of different toolsets: Moving everyone in a large organization to a single toolset is often impossible. With a federated data catalog, you need only establish a set of standards and protocols for federated authentication and authorization that every service follows.
  • Your source tools from several vendors, with each vendor using their own method of access management: In this case, your federation standards become an interoperability requirement that vendors must meet to license their tools within your organization.

Federated vs. centralized data catalog: What’s the difference?


Depending on your company’s size, culture, and data maturity, you may opt for a federated or centralized data catalog.

Here are some factors to consider when weighing the decision.

AttributeFederated (decentralized) data catalogCentralized data catalog
ArchitectureDecentralized data mesh architecture; divisions within a company can make their own service/toolset decisionsCentralized data fabric; specification of data formats and API integrations to which all organizations must adhere
Speed of integration of external data sourcesFaster and more flexible, as it uses interoperable standards for authentication and authorizationSlower and more rigid, requiring adherence to a centralized set of standards
Data accessibility vs. qualityEmphasizes data accessibility and collaborationEmphasizes data quality and governance controls

Summary

The future of data is decentralized and human-oriented. A modern data stack encourages data democratization, community support, and a non-bureaucratic, decentralized approach to data governance. That’s where a federated data catalog can help.

In this article, we explored how a federated data catalog can play an important role in data democratization.

Using federated authentication (authn) and federated authorization (authz), a federated data catalog can support data management more easily from a disparate number of tools and services.



Share this article

[Website env: production]