Federated Data Catalog: When Should You Go for One?
Share this article
A federated data catalog is a natural extension of the data catalog model. It lets you control access to data using the source data access and control policies.
Federated data catalogs are particularly useful for enterprises dealing with complex data environments.
In this article, we’ll cover the basics of a federated data catalog, when to use a federation and the advantages and disadvantages of centralized versus decentralized (federated) data catalogs.
Table of contents #
- What is a federated data catalog?
- What is federation?
- How federation affects the core functions of a data catalog
- How federated authentication (authn) and authorization (authz) streamline data security and governance
- When should you use a federated data catalog?
- Summary
- Related reads
What is a federated data catalog? #
A federated data catalog is a central portal for data discovery and collaboration that also enforces the standards for:
- Authentication: Validating a user’s identity
- Authorization: Determining what access rights should be granted
It uses a decentralized approach to data access management.
Before we delve into the specifics of a federated data catalog, let’s understand the concept of federation.
What is federation? #
In security, federation refers to synchronizing a digital identity across multiple platforms.
Non-federated systems duplicate access credentials (like usernames and passwords) and access permissions across multiple software systems. This results in fragmentation and inconsistency, opening the door for data leaks.
In a federated identity system, sites and toolsets integrate with one or more identity providers. The identity provider validates the user’s identity and determines the scope of the user’s rights and permissions.
Every system that integrates with a given federated identity provider is part of the same federated domain. As a result, their login credentials are valid across all systems that belong to that domain.
Federation vs. SSO: What’s the difference? #
Federation is similar to single sign-on (SSO) but not synonymous.
In SSO, all sites within a company integrate with a single identity management system (for example, Microsoft’s Active Directory).
Federation takes this a step further by establishing a set of standards so that sites and toolsets can integrate with multiple identity providers. This enables flexibility in organizations that use disparate tools from several vendors.
How federation affects the core functions of a data catalog #
A fully federated data catalog impacts every service a data catalog has to offer, such as:
- Search
- Lineage
- Tagging and classification
- Governance
Let’s explore this further.
Search #
A federated data catalog will provide a single interface for you to search across your data ecosystem, regardless of where the data is sourced from or stored. However, the accessibility will depend on the user’s access rights.
For example, a user may be able to see and read certain data, but not edit it once they find it.
Moreover, they may only be able to find certain sensitive data (for example, customer’s Personally Identifiable Information, or PII) if they have sufficient privileges.
Lineage #
Data lineage tracks the flow of your data through your organization as it is created, consumed, and modified over time.
In a federated data catalog, distributed authorization controls who can access the lineage history. It also provides comprehensive tracking of who in the company made specific changes to a dataset.
Tagging and classification #
With a federated authorization model, data stewards can define who in the company has the necessary insights to clean, tag, and classify data properly.
This opens up and democratizes classification while also maintaining governance controls.
Governance #
Federation simplifies data governance by propagating policies defined by data stewards and owners across the organization.
How federated authentication (authn) and authorization (authz) streamline data security and governance #
Let’s explore federated authentication and authorization in a federated data catalog in more detail.
Federated authentication (authn) #
What is federated authentication?
Authentication (often abbreviated to authn) determines that a user is who they say they are.
In federated authentication, a user’s identity and authentication information is stored in a trusted identity provider (IdP). The IdP can be used to authenticate a user with other systems or services that support the same authentication protocol.
Let’s say that a site integrates with a federated authentication provider. Instead of handling authorization itself, the site will redirect the user to login via the authentication provider’s interface. The authentication provider then returns a security token that represents a successful request.
When do you need federated authentication?
Companies rely on federated authentication when they need to integrate different systems and tools from a large number of different vendors.
The federated approach enables a more distributed, decentralized approach to information management, which can speed up time to market and encourage innovation.
How can you implement federated authentication?
Identity providers can use several technologies to implement federated authentication. Some of the most used include Security Assertion Markup Language (SAML), OAuth, or OpenID.
However, instead of implementing these standards themselves, many companies will integrate with a popular federated authentication provider, such as PassportJS or Okta Identity.
This approach leaves authentication in the hands of security experts and frees up your IT resources to focus on the domain-specific challenges relevant to your business.
Federated authorization (authz) #
What is federated authorization?
Authorization (abbreviated to authz) is the natural complement to authn. It determines what rights an authorized user has to what parts of a system.
In non-federated systems, authorization can end up being a nightmare. This is because every system makes its own policy decisions, often in proprietary ways that aren’t portable to other systems or toolsets.
With federated authorization, all tools rely on a common policy language to express authorization decisions around data and resources.
How can you implement federated authorization?
One of the most common policy languages for authorization is Open Policy Agent (OPA).
OPA uses Rego language to process arbitrary structured JSON descriptions of resources. The output is a set of policy decisions represented as arbitrary structured JSON.
OPA is domain agnostic, i.e., services can define their preferred input and output formats for themselves.
More importantly, they can ship the Rego code that recognizes and produces these structures along with their service. This decouples policy decisions from the service code, enabling other services to interoperate easily with a services’ policy framework.
With federated authorization, services no longer need to roll their own authorization to interoperate with other services. All services, including a federated data catalog, can use a single policy language and framework, and consume other services’ policy frameworks as is with little additional coding.
When should you use a federated data catalog? #
Moving to a federated data catalog does involve some lift. It requires services that produce data to onboard to federated authentication and authorization models.
While that work is less intensive — and, ultimately, more time-saving — than every service rolling its own authz and authn, it does take an investment of time and personnel.
So when do you move to a decentralized, federated data catalog instead of using a more centralized model with strict specifications for inbound data?
Investing in a federated data catalog makes the most sense when:
- Your business is large and generates data from a large number of different toolsets: Moving everyone in a large organization to a single toolset is often impossible. With a federated data catalog, you need only establish a set of standards and protocols for federated authentication and authorization that every service follows.
- Your source tools from several vendors, with each vendor using their own method of access management: In this case, your federation standards become an interoperability requirement that vendors must meet to license their tools within your organization.
Federated vs. centralized data catalog: What’s the difference? #
Depending on your company’s size, culture, and data maturity, you may opt for a federated or centralized data catalog.
Here are some factors to consider when weighing the decision.
Attribute | Federated (decentralized) data catalog | Centralized data catalog |
---|---|---|
Architecture | Decentralized data mesh architecture; divisions within a company can make their own service/toolset decisions | Centralized data fabric; specification of data formats and API integrations to which all organizations must adhere |
Speed of integration of external data sources | Faster and more flexible, as it uses interoperable standards for authentication and authorization | Slower and more rigid, requiring adherence to a centralized set of standards |
Data accessibility vs. quality | Emphasizes data accessibility and collaboration | Emphasizes data quality and governance controls |
Summary #
The future of data is decentralized and human-oriented. A modern data stack encourages data democratization, community support, and a non-bureaucratic, decentralized approach to data governance. That’s where a federated data catalog can help.
In this article, we explored how a federated data catalog can play an important role in data democratization.
Using federated authentication (authn) and federated authorization (authz), a federated data catalog can support data management more easily from a disparate number of tools and services.
Related reads #
- Modern Data Stack: Components, Architecture, and Tools
- What is a Data Catalog? And Why Do You Need One?
- Enterprise data catalog: Definition, Importance & benefits
- Data catalog benefits: 5 key reasons why you need one
- Open Source Data Catalog Software: 5 Popular Tools to Consider in 2023
- Data Catalog Platform: The Key To Future-Proofing Your Data Stack
- Top Data Catalog Use Cases Intrinsic to Data-Led Enterprises
- AWS Glue Data Catalog: Architecture, Components, and Crawlers
- Airbnb Data Catalog — Democratizing Data With Dataportal
- Lexikon: Spotify’s Efficient Solution For Data Discovery And What You Can Learn From It
- Google Cloud Data Catalog Guide - Everything You Need to Know
- Data Mesh vs. Data Fabric: How do you choose the best approach for your business needs?
- Data Governance Framework — Examples, Templates, Standards, Best Practices & How to Create a Data Governance Framework?
Share this article