Cloud-Based Data Catalog: Benefits, Options, Challenges, and Best Practices

Last Updated on: June 09th, 2023, Published on: June 09th, 2023
header image

Share this article

More and more workloads are moving to the cloud. This includes data catalogs.

So should you switch to a cloud-based data catalog as well? This article will explore what that means, the benefits, your options, and factors you’ll need to consider along the way.


Table of contents

  1. What is a cloud-based data catalog?
  2. Benefits of a cloud-based data catalog
  3. Different types of cloud-based data catalog hosting
  4. Challenges with a cloud-based data catalog
  5. Best practices for a cloud-based data catalog
  6. Should you use a cloud-based data catalog?
  7. Conclusion
  8. Cloud-based data catalog: Related reads

What is a cloud-based data catalog?

A cloud-based data catalog is a data catalog that is deployed to a cloud provider such as Amazon Web Services (AWS) or Azure, or hosted as a service on your behalf by a data catalog company such as Atlan.

Cloud-based data catalogs can host your data catalog alongside other cloud workloads you may already be running.

For example, if you’re using AWS as your data lake, storing data in services such as Amazon S3 or Amazon Redshift, you may also opt to run your data catalog as a service in one of your AWS accounts.


Benefits of a cloud-based data catalog

Business is moving to the cloud in a big way. Gartner estimates that by 2025, 98% of all enterprise workloads will be in the cloud. So it seems natural to ask whether your data catalog should also run there.

There are numerous benefits of hosting a cloud-based data catalog:

  • Co-locate your data catalog and your data
  • Lower your management overhead
  • Ensure better-distributed access
  • Lower your up-front investment
  • Get better security

Let’s explore each benefit further.

Co-locate your data catalog and your data


With 98% of all workloads heading to the cloud, it makes sense for your data catalog to be there as well.

By co-locating your data catalog and data in the same infrastructure, you can more easily apply a single management and security model to all your data-related resources.

Plus, you can further streamline integrating your data catalog with other cloud-based data services.

Lower your management overhead


A cloud provider or data catalog service provider will usually provide features that handle basic management tasks, such as automating backups or scaling your application to meet demand.

This reduces engineering and system maintenance efforts, which both frees up company resources as well as lowers costs.

Ensure better-distributed access


If your data catalog is hosted in your data center, you’ll need secure methods for remote access. Employees working from home and partners will require secured tunneling access to your on-premises data center, such as a Virtual Private Network (VPN) connection.

By contrast, since the cloud is by definition connected to the Internet, making your cloud-based data catalog accessible by remote employees, partners, and others is a much lighter lift.

Lower your up-front investment


Investing in on-premises data centers is a huge capital expense (CapEx). These investments tie up money and lock you into an inflexible technology strategy.

CapEx expenses in computing also require knowing your capacity needs upfront. If you buy too little computing power, your application may be unavailable during high load. If you buy too much, you have capacity that’s sitting idle.

By contrast, cloud expenses are operating systems (OpEx). They work like a utility: you pay every month only for the computing resources you use.

You can scale out and scale in the computing resources you dedicate to your data catalog dynamically to ensure it’s always available at the lowest possible cost.

Estimating CapEx vs. OpEx for data

Estimating CapEx vs. OpEx for data. Source: BMC.

Better security


Some companies are wary of moving to the cloud due to security fears. But done correctly, cloud-based hosting can be more secure than on-premises hosting.

With on-premises applications, it’s often up to your development team to roll out its own security solutions. With cloud hosting, you can take advantage of your provider’s built-in security features, such as network access lists, role-based access controls, and authorization/authentication services.

Leveraging cloud security saves your team countless of hours of application development time. It also gives you access to security features that have been tested and improved over multiple years across thousands of different customers.


Different types of cloud-based data catalog hosting

Deciding to deploy a cloud-based data catalog is only the first step. You also have a number of choices in how to deploy to the cloud.

Managed (Software as a Service)


In a managed deployment, either your cloud provider or your data catalog provider deploys the catalog on your behalf. This is also called a Software as a Service (SaaS) model.

With a managed deployment, your provider will handle common computing tasks such as backups, scaling, and resource allocation for you. This usually means a faster setup time and requires less ongoing IT support.

The drawback of a managed deployment is that it’s relatively expensive. You pay a premium for your provider to host the service. Your IT team may also feel it has less control over how the application is deployed and what resources are devoted to it.

Self-deployed


In the self-deployed model, your engineering team deploys all of the cloud-based infrastructure - virtual networks, virtual servers, storage - that the data catalog needs. It then installs the data catalog itself on top of this infrastructure.

A self-deployed cloud-based data catalog gives your team more control. It also generally leads to a lower cloud bill.

The drawbacks? A self-deployed data catalog requires a lot more engineering effort to keep the system running smoothly, thus adding to costs. That level of effort may be out of reach for some companies, or not possible if IT resources are already stretched thin.

Hybrid deployment


Depending on what data catalog you use, you can also opt for a hybrid deployment. In a hybrid deployment, your data catalog itself is self-hosted, but the data it stores and manages is stored in a managed service, such as Amazon RDS or S3.

A hybrid deployment can give your team greater operational control at a lower cost while still taking advantage of cloud-based features to simplify the management of your data tier. However, not every data catalog provider necessarily supports such a clean separation of concerns.


Challenges with a cloud-based data catalog

A cloud-based data catalog has a number of benefits. But there are also some pitfalls in terms of security and compliance of which you need to be aware.

Security


We said earlier that a cloud-based data catalog can provide better security. The key word there is can. Ensuring security on the cloud requires a proactive effort by your engineering team.

Even large, experienced companies with sprawling IT departments have experienced embarrassing security breaches brought about by cloud negligence.

Take Verizon. In 2017, a third-party vendor erroneously exposed data in an Amazon S3 bucket, potentially revealing customers’ Personally Identifiable Information (PII) to intruders and snoopers.

That doesn’t mean cloud security can’t be done well. The non-profit Financial Industry Regulatory Authority (FINRA) handles highly sensitive financial data workloads on AWS, for example.

With careful planning and regular security reviews, any cloud-based deployment can be as secure, if not more secure, than its equivalent on-premises deployment.

Compliance


A huge challenge with handling large volumes of data in the cloud is compliance with local data protection regulations. Some countries have laws demanding that all data related to their citizens be located physically within their borders. Many other countries forbid certain types of data - e.g., health care data - from leaving the confines of the nation.

The penalties for non-compliance can be staggering. In 2023, the Irish Data Protection Commission fined Facebook USD $1.3 billion (€1.2 billion) for the unlawful transfer of its citizen’s data from the European Union to the United States.

This doesn’t mean you can’t use a cloud-based data catalog if you handle regulated data.

Many cloud providers and data catalog service providers support running workloads on infrastructure that’s physically located within a given country or region. However, you’ll need to think carefully about what data gets stored where, and how many instances of your data catalog you’ll need to run in what regions to ensure compliance.


Best practices for a cloud-based data catalog

Deploying a cloud-based data catalog can seem daunting. But by following a few best practices, you can ensure a smooth, successful launch. These include:

  • Using encryption and zero-trust security
  • Planning for compliance
  • Anonymizing your data
  • Calculating and controlling costs
  • Mitigating migration issues

Let’s delve into the specifics of each best practice further.

Using encryption and zero-trust security


For self-deployments, ensure you’re encrypting all data both in transit as well as at rest, and encrypting all data in transit end-to-end.

For any cloud deployment, always employ a zero-trust security approach when assigning roles and access rights.

So, restrict administrator roles to a few key individuals. For other roles, use the principle of least privilege and assign only the permissions they absolutely require to get their jobs done.

Planning for compliance


Plan out what data you’ll need to store and where. Work with your solution provider or with experts from your cloud provider (such as solutions architects) to understand what deployments you’ll need in which localities to remain compliant with local data regulations.

Anonymizing your data


Wherever possible, avoid storing sensitive information, such as PII, in your cloud-based data catalog. Use data obfuscation or data masking to hide or redact sensitive values and ensure they’re only stored in highly-secured systems with limited access.

Calculating and controlling costs


Cloud services can sometimes contain “hidden” costs that aren’t obvious at the start. For example, many cloud providers charge per gigabyte for any data transferred out of their service to the Internet, or even to another cloud region.

During your planning phase, work with your cloud or service provider to understand your costs based on your expected user base, storage needs, and data transfer requirements.

For self-hosted deployments, use resources such as the AWS Pricing Calculator to create ballpark estimates. Don’t forget to account for the growth of your data catalog needs over time.

Look at special deals your cloud provider offers to help control costs. For example, you can earn steep discounts on virtual servers on all major cloud providers with a 1-year or 3-year commitment.

Monitor your infrastructure regularly for unused capacity and eliminate it. If you’re self-hosting your data catalog, set up billing alerts on your cloud provider to keep a cap on costs. Most cloud platforms support sending an alert if your costs along a particular vector (compute, storage, data transfer, etc.) exceed a set amount.

Mitigating migration issues


Are you already running your data catalog on-premises? If so, you’ll need a migration plan to ensure a smooth cutover with minimal downtime for business users.


Should you use a cloud-based data catalog?

There’s no such thing as a one-size-fits-all solution. There are many reasons you may want to adopt a cloud, on-premises, or even a hybrid solution for your data catalog needs. Use our decision-making guide to reason out which solution works best for you.

That said, there is a clear trend toward cloud deployments. As more of your company’s workloads move to the cloud, it will likely make sense at some point for your data catalog to move there too.

Conclusion

Moving your data catalog to the cloud can decrease costs, simplify management, and enable data democratization. With adequate planning, you can deploy a cloud-based data catalog that is reliable, scalable, and secure.



Share this article

[Website env: production]