Using and Abusing Auto Loader's Inferred Schema
Problem Databricks' Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front. But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields?
Using Auto Loader on Azure Databricks with AWS S3
Problem Recently on a client project, we wanted to use the Auto Loader functionality in Databricks to easily consume from AWS S3 into our Azure hosted data platform. The reason why we opted for Auto Loader over any other solution is because it natively exists within Databricks and allows us to quickly ingest data from Azure Storage Accounts and AWS S3 Buckets, while using the benefits of Structured Streaming to checkpoint which files it last loaded.
Data Product Fictional Case Study: Retail
Background In a previous post, we explored what the data domains could look like for our fictional retailer - XclusiV. In this post, we will explore how the data products could work in this fictional case study, including how pure data consumers would handle the data - particularly those consumers who have a holistic view of an organisation (also a group of consumers for whom a traditional analytical model is perfect).
Data Domain Fictional Case Study: Retail
In previous posts we’ve understood what is Data Mesh and gone into greater detail with regards to the principles. In this next series of posts I want to use a fictional case study to explore how the underlying principles could work in practice. This post will introduce the fictitious company; the challenges it faces; and how the principle of decentralised data ownership and architecture, with domain alignment, would work. Fictitious Company: XclusiV XclusiV is a luxury retailer operating in multiple countries.
Databricks Labs: Data Generator
Databricks recently released the public preview of a Data Generator for use within Databricks to generate synthetic data. This is particularly exciting as the Information Security manager at a client recently requested synthetic data to be generated for use in all non-production environments as a feature of a platform I’ve been designing for them. The Product Owner decided at the time that it was too costly to implement any time soon, but this release from Databricks makes the requirement for synthetic data much easier and quicker to realise and deliver.
Data Mesh Deep Dive
In a previous post, we laid down the foundational principles of a Data Mesh, and touched on some of the problems we have with the current analytical architectures. In this post, I will go deeper into the underlying principles of Data Mesh, particularly why we need an architecture paradigm like Data Mesh. Let’s start with why we need a paradigm like Data Mesh. Why do we need Data Mesh? In my previous post, I made the bold claim that analytical architectures hadn’t fundamentally progressed since the inception of the Data Warehouse in the 1990s.
What is Data Mesh?
To be able to properly describe what Data Mesh is, we need to contextualise in which analytical generation we currently are, mostly so that we can describe what it is not. Analytical Generations The first generation of analytics is the humble Data Warehouse and has existed since the 1990s and, while being mature and well known, is not always implemented correctly and, even the purest of implementation, comes under the strain of creaking and complex ETLs as it has struggled to scale with the increased volume of data and demand from consumers.
Introduction to Data Lakes
Data Lakes are the new hot topic in the big data and BI communities. Data Lakes have been around for a few years now, but have only gained popular notice within the last year. In this blog I will take you through the concept of a Data Lake, so that you can begin your own voyage on the lakes. What is a Data Lake? Before we can answer this question, it’s worth reflecting on a concept which most of us know and love – Data Warehouses.
Forecasting: Methods and Principles
What comes to your mind when you hear the words forecasting, forecasts etc? Invariably, you’ll think of weather forecasts. But forecasts are much more than that. Forecasting is the process of making predictions of the future based on past and present data and analysis of trends. It’s a process that has existed for millennia, though often with dubious methodologies… Instead of looking in to a crystal ball to predict the future we are going to employ the power of statistics!
Currency Conversion in SQL using Triangulation Arbitrage
Some systems use a single currency as a base, which is something that I noticed recently when working with IBM Cognos Controller, e.g. USD to convert local currencies into. But what if you want / need to rebase into another currency but still retain the original base? This doesn’t appear to be easy to achieve within Cognos Controller itself, but it is achievable within SQL and a wider ETL framework.