Logo Ust Does Tech
Logo Inverted Logo
  • Posts
  • Databricks
    • Databricks Labs Data Generator
    • Using Auto Loader on Azure Databricks with AWS S3
    • Using and Abusing Auto Loader's Inferred Schema
    • Scheduling Databricks Cluster Uptime
  • Data Lake
    • Introduction to Data Lakes
    • Deep Dive Into Data Lakes - SQL Bits
  • Data Mesh
    • What is Data Mesh?
    • Data Mesh Deep Dive
    • Data Domain Fictional Case Study in Retail
    • Data Product Fictional Case Study in Retail
  • Data Science
    • Forecasting Methods and Principles
  • SQL
    • Triangulation Arbitrage in SQL
    • CI CD with Synapse Serverless
  • Testing
  • Strategy
    • Why Data Quality is Important
    • What do we mean by Self-Serve
  • Data Modelling
    • Tabular Automation with TMSL and PowerShell - SQL Bits
Hero Image
Scheduling Databricks Cluster Uptime

Problem Interactive and SQL Warehouse (formerly known as SQL Endpoint) clusters take time to become active. This can range from around 5 mins through to almost 10 mins. For some workloads and users, this waiting time can be frustrating if not unacceptable. For this use case, we had streaming clusters that needed to be available for when streams started at 07:00 and to be turned off when streams stopped being sent at 21:00.

July 28, 2022 Read
Hero Image
Using and Abusing Auto Loader's Inferred Schema

Problem Databricks' Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front. But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields?

October 28, 2021 Read
Hero Image
Using Auto Loader on Azure Databricks with AWS S3

Problem Recently on a client project, we wanted to use the Auto Loader functionality in Databricks to easily consume from AWS S3 into our Azure hosted data platform. The reason why we opted for Auto Loader over any other solution is because it natively exists within Databricks and allows us to quickly ingest data from Azure Storage Accounts and AWS S3 Buckets, while using the benefits of Structured Streaming to checkpoint which files it last loaded.

October 18, 2021 Read
Hero Image
Databricks Labs: Data Generator

Databricks recently released the public preview of a Data Generator for use within Databricks to generate synthetic data. This is particularly exciting as the Information Security manager at a client recently requested synthetic data to be generated for use in all non-production environments as a feature of a platform I’ve been designing for them. The Product Owner decided at the time that it was too costly to implement any time soon, but this release from Databricks makes the requirement for synthetic data much easier and quicker to realise and deliver.

August 9, 2021 Read
Navigation
  • About
  • Projects
  • Recent Posts
Contact me:
  • [email protected]

Toha Theme Logo Toha
© 2021-2022 Copyright.
Powered by Hugo Logo