Scheduling Databricks Cluster Uptime
Problem Interactive and SQL Warehouse (formerly known as SQL Endpoint) clusters take time to become active. This can range from around 5 mins through to almost 10 mins. For some workloads and users, this waiting time can be frustrating if not unacceptable. For this use case, we had streaming clusters that needed to be available for when streams started at 07:00 and to be turned off when streams stopped being sent at 21:00.
Using and Abusing Auto Loader's Inferred Schema
Problem Databricks' Auto Loader has the ability to infer a schema from a sample of files. This means that you don’t have to provide a schema, which is really handy when you’re dealing with an unknown schema or a wide and complex schema, which you don’t always want to define up-front. But what happens if the schema that has been inferred isn’t the schema you were expecting or it contains fields which you definitely don’t want to ingest - like PCI or PII data fields?
Using Auto Loader on Azure Databricks with AWS S3
Problem Recently on a client project, we wanted to use the Auto Loader functionality in Databricks to easily consume from AWS S3 into our Azure hosted data platform. The reason why we opted for Auto Loader over any other solution is because it natively exists within Databricks and allows us to quickly ingest data from Azure Storage Accounts and AWS S3 Buckets, while using the benefits of Structured Streaming to checkpoint which files it last loaded.
Databricks Labs: Data Generator
Databricks recently released the public preview of a Data Generator for use within Databricks to generate synthetic data. This is particularly exciting as the Information Security manager at a client recently requested synthetic data to be generated for use in all non-production environments as a feature of a platform I’ve been designing for them. The Product Owner decided at the time that it was too costly to implement any time soon, but this release from Databricks makes the requirement for synthetic data much easier and quicker to realise and deliver.