Some of Azure Databricks Best Practices
Starting with Azure Databricks reference Architecture Diagram
Infrastructure Management Best Practices:
Azure Infrastructure could be provisioned under a top level Tenant Id- Organisation identifier(<organisation>@microsoft.com) followed by Subscription which could be say Prod and Non Prod environment subscription on a broader level or say could be Line of Business Level/Account owner level demarcation followed by Resource Group drilling down to a specific Project Level Scope. Data Bricks creates its own managed resource group and we may not have access to the storage account created within the Azure Data Bricks managed resource group.
So we need to plan the above hierarchy to get he best out of the Azure Databricks Service some pointers below:
1) Assign workspaces based on related group of people working collaboratively.
2) Streamline your access control matrix within your workspace(folders,notebooks etc) and all your resources that the workspace interacts with.
3) The above type of division scheme is also known as the Business Unit Subscription design pattern and it aligns well with Databricks charge back model (fig below)
Workspace and Subscriptions Limits
Customers partition workspaces based on teams or departments and it is very important to partition keeping Azure Subscription and ADB workspace limit in mind.
Networking Concepts
VNet is the virtual boundary associated with your Azure resources so it is advisable to select the largest VNet CIDR for accurate capacity planning(this feature only applies in case you bring your own VNet feature.
1) Each cluster node requires 1 public and 2 private IP’s . These IPs are logically grouped into 2 sebnets named “public” and “private”.
2)For a desired Cluster size of X : number of Public IPs is X and number of private IPs is 4X.
3)So indirectly the size of public and private sub nets determine total number of Virtual Machine’s available for clusters.
4)The allowed values for the enclosing VNET CIDR are from /16 through /24.
Storage and Secure access best practices
1)Never ever store any production data in default DBFS folders as life cycle of default DBFS is tied to the Workspace. Deleting the workspace will delete the DBFS and your production data will be gone for ever.
2)Customers cannot restrict access to this default folder and its contents.
The above recommendations does not apply to ADLS or Blob storage explicitly mounted as DBFS by the end user.
3)Always store secrets in Key Vault: Use Azure Key Vaults to store your secrets. Go to
https: //<your_azure_databricks_url>#secrets/createScope
4) Configuring secure access of ADLS Gen2 applicable to whole group of people across any cluster.
i)Service Principal Authentication-If you want to provide a group of users access to particular folder and its contents scope the Service Principal Authentication to:
a)Workspace-mount a folder for all clusters to access
b)Cluster-Cluster configuration setting.
c)Session-configure Spark conf session with SP
ii)AAD Passthrough Authentication-In case you wish to enforce fine grained ADLS file and Folder level ACLs for each user individually.
iii)Databricks Table ACLS- For BI scenarios you can set access controls on Tables and Views.
Cluster Configurations best practices
1)Design choices faced by developers
a)Analytics vs Job Clusters
b)Does my streaming cluster need to be up 24/7- You can execute a streaming workload as a Job cluster if latency is close to an hour. Also if the latency requirements are once in a day consider Trigger once mode
c)Cluster modes to use
d)Performance- Picking the right Virtual Machine
General Purpose Compute-Machine Learning
Storage Optimized-Streaming and ExtractTransformLoad jobs
Compute Optimized-Interactive /Development workloads
GPU- Deep Learning workloads.
Exceptions:
1)Machine Learning could require huge Memory Cache when more computations are involved in that case us e Memory Optimized RAM but in case our data set is huge go for Storage Optimized RAM.
2)In case of Streaming jobs if we want streaming processing to be faster than the input rate go for Compute Optimized Virtual Machines.
Some smart Tips- Getting the exact size is workload dependent and only iterative methods work for example take a percent of Cache see how much memory is being used and scale up iteratively to get the best match for your workload.
CPU Bound-Add more cores by adding more nodes to the cluster
Network Bound-Use fewer and bigger SSD backed machines to reduce network size and improve remote reads.
Disk I/O bound-If jobs are spilling to disks use Virtual Machines with more memory.
Databricks Pool Considerations- Consider using Pools in case you want to shorten the cluster start time by 7X gives best results for short duration Jobs which needs fast trigger and finish times and it helps speed up time in between job stages.