Ope-rationalization of Machine Learning Models. Part 2.
Part 1 covers a very special and unique kind of deployment mechanism using SQL Server Machine Learning Services.
Model Deployment using the Azure Data Factory
Any Machine Learning project once developed cater broadly to usually two types of execution patterns.
1) Batch Processing where we can schedule our Models to run at predefined timelines.
2) Real time Processing where we can schedule our Models to run the very moment an event has occurred.
Below we talk about how we can use Azure capabilities to ope rationalize our Machine Learning projects.
Batch Processing or Event Based Real Time Processing using Azure Data Factory.
Consider a Use case where we know that we need to run our models say every week/month/every six months or yearly after ingesting some set of data from disparate sources for which we have connectivity to Azure Cloud via the Azure Data Factory. List of incoming connectors could be found out at the link https://docs.microsoft.com/en-us/azure/data-factory/connector-overview
Now for the above use case we know that the data would have been changed over the period defined above and that is why we need to run our complete data pipeline once again before running our model training on the same to get inferences.
All the above steps can be orchestrated in one Azure Data factory pipeline ingesting data from disparate resources, performing data mapping/data wrangling on it ( you can use Azure Data flow/ or Azure Data bricks) for this and then finally feed the data to our models in Azure Databricks notebooks. The best part is it can be scheduled as well and you are done.
Sample Steps
1) Develop an Azure Data factory with multiple copy activities to ingest data from disparate sources.
2) Once the data is ingested from the sources store in in our Azure Cloud Data lake bronze stage/raw data stage
3) Run Azure Data Flow/Azure Data bricks notebooks to perform Data Wrangling and Data Mapping Activities
4) Save the processed files above in Azure Cloud Datalake silver stage/aggregated data stage.
5) Call you Azure Data bricks notebook which uses the data from above aggregated stage to do Data Science modelling assuming this has been developed in Notebook environment of Azure Databricks. make sure you do not hard code your file names etc in the Databricks notebooks rather try making in generic parameterized values which can be changed as per the schedule on which your Data factory is running.
6) Schedule it for the designated periods using the Azure Data factory triggers.
Example use case.
Consider you are developing a system for modelling a Covid pandemic use case and taking data from the government website https://api.covid19india.org/ which has web api’s catering to your use cases returning a json/csv which could in turn be used for your modelling and we know the data might change every day so we could schedule it to run every day.
We need to do some house keeping as well because we know that every time my model picks up from a specified location so we need to make sure the moment the new data lands in the location we need to archive the old data so that it does interfere with the modelling of new data. Also once we have completed the modelling we can store the output to Azure SQL Server so that end results can be consumed within Power BI.
Steps defined below:
1) Archive Data from my input folder to a different location
2) Copy data from web api to ADLS(raw stage).
3) Run the python modelling code in your Azure Data bricks notebook
4) Write the output files generated from model to Azure SQL Server database.
5) Trigger it for a daily run