Ksama Arora

Designing and Implementing Data Science Solution on Azure (DP-100)

Jul 13, 2024

Overview: Data Ingestion, Development Approaches, Azure ML Workspace, Resources and Assets.
Design and Prepare ML Solution: Compute Types, Create Compute Targets, LAB 1: Azure ML Workspace
Explore Data and Train Model: Key Terminologies, Data Preprocessing, Feature Engineering, Differential Privacy, Data Wrangling, LAB 2: Data Wrangling
Build and Train ML Model: Azure ML Designer, LAB 3: Basic Pipeline, Automated ML, Python SDK
Tuning Hyperparameters: Hyperparameters, Sampling Methods, Termination Policies
Responsible AI dashboard and Pipelines: Create component, build and run pipeline
Track models with Mlflow Job: Track and view metrics, evaluate and register model
Deploy and Consume Models with Azure ML: Managed online-endpoint, batch endpoint, trigger workflows on github

Data Science Life Cycle & Responsible AI Guidelines - Overview

Designing a Data Ingestion Solution*

Overview: Extract raw data from source (CRM or IoT device) -> Copy and transform data with Azure Synapse Analytics -> Store prepared data in Azure Blob Storage -> Train model with Azure ML

Type of Data	Description	Example
Tabular or Structured Data	All data has the same fields or properties, defined in a schema. Represented in tables where columns are features and rows are data points.	Excel or CSV file
Semi-structured Data	Not all data has the same fields or properties. Represented by a collection of key-value pairs. Keys are features, and values are properties.	JSON object from IoT device: { “deviceId”: 29482, “location”: “Office1”, “time”:”2021-07-14T12:47:39Z”, “temperature”: 23 }
Unstructured Data	Files that don’t adhere to any rules regarding structure. Can’t query the data in the database.	Document, image, audio, or video file

Store data for model training workloads:*

When using Azure Machine Learning, Azure Databricks, or Azure Synapse Analytics for model training, three common options for storing data are:
- Azure Blob Storage: Cheapest option for storing unstructured data. Ideal for storing files like images, text, CSV, and JSON
- Azure Data Lake Storage (Gen 2): Best for storing large volumes (limitless) of unstructured data like CSV files and images. A data lake also implements a hierarchical namespace, which means it’s easier to give someone access to a specific file or folder (privacy-sensitive)
- Azure SQL Database: Stores unstructured data that doesn’t change over time. Data is read as a table and schema is defined when a table in the database is created.

Create a data ingestion pipeline:*

Azure Synapse Analytics:
- Used to create and schedule data ingestion pipelines through the easy-to-use UI, or by defining the pipeline in JSON format.
- UI tool like mapping data flow or using a language like SQL, Python, or R.
- Creates automated pipelines to copy and move data
- Allows to choose between different compute - serverless SQL pools, dedicated SQL pools, or Spark pools
Azure Databricks:
- Opt for this if you prefer a code-first tool and to use SQL, python or R to create pipelines.
- Azure Databricks uses PySark, which distribute compute to transform large amounts of data in less time.
Azure Machine Learning:
- Provides compute clusters (automatically scale up or down when needed).
- Used to manage all tasks with a single tool or run pipelines with on-demand compute cluster.
- Python is preffered language
- Can create pipeline using Designer.
- Intutive user interface
Azure AI Services:
- Collection of pre-built ML models, this saves time and effort to train model
- Models are offered as an API (Application Programming Interface)

Selecting development approaches to build or train models*

Three ways to interact with Azure Machine Learning

Azure CLI: Use this command-line approach for automation of infrastructure.
Python SDK: Used to submit jobs and manage models from a Jupyter notebook, ideal for data scientists - code-first solutions
Azure ML Studio: Use the user-friendly UI to explore workspace and other capabilities - low code training and development, data management and monitoring can be applied (automated ML & visual designer)

Git: Azure ML Workspaces work indirectly with Git for source control, and you can use a local Git repository by cloning it (git clone url_of_repository)

Azure Machine Learning Environments* (Imp)

Environment: Can be another form of source control and multiple custom environments can be created or curated environments can be used. Contains python packages, environment variables, software settings, runtimes. Environment ensures version control, reproducible and auditable.

Environment are divided into 3 major categories:
- Curated Environment: These are automatically created when ML workspace is created (pre-built environments). These environements are created and updated by Microsoft. Curated environments use prefix AzureML- - faster development time
- User-Managed Environment: You are responsible to set up the environement as install packages required for training the script. You are responsible for everything.
- System Managed: In this, Conda package manager will manage the python environment for you.

NOTE IMP: Cloud Based Environment

Azure Data Science Virtual Machine: It is a cloud-based environment i.e. preconfigured with tools such as Tensorflow, scikit-learn, and the Azure CLI
Azure Virtual Machine: Also cloud-based but not preconfigured with common data science and Azure tools
VS Code and local computers are NOT Cloud-based environments

Azure ML Workspace

The following are provisioned by default when you deploy an Azure Machine Learning workspace.
- Application Insights is used to monitor predictive services in the workspace.
- Azure key vault is used to securely manage & store secrets such as authentication keys and credentials used by the workspace.
- Azure storage account is used to store machine learning notebooks and logs related to machine learning jobs and other activities.
- Azure Container Registery is created when needed to store images for Azure ML environments

IMP: File formats
Parquet files: These are recommended for ML tasks as they are compressed columnar format.
Python files: These are scripts and not intended for data storage.
CSV files or XLSX files: These are common for data processing but not optimised for ML tasks as they are row format

Creating Azure ML Workspace using Python SDK and CLI (Imp)

Resources and Assets*

Azure ML Resources: provide infrastructure and services to build solution
Workspace (explained above)
Compute Resources: 5 types - Compute Instance, Compute Clusters, Kubernetes Clusters, Attached Compute, Serverless Compute
Datastores

Datastore	Description	Storage Type	Main Usage
workspaceartifactstore	Stores compute and experiment logs	Azure Blob	Logs for jobs
workspaceworkingdirectory	Stores files uploaded via the Notebooks section	Azure File Share	Notebook files
workspaceblobstore	Default datastore for storing uploaded data	Azure Blob	Data assets
workspacefilestore	General file storage	Azure File Share	General file storage

Azure ML Assets: are product or contents created and configured by data scientists and engineers
- Models
- Environments
- Data
- Compoenents

(I) DESIGN AND PREPARE A MACHINE LEARNING SOLUTION

Compute*

Compute Target: is a compute resource/environment used to train and host model. Can use different compute target for each phase (training/production) of project.

Compute Type	Description
Compute Instances	Similar to a VM - Primarily used to run notebooks - Ideal for experimentation - Easiest option to work with compute instance is through the integrated notebooks experience in the Azure ML studio. OR use VS Code for easier source control of your code.
Compute Clusters	On-demand multi-node clusters of CPU or GPU - automatically scale - used for large volume of data - allow for parallel processing to distribute workload - reduce time of run. - Ideal to use for production workloads
	- Dedicated Cluster: These are just ready when you are - more expensive
	- Low-Priority Cluster: These systems are going to be ready when you are, probably. Low priority clusters may be accessed by multiple users thus you may get access to it but just in a few minutes or so - less expensive
Inference Cluster/Kubernetes Cluster	Allows you to create or attach an Azure Kubernetes Service (AKS) cluster. BEST to deploy trained machine learning models in PRODUCTION SCENARIO.
Attached Compute	Allows you to attach other Azure compute resources to the workspace, like Azure Databricks, HDInsight cluster or Synapse Spark pools. Used for Specialized needs.
Serverless Compute	A fully managed, on-demand compute you can use for training jobs.

Central Processing Unit (CPU) or Graphical Processing Unit (GPU) *
- CPU - sufficient and cheaper to use for smaller tabular datasets
- GPU - powerful and effective for unstructured data - for larger amount of tabular data - libraries such has RAPIDs (developed by NVIDIA) allow data prep and training with large datasets
General purpose or memory optimized *
- When creating compute resources, there are 2 VM sizes you choose from
  - General purpose: Have a balanced CPU-to-memory ratio. Ideal for testing and development with smaller datasets.
  - Memory optimized: Have a high memory-to-CPU ratio. Great for in-memory analytics, which is ideal when you have larger datasets or when you’re working in notebooks.
Spark Compute / Clusters *
- Spark cluster consists of driver node and worker nodes. Code will initially communicate with driver node. The work is then distributed across the worker nodes. This reduces processing time. Finally work is summarized and the driver node communicates the result back to you.
- Code needs to be written in Spar-friendly language like Scala, SQK, Rspark, or PySpark in order to distribute the workload.

Apache Spark Tools as Compute Targets

Azure Synapse: It is an enterprise analytics service platform that enables data warehousing, analytics, processing and integration and pipeline framed with a massively parallel processing architecture. Synapse supports bot SQL and Spark technologies.

Azure Synapse Spark Pools:

When setting up a Synapse Spark pool as an attached compute target in Azure Machine Learning studio: Select an existing Azure Synapse workspace and an existing Spark pool in that workspace -> tick the option to set up a managed identity -> Choose system-assigned or user-assigned -> To reliably connect to your new compute resource to run workloads, Go to synapse studio and assign managed identity in Azure ML to role of Synapse Administrator.

Serverless Spark Pools can be used as a form of compute to set up notebooks in Azure ML Studio

IMP:
Local compute is used with low volume of data to use for training workload. Thus it will not incur costs in Azure, as you have already bought the machine.
Inference cluster is designed for inference and not for training. It will incur costs in Azure.
Compute cluster is oversized for low volume of data. This will also incur costs in Azure.

Create Compute Targets for Experiments and Training:*

Azure ML compute instance or compute clusters can be created from:

Azure ML Studio
Python SDK
Azure CLI
Azure Resource Manager Templates (can re-use compute from the ARM templates)

Create Compute Target using Python SDK and CLI (Imp)

Configure Attached Compute Resources*

Attached Compute - HD Insights and Apache Spark

from azureml.core import RemoteCompute, ComputeTarget
# Ubuntu VMs only
# VM must have public IP addy
my_resource_id = "/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/virtualMachines/<vm_name>"
my_compute_target_name = "attached_existingVM"
attached_target_config = RemoteCompute.attach
#
attach_config = RemoteCompute.attach_configuration(resource_id='<resource_id>',
                                                    ssh_port=22,
                                                    username= '<username>'
                                                    password="<password>")
# Attach the compute
compute = ComputeTarget.attach(my_ws, my_compute_target_name, attached_target_config)
compute.wait_for_completion (show_output=True)

Note: When considering attaching an existing virtual machine to your Azure ML workspace as a compute target, it’s crucial that the external Vms must be Ubuntu only and must have an public IP address only.
However, the primary reason for choosing an existing VM over a new compute instance is to utilize unused capacity effectively.

Create Azure Machine Learning Workspace - Lab 1

(II) EXPLORE DATA AND TRAIN MODEL

Terminologies:*

URI (Uniform Resource Identifier):*

It specifies location of data/resource.
To connect Azure ML to your data, a protocol needs to be prefixed to the URI.
The three common protocols are:
- http(s): Use for data stores publicly or privately in an Azure Blob Storage or publicly available http(s) location. (Protocol used when accessing data stored in publicly available http(s) location)
- abfs(s): Use for data stores in an Azure Data Lake Storage Gen 2.
- azureml: Use for data stored in a datastore.

Datastores:*

Datastores are reference to existing Azure Storage resource and are used to read data directly from that source.
Benefits:
- Provides easy-to-use URI to your data storage
- Facilitates data discovery within Azure ML
- Securely stores connection information, without exposing secrets and keys to data scientists
Two methods for authentication:
- Credential Based: Use a service principal, shared access signature (SAS) token or account key to authenticate access to your storage account
- Identity-Based: Use your Microsoft Entra identity or managed identity
Types of datastores:

Data Assets:*

They are references to data in datastores
Benefits:
- Can share and reuse data
- Can seamlessly access data during model training or any supported compute type
- Can version metadata of the data asset

IMP Note:
Three types of data assets (SDK v2):
URI file: Points to specific file
URI folder: Points to specific folder (Folder contains multiple files like images, text, audio, video files etc)
ML Table: Dedicated to tabular data i.e. points to folder or file, including a schema to read as tabular data
Note FileDataset and Tabular Dataset belong to previous version of SDK

Data Asset Management: It is the implementation and monitoring of datastores and datasets. It is version and tracking. It is registering and retrieveing those versions. It is how we are monitoring datasets and how we look at drift detection. We can access datasets that Azure provides and look at public datasets

Create Datastores and Data Assets using Python SDK (Imp)

Mounting and Downloading Files for Datasets

	Mount Files	Download Files
Description	Files do not reside in compute	Files downloaded to compute
Processing	More streaming - means more processing/moving of data	Less streaming - less processing as data has been downloaded
Usage	Good if you don’t use all files from dataset	Good if you use all files from dataset
Available for	Datasets created from ADLS, SQL, Database, PostgreSQL	Datasets created from ADLS, SQL, Database, PostgreSQL

Preprocessing of Data

Steps for Preprocessing of Data:

Data quality assessment
Data Cleaning: Look for Missing data, Noisy data
Data transformation: Aggregation, Feature selection & engineering, Normalization (combining all the data so that everything is standardized), converting categorical features into numerical indicators, dropping high cardinality features like IDs
Data reduction: employ math to filter out unnecessary data

Feature Selection and Feature Engineering

Feature Selection: The process of selecting specific variables (features) that contribute the most to the prediction variable in our algorithms.
Feature Engineering: The process of selecting and expressing data in a way that improves the performance of machine learning models.
- Wrapper method: Wrapper methods evaluate subsets of features by training a model on them and assessing the model’s performance. All combinations are evaluated and the best one is chosen. Prone to overfitting
- Filter method: Filter methods apply statistical techniques to evaluate the relevance of each feature independently from the machine learning model. They assign a score to each feature based on various statistical tests and rank them accordingly. Thus Features can then be included or removed from the dataset based on their scores. This method is relationship based. Faster than wrapper method.

Differential Privacy: Eg of Responsible AI*

Differential Privacy seeks to protect individual data by adding statistical noise to the analysis process. Minimizing risk of personal identification and ensuring data privacy. It ensures that the output of a data analysis algorithm does not reveal sensitive information about any individual.

When applying differential privacy, Epsilon (ε), Bounds and Sample size needs to be defined
Privacy Loss Paramter/Epsilon (ε) is a key parameter in differential privacy that controls the balance between privacy and accuracy - Value ranges between 0 and 1
- Low Epsilon (ε): High privacy - Low accuracy - More noise - High data obscurity - Data more difficult to interpret
- High Epsilon (ε): Low Privacy - High accuracy - Less noise - Less data obscurity - Data more accurate and easier to interpret

# Sample code snippet
privacy_usage = { 'epsilon': 0.10},
                data_lower = lower_range[1],
                data_upper = upper_range[10],
                data_rows = sample

Accessing Data During Interactive Development (Imp)
Data Wrangling: It is the process of transforming data to a format that’s best suited to the needs of ML model.

Wrangle Data with Python in Azure ML - Lab 2

(III) THREE WAYS TO BUILD AND TRAIN ML MODEL

(A) Build & Train Models using Azure ML Designer

Step 1: Open Azure ML Studio -> Select Designer
Step 2: Create new pipeline (Classic Pre-built/Custom). Say we select a classic pre-built model for example - Regression Automobile Price prediction (Basic)
Step 3: Now run this model by setting up a pipeline job and a compute
Step 4: Evaluate the model based on metrics: coefficient of determination, mean absolute error, relative absolute error, relative squared error, root mean squared error

Dataset -> Select Columns in Dataset -> Clean Missing Data -> Split Data & (Linear Regression) -> Train Model -> Score Model -> Evaluate Model

Custom Code Components

Custom code modules are created using Python. Supported libraries include NumPy, SciPy, scikit-learn, Theano, TensorFlow, Keras, PyTorch, pandas, and matplotlib

Create Custom Model: Develop your custom model using “Create Python Model” module
Link the custom model to the “Train Model” module to train it with your dataset
Since standard evaluation modules are not compatible, use the “Execute Python Script” module to run evaluation scripts for the custom model

IMP NOTE:
An Execute Python Script component must implement azureml_main method .
A Create Python Model component must implement train and predict methods.
A load_data method is used to construct a data object but is NOT A PART of the Execute Python Script

NOTE IMP:

Algorithms used in creating pipeline

Multiclass Decision Forest: algo will make predictions between several categories
Bayesian Linear Regression: algo will forecast a numeric value
Boosted Decision Tree Regression: algo will forecast a numeric value
K-means: algo will separate data points into groups
Two Class Boosted Decision Tree: supports non-linearity of model
Two Class Logistic Regression, Two Class Support Vector Machine, and Linear Regression: does NOT support non-linearity of model i.e. relies on linearity assumption in the problem they are used to solve.

Create a basic pipeline in Azure ML Studio - Lab 3

Filter Based Feature Selection Module

Configuration Options:
- Operate on Feature Columns only - True/False
- Number of desires features (Specify the number of features to output in the results): 1
- Feature scoring method - PearsonCorrelation/ChiSquared
- Target Column: Specify target column

Pearson Correlation vs Chi-Squared Statistics

Aspect	Pearson Correlation	Chi-Squared Statistics
Purpose	Measures the strength and direction of a linear relationship between two quantitative variables	It is a comparative test that reveals how close expected values are to actual results.
Correlation Coefficient (R)	Ranges from -1 to +1: - 0: No correlation - +1: Perfect positive correlation - -1: Perfect negative correlation	Ranges from 0 to 1 - No correlation coefficient - lower values indicate a better fit to expected values
Type of Variables	Quantitative	Categorical
Interpretation	- Positive Correlation: e.g., More rain increases humidity - Negative Correlation: e.g., Higher altitude decreases temperature	Indicates whether a relationship exists but does not specify the type (positive or negative)
Value Indication	Indicates the strength of the relationship	Smaller values indicate a better fit and existence of a relationship
Steps	1. Determine linearity 2. Clean data 3. Generate the coefficient 4. Evaluate the results

Permutation Feature Importance Module

It refers to randomly shuffling data while keeping everything else constant and then seeing if we have a change in whatever feature column we were looking at. From that we can generate new prediction based on results. Compute feature importance score by calculating decrease in quality

Configuration Options:
- Random seed (Random number generator seed value - It is going to randomize the data) = 1023 (Say)
- Metric for measuring performance:
  - Classification metrics - Accuracy (how accurate model is), Precision (how good model is), Recall (how many times model was able to detect a specific category)
  - Regression metrics - Mean Absolute Error, Root Mean Squared Error, Relative Absolute Error, Relative Squared Error, Coefficient of Determination (R squared)

Applying Automated ML to Explore Models

Automated ML democratizes machine learning with a no-code approach, making it easy to explore optimal machine learning models. Automated ML handles preprocessing, featurization, transformation, scaling, and normalization. At the end, it scores the model by selecting a metric and deploying the model. Automated ML is used for exploring optimal algorithms and parameters to solve a particular problem without a lot of human trial-and-error.

Types of algorithm in automated ML: Classification, Regression and Time Series Forecasting

Examples of Automated ML:

Tabular Data
Computer Vision
Natural Language Processing

Configure and submit Auto ML experiment using Python SDK* (Imp)

Step 1: Create ML Client
Step 2: Define MLTable: (ML Table is already inputted)
Step 3: Define the AutoML Job

from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl, Input

# Create an Input object for the training data
my_training_input = Input(
    type=AssetTypes.MLTABLE, 
    path="./data/training-mltable-folder"
)

# Configure the classification job
my_classification_job = automl.classification( # for classfification, automl.classfication function is used
    compute="aemcmlcompute",
    experiment_name="this_experiment",
    training_data=my_training_input, # my_training_input refers to MLTable data asset created in Azure ML workpsace
    target_column_name="try_me",
    primary_metric="accuracy",
    n_cross_validations=4,
    enable_model_explainability=True,
    tags={"my_tag": "My value"}
)

# Specify Primary Metric (Important to specify)
# The primary metric is the target performance metric for which the optimal model will be determined. 
# Set the primary metric to the performance score for which you want to optimize the model.

# To retrieve the list of ClassificationPrimaryMetrics available
from azure.ai.ml.automl import ClassificationPrimaryMetrics
list(ClassificationPrimaryMetrics) 

# Set optional limits to minimize cost and time spent on training by set_limits()
my_classification_job.set_limits(
    timeout_minutes=600, # timeout_minutes: Number of minutes after which the complete AutoML experiment is terminated.
    trial_timeout_minutes=20, # trial_timeout_minutes: Maximum number of minutes one trial can take.
    max_trials=5, # max_trials: Maximum number of trials, or models that will be trained.
    enable_early_termination=True # enable_early_termination: Whether to end the experiment if the score isn't improving in the short term.
)
# Note: Can use multiple trials in parallel to save time. If using compute cluster, can have as many parallel trials as the max number of nodes
# Note: If you want to set max number of trials to be less than max number of nodes, use max_concurrent_trials

# Set optional training properties
my_classification_job.set_training(
    blocked_training_algorithms=["logistic_regression"], 
    enable_onnx_compatible_models=True
)

Step 4: Submit & Monitor Auto ML Job run

# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    my_classification_job
)
returned_job.services ["Studio"].endpoint

# Monitor Auto ML job runs 
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

IMP NOTE:
To create a reference to data source location, use Input class. The data will remain in the source location. The Output class is similar, but for output of the components.

NOTE: Set the featurization mode to “off” if you don’t want the automated machine learning to make changes to the data (no preprocessing of data). ALSO The target is the column you want to predict. The primary metric is the performance score for which you want to optimize the model. Set the primary_metric to ‘AUC_weighted’ to optimize for the AUC_weighted metric.

(B) Build & Train Models using Azure Automated Machine Learning - Lab 4*

(C) Training Model by using Python SDK*

Tuning Hyperparameters in Azure Machine Learning*

Parameters are values determined from the training features as parameters. Hyperparameter are values used to configure training behavior but which are not derived from training data. Hyperparameters: Top-level settings you configure before running the ML algorithm, such as: train-test ratios, number of epochs, batch size.

Types of Hyperparameters*

Discrete Hyperparameters:

Specified as choice among discrete values (explicit values/finite set of possibilties) which can be defined using:
- python list (Choice(values=[10,20,30]))
- range (Choice(values=range(1,10)))
- arbitrary set of comma-separated values (Choice(values=(30,50,100)))
Example QUniform(min_value, max_value, q), QLogUniform(min_value, max_value, q), QNormal(mu, sigma, q), QLogNormal(my, sigma, q)

{
    "batch_size": choice(1, 2, 3, 4)
    "number_of_hidden_layers": choice(range(1,5))
}

Continuous Hyperparameters:

Specified from a continuous (sliding) range of values (infinite number of possibilities).
Example Uniform(min_value, max_value), LogUniform(min_value, max_value), Normal(mu, sigma), LogNormal(my, sigma)

{
    "learning_rate": normal(10,3),
    "keep_probability":uniform(0.05, 0.1)
}

Hyperparameter tuning: Process of finding the configuration of Hyperparameters that results in determining performance metric for which you want to optimize and selecting the best performance model
- Search Space: The set of hyperparameter values tried during hyperparameter tuning is known as the search space. E.g. sample search space indicates batch_size hyperparameter and learning_rate hyperparameter with value of mean = 10 and s.d = 3

from azure.ai.ml.sweep import Choice, Normal

command_job_for_sweep = job(
    batch_size=Choice(values=[16, 32, 64]),    
    learning_rate=Normal(mu=10, sigma=3),
)

There are different types of jobs depending on how you want to execute a workload:
- Command: Execute/run a single script.
- Sweep Job: Perform hyperparameter tuning when executing a single script. Helps you automate choosing these parameters
- Pipeline: Run a pipeline consisting of multiple scripts or components.
Note: When you submit a pipeline you created with designer it will run as a pipeline job. When you submit an Automated ML experiment, it will also run as a job.
Sampling Methods (ALSO CODES) IMP*

Define Primary Metric
- You can define the objective of your sweep job by specifying the primary metric and goal you want hyperparameter tuning to optimize.
- Each training job is evaluated for the primary metric.
  - primary_metric: The name of the primary metric and the name of the metric logged by the training script should be an exact match.
  - goal: It can be either Maximize or Minimize. This determines if the primary metric will be maximized or minimized during job evaluation.

from azure.ai.ml.sweep import Uniform, Choice

command_job_for_sweep = command_job(
    learning_rate=Uniform ( min_value=0.05, max_value=0.1), 
                            batch_size = Choice (values = [16, 32, 64, 128]),
                            # Discrete hyperparameter used here
)

sweep_job = command_job_for_sweep.sweep(
    compute="cpu-cluster",
    sampling_algorithm = "bayesian", 
    primary_metric="accuracy",
    goal="Maximize",
)

Early Termination Policies*

Early Termination Policies: Strategy used to halt the training process of new models when they do not yield significantly better results compared to previously trained models. Increase efficiency and reduce costs by stopping underperforming runs early. Generally used when working with continuous hyperparameters and random or Bayesian sampling method.

Two Parameters when using early termination policy:
- evaluation_interval: Specifies at which interval you want the policy to be evaluated. Every time the primary metric is logged for a trial counts as an interval.
- delay_evaluations: Specifies when to start evaluating the policy. This parameter allows for at least a minimum of trials to complete without an early termination policy affecting them.

Bandit Policy:
Terminates when primary metric is not within a specified slack_factor (relative)/slack amount (absolute)
Stops runs that perform worse than the best run by a specified margin

from azure.ai.ml.sweep import BanditPolicy
  
sweep_job.early_termination = BanditPolicy(
    slack_amount = 0.2, 
    delay_evaluation = 5, 
    evaluation_interval = 1
)
# When after first five trials, best performing model has accuracy say 0.9.
# Thus all new models must perform better than 0.9-slack_amount = 0.9-0.2 = 0.7, the sweep job will continue.  
# If the new model has an accuracy score lower than 0.7, the policy will terminate the sweep job.

Median Stopping Policy:
Terminates when primary metric value is worse than median of averages for all trials.

from azure.ai.ml.sweep import MedianStoppingPolicy
  
sweep_job.early_termination = MedianStoppingPolicy(
    delay_evaluation = 5, 
    evaluation_interval = 1
)
# Say median accuracy score is 0.82 so far. 
# Any new model should have accuracy higher than 0.82 for the sweep job to continue. 
# If the new model has an accuracy score lower than 0.82, the policy will terminate the sweep job.

Truncation Selection Policy:
Terminates/Cancels lowest performing X% of trials at each evaluation interval based on the truncation_percentage value you specify for X.

from azure.ai.ml.sweep import TruncationSelectionPolicy
  
sweep_job.early_termination = TruncationSelectionPolicy(
    evaluation_interval=1, 
    truncation_percentage=20, 
    delay_evaluation=4 
)
# For example, if the primary metric is accuracy, the job continues only if a new model's accuracy is not in the worst 20% of all trials so far. 
# If the fifth trial has the lowest accuracy among all trials, the sweep job stops.

Use a sweep job for hyperparameter tuning*

Steps to Optimize Hyperparameters

Notebooks v/s Scripts

Notebooks	Scripts
Designed for interactive training	Intended for automated training
Contain other content	Stripped down to (mostly) code
Contain alternative model ideas	Run settled models
On-demand code (run specific cells as needed)	On-command functions (execute without supervision)
Awkward to deploy in CI/CD	Enable CI/CD processes

Run Training Script as Command Job in Azure Ml*
Step 1: Convert Notebook to Script
Step 2: Run a Script as a Command Job
Step 3: Use Parameters in Command Job

Responsible AI Dashboard*

Run Pipelines in Azure ML

Create Components

Components: Allow to create reusable scripts that can be easily shared across users within the same ML workspace. Use components to build ML pipeline

Component consist of 3 parts:

Metadata: Includes the component’s name, version, etc.
Interface: Includes the expected input parameters (like a dataset or hyperparameter) and expected output (like metrics and artifacts).
Command, code and environment: Specifies how to run the code.

To create component, you need 2 files:

Script: Contains workflow u want to execute
YAML file: to define metadata, interface, command, code and environment of the component/ or use command_component() function to create YAML file

Code to create a Component*
Step 1: Create a script to prepare data
Step 2: Create a YAML file
Step 3: Load the component
Step 4: Register components to Azure ML workspace, to make components accessible to other users in workspace

Create Pipelines

Pipeline is a workflow of ML tasks in which each task is defined as a component. A pipeline can be executed as a process by running the pipeline as a pipeline job. Each component is executed as a child job as part of the overall pipeline job.

Build a Pipeline and Run a Pipeline Job*

Track Model with MLFlow in Jobs*

MLFlow is an open-source platform designed to streamline the tracking of model metrics and artifacts, such as experiments and models. Scripts are Azure ML development approaches well suited to monitor and evaluate Mlflow models. It supports cloud-platform and cloud-agnostic & language-agnostic capabilities of Mlflow. Install it using pip install mlflow azure-mlflow. Mlflow model can opt for no-code deployment in Azure ML. Mlflow standardizes packaging of models, which means model can easily be imported or exported across different workflows.

IMP NOTE:
MLflow is a recommended tool to deploy trained model without specifying an environment or a scoring script. Rest all Azure ML endpoint, Batch endpoint, and deploying using ONNX runtime, all require both environment and scoring script. To use MLflow, you dont require a workspace.

(I) TRACK METRICS WITH MLFLOW
Two methods to track ML jobs with Mlflow
Enable autologging using mlflow.autolog()
Use logging functions to track custom metrics using mlflow

(II) VIEW METRICS AND EVALUATE MODELS

When job is complete, can explore and evaluate models:

Review metrics in Azure ML Studio
Retrieve metrics with Mlflow in a notebook
Retrieve runs

IMP NOTE: To use MLflow to track local experiments
Install azureml-mlflow package -> Create Azure ML workspace -> Set up tracking experiment -> Set experiment name -> Start training run

(III) REGISTER MLFLOW MODEL IN AZURE ML*

When you train and log a model, you store all relevant artifacts in a directory. When you register the model, an MLmodel file is created in that directory. The Mlmodel file contains the models metadata, which allow for model traceability. The model is logged with .fit() method

The framework used to train model is identified and included as the flavor of your model

Some common flavors that you can use with autologging are: (Syntax: mlflow.flavor.autolog)
- Keras: mlflow.keras.autolog
- Scikit-learn: mlflow.sklearn.autolog()
- LightGBM: mlflow.lightgbm.autolog
- XGBoost: mlflow.xgboost.autolog
- TensorFlow: mlflow.tensorflow.autolog
- PyTorch: mlflow.pytorch.autolog
- ONNX: mlflow.onnx.autolog

(I) Customize the model signature

Model signature defines the schema of model’s inputs and outputs. The signature is stored in JSON format in Mlmodel file, with other metadata of model. Model signature can be created in 2 ways:*

Method 1: Inferred from dataset
Method 2: Created manually

(II) Understand the Mlflow model format

Mlflow uses MLModel format to store all relevant model assets in folder or directory. MLmodel file is one of the files in the directory.

Mlmodel file format may include

artifact_path: classifier # During the training job, the model is logged to this path.
flavors: # The machine learning library with which the model was created.
  fastai:
    data: model.fastai
    fastai_version: 2.4.1
  python_function:
    data: model.fastai
    env: conda.yaml
    loader_module: mlflow.fastai
    python_version: 3.8.12
model_uuid: e694c68eba484299976b06ab9058f636 # The unique identifier of the registered model.
run_id: e13da8ac-b1e6-45d4-a9b2-6a0a5cfac537 # The unique identifier of job run during which the model was created.
signature: # Specifies the schema of the model's inputs and outputs:
# inputs: Valid input to the model. For example, a subset of the training dataset.
# outputs: Valid model output. For example, model predictions for the input dataset.
  inputs: '[{"type": "tensor",
             "tensor-spec": 
                 {"dtype": "uint8", "shape": [-1, 300, 300, 3]}
           }]'
  outputs: '[{"type": "tensor", 
              "tensor-spec": 
                 {"dtype": "float32", "shape": [-1,2]}
            }]'

(III) Choose the flavor

A flavor is the machine learning library with which the model was created. Python function flavor is the default model interface for models created from an MLflow run. Any MLflow python model can be loaded as a python_function model, which allows for workflows like deployment to work with any python model regardless of which framework was used to produce the model.

artifact_path: pipeline
flavors:
  python_function:
    env:
      conda: conda.yaml
      virtualenv: python_env.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    predict_fn: predict
    python_version: 3.8.5
  sklearn:
    code: null
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 1.2.0
mlflow_version: 2.1.0
model_uuid: b8f9fe56972e48f2b8c958a3afb9c85d
run_id: 596d2e7a-c7ed-4596-a4d2-a30755c0bfa5
signature:
  inputs: '[{"name": "age", "type": "long"}, {"name": "sex", "type": "long"}, {"name":
    "cp", "type": "long"}, {"name": "trestbps", "type": "long"}, {"name": "chol",
    "type": "long"}, {"name": "fbs", "type": "long"}, {"name": "restecg", "type":
    "long"}, {"name": "thalach", "type": "long"}, {"name": "exang", "type": "long"},
    {"name": "oldpeak", "type": "double"}, {"name": "slope", "type": "long"}, {"name":
    "ca", "type": "long"}, {"name": "thal", "type": "string"}]'
  outputs: '[{"name": "target", "type": "long"}]'

Configure the signature

There are two types of signatures:

Column-based: used for tabular data with a pandas. Dataframe as inputs.
Tensor-based: used for n-dimensional arrays or tensors (often used for unstructured data like text or images), with numpy.ndarray as inputs.

Deploy and Consume Models with Azure ML*

(A) Deploy Model to MANAGED ONLINE ENDPOINT (REAL-TIME ENDPOINT)

Endpoint: It is an HTTPS endpoint to which you can send data, and which will return a response (almost) immediately. Any data sent to the endpoint will serve as the input for scoring script hosted on the endpoint. The scoring script loads the trained model to predict the label for the new input data, which is also called inferencing. The label is then part of the output that’s returned.

Types of online endpoint:

Managed online endpoints: Azure Machine Learning manages all the underlying infrastructure if you use managed online endpoint.
Kubernetes online endpoint: You i.e. users manage the underlying infrastructure (Kubernetes cluster) if they you use Kubernetes online endpoint.

To deploy model to a managed online endpoint, need to specify 4 things:

Model assets like model pickle file, registered model in Azure ML Workspace
Scoring script that loads the model
Environment which lists all necessary packages that need to be installed on the compute of the endpoint
Compute configuration including compute size and scale settings to ensure you can handle amount of requests the endpoint will receive

NOTE: BLUE/GREEN DEPLOYMENT

Blue/green deployment allows for multiple models to be deployed to an endpoint. You can decide how much traffic to forward to each deployed model. This way, you can switch to a new version of model without interrupting service to customer.
When request is send to endpoint, 90% of traffic can go to blue deployment, and 10% of traffic can go to green deployment. With two versions of model deployed to same endpoint, you can easily test the model.
After testing, if it turns out better, you can redirect 90% of traffic to green deployment and if not so, you can roll back to first version of model i.e. 90% to blue deployment.

Create an online endpoint using ManagedOnlineEndpoint class

from azure.ai.ml.entities import ManagedOnlineEndpoint # using ManagedOnlineEndpoint class

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name="endpoint-example", # Name of the endpoint. Must be unique in the Azure region.
    description="Online endpoint",
    auth_mode="key", # Use key for key-based authentication. Use aml_token for Azure Machine Learning token-based authentication.
)

ml_client.begin_create_or_update(endpoint).result()

(I) DEPLOY MLFLOW MODEL TO A MANAGED ONLINE ENDPOINT*

(II) DEPLOY MODEL TO A MANAGED ONLINE ENDPOINT WITHOUT USING Mlflow MODEL FORMAT*

TEST MANAGED ONLINE ENDPOINTS

Method 1: Use Azure ML Studio -> Endpoints page -> Test the endpoint using studio

Method 2: Test using Python SDK.

Send data to deployed model in JSON format with following structure

{
"data":[
  [0.1,2.3,4.1,2.0], // 1st case
  [0.2,1.8,3.9,2.1],  // 2nd case, and so on...
]
}

And the model will display the response

# test the blue deployment with some sample data
response = ml_client.online_endpoints.invoke(
endpoint_name=online_endpoint_name,
deployment_name="blue",
request_file="sample-data.json",
)
if response[1]=='1':
print("Yes")
else:
print ("No")

(B) Deploy Model to BATCH ENDPOINT

Endpoint: It is an HTTPS endpoint that you can call to trigger a batch scoring job from another service like Azure Synapse Analytics or Azure Databricks. A batch endpoint allows you to integrate the batch scoring with an existing data ingestion and transformation pipeline.

Note: Models requiring complex and time consuming jobs, can be deployed to batch endpoints as it does not require immediate response. Anything that is conducted on an hourly, daily, or weekly schedule is more likely best suited to a batch endpoint. This is because batch endpoints are designed for asynchronous and high latency tasks.

Create batch endpoint, using BatchEndpoint class.

# Batch endpoint names need to be unique within Azure region.
# create a batch endpoint
endpoint = BatchEndpoint(
    name="endpoint-example",
    description="A batch endpoint",
)

ml_client.batch_endpoints.begin_create_or_update(endpoint)

Deploy model to a batch endpoint

Multiple models can be deployed to batch endpoint. The default deployment is used unless specified. Whenever you call the batch endpoint, it triggers a batch scoring job.

Compute clusters can also be used for batch deployments (enables processing in parallel batches). Code to create compute cluster for batch deployment:

from azure.ai.ml.entities import AmlCompute

cpu_cluster = AmlCompute(
    name="aml-cluster",
    type="amlcompute",
    size="STANDARD_DS11_V2",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120,
    tier="Dedicated",
)

cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)

(I) DEPLOY MLFLOW MODEL TO A BATCH ENDPOINT*

(II) DEPLOY A CUSTOM MODEL TO A BATCH ENDPOINT WITHOUT USING MLflow MODEL FORMAT*

INVOKE AND TROUBLESHOOT BATCH ENDPOINTS

1. Invoke batch endpoint:

When you invoke a batch endpoint, you trigger an Azure ML pipeline job. The job will expect an input parameter pointing to the data set you want to score.

2. Trigger the batch scoring job:

To prepare data for batch scoring job, register a folder as a data asset in Azure ML workspace. Then use registered data asset as input when invoking the batch endpoint with python SDK

from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

input = Input(type=AssetTypes.URI_FOLDER, path="azureml:new-data:1")

job = ml_client.batch_endpoints.invoke(
    endpoint_name=endpoint.name, 
    input=input)

3. Monitor run of pipeline job in Azure ML Studio in Jobs tab:

The predictions will be stored in the default datastore

4. Trouble shoot a batch scoring job:

The batch scoring job runs a pipeline job. If you want to troubleshoot pipelinejob, you can review its details in Outputs and Logs of pipeline job. If you want to troubleshoot the scoring script, go to child job -> outputs + logs tab -> logs/user/folder contains 3 files that will help troubleshoot

job_error.txt: Summarize the errors in your script.
job_progress_overview.txt: Provides high-level information about the number of mini-batches processed so far.
job_result.txt: Shows errors in calling the init() and run() function in the scoring script.

IMP:
To deploy a model as a real-time inferencing model for production, Azure Kubernetes Service (AKS) is preferred, because it offers all production grade requirements including high scalability, high performance, and appropriate security for ML model.

IMP:
Managed online endpoint: provides scalable real-time inference and doesn’t need infrastructure management
Kubernetes online endpoint: provides scalable real-time endpoint, but requires team to manage the infrastructure
Batch endpoint: doesn’t provide real-time inference
Local web service: not scalable, while still requiring your team to manage infrastructure

Trigger Workflows on Github

GitHub Actions can also be used for MLOps to automate Azure ML tasks, such as data processing, model deployment, and training. Workflows are defined using YAML and stored in the .github/workflows directory of your code repository.

You can use Azure Event Grid to set up a trigger for sending an email to the data science team when data drift is detected on an Azure Machine Learning pipeline. Create a topic on the pipeline and a subscription for Logic Apps as the event handler to send the email.

Some Tricky Microsoft Practise Assesment Qs IMP

Ksama Arora

Designing and Implementing Data Science Solution on Azure (DP-100)

Contents

Designing a Data Ingestion Solution*

Selecting development approaches to build or train models*

Azure Machine Learning Environments* (Imp)

NOTE IMP: Cloud Based Environment

Azure ML Workspace

IMP: File formats

Creating Azure ML Workspace using Python SDK and CLI (Imp)

Resources and Assets*

(I) DESIGN AND PREPARE A MACHINE LEARNING SOLUTION

Compute*

Apache Spark Tools as Compute Targets

IMP:

Create Compute Targets for Experiments and Training:*

Create Compute Target using Python SDK and CLI (Imp)

Configure Attached Compute Resources*

Attached Compute - HD Insights and Apache Spark

Create Azure Machine Learning Workspace - Lab 1

(II) EXPLORE DATA AND TRAIN MODEL

IMP Note:

Create Datastores and Data Assets using Python SDK (Imp)

Mounting and Downloading Files for Datasets

Preprocessing of Data

Feature Selection and Feature Engineering

Differential Privacy: Eg of Responsible AI*

Accessing Data During Interactive Development (Imp)

Wrangle Data with Python in Azure ML - Lab 2

(III) THREE WAYS TO BUILD AND TRAIN ML MODEL

(A) Build & Train Models using Azure ML Designer

Custom Code Components

IMP NOTE:

NOTE IMP:

Create a basic pipeline in Azure ML Studio - Lab 3

Filter Based Feature Selection Module

Pearson Correlation vs Chi-Squared Statistics

Permutation Feature Importance Module

Applying Automated ML to Explore Models

Configure and submit Auto ML experiment using Python SDK* (Imp)

IMP NOTE:

(B) Build & Train Models using Azure Automated Machine Learning - Lab 4*

(C) Training Model by using Python SDK*

Tuning Hyperparameters in Azure Machine Learning*

Types of Hyperparameters*

Early Termination Policies*

Use a sweep job for hyperparameter tuning*

Steps to Optimize Hyperparameters

Run Training Script as Command Job in Azure Ml*

Responsible AI Dashboard*

Run Pipelines in Azure ML

Create Components

Code to create a Component*

Create Pipelines

Build a Pipeline and Run a Pipeline Job*

Track Model with MLFlow in Jobs*

IMP NOTE:

(I) TRACK METRICS WITH MLFLOW

(II) VIEW METRICS AND EVALUATE MODELS

IMP NOTE: To use MLflow to track local experiments

(III) REGISTER MLFLOW MODEL IN AZURE ML*

(I) Customize the model signature

(II) Understand the Mlflow model format

(III) Choose the flavor

Deploy and Consume Models with Azure ML*

(A) Deploy Model to MANAGED ONLINE ENDPOINT (REAL-TIME ENDPOINT)

Create an online endpoint using ManagedOnlineEndpoint class

(I) DEPLOY MLFLOW MODEL TO A MANAGED ONLINE ENDPOINT*

(II) DEPLOY MODEL TO A MANAGED ONLINE ENDPOINT WITHOUT USING Mlflow MODEL FORMAT*

TEST MANAGED ONLINE ENDPOINTS

(B) Deploy Model to BATCH ENDPOINT

Create batch endpoint, using BatchEndpoint class.

Deploy model to a batch endpoint

(I) DEPLOY MLFLOW MODEL TO A BATCH ENDPOINT*

(II) DEPLOY A CUSTOM MODEL TO A BATCH ENDPOINT WITHOUT USING MLflow MODEL FORMAT*

INVOKE AND TROUBLESHOOT BATCH ENDPOINTS

IMP:

IMP:

Trigger Workflows on Github

Some Tricky Microsoft Practise Assesment Qs IMP