Ksama Arora

Designing and Implementing Data Science Solution on Azure (DP-100)

Jul 13, 2024

Contents


Data Science Life Cycle & Responsible AI Guidelines - Overview

Designing a Data Ingestion Solution*

Overview: Extract raw data from source (CRM or IoT device) -> Copy and transform data with Azure Synapse Analytics -> Store prepared data in Azure Blob Storage -> Train model with Azure ML

Type of DataDescriptionExample
Tabular or Structured DataAll data has the same fields or properties, defined in a schema. Represented in tables where columns are features and rows are data points.Excel or CSV file
Semi-structured DataNot all data has the same fields or properties. Represented by a collection of key-value pairs. Keys are features, and values are properties.JSON object from IoT device:
{ “deviceId”: 29482, “location”: “Office1”, “time”:”2021-07-14T12:47:39Z”, “temperature”: 23 }
Unstructured DataFiles that don’t adhere to any rules regarding structure. Can’t query the data in the database.Document, image, audio, or video file

Store data for model training workloads:*

Create a data ingestion pipeline:*

Selecting development approaches to build or train models*

Three ways to interact with Azure Machine Learning

Screenshot-2024-05-13-at-3-31-18-AM.png

Azure Machine Learning Environments* (Imp)

Environment: Can be another form of source control and multiple custom environments can be created or curated environments can be used. Contains python packages, environment variables, software settings, runtimes. Environment ensures version control, reproducible and auditable.

NOTE IMP: Cloud Based Environment

Azure ML Workspace

Screenshot-2024-05-15-at-3-43-17-AM.png

Screenshot-2024-05-15-at-4-08-43-AM.png

IMP: File formats

Resources and Assets*

DatastoreDescriptionStorage TypeMain Usage
workspaceartifactstoreStores compute and experiment logsAzure BlobLogs for jobs
workspaceworkingdirectoryStores files uploaded via the Notebooks sectionAzure File ShareNotebook files
workspaceblobstoreDefault datastore for storing uploaded dataAzure BlobData assets
workspacefilestoreGeneral file storageAzure File ShareGeneral file storage

(I) DESIGN AND PREPARE A MACHINE LEARNING SOLUTION

Compute*

Compute TypeDescription
Compute InstancesSimilar to a VM - Primarily used to run notebooks - Ideal for experimentation - Easiest option to work with compute instance is through the integrated notebooks experience in the Azure ML studio. OR use VS Code for easier source control of your code.
Compute ClustersOn-demand multi-node clusters of CPU or GPU - automatically scale - used for large volume of data - allow for parallel processing to distribute workload - reduce time of run. - Ideal to use for production workloads
 - Dedicated Cluster: These are just ready when you are - more expensive
 - Low-Priority Cluster: These systems are going to be ready when you are, probably. Low priority clusters may be accessed by multiple users thus you may get access to it but just in a few minutes or so - less expensive
Inference Cluster/Kubernetes ClusterAllows you to create or attach an Azure Kubernetes Service (AKS) cluster. BEST to deploy trained machine learning models in PRODUCTION SCENARIO.
Attached ComputeAllows you to attach other Azure compute resources to the workspace, like Azure Databricks, HDInsight cluster or Synapse Spark pools. Used for Specialized needs.
Serverless ComputeA fully managed, on-demand compute you can use for training jobs.

Apache Spark Tools as Compute Targets

Azure Synapse: It is an enterprise analytics service platform that enables data warehousing, analytics, processing and integration and pipeline framed with a massively parallel processing architecture. Synapse supports bot SQL and Spark technologies.

Azure Synapse Spark Pools:

Serverless Spark Pools can be used as a form of compute to set up notebooks in Azure ML Studio

IMP:

Create Compute Targets for Experiments and Training:*

Azure ML compute instance or compute clusters can be created from:

Create Compute Target using Python SDK and CLI (Imp)

Configure Attached Compute Resources*

Screenshot-2024-05-15-at-3-42-05-PM.png

Attached Compute - HD Insights and Apache Spark

from azureml.core import RemoteCompute, ComputeTarget
# Ubuntu VMs only
# VM must have public IP addy
my_resource_id = "/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/virtualMachines/<vm_name>"
my_compute_target_name = "attached_existingVM"
attached_target_config = RemoteCompute.attach
#
attach_config = RemoteCompute.attach_configuration(resource_id='<resource_id>',
                                                    ssh_port=22,
                                                    username= '<username>'
                                                    password="<password>")
# Attach the compute
compute = ComputeTarget.attach(my_ws, my_compute_target_name, attached_target_config)
compute.wait_for_completion (show_output=True)

Note: When considering attaching an existing virtual machine to your Azure ML workspace as a compute target, it’s crucial that the external Vms must be Ubuntu only and must have an public IP address only.
However, the primary reason for choosing an existing VM over a new compute instance is to utilize unused capacity effectively.

Create Azure Machine Learning Workspace - Lab 1


(II) EXPLORE DATA AND TRAIN MODEL

Terminologies:*

URI (Uniform Resource Identifier):*

Datastores:*

Data Assets:*

IMP Note:

Three types of data assets (SDK v2):

Data Asset Management: It is the implementation and monitoring of datastores and datasets. It is version and tracking. It is registering and retrieveing those versions. It is how we are monitoring datasets and how we look at drift detection. We can access datasets that Azure provides and look at public datasets

Create Datastores and Data Assets using Python SDK (Imp)

Mounting and Downloading Files for Datasets

 Mount FilesDownload Files
DescriptionFiles do not reside in computeFiles downloaded to compute
ProcessingMore streaming - means more processing/moving of dataLess streaming - less processing as data has been downloaded
UsageGood if you don’t use all files from datasetGood if you use all files from dataset
Available forDatasets created from ADLS, SQL, Database, PostgreSQLDatasets created from ADLS, SQL, Database, PostgreSQL

Preprocessing of Data

Steps for Preprocessing of Data:

Feature Selection and Feature Engineering

Differential Privacy: Eg of Responsible AI*

Differential Privacy seeks to protect individual data by adding statistical noise to the analysis process. Minimizing risk of personal identification and ensuring data privacy. It ensures that the output of a data analysis algorithm does not reveal sensitive information about any individual.

# Sample code snippet
privacy_usage = { 'epsilon': 0.10},
                data_lower = lower_range[1],
                data_upper = upper_range[10],
                data_rows = sample

Accessing Data During Interactive Development (Imp)

Data Wrangling: It is the process of transforming data to a format that’s best suited to the needs of ML model.

Wrangle Data with Python in Azure ML - Lab 2


(III) THREE WAYS TO BUILD AND TRAIN ML MODEL

(A) Build & Train Models using Azure ML Designer

Dataset -> Select Columns in Dataset -> Clean Missing Data -> Split Data & (Linear Regression) -> Train Model -> Score Model -> Evaluate Model

Screenshot-2024-05-16-at-1-06-58-AM.png

Custom Code Components

Custom code modules are created using Python. Supported libraries include NumPy, SciPy, scikit-learn, Theano, TensorFlow, Keras, PyTorch, pandas, and matplotlib

IMP NOTE:

Screenshot-2024-05-16-at-1-07-53-AM.png

NOTE IMP:

Algorithms used in creating pipeline

Screenshot-2024-05-16-at-1-08-23-AM.png

Create a basic pipeline in Azure ML Studio - Lab 3

Filter Based Feature Selection Module

Pearson Correlation vs Chi-Squared Statistics

AspectPearson CorrelationChi-Squared Statistics
PurposeMeasures the strength and direction of a linear relationship between two quantitative variablesIt is a comparative test that reveals how close expected values are to actual results.
Correlation Coefficient (R)Ranges from -1 to +1:
- 0: No correlation
- +1: Perfect positive correlation
- -1: Perfect negative correlation
Ranges from 0 to 1
- No correlation coefficient
- lower values indicate a better fit to expected values
Type of VariablesQuantitativeCategorical
Interpretation- Positive Correlation: e.g., More rain increases humidity
- Negative Correlation: e.g., Higher altitude decreases temperature
Indicates whether a relationship exists but does not specify the type (positive or negative)
Value IndicationIndicates the strength of the relationshipSmaller values indicate a better fit and existence of a relationship
Steps1. Determine linearity
2. Clean data
3. Generate the coefficient
4. Evaluate the results
 

Permutation Feature Importance Module

It refers to randomly shuffling data while keeping everything else constant and then seeing if we have a change in whatever feature column we were looking at. From that we can generate new prediction based on results. Compute feature importance score by calculating decrease in quality

Applying Automated ML to Explore Models

Automated ML democratizes machine learning with a no-code approach, making it easy to explore optimal machine learning models. Automated ML handles preprocessing, featurization, transformation, scaling, and normalization. At the end, it scores the model by selecting a metric and deploying the model. Automated ML is used for exploring optimal algorithms and parameters to solve a particular problem without a lot of human trial-and-error.

Examples of Automated ML:

Configure and submit Auto ML experiment using Python SDK* (Imp)

from azure.ai.ml.constants import AssetTypes
from azure.ai.ml import automl, Input

# Create an Input object for the training data
my_training_input = Input(
    type=AssetTypes.MLTABLE, 
    path="./data/training-mltable-folder"
)

# Configure the classification job
my_classification_job = automl.classification( # for classfification, automl.classfication function is used
    compute="aemcmlcompute",
    experiment_name="this_experiment",
    training_data=my_training_input, # my_training_input refers to MLTable data asset created in Azure ML workpsace
    target_column_name="try_me",
    primary_metric="accuracy",
    n_cross_validations=4,
    enable_model_explainability=True,
    tags={"my_tag": "My value"}
)

# Specify Primary Metric (Important to specify)
# The primary metric is the target performance metric for which the optimal model will be determined. 
# Set the primary metric to the performance score for which you want to optimize the model.

# To retrieve the list of ClassificationPrimaryMetrics available
from azure.ai.ml.automl import ClassificationPrimaryMetrics
list(ClassificationPrimaryMetrics) 

# Set optional limits to minimize cost and time spent on training by set_limits()
my_classification_job.set_limits(
    timeout_minutes=600, # timeout_minutes: Number of minutes after which the complete AutoML experiment is terminated.
    trial_timeout_minutes=20, # trial_timeout_minutes: Maximum number of minutes one trial can take.
    max_trials=5, # max_trials: Maximum number of trials, or models that will be trained.
    enable_early_termination=True # enable_early_termination: Whether to end the experiment if the score isn't improving in the short term.
)
# Note: Can use multiple trials in parallel to save time. If using compute cluster, can have as many parallel trials as the max number of nodes
# Note: If you want to set max number of trials to be less than max number of nodes, use max_concurrent_trials

# Set optional training properties
my_classification_job.set_training(
    blocked_training_algorithms=["logistic_regression"], 
    enable_onnx_compatible_models=True
)
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    my_classification_job
)
returned_job.services ["Studio"].endpoint

# Monitor Auto ML job runs 
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)

IMP NOTE:

To create a reference to data source location, use Input class. The data will remain in the source location. The Output class is similar, but for output of the components.

NOTE: Set the featurization mode to “off” if you don’t want the automated machine learning to make changes to the data (no preprocessing of data). ALSO The target is the column you want to predict. The primary metric is the performance score for which you want to optimize the model. Set the primary_metric to ‘AUC_weighted’ to optimize for the AUC_weighted metric.

(B) Build & Train Models using Azure Automated Machine Learning - Lab 4*

(C) Training Model by using Python SDK*

Screenshot-2024-05-17-at-2-06-10-AM.png


Tuning Hyperparameters in Azure Machine Learning*

Parameters are values determined from the training features as parameters. Hyperparameter are values used to configure training behavior but which are not derived from training data. Hyperparameters: Top-level settings you configure before running the ML algorithm, such as: train-test ratios, number of epochs, batch size.

Types of Hyperparameters*

Discrete Hyperparameters:

{
    "batch_size": choice(1, 2, 3, 4)
    "number_of_hidden_layers": choice(range(1,5))
}

Continuous Hyperparameters:

{
    "learning_rate": normal(10,3),
    "keep_probability":uniform(0.05, 0.1)
}
from azure.ai.ml.sweep import Choice, Normal

command_job_for_sweep = job(
    batch_size=Choice(values=[16, 32, 64]),    
    learning_rate=Normal(mu=10, sigma=3),
)

Screenshot-2024-05-16-at-10-45-10-PM.png

from azure.ai.ml.sweep import Uniform, Choice

command_job_for_sweep = command_job(
    learning_rate=Uniform ( min_value=0.05, max_value=0.1), 
                            batch_size = Choice (values = [16, 32, 64, 128]),
                            # Discrete hyperparameter used here
)

sweep_job = command_job_for_sweep.sweep(
    compute="cpu-cluster",
    sampling_algorithm = "bayesian", 
    primary_metric="accuracy",
    goal="Maximize",
)

Early Termination Policies*

Early Termination Policies: Strategy used to halt the training process of new models when they do not yield significantly better results compared to previously trained models. Increase efficiency and reduce costs by stopping underperforming runs early. Generally used when working with continuous hyperparameters and random or Bayesian sampling method.

Bandit Policy:

from azure.ai.ml.sweep import BanditPolicy
  
sweep_job.early_termination = BanditPolicy(
    slack_amount = 0.2, 
    delay_evaluation = 5, 
    evaluation_interval = 1
)
# When after first five trials, best performing model has accuracy say 0.9.
# Thus all new models must perform better than 0.9-slack_amount = 0.9-0.2 = 0.7, the sweep job will continue.  
# If the new model has an accuracy score lower than 0.7, the policy will terminate the sweep job.

Median Stopping Policy:

from azure.ai.ml.sweep import MedianStoppingPolicy
  
sweep_job.early_termination = MedianStoppingPolicy(
    delay_evaluation = 5, 
    evaluation_interval = 1
)
# Say median accuracy score is 0.82 so far. 
# Any new model should have accuracy higher than 0.82 for the sweep job to continue. 
# If the new model has an accuracy score lower than 0.82, the policy will terminate the sweep job.

Truncation Selection Policy:

from azure.ai.ml.sweep import TruncationSelectionPolicy
  
sweep_job.early_termination = TruncationSelectionPolicy(
    evaluation_interval=1, 
    truncation_percentage=20, 
    delay_evaluation=4 
)
# For example, if the primary metric is accuracy, the job continues only if a new model's accuracy is not in the worst 20% of all trials so far. 
# If the fifth trial has the lowest accuracy among all trials, the sweep job stops.

Use a sweep job for hyperparameter tuning*

Steps to Optimize Hyperparameters

NotebooksScripts
Designed for interactive trainingIntended for automated training
Contain other contentStripped down to (mostly) code
Contain alternative model ideasRun settled models
On-demand code (run specific cells as needed)On-command functions (execute without supervision)
Awkward to deploy in CI/CDEnable CI/CD processes

Run Training Script as Command Job in Azure Ml*

Responsible AI Dashboard*

Screenshot-2024-05-17-at-2-49-51-AM.png

Run Pipelines in Azure ML

Create Components

Components: Allow to create reusable scripts that can be easily shared across users within the same ML workspace. Use components to build ML pipeline

Component consist of 3 parts:

To create component, you need 2 files:

Code to create a Component*

Create Pipelines

Pipeline is a workflow of ML tasks in which each task is defined as a component. A pipeline can be executed as a process by running the pipeline as a pipeline job. Each component is executed as a child job as part of the overall pipeline job.

Build a Pipeline and Run a Pipeline Job*


Track Model with MLFlow in Jobs*

MLFlow is an open-source platform designed to streamline the tracking of model metrics and artifacts, such as experiments and models. Scripts are Azure ML development approaches well suited to monitor and evaluate Mlflow models. It supports cloud-platform and cloud-agnostic & language-agnostic capabilities of Mlflow. Install it using pip install mlflow azure-mlflow. Mlflow model can opt for no-code deployment in Azure ML. Mlflow standardizes packaging of models, which means model can easily be imported or exported across different workflows.

IMP NOTE:

MLflow is a recommended tool to deploy trained model without specifying an environment or a scoring script. Rest all Azure ML endpoint, Batch endpoint, and deploying using ONNX runtime, all require both environment and scoring script. To use MLflow, you dont require a workspace.

(I) TRACK METRICS WITH MLFLOW

Two methods to track ML jobs with Mlflow

(II) VIEW METRICS AND EVALUATE MODELS

When job is complete, can explore and evaluate models:

IMP NOTE: To use MLflow to track local experiments

Install azureml-mlflow package -> Create Azure ML workspace -> Set up tracking experiment -> Set experiment name -> Start training run

(III) REGISTER MLFLOW MODEL IN AZURE ML*

When you train and log a model, you store all relevant artifacts in a directory. When you register the model, an MLmodel file is created in that directory. The Mlmodel file contains the models metadata, which allow for model traceability. The model is logged with .fit() method

The framework used to train model is identified and included as the flavor of your model

(I) Customize the model signature

Model signature defines the schema of model’s inputs and outputs. The signature is stored in JSON format in Mlmodel file, with other metadata of model. Model signature can be created in 2 ways:*

(II) Understand the Mlflow model format

Mlflow uses MLModel format to store all relevant model assets in folder or directory. MLmodel file is one of the files in the directory.

Screenshot-2024-06-04-at-1-54-07-AM.png

artifact_path: classifier # During the training job, the model is logged to this path.
flavors: # The machine learning library with which the model was created.
  fastai:
    data: model.fastai
    fastai_version: 2.4.1
  python_function:
    data: model.fastai
    env: conda.yaml
    loader_module: mlflow.fastai
    python_version: 3.8.12
model_uuid: e694c68eba484299976b06ab9058f636 # The unique identifier of the registered model.
run_id: e13da8ac-b1e6-45d4-a9b2-6a0a5cfac537 # The unique identifier of job run during which the model was created.
signature: # Specifies the schema of the model's inputs and outputs:
# inputs: Valid input to the model. For example, a subset of the training dataset.
# outputs: Valid model output. For example, model predictions for the input dataset.
  inputs: '[{"type": "tensor",
             "tensor-spec": 
                 {"dtype": "uint8", "shape": [-1, 300, 300, 3]}
           }]'
  outputs: '[{"type": "tensor", 
              "tensor-spec": 
                 {"dtype": "float32", "shape": [-1,2]}
            }]'

(III) Choose the flavor

A flavor is the machine learning library with which the model was created. Python function flavor is the default model interface for models created from an MLflow run. Any MLflow python model can be loaded as a python_function model, which allows for workflows like deployment to work with any python model regardless of which framework was used to produce the model.

artifact_path: pipeline
flavors:
  python_function:
    env:
      conda: conda.yaml
      virtualenv: python_env.yaml
    loader_module: mlflow.sklearn
    model_path: model.pkl
    predict_fn: predict
    python_version: 3.8.5
  sklearn:
    code: null
    pickled_model: model.pkl
    serialization_format: cloudpickle
    sklearn_version: 1.2.0
mlflow_version: 2.1.0
model_uuid: b8f9fe56972e48f2b8c958a3afb9c85d
run_id: 596d2e7a-c7ed-4596-a4d2-a30755c0bfa5
signature:
  inputs: '[{"name": "age", "type": "long"}, {"name": "sex", "type": "long"}, {"name":
    "cp", "type": "long"}, {"name": "trestbps", "type": "long"}, {"name": "chol",
    "type": "long"}, {"name": "fbs", "type": "long"}, {"name": "restecg", "type":
    "long"}, {"name": "thalach", "type": "long"}, {"name": "exang", "type": "long"},
    {"name": "oldpeak", "type": "double"}, {"name": "slope", "type": "long"}, {"name":
    "ca", "type": "long"}, {"name": "thal", "type": "string"}]'
  outputs: '[{"name": "target", "type": "long"}]'

There are two types of signatures:


Deploy and Consume Models with Azure ML*

(A) Deploy Model to MANAGED ONLINE ENDPOINT (REAL-TIME ENDPOINT)

Endpoint: It is an HTTPS endpoint to which you can send data, and which will return a response (almost) immediately. Any data sent to the endpoint will serve as the input for scoring script hosted on the endpoint. The scoring script loads the trained model to predict the label for the new input data, which is also called inferencing. The label is then part of the output that’s returned.

Types of online endpoint:

To deploy model to a managed online endpoint, need to specify 4 things:

NOTE: BLUE/GREEN DEPLOYMENT

Screenshot-2024-06-06-at-5-38-28-PM.png

Create an online endpoint using ManagedOnlineEndpoint class

from azure.ai.ml.entities import ManagedOnlineEndpoint # using ManagedOnlineEndpoint class

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name="endpoint-example", # Name of the endpoint. Must be unique in the Azure region.
    description="Online endpoint",
    auth_mode="key", # Use key for key-based authentication. Use aml_token for Azure Machine Learning token-based authentication.
)

ml_client.begin_create_or_update(endpoint).result()

(I) DEPLOY MLFLOW MODEL TO A MANAGED ONLINE ENDPOINT*

(II) DEPLOY MODEL TO A MANAGED ONLINE ENDPOINT WITHOUT USING Mlflow MODEL FORMAT*

TEST MANAGED ONLINE ENDPOINTS

(B) Deploy Model to BATCH ENDPOINT

Endpoint: It is an HTTPS endpoint that you can call to trigger a batch scoring job from another service like Azure Synapse Analytics or Azure Databricks. A batch endpoint allows you to integrate the batch scoring with an existing data ingestion and transformation pipeline.

Note: Models requiring complex and time consuming jobs, can be deployed to batch endpoints as it does not require immediate response. Anything that is conducted on an hourly, daily, or weekly schedule is more likely best suited to a batch endpoint. This is because batch endpoints are designed for asynchronous and high latency tasks.

Create batch endpoint, using BatchEndpoint class.

# Batch endpoint names need to be unique within Azure region.
# create a batch endpoint
endpoint = BatchEndpoint(
    name="endpoint-example",
    description="A batch endpoint",
)

ml_client.batch_endpoints.begin_create_or_update(endpoint)

Deploy model to a batch endpoint

Multiple models can be deployed to batch endpoint. The default deployment is used unless specified. Whenever you call the batch endpoint, it triggers a batch scoring job.

from azure.ai.ml.entities import AmlCompute

cpu_cluster = AmlCompute(
    name="aml-cluster",
    type="amlcompute",
    size="STANDARD_DS11_V2",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120,
    tier="Dedicated",
)

cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)

(I) DEPLOY MLFLOW MODEL TO A BATCH ENDPOINT*

(II) DEPLOY A CUSTOM MODEL TO A BATCH ENDPOINT WITHOUT USING MLflow MODEL FORMAT*

INVOKE AND TROUBLESHOOT BATCH ENDPOINTS

1. Invoke batch endpoint:

Screenshot-2024-06-03-at-1-43-45-AM.png

2. Trigger the batch scoring job:

from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

input = Input(type=AssetTypes.URI_FOLDER, path="azureml:new-data:1")

job = ml_client.batch_endpoints.invoke(
    endpoint_name=endpoint.name, 
    input=input)

3. Monitor run of pipeline job in Azure ML Studio in Jobs tab:

Screenshot-2024-06-03-at-1-44-09-AM.png

4. Trouble shoot a batch scoring job:

The batch scoring job runs a pipeline job. If you want to troubleshoot pipelinejob, you can review its details in Outputs and Logs of pipeline job. If you want to troubleshoot the scoring script, go to child job -> outputs + logs tab -> logs/user/folder contains 3 files that will help troubleshoot

Screenshot-2024-06-03-at-1-44-43-AM.png

IMP:

IMP:

Trigger Workflows on Github

GitHub Actions can also be used for MLOps to automate Azure ML tasks, such as data processing, model deployment, and training. Workflows are defined using YAML and stored in the .github/workflows directory of your code repository.

Screenshot-2024-06-03-at-2-00-53-AM.png

Screenshot-2024-06-03-at-2-01-23-AM.png

Screenshot-2024-06-03-at-2-01-42-AM.png

You can use Azure Event Grid to set up a trigger for sending an email to the data science team when data drift is detected on an Azure Machine Learning pipeline. Create a topic on the pipeline and a subscription for Logic Apps as the event handler to send the email.


Some Tricky Microsoft Practise Assesment Qs IMP