Ksama Arora
Run Training Script as Command Job in Azure Ml
Step 1: Convert a Notebook to a Script
Scripts are ideal for testing and automation in production. To create a production-ready script:
- Remove nonessential code: Exclude unnecessary code like print() and df.describe() (i.e. code for exploratory purposes) to reduce costs and compute time.
- Refactor into functions: Break down the code into smaller functions to test parts of the code independently resulting in run of only the required part of code.
Example notebook code to read and split data:
# read and visualize the data
print("Reading data...")
df = pd.read_csv('diabetes.csv')
df.head()
# split data
print("Splitting data...")
X, y = df[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, df['Diabetic'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
Refactored into two functions - Read the data & Split the data:
- Note: A script consisting of multiple functions is best to use for production workloads. E.g. A data scientist has trained a model in a notebook. The model should be retrained every week on new data.
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
def main(csv_file):
df = get_data(csv_file)
# Split data
X_train, X_test, y_train, y_test = split_data(df)
# function that reads the data
def get_data(path):
df = pd.read_csv(path)
return df
# function that splits the data
def split_data(df):
X, y = df[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, df['Diabetic'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
return X_train, X_test, y_train, y_test
Test the script in the terminal:
- Open notebook page of Azure ML Studio -> save and run script in terminal
- OR
- Go to compute page -> select terminal of compute instance -> use command to run python script named train.py
python train.py
Step 2: Run a Script as a Command Job
from azure.ai.ml import command
# Configure job
job = command(
code="./src", # Folder containing the script
command="python train.py", # Specifies the file to run
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest", # Required packages
compute="aml-cluster", # Compute resource
display_name="train-model", # Job name
experiment_name="train-classification-model" # Experiment name
)
# Submit job
returned_job = ml_client.create_or_update(job)
aml_url = returned_job.studio_url
print("Monitor your job at", aml_url)
Step 3: Use Parameters in Command Job
Increase script flexibility using parameters.
Using script arguments:
import argparse # using library argparse
import pandas as pd
from sklearn.linear_model import LogisticRegression
def main(args):
# read data
df = get_data(args.training_data)
# function that reads the data
def get_data(path):
df = pd.read_csv(path)
return df
def parse_args():
# setup arg parser
parser = argparse.ArgumentParser()
# add arguments
parser.add_argument("--training_data", dest='training_data', type=str)
# parse args
args = parser.parse_args()
# return args
return args
# run script
if __name__ == "__main__":
# parse args
args = parse_args()
# run main function
main(args)
Passing arguments to a script:
python train.py --training_data diabetes.csv
Configure command job with arguments:
from azure.ai.ml import command
job = command(
code="./src",
command="python train.py --training_data diabetes.csv",
environment="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest",
compute="aml-cluster",
display_name="train-model",
experiment_name="train-classification-model"
)


