Ksama Arora
Step 1: Open a notebook with a running ML Azure Kernel
Step 2: Create ML Client and a datastore (code mentioned before)
Step 3: Build a URI OR Directly grab it in the Studio UI
# Azure Machine Learning workspace details:
subscription = '<subscription_id>'
resource_group = '<resource_group>'
workspace = '<workspace>'
datastore_name = '<datastore>'
path_on_datastore = '<path>'
# Long-form Datastore URI format:
uri = f'azureml://subscriptions/{subscription}/resourcegroups/{resource_group}/workspaces/{workspace}/datastores/{datastore_name}/paths/{path_on_datastore}'
OR go to Data -> specific Datastore -> select csv file -> copy URI (Datastore URI or Storage URI)
Note: The Datastore URI is only applicable to Azure ML and the Storage URI is a more generic storage endpoint, which is used only outside Azure ML.
Step 4: Load a Pandas Dataframe
# Import pandas library
import pandas as pd
# Populate dataframe "my_dataframe" using the pandas read CSV method by passing in the URI acquired
my_dataframe = pd.read_csv("URI")
# Then run dataframe head passing in a value for the number of rows you want to return
my_dataframe.head(1000)
- Step 5: Wrangle - Replace Missing Strings
# Fill missing values in the "Claim Network Status" column with "Unkown" and update the dataframe in place
my_dataframe.fillna(
value={"Claim Network Status": "Unkown"}, inplace=True)
# Fill missing values in the "Payment Status" column with "Unkown" and update the dataframe in place
my_dataframe.fillna(
value={"Payment Status": "Unkown"}, inplace=True)
# Return the first 1000 rows of the dataframe
my_dataframe.head(1000)
- Step 6: Wrangle - Delete Rows With any Empty Columns
# Drop rows with any missing values and update the dataframe in place
my_dataframe.dropna(inplace=True)
# Return the first 1000 rows of the dataframe
my_dataframe.head(1000)
Wrangling Data with Apache Spark
Step 1: Create a compute to power the notebook - Compute instance / Synapse Spark Pool / Azure ML Serverless Spark
Step 2: Open a notebook and we use Azure ML Serverless Spark as compute for ease of use
Step 3: Build a URI or grab it from the Studio UI
Step 4: Load a PySpark Pandas Dataframe
# Import pyspark pandas library
import pyspark.pandas as pd
my_dataframe = pd.read_csv("URI")
my_dataframe.head(1000)
Step 5: Wrangle - Replace Missing Strings (code given above)
Step 6: Wrangle - Delete rows with any empty columns (code given above)
Step 7: Wrangle - Remove Duplicate Rows
# Drop duplicate rows and update the dataframe in place
my_dataframe.drop_duplicates(inplace=True)
# Sort the dataframe by its index and update the dataframe in place
my_dataframe.sort_index(inplace=True)
# Return the first 1000 rows of the dataframe
my_dataframe.head(1000)
- Step 8: Save the Transformed Data
# Save the dataframe to a CSV file at the specified Azure Data Lake Storage path
my_dataframe.to_csv(
"abfss://<FILE_SYSTEM_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/data/wrangled"
)