Ksama Arora

Microsoft Certified: Azure Data Fundamentals (DP-900)

Aug 21, 2024

Contents

Azure Storage Services:

Azure Data Services:

SQL and Database Services:

Data Integration and Processing Services:

Business Intelligence and Big Data Tools:

Types of Cloud Computing

SaaS:

PaaS:

IaaS:

Azure Data Roles:

  1. Database Administrator: Configures and maintains a database (e.g., Azure Data services or SQL server).

    • Responsibilities:
      • Database management
      • Manage security, granting user access
      • Backups
      • Monitors performance
    • Common Tools:
      • Azure Data Studio
      • SQL Server Management Studio
      • Azure Portal
      • Azure CLI
  2. Data Engineer: Designs and implements data tasks related to the transfer and storage of big data.

    • Responsibilities:
      • Database pipelines and processes
      • Data ingestion and storage
      • Prepare data for analytics
      • Prepare data for analytical processing
    • Common Tools:
      • Azure Synapse Studio
      • SQL
      • Azure CLI
  3. Data Analyst: Analyzes business data to reveal important information.

    • Responsibilities:
      • Provides insights into the data
      • Visual reporting
      • Modeling data for analysis
      • Combines data for visualization and analysis
    • Common Tools:
      • Power BI Desktop
      • Power BI Portal
      • Power BI Services
      • Power BI Report Builder

Database Administrator Tools:

  1. Azure Data Studio:
    • Connect to Azure SQL, Azure data warehouse, Postgres SQL, and SQL Server (big data clusters, on-premises).
    • Graphical interface for managing on-premises and cloud-based data services.
    • Runs on Windows, macOS, and Linux.
    • Extensions for automation tasks.
    • Possibly a replacement for SSMS (though SSMS is still more mature).
  2. SQL Server Management Studio (SSMS):
    • Tooling for running SQL commands or managing database operations.
    • Graphical interface for on-premises and cloud-based services.
    • Runs on Windows.
    • More mature than Azure Data Studio.
  3. Azure Portal and CLI:
    • Manage SQL database configurations (e.g., create, delete, resize, number of cores).
    • Automate tasks with Azure Resource Manager templates (IaC).

Data Engineering Common Tools:

  1. Azure Synapse Studio:
    • Integrated portal for managing Azure Synapse, data ingestion (Azure Data Factory), and Synapse assets (SQL Pools/Spark Pool).
  2. Knowledge SQL:
    • Create databases, tables, views, etc.
  3. Azure CLI:
    • Support operations SQL cmd to connect to Microsoft server.
    • Run SQL queries and commands.
  4. HDInsights:
    • Streaming data via Apache Kafka or Apache Spark.
    • ELT jobs via HIVE, PIG, Apache Spark.
  5. Azure Databricks:
    • Apache Spark for ELT or streaming jobs to data warehouses or data lakes.

Data Analyst Common Tools:

  1. Power BI Desktop:
    • Standalone application for data visualization.
    • Data modeling and connecting to data sources.
    • Create interactive reports.
  2. Power BI Portal/Power BI Services:
    • Web UI for creating interactive dashboards.
  3. Power BI Report Builder:
    • Create paginated reports (printable reports).

Explore core data concepts

Different Data Formats

Common File Formats

Delimited text files

FirstName,LastName,Email
Joe,Jones,joe@litware.com
Samir,Nadoy,samir@northwind.com

Other common formats:

JavaScript Object Notation (JSON)

// Customer1
{
  "firstName": "Joe",
  "lastName": "Jones",
  "address":
  {
    "streetAddress": "1 Main St.",
    "city": "New York",
    "state": "NY",
    "postalCode": "10099"
  },
  "contact":
  [
    {
      "type": "home",
      "number": "555 123-1234"
    },
    {
      "type": "email",
      "address": "joe@litware.com"
    }
  ]
}

Extensible Markup Language (XML)

<Customers>
  <Customer name="Joe" lastName="Jones">
    <ContactDetails>
      <Contact type="home" number="555 123-1234"/>
      <Contact type="email" address="joe@litware.com"/>
    </ContactDetails>
  </Customer>
  <Customer name="Samir" lastName="Nadoy">
    <ContactDetails>
      <Contact type="email" address="samir@northwind.com"/>
    </ContactDetails>
  </Customer>
</Customers>

Binary Large Object (BLOB)

Optimized file formats:

FormatTypeDeveloped ByKey Features
AvroRow-based formatApacheEach record has a JSON header describing the structure of the binary data, good for data compression, and minimizes storage and network bandwidth requirements.
ORCColumnar data formatHortonWorksOptimizes read and write operations in Apache Hive, contains stripes of data for columns, includes statistical info (count, sum, max, min) in footers.
ParquetColumnar data formatCloudera and TwitterContains row groups with column data, includes metadata for efficient retrieval of specified columns, specializes in nested data types, supports efficient compression.

Databases

Relational Databases: commonly used to store and query structured data - data stored in tables that represent entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely identifies it - and these keys are used to reference the entity instance in other tables.

Non-Relational Databases (NoSQL database): are data management systems that don’t apply a relational schema to data.

Transactional Data Processing

Transactional systems are high-volume systems designed to process data quickly. The work performed by these systems is known as Online Transactional Processing (OLTP).

OLTP solutions rely on database systems optimized for both read and write operations to support transactional workloads. These workloads involve creating, retrieving, updating, and deleting (CRUD) operations on data records.

OLTP systems adhere to the ACID principles to ensure reliable transaction processing:

ACID PrincipleDescriptionExample
AtomicityEach transaction is treated as a single unit that either fully succeeds or fully fails.A transaction involving debiting funds from one account and crediting the same amount to another must complete both actions or neither.
ConsistencyTransactions transition the database from one valid state to another.The final state of a transaction must accurately reflect the transfer of funds between accounts.
IsolationConcurrent transactions do not interfere with each other and result in a consistent database state.While funds are being transferred, another transaction checking the balance must show consistent results, without mixing pre- and post-transfer states.
DurabilityOnce a transaction is committed, it remains so, even in the event of a system failure.After the account transfer is complete, the updated balances are preserved even if the database system is turned off and back on.

Analytical Data Processing

  1. Operational Data Extraction: Data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
  2. Schema Locking: Data is organized into schemas of tables, typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or in a data warehouse with a fully relational SQL engine.
  3. OLAP Models: Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
  4. Querying and Reporting: Data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.

Types of Data Storage:

OLAP Models:

Common Data Professional Roles

RoleResponsibilitiesCollaborationSecurity/PrivacyTechnologies/Management
Database Administrator- Design, implement, maintain, and operate database systems.
- Manage both on-premises and cloud-based databases.
- Ensure availability and optimize performance.
- Work with stakeholders for policies and processes.
- Develop backup and recovery plans.
- Manage data security.
- Control user access.
N/A
Data Engineer- Design and implement data workloads.
- Develop ingestion pipelines and data stores.
N/A- Ensure data privacy from on-premises to cloud.- Utilize relational and non-relational databases.
- Manage and monitor data pipelines.
Data Analyst- Maximize value of data assets.
- Identify trends and relationships.
- Build analytical models.
- Create reports and visualizations.
N/AN/A- Process raw data into insights.
- Deliver business-relevant insights.

Common Cloud Data Services

Explore relational data in azure

Relational Data

Relational data structures real-world entities as tables, with rows representing instances of these entities. Each row has the same columns, but not every cell needs a value. A row in a table signifies a single instance of an entity.

Normalization

Normalization minimizes data duplication and enforces data integrity by:

For instance, customer, product, sales order, and line item entities are stored in separate tables. Primary keys uniquely identify each row, while foreign keys reference related entities. Composite keys, which are unique combinations of multiple columns, can also be used. For example LineItem table uses combination of OrderNo and ItemNo. Screenshot-2024-07-25-at-1-07-57-PM.png

SQL (Structured Query Language)

SQL is the standard language for managing relational databases like Microsoft SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle.

SQL Statement Types

(I) DATA DEFINITION LANGUAGE (DDL)

Screenshot-2024-07-25-at-1-09-21-PM.png

(II) DATA CONTROL LANGUAGE (DCL)

Screenshot-2024-07-25-at-1-09-50-PM.png

(III) Data Manipulation Language (DML)

Screenshot-2024-07-25-at-1-10-21-PM.png

Defining Database Objects

Views

A view is a virtual table based on a SELECT query

CREATE VIEW Deliveries AS
SELECT o.OrderNo, o.OrderDate, c.FirstName, c.LastName, c.Address, c.City
FROM Order AS o
JOIN Customer AS c ON o.Customer = c.ID;

Stored Procedures

Stored procedures encapsulate SQL statements for specific actions.

CREATE PROCEDURE RenameProduct @ProductID INT, @NewName VARCHAR(20) AS
UPDATE Product SET Name = @NewName WHERE ID = @ProductID;

Execute with:

EXEC RenameProduct 201, 'Spanner';

Indexes

An index helps to search for data in a table. When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table.

E.g. code that creates index on Name column of the Product table

CREATE INDEX idx_ProductName ON Product(Name);

Note: Indexes consume storage and can slow down insert, update, and delete operations due to maintenance overhead.