Database Administrator Tools:
Data Engineering Common Tools:
Data Analyst Common Tools:
Explore core data concepts
Structured data: fixed schema - all data have same fields or properties - mostly tabular format
Semi-structured data: has some structure, but will allow for some variation between instances - common format for semi-structured data is JavaScript Object Notation (JSON)
Unstructured data - e.g. docs, images, audio and video data and binary files
Delimited text files
- It is a comma-separated values (CSV) in which fields are separated by commas, and rows are terminated by carriage return/new line
- First line may include field names
- Delimited text is good for structured data that needs to be accessed by a wide range of applications and services in a human-readable format
- E.g.
FirstName,LastName,Email
Joe,Jones,joe@litware.com
Samir,Nadoy,samir@northwind.com
Other common formats:
JavaScript Object Notation (JSON)
- It is a format in which hierarchical document schema is used to define data entities (objects) that have multiple attributes
- used for both structured and semi-structured data
- E.g.
// Customer1
{
"firstName": "Joe",
"lastName": "Jones",
"address":
{
"streetAddress": "1 Main St.",
"city": "New York",
"state": "NY",
"postalCode": "10099"
},
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "joe@litware.com"
}
]
}
Extensible Markup Language (XML)
- It is a human-readable format
- XML uses tags enclosed in angle-brackets (<../>) to define elements and attributes
- E.g.
<Customers>
<Customer name="Joe" lastName="Jones">
<ContactDetails>
<Contact type="home" number="555 123-1234"/>
<Contact type="email" address="joe@litware.com"/>
</ContactDetails>
</Customer>
<Customer name="Samir" lastName="Nadoy">
<ContactDetails>
<Contact type="email" address="samir@northwind.com"/>
</ContactDetails>
</Customer>
</Customers>
Binary Large Object (BLOB)
- All files are ultimately stored as binary data (1’s and 0’s), but in human-readable formats, bytes of binary data are mapped to printable characters (ASCII or Unicode)
- Common data stored as binary include images, video, audio, and application-specific documents
Format | Type | Developed By | Key Features |
---|---|---|---|
Avro | Row-based format | Apache | Each record has a JSON header describing the structure of the binary data, good for data compression, and minimizes storage and network bandwidth requirements. |
ORC | Columnar data format | HortonWorks | Optimizes read and write operations in Apache Hive, contains stripes of data for columns, includes statistical info (count, sum, max, min) in footers. |
Parquet | Columnar data format | Cloudera and Twitter | Contains row groups with column data, includes metadata for efficient retrieval of specified columns, specializes in nested data types, supports efficient compression. |
Relational Databases: commonly used to store and query structured data - data stored in tables that represent entities, such as customers, products, or sales orders. Each instance of an entity is assigned a primary key that uniquely identifies it - and these keys are used to reference the entity instance in other tables.
Non-Relational Databases (NoSQL database): are data management systems that don’t apply a relational schema to data.
Key-value databases: each record consists of unique key and associated value, which can be in any format
Document databases: specific form of key-value database in which the value is a JSON document (which the system is optimized to parse and query)
Column family databases: tabular data comprising of rows and columns, but can divide columns into groups known as column-families (holds a set of columns that are logically related together)
Graph Databases: which stores entities as nodes with links to define relationships between them
Transactional systems are high-volume systems designed to process data quickly. The work performed by these systems is known as Online Transactional Processing (OLTP).
OLTP solutions rely on database systems optimized for both read and write operations to support transactional workloads. These workloads involve creating, retrieving, updating, and deleting (CRUD) operations on data records.
OLTP systems adhere to the ACID principles to ensure reliable transaction processing:
ACID Principle | Description | Example |
---|---|---|
Atomicity | Each transaction is treated as a single unit that either fully succeeds or fully fails. | A transaction involving debiting funds from one account and crediting the same amount to another must complete both actions or neither. |
Consistency | Transactions transition the database from one valid state to another. | The final state of a transaction must accurately reflect the transfer of funds between accounts. |
Isolation | Concurrent transactions do not interfere with each other and result in a consistent database state. | While funds are being transferred, another transaction checking the balance must show consistent results, without mixing pre- and post-transfer states. |
Durability | Once a transaction is committed, it remains so, even in the event of a system failure. | After the account transfer is complete, the updated balances are preserved even if the database system is turned off and back on. |
Types of Data Storage:
OLAP Models:
Role | Responsibilities | Collaboration | Security/Privacy | Technologies/Management |
---|---|---|---|---|
Database Administrator | - Design, implement, maintain, and operate database systems. - Manage both on-premises and cloud-based databases. - Ensure availability and optimize performance. | - Work with stakeholders for policies and processes. - Develop backup and recovery plans. | - Manage data security. - Control user access. | N/A |
Data Engineer | - Design and implement data workloads. - Develop ingestion pipelines and data stores. | N/A | - Ensure data privacy from on-premises to cloud. | - Utilize relational and non-relational databases. - Manage and monitor data pipelines. |
Data Analyst | - Maximize value of data assets. - Identify trends and relationships. - Build analytical models. - Create reports and visualizations. | N/A | N/A | - Process raw data into insights. - Deliver business-relevant insights. |
Explore relational data in azure
Relational Data
Relational data structures real-world entities as tables, with rows representing instances of these entities. Each row has the same columns, but not every cell needs a value. A row in a table signifies a single instance of an entity.
Normalization
Normalization minimizes data duplication and enforces data integrity by:
For instance, customer, product, sales order, and line item entities are stored in separate tables. Primary keys uniquely identify each row, while foreign keys reference related entities. Composite keys, which are unique combinations of multiple columns, can also be used. For example LineItem table uses combination of OrderNo and ItemNo.
SQL is the standard language for managing relational databases like Microsoft SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle.
SELECT, INSERT, UPDATE, DELETE, CREATE, DROP
SQL Statement Types
CREATE TABLE Product (
ID INT PRIMARY KEY,
Name VARCHAR(20) NOT NULL,
Price DECIMAL NULL
);
GRANT SELECT, INSERT, UPDATE
ON Product
TO user1;
SELECT * FROM Customer WHERE City = 'Seattle';
SELECT o.OrderNo, o.OrderDate, c.Address, c.City
FROM Order AS o
JOIN Customer AS c
ON o.Customer = c.ID
INSERT INTO Product(ID, Name, Price)
VALUES (99, 'Drill', 4.99);
UPDATE Customer SET Address = '123 High St.' WHERE ID = 1;
DELETE FROM Product WHERE ID = 162;
Views
A view is a virtual table based on a SELECT query
CREATE VIEW Deliveries AS
SELECT o.OrderNo, o.OrderDate, c.FirstName, c.LastName, c.Address, c.City
FROM Order AS o
JOIN Customer AS c ON o.Customer = c.ID;
Stored Procedures
Stored procedures encapsulate SQL statements for specific actions.
CREATE PROCEDURE RenameProduct @ProductID INT, @NewName VARCHAR(20) AS
UPDATE Product SET Name = @NewName WHERE ID = @ProductID;
Execute with:
EXEC RenameProduct 201, 'Spanner';
Indexes
An index helps to search for data in a table. When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table.
E.g. code that creates index on Name column of the Product table
CREATE INDEX idx_ProductName ON Product(Name);
Note: Indexes consume storage and can slow down insert, update, and delete operations due to maintenance overhead.