Relational Databases — RDB Management System

Tanja Adžić
27 min readDec 27, 2023

--

Introduction

A relational database is a type of database that uses a structure that allows the definition of data structures, storage, and retrieval operations, and the relationships between data. It organizes data into tables, which consist of rows and columns. Each row in a table represents a record, and each column represents an attribute of the record.

The relational database management system (RDBMS) is the software that manages the relational database. It provides an interface for interacting with the database, allowing users to define, create, and manipulate the data.

Photo by Kelly Sikkema on Unsplash

A strong foundation for any effective relational database solution starts with a well-thought-out design and implementation strategy. Such a design ensures that users and applications relying on the data can be confident in the following aspects:

  • Accuracy: Is the data reliably accurate even as new information is incorporated or existing data is modified?
  • Ease of Access: Is the data structured in a manner that facilitates swift, straightforward, and predictable querying and maintenance?
  • Reliability: Does the database design guarantee data integrity, fostering consistent and dependable data?
  • Flexibility: Is the design adaptable, allowing for seamless updates or expansions to meet evolving data requirements?

Contents

  1. Fundamental Relational Database Concepts
  2. Relational Database Products
  3. Creating Tables and Loading Data
  4. Designing Keys, Indexes, and Constraints

Fundamental Relational Database Concepts

In today’s data-centric world, understanding the nature of data is crucial. Data, that is basically unorganized information, becomes meaningful through processing. It has various forms, such as facts, observations, perceptions, numbers, characters, symbols, and images, or a combination of these elements.

Types of data?

The classification of data is based on its structural characteristics, leading to three main categories:

  • Structured data — with a well-defined schema and rigid organization in rows and columns, is best suited for relational databases.
  • Semi-structured data — characterized by some organizational properties but not conforming to tabular structures, relies on hierarchies, tags, and metadata.
  • Unstructured data is complex and lacks a specific structure or format, and finds a home in NoSQL databases.

Data sources?

The modern data landscape consists of a plethora (such a cool word) of sources, from traditional databases and flat files to XML datasets, web scraping, data streams, feeds, and information gathered from social platforms and Internet of Things (IoT) devices with sensors. This diverse data can be stored, processed, and analyzed to provide valuable insights into business performance.

File formats?

Data comes in various file formats for storage and transfer. Common formats include delimited text files (e.g., CSV, TSV), spreadsheets, and language files like XML and JSON. Each format serves specific purposes, with XML offering readability for humans and machines, platform independence, and programming language neutrality. JSON, another language file, is widely used for sharing data of any size and type, making it a popular choice for APIs and web services.

Data repositories?

Once data is collected, the next question is where to store it. Structured and semi-structured data typically find their home in databases, whether relational, like DB2, or non-relational, like MongoDB. The choice between Online Transaction Processing (OLTP) systems, optimized for day-to-day operational data, and Online Analytical Processing (OLAP) systems, tailored for complex data analytics, depends on the type of data and operations required.

Relational databases, part of OLTP systems, store structured data in related tables, minimizing data duplication while maintaining complex relationships. Examples of Relational Database Management Systems (RDBMS) include: IBM DB2, Microsoft SQL Server, Oracle, and MySQL.

Relational databases play a crucial role in supporting daily business activities, such as customer transactions and workflows, and also serve for data analysis, such as making sales projections based on data from customer relationship management systems. In essence, a well-designed data strategy involves understanding the characteristics of data, choosing appropriate storage solutions, and leveraging various file formats for effective data management.

Data Models

Information Model vs. Data Model:

An Information Model, depicted as an abstract, formal representation, has entities’ properties, relationships, and operations. These entities can originate from the real world, such as those found in a library. It operates at a conceptual level, defining relationships between objects. In contrast, Data Models are more concrete, providing specific details and serving as the blueprint for any database system.

Hierarchical Information Model:

Among the various Information Models, the Hierarchical model is commonly used to illustrate organizational charts. Visualized in a tree structure, it features a root node, parent nodes, and child nodes.

Hierarchical Model, source: Wikipedia

The interesting part is that this model traces its origins back to the Information Management System released by IBM in 1968, initially designed for the Apollo Space Program.

Relational Model:

The Relational Model is the most widely used data model for databases. It facilitates data independence by storing information in simple data structures — tables. This model offers logical, physical, and storage independence, making it a preferred choice in database systems.

Entity-Relationship Data Model (ER Model):

An alternative to the relational data model, the ER Model is explored using a simplified library database as an example. An Entity-Relationship Diagram (ERD) showcases entities (tables) and their relationships:

ER Diagram, source Wikipedia

Attributes, in essence, provide additional details or information about an entity. Consider the entity “Account.” This entity possesses various attributes that contribute to a comprehensive description of the account. These attributes may include the account holder’s address, phone number, name, and password — each providing specific information about the account.

It’s important to note that each attribute is uniquely connected to a specific entity. In the case of our “Account” entity, each attribute, such as the address, phone number, name, or password, is associated with and describes that particular account.

Now, in the database implementation, the entity “Account” is translated into a table. This table serves as a structured storage unit for organizing and managing the data related to accounts. The attributes of the “Account” entity, namely the address, phone number, name, password, etc., become the columns of this table. Each column represents a specific attribute, and each row in the table corresponds to a distinct account entity, encapsulating the values of its attributes.

ERDs and Types of Relationship

Building blocks of a relationship are:

  • Entities — usually depicted as rectangles, are the foundational elements
  • Relationship sets — usually depicted as diamonds, connect entities through lines.
  • Crows foot notations — there are multiple ways to represent relationships between entities, one of them is crows foot notation (< | >)

An Entity-Relationship diagram (ER diagram) exemplifies these concepts using the entity “Book” as an illustration. Below, the entity “Book” is portrayed as a rectangle, and its attributes, such as title, edition, year, and price, are depicted as ovals. Each attribute is intricately linked to exactly one entity, contributing to a detailed understanding of the entity’s properties.

source: course link below

Next to the “Book” entity, there is also an “Author” entity:

As one author can have many books, and books can be written by many authors, the nature of these relationships can be represented with:

One-to-One Relationship: a book written by one author

One-to-Many Relationship: a book written by many authors

Many-to-Many Relationship: many authors writing many books

Mapping Entities to Tables

Entity-Relationship Diagrams (ERDs) form the core of database design. In relational design, start with an ERD, or ER diagram, and then map it to the database tables. Let’s illustrate this using the example of the “Book” entity.

The “Book” entity, containing various attributes, undergoes mapping to transform into a table. To simplify, we separate the entity from its attributes, resulting in a table named “Book” where each attribute becomes a column.

In the Relational database model, a table consists of rows and columns. During mapping, the entity seamlessly becomes the table. At this point, the table hasn’t assumed the structure of rows and columns; it is when the attributes are translated into columns that the structured format takes shape.

To complete the table, add data values to each column. This process is applicable to other entities, such as the “Author,” where the entity transforms into a table, and attributes become columns. Inserting relevant data values completes the table.

Data Types

A database table represents a single entity, and its columns denote attributes. Consider a table like “Book,” where columns may include title, published date, and pages. Ensuring uniform data entry by defining data types is crucial. For instance, the Title column should store textual data, Publish Date a date, and Pages a number.

Here are some of the common Data types

  • Character strings (fixed and variable length), integers, decimals, and date/time, organize and control data. Fixed-length strings, denoted as CHAR(10), allocate a set space regardless of actual data length. Variable-length strings (VARCHAR) specify a maximum length.
  • Numeric types encompass integers, smallints, bigints, and decimals. Integer types hold whole numbers, with smallints and bigints providing flexibility in size and range. Decimal types accommodate both whole and decimal numbers.
  • Date/time data include dates, times, and timestamps, each with specific formats.
  • Other data types involve Booleans (True/False), binary strings (for media data), large objects (LOBs), and XML data types.

Choosing appropriate data types avoids incorrect data insertion and allows for:

  • data integrity
  • data sorting
  • range selection
  • data calculations, and
  • utilization of standard functions.

Defining data types ensures efficient and meaningful database operations, enhancing overall data integrity and functionality.

Relational Model Concepts

The relational model, proposed in 1970, is rooted in a mathematical framework. Its fundamental components are Relation and Sets.

A relation, akin to a set, is an unordered assortment of unique elements sharing the same type, devoid of order or duplicates. In the context of databases, a relational database comprises sets of relations, with each relation corresponding to a mathematical table.

A relation, synonymous with a table, is an union of rows and columns. It comprises two integral parts: Relation Schema and Relation Instance.

The Relation Schema delineates the relation’s name and attributes (columns), specifying their names and data types. For instance, the entity Author exemplifies a Relation Schema, with attributes like Author_ID, last name, first name, email, city, and country, each associated with a specific data type.

The Relation Instance constitutes a table with rows (tuples) and columns (attributes or fields). Degree pertains to the count of attributes in a relation, while Cardinality signifies the number of tuples or rows.

Relational Database Products

Database Architecture

The choice of a deployment topology for your database hinges on its usage and accessibility. Here are several options:

  1. Single-Tier Architecture:
  • Ideal for small databases with limited user access, often deployed on a local desktop.
  • Access is restricted to a single user, making it suitable for development, testing, or embedded local applications.
  1. Client-Server Architecture (2-Tier):
  • Suited for larger databases accessed by many users.
  • The database resides on a remote server, and users access it through a client system, usually via a web page or local application.
  • Commonly used in multi-user scenarios, with the option of introducing a middle-tier (application server layer) between clients and the database server.
  1. Cloud Deployment:
  • Increasingly popular, where the database exists in a Cloud environment.
  • Offers advantages like no need for software downloads, infrastructure maintenance, and accessibility from anywhere with an internet connection.
  • Flexibility for development, testing, and full production environments.
  1. Client-Server Architecture (3-Tier):
  • Also known as 3-Tier architecture.
  • Involves three layers: presentation layer (user interface), business logic layer, and data layer (database).
  • Clients interact with an application server, which, in turn, communicates with the database server through a database API or driver.
  • Widely used in production environments, providing separation of presentation and business logic layers.

In a 2-Tier database architecture, the client application connects to the database server through an interface like an API. The server includes layers for data access, the database engine, and database storage. Meanwhile, in a 3-Tier architecture, a middle tier, such as a web application server, intervenes between clients and the database server. This approach enhances security and is standard in contemporary production environments.

Distributed Architecture and Clustered Databases

For critical or large-scale workloads requiring high availability and scalability, major RDBMSes offer distributed architectures. The primary types are Shared Disk and Shared Nothing architectures, utilizing techniques such as replication or partitioning. Some architectures combine these methods or integrate specialized hardware for improved availability and scalability.

Shared Disk Architecture:

  • Multiple database servers process workloads in parallel, enhancing processing speed.
  • Each server is linked to shared storage and each other through high-speed interconnects.
  • Allows workload distribution among servers for scalability.
  • Enables rerouting clients to other servers in case of a server failure, ensuring high availability.

Database Replication:

  • Changes on a database server are replicated to one or more database replicas.
  • Improves performance by distributing client workloads across servers.
  • High Availability replicas within the same location redirect clients in case of a primary server failure.
  • Disaster Recovery replicas, in geographically distributed locations, handle site-wide disasters (fire, flood..).

Partitioning (Sharding):

  • Tables with extensive data can be partitioned into logical segments, such as sales records for different quarters.
  • When these partitions are placed on separate nodes in a cluster, it’s called Sharding.
  • Each shard has its compute resources (Processing, Memory, Storage) to handle its data subset.
  • Queries issued by clients are processed in parallel on multiple nodes or shards, enhancing performance.
  • Additional shards and nodes can be added to the cluster for increased parallel processing with growing data or query workloads.

Database Partitioning and Sharding are commonly used for data warehousing and business intelligence workloads involving substantial data volumes.

Database Usage Patterns

Three main classes of database users include Data Engineers, Data Scientists, Business Analysts, and Application Developers.

Data Engineers and Database Administrators (DBAs):

  • Perform administrative tasks like creating and managing database objects, setting access controls, monitoring, and performance tuning.
  • Use GUI or web-based management tools, often provided by database vendors or third-party alternatives.
  • Employ command line interfaces and utilities for specific tasks, utilizing commands, interactive shells, or SQL scripts and batch files.
  • Leverage APIs for programmatic interfaces, invoking administrative tasks from applications and tools.

Data Analysts, Data Scientists, Business Analysts, and Business Intelligence Analysts:

  • Access databases for data analysis, deriving insights, and making data-driven predictions.
  • Primarily engage in read-only access to existing data sources.
  • Occasionally create database objects and populate them with data, especially in sandbox environments.
  • Use tools like Jupyter, R Studio, Zeppelin, SAS, SPSS for data science, and Excel, IBM Cognos, PowerBI, Tableau, MicroStrategy for reporting and BI.
  • Interface with relational databases using SQL interfaces and APIs, with some tools abstracting away the need for direct SQL usage.
  • May utilize SQL Query tools for ad-hoc querying.

Application Developers:

  • Rarely access databases directly, focusing on creating applications that require read and write access.
  • Develop applications using programming languages like C++, C#, Java, JavaScript, .Net, PHP, Perl, Python, and Ruby.
  • Communicate with databases using SQL interfaces and APIs like ODBC and JDBC.
  • Some databases, especially cloud-based ones, offer REST APIs for data access.
  • Modern programmers often use object-relational mapping (ORM) frameworks, like ActiveRecord in Ruby, Django in Python, Entity Framework in .NET, Hibernate in Java, and Sequelize in JavaScript, to simplify working with databases and SQL.

Relational Database Short History

The relational database concept emerged in the 1960s with the IBM Sabre Seat Reservation System used by American Airlines, gaining formal rules by Edgar F. Codd in the early 70s. Ingres and System R appeared in the late 70s, and the 80s marked their commercial success, with Db2 leading for IBM and SQL becoming the standard query language.

In the 1990s, client tools like Oracle Developer and personal productivity tools gained popularity, like Excel. Open source databases like MySQL and PostgreSQL rose in the 2000s, challenging commercial ones. The 2010s witnessed the surge of cloud databases, with leaders like Amazon RDS and Microsoft SQL Azure.

Large corporates favored commercial databases, but open source licensing gained traction in the late 2000s, with MySQL, PostgreSQL, and SQLite. The DB Engines ranking reveals the current popularity, with open source databases slightly ahead at 50.1%.

Cloud databases, accessed through cloud platforms, have become increasingly popular. They offer scalability, aligning with the Software-as-a-Service (SaaS) model. Leading cloud databases include Amazon DynamoDB, Microsoft Azure Cosmos DB, Google BigQuery, and Amazon Redshift. Gartner, Inc. predicts that by 2022, 75% of all databases will be deployed or migrated to a cloud platform.

DB2

IBM introduced Database 2 (DB2) in 1983, marking an early relational database management system. Initially for IBM mainframes, it expanded to OS/2, UNIX, Linux, and Windows. Over iterations, DB2 evolved into a suite, including Db2 Database, Db2 Warehouse, Db2 on Cloud, and more.

Db2 products, powered by AI, simplify data management with machine learning algorithms, column stores, and data skipping. A unified SQL engine across Db2 ensures query compatibility. The family supports all data types, enhancing corporate decision-making.

Db2 provides replication for high availability and disaster recovery. Scalability options include on-premises extensions, cloud deployments with flexible power and storage scaling, and Database Partitioning Feature for parallel processing in Db2 Warehouse.

The Db2 family encompasses various data management solutions:

  1. Db2 Database: Enterprise-ready on-premises RDBMS optimized for OLTP.
  2. DB2 Warehouse: On-premises data warehouse for advanced analytics and machine learning.
  3. Db2 on Cloud: Fully managed cloud-based SQL database with performance, scalability, and resilience.
  4. DB2 Warehouse on Cloud: Elastic cloud-based data warehouse with features akin to on-premises Db2 Warehouse.
  5. Db2 Big SQL: SQL-on-Hadoop engine offering parallel processing and advanced querying.
  6. Db2 Event Store: Memory-optimized database for ingesting and analyzing streamed data.
  7. Db2 for z/OS: Enterprise data server for IBM Z supporting analytics, mobile, and cloud integration.
  8. Cloud Pak for Data: An integrated data and AI platform running on Red Hat OpenShift for working with and managing all data across private, public, or hybrid clouds.

Db2 on Cloud Plans:

  1. Lite Plan: Free, time-unlimited, with 200 MB data limit and 15 simultaneous connections.
  2. Standard Plan: Flexible scaling, storage, and three-node high availability clustering.
  3. Enterprise Plan: Dedicated database instance with flexible scaling and three-node high availability clustering.

Db2 on Cloud can be deployed on IBM Cloud or Amazon Web Services, accessible via CLPPlus, GUI console, or standard APIs.

High Availability Disaster Recovery (HADR): Replicates changes to up to three standby servers, facilitating automatic promotion in case of primary database failure.

Db2 Warehouse Scaling: Easily scales storage by adding or removing nodes, automatically rebalancing partitions and workloads.

MySQL

MySQL, initially developed by MySQL AB, a Swedish company founded by Monty Widenius, gained its name from My, Widenius’ daughter. Sun Microsystems acquired MySQL AB, and later, Oracle Corporation acquired Sun Microsystems. The recognizable dolphin in the MySQL logo is named Sakila, chosen through a naming contest.

In the late 1990s and early 2000s, MySQL’s popularity surged, especially within the LAMP stack (Linux, Apache, MySQL, PHP), a vital component for constructing numerous web platforms.

MySQL is dual-licensed: open source under GNU GPL and commercial for applications embedding MySQL. Notably, MariaDB emerged as a prominent fork, led by original MySQL developers.

MySQL functions as an object-relational database management system, offering diverse flavors and editions, including a clustered version for demanding workloads. It runs on UNIX, Windows, and Linux, supporting various programming languages for client applications. MySQL employs standard SQL syntax with additional extensions like the LOAD DATA statement for rapid data import.

Primarily handling relational data, MySQL also supports JSON. Multiple storage engines are available, each tailored to specific workloads. InnoDB, the default engine, ensures data consistency, supports row-level locking for improved multi-user performance, and offers features like clustered indexes and foreign key constraints. MyISAM suits read-heavy workloads, while the NDB engine supports high availability through clustering for applications requiring redundancy.

MySQL provides options for high availability and scalability. Replication allows data copies on one or more replicas, distributing the read load and enhancing scalability. It also boosts availability, enabling failover to replicas in case of source database failure. Two clustering options exist: one utilizing InnoDB with group replication for read-write primary and multiple secondary servers, and the other leveraging the NDB engine for high availability and scalability with multiple server and data nodes. MySQL’s versatility, broad compatibility, and support for various workloads make it a popular choice in the database landscape.

PostgreSQL

PostgreSQL, originating from the POSTGRES project at the University of California over 30 years ago, has a rich history of use in research and production applications across diverse industries. The open-source nature of PostgreSQL, derived from Postgres95 in 1994, allows users to freely modify and distribute the source code, making it adaptable to specific business requirements.

Pronounced as “Postgres,” it seamlessly integrates into the LAPP stack (Linux, Apache, PostgreSQL, and PHP) for web applications. Independent extensions, such as PostGIS, enhance functionality, especially in handling geographic and spatial data.

PostgreSQL is a free, open-source, object-relational database management system. Its object-oriented features support inheritance and overloading, facilitating design simplicity and object reuse. Compatible with various operating systems, its low maintenance requirement makes it organization-friendly. Supporting multiple programming languages and ANSI SQL standards, PostgreSQL provides a versatile database solution.

Utilizing standard relational database constructs like keys, transactions, views, functions, and stored procedures, PostgreSQL also incorporates NoSQL functionality such as JSON for structured data and HSTORE for non-hierarchical data. The system supports replication for high availability, including two-node synchronous replication and multi-node asynchronous replication for scalability.

For further flexibility in scaling applications, commercial editions like EDB PostgreSQL Replication Server offer multi-master read/write replication. This allows the operation of multiple read/write databases replicating changes with each other, ensuring availability in case of failures.

Recent releases of PostgreSQL introduced features like partitioning, enabling the division of large tables into smaller sections for improved query performance. Additionally, sharding allows horizontal partition storage across multiple remote servers, enhancing scalability when working with large datasets. These advancements solidify PostgreSQL as a robust and scalable database solution.

Creating Tables and Loading Data

Types of SQL statements (DDL vs. DML)

SQL Statements serve as the means to interact with entities (tables), attributes (columns), and their tuples (rows with data values) in relational databases. These statements are categorized into two main types: Data Definition Language (DDL) statements and Data Manipulation Language (DML) statements.

DDL statements, responsible for defining, changing, or dropping database objects like tables, encompass common types such as: CREATE, ALTER, TRUNCATE, and DROP.

  • CREATE is used to create tables and define their columns,
  • ALTER for modifying tables (adding, dropping columns, modifying data types),
  • TRUNCATE for deleting data without removing the table, and
  • DROP for deleting tables.

DML statements are employed for reading and modifying data in tables, often referred to as CRUD operations (Create, Read, Update, and Delete). Common DML statement types consist of:

  • INSERT (for adding rows of data to a table),
  • SELECT (for reading or selecting rows from a table),
  • UPDATE (for editing rows in a table), and
  • DELETE (for removing rows of data from a table).

Creating Tables

When creating a table, consider essential information. Firstly, decide the schema where the table will reside. Schemas, found in many relational databases, organize objects logically. For instance, in IBM Db2 on Cloud, objects like tables and views are stored in user schemas. Ensure you have necessary details for table creation, including a name, and names with data types for each column. Consider whether a column permits duplicate or null values. Use your Entity Relationship Diagram for guidance.

Tables can be created through visual interfaces, SQL statements like CREATE TABLE in scripts, or administrative APIs for programmatic creation. The following steps are based on Db2 on Cloud, but the concepts apply broadly.

  1. Choosing a Schema:
  • Use the Db2 on Cloud console.
  • Choose a schema (e.g., CQC63405) to host the table. Default is usually the username.
  • New schemas can be created to organize objects.
  1. Creating a Table:
  • Select “New table.”
  • Name your table (e.g., Employee Details).
  • Default column exists; rename as needed.
  • Set data types and add columns using “Add column.”
  • Specify null values, length, and scale based on the data type.
  • Click “Create.”
  1. Post-Creation Actions:
  • Options post-creation include dropping the table, generating SQL code for SELECT, INSERT, UPDATE, DELETE actions, altering the table (add new columns, set constraints), and viewing dependent database objects.

While Db2 on Cloud is used in this example, these principles are applicable to various databases.

CREATE TABLE Statement

The syntax for creating a table involves using the CREATE TABLE statement followed by the table name. The subsequent attributes, enclosed in parentheses, define the columns with their respective data types and optional values. Each column definition is separated by a comma.

For instance, consider creating a table for Canadian provinces:

CREATE TABLE provinces (
id CHAR(2) PRIMARY KEY NOT NULL,
name VARCHAR(24)
);

Here, “id” is a fixed-length character string (CHAR) of length 2, serving as the primary key. “name” is a variable-length character string (VARCHAR) with a maximum length of 24 characters.

Now, let’s delve into a more complex example using the Library database. We’ll create a table for the “AUTHOR” entity, with attributes like “AUTHOR_ID,” “FIRSTNAME,” “LASTNAME,” etc. In this case, “Author_ID” is designated as the primary key to ensure uniqueness.

CREATE TABLE author (
author_id CHAR(2) PRIMARY KEY NOT NULL,
lastname VARCHAR(15) NOT NULL,
firstname VARCHAR(15) NOT NULL,
email VARCHAR(40),
city VARCHAR(15),
country CHAR(2)
);

Here, “Author_ID” is the primary key, preventing duplicate values. “Last Name” and “First Name” are constrained with “NOT NULL” to ensure they contain valid data, as an author must have a name.

ALTER, DROP, and Truncate tables

To modify the structure of a table, the ALTER TABLE statement is used. Unlike the CREATE TABLE statement, the ALTER TABLE statement doesn’t enclose parameters in parentheses. Each row in the ALTER TABLE statement represents a change to be made to the table.

For instance, to add a “telephone_number” column to the “AUTHOR” table in the Library database, you can use:

ALTER TABLE author ADD COLUMN telephone_number BIGINT;

Here, the data type for the new column is BIGINT, capable of holding a number up to 19 digits long.

To modify the data type of an existing column, the ALTER COLUMN clause is utilized. For example, altering the “telephone_number” column to use the CHAR data type:

ALTER TABLE author ALTER COLUMN telephone_number SET DATA TYPE CHAR(20);

It’s crucial to note that altering the data type of a column with existing data could lead to issues if the data is not compatible with the new type.

If the specifications change, and the extra column is no longer needed, the ALTER TABLE statement with the DROP COLUMN clause is employed to remove the column:

ALTER TABLE author DROP COLUMN telephone_number;

To delete a table from a database, the DROP TABLE statement is used:

DROP TABLE author;

Deleting a table also removes its data by default. If you only want to delete the data without removing the table, the TRUNCATE TABLE statement is more efficient:

TRUNCATE TABLE author IMMEDIATE;

The “IMMEDIATE” keyword ensures that the truncation is processed immediately and cannot be undone.

Data Movement Utilities

Data engineers and Database administrators often need to move data in and out of databases for various reasons. This could include tasks like populating the entire database, creating a working copy for development, taking snapshots for disaster recovery, or creating new tables from external data sources. There are three broad categories of tools and utilities for data movement: Backup and Restore, Import and Export, and Load.

Backup and Restore:

  • The backup operation creates files encapsulating all database objects and data.
  • The restore operation recreates the original database from backup files.
  • Preserves all objects, including schemas, tables, views, data types, functions, etc.
  • Useful for disaster recovery and creating additional copies for development.

Import and Export:

  • Import reads data from a file and performs INSERT statements.
  • Export selects data from a table and saves it to a file.
  • Operations can be done through command-line utilities, management APIs, or graphical tools.
  • Common file formats: DEL (delimited ASCII), ASC (non-delimited ASCII), PC/IXF, JSON.

Load:

  • Load utility writes formatted pages directly into the database, faster than import.
  • Doesn’t perform referential or table constraints checking.
  • May bypass database logging for higher performance.
  • Preferred for very large volumes of data.

Examples:

  • In DB2, command line import/export utilities or load utilities can be used.
  • Exporting a table to a CSV file in the Db2 console involves selecting the table, viewing data, and choosing the export button.

For very large datasets, Load utilities are preferred due to their speed and ability to bypass certain checks. Import and Export operations are suitable for smaller datasets and provide flexibility in file formats. The choice of method depends on the specific requirements of the data movement task.

Loading Data

Using SQL statements like INSERT is effective for adding data to tables, especially for small datasets during development and testing. However, for larger datasets involving hundreds or thousands of rows, this method becomes impractical. Relational Database Management Systems (RDBMS) typically offer a more efficient and scalable approach to directly load substantial amounts of data into tables.

Loading Data in Db2 Web Console:

  • The Load Data utility in Db2 Web Console facilitates the quick and efficient loading of data.
  • Suitable for large datasets originating from diverse sources in various formats.
  • For example, loading data from delimited text files, S3 object storage (Amazon Web Services), or Cloud Object Storage (IBM).

Example: Loading Data from a CSV File:

  1. In the Db2 Web Console, access the three-bar menu, navigate to the LOAD section, and click Load Data.
  2. Follow a four-step process: identify source data type and location, select the target, define configurable aspects, and confirm before starting the load.
  3. On the Source page, specify the source data location and provide authentication details if required (e.g., for IBM Cloud Object Storage).
  4. On the Target page, select the target schema and table, and choose to append data or overwrite existing data. (Note: Overwriting will result in the loss of existing data, even if the load fails. — Alternatively, load data into a new table using the New Table option)
  5. On the Define page, specify character encoding, delimiter, presence of column headings, and date/time formats.
  6. On the Finalize page, review all settings before initiating the data load.
  7. Upon completion, the console displays the process status, including any errors or warnings raised during the load.

This approach streamlines the process of loading large datasets into tables, enhancing efficiency and scalability compared to individual INSERT statements.

Designing Keys, Indexes, and Constraints

Database Objects & Hierarchy (Including Schemas)

Relational Database Management Systems (RDBMS) involve a hierarchical organization of various objects, essential for efficient management by Database Engineers and Administrators. This hierarchical structure aids in security, maintenance, and accessibility. While slight variations exist, this overview provides a generalized understanding of RDBMS structure.

Hierarchy Overview:

  1. Instance: Represents a logical boundary for a database or set of databases.
  • An instance can contain multiple databases.
  • Assigned a unique name with its system catalog tables and configuration files.
  1. Schema:
  • A logical grouping of objects within a database.
  • Defines object naming conventions and prevents ambiguous references.
  • Contains database objects such as tables, constraints, and indexes.
  1. Database Objects:
  • Tables, views, indexes, functions, triggers, and packages.
  • Can be system-defined or user-defined.
  • Relationships between tables enhance data integrity, reducing redundancy.

Instance Concept:

  • Logical Boundary: Organizes database objects and configuration parameters.
  • Multiple Instances: Enable unique server environments for different purposes.
  • Isolation: Objects within one instance are isolated from others.
  • Use Cases: Useful for development, production environments, access control, and administration.

Database Structure:

  1. Relational Database (RDB):
  • Set of objects for storing, managing, and accessing data.
  • Includes built-in and user-defined objects.
  1. Schema:
  • Specialized object grouping logically.
  • Contains tables, views, triggers, functions, packages, etc.
  • Provides naming context, aiding in namespace conflict resolution.
  1. System Schema:
  • Holds configuration information and metadata.
  • Stores user permissions, index details, database partitions, and user-defined data types.
  1. Partitioned Relational Database:
  • Manages data across multiple partitions.
  • Used for scenarios involving extensive data, like data warehousing and business intelligence.

Database Objects:

  • Creation and Design:
  • Tables: Logical structures storing data in rows and columns.
  • Constraints: Enforce data rules (e.g., uniqueness).
  • Indexes: Improve performance and ensure data uniqueness.
  • Views: Represent data differently without permanent storage.
  • Aliases: Alternative names for objects, simplifying references.

Management and Creation:

  • Use graphical tools, scripting, or APIs.
  • SQL statements like CREATE or ALTER in Data Definition Language for object creation.

Understanding this structured hierarchy is vital for effective RDBMS management and utilization by Database Engineers and Administrators.

Primary Keys and Foreign Keys

A primary key is pivotal in uniquely identifying each row in a table. When selecting a primary key, some tables naturally possess a unique attribute, such as a book ID or an employee ID. However, if the table lacks an inherent unique attribute, you can introduce a new column for this purpose. Alternatively, if a combination of two attributes guarantees uniqueness, a composite primary key across these columns is possible.

A table can only have one primary key. During table creation, use the PRIMARY KEY clause in the CREATE TABLE statement, specifying the column or columns serving as the primary key. If you need to add it after the table is created, use the ADD PRIMARY KEY clause in the ALTER TABLE statement, again providing the relevant column or columns:

-- Creating a table with a primary key during table creation
CREATE TABLE ExampleTable (
ID INT PRIMARY KEY,
Name VARCHAR(255),
Age INT
);

-- Adding a primary key after table creation
ALTER TABLE ExampleTable
ADD PRIMARY KEY (ID);

Primary and Foreign Keys: Defining Relationships

To articulate relationships between tables, primary and foreign keys come into play. A foreign key in a table mirrors the information contained in the primary key of another table. For instance, the Copy table might enlist all books owned by a library, necessitating that the book_id in a copy corresponds to a valid entry in the Book table.

During table creation, integrate a foreign key using the CONSTRAINT <constraint_name> FOREIGN KEY clause in the CREATE TABLE statement. Specify the foreign key column and reference the associated table and primary key column. You can also dictate actions for updates or deletions in the parent table, using rules like taking no action, cascading deletions, or setting foreign key values to null.

Similarly, post-creation, adding a foreign key is achievable via the ADD FOREIGN KEY clause in the ALTER TABLE statement, specifying the pertinent details. When employing multiple columns for a primary key, remember to separate them with commas.

Indexes

When adding data to a table, it typically gets appended to the end, but there’s no inherent order. When selecting a specific row, the processor may need to check each row sequentially, leading to slow retrieval on large tables. Without a specified sort order, selecting multiple rows may yield results in an unordered state. To efficiently address this, you can create an index on a table to quickly locate specific rows.

An index works by storing pointers to each row, allowing the SQL processor to rapidly find the desired row, similar to using an index in a book. By default, when creating a primary key, an index is automatically generated. However, you can also create custom indexes on frequently searched columns using the CREATE INDEX statement, specifying the index name, uniqueness, and the associated table and column.

Indexes offer various benefits, including improved SELECT query performance on indexed columns, reducing the need for data sorting, and ensuring the uniqueness of rows if the UNIQUE clause is applied. Despite these advantages, indexes come with drawbacks. Each index consumes disk space, and operations likeINSERT, UPDATE, and DELETE queries may experience decreased performance due to the sorting nature of indexed tables.

Creating an index should be a strategic decision, considering the trade-off between advantages and disadvantages. It’s beneficial for tables with infrequent inserts or updates but regular SELECT queries and WHERE clauses. Overindexing a table can counteract performance benefits, akin to an unhelpful index that lists every word in a book.

Normalization

When managing data, such as book records in a bookshop, inconsistencies and duplicate information are common challenges. Duplication can lead to extra work and data inconsistencies during updates, necessitating changes in multiple places. Normalization, the process of organizing data to minimize redundancy, involves dividing larger tables into related ones. This enhances transaction speed by executing updates, additions, and deletes only once on a normalized database, promoting data integrity.

The normalization process focuses on achieving required normal form levels for each table. Key forms include:

  • First normal form (1NF),
  • Second normal form (2NF), and
  • Third normal form (3NF).

In 1NF, each row must be unique, and each cell should contain a single value. To illustrate, normalizing a Book table involves ensuring that each cell contains a single value, transforming it into 1NF.

Moving to 2NF, which builds upon 1NF, involves separating groups of values applying to multiple rows into new tables. This mitigates data duplication. For instance, a Book table containing format-related data for a specific book may be split to achieve 2NF. Identifying a Primary Key in one table and using it as a Foreign Key in another establishes a relationship between the tables.

In 3NF, which builds on 1NF and 2NF, columns that do not depend on the key are eliminated. Additional book data, like publisher and shipping location, may prompt the creation of a Publishers table. Both tables are now in 3NF, a common normalization level in relational databases.

Higher normal forms, like Boyce Codd Normal Form (BCNF), fourth, and fifth normal forms, address specific scenarios.

In transactional systems (OLTP), where data is frequently read and written, normalization to 3NF is common. It optimizes the processing and storage of transactions efficiently. In analytical systems (OLAP), primarily read-only, databases may undergo some de-normalization for improved read performance. Data warehousing, emphasizing performance, benefits from fewer tables for processing.

Relational Model Constraints

In business databases, maintaining data integrity is crucial, and constraints play a pivotal role in enforcing business rules. In a relational data model, integrity is ensured through six types of constraints:

Entity Integrity Constraint (Primary Key Constraint or Unique Constraint):

  • Definition: Requires each tuple in a relation to have a unique identifier known as a primary key.
  • Usage: Indexes are employed to implement this constraint.
  • Details: No attribute participating in the primary key is allowed to accept NULL values, preventing duplicates.

Referential Integrity Constraint:

  • Definition: Defines relationships between tables and ensures the validity of these relationships using Primary Keys and Foreign Keys.
  • Example: In a book-author scenario, for a book to exist, it must be written by at least one author.

Semantic Integrity Constraint:

  • Definition: Ensures the correctness of the meaning of data.
  • Example: In the Author relation, if the City attribute contains a value other than Toronto, it lacks meaningful semantics.

Domain Constraint:

  • Definition: Specifies permissible values for a given attribute.
  • Example: The Country attribute in the Author relation must contain a valid two-letter country code, enforcing meaningful data.

Null Constraint:

  • Definition: Specifies that attribute values cannot be null.
  • Example: In the Author relation, neither LastName nor FirstName can contain NULL values, ensuring authors have identifiable names

Check Constraint:

  • Definition: Enforces domain integrity by restricting accepted values for an attribute.
  • Example: In the Book relation, the Year attribute, representing the publication year, might have a CHECK constraint to limit values based on the current year.

These constraints collectively ensure data accuracy, prevent redundancy, and contribute to a well-structured relational database.

Sorry for the long post, hopefully you found plenty of useful information here.

If you found this blog helpful, you might want to check out some of my other data-related blogs:

Disclaimer: The notes and information presented in this blog post were compiled during the course “Introduction to Relational Databases (RDBMS)” and are intended to provide an educational overview of the subject matter for personal use.

--

--

Tanja Adžić

Data Scientist and aspiring Data Engineer, skilled in Python, SQL. I love to solve problems.