Accessing Databases using Python

Tanja Adžić
13 min readNov 27, 2023

--

In these notes, we will dive into fundamental concepts related to using Python to interact with databases. Using Jupyter Notebooks, we will explore how to create tables, load data, execute SQL queries, and leverage the analytical capabilities of Python to gain insights from our data.

Photo by Growtika on Unsplash

How to access databases using Python?

There are a lot of benefits of Python when it commes to database programming. The Python Ecosystem is rich, and it provides easy to use tools for Data Science, like: pandas, NumPy, SciPy, matplotlib, Scikit-Learn and many others. Python is open-source, ant it is ported to many platforms.

Python offers robust support for relational database systems, simplifying the process of accessing databases through the Python Database API, often referred to as the DB API. Extensive and well-documented resources related to Python database interactions are readily accessible, making it user-friendly and efficient for developers to work with databases using Python.

Notebooks

In the field of Data Science, notebooks are quite popular. These interfaces operate within an environment that facilitates document creation and sharing, incorporating live code, equations, visualizations, and explanatory text seamlessly. A notebook interface serves as a virtual programming environment, with examples like Mathematica, Maple, Matlab, IPython Jupyter, R Markdown, Apache Zeppelin, Apache Spark, Databricks cloud, and more.

The Jupyter notebook stands as an open-source web application facilitating the creation and distribution of documents comprising live code, equations, visualizations, and narrative text. Jupyter offers numerous advantages:

  • Support for over 40 programming languages, including Python, R, Julia, and Scala.
  • Seamless sharing of notebooks via email, Dropbox, GitHub, and the Jupyter notebook viewer.
  • Capabilities to generate interactive outputs encompassing HTML, images, videos, LaTeX, and customizable data types.
  • Integration with big data tools like Apache Spark, accessible through Python, R, and Scala, alongside data exploration using libraries such as pandas, Scikit-Learn, ggplot2, and TensorFlow.

The Basics of SQL APIs and Python DB APIs

Here is a breakdown of basic terminology and steps involving databases and Python:

  1. Understanding APIs: An Application Programming Interface (API) is a set of functions that provide access to specific services. In the context of databases, SQL APIs and Python DB APIs enable applications to interact with Database Management Systems (DBMS).
  2. SQL API: The SQL API is a collection of library function calls designed as an API for the DBMS. These functions allow application programs to communicate with the DBMS using SQL statements.
  3. Sending SQL Statements: To execute SQL statements within an application, you utilize the SQL API. The process involves constructing SQL statements as text strings in a buffer.
  4. Connecting to the DBMS: The application program initiates database access by calling one or more API functions to establish a connection with the DBMS. This connection serves as a bridge between the program and the database.
  5. Passing SQL Statements: Once connected, the program can use API calls to transmit SQL statements from the buffer to the DBMS for execution. These statements can include queries, updates, or any other SQL operation.
  6. Checking Status and Handling Errors: To ensure smooth execution, the application program utilizes API calls to monitor the status of its requests to the DBMS. In case of errors or exceptions, the API provides functions to handle and manage these issues.
  7. Disconnecting from the Database: After completing its tasks, the application program employs an API call to disconnect from the database. This ensures that the resources are properly released and the connection is terminated.

APIs used by popular SQL-based DBMS systems

This table lists various applications and their corresponding SQL APIs for connecting to different database systems:

APIs for SQL in different systems

Writing Code using DB-API

How exactly does the DB-API work? The user writes Python programs within a Jupyter notebook, and there’s a mechanism through which Python code communicates with the Database Management System (DBMS). This communication is facilitated by Python code that connects to the database using DB-API calls. DB-API stands for Python’s standard Application Programming Interface (API) for accessing relational databases. It serves as a universal standard, allowing you to write a single program capable of working with various types of relational databases, eliminating the need for separate programs for each database system. Learning the DB-API functions equips us with the knowledge to interact with any database using Python.

Advantages of using DB-API:

  • Ease of Implementation: The DB-API is designed to be straightforward and user-friendly, making it easy to implement and understand. Its consistent interface simplifies database access.
  • Consistency: The DB-API encourages consistency among Python modules used for database access. This consistency leads to more easily understood and predictable modules, reducing development complexity.
  • Code Portability: Python programs written using the DB-API are often more portable across different database systems. This means that code can be used with various databases without significant modifications.
  • Broad Database Connectivity: The DB-API offers broad support for connecting to various database management systems (DBMS) from Python, enhancing the range of database options available for developers.

Concepts of the Python DB-API

In the Python DB-API, there are two main concepts:

  • Connection Objects
  • Cursor Objects

Connection Objects are used to connect to a database and manage transactions.

Cursor objects in the Python DB-API are used to run queries and navigate through the results of a database. We open a cursor object and use it to execute queries, similar to scrolling through a text document to retrieve data from the result set. Cursors enable us to scan and interact with the database results efficiently.

Connection Methods

The DB-API includes a connect constructor for creating a connection to the database. It returns a Connection Object, which is then used by the various connection methods. These connection methods are:

  • The cursor() method, which returns a new cursor object using the connection.
  • The commit() method, which is used to commit any pending transaction to the database.
  • The rollback() method, which causes the database to roll back to the start of any pending transaction.
  • The close() method, which is used to close a database connection.

Cursor Methods

These objects represent a database cursor, which is used to manage the content of a fetch operation:

  • callproc()
  • execute()
  • executemany()
  • fetchone()
  • fetchmany()
  • fecthall()
  • nextset()
  • Arraysize()
  • close()

Cursors created from the same connection are not isolated; any changes made to the database by one cursor are immediately visible to other cursors from the same connection. However, cursors created from different connections can be isolated or not, depending on how the transaction support is implemented in the database and the specific configuration.

Database Cursor

A database cursor is a control structure that facilitates the traversal of records within a database. It functions akin to a file name or file handle in a programming language. Much like a program opens a file to access its content, it employs a cursor to gain access to query results. Furthermore, just as a program closes a file to conclude its access, it closes a cursor to conclude access to query results. Another parallel can be drawn with file handles, which maintain the program’s current position within an open file; similarly, a cursor keeps track of the program’s current position within query results. Cursors play a crucial role in managing and interacting with data in a database system.

Overview of Database Connection Code using Python

Let’s walk through a Python application that uses the DB-API to query a database:

# Step 1: Import the necessary database module
from dbmodule import connect

# Step 2: Open a connection to the database
connection = connect('databasename', 'username', 'password')

# Step 3: Create a cursor object on the connection
cursor = connection.cursor()

# Step 4: Run a SQL query using the cursor
cursor.execute("SELECT * FROM mytable")

# Step 5: Fetch and print the results of the query
results = cursor.fetchall()
for row in results:
print(row)

# Step 6: Close the cursor
cursor.close()
connection.close()

Connecting to a database using IBM_DB API

The ibm_db API offers a wide range of Python functions designed to facilitate data access and manipulation within an IBM data server database. These functions serve various purposes, such as establishing connections, executing SQL statements, retrieving result sets, invoking stored procedures, managing transactions, error handling, and accessing metadata. The ibm_db API relies on the IBM Data Server Driver for ODBC and CLI APIs to establish connections to IBM, DB2, and Informix databases.

To interact with a DB2 database using Python and the ibm_db library, you’ll need to import the ibm_db library into your Python application. Establishing a connection to DB2 requires specific information, including:

  • driver name
  • database name
  • host DNS name or IP address
  • host port
  • connection protocol
  • user ID and user password

Here’s an example of how to create a DB2 database connection in Python:

  • Import the ibm_db library into your Python application.
import ibm_db
  • Define a DSN (Data Source Name) object to store the connection credentials.
dsn = (
"DRIVER={DB2 Driver};"
"DATABASE=mydb;"
"HOSTNAME=hostname;"
"PORT=port;"
"PROTOCOL=TCPIP;"
"UID=username;"
"PWD=password;"
)
  • Use the ibm_db.connect function to create a non-persistent connection by passing the DSN object as a parameter.
try:
conn = ibm_db.connect(dsn, "", "")
print("Connected to the database.")
except Exception as e:
print("Unable to connect to the database:", e)
  • If the connection is successfully established, the code will print “Connected to the database.” Otherwise, it will print an error message.
  • Close the connection:
ibm_db.close(conn)

Remember to close connections to avoid leaving unused connections that can consume system resources.

Creating tables, loading and querying data

To begin, we obtain a connection resource by connecting to the database using the ibm_db.connect method from the ibm_db API.

Creating tables

Creating tables in DB2 can be done in various ways, including using the Web console provided by DB2 or creating tables from SQL, R, or Python environments. Here, we’ll explore how to create tables in DB2 from our Python application. Let’s consider an example where we create a table for a commercial Trucks database using Python code.

To create a table, we will use the ibm_db.exec_immediate function. This function takes the following parameters:

  1. connection: A valid database connection resource returned from the ibm_db.connect or ibm_db.pconnect function.
  2. statement: A string containing the SQL statement to create the table.
  3. options (optional): A parameter that includes a dictionary specifying the type of cursor to return for result sets.

Here’s a code showing how to create the Trucks table in a DB2 database:

stmt = ibm_db.exec_immediate(conn, 
"CREATE TABLE Trucks(
serial_no VARCHAR(20) PRIMARY KEY NOT NULL,
model VARCHAR (20) NOT NULL,
manufacturer VARCHAR (20) NOT NULL,
Engine_size VARCHAR (20) NOT NULL,
Truck_Class VARCHAR (20) NOT NULL)"
)

In the process of creating the Trucks table in a DB2 database using the ibm_db API, we utilize the ibm_db.exec_immediate function. This function requires a connection resource to be passed as its first parameter, representing the established database connection. Following that, the SQL statement, defined as the query for creating the table, serves as the next parameter. This SQL statement contains specifications for the table structure, defining the columns, data types, and any constraints. In this case, the new table, named "Trucks," is structured with five columns, with "serial_no" set as the primary key.

Loading data

Here is the code that continues on the previously created table “Trucks”:

stmt = ibm_db.exec_immediate(conn,
"INSERT INTO Trucks(serial_no, model, manufacturer, Engine_size, Truck_Class)
VALUES("A1234", "Lonestar", "International Trucks",
"Cummins ISX15", "Class 8");")

We again use the ibm_db.exec_immediate function provided by the ibm_db API to execute SQL statements for inserting data into the Trucks table. This function takes a connection resource conn, created during the initial database connection, as its first parameter. The subsequent parameter contains the SQL statement INSERT INTO for inserting data into the Trucks table. Each execution of this query adds a new row to the table, effectively populating it with data. To insert multiple rows, we repeatedly use the ibm_db.exec_immediate function for each set of data to be added to the Trucks table.

Querying data

stmt = ibm_db.exec_immediate(conn, "SELECT * FROM Trucks")

ibm_db.fetch_both(stmt)

Now that we have successfully connected our Python code to a database instance, created a database table, and populated it with data, let’s explore how we can retrieve information from the Trucks table. To accomplish this, we again use the ibm_db.exec_immediate function from the ibm_db API. As with previous use cases, we provide the connection resource as the first parameter for this function, and the following parameter contains the SQL statement SELECT FROM. This query enables us to fetch data from the Trucks table. The output displays the fields and contents of the data stored in the Trucks table. To verify the accuracy of the output returned by the select query, we can cross-reference it with the information in the DB2 console.

We can use pandas to further query data. Here is an example:

import pandas as pd
import ibm_db_dbi

pconn = ibm_db_dbi.Connection(conn)

df = pd.read_sql("SELECT * FROM Trucks", pconn)
df

SQL Magic

Jupyter notebooks offer a feature called Magic commands that can streamline Python usage, making them especially valuable for data analysis. Magic commands come in two types:

  1. Cell magics: These start with a double %% sign and apply to the entire cell.
  2. Line magics: These start with a single % sign and are applied to a specific line within a cell.

The general format for using magic commands is %magicname arguments.

In the context of working with databases, SQL Magic commands can be especially helpful. For instance, if we need to execute a query to select data from a table and retrieve its results, we can simply enter a command like the following in a Jupyter notebook cell:

%sql select * from tablename

SQL magic commands have some limitations. For example, unlike the DB-API, they do not provide explicit methods for closing a connection and releasing associated resources.

Analyzing Data with Python

We will illustrate how to store data in an IBM Db2 on Cloud database and perform basic data analysis using Python. We will use nutritional facts dataset sourced from Kaggle, for popular menu items at McDonald’s as an example.

First we need to create a table in the Db2 database to store the McDonald’s menu nutrition facts dataset. We will use the Db2 console for this process, which involves four main steps: source, target, define, and finalize.

  1. Source: We start by loading the spreadsheet into the Db2 using the console.
  2. Target: Next, we select the target schema.
  3. Define: Here, we will have the option to load the data into an existing table or create a new table. If we choose to create a new table, we can specify the table name. We will also have the opportunity to review the data preview and define the columns and their data types.
  4. Finalize: After reviewing the settings, we can initiate the data loading process. Once the loading is complete, we can view statistics on the loaded data.

With the data successfully loaded into the Db2 table, we can proceed to analyze and explore the data further using Python. Let’s verify the loaded data and the number of rows:

stmt = ibm_db.exec_immediate(conn, "SELECT COUNT(*) FROM MCDONALDS_NUTRITION")

ibm_db.fetch_both(stmt)

Let’s explore how we can use Pandas to retrieve data from the database tables.

import pandas as pd
import ibm_db_dbi

pconn = ibm_db_dbi.Connection(conn)

df = pd.read_sql("SELECT * FROM MCDONALDS_NUTRITION", pconn)
df

We load data from the MCDONALDS_NUTRITION table into a DataFrame called df using the read_sql method. We provide the SQL select query and the connection object as parameters to the read_sql method. To get an initial look at the data in the DataFrame df we use the head method to display the first few rows of data ( the default number of rows to display is 5):

df.head()

Next, let’s proceed with data analysis using Pandas. Pandas provides a set of common mathematical and statistical methods. We can use the describe method to view summary statistics of the data in the DataFrame. This method provides information such as the number of observations (260 food items in our case), unique categories of food items (nine), and various summary statistics for the variables.

df.describe(include = 'all')

For example, we can see that the maximum value for total fat is 118. Let’s focus on understanding the sodium content in our food items. We know that sodium plays a significant role in our diet, and excessive consumption can lead to health issues. To find which food item has the maximum sodium content, we start by visualizing the sodium values for different food items using a scatter plot.

import matplotlib.pyplot as pl
%matplotlib inline
import seaborn as sns

plot = sns.swarmplot(x = "Category", y = "Sodium", data = df)
plt.setp(plot.get_xticklabels(), rotation = 70)
plt.title("Sodium Content")
plt.show()

In the scatter plot, we notice a high value of around 3,600 for sodium, which raises our curiosity. To identify the food item associated with this high sodium value, we proceed with further data analysis. We use the idxmax method to determine the index at which the maximum sodium value occurs in the DataFrame.

df['Sodium'].idmax()

The output index is 82, and we then use the .at method to find the item name associated with this index, which turns out to be “Chicken McNuggets, 40 pieces.”

df.at[82, "Item"]

Visualizations are invaluable for data exploration. We can use them to understand relationships, patterns, and outliers in the data. In our analysis, we created a scatter plot to examine the correlation between protein and total fat.

plot = sns.jointplot(x = "Protein", y = "Total Fat", data = df)
plot.show()

The Pearson correlation coefficient (0.81) indicates a strong positive correlation between the two variables.

Next, we used box plots to visualize the distribution of sugar content. The box plot provides insights into the average sugar values for different food items, as well as the presence of outliers. Some food items have extremely high sugar content, with values of around 128 grams.

plot = sns.set_style('whitegrid')
ax = sns.boxplot(x = df["Sugars"])
plot.show()

These visualizations and data analysis techniques help us gain a deeper understanding of the nutritional data for McDonald’s menu items.

Conclusion

In conclusion, we’ve discussed how to access databases from Python using the appropriate API, such as ibm_db for IBM DB2, psycopg2 for PostgreSQL, and dblib for SQL Server. Python’s standard DB-API allows us to write a single program that works with multiple relational databases, promoting code reusability.

We explored the connection methods available in the DB-API, including the cursor method for creating cursor objects, commit for finalizing transactions, rollback for reverting transactions, and close for ending database connections.

Additionally, we learned about SQL Magic commands, which make executing queries more accessible within Jupyter Notebooks. These commands can be applied at both the cell and line levels, providing greater flexibility for data analysis tasks.

Disclaimer: The notes and information presented in this blog post were compiled during the course “Databases and SQL for Data Science with Python” and are intended to provide an educational overview of the subject matter for personal use.

--

--

Tanja Adžić
Tanja Adžić

Written by Tanja Adžić

Data Scientist and aspiring Data Engineer, skilled in Python, SQL. I love to solve problems.

No responses yet