My lab group recently asked me to give a tutorial on using SQL databases in science. While we are all complex systems scientists, my background is in computer science and STS, and many of my colleagues come from physics, mathematics, and philosophy, so we learn a great deal from one another. I’ve turned my slides into a blog post here, like my last lab talk on using Git for scientific software development.
A database is a piece of software for storing and organizing your data. Most importantly, databases make it easy to query your data, asking for subsets of your data that match a specific pattern you are interested in.
If you currently store your data in formats like CSV or JSON, and write lots of code for reading this data and searching through it for pieces relevant to your research question, our goal will be to offload all of this logic from your own code to a database. It will run much faster, it will be faster to write, and it will help you avoid bugs while expressing complicated questions simply.
There are many types of databases, but for this post I’ll split them along two axis: do they run locally (as part of your research code, storing data in a single file) or remotely (running as an independent process you speak to over the network), and does the database use SQL (a language for expressing sophisticated data queries) or not. Here’s a small subset of databases along these axes:
|SQL||SQLite, DuckDB||Postgresql, MySQL, MariaDB, MSSQL, …|
|NoSQL||Pandas (sorta), BerkeleyDB, …||Redis, MongoDB, Firebase, …|
In this post I’ll be focusing on SQLite and Postgresql as examples. I’ll briefly talk about NoSQL databases at the end, and the scenarios where they might be preferable to SQL databases.
SQLite stores all data in one file on your hard drive. SQLite is a library, so the database software runs inside of the software you write. It is trivial to set up, pretty fast (especially for queries), and has most database features we will want.
Critically, SQLite is ill-suited to concurrency. Since SQLite runs inside of your software, two different Python scripts can easily try to write to the same database file at the same time, risking catastrophic data corruption. You can build sophisticated locking mechanisms to ensure only one program accesses a database at once, but this adds a serious performance bottleneck. SQLite is really intended for a single piece of software to store data, not live setups where several applications write data at the same time.
Postgres runs as a software daemon; it runs all the time, storing data in a series of files and caches that it manages. Whether postgres is running on your own computer or another computer, your research software will communicate with postgresql over the network.
This difference in design means that postgres requires some additional bureaucracy to set up: users and passwords, databases and permissions and authentication. In return however, postgres is even faster than SQLite, and handles concurrent access from many applications trivially. Postgres also has a number of advanced features that are unavailable in SQLite.
Relational databases store data in tables (think spreadsheets), and in the relationships between tables.
You request data from a table using a
SELECT statement of the form
SELECT columns FROM table WHERE row matches condition. For example:
SELECT userid FROM desks WHERE building='Innovation' AND room=413;
You can also combine multiple tables during a
SELECT to gather related information. Here we fetch the names of all graduate students with a desk assigned to them, by selecting rows from the desks table and combining them with matching entries from the user table where the user IDs of both rows match:
SELECT firstname,lastname FROM desks LEFT JOIN users ON desks.userid=users.userid WHERE status='Graduate';
The following are the main commands for interacting with a SQL database to create relations, add, remove, and update information in relations, and select information from relations:
|SELECT||Return some columns from a relation|
|INSERT||Add data to a relation|
|DELETE||Remove data from a relation|
|UPDATE||Modify data in a relation|
|CREATE||Create a new relation (table, view, index)|
|DROP||Remove a relation|
|EXPLAIN||Show how a query will access data|
So that’s all fine in theory, but how do we write software that actually uses a database?
To connect to a SQLite database from Python we first supply a database filename, open a cursor from the connection, then use the cursor to send a query and get back a 2D array of results.
import sqlite3 conn = sqlite3.connect("university.db") c = conn.cursor() c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid") for (f,l,b,r) in c.fetchall(): print("%s %s has a desk in %s %d" % (f,l,b,r)) conn.commit() # Save any CREATE/INSERT changes to the database conn.close()
You can think of the cursor as a finger tracking your position in the database. Multiple cursors allow us to make multiple queries from the same database and track which results were associated with which request.
Interacting with Postgres is similar to SQLite: we connect to the database, then open a cursor from the connection, and use the cursor to send queries and get results. However, Postgres is a daemon accessible over the network, so we’ll need to supply a hostname and port number where the SQL server can be found, the name of the database we want to reach, and a username and password authorized to connect to that database.
import psycopg2 try: conn = psycopg2.connect(host="127.0.0.1", port="5432", user="registrar", password="hunter2", database="university_users") c = conn.cursor() c.execute("SELECT firstname,lastname,building,room FROM desks LEFT JOIN users ON desks.userid=users.userid") for (f,l,b,r) in c.fetchall(): print("%s %s has a desk in %s %d" % (f,l,b,r)) except (Exception, Error) as error: print("Error while connecting to PostgreSQL", error) finally: if( conn ): conn.commit() conn.close()
Often your SQL statements will depend on other variables, and can’t be written as constant strings ahead of time. It may be tempting to assemble the SQL statement using string concatenation to insert variables. Never do this.
Consider the following example:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '" + name + "'") matches = c.fetchall()
Given a student’s last name, look up all students with that name. You might find functionality like this on your university’s student directory. But what if a user enters input like
' OR 'a'='a? The query now reads:
SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR 'a'='a'
While a little clunky, this will return every user in the database. Worse yet, a malicious user might construct a query like:
SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '' OR password LIKE 'A%'
This would get them a list of all users whose password hashes start with ‘A’, then another query for ‘AA’, ‘AB’, and slowly an attacker can reconstruct the password hashes of every user at the university. This kind of attack is called a SQL Injection, and is a common vulnerability in websites. While scientific code is less likely to be directly attacked than a website, if you’re working with real-world data, especially web-scraped or user-gathered data, there can be all kinds of garbage in your input.
To avoid this vulnerability you can write your query first with placeholders for parameters, then tell SQL to complete the statement on your behalf. In SQLite this looks like:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '?'", [name])
Or in Postgresql:
c.execute("SELECT userid,firstname,lastname FROM users WHERE lastname LIKE '%s'", [name])
In either case, the SQL engine will properly escape any text in the
name field, ensuring that it’s interpreted as a string, and never as a SQL statement.
Real-world data is messy. Maybe you assume that every office at a school is assigned to an employee or graduate student, but some were assigned to recent graduates or retirees and haven’t been re-assigned. Maybe you assume that all students have last names, but some international students come from cultures that use mononyms, and the last name field is empty. If you don’t check these underlying assumptions, you might not learn that you’ve made a mistake until hours of debugging later! Fortunately, SQL provides an easy way to describe and enforce your assumptions about data through constraints.
CREATE TABLE users( userid TEXT PRIMARY KEY, firstname TEXT NOT NULL, lastname TEXT, status TEXT NOT NULL );
This definition of the user table includes four text fields, three of which cannot be empty. Further, the
userid field must be unique: you can have two students with the same first and last name, but they must have different usernames. We can add more detailed restrictions to the desk-assignment table:
CREATE TABLE desks( building TEXT NOT NULL, room INT NOT NULL, desk INT NOT NULL, userid TEXT, FOREIGN KEY(userid) REFERENCES users(userid), UNIQUE(building,room,desk) );
Here, we’ve explicitly said that the
userid field must match some user ID in the
users table. We’ve also said that while there can be multiple rooms in a building, and multiple desks in a room, there cannot be multiple desk 4’s in room 112 of Sage hall: the combination of building name, room number, and desk number must be unique.
If we try to insert any data into these tables that violates the described constraints, SQL will throw an exception instead of adding the new rows. Like unit testing but for your input data, constraints can help you be confident that your data follows the logic you think it does.
Compared to parsing CSV or JSON in Python and searching for the data you want, SQL will run inconceivably fast. But if you’re storing several gigabytes or more in your tables, even SQL databases will slow down. With a little bit of forethought we can make SQL queries run much faster.
Let’s say you have all your email stored in a database, with a structure something like:
CREATE TABLE emails( msgid TEXT PRIMARY KEY, from_address TEXT NOT NULL, to_address TEXT NOT NULL, subject TEXT NOT NULL, sent_date INT NOT NULL );
If we want to search for all emails received in the last week then SQL needs to search through every email in the table to check their sent dates. This is obviously highly inefficient, but we can warn SQL that we’ll be making these kinds of queries:
CREATE INDEX time_index ON emails(sent_date);
Creating an index tells SQL to build a binary tree, sorting the emails by
sent_date to reduce lookups from
O(log n), dramatically improving performance. We can also build indices on multiple columns at once:
CREATE INDEX from_time_index ON emails(from_address,sent_date);
And now we can look up emails from a particular user in a particular time window in
O(log n) - even better! Both SQLite and Postgresql will automatically create indices for primary keys and unique constraints, since they’ll need to perform lookups during every new insert to make sure the constraints aren’t violated. You’ll often be selecting data based on unique characteristics too, so in practice it isn’t always necessary to declare indices explicitly.
Many SQL functions aggregate information from multiple rows. For example, we can count the number of users with:
SELECT COUNT(*) FROM users;
There are a variety of aggregate functions, include
Often we don’t want to apply aggregate functions to every row, but to a sub-group of rows. Imagine we have a table of course registrations, like:
CREATE TABLE course_registration( userid TEXT, coursecode INT, credits INT, FOREIGN KEY(userid) REFERENCES users(userid), UNIQUE(userid,coursecode) );
To ask how many credits each student is registered for we might query:
SELECT userid,SUM(credits) FROM course_registration GROUP BY userid;
GROUP BY clause clusters rows based on their
userid, then runs the aggregate function on each group rather than on all rows. We could also list the students in descending order by credit count like:
SELECT userid,SUM(credits) AS total_credits FROM course_registration GROUP BY userid ORDER BY total_credits DESC;
Pandas is a ubiquitous Python package in data science. It makes it easy to store a table or a sequence of values as a Python object and perform some data analysis. It also integrates well with Seaborn, a package for statistical data visualization built on top of matplotlib. The two also integrate well with SQL. In just a couple lines, we can plot a histogram of how many credits students have registered for, from SQL to Pandas to Seaborn:
import sqlite3 import pandas as pd import seaborn as sns import matplotlib.pyplot as plt conn = sqlite3.connect("registration.db") df = pd.read_sql("SELECT userid,SUM(credits) AS registered_credits FROM course_registration GROUP BY userid", conn) sns.histplot(data=df, x="registered_credits") plt.set_title("Credit load across student body") plt.savefig("credit_load.pdf", bbox_inches="tight") conn.close()
Pandas will run a query for us (against SQLite, Postgresql, or a variety of other database types), put the result in a table with appropriate column names, and hand it off to Seaborn, which understands those same column names. Data analysis made easy!
For all the awesomeness of SQL, there are some tasks it is ill-suited to. If all you need is a way to store a dictionary so it persists, and make that dictionary accessible to multiple programs, then SQL is way more complexity and overhead than you need. Redis fills this niche well, and is simple and fast as long as you use it for this purpose.
If you have an enormous amount of data, terabytes worth, and need to update that data continuously, then SQL is a poor fit: every SQL server can only make one change to a table at a time and will have some kind of internal locking mechanism to prevent multiple writes from conflicting. This would be disastrous if, for example, you wanted to store all tweets in a database and need to save data as millions of users tweet at once. Here, tools like MongoDB step up, offering multiple database “shards” that will periodically sync with one another. This setup offers “eventual consistency” where a new tweet might not be available to all users right away, but things propagate pretty quickly, and in return we can handle huge numbers of updates at once.
More generally, SQL is a poor choice for:
Storing large files: maybe store metadata about files in a table, along with a pathname to where the file can be found on disk?
Storing unstructured data: you need to know what your rows and columns will be to put information in a spreadsheet. If your data is not uniform enough to describe in this way, then a spreadsheet is inappropriate.
Storing arbitrarily structured or nested data: if your data comes in the form of deeply nested JSON or XML, then a spreadsheet may be a poor choice. This is not always the case: if you have some nested JSON representing a tree of comments and replies on a website, then you may be able to “flatten” the tree by making each comment into a unique row, and including a “parent commentID” as a column. However, if different levels of nesting can have a wide variety of tags and meanings, this conversion may not always make sense. If you find that you’re storing a blob of JSON as a column in your table, then a table may not be the best representation for you.
For very specific types of data, like GIS, or network/graph data, there are specialized databases that may offer more task-appropriate tools than SQL.
SQL databases are an invaluable tool for any data science. They allow researchers to organize a wide variety of data so that it is easily and quickly queried to identify patterns and answer questions. SQL can simplify your code, preclude nasty bugs via constraints, and integrates nicely with most programming languages, and especially with common data science software packages.
This post offers a brief tutorial on using SQL, but there is an enormous depth available to accomplish more complicated tasks. In particular, I have left out:
Views and Materialized Views
Subqueries and Common Table Expressions
Much more detail on joins, unions, grouping, and partitioning
Functions, stored procedures, and triggers
Hopefully this gives you enough of a foundation to start using SQL in scientific contexts, and look up more details as you need them.