SQL

SQL stands for structured query language, and is used to express database queries. A large number of databases use the SL language, such as MySQL, PostgreSQL, MariaDB, and SQlite. We use a database to store structured data in a way that makes it easy to extract all or portions of that data.

Basic queries

In this section we will use Fisher’s iris plant data set, which has the following schema.

Id SepalLength SepalWidth PetalLength PetalWidth IrisClass
1 5.1 3.5 1.4 0.2 Iris-setosa

We assume the table name is IrisData

To select all the data, we could run

SELECT *
FROM IrisData

To be more explicit about what we are selecting, we could write

SELECT (Id, SepalLength, SepalWidth, PetalLength, PetalWidth, IrisClass)
FROM IrisData

We can narrow down our selection to include only data related to the iris versicolor class with

SELECT (Id, SepalLength, SepalWidth, PetalLength, PetalWidth, IrisClass)
FROM IrisData
WHERE IrisClass = "Iris-versicolor"

If we wanted to count how many rows correspond to the setosa class we could write

SELECT COUNT(*)
FROM IrisData
WHERE IrisClass = "Iris-setosa"

In all of these queries the case of the keywords is ignored, but by convention we capitalize them.

SQLite

SQLite is an open-source database library that understands the SQL language. SQLite is a library, not a server, and so any application that wants to interact with an SQLite database must utilize a library to interact with it; the upside is that you don’t need a server process running on your computer. This makes using SQLite databases very easy, as each database is simply a file that can be copied and shared. The file iris.sqlite contains the iris data set in the format above.

Using SQLite in Python is very easy. To get started, we import the sqlite3 library and connect to our database file, and create a cursor that can query the database

import sqlite3

conn = sqlite3.connect('iris.sqlite')
c = conn.cursor()

To begin, let’s retrieve some data

c.execute('SELECT * FROM IrisData LIMIT 10')
for row in c:
    print(row)

This will print:

(1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa')
(2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa')
(3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa')
(4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa')
(5, 5.0, 3.6, 1.4, 0.2, 'Iris-setosa')
(6, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa')
(7, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa')
(8, 5.0, 3.4, 1.5, 0.2, 'Iris-setosa')
(9, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa')
(10, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa')

If we wanted to select some from the versicolor class we could write

c.execute('SELECT * FROM IrisData WHERE IrisClass = "Iris-versicolor" LIMIT 10')
for row in c:
    print(row)

which prints:

(51, 7.0, 3.2, 4.7, 1.4, 'Iris-versicolor')
(52, 6.4, 3.2, 4.5, 1.5, 'Iris-versicolor')
(53, 6.9, 3.1, 4.9, 1.5, 'Iris-versicolor')
(54, 5.5, 2.3, 4.0, 1.3, 'Iris-versicolor')
(55, 6.5, 2.8, 4.6, 1.5, 'Iris-versicolor')
(56, 5.7, 2.8, 4.5, 1.3, 'Iris-versicolor')
(57, 6.3, 3.3, 4.7, 1.6, 'Iris-versicolor')
(58, 4.9, 2.4, 3.3, 1.0, 'Iris-versicolor')
(59, 6.6, 2.9, 4.6, 1.3, 'Iris-versicolor')
(60, 5.2, 2.7, 3.9, 1.4, 'Iris-versicolor')

PostgreSQL

PostgreSQL is an open-source database that understands SQL. PostgreSQL runs as a server, meaning that to use it you must have a process running on a computer. This is useful when you have a machine that is accessible online that must constantly be ready to serve up and store information. PostgreSQL is capable of storing and querying huge amounts of data efficiently (you will not exceed its capabilities in this course).

All of the above queries will run properly in PostgreSQL. The analog to sqlite3 is psychopg2 and has an almost identical interface.

Indexes

An index is an optimization that PostgreSQL (and most databases) implement to improve query performance. If there is a column you expect to match on frequently in a large data set, you should consider creating an index. The syntax is simply; to add an index for the Id and Class columns, we would write:

CREATE INDEX iris_data_id_index ON IrisData (Id);
CREATE INDEX iris_data_class_index ON IrisData (Class);

The iris data set is far too small to care about indexes, but if you have a large data set indexes can greatly improve query speed.

SQL Injection

SQL injection describes an adversary crafting inputs to your application that can leak or destroy data in your database. This is something to keep in mind when creating web services to show your results. It is even important to remember if you are only working locally since even inserting quoted data from a data set can trigger unexpected behavior.

At its core, SQL injection relies on the programmer executing Python code similar to the following:

import sqlite3

conn = sqlite3.connect('iris.sqlite')
c = conn.cursor()

query = 'INSERT INTO IrisData VALUES (1, 1, 1, 1, 1, "{}")'
c.execute(query.format(some_data))

Note that if some_data were to be the string:

bogus data"); DROP TABLE ImportantTable;--

we would end up executing the commands (with newlines added for clarity)

INSERT INTO IrisData VALUES (1, 1, 1, 1, 1, "bogus data");
DROP TABLE ImportantTable;
--")

which would destroy the table ImportantTable should it exist. Note that -- begins a comment, and so will remove any additional characters such as ").

Luckily it is easy to guard against SQL injection. Any modern SQL library will offer ways to properly sanitize data before it is entered in the database. In sqlite3 in Python, instead of using Python’s native string formatting we would instead using the ? character and pass the data to be formatted to the execute statement. To re-write our previous query in a safe fashion we would write

import sqlite3

conn = sqlite3.connect('iris.sqlite')
c = conn.cursor()

query = 'INSERT INTO IrisData VALUES (1, 1, 1, 1, 1, ?)'
c.execute(query, (some_data,))

Passing in the previous bad data would simply result in a row with a class name of bogus data"); DROP TABLE ImportantTable;--, since the execute method automatically and properly escapes all relevant characters in the provided data.