SQL¶
SQL stands for structured query language, and is used to express database queries. A large number of databases use the SL language, such as MySQL, PostgreSQL, MariaDB, and SQlite. We use a database to store structured data in a way that makes it easy to extract all or portions of that data.
Basic queries¶
In this section we will use Fisher’s iris plant data set, which has the following schema.
Id | SepalLength | SepalWidth | PetalLength | PetalWidth | IrisClass |
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
We assume the table name is IrisData
To select all the data, we could run
SELECT *
FROM IrisData
To be more explicit about what we are selecting, we could write
SELECT (Id, SepalLength, SepalWidth, PetalLength, PetalWidth, IrisClass)
FROM IrisData
We can narrow down our selection to include only data related to the iris versicolor class with
SELECT (Id, SepalLength, SepalWidth, PetalLength, PetalWidth, IrisClass)
FROM IrisData
WHERE IrisClass = "Iris-versicolor"
If we wanted to count how many rows correspond to the setosa class we could write
SELECT COUNT(*)
FROM IrisData
WHERE IrisClass = "Iris-setosa"
In all of these queries the case of the keywords is ignored, but by convention we capitalize them.
SQLite¶
SQLite is an open-source database library that understands the SQL language. SQLite is a library, not a server, and so any application that wants to interact with an SQLite database must utilize a library to interact with it; the upside is that you don’t need a server process running on your computer. This makes using SQLite databases very easy, as each database is simply a file that can be copied and shared. The file iris.sqlite contains the iris data set in the format above.
Using SQLite in Python is very easy. To get started, we import the
sqlite3
library and connect to our database file, and create a
cursor that can query the database
import sqlite3
conn = sqlite3.connect('iris.sqlite')
c = conn.cursor()
To begin, let’s retrieve some data
c.execute('SELECT * FROM IrisData LIMIT 10')
for row in c:
print(row)
This will print:
(1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa')
(2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa')
(3, 4.7, 3.2, 1.3, 0.2, 'Iris-setosa')
(4, 4.6, 3.1, 1.5, 0.2, 'Iris-setosa')
(5, 5.0, 3.6, 1.4, 0.2, 'Iris-setosa')
(6, 5.4, 3.9, 1.7, 0.4, 'Iris-setosa')
(7, 4.6, 3.4, 1.4, 0.3, 'Iris-setosa')
(8, 5.0, 3.4, 1.5, 0.2, 'Iris-setosa')
(9, 4.4, 2.9, 1.4, 0.2, 'Iris-setosa')
(10, 4.9, 3.1, 1.5, 0.1, 'Iris-setosa')
If we wanted to select some from the versicolor class we could write
c.execute('SELECT * FROM IrisData WHERE IrisClass = "Iris-versicolor" LIMIT 10')
for row in c:
print(row)
which prints:
(51, 7.0, 3.2, 4.7, 1.4, 'Iris-versicolor')
(52, 6.4, 3.2, 4.5, 1.5, 'Iris-versicolor')
(53, 6.9, 3.1, 4.9, 1.5, 'Iris-versicolor')
(54, 5.5, 2.3, 4.0, 1.3, 'Iris-versicolor')
(55, 6.5, 2.8, 4.6, 1.5, 'Iris-versicolor')
(56, 5.7, 2.8, 4.5, 1.3, 'Iris-versicolor')
(57, 6.3, 3.3, 4.7, 1.6, 'Iris-versicolor')
(58, 4.9, 2.4, 3.3, 1.0, 'Iris-versicolor')
(59, 6.6, 2.9, 4.6, 1.3, 'Iris-versicolor')
(60, 5.2, 2.7, 3.9, 1.4, 'Iris-versicolor')
PostgreSQL¶
PostgreSQL is an open-source database that understands SQL. PostgreSQL runs as a server, meaning that to use it you must have a process running on a computer. This is useful when you have a machine that is accessible online that must constantly be ready to serve up and store information. PostgreSQL is capable of storing and querying huge amounts of data efficiently (you will not exceed its capabilities in this course).
All of the above queries will run properly in PostgreSQL. The analog
to sqlite3
is psychopg2
and has an almost identical interface.
Indexes¶
An index is an optimization that PostgreSQL (and most databases)
implement to improve query performance. If there is a column you
expect to match on frequently in a large data set, you should consider
creating an index. The syntax is simply; to add an index for the
Id
and Class
columns, we would write:
CREATE INDEX iris_data_id_index ON IrisData (Id);
CREATE INDEX iris_data_class_index ON IrisData (Class);
The iris data set is far too small to care about indexes, but if you have a large data set indexes can greatly improve query speed.
SQL Injection¶
SQL injection describes an adversary crafting inputs to your application that can leak or destroy data in your database. This is something to keep in mind when creating web services to show your results. It is even important to remember if you are only working locally since even inserting quoted data from a data set can trigger unexpected behavior.
At its core, SQL injection relies on the programmer executing Python code similar to the following:
import sqlite3
conn = sqlite3.connect('iris.sqlite')
c = conn.cursor()
query = 'INSERT INTO IrisData VALUES (1, 1, 1, 1, 1, "{}")'
c.execute(query.format(some_data))
Note that if some_data
were to be the string:
bogus data"); DROP TABLE ImportantTable;--
we would end up executing the commands (with newlines added for clarity)
INSERT INTO IrisData VALUES (1, 1, 1, 1, 1, "bogus data");
DROP TABLE ImportantTable;
--")
which would destroy the table ImportantTable
should it exist. Note
that --
begins a comment, and so will remove any additional
characters such as ")
.
Luckily it is easy to guard against SQL injection. Any modern SQL
library will offer ways to properly sanitize data before it is entered
in the database. In sqlite3
in Python, instead of using Python’s
native string formatting we would instead using the ?
character
and pass the data to be formatted to the execute statement. To
re-write our previous query in a safe fashion we would write
import sqlite3
conn = sqlite3.connect('iris.sqlite')
c = conn.cursor()
query = 'INSERT INTO IrisData VALUES (1, 1, 1, 1, 1, ?)'
c.execute(query, (some_data,))
Passing in the previous bad data would simply result in a row with a
class name of bogus data"); DROP TABLE ImportantTable;--
, since
the execute
method automatically and properly escapes all relevant
characters in the provided data.