Dear Salonnières,

thanks for attending compu-salon episode 9, databases. Here's a summary of our discussion. This Friday we'll talk about high-performance Python---overcharging the, ehm, slithering?

Friday, Apr 6, 4pm, Tapir interaction room as usual. I'll send out a reminder Friday morning. Until then!

Michele

Compu-salon ep. 9 in summary

We talked about tabular data, as encoded in standard formats, and as stored in relational databases. It doesn't sound any cooler now than it did last Friday, but it's going to be useful some day, isn't it?

Reading and writing tabular data (and especially astronomical tables)

Simple text-based formats

[See this excellent tutorial for more info.]

These include simple tab-and-newline-separated files. By now you should know enough about Python to handle these without a sweat. It's a one-liner (but see at the end of this section):

>>> table = [line.split() for line in open('file.txt','r')]

This will get you a nested list of strings. If you know already that you're working with numbers, you can wrap line.split() with float(); and if you want a numpy array you can wrap the list comprehension with numpy.array().

Only a little harder are comma-separated (CSV) files, which are the pidgin of spreadsheets. For these, you'll be happier with the Python csv module:

>>> import csv
>>> [row for row in csv.reader(open('file1.csv','r'))]
[['a11', 'a12', 'a13'], ['a21', 'a22', 'a23']]

# if the first line of the file describes the columns (say A, B, C),
# you can get back a list of dictionaries
>>> [row for row in csv.DictReader(open('file2.csv','r'))]
[{'A': 'a11', 'B': 'a12', 'C': 'a13'}, {'A': 'a21', 'B': 'a22', 'C': 'a23'}]

# to write to a csv file, use csv.writer/csv.DictWriter and writerow/writerows
>>> w = csv.writer(open('file3.csv'))
>>> w.writerows([['a11','a12','a13'], ...])

>>> dw = csv.DictWriter(open('file4.csv'))
>>> dw.writeheader()
>>> dw.writerows([{'A': 'a11', 'B': 'a12', 'C': 'a13'}, ...])

Although they're not really for tabular data, I though I would mention .ini files, which look like

# this is file example.ini

[section_name_1]
par1=1
par2=2

[section_name_2]
par1=3
par3=4

and are handled by the ConfigParser module in the Python standard library:

>>> import ConfigParser

# reading...
>>> config = ConfigParser.ConfigParser()
>>> config.read('example.ini')
>>> print config.get('section_name_1','par1')
1

# ...and writing (or writing back more parameters)
>>> config.add_section('section_name_3')
>>> config.set('section_name_3', 'par1', '5')
>>> config.write(open('example-modified.ini','w'))

If you're just writing configuration files for a Python application, you could also stick your parameter definitions in a Python .py file, and import it in your program.

Ah, let me mention one thing. If you come from C/C++/Fortran, you know that it's good practice to close files once you're done with them. But in the snippets above, I do no such thing. Indeed, I don't even store file handles in variables, but use them fleetingly in list comprehensions, or pass them directly to parsers. What happens is that Python will close files automatically as soon as they go out of scope—i.e., after the list comprehension is complete, or when the parser object itself is destroyed when it goes out of scope, or manually with del. However, if an exception is thrown, the file may end up not being closed.

Luckily Python (since v. 2.5) offers a very nifty way to handle the interplay of resource management and exceptions: context managers. For instance, for a file object we would do

>>> with open('file.txt','r') as f:
...     table = [line.split() for line in f]

This construction will close the file after the instructions in the indented block have completed. Furthermore, if something bad happens in one of the split()s, an exception would be thrown, but the file would still be closed. Quite elegant! (Another good place to use context managers is with multiprocessing synchronization objects such as Lock.)

Tabular-data formats used in astronomy

Here I will just recommend a few packages:

For the traditional, NASA-endorsed FITS format, use pyFITS.
For the high-performance HDF5 format, you can use h5py or pytables, which is optimized to handle large arrays very efficiently. (To install h5py you'll need to brew install hf5; pip install h5py. To install pytables, you'll need to pip install numexpr; pip install cython; pip install tables, all in your virtualenv of course.)
For VOTable, an XML format designed to exchange tabular data within the Virtual Observatory, you can use vo.table. (To install it, run pip install on the .tar.gz file.)
The package ATpy provides a one-stop shop to reading and writing FITS, HDF5, VOTable, ASCII, and standard database formats. It's OK for moderately large arrays, and maps tables very directly to numpy record arrays. (To use FITS, HDF5, and VOTable, ATpy needs PyFITS, h5py, and vo.table, respectively.)

Since VOTable is an XML format (a big topic...), you may even want to load such files by handling XML directly. If you open a VOTable file, you'll see a nested structure of XML elements that looks like

<?xml version="1.0"?>
<VOTABLE xmlns="http://www.ivoa.net/xml/VOTable/v1.2">
  <RESOURCE name="CRTS lightcurves">
    <TABLE name="query results">
      <DESCRIPTION>Photometry data from CRTS</DESCRIPTION>
      <FIELD name="ID" datatype="char" [...] />
      <FIELD name="RAJ2000" datatype="float" unit="deg" [...] />
      [...]
      <DATA>
        <TABLEDATA>
          <TR>
            <TD>1135075045477</TD>
            <TD>254.45757</TD>
            <TD>35.34235</TD>
            [...]
          </TR>
          [...]
        </TABLEDATA>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>

I'm simplifying, removing attributes, etc., but the important thing to see is the HTML-like table structure of cells (<TD>) nested within rows (<TR>). The best Python interface to handle such trees is ElementTree, so here's how we'd use it for the file above.

>>> import xml.etree.ElementTree as E

# create a tree object and parse the file
>>> tree = E.ElementTree()
>>> tree.parse('votablefile.xml')

# navigate the XML tree to get to the table
# since VOTable includes an XML namespace (set by the `VOTABLE` `xmlns`
# attribute), we can't navigate by `RESOURCE`, `TABLE`, etc.,
# but we need to use '{http://www.ivoa.net/xml/VOTable/v1.2}TABLE', etc.
>>> prefix = '{http://www.ivoa.net/xml/VOTable/v1.2}'
>>> path = '{0}RESOURCE/{0}TABLE/{0}DATA/{0}TABLEDATA'.format(prefix)
>>> table = tree.getroot().find(path)

# now use a nested list comprehension to get the data
# table works as an iterator that yields rows;
# each row works as an iterator that yields cells;
# then we take the textual content of each cell
>>> data = [[cell.text for cell in row] for row in table]

Working with databases

Well, this is another huge topic that I can't really do justice to here. You can start on wikipedia to learn about the history of databases, and especially about the now-dominant relational model. The main points are that:

In this context, a relation is just a table, consisting of records (rows); each row consists of fields (columns). So for instance the table

A	B	C
1	3	5
2	4	6
...	...	...

would correspond to the statements that "A=1, B=3, C=5" are simultaneously true; "A=2, B=4, C=6" are simultaneously true, and so on. This formal interpretation allows a very powerful mathematics of tables that is implemented by SQL, the Standard Query Language.

Relational databases emphasize search rather than navigation. The records have no other identity other than the value of their fields, although it's customary to include primary keys, typically integers, that uniquely identify a record. For the same reason, tables are not linked together by pointers, but only by values. To reiterate,

[In a relational database the] data is stored in tables and the relationships among the data are also stored in tables. The data can be accessed or reassembled in many different ways without having to change the table forms. (wikipedia)

The major database software applications have been under development for multiple decades now, and they are very robust, enterprise-grade systems that support concurrent transactions by multiple agents (think credit-card processing) and emphasize reliability, fault tolerance, and the preservation of uncorrupted data (again think credit-card processing). Formally, this is ensured by implementing atomic, consistent, isolated, and durable (ACID) transactions.

The proper uses of databases are storing lots of complex structured data, and exploring and remixing it. You can do this with mouse-oriented applications such as Microsoft Access, or with the database lingua franca, SQL. To get started on SQL, check out the Software Carpentry screencasts.

Databases and GWs

Here I will go quickly through what we did in class. We continued last week's thread on matched filtering of gravitational-wave detector data, and implemented a crude coincidence search. To start, we made some simulated data (signal + noise), created a bank of signal templates (with a single parameter, chirp mass), and generated time-clustered lists of triggers with SNR higher than a given threshold:

# --- begin Python script coincident.py ---

import random, numpy as N

# filtering.py (see below) contains basic template-generation and
# matched-filtering routines
import filtering as F

# simulate the GW chirp from a binary inspiral, take it to the time domain
# (normalization is arbitrary here because I'm not too careful in filtering.py)
sig = 1e6 * N.fft.irfft(F.normtemplate(1.2))

# make noise for two interferometers (h1 and l1 for LIGO Hanford and Livingston)
h1 = N.random.randn(2**20)
l1 = N.random.randn(2**20)

# add the chirp at a random time in h1...
index = random.randint(500,2**20-500)
h1[index:index+len(sig)] += sig

# ...and in l1, delayed by a random time that represents light propagation
delay = random.randint(-500,500)
l1[index+delay:index+delay+len(sig)] += sig

# take the data (noise + signal) to the frequency domain, where we do our filtering 
h1_f = N.fft.rfft(h1)
l1_f = N.fft.rfft(l1)

# time-cluster a full time series of SNRs (for a single template) by taking
# only the strongest trigger in each window; return a two-column array (t,SNR) 
def cluster(snrs,window=128):
    n = len(snrs)

    # reshape data to have the SNRs in each window on separate rows
    s = snrs.reshape((n/window,window))

    # generate the corresponding time data
    t = N.arange(n).reshape((n/window,window))

    res = []
    # iterate simultaneously over the rows of the two arrays
    for rows,rowt in zip(s,t):
        # extend list 'res' with the time and SNR at highest SNR
        i = N.argmax(rows)
        res.append((rowt[i],rows[i]))

    # convert the list to a numpy array before returning it
    return N.array(res)

# obtain the time triggers for template `temp` against `data`,
# cluster and threshold them
def triggers(data,temp,threshold=5,window=128):
    snrs = F.snr(data,temp)
    clustered = cluster(snrs,window)

    # we select triggers above threshold by using numpy boolean indexing
    return clustered[clustered[:,1] > threshold]

# make a template bank---to save computing time, we'll make a narrow bank
# around the true chirp mass 
Mcs = N.arange(1.19,1.21,0.0005)
bank = [(Mc,F.normtemplate(Mc)) for Mc in Mcs]

# return a list of clustered triggers above threshold from all templates
def banktriggers(data,bank,threshold=5,window=128):
    res = []

    # iterate over the templates
    for (Mc,temp) in bank:
        # iterate over the thresholded triggers from each template
        for trig in triggers(data,temp,threshold,window):
            res.append((trig[0],trig[1],Mc))

    return N.array(res)

# run the search
h1_t = banktriggers(h1_f,bank)    
l1_t = banktriggers(l1_f,bank)

Cool. Of course you followed all that, huh? Now we have two arrays h1_t and l1_t with the clustered, thresholded triggers for the two interferometers. Each array has three columns (time, SNR, chirp mass). We will insert the triggers in a database, using the agile SQLite, which does not require a server and keeps all tables in a file. In Python, we operate on SQLite files with the sqlite3 package, now incorporated in the standard library. Note however that the interface would look the same even for more powerful database engines.

In fact, all we're going to be doing is passing SQL commands to the database. Oh yes, you'll be talking SQL now. How badass is that? Seriously, modern scientific computing is all about gluing together different languages and applications, and Python provides excellent glue. SQL is no exception.

# --- Python script coincident.py continues... ---

import sqlite3 as S

# 'connect' to the SQLite database
d = S.connect('triggers.sqlite')

# run SQL queries to create tables with four columns: a unique index,
# and the time, SNR, and chirp mass of each trigger
d.execute("""
CREATE TABLE h1 ('i' INTEGER PRIMARY KEY,'t' FLOAT,'snr' FLOAT,'Mc' FLOAT);
""")
d.execute("""
CREATE TABLE l1 ('i' INTEGER PRIMARY KEY,'t' FLOAT,'snr' FLOAT,'Mc' FLOAT);
""")

# iterate over the rows of h1_t and l1_t, and insert each row in the database
# we pass NULL for the PRIMARY KEY index, but SQLite will autogenerate it
for t in h1_t:
    d.execute("INSERT INTO h1 VALUES(NULL,?,?,?);",t)
for t in l1_t:
    d.execute("INSERT INTO l1 VALUES(NULL,?,?,?);",t)

# make sure the transactions have been recorded to the database
d.commit()

Now we're ready to play with the trigger database, using more SQL queries. We can do so in Python (appending .fetchall() to the .execute() method to get back the output), or in a light-weight application such as SQLite Manager.

For instance, we may wish to see all the h1 triggers with an SNR slightly higher than the threshold:

SELECT * FROM h1 WHERE snr > 6;

We may wish to order the triggers by time, or by SNR (ASCending or DESCending):

SELECT * FROM h1 WHERE snr > 6 ORDER BY t;
SELECT * FROM h1 WHERE snr > 6 ORDER BY snr ASC;

We can also run aggregation queries, such as counting records, or taking the maximum value:

SELECT COUNT(*) FROM h1 WHERE snr > 6;
SELECT MAX(snr) FROM h1 WHERE snr > 6;

The true power of databases becomes more obvious when we run SQL queries that involve two tables. Just running

SELECT * FROM h1 JOIN l1;

returns the cross product of every line in table h1 by every line in table l1. Not very useful, so we need to add a condition. For instance, we may join only triggers with the same chirp mass

SELECT * FROM h1 JOIN l1 ON h1.Mc == l1.Mc;

or, even better, we may join triggers where both mass and time sit together in narrow windows:

SELECT * FROM h1 JOIN l1 ON (ABS(h1.Mc - l1.Mc) < 0.01 AND ABS(h1.t - l1.t) < 500);

Let us now create a new coincidence table from all the joint events with a total SNR higher than a threshold. Doing that is easier from Python, where we can define a new SQL function rms to compute the total SNR, and use it in our queries. We then create the coincidence table c, and feed the rms-filtered query straight to an INSERT into c.

# --- Python script coincident.py continues... ---

# make a simple Python function available to SQL (we use a _lambda_
# anonymous function, while the '2' is for two arguments)
d.create_function("rms",2,lambda x,y: math.sqrt(x**2 + y**2))

# try out a query that will return the primary keys for trigger pairs
# that satisfy the coincidence conditions on Mc and t,
# and that have sufficient total SNR 
print d.execute("""
SELECT
    h1.i, l1.i, rms(h1.snr,l1.snr)
FROM h1 JOIN l1 ON
    (ABS(h1.Mc - l1.Mc) < 0.01 AND ABS(h1.t - l1.t) < 500);
""").fetchall()

# now create the coincidence table
d.execute("CREATE TABLE c ('i' INTEGER PRIMARY KEY, 'ih1' INTEGER, 'il1' INTEGER, 'snr' FLOAT)")

# and populate it (in SQL) from the same query as above
d.execute("""
INSERT INTO c SELECT
    NULL, h1.i, l1.i, rms(h1.snr,l1.snr)
FROM h1 JOIN l1 ON
    (ABS(h1.Mc - l1.Mc) < 0.01 AND ABS(h1.t - l1.t) < 500);
""")

d.commit()

The coincidence table includes only the keys of the individual triggers that make up coincidence, but we can recover that information with appropriate JOINs on the primary keys:

SELECT
    c.snr, h1.Mc, l1.Mc
FROM c JOIN h1 ON
    (c.ih1 == h1.i)
JOIN l1 ON
    (c.il1 == l1.i)
WHERE
    (c.snr > 10)
ORDER BY
    c.snr;

And for the final flourish, we get the two chirp masses for the coincidences with the maximum total SNR, grouped by one of the chirp masses:

SELECT
    c.snr, h1.Mc, l1.Mc
FROM c JOIN (SELECT
                 max(c.snr) AS maxc
             FROM c JOIN h1 ON
                 (c.ih1 == h1.i)
             JOIN l1 ON
                 (c.il1 == l1.i)
             GROUP BY h1.Mc) ON
    c.snr = maxc
JOIN h1 ON
    (c.ih1 == h1.i)
JOIN l1 ON
    (c.il1 == l1.i)
ORDER BY c.snr ASC, h1.Mc;

Ta-da! On second thought, ouch! Because of the formal mathematical structure of the relational model, one can't run a simple query for the maximum of a field and for the value of the other fields where the maximum is achieved. Instead, we use a nested SELECT to build a temporary table with the maximum total SNRs, grouped by h1.Mc, and match them to total SNRs as recorded in c.

Indeed, SQL can be complicated and counterintuitive. It has many powerful features that we haven't even touched on here. On the upside, for likely scientific applications a few basic commands will get you a long way.

That's all folks! For your reference, here's filtering.py, slightly modified from last week.

# --- begin Python module filtering.py ---

# integer division should return a float, so we avoid the classic
# "C" bug that 3/5 = 0; this is standard in Python 3, by the way
from __future__ import division

import math, numpy as N

# we'll need a couple of constants
Msun = 4.9257e-6    # Mass of the Sun in seconds
Mpcs = 1.02927e14   # megaparsec in seconds

# we work with a basic array of frequencies from 0 to fmax;
# we take one more element than a power of two, so that the inverse real FFT 
# has 2^N elements; this is consistent with the standard arrangements
# of frequencies in numerical FFTs, which goes as
# [0,df,2*df,...,f_Nyquist,-(f_Nyquist-df),..,-df]

fmax = 1024
f    = N.linspace(0,fmax,2**16 + 1) # enough samples to cover chirp mass
                                    # between 0.8 and 1.75 Msun
df   = f[1] - f[0]                  # frequency spacing

fmin = 40
def template(Mc):
    """Return a 'Newtonian' chirp with chirp mass Mc [Msun],
    as defined in the frequency domain by the stationary-phase approximation.
    See, e.g., Maggiore (2007)."""

    # by operating on f[1:] avoid taking "difficult" powers at f = 0
    phasing   = (3/128) * (math.pi * Mc * Msun * f[1:])**(-5/3)
    # the last factor (a Boolean) sets to 0 all elements with f < fmin
    amplitude = math.sqrt(5/24) * math.pi**(-2/3) * (Mc * Msun)**(5/6) * f[1:]**(-7/6) * (f[1:] > fmin)

    # but we include f = 0 in the final resulting array
    waveform = N.zeros(len(f),'complex128')    
    waveform[1:] = amplitude * N.exp(-1j * phasing)

    return waveform

sigma2 = 1

def normsq(signal):
    """Squared "noise" norm of a signal (with noise variance = 1)."""
    return (4/df) * N.sum(N.abs(signal)**2) / sigma2

def normtemplate(Mc):
    """Return a normalized template by calling template and normsq."""
    t = template(Mc)
    return t / math.sqrt(normsq(t))

def snr(signal,temp):
    """Return the SNR time series for the signal, filtered against template.
    Expects rffts of both signal and template. Template will be padded if needed."""

    # time-domain padding
    if len(signal) > len(temp):
        temp_t = N.fft.irfft(temp)
        temp_pad = N.concatenate((temp_t,N.zeros(2 * (len(signal) - 1) - len(temp_t),'d')))
        temp = N.fft.rfft(temp_pad)

    snr = (4/df) * N.real(N.fft.irfft(N.conj(signal) * temp)) / sigma2

    # the FFT returns the reversed time series of SNR... we'll just accept this
    # there's also a -1 offset, we will also neglect that
    return snr[::-1]