Category Archives: science

A Critique of The Programming Language J

I’ve spent around a year now fiddling with and eventually doing real
data analytic work in the The Programming Language J. J is one of
those languages which produces a special enthusiasm from its users and
in this way it is similar to other unusual programming languages like
Forth or Lisp. My peculiar interest in the language was due to no
longer having access to a Matlab license, wanting an array oriented
language to do analysis in, and an attraction to brevity and the point
free programming style, two aspects of programming which J emphasizes.

Sorry, Ken.

Sorry, Ken.

I’ve been moderately happy with it, but after about a year of light
work in the language and then a month of work-in-earnest (writing
interfaces to gnuplot and hive and doing Bayesian inference and
spectral clustering) I now feel I am in a good position to offer a
friendly critique of the language.

First, The Good

J is terse to nearly the point of obscurity. While terseness is not a
particularly valuable property in a general purpose programming
language (that is, one meant for Software Engineering), there is a
case to be made for it in a data analytical language. Much of my work
involves interactive exploration of the structure of data and for that sort
of workflow, being able to quickly try a few different ways of
chopping, slicing or reducing some big pile of data is pretty
handy. That you can also just copy and paste these snippets into some
analysis pipeline in a file somewhere is also nice. In other words,
terseness allows an agile sort of development style.

Much of this terseness is enabled by built in support for tacit
programming. What this means is that certain expressions in J are
interpreted at function level. That is, they denote, given a set of
verbs in a particular arrangement, a new verb, without ever explicitly
mentioning values.

For example, we might want a function which adds up all the maximum
values selected from the rows of an array. In J:

+/@:(>./"1)

J takes considerable experience to read, particularly in Tacit
style. The above denotes, from RIGHT to LEFT: for each row ("1)
reduce (/) that row using the maximum operation >. and then (@:)
reduce (/) the result using addition (+). In english, this means:
find the max of each row and sum the results.

Note that the meaning of this expression is itself a verb, that is
something which operates on data. We may capture that meaning:

sumMax =: +/@:(>./"1)

Or use it directly:

+/@:(>./"1) ? (10 10 $ 10)

Tacit programming is enabled by a few syntactic rules (the so-called
hooks and forks) and by a bunch of function level operators called
adverbs and conjuctions. (For instance, @: is a conjunction rougly
denoting function composition while the expression +/ % # is a fork,
denoting the average operation. The forkness is that it is three
expressions denoting verbs separated by spaces.

The details obscure the value: its nice to program at function level
and it is nice to have a terse denotation of common operations.

J has one other really nice trick up its sleeve called verb
rank
. Rank itself is not an unusual idea in data analytic languages:
it just refers to the length of the shape of the matrix; that is, its
dimensionality.

We might want to say a bit about J’s basic evaluation strategy before
explaining rank, since it makes the origin of the idea more clear. All
verbs in J take one or two arguments on the left and the right. Single
argument verbs are called monads, two argument verbs are called dyads.
Verbs can be either monadic or dyadic in which case we call the
invocation itself monadic or dyadic. Most of J’s built-in operators
are both monadic and dyadic, and often the two meanings are unrelated.

NB. monadic and dyadic invocations of <
4 < 3 NB. evaluates to 0
<3 NB. evalutes to 3, but in a box.

Give that the arguments (usually called x and y respectively) are
often matrices it is natural to think of a verb as some sort of matrix
operator, in which case it has, like any matrix operation, an expected
dimensionality on its two sides. This is sort of what verb rank is
like in J: the verb itself carries along some information about how
its logic operates on its operands. For instance, the built-in verb
-: (called match) compares two things structurally. Naturally, it
applies to its operands as a whole. But we might want to compare two
lists of objects via match, resulting in a list of results. We can
do that by modifying the rank of -:

x -:”(1 1) y

The expression -:”(1 1) denotes a version of match which applies to
the elements of x and y, each treated as a list. Rank in J is roughly
analogous the the use of repmat, permute and reshape in Matlab: we can
use rank annotations to quickly describe how verbs operate on their
operands in hopes of pushing looping down into the C engine, where
it can be executed quickly.

To recap: array orientation, terseness, tacit programming and rank are
the really nice parts of the language.

The Bad and the Ugly

As a programming environment J can be productive and efficient, but it
is not without flaws. Most of these have to do with irregularities in
the syntax and semantics which make the language confusing without
offering additional power. These unusual design choices are
particularly apparent when J is compared to more modern programming
languages.

Fixed Verb Arities

As indicated above, J verbs, the nearest cousin to functions or
procedures from other programming languages, have arity 1 or
arity 2. A single symbol may denote expressions of both arity, in
which case context determines which function body is executed.

There are two issues here, at least. The first is that we often want
functions of more than two arguments. In J the approach is to pass
boxed arrays to the verb. There is some syntactic sugar to support
this strategy:

multiArgVerb =: monad define
‘arg1 arg2 arg3’ =. y
NB. do stuff
)

If a string appears as the left operand of the =. operator, then
simple destructuring occurs. Boxed items are unboxed by this
operation, so we typically see invocations like:

multiArgVerb('a string';10;'another string')

But note that the expression on the right (starting with the open
parentheses) just denotes a boxed array.

This solution is fine, but it does short-circuit J’s notion of verb
rank: we may specify the the rank with which the function operates on
its left or right operand as a whole, but not on the individual
“arguments” of a boxed array. But nothing about the concept of rank
demands that it be restricted to one or two argument functions: rank
entirely relates to how arguments are extracted from array valued
primitive arguments and dealt to the verb body. This idea can be
generalized to functions of arbitrary argument count.

Apart from this, there is the minor gripe that denoting such single
use boxed arrays with ; feels clumsy. Call that the Lisper’s bias:
the best separator is the space character.1

A second, related problem is that you can’t have a
zero argument function either. This isn’t the only language where
this happens (Standard ML and OCaml also have this tradition, though I
think it is weird there too). The problem in J is that it would feel
natural to have such functions and to be able to mention them.

Consider the following definitions:

o1 =: 1&-
o2 =: -&1

(o1 (0 1 2 3 4)); (o2 (0 1 2 3 4))
┌────────────┬──────────┐
│1 0 _1 _2 _3│_1 0 1 2 3│
└────────────┴──────────┘

So far so good. Apparently using the & conjunction (called “bond”)
we can partially apply a two-argument verb on either the left or the
right. It is natural to ask what would happen if we bonded twice.

(o1&1)
o1&1

Ok, so it produces a verb.

 3 3 $ ''
  ;'o1'
  ;'o2'
  ;'right'
  ;((o1&1 (0 1 2 3 4))
  ; (o2&1 (0 1 2 3 4))
  ;'left'
  ; (1&o1 (0 1 2 3 4))
  ; (1&o2 (0 1 2 3 4)))

┌─────┬────────────┬────────────┐
│     │o1          │o2          │
├─────┼────────────┼────────────┤
│right│1 0 1 0 1   │1 0 _1 _2 _3│
├─────┼────────────┼────────────┤
│left │1 0 _1 _2 _3│_1 0 1 2 3  │
└─────┴────────────┴────────────┘

I would describe these results as goofy, if not entirely impossible to
understand (though I challenge the reader to do so). However, none of
them really seem right, in my opinion.

I would argue that one of two possibilities would make some sense.

  1. (1&-)&1 -> 0 (eg, 1-1)
  2. (1&-)&1 -> 0″_ (that is, the constant function returning 0)

That many of these combinations evaluate to o1 or o2 is doubly
confusing because it ignores a value AND because we can denote
constant functions (via the rank conjunction), as in the expression
0"_.

Generalizations

What this is all about is that J doesn’t handle the idea of a
function very well. Instead of having a single, unified abstraction
representing operations on things, it has a variety of different ideas
that are function-like (verbs, conjuctions, adverbs, hooks, forks,
gerunds) which in a way puts it ahead of a lot of old-timey languages
like Java 7 without first order functions, but ultimately this
handful of disparate techniques fails to acheive the conceptual unity
of first order functions with lexical scope.

Furthermore, I suggest that nothing whatsoever would be lost (except
J‘s interesting historical development) by collapsing these ideas
into the more typical idea of closure capturing functions.

Other Warts

Weird Block Syntax

Getting top-level2 semantics right is hard in any
language. Scheme is famously ambiguous on the subject, but at
least for most practical purposes it is comprehensible. Top-level has
the same syntax and semantics as any other body of code in scheme
(with some restrictions about where define can be evaluated) but in
J neither is the same.

We may write block strings in J like so:

blockString =: 0 : 0 
Everything in here is a block string.       
)

When the evaluator reads 0:0 it switches to sucking up characters
into a string until it encounters a line with a ) as its first
character. The verb 0:3 does the same except the resulting string is
turned into a verb.

plus =: 3 : 0
    x+y
)

However, we can’t nest this syntax, so we can’t define non-tacit
functions inside non-tacit functions. That is, this is illegal:

plus =: 3 : 0
  plusHelper =. 3 : 0
    x+y
  )
  x plusHelper y
)

This forces to the programmer to do a lot of lambda lifting
manually, which also forces them to bump into the restrictions on
function arity and their poor interaction with rank behavior, for if
we wish to capture parts of the private environment, we are forced to
pass those parts of the environment in as an argument, forcing us to
give up rank behavior or forcing us to jump up a level to verb
modifiers.

Scope

Of course, you can define local functions if you do it tacitly:

plus =: 3 : 0
    plusHelper =. +
    x plusHelper y   
)

But, even if you are defining a conjunction or an adverb, whence you
are able to “return” a verb, you can’t capture any local functions –
they disappear as soon as execution leaves the conjunction or adverb
scope.

That is because J is dynamically scoped, so any capture has to be
handled manually, using things like adverbs, conjunctions, or the good
old fashioned fix f., which inserts values from the current scope
directly into the representation of a function. Essentially all modern
languages use lexical scope, which is basically a rule which says: the
value of a variable is exactly what it looks like from reading the
program. Dynamic scope says: the valuable of the variable is whatever
its most recent binding is.

Recapitulation!

The straight dope, so to speak, is that J is great for a lot of
reasons (terseness, rank) but also a lot of irregular language
features (adverbs, conjunctions, hooks, forks, etc) which could be
folded all down into regular old functions without harming the
benefits of the language, and simplifying it enormously.

If you don’t believe that regular old first order functions with
lexical scope can get us where we need to go, check out my
tacit-programming libraries in R and Javascript. I
even wrote a complete, if ridiculously slow implementation of J‘s
rank feature, literate-style, here.


Footnotes

1 It bears noting that ; in an expression like (a;b;c)
is not a syntactic element, but a semantic one. That is, it is the
verb called “link” which has the effect of linking its arguments into
a boxed list. It is evaluated like this:

(a;(b;c))

(a;b;c) is nice looking but a little strange: In an expression
(x;y) the effect depends on y is boxed already or not: x is always boxed regardless, but y is boxed only if it wasn’t boxed before.

2 Top level? Top-level is the context where everything
“happens,” if anything happens at all. Tricky things about top-level
are like: can functions refer to functions which are not yet defined,
if you read a program from top to bottom? What about values? Can you
redefine functions, and if so, how do the semantics work? Do functions
which call the redefined function change their behavior, or do they
continue to refer to the old version? What if the calling interface
changes? Can you check types if you imagine that functions might be
redefined at any time? If your language has classes, what about
instances created before a change in the class definition. Believe or
not, Common Lisp tries to let you do this – and its confusing!

On the opposite end of the spectrum are really static languages like
Haskell, wherein type enforcement and purity ensure that the top-level
is only meaningful as a monolith, for the most part.


Notes on `Quantum Computing Since Democritus, Chapter 1`

For a long time, I’ve been interested in the sorts of questions exemplified by the following example:

Suppose we are Isaac Newton or  Gottfried Leibniz. We have at our disposal two sources of inspiration: data, collected by intrepid philatelists like Tycho Brahe and something like theory, in the form of artifacts like Kepler’s Laws, Galileo’s pre-Newtonian laws of motion (for it was he who first suggested that objects in motion retain that motion unless acted upon), and a smattering of Aristotelian and post-Aristotelian intuitions about motion (for instance, John Philoponus’ notion that, in addition to the rules of motion described by Aristotle, one object could impart on another a transient impetus). You also have tables and towers and balls you can roll on them or drop from them. You can perform your own experiments.

The question, then, is how do you synthesize something like Newton’s Laws. Jokes about Newton’s extra-scientific interests aside, this is alchemy indeed, and an alchemy to which most training physicists receive (or at least I received) does not address itself.

Newton’s Laws are generally dropped on the first year physics student (perhaps after working with statics for awhile) fully formed:

First law: When viewed in an inertial reference frame, an object either remains at rest or continues to move at a constant velocity, unless acted upon by an external force.[2][3]
Second law: The vector sum of the external forces F on an object is equal to the mass m of that object multiplied by the acceleration vector aof the object: F = ma.
Third law: When one body exerts a force on a second body, the second body simultaneously exerts a force equal in magnitude and opposite in direction on the first body.

(this formulation borrowed from Wikipedia)

The laws are stated here in terms of a lot of subsidiary ideas: inertial reference frames, forces, mass. Neglecting the reference to mathematical structures (vector sums), this is a lot to digest: and it is hard to imagine Newton just pulling these laws from thin air.  It took the species about 2000 years to figure it out (if you measure from Zeno to Newton, since Newton’s work is in some sense a practical rejoinder to the paradoxes of that pre-Socratic philosopher), so it cannot be, as some of my colleagues have suggested, so easy to figure out.

A doctorate in physics takes (including the typical four year undergraduate degree in math, physics or engineering) about ten years. Most of what is learned in such a program is pragmatic theory: how to take a problem statement or something even more vague, identify the correct theoretical approach from a dictionary of possibilities, and then to “turn the crank.” It is unusual (or it was unusual for me) for a teacher to spend time posing more philosophical questions. Why, for instance, does a specific expression called the “Action,” when minimized over all possible paths of a particle, find a physical path? I’ve had a lot of physicist friends dismiss my curiosity about this subject, but I’m not the only one interested (eg, the introductory chapter of Lanczos’ “The Variation Principles of Mechanics”).

What I am getting to here, believe it or not, is that I think physicists are over-prepared to work problems and under-prepared to do the synthetic work of building new theoretical approaches to existing unsolved problems. I enjoy the freedom of having fallen from the Ivory Tower, and I aim to enjoy that freedom in 2016 by revisiting my education from a perspective which allows me to stop and ask “why” more frequently and with more intensity.

Enter Scott Aaronson’s “Quantum Computing Since Democritus,” a book whose title immediately piqued my interest, combining, as it does, the name of a pre-Socratic philosopher (the questions of which form the basis, in my opinion, for so much modern physics) with the most modern and pragmatic of contemporary subjects in physics. Aaronson’s project seems to accomplish exactly what I want as an armchair physicist: stopping to think about what our theories really mean.

To keep myself honest, I’ll be periodically writing about the chapters of this book – I’m a bit rusty mathematically and so writing about the work will encourage me to get concrete where needed.

Atoms and the Void

Atoms and the Void is a short chapter which basically asks us to think a bit about what quantum mechanics means. Aaronson describes Quantum Mechanics in the following way:

Here’s the thing: for any isolated region of the universe that you want to consider, quantum mechanics describes the evolution in time of the state of that region, which we represent as a linear combination – a superposition – of all the possible configurations of elementary particles in that region. So, this is a bizarre picture of reality, where a given particle is not here, not there, but in a sort of weighted sum over all the places it could be. But it works. As we all know, it does pretty well at describing the “atoms and the void” that Democritus talked about.

The needs of an introductory chapter, I guess, prevent him from describing how peculiar this description is: for one thing, there is never an isolated region of the universe (or at least, not one we are interested in, I hope obviously). But he goes on to meditate on this anyway by asking us to think about how we interpret measurement where quantum mechanics is concerned. He dichotimizes interpretations of quantum mechanics by where they fall on the question of putting oneself in coherent superposition.

Happily, he doesn’t try to claim that any particular set of experiments can definitely disambiguate different interpretations of quantum mechanics. Instead he suggests that by thinking specifically of Quantum Computing, which he implies gets most directly at some of the issues raised by debates over interpretation, we might learn something interesting.

This tantalizes us to move to chapter 2.

The Fox

Amen I say to you: the fox which lives
in the abandoned building near our cabin
means nothing at all. It bears no portents.
Molecule by molecule it tumbles
through space, much like the rest of us, a dense
cloud, a dense cloud, a dense cloud of
matter whispering itself the quiet
evocations of its own being! Or!

Maybe not.

It is brown. It stares at us silently,
while we sit there in our car watching it.
It is the brown of dead leaves, wet from rain
in late December, of mud, of the wood
of the empty building near our cabin.

Spectral Clustering in J

With Examples from Quantitative Neuroscience

In my never ending quest to understand hard, but useful stuff, I’ve been learning the Programming Language J. J is an array oriented language in the APL family. Matlab is a related language, but its APL-ness is quite attenuated compared to J. I know Matlab already, having used it quote extensively as a research scientist, so J is not utterly alien, but unlike Matlab, which presents a recognizable syntactic face to the user, J retains an APL-like syntax.

As you will see, this presents challenges. The best way to learn to use a tool is to use it, so I thought I would take the opportunity to investigate a clustering technique which would have been useful to me as a grad student (had I known about it): Spectral Clustering.

Spectral Clustering1 refers to a family of clustering approaches which operate on the assumption that local information around data points is more useful for the purposes of assigning those points clusters than is their global arrangement. Typical clustering techniques, like k-means, make the assumption “points in a cluster will be near to eachother.” Spectral clustering relaxes that assumption to “points in a cluster will be near to other points in their cluster.”

We will develop the spectral clustering algorithm and then apply it to a series of more interesting data sets.

Part 1: 1d Gaussian Clusters

We are going to work on exploring spectral clustering. In order to do that we need something to cluster. Our first data set will be trivial: made by combining two sets drawn from different normal distributions2.

load 'utils.ijs'
load 'km.ijs'
require 'graphics/plot'
require 'numeric'
require 'viewmat'
require 'math/lapack'
require 'math/lapack/geev'

clusterSize1 =: 100
e1d1 =: (0 0.01) gausian clusterSize1
e1d2 =: (13 0.01) gausian clusterSize1

e1labeled =:  (((clusterSize1$0),(clusterSize1$1))) ,: (e1d1, e1d2)
e1data =: 1 { e1labeled
e1labels =: 0 { e1labeled


NB. show the histogram of the data
NB. (_6 6 200) histPlot e1data

The commented line produces the following plot:

Example One Data Histogram

This kind of data is trivial to cluster with an algorithm like k-means3.



kmClustering  =: 2 km (((# , 1:) e1data) $ e1data)

+/ (kmClustering = e1labels)

Should give ~ 200, unless something unusual happens.

As I suggested in the introductory section, the magic of spectral clustering is that it works on a representation of data which favors local information about the relationships between objects. Hence we will need to create some sort of connectivity graph for our data points. There are lots of approaches, but the upshot is that i,j in our connectivity matrix must tell us whether it is likely that point i and point j are in the same cluster.

For this simple example, we will just use the Euclidean distance between points and a threshold: anything lower than the threshold is connected and anything higher is not.

NB. d is the distance between two vectors of equal dimension
d =: %:@:+/@:*:@:([ - ])

NB. ds is the table of all pairwise distances where each argument is a
NB. list of vectors.
ds =: d"(1 1)/

NB. convert a list of points to a list of 1d vectors
l2v =: (#@:],1:) $ ]

NB. ds for lists of points
ds1d =: l2v@:] ds l2v@:]

NB. the table of all distances of x
dof =: ] ds ]

NB. the same except for a list of points
dof1d =: ] ds1d ] 

connections1d =: [ <~ dof1d@:]

e1connections =: 1.0 connections1d e1data

NB. viewmat e1connections

The connections matrix looks like:

Example 1 connections matrix

Spectral clustering operates on a Matrix called the Graph Laplacian which is related to the connectivity matrix. To get the laplacian matrix L we subtract the connectivity matrix from something called the degree matrix. Explained clearly but without interpretation, the degree matrix is just the sum of all connections for a given vertex in our graph, placed on the diagonal.

calcDegreeMat =:  (eye@:#) * +/"1 

The degree matrix just tells you how connected a given vertex is. To get the Laplacian Matrix you subtract the connectivity matrix from the degree matrix. This is conveniently defined in J:

calcLap =: calcDegreeMat - ]

For our first data set, this Laplacian looks like:

Example 1 Laplacian Matrix

Brief meditation will explain why the components of this matrix representing inter-object connections are so faint compared to those on the diagonal: the rank of each vertex in the graph is always larger than its connections, being the sum thereof.

Now that we have the Laplacian I will perform:

Spectral Clustering

The paper A Tutorial Introduction to Spectral Clustering, describes a family of spectral clustering strategies. We will implement just the first.

It involves the steps:

  1. Find the graph Laplacian for your system
  2. Find its eigenvalues and eigenvectors
  3. Sort the latter by the former, in ascending order
  4. Arrange the vectors as columns of a matrix
  5. truncate that matrix to a small number of columns, corresponding to the smaller eigenvalues.
  6. treat the rows of the resulting matrix as vectors.
  7. perform clustering, say k-means, on those vectors.

We can denote these succinctly in J:


eigensystem =: monad define 
  raw =. geev_jlapack_ y
  ii =. /:@:>@:(1&{) raw
  lvecs =. |: (ii { |: (0 unboxFrom raw))
  vals =. ii { (1 unboxFrom raw)
  rvecs =. |: (ii { :| (2 unboxFrom raw))
  lvecs ; vals ; rvecs  
)

eigenvectors =: >@:(0&{@:eigensystem)
eigenvalues =: >@:(1&{@:eigensystem)

takeEigenvectors =: |:@([ {. |:@:])

e1clustering =: 2 km takeEigenvectors eigenvectors e1Lap

(+/ e1clustering = e1labels) NB. should be about 200

Example 1 Clustered

A boring, if resounding, success.

Example 2: Concentric Clusters

Consider the following data:

rTheta =: (([ * cos@:]) , ([ * sin@:]) )"(0 0)
e2clusterSize =: 100
e2d1 =: ((1,0.1) gausian e2clusterSize) rTheta ((0,2p1) uniform     e2clusterSize)
e2d2 =: ((4,0.1) gausian (2*e2clusterSize)) rTheta ((0,2p1) uniform     (2*e2clusterSize))
e2data =: e2d1, e2d2

Concentric Clusters

This is the sort of data set which is pathological for clustering procedures which work in the original coordinates of the data. Even traditional dimension reducing techniques like principal component analysis will fail to improve the performance (in fact, the principal components of this data set are analogous to x and y, modulo a rotation).

We could, of course, realize by inspection that this problem reduces to the first example if the angular component is thrown away and we cluster on the radial component only.

However, in general, we are working with data in n dimensions and such simplifications may not be apparent to us. Hence, we will proceed with this example set.

Again, we produce the connectivity matrix by simple ci,j = di,j < epsilon.

e2connections =: 1 connections e2data 
e2Lap =: calcLap e2connections
e2ids =:  2 km 2 takeEigenvectors eigenvectors e2Lap

e2 connectivity matrix

As we physicists like to say: now we just turn the crank:

e2Lap =: calcLap e2connections
e2ids =:  2 km 2 takeEigenvectors eigenvectors e2Lap

And we can plot these points labeled by ID:

e2 data clustered

Pretty neat - we have successfully clustered a data set which traditional techniques could not have dealt without, and without manually figuring out a transform the simplified the data set.

Example 3: Neural Spike Trains

My interest in spectral clustering goes back to my graduate research, which involved, among other things, the automatic clustering of neural data. Neurons, in animals like us, anyway, communicate with one another using brief, large depolarizations of the voltage across their cell membranes. Neuroscientists call these events spikes and they literally look like large (~80 mV) spikes of voltage in the recording of the voltage across a cell membrane. These spikes travel down the neuron's axon and are transmitted through synapses to other neurons. There are other forms of communication between neurons, but it is generally assumed that spikes underly much of the computations that neural networks perform.

Spikes are short (1-2 ms) and firing rates are generally low on this time scale, with a typical inter-spike period of 10-100 ms (depending a great deal on the inputs to the neuron). In order to understand how neurons work, neuroscientist repeat a stimulus over and over again and record the spike times relative to the beginning of the experiment.

This results in a data set like:

33.0  55.2  80.0
30.0  81.1
55.2  85.9

Where rows are trials and each number is a spike time for that trial. In J we can read this data as padded arrays:

spikes =: asArray 0 : 0
33.0  55.2  80.0
30.0  81.1
55.2  85.9
)

And J will pad it out with zeros.

 33   55.2   80
 30   81.1   0
 55.2 85.9   0

Under certain circumstances, during such experiments, neurons generate a small number of distinct spiking patterns. Assuming that these patterns may be of interest, much of my research focused on extracting them automatically. That is: given a data set of spike trains, extract the clusters of similar trains.

Unfortunately, spike trains are not vectors and so there is no convenient euclidean distance measure one can easily apply to them. Luckily Victor and Purpura (1996) developed an edit-distance metric on spike trains by analogy to similar metrics other, non-vector, data (like gene sequences).

The Victor-Purpura Metric defines the distance between two spike trains as the minimum cost of distorting one spike train into the other using just two operations: sliding spikes in time to line up or deleting/adding spikes. The cost of sliding spikes in time is has the parameterized cost q*dt where dt is the time difference, and the cost of adding or removing a spike is one.

The algorithm to find the minimum such distortion can be easily implemented:

vpNaive =: adverb define 
  :
  'q' =. m
  's1' =. nonzeros x 
  's2' =. nonzeros y
  'n1' =. #s1
  'n2' =. #s2 
  if. n1 = 0 do.
    n2
  elseif. n2 = 0 do.
    n1
  elseif. 1 do.
    'h1' =. {. s1
    'h2' =. {. s2
    'movecost' =. q * abs (h1-h2)
    if. movecost < 1.0 do.
      'restcost' =. ((}. s1) (q vpNaive) (}. s2))
      restcost + movecost 
    else.
      'restcost' =. ((}. s1) (q vpNaive) (}. s2))
      restcost + 1
    end.
  end.
)

although much more efficient implementations are possible. For our purposes, this will suffice.

With this definition, we can produce the similarity matrix for a given q using this J adverb:

spikeDistances =: adverb define   
  y ((m vpNaive)"(1 1) /) y
)

The choice of q is an important issue which we will neglect for now. See my doctoral thesis for insight into this question. The one sentence upshot is that q should generally be chosen to maximize the information content in the resulting similarity matrix.

Now let's create an example data set consisting of two distinct firing patterns. A firing pattern consists of a mean time for each "event" in the pattern, plus that events "reliability," which is the probability that a spike actually appears in an event. An event may be localized at a very specific time, but spikes may only sometimes appear on a given trial in that event.

Consider:

NB. This produces a fake spike data set of Y trials with events
NB. specified by means, stds, and reliabilities, passed in as a
NB. rectangular matrix.
makeSpikes =: dyad define 
  'means stds reliabilities' =. x
  'n' =. y

  spikes =. (n,#means) $ means 
  spikes =. spikes + ((n,#stds) $ stds)*(rnorm (n,#stds))
  spikes =. spikes * ((runif (n,#reliabilities))<((n,#reliabilities) $   reliabilities))
  nonzeros"1 spikes
)

This function takes a description of the pattern, like:

e3pattern1 =: asArray 0 : 0 
  10 20 30
  1  1.5  1.5
  0.9 0.9 0.6
)

e3pattern2 =: asArray 0 : 0
  15 40
  1  1
  0.8 0.8
)

And returns a data set of N trials:

NB. Now we use makeSpikes to generate the actual data sets.
e3spikes1 =: e3pattern1 makeSpikes 50
e3spikes2 =: e3pattern2 makeSpikes 50

NB. and the utility `concatSpikes` to combine them.  This just makes
NB. sure the zero padding works.
e3spikes =: e3spikes1 concatSpikes e3spikes2

We can plot this data in the traditional fashion, as a "rastergram," where each spike is plotted with its time as the x-coordinate and its trial as its y-coordinate:

rastergram e3spikes

Example 3 Rastergram

It is interesting to note how difficult it can be to see these patterns by visual inspection if the trials are not sorted by pattern:

rastergram shuffle e3spikes

Example 3 Rastergram shuffled

Now let's cluster them:

NB. We produce the distance matrix
e3distances =: (0.8 spikeDistances) e3spikes

We use thresholding to generate our connectivity matrix:

e3connections =: e3distances < 2

And we turn the crank:

e3Lap =: calcLap e3connections
e3evs =: eigenvectors e3Lap
e3clustering =: 2 km (2 takeEigenvectors e3evs)

Finally, we can color code our rastergram by clustering:

pd 'reset'
pd 'type dot'
pd 'color blue'
pd 'title trial clustering'
pd spikesForPlotting (e3clustering = 0) # e3spikes

pd 'color red'
pd (#((e3clustering=0)#e3spikes)) spikesForPlotting ((e3clustering =   1) # e3spikes)

pd 'xcaption time'
pd 'ycaption trial'

pd 'show'

Example 3 Color Coded Rastergram


Footnotes

1

A Note On Simpler Clustering Algorithms

One can appreciate a family of clustering approaches by starting with Expectation Maximization: a two step iterative process which attempts to find the parameters of a probability distribution assumed to underly a given data set. When using this approach for clustering, one assumes that the data has hidden or missing dimensions: in particular, a missing label assigning each point to a cluster. Given some guess at those labels, you can calculate the parameters of the distribution. Using this estimate, you can modify the guess about the hidden labels. Repeat until convergence.

In practice, simplification over the general case can be found by assuming that the distribution is a "Gaussian Mixture Model," just a weighted combination of n-dimensional Gaussian distributions. Further simplifications manifest by applying ad-hoc measures of cluster "belonging" based on just the mean of each cluster (see fuzzy clustering) and finally just by assuming that objects belong to the cluster to which they are closest (as in k-means).

In all of these simplifications clustering refers to some sort of centroid of the clusters as a way of estimating which object belongs to which clusters. But some data sets have clusters which are concentric. For instance, if one cluster is points between 10 and 20 units from the origin and another is points between 50 and 60 units from the origin.

In this case, all such attempts will fail unless the space is transformed. Spectral clustering transforms the space by considering topological information: that is, what objects are near or connected to what other objects, and throwing away large scale information.

In the concentric example above, the thing that all points in each cluster have in common is they are near other points in their cluster. When our distributions have the property that the distance between some two points in cluster A may be greater than the distance between some point in A and some point in B, the spectral clustering gives a nice mechanism for identifying the clusters.

2

Utilities

Throughout the above I refer to the occasional utility. Here are the definitions.

load 'stats/distribs'
load 'numeric'
load 'graphics/plot'

gausian =: dyad define
 'm s' =. x
 m + (s*(rnorm y))
)

uniform =: dyad define 
 'mn mx' =. x
 mn + ((mx-mn)*(runif y))
)

shuffle =: (#@:] ? #@:]) { ] 

hist =: dyad define 
  'min max steps' =. x
  'data' =. ((y<min) * (y>max)) # y
  'quantized' =. >. (steps * ((data - min) % (max-min)))
  'indexes' =. i. steps
  'stepSize' =. (max-min)%steps
  1 -~ ((indexes, quantized) #/. (indexes, quantized))
)


curtail =: }: 

histPlot =: dyad define 
  h =. x hist y 
  'min max nsteps' =. x 
  halfStep =. (max-min)%(2*nsteps)
  'plotXs' =. halfStep + steps (min, max, nsteps)
  'xcaption position; ycaption count; title example 1 data' plot (curtail plotXs) ; h 
)


unboxFrom =: >@:([ { ])

dot =: +/ . *

dir =: monad define 
  list (4 !: 1) 0 1 2 3
)

xyForPlotting =: (0&{ ; 1&{)@|:

ten =: monad define 

10

)

abs =: (+ ` - @. (< & 0))"0

nonzeros =: ((>&0@:abs@:]) # ])

asArray =:  ". ;. _2

concatSpikes =: dyad define
  maxSpikes =. (1 { ($ x)) >. (1{($y))
  pad =. (maxSpikes&{.)"1
  (pad x), (pad y)
)

spikeIndexes =: |:@:(( 1&{@:$@:], #@:[) $ (i.@:#@:]) )
spikesForPlotting =: verb define 
  0 spikesForPlotting y
  :
  spikes =. ,y
  ii =:  ,spikeIndexes y
  nzii =. -. (spikes = 0)
  (nzii # spikes) ; (x+(nzii # ii))
)

rastergram =: monad define
  pd 'reset'
  pd 'type dot'
  pd 'color red'
  pd 'title rastergram'
  pd 'xcaption time'
  pd 'ycaption trial number'
  pd spikesForPlotting y
  pd 'show'
)

offsetY =: ] + ((#@:] , 1&{@:$@:]) $ (0: , [))

eye =: (],]) $ ((1: + ]) take 1:)

asOperator =: adverb define 
  m dot y
)

3

K-Means

An implementation of k-means in J is comically terse, even in an exploded, commented form:

NB. This file contains an implementation of k-means, a clustering
NB. algorithm which takes a set of vectors and a number of clusters
NB. and assigns each vector to a cluster.
NB.
NB. Example Data
NB. v1 =: ?  10 2 $ 10 
NB. v2 =: ?  3 2 $ 10
NB. v3 =: ? 100 2 $ 50
NB. trivialV =: 2 2 $ 10
NB.
NB. Examples
NB. 3 km v3

NB. produce a permutation of y.
permutation =: (#@[ ? #@[)

NB. shuffle list
NB. shuffle the list
shuffle =: {~ permutation

NB. nClusters drawIds vectors
NB. given a list of vectors and a number of clusters
NB. assign each vector to a cluster randomly
NB. with the constraint that all clusters are
NB. roughly equal in size.
drawIds =: shuffle@(#@] $ i.@[)

NB. distance between two vectors as lists
vd =: +/@:(*:@-)

NB. Give the distance between all vectors in x and y
distances =: (vd"1 1)/

NB. return the list of indices of the minimum values of the rows.
minI =: i. <./

NB. Given x a list of centers and y a list of vectors,
NB. return the labeling of the vectors to each center.
reId =: ((minI"1)@distances)~

NB. canonicalize labels 
NB. given a list of labels, re-label the labels so
NB. that 0 appears first, 1 second, etc
canonicalize =: ] { /:@~.

NB. (minI"1 v1 distances v2 ) 

vsum =: +/"1@|:
vm =: (# %~ (+/"1@|:))

calcCenters =: (/:@~.@[) { (vm/.)

NB. ids kmOnce vectors This performs the heavy lifting for K-means.
NB. Given a list of integral ids, one for each vector in the N x D
NB. vectors, this verb calculates the averages of the groups implied
NB. by those labels and re-assigns each vector to one of those
NB. averages, whichever it is closest to.

kmOnce =: dyad define
  ids =. x 
  vectors =. y
  centers =. ids calcCenters vectors
  centers reId vectors
)

NB. Use the power adverb to iterate kmOnce until the labels converge
kmConverge =: ((kmOnce~)^:_)~

NB. nClusters km dataset k means.  Given a cluster count on the left
NB. and a vector data set on the right, return an array of IDs
NB. assigning each vector to a cluster.
NB. One can use `calcCenters` to recover the cluster centers.

km =: canonicalize@((drawIds) kmConverge ])

NB. Example Data
NB. v1 =: ?  10 2 $ 10 
NB. v2 =: ?  3 2 $ 10
NB. v3 =: ? 100 2 $ 50
NB. trivialV =: 2 2 $ 10

NB. Examples
NB. 3 km v3