Thursday, August 29, 2013

Here's a patch for scipy.spatial.distance for cointegration

I wrote the following patch back in 2009 (or was it '08?) to help me implement a pairs trading strategy.  It's a useful extension to the wonderful scipy.spatial.distance package.  The package is absolutely indispensable for efficiently calculating pairwise distances between vectors (such as a set of time series vectors of stock prices).

A pairs trading strategy is used in trading to make money from highly-correlated stocks/currencies/options/whatever-you-want-to-trade signals.  The general idea is that two series (e.g. GOOG, AAPL) will keep crossing one another, and that you can generally 'buy the lower one and sell the higher one.'  In this manner (and at whatever frequency you can manage) you can keep extracting profit from the signals, so long as the transaction costs you pay don't detract from the profit you make (the differential) between the two signals.

Usually, you're given a large set of time series returns and have to efficiently sift through them, looking for highly-correlated pairs.  Correlation, though, isn't a very robust measure.  This code uses cointegration instead, which is a more efficient and robust measure, similar to correlation. On top of that, scipy.spatial.distance is really fast and memory efficient since it calls C and Fortran code directly.

I used this code to run a small trading strategy in a production setting, accessing the U.S. stock market through Goldman's REDI platform.  The strategy wasn't very successful, mostly because (I think) the universe of stocks had spreads that were too large and the volumes too small.  The spreads were like $0.15, the universe was Russell 3000 and some smaller-cap stuff.  Nonetheless, I believe the code could be useful for arbitrage strategies in other realms, as cointegration is still used widely throughout the financial industry, at least to the best of my knowledge.

So, without further ado, here's the patch.  Enjoy, now go make a million!  :)

========

(msyang)-(~/test/gentoo-alt/tags/overlay/my-overlay/sci-libs/scipy/files)
(jobs:2) $ cat scipy-0.7.1-spatial-dickeyfuller.patch 
--- ./scipy/spatial/distance.py.orig 2009-08-22 04:48:08.000000000 -0400
+++ ./scipy/spatial/distance.py 2009-08-22 04:48:59.000000000 -0400
@@ -56,6 +56,8 @@
 +------------------+-------------------------------------------------+
 | dice             | the Dice dissimilarity (boolean).               |
 +------------------+-------------------------------------------------+
+| dickeyfuller     | the dickey-fuller statistic                     |
++------------------+-------------------------------------------------+
 | euclidean        | the Euclidean distance.                         |
 +------------------+-------------------------------------------------+
 | hamming          | the Hamming distance (boolean).                 |
@@ -841,6 +843,10 @@
     The following are common calling conventions.
+    0. ``Y = pdist(X, 'dickeyfuller')``
+
+       Computes the dickey-fuller 1-lag statistic.
+
     1. ``Y = pdist(X, 'euclidean')``
        Computes the distance between m points using Euclidean distance
@@ -1026,7 +1032,7 @@
        metric : string or function
            The distance metric to use. The distance function can
            be 'braycurtis', 'canberra', 'chebyshev', 'cityblock',
-           'correlation', 'cosine', 'dice', 'euclidean', 'hamming',
+           'correlation', 'cosine', 'dice', 'dickeyfuller', 'euclidean', 'hamming',
            'jaccard', 'kulsinski', 'mahalanobis', 'matching',
            'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean',
            'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.
@@ -1112,6 +1118,8 @@
         #    TypeError('A double array must be passed.')
         if mstr in set(['euclidean', 'euclid', 'eu', 'e']):
             _distance_wrap.pdist_euclidean_wrap(_convert_to_double(X), dm)
+        elif mstr in set(['dickeyfuller', 'dickey', 'df', 'd']):
+            _distance_wrap.pdist_df_wrap(_convert_to_double(X), dm)
         elif mstr in set(['sqeuclidean', 'sqe', 'sqeuclid']):
             _distance_wrap.pdist_euclidean_wrap(_convert_to_double(X), dm)
             dm = dm ** 2.0
--- ./scipy/spatial/src/distance.c.orig 2009-08-22 04:48:15.000000000 -0400
+++ ./scipy/spatial/src/distance.c 2009-08-22 04:49:26.000000000 -0400
@@ -39,6 +39,37 @@
 #include "common.h"
 #include "distance.h"
+static inline double df_distance(const double *u, const double *v, int n) {
+  int i = 0;
+  double mean_spread = 0.0, mean_nolag = 0.0, mean_lag = 0.0;
+  double result = 0.0;
+  double covar = 0.0;
+  // extra precision
+  //long double covar = 0.0;
+  for (i = 0; i < n; i++) {
+    const double spread = u[i] - v[i];
+    const double delta = spread - mean_spread;
+    mean_spread += delta/(i + 1);
+    result += delta * (spread - mean_spread);
+    if (i != 0) {
+        const double delta_lag = spread - mean_lag;
+        mean_lag += delta_lag/i;
+    }
+    if (i == n - 2){
+        mean_nolag = mean_spread;
+    }
+  }
+  for (i = 0; i < n - 1; i++) {
+      // extra precision
+      //const long double delta1 = u[i] - v[i] - mean_nolag;
+      //const long double delta2 = u[i+1] - v[i+1] - mean_lag;
+      const double delta1 = u[i] - v[i] - mean_nolag;
+      const double delta2 = u[i+1] - v[i+1] - mean_lag;
+      covar += (delta1 * delta2 - covar) / (i + 1);
+  }
+  return covar*(n - 1)/result;
+}
+
 static inline double euclidean_distance(const double *u, const double *v, int n) {
   int i = 0;
   double s = 0.0, d;
@@ -322,6 +353,19 @@
   }
 }
+void pdist_df(const double *X, double *dm, int m, int n) {
+  int i, j;
+  const double *u, *v;
+  double *it = dm;
+  for (i = 0; i < m; i++) {
+    for (j = i + 1; j < m; j++, it++) {
+      u = X + (n * i);
+      v = X + (n * j);
+      *it = df_distance(u, v, n);
+    }
+  }
+}
+
 void pdist_euclidean(const double *X, double *dm, int m, int n) {
   int i, j;
   const double *u, *v;
--- ./scipy/spatial/src/distance_wrap.c.orig 2009-08-22 04:48:25.000000000 -0400
+++ ./scipy/spatial/src/distance_wrap.c 2009-08-22 04:49:40.000000000 -0400
@@ -564,6 +564,27 @@
 /***************************** pdist ***/
+extern PyObject *pdist_df_wrap(PyObject *self, PyObject *args) {
+  PyArrayObject *X_, *dm_;
+  int m, n;
+  double *dm;
+  const double *X;
+  if (!PyArg_ParseTuple(args, "O!O!",
+ &PyArray_Type, &X_,
+ &PyArray_Type, &dm_)) {
+    return 0;
+  }
+  else {
+    X = (const double*)X_->data;
+    dm = (double*)dm_->data;
+    m = X_->dimensions[0];
+    n = X_->dimensions[1];
+
+    pdist_df(X, dm, m, n);
+  }
+  return Py_BuildValue("d", 0.0);
+}
+
 extern PyObject *pdist_euclidean_wrap(PyObject *self, PyObject *args) {
   PyArrayObject *X_, *dm_;
   int m, n;
@@ -1109,6 +1130,7 @@
   {"pdist_city_block_wrap", pdist_city_block_wrap, METH_VARARGS},
   {"pdist_cosine_wrap", pdist_cosine_wrap, METH_VARARGS},
   {"pdist_dice_bool_wrap", pdist_dice_bool_wrap, METH_VARARGS},
+  {"pdist_df_wrap", pdist_df_wrap, METH_VARARGS},
   {"pdist_euclidean_wrap", pdist_euclidean_wrap, METH_VARARGS},
   {"pdist_hamming_wrap", pdist_hamming_wrap, METH_VARARGS},
   {"pdist_hamming_bool_wrap", pdist_hamming_bool_wrap, METH_VARARGS},
--- ./scipy/spatial/src/distance.h.orig 2009-08-22 04:48:33.000000000 -0400
+++ ./scipy/spatial/src/distance.h 2009-08-22 04:49:40.000000000 -0400
@@ -39,6 +39,7 @@
 void dist_to_squareform_from_vector(double *M, const double *v, int n);
 void dist_to_vector_from_squareform(const double *M, double *v, int n);
+void pdist_df(const double *X, double *dm, int m, int n);
 void pdist_euclidean(const double *X, double *dm, int m, int n);
 void pdist_seuclidean(const double *X,
       const double *var, double *dm, int m, int n);

## Please give me a heads up if something like this is already in scipy.spatial.distance or somewhere
## else on github or Numerical Recipes or Stata or somewhere else.
## Apologies if this algorithm is well-known already
## It's my dream to be credited with at least a mention of my name, "Michael Yang"
## if this patch gets incorporated some day into the scipy.spatial.distance package.
##
## Thanks again, and I hope you find this useful!

No comments:

Post a Comment