Skip to content

Commit

Permalink
refactored based on comments
Browse files Browse the repository at this point in the history
  • Loading branch information
nipsn committed May 30, 2024
1 parent dc68da1 commit f1b0bf4
Show file tree
Hide file tree
Showing 5 changed files with 66 additions and 68 deletions.
4 changes: 2 additions & 2 deletions ADF.q
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ ones:{x .[;;:;1f]/l where((<=).')l:a cross a:til n:count x}
// @param asset1 {float[]} Close prices of the first asset
// @param asset2 {float[]} Close prices of the second asset
// @return {dict} p_value of ADF test
fCoint: {@[;1]0f^coint[x;y]`}
fcoint: {@[;1]0f^coint[x;y]`}

// We hardcore read from every .csv the historical data
syms:`SP500_hist`NASDAQ100_hist`BFX`FCHI`GDAXI`HSI`KS11`MXX`N100`N225`NYA`RUT`STOXX
Expand All @@ -32,7 +32,7 @@ rs:{([]sym:x;close:first((5#" "),"F";csv) 0:`$":data/stocks/",string[x],".csv")}
t: `sym xgroup raze rs each syms

// We apply our cointegration function on every pair of symbols form our crossedList
matrix: fCoint .' 0f^neg[trange]#''@\:[;`close](@/:[t]')syms cross syms
matrix: fcoint .' 0f^neg[trange]#''@\:[;`close](@/:[t]')syms cross syms

// We create a p-values matrix from the ADF test results and set values above the diagonal to 1
pvalues: ones (count[syms]*til count syms)_matrix
Expand Down
58 changes: 28 additions & 30 deletions PostPairsTrading.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,15 +66,15 @@ t: `sym xgroup raze rs each syms
We then proceed to create a function called **fCoint** to call our imported function from PyKX, handle any null values by filling them with 0, using `0f^` and return the second value, which in this case is the p-value.

```q
fCoint: {@[;1]0f^coint[x;y]`}
fcoint: {@[;1]0f^coint[x;y]`}
```

We generate all combinations (`cross`) of indexes to see which pair is most cointegrated. Then, we index (`@`) each pair in our table. Additionally, we take (`#`) the last **trange** days of data for both indexes, and finally apply our **fCoint** function to each (`.'`) pair of data lists. **trange** symbolizes the number of working days in the last 4 years.
We generate all combinations (`cross`) of indexes to see which pair is most cointegrated. Then, we index (`@`) each pair in our table. Additionally, we take (`#`) the last **trange** days of data for both indexes, and finally apply our **fcoint** function to each (`.'`) pair of data lists. **trange** symbolizes the number of working days in the last 4 years.


```q
trange:4*252
matrix: fCoint .' 0f^neg[trange]#''@\:[;`close](@/:[t]')syms cross syms
matrix: fcoint .' 0f^neg[trange]#''@\:[;`close](@/:[t]')syms cross syms
```

Now, with our matrix in hand, we can plot it and **visually identify** which asset is more favorable. In order to do that, we can leverage PyKX once again to bring the `heatmap` module to q:
Expand Down Expand Up @@ -131,12 +131,12 @@ However, this presents **an opportunity for profit** because we know that these

To check for deviations in our prices, we could simply subtract them and observe if the difference deviates significantly from zero, considering their scale difference.

Indeed, just subtracting the prices of two assets, as in $priceY−priceX$ may not provide a clear understanding of their relationship. Let's illustrate this with an example:
Indeed, just subtracting the prices of two assets, as in $price_y−price_x$ may not provide a clear understanding of their relationship. Let's illustrate this with an example:

```q
q)priceX: 5 10 7 4 8
q)priceY: 23 30 25 30 35
q)spreads: priceY - priceX
q)price_x: 5 10 7 4 8
q)price_y: 23 30 25 30 35
q)spreads: price_y - price_x
18 20 18 26 27
```

Expand All @@ -146,11 +146,11 @@ q)spreads: priceY - priceX
Let's consider **using logarithms**, as they possess favourable properties for our pricing model. They prevent negative values and stabilize variance. Log returns are time-additive and symmetric, simplifying the calculation and analysis of returns. This improves the accuracy of statistical models and ensures non-negative pricing, enhancing model robustness and reliability:

```q
q)log priceX
q)log price_x
1.609438 2.302585 1.94591 1.386294 2.079442
q)log priceY
q)log price_y
3.135494 3.401197 3.218876 3.401197 3.555348
q)spreads: log[priceY] - log priceX
q)spreads: log[price_y] - log price_x
1.526056 1.098612 1.272966 2.014903 1.475907
```

Expand All @@ -166,18 +166,18 @@ In this context, Y represents the NASDAQ 100 index, X represents the S&P 500 ind

The plotted graph above illustrates the relationship between the NASDAQ 100 and the S&P 500 indices, with each purple dot representing a data point of their prices at a given time. The linear trend visible in the scatter plot suggests a strong positive cointegration between the two indices. By applying linear regression, we can model this relationship mathematically, allowing us to predict the NASDAQ 100 index price based on the S&P 500 index price. This predictive power is crucial for pair trading, as it helps identify mispricings and potential trading opportunities.

Linear regression aims to identify relationships between historical data, which we then extrapolate to current data. The differences between these relationships, or deviations, are our spreads. We've already calculated the 𝛼 and 𝛽 using the logarithmic values of our historical data (since real-time price values for priceX and priceY are unknown). Now, we simply combine everything and apply linear regression to our price logarithms:
Linear regression aims to identify relationships between historical data, which we then extrapolate to current data. The differences between these relationships, or deviations, are our spreads. We've already calculated the 𝛼 and 𝛽 using the logarithmic values of our historical data (since real-time price values for price_x and price_y are unknown). Now, we simply combine everything and apply linear regression to our price logarithms:

$$spread = log(priceY) - (\beta \cdot log(priceX)+\alpha)$$
$$spread = log(price_y) - (\beta \cdot log(price_x)+\alpha)$$

```q
q)spreads: log[priceY] - alpha + log[priceX] * beta
q)spreads: log[price_y] - alpha + log[price_x] * beta
-0.1493929 0.0451223 -0.08835117 0.0451223 0.1579725
```

There are different methods we can use to obtain the best alpha and beta values that minimize the spreads or, in other words, there are mathematical methods to find the line that best fits the prices.
There are different methods we can use to obtain the best alpha and beta values that minimize the spreads or, in other words, there are mathematical ways to find the line that best fits the prices.

The aim of this post is not to delve deeply into these methods but to mention that the most popular method is called the least squares method. For this case, it provides a closed-form solution that depends on our historical data. This means we do not need any iterative algorithm or a more complex method to find these optimal alpha and beta values.
The aim of this post is not to delve deeply into them but to mention that the most popular one is called the least squares method. For this case, it provides a closed-form solution that depends on our historical data. This means we do not need any iterative algorithm or anything more complex to find these optimal alpha and beta values.

>💡 For those interested in our implementation of these formulas in kdb+/q, the code can be found in our repository [Pair-Trading](https://github.com/hablapps/pairstrading/blob/5-Post/linear_regression.q).
Expand All @@ -188,39 +188,37 @@ This precisely meets our objective—a **comprehensive method for representing r

Now that we have selected a pair of cointegrated indices and understand how to calculate their relationships, let's see how we can create a real-time pair trading scenario.

> ⚠️ An important note is that this post will include a real-time simulation. In other words, if we wanted to develop a 100% real-time product, we would need to make slight adjustments to the code.
The first step to implementing this pair trading algorithm in real time is to declare a `.z.ts` function. This `.z.ts` function will be called automatically every x milliseconds which can be configured with `\t`. In our case, it will be called every 100 milliseconds.

```q
.z.ts: {.streamPair.genPair[]}
.z.ts: {.stream_pair.gen_pair[]}
\t 100
```

Let's now see how our **.streamPair.genPair** function should be defined. In our case, we are simulating real time; we do not have a 100% real-time product. Therefore, we already have the data loaded into memory and only need to display it one by one. For this, we will use an index **streamPair.i** which we will update with each execution of our function.
Let's now see how our **.stream_pair.gen_pair** function should be defined. We are only simulating real time; we do not have a 100% real-time product. Therefore, we already have the data loaded into memory and only need to display it one by one. For this, we will use an index *.stream_pair.i** which we will update with each execution of our function. Please keep in mind that if we wanted to run this in a real real-time scenario, the code would need to be modified.

```q
.streamPair.i+:1;
resX: priceX[.streamPair.i];
resY: priceY[.streamPair.i];
.stream_pair.i+:1;
resX: price_x[.stream_pair.i];
resY: price_y[.stream_pair.i];
```

The purpose of this function is to calculate the corresponding price spreads. For this, we will use the spread formula that we already know.

```q
spread: priceY[.streamPair.i][`bid] - ((priceX[.streamPair.i][`bid] * beta_lr)+alpha_lr);
spread: price_y[.stream_pair.i][`bid] - ((price_x[.stream_pair.i][`bid] * beta_lr)+alpha_lr);
```

Putting everything together and returning a table with the time instant and the spread, we would get the function:

```q
.streamPair.genPair:{
.streamPair.i+:1;
resX: priceX[.streamPair.i];
resY: priceY[.streamPair.i];
.stream_pair.gen_pair:{
.stream_pair.i+:1;
resX: price_x[.stream_pair.i];
resY: price_y[.stream_pair.i];
s: resY[`bid] - alpha_lr+resX[`bid] * beta_lr;
enlist `dateTime`spread`mean!
("p"$(resX[`dateTime]);"f"$(s);"f"$(0));
enlist `dt`spread`mean!
("p"$(resX[`dt]);"f"$(s);"f"$(0));
}
```

Expand All @@ -236,7 +234,7 @@ Finally, once we have our spreads accurately calculated and observe how our data

A simple approach to window signals is to set these windows as twice the historical standard deviation of the spreads. Therefore, if either of these limits is reached, we should sell the overvalued index and buy the undervalued one, and then unwind our position when the spread returns to 0. Let's clarify this with a specific example:

![SpreadsD](resources/window_signals.gif)
![WSignals](resources/window_signals.gif)

In this instance, we can see that the spread (purple line) is positive and above the signal (blue line), indicating that our Y index (NASDAQ100) is overvalued relative to the SP500. Therefore, we should sell NASDAQ100 and buy SP500. At the end of the gif, it can be observed that the spread returns to 0 (green line), meaning the indexes are no longer overvalued or undervalued, respectively. At this point, we should unwind the positions we acquired earlier.

Expand Down
4 changes: 2 additions & 2 deletions linear_regression.q
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
// @param x {number[]} Independent variable
// @param y {number[]} Dependent variable
// @return {number} Beta (slope)
betaF:{dot:{sum x*y};
beta_f:{dot:{sum x*y};
((n*dot[x;y])-(*/)(sum')(x;y))%
((n:count[x])*dot[x;x])-sum[x]xexp 2};

Expand All @@ -23,4 +23,4 @@ betaF:{dot:{sum x*y};
// @param x {number[]} Independent variable
// @param y {number[]} Dependent variable
// @return {number} alpha (intercept)
alphaF: {avg[y]-(betaF[x;y]*avg[x])};
alpha_f: {avg[y]-(beta_f[x;y]*avg[x])};
File renamed without changes
68 changes: 34 additions & 34 deletions streamPair.q
Original file line number Diff line number Diff line change
Expand Up @@ -2,69 +2,69 @@
\l linear_regression.q

// load tables
readTick:{1_ flip `dateTime`bid`ask`bidVol`askVol!("*FFFF";",")0: `$":data/",string[x],".csv"};
readHist:{1_ flip enlist[`close]!(" F ";",") 0: `$":data/",string[x],"_hist.csv"};
tab1:readTick `USA500IDXUSD;
tab2:readTick `USATECHIDXUSD;
tab3: flip `dateTime`spread`mean`up`low`ewma`up2`low2!("P"$();"F"$();"F"$();"F"$();"F"$();"F"$();"F"$();"F"$());
historial_tab1:readHist `SP500
historial_tab2:readHist `NASDAQ100
read_tick:{1_ flip `dt`bid`ask`bidVol`askVol!("*FFFF";",")0: `$":data/",string[x],".csv"};
read_hist:{1_ flip enlist[`close]!(" F ";",") 0: `$":data/",string[x],"_hist.csv"};
tab1:read_tick `USA500IDXUSD;
tab2:read_tick `USATECHIDXUSD;
tab3: flip `dt`spread`mean`up`low`ewma`up2`low2!("P"$();"F"$();"F"$();"F"$();"F"$();"F"$();"F"$();"F"$());
historial_tab1:read_hist `SP500
historial_tab2:read_hist `NASDAQ100

// Fix data and take log(prices)
priceX: 0!1_(update delta:0f^deltas dateTime from distinct select distinct dateTime, log bid, log ask from update dateTime:"P"$@[;19;:;"."] each dateTime from tab1);
priceY: 0!1_(update delta:0f^deltas dateTime from distinct select distinct dateTime, log bid, log ask from update dateTime:"P"$@[;19;:;"."] each dateTime from tab2);
price_x: 0!1_(update delta:0f^deltas dt from distinct select distinct dt, log bid, log ask from update dt:"P"$@[;19;:;"."] each dt from tab1);
price_y: 0!1_(update delta:0f^deltas dt from distinct select distinct dt, log bid, log ask from update dt:"P"$@[;19;:;"."] each dt from tab2);

// Calculate alpha and beta from historical values
beta_lr: betaF[px:-100#log historial_tab1`close;py:-100#log historial_tab2`close]; // we only take most recent 100 values for the alpha and beta
alpha_lr: alphaF[px;py];
beta_lr: beta_f[px:-100#log historial_tab1`close;py:-100#log historial_tab2`close]; // we only take most recent 100 values for the alpha and beta
alpha_lr: alpha_f[px;py];
// We calculate an historical standard deviation
std_lr: dev[(1000#exec bid from priceY) - (1000#exec bid from priceX)];
std_lr: dev[(1000#exec bid from price_y) - (1000#exec bid from price_x)];

/ load and initialize kdb+tick
/ all tables in the top level namespace (.) become publish-able
\l tick/u.q
.u.init[];

// Read and write on buffer functions
.ringBuffer.read:{[t;i] $[i<=count t; i#t; i rotate t] }
.ringBuffer.write:{[t;r;i] @[t;(i mod count value t)+til 1;:;r];}
.ring_buffer.read:{[t;i] $[i<=count t; i#t; i rotate t] }
.ring_buffer.write:{[t;r;i] @[t;(i mod count value t)+til 1;:;r];}

// Initialize index and empty tables (We will access directly to these objects from dashboards)
.streamPair.i:-1;
.streamPair.iEWMA:-1;
.streamPair.priceX: 1000#tAux: 1_1#priceX;
.streamPair.priceY: 1000#tAux;
.streamPair.spreads: 1000#tab3;
.stream_pair.i:-1;
.stream_pair.iEWMA:-1;
.stream_pair.price_x: 1000#tAux: 1_1#price_x;
.stream_pair.price_y: 1000#tAux;
.stream_pair.spreads: 1000#tab3;

// Timer function
timer:{t:.z.p;while[.z.p<t+x&abs x-16*1e6]} / 16 <- timer variable

.streamPair.genPair:{
.stream_pair.gen_pair:{
// We wait some delta
d: `float$(priceX[.streamPair.i+:1][`delta]);
timer[d];
d: `float$(price_x[.stream_pair.i+:1][`delta]);
// timer[d];
// We take the i element from our tables
resX: enlist priceX[.streamPair.i];
resY: enlist priceY[.streamPair.i];
res_x: enlist price_x[.stream_pair.i];
res_y: enlist price_y[.stream_pair.i];

// We calculate spreads for linear regression
s: priceY[.streamPair.i][`bid] - ((priceX[.streamPair.i][`bid] * beta_lr)+alpha_lr);
ewma: dev[ema[0.04; .streamPair.iEWMA#0f^(exec spread from .streamPair.spreads)]];
$[.streamPair.iEWMA>999;.streamPair.iEWMA:998;.streamPair.iEWMA+:1];
resSpread: enlist `dateTime`spread`mean`up`low`ewma`up2`low2!("p"$(priceX[.streamPair.i][`dateTime]);"f"$(s);"f"$(0);"f"$(1.96*std_lr);"f"$(-1.96*std_lr);"f"$0f^(ewma); "f"$(0f^(1.96*(last 1000 mdev (exec spread from .streamPair.spreads)))); "f"$0f^(-1.96*(last 1000 mdev (exec spread from .streamPair.spreads))));
s: price_y[.stream_pair.i][`bid] - ((price_x[.stream_pair.i][`bid] * beta_lr)+alpha_lr);
ewma: dev[ema[0.04; .stream_pair.iEWMA#0f^(exec spread from .stream_pair.spreads)]];
$[.stream_pair.iEWMA>999;.stream_pair.iEWMA:998;.stream_pair.iEWMA+:1];
res_spread: enlist `dt`spread`mean`up`low`ewma`up2`low2!("p"$(price_x[.stream_pair.i][`dt]);"f"$(s);"f"$(0);"f"$(1.96*std_lr);"f"$(-1.96*std_lr);"f"$0f^(ewma); "f"$(0f^(1.96*(last 1000 mdev (exec spread from .stream_pair.spreads)))); "f"$0f^(-1.96*(last 1000 mdev (exec spread from .stream_pair.spreads))));

// We update our buffer tables with those values
.ringBuffer.write[`.streamPair.priceX;resX;.streamPair.i];
.ringBuffer.write[`.streamPair.priceY;resY;.streamPair.i];
.ringBuffer.write[`.streamPair.spreads;resSpread;.streamPair.i];
resX
.ring_buffer.write[`.stream_pair.price_x;res_x;.stream_pair.i];
.ring_buffer.write[`.stream_pair.price_y;res_y;.stream_pair.i];
.ring_buffer.write[`.stream_pair.spreads;res_spread;.stream_pair.i];
res_x

}

// Publish stream updates each milisecond
.z.ts: {.streamPair.genPair[]}
// .z.ts: {.stream_pair.gen_pair[]}

// Snapshot read from our buffer
.u.snap:{[t] .ringBuffer.read[.streamPair.priceX;.streamPair.i]} // reqd. by dashboards
.u.snap:{[t] .ring_buffer.read[.stream_pair.price_x;.stream_pair.i]} // reqd. by dashboards

// \t 100

0 comments on commit f1b0bf4

Please sign in to comment.