Backtesting machine learning trading algorithmns¶

In this Notebook I will go through how to build a machine learning work flow for analysing a certain trading startegies on the UK100 index. The strategy will be looking at the previous 10 days, daily price movement, to see if they are indictative at all as to what todays price movement direction will be. E.G if we have had 10 heavy downwards movements over the last 5 days would that make it more likely that we will seen a price movement upwards (mean reversion) etc.

The lagged returns will act as our "features" and the actual price movement directions will be our "labels" (output variable). Because our labels only have a (-1 or +1) output this is a classification problem as the label variable is not continous.

1. Firstly we need to connect to the FXCM API to be able to download historical data¶

I have a demo broking account with FXCM which gives me access to the FXCM API where I can download tick data and it also allows to subscribe to it in real time for when you are ready for implementing our trading strategy

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import fxcmpy
import seaborn as sns
from datetime import datetime
import random 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix, balanced_accuracy_score

# the API come with a class called fxcmpy for interacting with it

api = fxcmpy.fxcmpy(config_file=r"C:\Users\Adam\fxcm.cfg")

from fxcmpy import fxcmpy_tick_data_reader as tdr 

print(tdr.get_available_symbols())


# we want to set a random seed so that our results are reporduceable
np.random.seed(42)

('AUDCAD', 'AUDCHF', 'AUDJPY', 'AUDNZD', 'CADCHF', 'EURAUD', 'EURCHF', 'EURGBP', 'EURJPY', 'EURUSD', 'GBPCHF', 'GBPJPY', 'GBPNZD', 'GBPUSD', 'GBPCHF', 'GBPJPY', 'GBPNZD', 'NZDCAD', 'NZDCHF', 'NZDJPY', 'NZDUSD', 'USDCAD', 'USDCHF', 'USDJPY')

2. We need to set all of the parameters in with which we want to test our algorithm¶

We are going to test a variety of well known machine learning algorithmns to see which performs best with our lagged indicator strategy. And by best I mean decipher which has the most predictive power.

We are going to look at: the support vector machine and the K nearest neighbours alogiriths

Machine learning algorithmns need to be tuned to find the best parameters for working with different types of data. Therefore here we set up the hyperparamter space over which we will want the algorithmn to iterate through to find the most predictive parameters when fitting to the data, we set the "acccuracy score" as the metric for deciding predictive power.

""" ------------------ MODEL INPUTS!!! -------------------------------"""

""" set data paramters """

index_totrade = "UK100"
candle_size = "D1"
PreviousCandles = 10000

""" set ML paramters """

lags_list = [10]

from sklearn.svm import SVC
SupportVec = SVC()

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

cross_vals = 3
models_to_fit = [knn]
model_names = ['knn']

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid_SVC =  parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

#logreg param space
c_space = np.logspace(-5, 8, 15)
param_grid_logreg = {'C': c_space}

#knn param spzce
leaf_size = list(np.arange(1,41,10))
n_neighbors = list(np.arange(1,21,5))
p=[1,2]
param_grid_knn = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

parameter_grids = [param_grid_knn]

refit_score = 'accuracy_score'


""" set trading algorithmn parameters """

Total_Balance = 1000
Risk_perTrade = 0.01
financing_cost = 0.0001
stoploss = 0.9

3 Download the UK100 data and visualise it¶

It is always best to do exploratory data analysis on any raw data so we will down load the data and make sure it looks as we expect. We will also append variables that will come in use that don't come with the data such as creating the daily returns and working out the maximum daily movements in case we want to implement a stop loss strategy

data = api.get_candles(index_totrade, period=candle_size,number=PreviousCandles)

data.sort_index(ascending=True, inplace=True)
 
print(data.head())
    
data['midopen']= (data['bidopen']+data['askopen'])/2
 
data['midclose']= (data['bidclose']+data['askclose'])/2
    
data['midhigh']= (data['bidhigh']+data['askhigh'])/2

data['midlow']= (data['bidlow']+data['asklow'])/2

# we need to work out the mean spread which will give us an indication of the transaction cost  

spread = (data["askclose"]-  data["bidclose"]).mean()

ptc = spread / data['midclose'].mean() 

data['returns'] = np.log((data['midclose'])/data['midclose'].shift(1))

data.dropna(inplace=True)

data['direction'] = np.where(data['returns']> 0 , 1, -1)
    
# creating a variable which will show the maximum range in up and down movements per day to be used later with stop loss function

data['max_longreturn_thatday'] = np.log((data['midhigh'])/data['midclose'].shift(1))
                                                      
data['max_shortreturn_thatday'] = np.log((data['midlow'])/data['midclose'].shift(1))

data.dropna(inplace=True)

# the index direction will be used as the label variable for the machine learning algorithmns

data['direction'] = np.where(data['returns']> 0 , 1, -1)

data['year']=data.index.year
    
# these variables will be used later to analyse candle stick heights and to refine the stop loss paramter
    
data['OpentoHigh'] = (data['midhigh'] - data['midopen'])/data['midopen']
    
data['OpentoLow'] = (data['midlow'] - data['midopen'])/data['midopen']

# plot the data to make sure it looks correct
    
fig, ax = plt.subplots(figsize=(10,6))

ax.plot(data['midclose'])

ax.set_xlabel("Date")

ax.set_ylabel("Price")

ax.set_title("{}".format(index_totrade))

plt.show()

                     bidopen  bidclose  bidhigh  bidlow  askopen  askclose  \
date                                                                         
2001-09-09 21:00:00   5070.3    5033.7   5070.3  4895.9   5070.3    5033.7   
2001-09-10 21:00:00   5033.7    4746.0   5129.0  4746.0   5033.7    4746.0   
2001-09-11 21:00:00   4746.0    4882.1   4882.1  4651.8   4746.0    4882.1   
2001-09-12 21:00:00   4882.1    4943.6   4946.6  4847.5   4882.1    4943.6   
2001-09-13 21:00:00   4943.6    4755.7   4970.1  4743.0   4943.6    4755.7   

                     askhigh  asklow  tickqty  
date                                           
2001-09-09 21:00:00   5070.3  4895.9        0  
2001-09-10 21:00:00   5129.0  4746.0        0  
2001-09-11 21:00:00   4882.1  4651.8        0  
2001-09-12 21:00:00   4946.6  4847.5        0  
2001-09-13 21:00:00   4970.1  4743.0        0

4. We now need to create the lagged indicators¶

We shall not only be creating the lagged features but we will be converting them into digitized bins based upon their comparison to the mean and stdev of UK100 daily return values. This will provide some extra information around the size of the movement. These features will then be concatenated onto the original data

pd.set_option('mode.chained_assignment', None)

lag_number = 10

lagged_cols = [] 

# loop through last 10 candles
for lag in range(1,lag_number+1):
    
    col = 'lag_{}'.format(lag)     #naming the columns
    
    data[col] = data['returns'].shift(lag)     # accesses the lagged colum and assings the return from the corresponding lag using the shift
    
    lagged_cols.append(col)


#need to get rid of the NANs created from the first ten rows from trying to find data points that do not exist
data.dropna(inplace=True)

""" --------- creating digitized lag returns columns to the data  -----------------"""

# select only the lagged return features from the data and convert them into digitized bins
lagged_data = data[lagged_cols]

def create_bins(data,bins=[0]):
    
    x = list(data.columns.copy())
        
    for col in x:
        col_bin= col + '_bin'
        data[col_bin] = np.digitize(data[col], bins=bins)
    data.drop(x,axis=1, inplace=True)
     
mean = data['returns'].mean()

stdev = data['returns'].std()

""" decide the bin set up here """

bins = [mean - (2 * stdev), mean - stdev, mean , mean + stdev, mean + (2 * stdev) ]

create_bins(lagged_data,bins)

print(data[lagged_cols].head())

print(lagged_data.head())
#concatentate the digitized lagged columns onto the data set

data = pd.concat([data,lagged_data], axis=1)

                        lag_1     lag_2     lag_3     lag_4     lag_5  \
date                                                                    
2001-09-25 21:00:00  0.010671  0.039839 -0.027408 -0.035526 -0.026542   
2001-09-26 21:00:00  0.006988  0.010671  0.039839 -0.027408 -0.035526   
2001-09-27 21:00:00  0.014271  0.006988  0.010671  0.039839 -0.027408   
2001-09-30 21:00:00  0.028925  0.014271  0.006988  0.010671  0.039839   
2001-10-01 21:00:00 -0.024317  0.028925  0.014271  0.006988  0.010671   

                        lag_6     lag_7     lag_8     lag_9    lag_10  
date                                                                   
2001-09-25 21:00:00 -0.010300  0.029667 -0.038750  0.012518  0.028273  
2001-09-26 21:00:00 -0.026542 -0.010300  0.029667 -0.038750  0.012518  
2001-09-27 21:00:00 -0.035526 -0.026542 -0.010300  0.029667 -0.038750  
2001-09-30 21:00:00 -0.027408 -0.035526 -0.026542 -0.010300  0.029667  
2001-10-01 21:00:00  0.039839 -0.027408 -0.035526 -0.026542 -0.010300  
                     lag_1_bin  lag_2_bin  lag_3_bin  lag_4_bin  lag_5_bin  \
date                                                                         
2001-09-25 21:00:00          3          5          0          0          0   
2001-09-26 21:00:00          3          3          5          0          0   
2001-09-27 21:00:00          4          3          3          5          0   
2001-09-30 21:00:00          5          4          3          3          5   
2001-10-01 21:00:00          1          5          4          3          3   

                     lag_6_bin  lag_7_bin  lag_8_bin  lag_9_bin  lag_10_bin  
date                                                                         
2001-09-25 21:00:00          2          5          0          4           5  
2001-09-26 21:00:00          0          2          5          0           4  
2001-09-27 21:00:00          0          0          2          5           0  
2001-09-30 21:00:00          0          0          0          2           5  
2001-10-01 21:00:00          5          0          0          0           2

5. Enacting the maching learning¶

Now that we have established our features and our variables we are going to fit the algorithmn to the data.

For testing the data we need a training data set for the model to parameterising and fit to, and then we need a completely new data set the "test data set" to test the predictive power of the models.

In this example we are going to use 2001 through to 2020 (exc 2018) as training data and then we are going to show the K nearest neighbor the 2018 data thus far to see how it would have performed that year!

Once we have fitted the model we will analyse its predictive power by looking at its precision and ROC scores

scorers = {
            'precision_score': make_scorer(precision_score),
            'recall_score': make_scorer(recall_score),
            'accuracy_score': make_scorer(accuracy_score),
            'balanced_accuracy_score' : make_scorer(balanced_accuracy_score)}
    
    
# we want to train the data on 20 data sets comprising a combination of the last 20 years 
Years = ['2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
    
# for this example we look just look at 2020 as a test data set
year = 2018
year_index =17
    
    
# establishing a results dataframe 
    
results_columns = ['moves','precision -1','recall -1','precision 1','recall 1','best score','best params'] 

Prediction_ResultsByYearColumns = pd.MultiIndex.from_product([model_names,results_columns], names=['model','prediciton ind'])

Prediction_ResultsByYear = pd.DataFrame(index = Years , columns= Prediction_ResultsByYearColumns)


# initalising the train and test data sets and the features and labels for the ML 
Y_year = Years[year_index]   
X_years = Years[:year_index] + Years[year_index+1:]

X_train = data[data['year'].isin(X_years)][lagged_data.columns] 
y_train = data[data['year'].isin(X_years)]['direction'] 
        
X_test = data.loc[Y_year][lagged_data.columns] 
y_test = data.loc[Y_year]['direction'] 

#percentage of up movements in t data set shows the balance of the calsses
Prediction_ResultsByYear.loc[Y_year,('knn',results_columns[0])] = (y_test==1).sum()/len(y_test)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

# The grid search CV function allows us to define all the variables we want to iterate over and the corresponding metrics etc        
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid_knn, scoring=scorers, refit=refit_score,
                                       cv=cross_vals , return_train_score=True)
    
grid_search.fit(X_train, y_train)

""" looking at the prediction results of the fitted algorithms and adding them to the results table """
# make the predictions
y_pred = grid_search.predict(X_test)

report = classification_report(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
    
#calculate precision -1
Prediction_ResultsByYear.loc[Y_year,('knn',results_columns[1])]= matrix [0][0] / (matrix[0][0]+ matrix[1][0])
#calculate recall -1
Prediction_ResultsByYear.loc[Y_year,('knn',results_columns[2])]= matrix[0][0] / (matrix[0][0]+ matrix[0][1])
#calculate precision 1
Prediction_ResultsByYear.loc[Y_year,('knn',results_columns[3])]= matrix[1][1] / (matrix[1][1]+ matrix[0][1])
#calculate recall -1
Prediction_ResultsByYear.loc[Y_year,('knn',results_columns[4])]= matrix[1][1] / (matrix[1][1]+ matrix[1][0])
# returns the predictive score for the tuned estimator 
Prediction_ResultsByYear.loc[Y_year,('knn',results_columns[5])]=grid_search.best_score_
# returns the parameters for each estimator which return the greatest predictive score
Prediction_ResultsByYear.loc[Y_year,('knn',results_columns[6])]=[grid_search.best_params_]
            
print(Prediction_ResultsByYear.loc['2018'])

model  prediciton ind
knn    moves                                                   0.474308
       precision -1                                            0.571429
       recall -1                                               0.631579
       precision 1                                             0.537736
       recall 1                                                   0.475
       best score                                              0.513645
       best params       [{'leaf_size': 11, 'n_neighbors': 16, 'p': 2}]
Name: 2018, dtype: object

6. Analysing prediction results and looking and plotting the ROC curve¶

From the outputted results above we can see that the "best score" (which is the F1 score) came out at 51.9% so the model had slightly more predictive power than randomly selecting a direction. This suggests there could be some minimal predictive power in lagged indicators.

The precision values show that out of all the actual downward movements (-1) 56% were predicted correctly and for all the upward movements 52% were predicted correctly

Below we will dig a little deeper into the prediction results by plotting an ROC curve - what you can see from the plot is that when the model is compared to another "no skill model" that always goes long - the KNN model outpeforms it signified by its movement towards the top right corner meaning a lower false positive rate

""" plotting ROC Curves"""
            
# Plotting an ROC curve to examine the predicitve capabilities of each algorithmn
# keep probabilities for the positive outcome only
y_predPOSONLY = grid_search.predict_proba(X_test)[:,1].astype(float)
            
#using this tells you which order the classes are in
grid_search.classes_
            
#generate a no skill model for comparison that always guesses 1 
ns_pred = [1 for _ in range(len(y_test))]

#Calculate ROC scores
ns_auc = roc_auc_score(y_test,ns_pred)
y_pred_auc = roc_auc_score(y_test,y_predPOSONLY)

# summarize ROC scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('KNN: ROC AUC=%.3f' % (y_pred_auc))

# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_pred)
y_fpr, y_tpr, _ = roc_curve(y_test, y_predPOSONLY)
            
fig1b, ax = plt.subplots(figsize=(10,6))
# plot the roc curve for the model
ax.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
ax.plot(y_fpr, y_tpr, marker='.', label='knn')
# axis labels
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
            
ax.set_title("{0} - {1}".format('knn',year))
# show the legend
ax.legend()
# show the plot
plt.show()

No Skill: ROC AUC=0.500
KNN: ROC AUC=0.560

7 Analysing the investment performance of the trading algorithm.¶

We have seen the model has some predictive power in terms of predicting the dricetion of daily UK100 movements but how does that translate into investment performance. To decipher the investment performance we multiply the algorithmns predicted direction (1 or -1) by the actual return experienced that day.

What you can see if that for 2018 the algorithmn returns a cumulative return of 20% where as going long on the uk100 would have return -10%. This current does not include transaction costs however but is a strong indicator of performance capability.

In the second graph we seenthe same plot as the first however the algorithmn long and short positions have been overalyed onto the plot to show what positions the algorithm was in through out the year

algo = pd.DataFrame(grid_search.predict(X_test), columns = ["algo_positions"])         
            
algo.set_index(X_test.index, inplace=True)

algo = pd.concat([algo,data], axis=1 )

algo["algo_pos_change_ind"] = algo["algo_positions"]*algo["algo_positions"].shift(1)                    

algo.dropna(inplace=True)

#set up a trade indicator """
algo["Trade_ind"] = np.where(algo['algo_pos_change_ind'] < 0 , 1, 0)

""" we establish the algo returns per day by mutliplying the prediction from the algo either long or short (1 or -1) by the actual return that day """ 
algo["algo_returns"] = algo.algo_positions * algo.returns

            
# plot the cumulative return of the algorithm versus the return of the index
fig2, ax = plt.subplots(figsize=(10,6))
ax.plot(algo["returns"].cumsum().apply(np.exp), label ='UK100 return')
ax.plot(algo["algo_returns"].cumsum().apply(np.exp), label ='Algo Return')
ax.set_xlabel("Date")
ax.set_ylabel("Cumulative return exc trans costs")
ax.set_title("{0} - {1}".format(index_totrade, year))

ax.legend(loc="upper left")
plt.show()
            
# plot the cumulative return of the algorithm versus the return of the index and the trading positions
fig3, ax = plt.subplots(figsize=(16,8))
ax_1 = ax.twinx()
ax.plot(algo["returns"].cumsum().apply(np.exp),label ='UK100 return')
ax.plot(algo["algo_returns"].cumsum().apply(np.exp), label ='Algo Return')
ax_1.plot(algo["algo_positions"],color='r', label='algo positions')
ax.set_xlabel("Date")
ax.set_ylabel("Cumulative return exc trans costs")
ax_1.set_ylabel("Algo position")
ax.set_title("{0} - {1}".format(index_totrade, year))
ax.legend()
ax_1.legend()
plt.show()

8 investing maximum drawdown and drawdown periods¶

We are now going to look at the maximum drawdon period for the algorithmn in 2018 (the maximum the algorithmn lost from the most recent high position)

and the maximum drawdown period the longest length in days in took the algorithm to reach a new high after reaching a previous one.

The blue line on the plot is the grow of a begin starting margin of 1000 over time when invested by the algorithmn

# lets start with a theoretical 1000 pounds invested

Margin_amount = 1000
# showing how equity how the margin position would move
Margin_pos = (algo["algo_returns"].cumsum().apply(np.exp) * Margin_amount).astype(float)
    
Margin_pos = pd.DataFrame(Margin_pos.rename("Margin_pos", inplace=True))

#cumulative max returns the cumulative maximum values over time 
Margin_pos['cummax'] = Margin_pos.cummax()

# The drawdown values over time when the drawdon value is 0 that means we have reached a new high! the period between two zeros is the drawdown period""" 
Margin_pos["drawdown"] = Margin_pos['cummax']- Margin_pos['Margin_pos']
Margin_pos["drawdown_percent"] = Margin_pos["drawdown"]/Margin_pos['cummax']

# The maximum drawdown value
max_drawdown_percent = Margin_pos["drawdown_percent"].max()

# The point in time of the maximum drawdown, the idxmax returns the index of the position in this case a time stamp!!!
t_max= Margin_pos["drawdown_percent"].idxmax()
    
# The maximum drawdown percentage
max_drawdown_percentage = round(max_drawdown_percent*100,1)
    
""" now we can plot the max drawdown and the drawdown periods """

# first we want to find the times of all the highs, when drawdown = 0 
high_indicies = Margin_pos["drawdown"][Margin_pos["drawdown"]==0]
# calculate the time delta between all te highs
dd_periods = (high_indicies.index[1:].to_pydatetime() - high_indicies.index[:-1].to_pydatetime())

# Sometimes the algorithmn may not make any positive return over the year period if that is the case we set the drawdown period to that of len 1year in buiness days
if len(high_indicies) == 1:
            
    max_dd_period = 250
                    
else:
        
    max_dd_period = dd_periods.max().days
    
            
#now we can plot the results
Margin_pos[['Margin_pos', 'cummax']].plot(figsize=(10,6))
#we can specifically add a line for t_max """
plt.axvline(t_max,c='r', alpha=0.5)

print('The maximum drawdown period for the algo in 2018 would have been: {0} depicted by the longest horizontal cummax line'.format(max_dd_period))

print('The maximum drawdown percentage for the algo in 2018 would have been: {0} and occured at {1} plotted on the grah as a cross'.format(max_drawdown_percent, t_max))

The maximum drawdown period for the algo in 2018 would have been: 110 depicted by the longest horizontal cummax line
The maximum drawdown percentage for the algo in 2018 would have been: 0.05664783703970794 and occured at 2018-06-05 21:00:00 plotted on the grah as a cross

9. Calculating VAR and analysying the risk and return charactersitics of the Algorithmn¶

To be able to calculate VAR and get a probabilisitic understanding of our algorithmns capability we need to run multiple simulations to do will we will now run the entire code again but this time with the full twenty different test and train sets (with each year from 2001 to 2020 acting as a test set). This is essentially the same as running 20 simulations. Once we have our simulations we can caluclate all sorts of statical outputs such as mean expected return, stdev, VAR figures etc

""" ------------------ MODEL INPUTS!!! -------------------------------"""

""" set data paramters """

index_totrade = "UK100"
candle_size = "D1"
PreviousCandles = 10000

""" set ML paramters """

lags_list = [10]

from sklearn.svm import SVC
SupportVec = SVC()
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

cross_vals = 3

models_tofit = [knn]
model_names = ['knn']

# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid_SVC =  parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}

#logreg param space
c_space = np.logspace(-5, 8, 15)
param_grid_logreg = {'C': c_space}

#knn param spzce
leaf_size = list(np.arange(1,41,10))
n_neighbors = list(np.arange(1,21,5))
p=[1,2]
param_grid_knn = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)

parameter_grids = [param_grid_knn]

refit_score = 'accuracy_score'


""" set trading algorithmn parameters """

Total_Balance = 1000
Risk_perTrade = 0.01
financing_cost = 0.0001
stoploss = 0.9



""" ------------------------------------------------- BACK TESTING ALGORITHMN ---------------------------------------"""

for lag_number in lags_list:

    """ ---------------importing data from FXCM API -----------"""

    data = api.get_candles(index_totrade, period=candle_size,number=PreviousCandles)

    data.sort_index(ascending=True, inplace=True)
 
    data['midopen']= (data['bidopen']+data['askopen'])/2
 
    data['midclose']= (data['bidclose']+data['askclose'])/2
    
    data['midhigh']= (data['bidhigh']+data['askhigh'])/2

    data['midlow']= (data['bidlow']+data['asklow'])/2

    # we need to work out the mean spread which will give us an indication of the transaction cost  

    spread = (data["askclose"]-  data["bidclose"]).mean()

    ptc = spread / data['midclose'].mean() 

    data['returns'] = np.log((data['midclose'])/data['midclose'].shift(1))

    data.dropna(inplace=True)

    data['direction'] = np.where(data['returns']> 0 , 1, -1)
    
    # creating a variable which will show the maximum range in up and down movements per day to be used later with stop loss function

    data['max_longreturn_thatday'] = np.log((data['midhigh'])/data['midclose'].shift(1))
                                                      
    data['max_shortreturn_thatday'] = np.log((data['midlow'])/data['midclose'].shift(1))

    data.dropna(inplace=True)

    # the index direction will be used as the label variable for the machine learning algorithmns

    data['direction'] = np.where(data['returns']> 0 , 1, -1)

    data['year']=data.index.year
    
    # these variables will be used later to analyse candle stick heights and to refine the stop loss paramter
    
    data['OpentoHigh'] = (data['midhigh'] - data['midopen'])/data['midopen']
    
    data['OpentoLow'] = (data['midlow'] - data['midopen'])/data['midopen']

    # plot the data to make sure it looks correct
    
   

    """ ------------------------------------------------------------------------------"""

    
    """  ----------------------- adding lagged returns columns to the data ------------------------- """

    # we are going to create a new dataframe initialized as lagged cols

    lagged_cols = [] 

    # loop through last 5 candles
    for lag in range(1,lag_number+1):
    
        col = 'lag_{}'.format(lag)     #naming the columns
    
        data[col] = data['returns'].shift(lag)     # accesses the lagged colum and assings the return from the corresponding lag using the shift
    
        lagged_cols.append(col)

    
    #need to get rid of the NANs created from getting the lag returns
    data.dropna(inplace=True)

    """ --------- creating digitized lag returns columns to the data  -----------------"""

    # select only the lagged return features from the data
    lagged_data = data[lagged_cols]

    def create_bins(data,bins=[0]):
    
        x = list(data.columns.copy())
        
        for col in x:
            col_bin= col + '_bin'
            data[col_bin] = np.digitize(data[col], bins=bins)
        data.drop(x,axis=1, inplace=True)
     
    mean = data['returns'].mean()

    stdev = data['returns'].std()

    """ decide the bin set up here """

    bins = [mean - (2 * stdev), mean - stdev, mean , mean + stdev, mean + (2 * stdev) ]

    create_bins(lagged_data,bins)
    
    #concatentate the digitized lagged columns onto the data set

    data = pd.concat([data,lagged_data], axis=1)

    """ ----------------------------------------------------------------------"""   
 
    
    """  -------------------- Enact the machine learning ---------------------"""
    scorers = {
            'precision_score': make_scorer(precision_score),
            'recall_score': make_scorer(recall_score),
            'accuracy_score': make_scorer(accuracy_score),
            'balanced_accuracy_score' : make_scorer(balanced_accuracy_score)}
    
    
    # we want to train the data on 20 data sets comprising a combination of the last 20 years 
    Years = ['2001','2002','2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
    
    # establishing a results datframe 
    
    results_columns = ['moves','precision -1','recall -1','precision 1','recall 1','best score','best params','ROC Score','annual return', 'num of trades',
                       
                       'maximum drawdown percentage', 'maximum drawdown period'] 

    Prediction_ResultsByYearColumns = pd.MultiIndex.from_product([model_names,results_columns], names=['model','prediciton ind'])

    Prediction_ResultsByYear = pd.DataFrame(index = Years , columns= Prediction_ResultsByYearColumns)
    
    
    
    
    for year_index, year in enumerate(Years):
        
                    
        # initalising the train and test data set and te features and labels for the ML 
        Y_year = Years[year_index]   
        X_years = Years[:year_index] + Years[year_index+1:]

        X_train = data[data['year'].isin(X_years)][lagged_data.columns] 
        y_train = data[data['year'].isin(X_years)]['direction'] 
        
        X_test = data.loc[Y_year][lagged_data.columns] 
        y_test = data.loc[Y_year]['direction'] 

                
        # On each traaining data set we want to loop throuhg all the ML algorithmn so we can see which one is most predictive
             
        for model_index, model in enumerate(models_to_fit):
                                  
            # the grid search object is where we specify the paramters for fitting: the estimator to use, the number of cross validations, the scoring paramter to fit too etc
            grid_search = GridSearchCV(estimator=models_to_fit[model_index], param_grid=parameter_grids[model_index], scoring=scorers, refit=refit_score,
                                       cv=cross_vals , return_train_score=True)
    
            output = grid_search.fit(X_train, y_train)

            """ looking at the prediction results of the fitted algorithms and adding them to the results table """
            # make the predictions
            y_pred = grid_search.predict(X_test)
    
            report = classification_report(y_test, y_pred)
            matrix = confusion_matrix(y_test, y_pred)
            
            #percentage of up movements in t data set shows the balance of the calsses
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[0])] = (y_test==1).sum()/len(y_test)
        
    
            #calculate precision -1
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[1])]= matrix [0][0] / (matrix[0][0]+ matrix[1][0])
            #calculate recall -1
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[2])]= matrix[0][0] / (matrix[0][0]+ matrix[0][1])
            #calculate precision 1
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[3])]= matrix[1][1] / (matrix[1][1]+ matrix[0][1])
            #calculate recall -1
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[4])]= matrix[1][1] / (matrix[1][1]+ matrix[1][0])
            # returns the predictive score for the tuned estimator 
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[5])]=grid_search.best_score_
            # returns the parameters for each estimator which return the greatest predictive score
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[6])]=[grid_search.best_params_]
            
            
            """ plotting ROC Curves"""
            
            # Plotting an ROC curve to examine the predicitve capabilities of each algorithmn
            # keep probabilities for the positive outcome only
            y_predPOSONLY = grid_search.predict_proba(X_test)[:,1].astype(float)
            
            #using this tells you which order the classes are in
            grid_search.classes_
            
            #generate a no skill model for comparison that always guesses 1 
            ns_pred = [1 for _ in range(len(y_test))]

            #Calculate ROC scores
            ns_auc = roc_auc_score(y_test,ns_pred)
            y_pred_auc = roc_auc_score(y_test,y_predPOSONLY)
            #ROC scores
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[7])]=y_pred_auc
            
            
            
                      
            
            """---------------------------------------------------------------------------------------------------"""
            
        
            
            """--------- now we will to look to build in the predictions to back test how the algorithmn would have performed-------------"""
 
            #we take the predicted return movements outputted from the ML algorithmn when applied to our test data set which the algo hasn't seen
     
            algo = pd.DataFrame(grid_search.predict(X_test), columns = ["algo_positions"])

            
            #algo = random.random()
            
            #if algo.iloc[i]

         
            algo.set_index(X_test.index, inplace=True)

            algo = pd.concat([algo,data], axis=1 )

            algo["algo_pos_change_ind"] = algo["algo_positions"]*algo["algo_positions"].shift(1)                    

            algo.dropna(inplace=True)

            #set up a trade indicator """
            algo["Trade_ind"] = np.where(algo['algo_pos_change_ind'] < 0 , 1, 0)

            """ we establish the algo returns per day by mutliplying the prediction from the algo either long or short (1 or -1) by the actual return that day """ 
            algo["algo_returns"] = algo.algo_positions * algo.returns

            

            """-------------------------------------------------------------------------------------------------------------------------------"""


            """ ------ adding algorithm return performance and other metrics of the algorithmn to the results table -----------------"""
            
            # algo cumulative return performance inc stoploss and costs
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[8])] = (algo["algo_returns"].apply(np.exp).product()) -1
            #number of trades
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[9])] = algo.Trade_ind.sum()
           

            """ ----------------------------------investigating drawdown over individual years -------------------------------------------------------"""
                
            #Calculating the margin amount for the trade
            Margin_amount = Total_Balance * Risk_perTrade * 18.5
            # showing how equity how the margin position would move
            Margin_pos = (algo["algo_returns"].cumsum().apply(np.exp) * Margin_amount).astype(float)
    
            Margin_pos = pd.DataFrame(Margin_pos.rename("Margin_pos", inplace=True))

            #cumulative max returns the cumulative maximum values over time 
            Margin_pos['cummax'] = Margin_pos.cummax()

            # The drawdown values over time when the drawdon value is 0 that means we have reached a new high! the period between two zeros is the drawdown period""" 
            Margin_pos["drawdown"] = Margin_pos['cummax']- Margin_pos['Margin_pos']
            Margin_pos["drawdown_percent"] = Margin_pos["drawdown"]/Margin_pos['cummax']

            # The maximum drawdown value
            max_drawdown_percent = Margin_pos["drawdown_percent"].max()

            # The point in time of the maximum drawdown, the idxmax returns the index of the position in this case a time stamp!!!
            t_max= Margin_pos["drawdown_percent"].idxmax()
    
            # The maximum drawdown percentage
            max_drawdown_percentage = round(max_drawdown_percent*100,1)
            # adding the maximum drawdown percentage to the results table
            Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[10])] = max_drawdown_percentage
    
            """ now we can plot the max drawdown and the drawdown periods """

            # first we want to find the times of all the highs, when drawdown = 0 
            high_indicies = Margin_pos["drawdown"][Margin_pos["drawdown"]==0]
            # calculate the time delta between all te highs
            dd_periods = (high_indicies.index[1:].to_pydatetime() - high_indicies.index[:-1].to_pydatetime())

            # Sometimes the algorithmn may not make any positive return over the year period if that is the case we set the drawdown period to that of len 1year in buiness days
            if len(high_indicies) == 1:
            
                max_dd_period = 250
                # adding the maximum drawdown period to the results table
                Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[11])] = 250
                
            else:
        
                max_dd_period = dd_periods.max()
                # adding the maximum drawdown period to the results table
                Prediction_ResultsByYear.loc[Y_year,(model_names[model_index],results_columns[11])] = max_dd_period.days
            
            
print(Prediction_ResultsByYear)    


print('\n\n'+ 'The mean expected annual return from the algorithmn is : {}'.format(Prediction_ResultsByYear[('knn','annual return')].mean())) 

print('The stdev expectation of the annual returns from the algorithmn is : {}'.format(Prediction_ResultsByYear[('knn','annual return')].std()) +'\n') 
    
""" ----------------------------------Caculating VAR of the algorithmn of different time periods -------------------------------------------------------"""

#VAR is usually quoted as the maximum loss expected over a certain time horizon within a stated confidence level

""" Because all the test data sets are all a year long it is as if we have run 20 simulations of our algorithmn
we can therefore make proabprobabilistic statements about the risk and return characterisitcs of the algorithmn """

# we need to imprt scipy stats as it gives us the VaR values 
import scipy.stats as scs

percs = np.array([99.9,99,97.5,90,75,50])

VaR = scs.scoreatpercentile(Prediction_ResultsByYear[('knn','annual return')] ,percs)

def print_VaR(percs,VaR):
    print('%16s %16s' % ('Confidence Level', 'Value at Risk'))
    print(33 * '-') # creates a line of dashes 
    
    for pair in zip(percs,VaR):
            print('%16.2f %16.3f' % (100 - pair[0], - pair[1]))

print_VaR(percs,VaR)

model                knn                                               \
prediciton ind     moves precision -1 recall -1 precision 1  recall 1   
2001            0.552239      0.44186  0.633333    0.541667  0.351351   
2002            0.464286     0.557143  0.577778    0.491071  0.470085   
2003            0.517787     0.506757  0.614754    0.552381  0.442748   
2004            0.551181     0.479675  0.517544    0.580153  0.542857   
2005            0.559524     0.441667  0.477477    0.560606  0.524823   
2006            0.511905      0.46875  0.487805    0.491935  0.472868   
2007            0.541502      0.42446  0.508621         0.5  0.416058   
2008            0.437008     0.566667  0.475524    0.440299  0.531532   
2009            0.539683     0.439716  0.534483    0.513514  0.419118   
2010            0.534615     0.481203  0.528926    0.551181  0.503597   
2011            0.529644     0.482014  0.563025     0.54386  0.462687   
2012            0.529644     0.469388  0.579832    0.528302   0.41791   
2013            0.529183     0.434783  0.495868    0.487395  0.426471   
2014            0.477778     0.510417  0.347518    0.471264  0.635659   
2015             0.44086     0.569343       0.5    0.450704  0.520325   
2016            0.503817     0.496855  0.607692    0.504854  0.393939   
2017            0.539683     0.488889  0.568966     0.57265  0.492647   
2018            0.474308     0.571429  0.631579    0.537736     0.475   
2019            0.565217     0.432432  0.436364     0.56338  0.559441   
2020            0.481013     0.469388  0.560976         0.4  0.315789   

model                                                                      \
prediciton ind best score                                     best params   
2001             0.518005  [{'leaf_size': 31, 'n_neighbors': 16, 'p': 2}]   
2002             0.508652  [{'leaf_size': 11, 'n_neighbors': 16, 'p': 1}]   
2003             0.514533  [{'leaf_size': 21, 'n_neighbors': 16, 'p': 2}]   
2004             0.515757   [{'leaf_size': 1, 'n_neighbors': 11, 'p': 1}]   
2005             0.507987  [{'leaf_size': 11, 'n_neighbors': 11, 'p': 1}]   
2006             0.522185  [{'leaf_size': 31, 'n_neighbors': 16, 'p': 1}]   
2007             0.517858  [{'leaf_size': 11, 'n_neighbors': 16, 'p': 2}]   
2008             0.515757   [{'leaf_size': 1, 'n_neighbors': 11, 'p': 1}]   
2009             0.507542  [{'leaf_size': 21, 'n_neighbors': 16, 'p': 2}]   
2010             0.517778  [{'leaf_size': 21, 'n_neighbors': 16, 'p': 1}]   
2011             0.508098  [{'leaf_size': 31, 'n_neighbors': 16, 'p': 1}]   
2012             0.511426  [{'leaf_size': 31, 'n_neighbors': 16, 'p': 2}]   
2013             0.514546  [{'leaf_size': 31, 'n_neighbors': 16, 'p': 1}]   
2014              0.51225   [{'leaf_size': 1, 'n_neighbors': 11, 'p': 2}]   
2015             0.516849  [{'leaf_size': 11, 'n_neighbors': 16, 'p': 2}]   
2016             0.507561   [{'leaf_size': 1, 'n_neighbors': 16, 'p': 1}]   
2017              0.50621   [{'leaf_size': 1, 'n_neighbors': 16, 'p': 1}]   
2018             0.513645  [{'leaf_size': 11, 'n_neighbors': 16, 'p': 2}]   
2019              0.51653  [{'leaf_size': 21, 'n_neighbors': 11, 'p': 1}]   
2020             0.510573  [{'leaf_size': 31, 'n_neighbors': 16, 'p': 2}]   

model                                                 \
prediciton ind ROC Score annual return num of trades   
2001            0.436486   -0.00328148            31   
2002            0.574865     -0.083547           132   
2003             0.51436    -0.0611279           134   
2004            0.508803      0.171821           114   
2005            0.528017     0.0539645           116   
2006            0.459539    -0.0802906           138   
2007            0.432733     0.0699638           124   
2008            0.489447     0.0915225           123   
2009            0.486879     -0.062962           131   
2010            0.512872      0.249738           127   
2011            0.527436      0.231289           129   
2012            0.503386     -0.102045           125   
2013            0.459194      -0.17242           128   
2014            0.499093     -0.142094           141   
2015            0.515713      0.295743           135   
2016            0.476224     0.0357738           133   
2017            0.544878      0.055853           113   
2018            0.560495      0.262661           147   
2019            0.482486   0.000161015           118   
2020            0.469833     0.0164258            33   

model                                                               
prediciton ind maximum drawdown percentage maximum drawdown period  
2001                                  11.7                       8  
2002                                  26.9                     109  
2003                                  19.8                      15  
2004                                   9.9                     121  
2005                                   9.1                      26  
2006                                    21                      34  
2007                                   9.9                     182  
2008                                  25.4                     140  
2009                                  30.5                      47  
2010                                  11.7                     125  
2011                                  34.6                     112  
2012                                  15.2                     250  
2013                                  21.4                      29  
2014                                  18.9                       3  
2015                                   9.7                     111  
2016                                  10.7                      93  
2017                                  10.4                     269  
2018                                   5.7                     110  
2019                                  15.4                     110  
2020                                  21.2                      28  


The mean expected annual return from the algorithmn is : 0.04135745178178965
The stdev expectation of the annual returns from the algorithmn is : 0.13953969747271508

Confidence Level    Value at Risk
---------------------------------
            0.10           -0.295
            1.00           -0.289
            2.50           -0.280
           10.00           -0.251
           25.00           -0.112
           50.00           -0.026

10. Drawing conclusions form the simulated outputs¶

From the results above we can see that the expected annual return from the algoirthm is 4.1%! Which is positive results!

However the standard deviation is quite high at 13.9% showing that the alogrithmn is high risk.

From the VaR analysis we can see there is a 1% chance that there will be maximim loss of -28.9% you would have to decide what level of risk you would able to accpet for what level of xpected returns!