**Introduction**

Machine Learning has many advantages. It is the hot topic right now. For a trader or a fund manager, the pertinent question is “How can I apply this new tool to generate more alpha?”. I will explore one such model that answers this question in a series of blogs.

“How can I apply this new tool to generate more alpha?”Click To Tweet

This blog has been divided into the following segments:

- Getting the data and making it usable.
- Creating Hyper-parameters.
- Splitting the data into test and train sets.
- Getting the best-fit parameters to create a new function.
- Making the predictions and checking the performance.
- Finally, some food for thought.

**Pre-requisites**

You may add one line to install the packages “pip install numpy pandas …”

You can install the necessary packages using the following code, in the Anaconda Prompt.

- pip install pandas
- pip install pandas-datareader
- pip install numpy
- pip install sklearn
- pip install matplotlib

Before we go any further, let me state that this code is written in Python 2.7. So let’s dive in.

**Dive in**

**Problem Statement:** Let’s start by understanding what we are aiming to do. By the end of this blog, I will show you how to create an algorithm that can predict the closing price of a day from the previous OHLC(Open, High, Low, Close) data. I also want to monitor the prediction error along with the size of the input data.

Let us import all the libraries and packages needed for us to build this machine learning algorithm.

**Getting the data and making it usable**

To create any algorithm we need data to train the algorithm and then to make predictions on new unseen data. In this blog, we will fetch the data from Yahoo. To accomplish this we will use the data reader function from the panda’s library. This function is extensively used and it enables you to get data from many online data sources.

We are fetching the data of the SPDR ETF linked to S&P 500. This stock can be used as a proxy for the performance of the S&P 500 index. We specify the year starting from which we will be pulling the data. Once the data is in, we will discard any data other than the OHLC, such as volume and adjusted Close, to create our data frame ‘df ’.

Now we need to make our predictions from past data, and these past features will aid the machine learning model trade. So, let’s create new columns in the data frame that contain data with one day lag.

Note the capital letters are dropped for lower-case letters in the names of new columns.

**Creating Hyper-parameters**

Although the concept of hyper-parameters is worthy of a blog in itself, for now I will just say a few words about them. These are the parameters that the machine learning algorithm can’t learn over but needs to be iterated over. We use them to see which predefined functions or parameters yield the best fit function.

In this example, I have used Lasso regression which uses L1 type of regularization. This is a type of machine learning model based on regression analysis which is used to predict continous data.. This type of regularization is very useful when you are using feature selection. It is capable of reducing the coefficient values to zero. The imputer function replaces any NaN values that can affect our predictions with mean values, as specified in the code. The ‘steps’ is a bunch of functions that are incorporated as a part of the Pipeline function. The pipeline is a very efficient tool to carry out multiple operations on the data set. Here we have also passed the Lasso function parameters along with a list of values that can be iterated over. Although I am not going into details of what exactly these parameters do, they are something worthy of digging deeper into. Finally, I called the randomized search function for performing the cross-validation.

In this example, we used 5 fold cross validation. In a k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. Cross-validation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance. Based on the fit parameter we decide the best features.

** Splitting the data into test and train sets**

First, let us split the data into the input values and the prediction values. Here we pass on the OHLC data with one day lag as the data frame X and the Close values of the current day as y. Note the column names below in lower-case.

In this example, To keep the blog short and relevant, I have chosen not to create any polynomial features but to use only the raw data. If you are interested in various combinations of the input parameters and with higher degree polynomial features, you are free to transform the data using the PolynomialFeature() function from the preprocessing package of scikit learn.

Now, let us also create a dictionary that holds the size of the train data set and its corresponding average prediction error.

** Getting the best-fit parameters to create a new function**

I want to measure the performance of the regression function as compared to the size of the input dataset. In other words, I want to see if by increasing the input data, will we be able to reduce the error. For this, I used the for loop to iterate over the same data set but with different lengths. At this point, I would like to add that for those of you who are interested explore the ‘reset’ function and how it will help us in making a more reliable prediction.

(Hint: It is a part of the python magic commands)

Let me explain what I did in a few steps. First, I created a set of periodic numbers ‘t’ starting from 50 to 97, in steps of 3. The purpose of these numbers is to choose the percentage size of the dataset that will be used as train data set. Second, for a given value of ‘t’ I split the length of the data set to the nearest integer corresponding to this percentage. Then I divided the total data into train data, which includes the data from the beginning till the split, and test data, which includes the data from the split till the end. The reason for adopting this approach and not using the random split is to maintain the continuity of the time series.

After this, we pull the best parameters that generated the lowest cross-validation error and then use these parameters to create a new reg1 function which will be a simple Lasso regression fit with the best parameters.

** Making the predictions and checking the performance**

Now let us predict the future close values. To do this we pass on test X, containing data from split to end, to the regression function using the predict() function. We also want to see how well the function has performed, so let us save these values in a new column.

As you might have noticed, I created a new error column to save the absolute error values. Then I took the mean of the absolute error values, which I saved in the dictionary that we had created earlier.

Now it’s time to plot and see what we got.

I created a new Range value to hold the average daily trading range of the data. It is a metric that I would like to compare with when I am making a prediction. The logic behind this comparison is that if my prediction error is more than the day’s range then it is likely that it will not be useful. I might as well use the previous day’s High or Low as the prediction, which will turn out be more accurate. Please note I have used the split value outside the loop. This implies that the average range of the day that you see here is relevant to the last iteration.

Let’s execute the code and see what we get.

**Some food for thought**

What does this scatter plot tell you? Let me ask you a few questions.

- Is the equation over-fitting?
- The performance of the data improved remarkably as the train data set size increased. Does this mean if we give more data the error will reduce further?
- Is there an inherent trend in the market, allowing us to make better predictions as the data set size increases?
- Last but the best question How will we use these predictions to create a trading strategy?

My next blog ‘Trading Using Machine Learning In Python Part-2‘ will answer all these questions for you!

**Next Step**

A detailed guide to help you learn how to implement a trading strategy using the regime predictions in Python. We have explained it all in our post ‘Trading Using Machine Learning In Python – SVM (Support Vector Machine)‘.

**Update**

*We have noticed that some users are facing challenges while downloading the market data from Yahoo and Google Finance platforms. In case you are looking for an alternative source for market data, you can use Quandl for the same. *

**Download Data Files**

- Python – regression.py