Predicting Stock Prices: Linear Regression (Python)

7
865

Welcome to the introduction to the Linear Regression section of the Machine Learning with Python.

This article is intended for someone who has basic understanding of Linear Regression; probably person has used some other tool like SAS or R for Linear Regression Analysis. In case user wants to know more about Linear Regression then please click on the given link:

https://www.fromthegenesis.com/category/statistical-modeling-project/linear-regression/

Python Libraries:

For Linear Regression Analysis user must have installed mentioned libraries in the system.

  1. numpy
  2. scikit-learn
  3. matplotlib
  4. pandas

If not, then use the below given commands to install libraries:

pip install numpy

pip install scikit-learn

pip install matplotlib

pip install pandas

We would need another very useful library “quandl” which has free to use financial and economic datasets for analysis.

pip install quandl

To begin with Linear Regression, our goal is to find an equation which helps us with the best fit line for the data so that we could predict the value for dependent variable based on the values of independent variables.

Linear Regression is a form of supervised machine learning algorithms, which tries to develop an equation or a statistical model which could be used over and over with very high accuracy of prediction.

Linear Regression is popularly used in modeling data for stock prices, so we can start with an example while modeling financial data. We could use sample financial data available in “quandl” library.

Let us first import the libraries (we are using spyder for the analysis but user could also opt for jupyter or pycharm or any other interface):

import pandas as pd

import quandl

df = quandl.get(“WIKI/GOOGL”)

print(df.head())

Incase, you face any issue with quandl, please try “Q” in capital it should solve your problem.

Our data which we are about to model should look like:

Our data set has in total 12 variables but we do not need all of them if closely look into dataset, we would find two types of variables. One the regular or the basic variables and few variables have prefix of “Adj”. We would need only those variables which have “Adj” as prefix because adjusted columns are derived from basic columns, keeping both regular and adjusted variables is redundant.

So, let select the variables which we need for our analysis:

df = df[[‘Adj. Open’,  ‘Adj. High’,  ‘Adj. Low’,  ‘Adj. Close’, ‘Adj. Volume’]]

Now, we have just adjusted columns which are 5 in total. For better understanding of linear regression we would do some manipulation with data to make it more suitable for analysis.

df[‘HL_PCT’] = (df[‘Adj. High’] – df[‘Adj. Low’]) / df[‘Adj. Close’] * 100.0

df[‘PCT_change’] = (df[‘Adj. Close’] – df[‘Adj. Open’]) / df[‘Adj. Open’] * 100.0

Now we have a new data from which looks like:

df = df[[‘Adj. Close’, ‘HL_PCT’, ‘PCT_change’, ‘Adj. Volume’]]

print(df.head())

Next please import few more libraries to which we would need for analysis

import quandl, math

import numpy as np

from sklearn import preprocessing, cross_validation, svm

from sklearn.linear_model import LinearRegression

We need numpy, to convert data into numpy arrays which is readable into Scikit-learn. We would do one more data adjustment:

forecast_col = ‘Adj. Close’

df.fillna(value=-99999, inplace=True)

forecast_out = int(math.ceil(0.01 * len(df)))

df[‘label’] = df[forecast_col].shift(-forecast_out)

We’ll then drop any still NaN information from the dataframe:

df.dropna(inplace=True)

We are finally ready with our data to build linear regression model. Let us tag independent and dependent variables:

x = np.array(df.drop([‘label’], 1))

y = np.array(df[‘label’])

Standardizing the dependent variable.

x = preprocessing.scale(x)

Now divide the data into training and test datasets:

x_train, x_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.2)

Though there are several classifiers or regression algorithm available in sklearn but for this analysis we would use Support Vector Regression available in sklearn.

clf = svm.SVR()

Now, we have the classifier which will used for analysis: let us train our machine learning algorithm.

clf.fit(x_train, y_train)

Check the accuracy of machine learning classifier:

confidence = clf.score(x_test, y_test)

print(confidence)

We have uploaded all the commands used in this analysis separately. Please use below link to download the commands:

https://drive.google.com/open?id=1AvAQsvVhHjcxRYIrQNWXLPe7hhRRf0ga

Let us know in case any help required or any query. Please do reach us @

fromthegenesis@gmail.com or leave a comment.

 

7 COMMENTS

  1. Wonderful work! This is the kind of information that should be shared around the
    internet. Shame on the seek engines for now not positioning this
    publish higher!

    Thank you =)

  2. Good blog you have here.. It’s hard to find high quality writing like yours these days.
    I truly appreciate individuals like you! Take care!!

  3. A person essentially assist to make severely articles I’d
    state. That is the very first time I frequented your web page and up to
    now? I surprised with the research you made to make this
    actual put up extraordinary. Magnificent task!

  4. Aw, this was an incredibly good post. Taking the time and actual effort
    to make a very good article… but what can I say… I hesitate a
    lot and never seem to get anything done.

  5. Good day! This is my 1st comment here so I just wanted to give a quick shout out and say
    I really enjoy reading your blog posts. Can you recommend any
    other blogs/websites/forums that go over the same subjects?
    Thank you!

LEAVE A REPLY

Please enter your comment!
Please enter your name here