This project is maintained by John-Land
A traditional measure of whether a stock is expensive or cheap, is to use a valuation multiple. A valuation multiple relates the market value of an asset relative to a key statistic that is assumed to relate to that value. To be useful, that statistic – whether earnings, cash flow or some other measure – must bear a logical relationship to the market value observed; to be seen, in fact, as the driver of that market value. For more information about valuation multiples, refer to below link: https://en.wikipedia.org/wiki/Valuation_using_multiples#Valuation_multiples
The most famous valuation multiple the PE (Price to earnings) ratio. This is the ratio between the market value of all the shares outstanding for a company, divided by the company’s Net Income. It is often also defined as Price per share divided by earnings per share, which is simply dividing both the Market Capitalization and the Net income by the shares outstanding. In this project we will try to predict the PE ratios for the 3000 biggest companies listed on the US stock exchanges.
Before analysing our dataset, let us first try to understand what fundamental financial information about the company should be driving the PE ratio based on Financial Theory.
The present value of any Cash Flow generating asset is the value of all future cash flows the asset will generate over its lifetime, discounted back to today with a risk adjusted interest rate.
In its simplest form, where we expect the cash flow (CF) to grow at the same rate for ever (g), the present value (PV) is calculated with below perpetuity formula.
\[PV = \frac{CF * (1+g)}{(r-g)}\]When trying to value a company based on its future cash flows, we can either look at it through the lenses of all capital providers (equity and debt providers) and we would use the FCFF (Free Cash Flow to the Firm) as the measure of cash flow (CF), or we could look at it purely as an Equity investor and we would use the FCFE (Free Cash Flow to Equity) as our measure of CF.
The main difference is that the FCFF is the cash flow left over after cash flow needs for operations and Investments are met, but before any debt repayments, whereas the FCFE is after debt payments (cash flow left over for equity investors).
If we want to estimate the value of Equity directly, we would therefore use the FCFE as the appropriate measure of CF.
As the Market Capitalization (MC) (numerator in the PE ratio), is the aggregate market estimation of the value of Equity, we can rewrite the above formula as follows:
\(MC = \frac{FCFE*(1 + g)}{(r - g)}\)
FCFE can be rewritten as:
\(MC = \frac{(NetIncome+DA-CAPEX-deltaWC-NetDebtRepayment)*(1 + g)}{(r - g)}\)
Re-writing the numerator relative to Net Income gives us:
\(MC = \frac{NetIncome*(1+\frac{DA}{Net Income}-\frac{CAPEX}{Net Income}-\frac{deltaWC}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)
\(MC = \frac{NetIncome*(1-\frac{CAPEX - DA + deltaWC}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)
\(MC = \frac{NetIncome*(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)
Deviding both sides by the Net Income (Earnings), gives us the Price to Earnings ratio:
\(\frac{Price}{Earnings}=\frac{MC}{NetIncome}= \frac{\frac{NetIncome*(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}}{Net Income}\)
\(\frac{Price}{Earnings}=\frac{MC}{NetIncome}= \frac{(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)
Finally converting the discount rate r to the cost of equity gives us:
\(\frac{Price}{Earnings}=\frac{MC}{NetIncome}= \frac{(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{((rf+\beta*ERP) - g)}\)
This formula gives us some understanding of what we would expect the fundamental drivers of the PE ratio to be. The fundamental drivers of the P/E ratio are
rf (risk free rate traditionally measured as the US T-Bond rate) and ERP (Equity risk premium, basically the premium that equity investor require on top of the risk free rate for investing in the overall stock market) values are the same for all stocks, therefore changes in these overall market metrics will affect all PE ratios, but these will not help differentiation between PE ratios of individual companies, as these are equal for all companies.
We will therefore try to obtain financial metrics that are related to points 1-4, as we would expect these to be partial drivers of the PE ratios seeen in the market.
The company data was downloaded from the data provider “Finbox”. For more information, refer to below link: https://finbox.com
We will first import the data and check for any missing values and some basic information. Then we will remove outliers before splitting the data into training and testing sets, and analysing the training set in more detail.
# linear algebra
import numpy as np
# data processing
import pandas as pd
#display max columns and rows, to be able to inspect big data tables
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
# conntects to google drive if ran in colab
# ignored if ran outside of colab
if 'google.colab' in str(get_ipython()):
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# load file from google drive path, if ran in colab
# load file from notebook folder, if ran outside colab
if 'google.colab' in str(get_ipython()):
path = '/content/drive/My Drive/Colab Notebooks/stock_pricing/'
else:
path = str()
file = path + 'large_Financials_unlinked.xlsm'
company_data = pd.read_excel(file)
#original data structure
company_data.head(5)
Ticker | Full_Ticker | Name | Sector | Industry_GICS | Index_Membership | Market_Cap | Net_Debt | Minority_Interest | EV | EBITDA | Net_Income | Equity | Avg_EBITDA_Margin_Forecast_10y | EBITDA_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Net_Income_Forecast_CAGR_10y | ROE | Beta_5y | Net_Debt_perc_EV | EV_TTM_EBITDA | Market_Cap_TTM_Net_Income | Market_Cap_BV_Equity | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAPL | NASDAQGS:AAPL | Apple Inc. | Information Technology | Technology Hardware, Storage & Peripherals | Dow Jones Composite Average, Dow Jones Industr... | 2.166836e+06 | -69552.0 | 0.0 | 2.097284e+06 | 77344.0 | 57411.0 | 65339.0 | 0.28493 | 0.0381633 | 0.206377 | 0.0348267 | 0.736856 | 1.283613 | -0.033163 | 27.116315 | 37.742528 | 33.162985 | -0.032098 |
1 | AMZN | NASDAQGS:AMZN | Amazon.com, Inc. | Consumer Discretionary | Internet & Direct Marketing Retail | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 1.565880e+06 | 28412.0 | 0.0 | 1.594292e+06 | 43708.0 | 17377.0 | 82775.0 | 0.183412 | 0.145604 | 0.0865852 | 0.201361 | 0.249521 | 1.198004 | 0.017821 | 36.475980 | 90.112226 | 18.917308 | 0.018144 |
2 | MSFT | NASDAQGS:MSFT | Microsoft Corporation | Information Technology | Software | Dow Jones Composite Average, Dow Jones Industr... | 1.624977e+06 | -54733.0 | 0.0 | 1.570244e+06 | 68123.0 | 47496.0 | 123392.0 | 0.475806 | 0.0485504 | 0.324014 | 0.048353 | 0.413993 | 0.826155 | -0.034856 | 23.050136 | 34.212932 | 13.169228 | -0.033682 |
3 | GOOG | NASDAQGS:GOOG | Alphabet Inc. | Communication Services | Interactive Media & Services | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 1.178252e+06 | -105501.0 | 0.0 | 1.072751e+06 | 48075.0 | 35713.0 | 212920.0 | 0.381029 | 0.107043 | 0.260145 | 0.110387 | 0.175111 | 0.994747 | -0.098346 | 22.314107 | 32.992235 | 5.533776 | -0.089540 |
4 | FB | NASDAQGS:FB | Facebook, Inc. | Communication Services | Interactive Media & Services | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 7.151778e+05 | -44476.0 | 0.0 | 6.707018e+05 | 30221.0 | 25276.0 | 117731.0 | 0.530226 | 0.119821 | 0.351485 | 0.131987 | 0.238757 | 1.182773 | -0.066313 | 22.193237 | 28.294738 | 6.074677 | -0.062189 |
#original data shape
original_data_shape = company_data.shape
print('The original data shape is', original_data_shape)
The original data shape is (3171, 24)
#drop duplicates
company_data = company_data.drop_duplicates('Name' , keep='first')
print(original_data_shape[0] - company_data.shape[0], 'companies were removed due to duplication.')
print('The new data shape is', company_data.shape)
21 companies were removed due to duplication.
The new data shape is (3150, 24)
#remove not meaningful samples
current_num_samples = company_data.shape[0]
company_data = company_data[(company_data.Market_Cap > 0)]
company_data = company_data[(company_data.Net_Income > 0)]
new_num_samples = company_data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to not meaningful data.')
print('The new data shape is', company_data.shape)
1317 companies were removed due to not meaningful data.
The new data shape is (1833, 24)
#filter data for required columns
pe_data = company_data[['Sector',
'Name',
'Market_Cap_TTM_Net_Income',
'Net_Income_Forecast_CAGR_10y',
'Avg_Net_Income_Margin_Forecast_10y',
'Beta_5y',
'Net_Debt_perc_Market_Cap'
]].set_index('Name')
pe_data.head(5)
Sector | Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|
Name | ||||||
Apple Inc. | Information Technology | 37.742528 | 0.0348267 | 0.206377 | 1.283613 | -0.032098 |
Amazon.com, Inc. | Consumer Discretionary | 90.112226 | 0.201361 | 0.0865852 | 1.198004 | 0.018144 |
Microsoft Corporation | Information Technology | 34.212932 | 0.048353 | 0.324014 | 0.826155 | -0.033682 |
Alphabet Inc. | Communication Services | 32.992235 | 0.110387 | 0.260145 | 0.994747 | -0.089540 |
Facebook, Inc. | Communication Services | 28.294738 | 0.131987 | 0.351485 | 1.182773 | -0.062189 |
#check column data types
pe_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1833 entries, Apple Inc. to Smart Sand, Inc.
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 1833 non-null object
1 Market_Cap_TTM_Net_Income 1833 non-null float64
2 Net_Income_Forecast_CAGR_10y 1833 non-null object
3 Avg_Net_Income_Margin_Forecast_10y 1833 non-null object
4 Beta_5y 1833 non-null float64
5 Net_Debt_perc_Market_Cap 1833 non-null float64
dtypes: float64(3), object(3)
memory usage: 100.2+ KB
#ensure columns have the correct data type
pe_data['Market_Cap_TTM_Net_Income'] = pe_data['Market_Cap_TTM_Net_Income'].apply(pd.to_numeric, errors='coerce')
pe_data['Net_Income_Forecast_CAGR_10y'] = pe_data['Net_Income_Forecast_CAGR_10y'].apply(pd.to_numeric, errors='coerce')
pe_data['Avg_Net_Income_Margin_Forecast_10y'] = pe_data['Avg_Net_Income_Margin_Forecast_10y'].apply(pd.to_numeric, errors='coerce')
pe_data['Beta_5y'] = pe_data['Beta_5y'].apply(pd.to_numeric, errors='coerce')
pe_data['Net_Debt_perc_Market_Cap'] = pe_data['Net_Debt_perc_Market_Cap'].apply(pd.to_numeric, errors='coerce')
pe_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1833 entries, Apple Inc. to Smart Sand, Inc.
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 1833 non-null object
1 Market_Cap_TTM_Net_Income 1833 non-null float64
2 Net_Income_Forecast_CAGR_10y 1562 non-null float64
3 Avg_Net_Income_Margin_Forecast_10y 1831 non-null float64
4 Beta_5y 1833 non-null float64
5 Net_Debt_perc_Market_Cap 1833 non-null float64
dtypes: float64(5), object(1)
memory usage: 100.2+ KB
#check for missing values
pe_data.isna().sum()
Sector 0
Market_Cap_TTM_Net_Income 0
Net_Income_Forecast_CAGR_10y 271
Avg_Net_Income_Margin_Forecast_10y 2
Beta_5y 0
Net_Debt_perc_Market_Cap 0
dtype: int64
#remove missing values
current_num_samples = pe_data.shape[0]
pe_data = pe_data.dropna()
pe_data.isna().sum()
new_num_samples = pe_data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to missing data.')
print('The new data shape is', pe_data.shape)
271 companies were removed due to missing data.
The new data shape is (1562, 6)
We will look at the individual histograms of all variables, to see if we have potential outliers, and use Gaussian mixture models to later remove these outliers.
#summary statistics before outlier removal
pe_data.describe()
Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|
count | 1562.000000 | 1562.000000 | 1562.000000 | 1562.000000 | 1562.000000 |
mean | 88.873737 | 0.040772 | 0.150340 | 1.074986 | 0.325119 |
std | 783.710029 | 0.132170 | 0.127199 | 0.559106 | 1.627939 |
min | 0.706985 | -0.342594 | -0.038680 | -1.269432 | -6.197720 |
25% | 15.490751 | 0.008678 | 0.075073 | 0.724133 | -0.020051 |
50% | 25.771183 | 0.046963 | 0.114935 | 1.060399 | 0.116504 |
75% | 45.413055 | 0.093754 | 0.188474 | 1.407431 | 0.371820 |
max | 25745.355250 | 0.958556 | 1.941181 | 3.187500 | 40.707549 |
sns.histplot(pe_data[['Market_Cap_TTM_Net_Income']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab7fa1b00>
sns.histplot(np.log(pe_data[['Market_Cap_TTM_Net_Income']]))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab53741d0>
sns.histplot(pe_data[['Net_Income_Forecast_CAGR_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab88663c8>
sns.histplot(pe_data[['Avg_Net_Income_Margin_Forecast_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab898e908>
sns.histplot(pe_data[['Beta_5y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab4946630>
sns.histplot(pe_data[['Net_Debt_perc_Market_Cap']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab4ba8c18>
#check for outliers with multivariate Guassian
from sklearn.covariance import EllipticEnvelope
X = pe_data[['Market_Cap_TTM_Net_Income',
'Net_Income_Forecast_CAGR_10y',
'Avg_Net_Income_Margin_Forecast_10y',
'Beta_5y',
'Net_Debt_perc_Market_Cap']]
cov = EllipticEnvelope(random_state=0, contamination=0.1).fit(X)
# predict returns 1 for an inlier and -1 for an outlier
pe_data[cov.predict(X)== -1]
Sector | Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|
Name | ||||||
JPMorgan Chase & Co. | Financials | 16.751677 | -0.308714 | 0.076850 | 1.221113 | -1.502979 |
salesforce.com, inc. | Information Technology | 55.467165 | 0.494147 | 0.167470 | 1.171743 | -0.016569 |
Zoom Video Communications, Inc. | Information Technology | 238.968203 | 0.548485 | 0.294907 | 0.000000 | -0.017654 |
Citigroup Inc. | Financials | 11.521894 | -0.249653 | 0.058769 | 1.951680 | -2.650639 |
Morgan Stanley | Financials | 13.922561 | -0.189029 | 0.076157 | 1.578256 | -2.002738 |
Square, Inc. | Information Technology | 330.866730 | 0.287034 | 0.169256 | 2.710084 | -0.002075 |
The Goldman Sachs Group, Inc. | Financials | 15.782209 | -0.147498 | 0.094139 | 1.498424 | -2.264476 |
Autodesk, Inc. | Information Technology | 164.201527 | 0.267383 | 0.312567 | 1.440651 | 0.006867 |
Rocket Companies, Inc. | Financials | 2.379811 | 0.133660 | 0.335283 | 0.000000 | 10.451434 |
Chipotle Mexican Grill, Inc. | Consumer Discretionary | 165.322824 | 0.112505 | 0.097372 | 1.310399 | 0.053183 |
Match Group, Inc. | Communication Services | 462.034079 | 0.122597 | 0.289083 | 0.000000 | 0.078802 |
IQVIA Holdings Inc. | Healthcare | 206.344869 | 0.217596 | 0.096187 | 1.408088 | 0.317127 |
Marriott International, Inc. | Consumer Discretionary | 234.753507 | 0.088121 | 0.321391 | 1.828256 | 0.251279 |
The Trade Desk, Inc. | Information Technology | 267.645982 | 0.262910 | 0.426602 | 2.571428 | -0.006479 |
Kinder Morgan, Inc. | Energy | 282.232013 | 0.024985 | 0.214225 | 0.973214 | 0.988606 |
Yandex N.V. | Communication Services | 0.956178 | 0.247145 | 0.151302 | 1.418067 | -6.197720 |
GoodRx Holdings, Inc. | Healthcare | 859.767452 | 0.275670 | 0.265819 | 0.000000 | -0.019991 |
Altice USA, Inc. | Communication Services | 180.741036 | 0.214488 | 0.077281 | 1.117957 | 1.366004 |
Insulet Corporation | Healthcare | 606.399539 | 0.347631 | 0.115667 | 0.753676 | 0.001678 |
Royalty Pharma plc | Healthcare | 8.626940 | -0.018368 | 0.783479 | 0.000000 | 0.189492 |
Ceridian HCM Holding Inc. | Information Technology | 1274.738272 | 0.106253 | 0.126695 | 1.383878 | 0.030229 |
PPD, Inc. | Healthcare | 176.904951 | 0.242234 | 0.081675 | 0.000000 | 0.303869 |
Sunrun Inc. | Industrials | 2190.510881 | 0.389808 | 0.280920 | 2.331932 | 0.148653 |
Paylocity Holding Corporation | Information Technology | 168.748963 | 0.125200 | 0.187379 | 1.358193 | -0.005702 |
Vail Resorts, Inc. | Consumer Discretionary | 221.809841 | 0.203086 | 0.134174 | 1.267857 | 0.195811 |
Nuance Communications, Inc. | Information Technology | 638.474079 | 0.241816 | 0.151574 | 1.207983 | 0.095179 |
Repligen Corporation | Healthcare | 255.814324 | 0.210457 | 0.197374 | 0.937500 | -0.025107 |
Globant S.A. | Information Technology | 175.746339 | 0.169668 | 0.147179 | 1.241071 | -0.028297 |
JOYY Inc. | Communication Services | 0.712293 | 0.054680 | 0.122382 | 1.025735 | -2.524629 |
Santander Consumer USA Holdings Inc. | Financials | 12.545485 | -0.294670 | 0.047224 | 1.146533 | 6.150640 |
Regency Centers Corporation | Real Estate | 162.169055 | 0.025364 | 0.218645 | 1.025210 | 0.520654 |
Post Holdings, Inc. | Consumer Staples | 8049.723862 | 0.840637 | 0.050039 | 0.713760 | 1.006789 |
Polaris Inc. | Consumer Discretionary | 272.459252 | 0.089551 | 0.079538 | 2.051995 | 0.172279 |
RealPage, Inc. | Information Technology | 160.502484 | 0.168698 | 0.163460 | 1.200105 | 0.083945 |
Healthcare Trust of America, Inc. | Real Estate | 174.044302 | 0.134457 | 0.107783 | 0.570378 | 0.518029 |
East West Bancorp, Inc. | Financials | 14.698616 | -0.327636 | 0.112202 | 1.983193 | -0.455557 |
Jefferies Financial Group Inc. | Financials | 8.585564 | -0.017036 | 0.107537 | 1.452205 | 3.113440 |
Planet Fitness, Inc. | Consumer Discretionary | 1145.056270 | 0.080519 | 0.200777 | 1.278886 | 0.224182 |
OneMain Holdings, Inc. | Financials | 10.945856 | -0.015186 | 0.196724 | 2.301995 | 2.253179 |
Euronet Worldwide, Inc. | Information Technology | 231.647933 | 0.049857 | 0.129901 | 1.603991 | -0.015695 |
DouYu International Holdings Limited | Communication Services | 4.487869 | 0.598024 | 0.149220 | 0.000000 | -2.052381 |
Apartment Investment and Management Company | Real Estate | 4.815010 | -0.262892 | 0.014636 | 0.504201 | 5.617434 |
51job, Inc. | Industrials | 4.695607 | 0.142755 | 0.367819 | 0.773634 | -2.131473 |
TFS Financial Corporation | Financials | 63.557578 | -0.303970 | 0.065285 | 0.473214 | 0.582566 |
Silicon Laboratories Inc. | Information Technology | 440.317690 | 0.246578 | 0.149803 | 1.061449 | -0.027344 |
Futu Holdings Limited | Financials | 11.461593 | 0.375405 | 0.399249 | 0.000000 | 3.088148 |
Cullen/Frost Bankers, Inc. | Financials | 17.665053 | -0.313444 | 0.064631 | 1.550420 | -0.823118 |
CyberArk Software Ltd. | Information Technology | 2150.226566 | 0.013283 | 0.112798 | 1.372899 | -0.099157 |
Inovalon Holdings, Inc. | Healthcare | 675.318101 | 0.385118 | 0.179341 | 0.631302 | 0.231589 |
New York Community Bancorp, Inc. | Financials | 12.140423 | -0.157636 | 0.140916 | 1.053046 | 2.825336 |
PennyMac Financial Services, Inc. | Financials | 3.108653 | -0.006878 | 0.183363 | 1.210084 | 6.173266 |
Affiliated Managers Group, Inc. | Financials | 45.356890 | 0.487594 | 0.332176 | 1.448004 | 0.312858 |
Vicor Corporation | Industrials | 554.240278 | 0.286143 | 0.239396 | 0.838235 | -0.044756 |
STAAR Surgical Company | Healthcare | 459.804971 | 0.204366 | 0.187577 | 1.080882 | -0.028797 |
Quaker Chemical Corporation | Materials | 758.979339 | 0.208738 | 0.101672 | 1.521008 | 0.158539 |
H&R Block, Inc. | Consumer Discretionary | 15.135832 | 0.547058 | 0.140034 | 0.717436 | 0.603042 |
Momo Inc. | Communication Services | 1.284078 | -0.018558 | 0.135815 | 1.501050 | -1.913612 |
Verint Systems Inc. | Information Technology | 266.448316 | 0.273125 | 0.189768 | 0.882878 | 0.054453 |
Air Lease Corporation | Industrials | 9.110599 | 0.098506 | 0.374975 | 2.205357 | 2.695674 |
Spirit Realty Capital, Inc. | Real Estate | 1840.574712 | -0.017255 | 0.185119 | 1.200630 | 0.592913 |
Blackstone Mortgage Trust, Inc. | Financials | 29.168555 | -0.232884 | 0.265245 | 1.353991 | 3.194494 |
Hudson Pacific Properties, Inc. | Real Estate | 153.618415 | 0.076131 | 0.075570 | 1.008928 | 0.885869 |
Crane Co. | Industrials | 219.922813 | 0.133795 | 0.111302 | 1.642331 | 0.156368 |
W. R. Grace & Co. | Materials | 293.805810 | 0.101765 | 0.141941 | 1.490021 | 0.450699 |
Digital Turbine, Inc. | Information Technology | 177.187460 | 0.269344 | 0.306532 | 2.402310 | -0.001506 |
J & J Snack Foods Corp. | Consumer Staples | 159.774902 | 0.167970 | 0.069604 | 0.579306 | -0.073952 |
Nelnet, Inc. | Financials | 17.195040 | -0.113014 | 0.078918 | 0.694327 | 6.956524 |
Kratos Defense & Security Solutions, Inc. | Industrials | 765.998239 | 0.268329 | 0.089397 | 1.075630 | 0.002872 |
ABM Industries Incorporated | Industrials | 9263.878107 | 0.958556 | 0.024766 | 1.237920 | 0.184408 |
Applied Industrial Technologies, Inc. | Industrials | 163.346657 | 0.236222 | 0.051580 | 1.462184 | 0.179402 |
CONMED Corporation | Healthcare | 10745.372434 | 0.149241 | 0.090452 | 1.554621 | 0.221933 |
Texas Capital Bancshares, Inc. | Financials | 42.043347 | -0.314522 | 0.059117 | 2.040966 | -2.114025 |
Trinity Industries, Inc. | Industrials | 2165.264179 | 0.041242 | 0.059529 | 1.543592 | 1.526126 |
Alliance Data Systems Corporation | Information Technology | 12.230037 | 0.040894 | 0.144305 | 2.582983 | 3.966767 |
Dana Incorporated | Consumer Discretionary | 233.766139 | 0.087559 | 0.045863 | 2.612920 | 0.660917 |
Mr. Cooper Group Inc. | Financials | 4.745800 | -0.084737 | 0.092248 | 1.561449 | 5.742193 |
Moog Inc. | Industrials | 293.815571 | 0.365227 | 0.057571 | 1.549894 | 0.345789 |
Chimera Investment Corporation | Financials | 32.978382 | -0.249461 | 0.217903 | 1.006302 | 5.614781 |
Brinker International, Inc. | Consumer Discretionary | 139.881608 | 0.238843 | 0.047800 | 2.672794 | 0.806977 |
Rogers Corporation | Information Technology | 537.862982 | 0.127355 | 0.141970 | 1.819852 | -0.036268 |
Goosehead Insurance, Inc | Financials | 419.744708 | 0.354969 | 0.193873 | 0.619425 | 0.026563 |
360 DigiTech, Inc. | Financials | 0.706985 | 0.110954 | 0.337680 | 0.708786 | -2.398706 |
Kennedy-Wilson Holdings, Inc. | Real Estate | 26.447361 | 0.194040 | 1.234208 | 1.246323 | 1.735967 |
Korn Ferry | Industrials | 165.150541 | 0.069940 | 0.089708 | 1.610819 | 0.008606 |
Easterly Government Properties, Inc. | Real Estate | 162.517737 | 0.143134 | 0.077131 | 0.445903 | 0.504803 |
Verra Mobility Corporation | Information Technology | 293.971912 | 0.238901 | 0.385125 | 1.689947 | 0.352871 |
Celsius Holdings, Inc. | Consumer Staples | 747.532815 | 0.168234 | 0.109054 | 2.172794 | -0.009906 |
PennyMac Mortgage Investment Trust | Financials | 61.684033 | -0.033100 | 0.348720 | 1.171218 | 4.255587 |
Navient Corporation | Financials | 5.219577 | -0.312567 | 0.158355 | 1.830357 | 40.707549 |
BancFirst Corporation | Financials | 21.139351 | -0.306531 | 0.075616 | 1.407563 | -0.855761 |
Vital Farms, Inc. | Consumer Staples | 198.802568 | 0.334608 | 0.063634 | 0.000000 | -0.099120 |
Broadmark Realty Capital Inc. | Financials | 13.009910 | 0.430792 | 1.053596 | 0.000000 | -0.125581 |
Glu Mobile Inc. | Communication Services | 202.020533 | 0.393113 | 0.250600 | 0.831932 | -0.186857 |
Arbor Realty Trust, Inc. | Financials | 16.151109 | -0.342594 | 0.314844 | 1.863970 | 2.811448 |
World Fuel Services Corporation | Energy | 12.281043 | 0.484571 | 0.071972 | 1.246848 | 0.071008 |
Apollo Commercial Real Estate Finance, Inc. | Financials | 29.734538 | -0.009752 | 0.673072 | 1.324579 | 2.548829 |
Gray Television, Inc. | Communication Services | 5.959170 | 0.143468 | 0.128964 | 2.061449 | 1.977744 |
Encore Capital Group, Inc. | Financials | 4.636604 | 0.084569 | 0.204575 | 1.608718 | 3.060390 |
The Pennant Group, Inc. | Healthcare | 223.346359 | 0.436328 | 0.131793 | 0.000000 | 0.171849 |
Kaman Corporation | Industrials | 1432.912047 | 0.055077 | 0.090850 | 1.259453 | 0.091401 |
Enerpac Tool Group Corp. | Industrials | 444.916874 | 0.266887 | 0.086804 | 1.398634 | 0.067732 |
StoneX Group Inc. | Financials | 7.325914 | 0.134682 | 0.004449 | 1.237394 | -2.462987 |
The GEO Group, Inc. | Real Estate | 7.594080 | 0.021620 | 0.067978 | 0.872899 | 2.610962 |
Empire State Realty Trust, Inc. | Real Estate | 352.740898 | 0.083937 | 0.167861 | 1.248424 | 1.030238 |
KKR Real Estate Finance Trust Inc. | Financials | 20.091090 | 0.093110 | 0.913182 | 0.717959 | 3.826082 |
OneSpan Inc. | Information Technology | 508.315408 | 0.160949 | 0.114122 | 0.657037 | -0.106090 |
CEVA, Inc. | Information Technology | 25745.355250 | 0.802418 | 0.098178 | 1.185924 | -0.106221 |
Ladder Capital Corp | Financials | 31.747818 | 0.025390 | 0.385591 | 2.379201 | 3.189546 |
Textainer Group Holdings Limited | Industrials | 17.635944 | 0.111696 | 0.176426 | 1.580357 | 3.573168 |
Brookfield Property REIT Inc. | Real Estate | 3.160712 | 0.009887 | 0.121736 | 1.404560 | 25.971131 |
GrowGeneration Corp. | Consumer Discretionary | 809.353696 | 0.456916 | 0.119451 | 3.109704 | -0.017205 |
Lantheus Holdings, Inc. | Healthcare | 2710.954416 | 0.123571 | 0.131269 | 1.304621 | 0.129804 |
H&E Equipment Services, Inc. | Industrials | 298.247411 | 0.092664 | 0.121501 | 2.391806 | 0.980952 |
OFG Bancorp | Financials | 20.976006 | -0.062395 | 0.083554 | 1.637605 | -2.039222 |
CoreCivic, Inc. | Real Estate | 6.675864 | 0.053729 | 0.135858 | 1.274159 | 2.243393 |
Federal Agricultural Mortgage Corporation | Financials | 8.112086 | -0.234719 | 0.156564 | 1.099789 | 25.464763 |
Ready Capital Corporation | Financials | 16.785361 | -0.138081 | 0.145104 | 1.135958 | 6.452341 |
Banc of California, Inc. | Financials | 169.877246 | 0.031302 | 0.048239 | 1.855567 | 0.530683 |
Cowen Inc. | Financials | 5.865177 | 0.223161 | 0.107700 | 1.501050 | -1.714701 |
Independence Holding Company | Financials | 153.504869 | 0.038060 | 0.038884 | 0.713760 | -0.034722 |
Byline Bancorp, Inc. | Financials | 15.225538 | -0.316065 | 0.049298 | 1.399270 | 1.153092 |
Banco Latinoamericano de Comercio Exterior, S.A | Financials | 9.353085 | 0.055819 | 0.681126 | 1.431197 | 0.963452 |
CrossFirst Bankshares, Inc. | Financials | 166.033851 | -0.112332 | 0.085429 | 0.000000 | 0.200957 |
CAI International, Inc. | Industrials | 22.486914 | 0.148288 | 0.344705 | 1.753151 | 3.684444 |
Heritage Commerce Corp | Financials | 20.079816 | -0.309449 | 0.081675 | 1.238445 | -1.498694 |
WideOpenWest, Inc. | Communication Services | 51.772992 | 0.066138 | 0.056696 | 2.121111 | 2.439644 |
Customers Bancorp, Inc. | Financials | 6.739674 | -0.201747 | 0.094582 | 1.556722 | 9.121123 |
Dime Community Bancshares, Inc. | Financials | 13.242787 | 0.016384 | 0.160215 | 1.113445 | 2.110544 |
Brigham Minerals, Inc. | Energy | 439.092310 | 0.280175 | 0.458745 | 0.000000 | -0.005515 |
ShotSpotter, Inc. | Information Technology | 183.596527 | 0.236514 | 0.146619 | 1.529387 | -0.054738 |
Orchid Island Capital, Inc. | Financials | 85.217282 | 0.116008 | 1.392439 | 1.353466 | 8.541720 |
GreenSky, Inc. | Information Technology | 94.385190 | 0.183484 | 0.179217 | 1.684817 | 2.229329 |
Flushing Financial Corporation | Financials | 12.017677 | -0.114548 | 0.106463 | 0.874474 | 2.577398 |
Amalgamated Bank | Financials | 10.543987 | -0.241360 | 0.074361 | 0.934883 | -1.690125 |
EVI Industries, Inc. | Industrials | 604.358309 | 0.103463 | 0.006639 | 1.044642 | 0.051524 |
CNB Financial Corporation | Financials | 11.078508 | -0.311182 | 0.083196 | 0.950630 | -0.757662 |
IBEX Limited | Industrials | 175.676546 | 0.186539 | 0.069038 | 0.000000 | 0.100618 |
Global Water Resources, Inc. | Utilities | 333.423098 | 0.087818 | 0.093889 | 0.590481 | 0.260442 |
GWG Holdings, Inc. | Financials | 5.710270 | 0.139139 | 1.941181 | -0.213235 | 7.577290 |
Metropolitan Bank Holding Corp. | Financials | 9.048492 | -0.055528 | 0.131344 | 1.274883 | -2.149825 |
A-Mark Precious Metals, Inc. | Financials | 3.754298 | 0.010691 | 0.007043 | -0.453256 | 2.702594 |
Orion Energy Systems, Inc. | Industrials | 228.903924 | 0.146702 | 0.164330 | 2.619222 | -0.001672 |
Alpha Pro Tech, Ltd. | Industrials | 8.179425 | 0.371954 | 0.289285 | -1.269432 | -0.135723 |
Gold Resource Corporation | Materials | 837.443231 | 0.095632 | 0.117449 | 1.922794 | -0.139824 |
Great Ajax Corp. | Financials | 10.873657 | 0.011348 | 0.397006 | 1.724789 | 4.167232 |
Overseas Shipholding Group, Inc. | Energy | 4.582406 | 0.004951 | 0.025634 | 0.344537 | 3.305226 |
First Business Financial Services, Inc. | Financials | 10.224774 | -0.248875 | 0.060277 | 1.070378 | 2.803202 |
PCB Bancorp | Financials | 10.964238 | -0.331828 | 0.059268 | 0.770483 | -0.747264 |
Willis Lease Finance Corporation | Industrials | 11.379067 | 0.041128 | 0.161357 | 1.225840 | 6.851299 |
Select Bancorp, Inc. | Financials | 24.861609 | -0.260696 | 0.044938 | 0.826680 | -1.193999 |
BayCom Corp | Financials | 13.069827 | -0.296532 | 0.074006 | 1.012079 | -1.298867 |
Intevac, Inc. | Information Technology | 34.864770 | 0.625030 | 0.769693 | 0.964810 | -0.196762 |
Mesa Air Group, Inc. | Industrials | 8.370785 | 0.106898 | 0.075282 | 3.127155 | 3.213603 |
BankFinancial Corporation | Financials | 13.423641 | -0.002157 | 0.184443 | 0.582457 | -3.339265 |
Meridian Corporation | Financials | 6.368735 | -0.045533 | 0.093215 | 0.476823 | 2.386771 |
Sharps Compliance Corp. | Healthcare | 145.646316 | 0.040317 | 0.039380 | 0.002626 | 0.046589 |
StealthGas Inc. | Energy | 8.482289 | 0.367267 | 0.244247 | 1.709558 | 2.890521 |
#outlier removal
current_num_samples = pe_data.shape[0]
pe_data_new = pe_data[cov.predict(X)== 1]
new_num_samples = pe_data_new.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to outliers.')
print('The new data shape is', pe_data_new.shape)
157 companies were removed due to outliers.
The new data shape is (1405, 6)
#summary statistics after outlier removal
pe_data_new.describe()
Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|
count | 1405.000000 | 1405.000000 | 1405.000000 | 1405.000000 | 1405.000000 |
mean | 33.827036 | 0.034090 | 0.144781 | 1.059838 | 0.193706 |
std | 27.348860 | 0.111973 | 0.103760 | 0.536433 | 0.403447 |
min | 2.146296 | -0.334833 | -0.038680 | -0.465861 | -1.314781 |
25% | 15.729846 | 0.009369 | 0.074373 | 0.722163 | -0.018924 |
50% | 25.351531 | 0.045684 | 0.114119 | 1.038340 | 0.115458 |
75% | 41.711925 | 0.085432 | 0.187319 | 1.375525 | 0.344100 |
max | 150.911270 | 0.397053 | 0.676412 | 3.187500 | 2.002922 |
sns.histplot(pe_data_new[['Market_Cap_TTM_Net_Income']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab43e0f28>
sns.histplot(np.log(pe_data_new[['Market_Cap_TTM_Net_Income']]))
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab4273630>
sns.histplot(pe_data_new[['Net_Income_Forecast_CAGR_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab422d6d8>
sns.histplot(pe_data_new[['Avg_Net_Income_Margin_Forecast_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab41290f0>
sns.histplot(pe_data_new[['Beta_5y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab406f6a0>
sns.histplot(pe_data_new[['Net_Debt_perc_Market_Cap']])
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab3faeef0>
# split data into train and test splits
from sklearn.model_selection import train_test_split
X = pe_data_new[['Sector', 'Net_Income_Forecast_CAGR_10y', 'Avg_Net_Income_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap']];
Y = pe_data_new[['Market_Cap_TTM_Net_Income']];
Y['Log_Market_Cap_TTM_Net_Income'] = np.log(Y['Market_Cap_TTM_Net_Income']);
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=True, test_size = 0.2, random_state=1);
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# check training data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', Y_train.shape)
X_train has shape (1124, 5)
y_train has shape (1124, 2)
#check X_train
X_train.head(5)
Sector | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|
Name | |||||
PPL Corporation | Utilities | 0.037723 | 0.249385 | 0.746323 | 1.093131 |
Premier, Inc. | Healthcare | -0.055200 | 0.252738 | 0.262079 | 0.119931 |
Shutterstock, Inc. | Consumer Discretionary | 0.230431 | 0.166612 | 1.048844 | -0.132058 |
Intel Corporation | Information Technology | 0.008352 | 0.272094 | 0.734768 | 0.081718 |
AGCO Corporation | Industrials | 0.164077 | 0.049779 | 1.218487 | 0.145613 |
#check y_train
Y_train.head(5)
Market_Cap_TTM_Net_Income | Log_Market_Cap_TTM_Net_Income | |
---|---|---|
Name | ||
PPL Corporation | 13.926587 | 2.633800 |
Premier, Inc. | 16.623611 | 2.810824 |
Shutterstock, Inc. | 51.272719 | 3.937159 |
Intel Corporation | 9.941109 | 2.296679 |
AGCO Corporation | 42.214750 | 3.742770 |
train_data = Y_train.merge(X_train, how='outer', left_index=True, right_index=True)
test_data = Y_test.merge(X_test, how='outer', left_index=True, right_index=True)
train_data.describe().round(2)
Market_Cap_TTM_Net_Income | Log_Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|
count | 1124.00 | 1124.00 | 1124.00 | 1124.00 | 1124.00 | 1124.00 |
mean | 33.57 | 3.26 | 0.03 | 0.14 | 1.05 | 0.19 |
std | 26.75 | 0.71 | 0.11 | 0.10 | 0.54 | 0.40 |
min | 2.28 | 0.82 | -0.33 | -0.04 | -0.47 | -1.31 |
25% | 15.78 | 2.76 | 0.01 | 0.07 | 0.71 | -0.02 |
50% | 25.66 | 3.24 | 0.05 | 0.11 | 1.02 | 0.11 |
75% | 41.32 | 3.72 | 0.08 | 0.18 | 1.35 | 0.35 |
max | 150.91 | 5.02 | 0.40 | 0.68 | 3.19 | 2.00 |
# correlation matrix for outcome variable
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Compute the correlation matrix
corr = train_data.drop(columns="Log_Market_Cap_TTM_Net_Income").corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab3f81668>
# correlation matrix for log of outcome variable
# Compute the correlation matrix
corr = train_data.drop(columns="Market_Cap_TTM_Net_Income").corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab3e9bf60>
sns.set_theme()
sns.jointplot(data=train_data, x="Net_Income_Forecast_CAGR_10y", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Net_Income_Forecast_CAGR_10y", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x7f8ab85715f8>
sns.jointplot(data=train_data, x="Avg_Net_Income_Margin_Forecast_10y", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Avg_Net_Income_Margin_Forecast_10y", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x7f8ab39284e0>
sns.jointplot(data=train_data, x="Beta_5y", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Beta_5y", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x7f8ab35e1588>
sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x7f8ab3278198>
#ML dependancies
#data preprocessing pipeline to preprocess data table
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
#Grid search cross validation to find best hyper parameters
from sklearn.model_selection import GridSearchCV
#models considered
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
#evaluation
from sklearn.model_selection import cross_val_score
#split data back into X and y
X_train = train_data[['Net_Income_Forecast_CAGR_10y', 'Avg_Net_Income_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_train = train_data['Log_Market_Cap_TTM_Net_Income']
X_test = test_data[['Net_Income_Forecast_CAGR_10y', 'Avg_Net_Income_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_test = test_data['Log_Market_Cap_TTM_Net_Income']
# check training and test data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', y_train.shape)
print('X_test has shape', X_test.shape)
print('y_test has shape', y_test.shape)
X_train has shape (1124, 5)
y_train has shape (1124,)
X_test has shape (281, 5)
y_test has shape (281,)
# define categorical columns for transformation
categorical_variables = ['Sector']
# define numerical columns for transformation
numerical_variables = list(X_train.drop(columns=categorical_variables).columns)
# define column transformation for data preprocessing pipeline
column_transformer = ColumnTransformer(transformers=[
('categorical_columns', OneHotEncoder(handle_unknown='ignore'), categorical_variables),
('numeric_columns', StandardScaler(), numerical_variables)])
#ridge regression
grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}
model_1 = Pipeline([('column_transformer', column_transformer),
('model', Ridge(max_iter = 100000))])
# default metric to optimize over grid parameters: accuracy
grid_model_1 = GridSearchCV(model_1, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_1 = grid_model_1.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_1.best_params_)
print('Grid best score: ', grid_model_1.best_score_.round(3))
Grid best parameter: {'model__alpha': 10}
Grid best score: 0.225
#lasso regression
grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}
model_2 = Pipeline([('column_transformer', column_transformer),
('model', Lasso(max_iter = 100000))])
# default metric to optimize over grid parameters: accuracy
grid_model_2 = GridSearchCV(model_2, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_2 = grid_model_2.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_2.best_params_)
print('Grid best score: ', grid_model_2.best_score_.round(3))
Grid best parameter: {'model__alpha': 0.001}
Grid best score: 0.226
#KNeighborsRegressor
grid_values = {'model__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]}
model_3 = Pipeline([('column_transformer', column_transformer),
('model', KNeighborsRegressor())])
# default metric to optimize over grid parameters: accuracy
grid_model_3 = GridSearchCV(model_3, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_3 = grid_model_3.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_3.best_params_)
print('Grid best score: ', grid_model_3.best_score_.round(3))
Grid best parameter: {'model__n_neighbors': 17}
Grid best score: 0.267
#SVR linear
grid_values = {'model__C': [0.01, 0.1, 1, 10, 100]}
model_4 = Pipeline([('column_transformer', column_transformer),
('model', SVR(kernel='linear'))])
# default metric to optimize over grid parameters: accuracy
grid_model_4 = GridSearchCV(model_4, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_4 = grid_model_4.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_4.best_params_)
print('Grid best score: ', grid_model_4.best_score_.round(3))
Grid best parameter: {'model__C': 0.1}
Grid best score: 0.224
#SVR rbf
grid_values = {'model__gamma': [0.01, 0.1, 1, 10, 100],
'model__C': [0.01, 0.1, 1, 10, 100]}
model_5 = Pipeline([('column_transformer', column_transformer),
('model', SVR(kernel='rbf'))])
# default metric to optimize over grid parameters: accuracy
grid_model_5 = GridSearchCV(model_5, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_5 = grid_model_5.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_5.best_params_)
print('Grid best score: ', grid_model_5.best_score_.round(3))
Grid best parameter: {'model__C': 1, 'model__gamma': 0.1}
Grid best score: 0.327
#DecisionTreeRegressor
grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8]}
model_6 = Pipeline([('column_transformer', column_transformer),
('model', DecisionTreeRegressor())])
# default metric to optimize over grid parameters: accuracy
grid_model_6 = GridSearchCV(model_6, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_6 = grid_model_6.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_6.best_params_)
print('Grid best score: ', grid_model_6.best_score_.round(3))
Grid best parameter: {'model__max_depth': 4}
Grid best score: 0.161
#RandomForestRegressor
grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'model__n_estimators': [50, 75, 100, 125, 150, 200, 250]}
model_7 = Pipeline([('column_transformer', column_transformer),
('model', RandomForestRegressor())])
# default metric to optimize over grid parameters: accuracy
grid_model_7 = GridSearchCV(model_7, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_7 = grid_model_7.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_7.best_params_)
print('Grid best score: ', grid_model_7.best_score_.round(3))
Grid best parameter: {'model__max_depth': 10, 'model__n_estimators': 125}
Grid best score: 0.306
#VotingRegressor
model_8 = VotingRegressor([('ridge', grid_model_1), ('lasso', grid_model_2), ('knn', grid_model_3), ('svr_linear', grid_model_4),
('svr_rbf', grid_model_5), ('tree', grid_model_6), ('rf', grid_model_7)])
model_8 = model_8.fit(X_train, y_train)
cv_score = cross_val_score(model_8, X_train, y_train, cv=5, scoring = 'r2').mean().round(3)
print('CV score', cv_score)
CV score 0.289
#model performance summary on cross validation folds
models = ['Ridge',
'Lasso',
'KNeighborsRegressor',
'SVR linear',
'SVR rbf',
'Decision Tree',
'RandomForestRegressor',
'VotingRegressor']
scores = [grid_model_1.best_score_.round(3),
grid_model_2.best_score_.round(3),
grid_model_3.best_score_.round(3),
grid_model_4.best_score_.round(3),
grid_model_5.best_score_.round(3),
grid_model_6.best_score_.round(3),
grid_model_7.best_score_.round(3),
cv_score.round(3)]
model_scores = pd.DataFrame(data= scores, columns = ['CV_r2'], index = models)
print('The average cross validation performance of all trained models was as below')
model_scores
The average cross validation performance of all trained models was as below
CV_r2 | |
---|---|
Ridge | 0.225 |
Lasso | 0.226 |
KNeighborsRegressor | 0.267 |
SVR linear | 0.224 |
SVR rbf | 0.327 |
Decision Tree | 0.161 |
RandomForestRegressor | 0.306 |
VotingRegressor | 0.289 |
#get the best model as per cross validation performance
print('The model selected with the best Cross Validation performance was:', model_scores.index[np.argmax(model_scores['CV_r2'])])
model_scores.iloc[[np.argmax(model_scores['CV_r2'])]]
The model selected with the best Cross Validation performance was: SVR rbf
CV_r2 | |
---|---|
SVR rbf | 0.327 |
#model dictionary
model_dict = {'Ridge': grid_model_1,
'Lasso': grid_model_2,
'KNeighborsRegressor': grid_model_3,
'SVR linear': grid_model_4,
'SVR rbf': grid_model_5,
'Decision Tree': grid_model_6,
'RandomForestRegressor': grid_model_7,
'VotingRegressor': model_8
}
#test set performance best models
print('The best model reaches below out of sample test set performance:')
model_dict[model_scores.index[np.argmax(model_scores['CV_r2'])]].score(X_test, y_test).round(3)
The best model reaches below out of sample test set performance:
0.362