Price to Earnings Ratio Prediction

A traditional measure of whether a stock is expensive or cheap, is to use a valuation multiple. A valuation multiple relates the market value of an asset relative to a key statistic that is assumed to relate to that value. To be useful, that statistic – whether earnings, cash flow or some other measure – must bear a logical relationship to the market value observed; to be seen, in fact, as the driver of that market value. For more information about valuation multiples, refer to below link: https://en.wikipedia.org/wiki/Valuation_using_multiples#Valuation_multiples

1. Price to Earnings Ratio

The most famous valuation multiple the PE (Price to earnings) ratio. This is the ratio between the market value of all the shares outstanding for a company, divided by the company’s Net Income. It is often also defined as Price per share divided by earnings per share, which is simply dividing both the Market Capitalization and the Net income by the shares outstanding. In this project we will try to predict the PE ratios for the 3000 biggest companies listed on the US stock exchanges.

Before analysing our dataset, let us first try to understand what fundamental financial information about the company should be driving the PE ratio based on Financial Theory.

The present value of any Cash Flow generating asset is the value of all future cash flows the asset will generate over its lifetime, discounted back to today with a risk adjusted interest rate.

In its simplest form, where we expect the cash flow (CF) to grow at the same rate for ever (g), the present value (PV) is calculated with below perpetuity formula.

\[PV = \frac{CF * (1+g)}{(r-g)}\]

When trying to value a company based on its future cash flows, we can either look at it through the lenses of all capital providers (equity and debt providers) and we would use the FCFF (Free Cash Flow to the Firm) as the measure of cash flow (CF), or we could look at it purely as an Equity investor and we would use the FCFE (Free Cash Flow to Equity) as our measure of CF.

The main difference is that the FCFF is the cash flow left over after cash flow needs for operations and Investments are met, but before any debt repayments, whereas the FCFE is after debt payments (cash flow left over for equity investors).

If we want to estimate the value of Equity directly, we would therefore use the FCFE as the appropriate measure of CF.

As the Market Capitalization (MC) (numerator in the PE ratio), is the aggregate market estimation of the value of Equity, we can rewrite the above formula as follows:

\(MC = \frac{FCFE*(1 + g)}{(r - g)}\)

FCFE can be rewritten as:

\(MC = \frac{(NetIncome+DA-CAPEX-deltaWC-NetDebtRepayment)*(1 + g)}{(r - g)}\)

Re-writing the numerator relative to Net Income gives us:

\(MC = \frac{NetIncome*(1+\frac{DA}{Net Income}-\frac{CAPEX}{Net Income}-\frac{deltaWC}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)

\(MC = \frac{NetIncome*(1-\frac{CAPEX - DA + deltaWC}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)

\(MC = \frac{NetIncome*(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)

Deviding both sides by the Net Income (Earnings), gives us the Price to Earnings ratio:

\(\frac{Price}{Earnings}=\frac{MC}{NetIncome}= \frac{\frac{NetIncome*(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}}{Net Income}\)

\(\frac{Price}{Earnings}=\frac{MC}{NetIncome}= \frac{(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{(r - g)}\)

Finally converting the discount rate r to the cost of equity gives us:

\(\frac{Price}{Earnings}=\frac{MC}{NetIncome}= \frac{(1-\frac{reInvestment}{Net Income}-\frac{NetDebtRepayment}{Net Income})*(1 + g)}{((rf+\beta*ERP) - g)}\)

This formula gives us some understanding of what we would expect the fundamental drivers of the PE ratio to be. The fundamental drivers of the P/E ratio are

A higher expected growth rate g, all else being equal, should lead to a higher PE ratio
A lower reinvestment need (Capex - depreciation + delta working capital), all else being equal, should lead to a higher PE ratio
Lower amount of debt needed (Net Debt repayments), all else being equal, should result in a higher PE ratio
A higher risk relative to the overall market, measured as beta in the denominator, should lead to a lower PE ratio

rf (risk free rate traditionally measured as the US T-Bond rate) and ERP (Equity risk premium, basically the premium that equity investor require on top of the risk free rate for investing in the overall stock market) values are the same for all stocks, therefore changes in these overall market metrics will affect all PE ratios, but these will not help differentiation between PE ratios of individual companies, as these are equal for all companies.

We will therefore try to obtain financial metrics that are related to points 1-4, as we would expect these to be partial drivers of the PE ratios seeen in the market.

2. Data Extraction and Pre-Processing

The company data was downloaded from the data provider “Finbox”. For more information, refer to below link: https://finbox.com

We will first import the data and check for any missing values and some basic information. Then we will remove outliers before splitting the data into training and testing sets, and analysing the training set in more detail.

# linear algebra
import numpy as np     

# data processing
import pandas as pd    
#display max columns and rows, to be able to inspect big data tables
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

# conntects to google drive if ran in colab
# ignored if ran outside of colab

if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# load file from google drive path, if ran in colab 
# load file from notebook folder, if ran outside colab

if 'google.colab' in str(get_ipython()):
    path = '/content/drive/My Drive/Colab Notebooks/stock_pricing/'
else:
    path = str()
file = path + 'large_Financials_unlinked.xlsm'

company_data = pd.read_excel(file)

#original data structure
company_data.head(5)

	Ticker	Full_Ticker	Name	Sector	Industry_GICS	Index_Membership	Market_Cap	Net_Debt	EV	EBITDA	Net_Income	Equity	Avg_EBITDA_Margin_Forecast_10y	EBITDA_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Net_Income_Forecast_CAGR_10y	ROE	Beta_5y	Net_Debt_perc_EV	EV_TTM_EBITDA	Market_Cap_TTM_Net_Income	Market_Cap_BV_Equity	Net_Debt_perc_Market_Cap
0	AAPL	NASDAQGS:AAPL	Apple Inc.	Information Technology	Technology Hardware, Storage & Peripherals	Dow Jones Composite Average, Dow Jones Industr...	2.166836e+06	-69552.0	2.097284e+06	77344.0	57411.0	65339.0	0.28493	0.0381633	0.206377	0.0348267	0.736856	1.283613	-0.033163	27.116315	37.742528	33.162985	-0.032098
1	AMZN	NASDAQGS:AMZN	Amazon.com, Inc.	Consumer Discretionary	Internet & Direct Marketing Retail	Nasdaq 100, Nasdaq Composite, Russell 1000, Ru...	1.565880e+06	28412.0	1.594292e+06	43708.0	17377.0	82775.0	0.183412	0.145604	0.0865852	0.201361	0.249521	1.198004	0.017821	36.475980	90.112226	18.917308	0.018144
2	MSFT	NASDAQGS:MSFT	Microsoft Corporation	Information Technology	Software	Dow Jones Composite Average, Dow Jones Industr...	1.624977e+06	-54733.0	1.570244e+06	68123.0	47496.0	123392.0	0.475806	0.0485504	0.324014	0.048353	0.413993	0.826155	-0.034856	23.050136	34.212932	13.169228	-0.033682
3	GOOG	NASDAQGS:GOOG	Alphabet Inc.	Communication Services	Interactive Media & Services	Nasdaq 100, Nasdaq Composite, Russell 1000, Ru...	1.178252e+06	-105501.0	1.072751e+06	48075.0	35713.0	212920.0	0.381029	0.107043	0.260145	0.110387	0.175111	0.994747	-0.098346	22.314107	32.992235	5.533776	-0.089540
4	FB	NASDAQGS:FB	Facebook, Inc.	Communication Services	Interactive Media & Services	Nasdaq 100, Nasdaq Composite, Russell 1000, Ru...	7.151778e+05	-44476.0	6.707018e+05	30221.0	25276.0	117731.0	0.530226	0.119821	0.351485	0.131987	0.238757	1.182773	-0.066313	22.193237	28.294738	6.074677	-0.062189

2.1 Data Structure and removal of not meaningful samples

#original data shape
original_data_shape = company_data.shape
print('The original data shape is', original_data_shape)

The original data shape is (3171, 24)

#drop duplicates
company_data = company_data.drop_duplicates('Name' , keep='first')
print(original_data_shape[0] - company_data.shape[0], 'companies were removed due to duplication.')
print('The new data shape is', company_data.shape)

21 companies were removed due to duplication.
The new data shape is (3150, 24)

#remove not meaningful samples
current_num_samples = company_data.shape[0]
company_data = company_data[(company_data.Market_Cap > 0)]
company_data = company_data[(company_data.Net_Income > 0)]
new_num_samples = company_data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to not meaningful data.')
print('The new data shape is', company_data.shape)

1317 companies were removed due to not meaningful data.
The new data shape is (1833, 24)

#filter data for required columns
pe_data = company_data[['Sector', 
                        'Name', 
                        'Market_Cap_TTM_Net_Income', 
                        'Net_Income_Forecast_CAGR_10y',
                        'Avg_Net_Income_Margin_Forecast_10y', 
                        'Beta_5y', 
                        'Net_Debt_perc_Market_Cap'
                       ]].set_index('Name')
pe_data.head(5)

	Sector	Market_Cap_TTM_Net_Income	Net_Income_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
Name
Apple Inc.	Information Technology	37.742528	0.0348267	0.206377	1.283613	-0.032098
Amazon.com, Inc.	Consumer Discretionary	90.112226	0.201361	0.0865852	1.198004	0.018144
Microsoft Corporation	Information Technology	34.212932	0.048353	0.324014	0.826155	-0.033682
Alphabet Inc.	Communication Services	32.992235	0.110387	0.260145	0.994747	-0.089540
Facebook, Inc.	Communication Services	28.294738	0.131987	0.351485	1.182773	-0.062189

#check column data types
pe_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1833 entries, Apple Inc. to Smart Sand, Inc.
Data columns (total 6 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Sector                              1833 non-null   object 
 1   Market_Cap_TTM_Net_Income           1833 non-null   float64
 2   Net_Income_Forecast_CAGR_10y        1833 non-null   object 
 3   Avg_Net_Income_Margin_Forecast_10y  1833 non-null   object 
 4   Beta_5y                             1833 non-null   float64
 5   Net_Debt_perc_Market_Cap            1833 non-null   float64
dtypes: float64(3), object(3)
memory usage: 100.2+ KB

#ensure columns have the correct data type
pe_data['Market_Cap_TTM_Net_Income'] = pe_data['Market_Cap_TTM_Net_Income'].apply(pd.to_numeric, errors='coerce')
pe_data['Net_Income_Forecast_CAGR_10y'] = pe_data['Net_Income_Forecast_CAGR_10y'].apply(pd.to_numeric, errors='coerce')
pe_data['Avg_Net_Income_Margin_Forecast_10y'] = pe_data['Avg_Net_Income_Margin_Forecast_10y'].apply(pd.to_numeric, errors='coerce')
pe_data['Beta_5y'] = pe_data['Beta_5y'].apply(pd.to_numeric, errors='coerce')
pe_data['Net_Debt_perc_Market_Cap'] = pe_data['Net_Debt_perc_Market_Cap'].apply(pd.to_numeric, errors='coerce')

pe_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1833 entries, Apple Inc. to Smart Sand, Inc.
Data columns (total 6 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Sector                              1833 non-null   object 
 1   Market_Cap_TTM_Net_Income           1833 non-null   float64
 2   Net_Income_Forecast_CAGR_10y        1562 non-null   float64
 3   Avg_Net_Income_Margin_Forecast_10y  1831 non-null   float64
 4   Beta_5y                             1833 non-null   float64
 5   Net_Debt_perc_Market_Cap            1833 non-null   float64
dtypes: float64(5), object(1)
memory usage: 100.2+ KB

#check for missing values
pe_data.isna().sum()

Sector                                  0
Market_Cap_TTM_Net_Income               0
Net_Income_Forecast_CAGR_10y          271
Avg_Net_Income_Margin_Forecast_10y      2
Beta_5y                                 0
Net_Debt_perc_Market_Cap                0
dtype: int64

#remove missing values
current_num_samples = pe_data.shape[0]
pe_data = pe_data.dropna()
pe_data.isna().sum()
new_num_samples = pe_data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to missing data.')
print('The new data shape is', pe_data.shape)

271 companies were removed due to missing data.
The new data shape is (1562, 6)

2.2 Inspection for Outliers

We will look at the individual histograms of all variables, to see if we have potential outliers, and use Gaussian mixture models to later remove these outliers.

2.2.1 Summary Statistics and Histograms

#summary statistics before outlier removal
pe_data.describe()

	Market_Cap_TTM_Net_Income	Net_Income_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
count	1562.000000	1562.000000	1562.000000	1562.000000	1562.000000
mean	88.873737	0.040772	0.150340	1.074986	0.325119
std	783.710029	0.132170	0.127199	0.559106	1.627939
min	0.706985	-0.342594	-0.038680	-1.269432	-6.197720
25%	15.490751	0.008678	0.075073	0.724133	-0.020051
50%	25.771183	0.046963	0.114935	1.060399	0.116504
75%	45.413055	0.093754	0.188474	1.407431	0.371820
max	25745.355250	0.958556	1.941181	3.187500	40.707549

sns.histplot(pe_data[['Market_Cap_TTM_Net_Income']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab7fa1b00>

png

sns.histplot(np.log(pe_data[['Market_Cap_TTM_Net_Income']]))

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab53741d0>

png

sns.histplot(pe_data[['Net_Income_Forecast_CAGR_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab88663c8>

png

sns.histplot(pe_data[['Avg_Net_Income_Margin_Forecast_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab898e908>

png

sns.histplot(pe_data[['Beta_5y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab4946630>

png

sns.histplot(pe_data[['Net_Debt_perc_Market_Cap']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab4ba8c18>

png

2.2.2 Outlier Removal with Gaussian Mixtures

#check for outliers with multivariate Guassian
from sklearn.covariance import EllipticEnvelope

X = pe_data[['Market_Cap_TTM_Net_Income',
 'Net_Income_Forecast_CAGR_10y',
 'Avg_Net_Income_Margin_Forecast_10y',
 'Beta_5y',
 'Net_Debt_perc_Market_Cap']]
cov = EllipticEnvelope(random_state=0, contamination=0.1).fit(X)
# predict returns 1 for an inlier and -1 for an outlier
pe_data[cov.predict(X)== -1]

	Sector	Market_Cap_TTM_Net_Income	Net_Income_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
Name
JPMorgan Chase & Co.	Financials	16.751677	-0.308714	0.076850	1.221113	-1.502979
salesforce.com, inc.	Information Technology	55.467165	0.494147	0.167470	1.171743	-0.016569
Zoom Video Communications, Inc.	Information Technology	238.968203	0.548485	0.294907	0.000000	-0.017654
Citigroup Inc.	Financials	11.521894	-0.249653	0.058769	1.951680	-2.650639
Morgan Stanley	Financials	13.922561	-0.189029	0.076157	1.578256	-2.002738
Square, Inc.	Information Technology	330.866730	0.287034	0.169256	2.710084	-0.002075
The Goldman Sachs Group, Inc.	Financials	15.782209	-0.147498	0.094139	1.498424	-2.264476
Autodesk, Inc.	Information Technology	164.201527	0.267383	0.312567	1.440651	0.006867
Rocket Companies, Inc.	Financials	2.379811	0.133660	0.335283	0.000000	10.451434
Chipotle Mexican Grill, Inc.	Consumer Discretionary	165.322824	0.112505	0.097372	1.310399	0.053183
Match Group, Inc.	Communication Services	462.034079	0.122597	0.289083	0.000000	0.078802
IQVIA Holdings Inc.	Healthcare	206.344869	0.217596	0.096187	1.408088	0.317127
Marriott International, Inc.	Consumer Discretionary	234.753507	0.088121	0.321391	1.828256	0.251279
The Trade Desk, Inc.	Information Technology	267.645982	0.262910	0.426602	2.571428	-0.006479
Kinder Morgan, Inc.	Energy	282.232013	0.024985	0.214225	0.973214	0.988606
Yandex N.V.	Communication Services	0.956178	0.247145	0.151302	1.418067	-6.197720
GoodRx Holdings, Inc.	Healthcare	859.767452	0.275670	0.265819	0.000000	-0.019991
Altice USA, Inc.	Communication Services	180.741036	0.214488	0.077281	1.117957	1.366004
Insulet Corporation	Healthcare	606.399539	0.347631	0.115667	0.753676	0.001678
Royalty Pharma plc	Healthcare	8.626940	-0.018368	0.783479	0.000000	0.189492
Ceridian HCM Holding Inc.	Information Technology	1274.738272	0.106253	0.126695	1.383878	0.030229
PPD, Inc.	Healthcare	176.904951	0.242234	0.081675	0.000000	0.303869
Sunrun Inc.	Industrials	2190.510881	0.389808	0.280920	2.331932	0.148653
Paylocity Holding Corporation	Information Technology	168.748963	0.125200	0.187379	1.358193	-0.005702
Vail Resorts, Inc.	Consumer Discretionary	221.809841	0.203086	0.134174	1.267857	0.195811
Nuance Communications, Inc.	Information Technology	638.474079	0.241816	0.151574	1.207983	0.095179
Repligen Corporation	Healthcare	255.814324	0.210457	0.197374	0.937500	-0.025107
Globant S.A.	Information Technology	175.746339	0.169668	0.147179	1.241071	-0.028297
JOYY Inc.	Communication Services	0.712293	0.054680	0.122382	1.025735	-2.524629
Santander Consumer USA Holdings Inc.	Financials	12.545485	-0.294670	0.047224	1.146533	6.150640
Regency Centers Corporation	Real Estate	162.169055	0.025364	0.218645	1.025210	0.520654
Post Holdings, Inc.	Consumer Staples	8049.723862	0.840637	0.050039	0.713760	1.006789
Polaris Inc.	Consumer Discretionary	272.459252	0.089551	0.079538	2.051995	0.172279
RealPage, Inc.	Information Technology	160.502484	0.168698	0.163460	1.200105	0.083945
Healthcare Trust of America, Inc.	Real Estate	174.044302	0.134457	0.107783	0.570378	0.518029
East West Bancorp, Inc.	Financials	14.698616	-0.327636	0.112202	1.983193	-0.455557
Jefferies Financial Group Inc.	Financials	8.585564	-0.017036	0.107537	1.452205	3.113440
Planet Fitness, Inc.	Consumer Discretionary	1145.056270	0.080519	0.200777	1.278886	0.224182
OneMain Holdings, Inc.	Financials	10.945856	-0.015186	0.196724	2.301995	2.253179
Euronet Worldwide, Inc.	Information Technology	231.647933	0.049857	0.129901	1.603991	-0.015695
DouYu International Holdings Limited	Communication Services	4.487869	0.598024	0.149220	0.000000	-2.052381
Apartment Investment and Management Company	Real Estate	4.815010	-0.262892	0.014636	0.504201	5.617434
51job, Inc.	Industrials	4.695607	0.142755	0.367819	0.773634	-2.131473
TFS Financial Corporation	Financials	63.557578	-0.303970	0.065285	0.473214	0.582566
Silicon Laboratories Inc.	Information Technology	440.317690	0.246578	0.149803	1.061449	-0.027344
Futu Holdings Limited	Financials	11.461593	0.375405	0.399249	0.000000	3.088148
Cullen/Frost Bankers, Inc.	Financials	17.665053	-0.313444	0.064631	1.550420	-0.823118
CyberArk Software Ltd.	Information Technology	2150.226566	0.013283	0.112798	1.372899	-0.099157
Inovalon Holdings, Inc.	Healthcare	675.318101	0.385118	0.179341	0.631302	0.231589
New York Community Bancorp, Inc.	Financials	12.140423	-0.157636	0.140916	1.053046	2.825336
PennyMac Financial Services, Inc.	Financials	3.108653	-0.006878	0.183363	1.210084	6.173266
Affiliated Managers Group, Inc.	Financials	45.356890	0.487594	0.332176	1.448004	0.312858
Vicor Corporation	Industrials	554.240278	0.286143	0.239396	0.838235	-0.044756
STAAR Surgical Company	Healthcare	459.804971	0.204366	0.187577	1.080882	-0.028797
Quaker Chemical Corporation	Materials	758.979339	0.208738	0.101672	1.521008	0.158539
H&R Block, Inc.	Consumer Discretionary	15.135832	0.547058	0.140034	0.717436	0.603042
Momo Inc.	Communication Services	1.284078	-0.018558	0.135815	1.501050	-1.913612
Verint Systems Inc.	Information Technology	266.448316	0.273125	0.189768	0.882878	0.054453
Air Lease Corporation	Industrials	9.110599	0.098506	0.374975	2.205357	2.695674
Spirit Realty Capital, Inc.	Real Estate	1840.574712	-0.017255	0.185119	1.200630	0.592913
Blackstone Mortgage Trust, Inc.	Financials	29.168555	-0.232884	0.265245	1.353991	3.194494
Hudson Pacific Properties, Inc.	Real Estate	153.618415	0.076131	0.075570	1.008928	0.885869
Crane Co.	Industrials	219.922813	0.133795	0.111302	1.642331	0.156368
W. R. Grace & Co.	Materials	293.805810	0.101765	0.141941	1.490021	0.450699
Digital Turbine, Inc.	Information Technology	177.187460	0.269344	0.306532	2.402310	-0.001506
J & J Snack Foods Corp.	Consumer Staples	159.774902	0.167970	0.069604	0.579306	-0.073952
Nelnet, Inc.	Financials	17.195040	-0.113014	0.078918	0.694327	6.956524
Kratos Defense & Security Solutions, Inc.	Industrials	765.998239	0.268329	0.089397	1.075630	0.002872
ABM Industries Incorporated	Industrials	9263.878107	0.958556	0.024766	1.237920	0.184408
Applied Industrial Technologies, Inc.	Industrials	163.346657	0.236222	0.051580	1.462184	0.179402
CONMED Corporation	Healthcare	10745.372434	0.149241	0.090452	1.554621	0.221933
Texas Capital Bancshares, Inc.	Financials	42.043347	-0.314522	0.059117	2.040966	-2.114025
Trinity Industries, Inc.	Industrials	2165.264179	0.041242	0.059529	1.543592	1.526126
Alliance Data Systems Corporation	Information Technology	12.230037	0.040894	0.144305	2.582983	3.966767
Dana Incorporated	Consumer Discretionary	233.766139	0.087559	0.045863	2.612920	0.660917
Mr. Cooper Group Inc.	Financials	4.745800	-0.084737	0.092248	1.561449	5.742193
Moog Inc.	Industrials	293.815571	0.365227	0.057571	1.549894	0.345789
Chimera Investment Corporation	Financials	32.978382	-0.249461	0.217903	1.006302	5.614781
Brinker International, Inc.	Consumer Discretionary	139.881608	0.238843	0.047800	2.672794	0.806977
Rogers Corporation	Information Technology	537.862982	0.127355	0.141970	1.819852	-0.036268
Goosehead Insurance, Inc	Financials	419.744708	0.354969	0.193873	0.619425	0.026563
360 DigiTech, Inc.	Financials	0.706985	0.110954	0.337680	0.708786	-2.398706
Kennedy-Wilson Holdings, Inc.	Real Estate	26.447361	0.194040	1.234208	1.246323	1.735967
Korn Ferry	Industrials	165.150541	0.069940	0.089708	1.610819	0.008606
Easterly Government Properties, Inc.	Real Estate	162.517737	0.143134	0.077131	0.445903	0.504803
Verra Mobility Corporation	Information Technology	293.971912	0.238901	0.385125	1.689947	0.352871
Celsius Holdings, Inc.	Consumer Staples	747.532815	0.168234	0.109054	2.172794	-0.009906
PennyMac Mortgage Investment Trust	Financials	61.684033	-0.033100	0.348720	1.171218	4.255587
Navient Corporation	Financials	5.219577	-0.312567	0.158355	1.830357	40.707549
BancFirst Corporation	Financials	21.139351	-0.306531	0.075616	1.407563	-0.855761
Vital Farms, Inc.	Consumer Staples	198.802568	0.334608	0.063634	0.000000	-0.099120
Broadmark Realty Capital Inc.	Financials	13.009910	0.430792	1.053596	0.000000	-0.125581
Glu Mobile Inc.	Communication Services	202.020533	0.393113	0.250600	0.831932	-0.186857
Arbor Realty Trust, Inc.	Financials	16.151109	-0.342594	0.314844	1.863970	2.811448
World Fuel Services Corporation	Energy	12.281043	0.484571	0.071972	1.246848	0.071008
Apollo Commercial Real Estate Finance, Inc.	Financials	29.734538	-0.009752	0.673072	1.324579	2.548829
Gray Television, Inc.	Communication Services	5.959170	0.143468	0.128964	2.061449	1.977744
Encore Capital Group, Inc.	Financials	4.636604	0.084569	0.204575	1.608718	3.060390
The Pennant Group, Inc.	Healthcare	223.346359	0.436328	0.131793	0.000000	0.171849
Kaman Corporation	Industrials	1432.912047	0.055077	0.090850	1.259453	0.091401
Enerpac Tool Group Corp.	Industrials	444.916874	0.266887	0.086804	1.398634	0.067732
StoneX Group Inc.	Financials	7.325914	0.134682	0.004449	1.237394	-2.462987
The GEO Group, Inc.	Real Estate	7.594080	0.021620	0.067978	0.872899	2.610962
Empire State Realty Trust, Inc.	Real Estate	352.740898	0.083937	0.167861	1.248424	1.030238
KKR Real Estate Finance Trust Inc.	Financials	20.091090	0.093110	0.913182	0.717959	3.826082
OneSpan Inc.	Information Technology	508.315408	0.160949	0.114122	0.657037	-0.106090
CEVA, Inc.	Information Technology	25745.355250	0.802418	0.098178	1.185924	-0.106221
Ladder Capital Corp	Financials	31.747818	0.025390	0.385591	2.379201	3.189546
Textainer Group Holdings Limited	Industrials	17.635944	0.111696	0.176426	1.580357	3.573168
Brookfield Property REIT Inc.	Real Estate	3.160712	0.009887	0.121736	1.404560	25.971131
GrowGeneration Corp.	Consumer Discretionary	809.353696	0.456916	0.119451	3.109704	-0.017205
Lantheus Holdings, Inc.	Healthcare	2710.954416	0.123571	0.131269	1.304621	0.129804
H&E Equipment Services, Inc.	Industrials	298.247411	0.092664	0.121501	2.391806	0.980952
OFG Bancorp	Financials	20.976006	-0.062395	0.083554	1.637605	-2.039222
CoreCivic, Inc.	Real Estate	6.675864	0.053729	0.135858	1.274159	2.243393
Federal Agricultural Mortgage Corporation	Financials	8.112086	-0.234719	0.156564	1.099789	25.464763
Ready Capital Corporation	Financials	16.785361	-0.138081	0.145104	1.135958	6.452341
Banc of California, Inc.	Financials	169.877246	0.031302	0.048239	1.855567	0.530683
Cowen Inc.	Financials	5.865177	0.223161	0.107700	1.501050	-1.714701
Independence Holding Company	Financials	153.504869	0.038060	0.038884	0.713760	-0.034722
Byline Bancorp, Inc.	Financials	15.225538	-0.316065	0.049298	1.399270	1.153092
Banco Latinoamericano de Comercio Exterior, S.A	Financials	9.353085	0.055819	0.681126	1.431197	0.963452
CrossFirst Bankshares, Inc.	Financials	166.033851	-0.112332	0.085429	0.000000	0.200957
CAI International, Inc.	Industrials	22.486914	0.148288	0.344705	1.753151	3.684444
Heritage Commerce Corp	Financials	20.079816	-0.309449	0.081675	1.238445	-1.498694
WideOpenWest, Inc.	Communication Services	51.772992	0.066138	0.056696	2.121111	2.439644
Customers Bancorp, Inc.	Financials	6.739674	-0.201747	0.094582	1.556722	9.121123
Dime Community Bancshares, Inc.	Financials	13.242787	0.016384	0.160215	1.113445	2.110544
Brigham Minerals, Inc.	Energy	439.092310	0.280175	0.458745	0.000000	-0.005515
ShotSpotter, Inc.	Information Technology	183.596527	0.236514	0.146619	1.529387	-0.054738
Orchid Island Capital, Inc.	Financials	85.217282	0.116008	1.392439	1.353466	8.541720
GreenSky, Inc.	Information Technology	94.385190	0.183484	0.179217	1.684817	2.229329
Flushing Financial Corporation	Financials	12.017677	-0.114548	0.106463	0.874474	2.577398
Amalgamated Bank	Financials	10.543987	-0.241360	0.074361	0.934883	-1.690125
EVI Industries, Inc.	Industrials	604.358309	0.103463	0.006639	1.044642	0.051524
CNB Financial Corporation	Financials	11.078508	-0.311182	0.083196	0.950630	-0.757662
IBEX Limited	Industrials	175.676546	0.186539	0.069038	0.000000	0.100618
Global Water Resources, Inc.	Utilities	333.423098	0.087818	0.093889	0.590481	0.260442
GWG Holdings, Inc.	Financials	5.710270	0.139139	1.941181	-0.213235	7.577290
Metropolitan Bank Holding Corp.	Financials	9.048492	-0.055528	0.131344	1.274883	-2.149825
A-Mark Precious Metals, Inc.	Financials	3.754298	0.010691	0.007043	-0.453256	2.702594
Orion Energy Systems, Inc.	Industrials	228.903924	0.146702	0.164330	2.619222	-0.001672
Alpha Pro Tech, Ltd.	Industrials	8.179425	0.371954	0.289285	-1.269432	-0.135723
Gold Resource Corporation	Materials	837.443231	0.095632	0.117449	1.922794	-0.139824
Great Ajax Corp.	Financials	10.873657	0.011348	0.397006	1.724789	4.167232
Overseas Shipholding Group, Inc.	Energy	4.582406	0.004951	0.025634	0.344537	3.305226
First Business Financial Services, Inc.	Financials	10.224774	-0.248875	0.060277	1.070378	2.803202
PCB Bancorp	Financials	10.964238	-0.331828	0.059268	0.770483	-0.747264
Willis Lease Finance Corporation	Industrials	11.379067	0.041128	0.161357	1.225840	6.851299
Select Bancorp, Inc.	Financials	24.861609	-0.260696	0.044938	0.826680	-1.193999
BayCom Corp	Financials	13.069827	-0.296532	0.074006	1.012079	-1.298867
Intevac, Inc.	Information Technology	34.864770	0.625030	0.769693	0.964810	-0.196762
Mesa Air Group, Inc.	Industrials	8.370785	0.106898	0.075282	3.127155	3.213603
BankFinancial Corporation	Financials	13.423641	-0.002157	0.184443	0.582457	-3.339265
Meridian Corporation	Financials	6.368735	-0.045533	0.093215	0.476823	2.386771
Sharps Compliance Corp.	Healthcare	145.646316	0.040317	0.039380	0.002626	0.046589
StealthGas Inc.	Energy	8.482289	0.367267	0.244247	1.709558	2.890521

#outlier removal
current_num_samples = pe_data.shape[0]
pe_data_new = pe_data[cov.predict(X)== 1]
new_num_samples = pe_data_new.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to outliers.')
print('The new data shape is', pe_data_new.shape)

157 companies were removed due to outliers.
The new data shape is (1405, 6)

2.2.3 Summary Statistics and Histograms after outlier removal

#summary statistics after outlier removal
pe_data_new.describe()

	Market_Cap_TTM_Net_Income	Net_Income_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
count	1405.000000	1405.000000	1405.000000	1405.000000	1405.000000
mean	33.827036	0.034090	0.144781	1.059838	0.193706
std	27.348860	0.111973	0.103760	0.536433	0.403447
min	2.146296	-0.334833	-0.038680	-0.465861	-1.314781
25%	15.729846	0.009369	0.074373	0.722163	-0.018924
50%	25.351531	0.045684	0.114119	1.038340	0.115458
75%	41.711925	0.085432	0.187319	1.375525	0.344100
max	150.911270	0.397053	0.676412	3.187500	2.002922

sns.histplot(pe_data_new[['Market_Cap_TTM_Net_Income']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab43e0f28>

png

sns.histplot(np.log(pe_data_new[['Market_Cap_TTM_Net_Income']]))

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab4273630>

png

sns.histplot(pe_data_new[['Net_Income_Forecast_CAGR_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab422d6d8>

png

sns.histplot(pe_data_new[['Avg_Net_Income_Margin_Forecast_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab41290f0>

png

sns.histplot(pe_data_new[['Beta_5y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab406f6a0>

png

sns.histplot(pe_data_new[['Net_Debt_perc_Market_Cap']])

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab3faeef0>

png

2.3 Split into training and testing set

# split data into train and test splits
from sklearn.model_selection import train_test_split

X = pe_data_new[['Sector', 'Net_Income_Forecast_CAGR_10y', 'Avg_Net_Income_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap']];
Y = pe_data_new[['Market_Cap_TTM_Net_Income']];
Y['Log_Market_Cap_TTM_Net_Income'] = np.log(Y['Market_Cap_TTM_Net_Income']);

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=True, test_size = 0.2, random_state=1);

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# check training data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', Y_train.shape)

X_train has shape (1124, 5)
y_train has shape (1124, 2)

#check X_train
X_train.head(5)

	Sector	Net_Income_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
Name
PPL Corporation	Utilities	0.037723	0.249385	0.746323	1.093131
Premier, Inc.	Healthcare	-0.055200	0.252738	0.262079	0.119931
Shutterstock, Inc.	Consumer Discretionary	0.230431	0.166612	1.048844	-0.132058
Intel Corporation	Information Technology	0.008352	0.272094	0.734768	0.081718
AGCO Corporation	Industrials	0.164077	0.049779	1.218487	0.145613

#check y_train
Y_train.head(5)

	Market_Cap_TTM_Net_Income	Log_Market_Cap_TTM_Net_Income
Name
PPL Corporation	13.926587	2.633800
Premier, Inc.	16.623611	2.810824
Shutterstock, Inc.	51.272719	3.937159
Intel Corporation	9.941109	2.296679
AGCO Corporation	42.214750	3.742770

train_data = Y_train.merge(X_train, how='outer', left_index=True, right_index=True)
test_data = Y_test.merge(X_test, how='outer', left_index=True, right_index=True)

3. Exploratory Data Analysis

3.1 Summary Statistics

train_data.describe().round(2)

	Market_Cap_TTM_Net_Income	Log_Market_Cap_TTM_Net_Income	Net_Income_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
count	1124.00	1124.00	1124.00	1124.00	1124.00	1124.00
mean	33.57	3.26	0.03	0.14	1.05	0.19
std	26.75	0.71	0.11	0.10	0.54	0.40
min	2.28	0.82	-0.33	-0.04	-0.47	-1.31
25%	15.78	2.76	0.01	0.07	0.71	-0.02
50%	25.66	3.24	0.05	0.11	1.02	0.11
75%	41.32	3.72	0.08	0.18	1.35	0.35
max	150.91	5.02	0.40	0.68	3.19	2.00

3.2 Correlation Matrix

# correlation matrix for outcome variable

from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Compute the correlation matrix
corr = train_data.drop(columns="Log_Market_Cap_TTM_Net_Income").corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab3f81668>

png

# correlation matrix for log of outcome variable

# Compute the correlation matrix
corr = train_data.drop(columns="Market_Cap_TTM_Net_Income").corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ab3e9bf60>

png

3.3 Pairplots

sns.set_theme()
sns.jointplot(data=train_data, x="Net_Income_Forecast_CAGR_10y", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Net_Income_Forecast_CAGR_10y", y="Log_Market_Cap_TTM_Net_Income")

<seaborn.axisgrid.JointGrid at 0x7f8ab85715f8>

png

sns.jointplot(data=train_data, x="Avg_Net_Income_Margin_Forecast_10y", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Avg_Net_Income_Margin_Forecast_10y", y="Log_Market_Cap_TTM_Net_Income")

<seaborn.axisgrid.JointGrid at 0x7f8ab39284e0>

png

sns.jointplot(data=train_data, x="Beta_5y", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Beta_5y", y="Log_Market_Cap_TTM_Net_Income")

<seaborn.axisgrid.JointGrid at 0x7f8ab35e1588>

png

sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="Market_Cap_TTM_Net_Income")
sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="Log_Market_Cap_TTM_Net_Income")

<seaborn.axisgrid.JointGrid at 0x7f8ab3278198>

png

4. Machine Learning

#ML dependancies

#data preprocessing pipeline to preprocess data table
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

#Grid search cross validation to find best hyper parameters
from sklearn.model_selection import GridSearchCV

#models considered
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor

#evaluation
from sklearn.model_selection import cross_val_score

4.1 Pre-Processing Pipeline

#split data back into X and y
X_train = train_data[['Net_Income_Forecast_CAGR_10y', 'Avg_Net_Income_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_train = train_data['Log_Market_Cap_TTM_Net_Income']

X_test = test_data[['Net_Income_Forecast_CAGR_10y', 'Avg_Net_Income_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_test = test_data['Log_Market_Cap_TTM_Net_Income']

# check training and test data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', y_train.shape)
print('X_test has shape', X_test.shape)
print('y_test has shape', y_test.shape)

X_train has shape (1124, 5)
y_train has shape (1124,)
X_test has shape (281, 5)
y_test has shape (281,)

# define categorical columns for transformation
categorical_variables = ['Sector']

# define numerical columns for transformation
numerical_variables = list(X_train.drop(columns=categorical_variables).columns)

# define column transformation for data preprocessing pipeline
column_transformer = ColumnTransformer(transformers=[
                                                 ('categorical_columns', OneHotEncoder(handle_unknown='ignore'), categorical_variables),
                                                 ('numeric_columns', StandardScaler(), numerical_variables)])

4.2 Model Training and Hyper Parameter Tuning

#ridge regression

grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}

model_1 = Pipeline([('column_transformer', column_transformer),
                    ('model', Ridge(max_iter = 100000))])

# default metric to optimize over grid parameters: accuracy
grid_model_1 = GridSearchCV(model_1, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_1 = grid_model_1.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_1.best_params_)
print('Grid best score: ', grid_model_1.best_score_.round(3))

Grid best parameter:  {'model__alpha': 10}
Grid best score:  0.225

#lasso regression

grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}

model_2 = Pipeline([('column_transformer', column_transformer),
                    ('model', Lasso(max_iter = 100000))])

# default metric to optimize over grid parameters: accuracy
grid_model_2 = GridSearchCV(model_2, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_2 = grid_model_2.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_2.best_params_)
print('Grid best score: ', grid_model_2.best_score_.round(3))

Grid best parameter:  {'model__alpha': 0.001}
Grid best score:  0.226

#KNeighborsRegressor

grid_values = {'model__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]}

model_3 = Pipeline([('column_transformer', column_transformer),
                    ('model', KNeighborsRegressor())])

# default metric to optimize over grid parameters: accuracy
grid_model_3 = GridSearchCV(model_3, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_3 = grid_model_3.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_3.best_params_)
print('Grid best score: ', grid_model_3.best_score_.round(3))

Grid best parameter:  {'model__n_neighbors': 17}
Grid best score:  0.267

#SVR linear

grid_values = {'model__C': [0.01, 0.1, 1, 10, 100]}

model_4 = Pipeline([('column_transformer', column_transformer),
                    ('model', SVR(kernel='linear'))])

# default metric to optimize over grid parameters: accuracy
grid_model_4 = GridSearchCV(model_4, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_4 = grid_model_4.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_4.best_params_)
print('Grid best score: ', grid_model_4.best_score_.round(3))

Grid best parameter:  {'model__C': 0.1}
Grid best score:  0.224

#SVR rbf

grid_values = {'model__gamma': [0.01, 0.1, 1, 10, 100], 
               'model__C': [0.01, 0.1, 1, 10, 100]}

model_5 = Pipeline([('column_transformer', column_transformer),
                    ('model', SVR(kernel='rbf'))])

# default metric to optimize over grid parameters: accuracy
grid_model_5 = GridSearchCV(model_5, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_5 = grid_model_5.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_5.best_params_)
print('Grid best score: ', grid_model_5.best_score_.round(3))

Grid best parameter:  {'model__C': 1, 'model__gamma': 0.1}
Grid best score:  0.327

#DecisionTreeRegressor

grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8]}

model_6 = Pipeline([('column_transformer', column_transformer),
                    ('model', DecisionTreeRegressor())])

# default metric to optimize over grid parameters: accuracy
grid_model_6 = GridSearchCV(model_6, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_6 = grid_model_6.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_6.best_params_)
print('Grid best score: ', grid_model_6.best_score_.round(3))

Grid best parameter:  {'model__max_depth': 4}
Grid best score:  0.161

#RandomForestRegressor

grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10], 
               'model__n_estimators': [50, 75, 100, 125, 150, 200, 250]}

model_7 = Pipeline([('column_transformer', column_transformer),
                    ('model', RandomForestRegressor())])

# default metric to optimize over grid parameters: accuracy
grid_model_7 = GridSearchCV(model_7, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_7 = grid_model_7.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_7.best_params_)
print('Grid best score: ', grid_model_7.best_score_.round(3))

Grid best parameter:  {'model__max_depth': 10, 'model__n_estimators': 125}
Grid best score:  0.306

#VotingRegressor

model_8 = VotingRegressor([('ridge', grid_model_1), ('lasso', grid_model_2), ('knn', grid_model_3), ('svr_linear', grid_model_4), 
                                     ('svr_rbf', grid_model_5), ('tree', grid_model_6), ('rf', grid_model_7)])
model_8 = model_8.fit(X_train, y_train)
cv_score = cross_val_score(model_8, X_train, y_train, cv=5, scoring = 'r2').mean().round(3)
print('CV score', cv_score)

CV score 0.289

4.3 Model Comparison and Selection

#model performance summary on cross validation folds

models = ['Ridge', 
          'Lasso',
          'KNeighborsRegressor',
          'SVR linear',
          'SVR rbf',
          'Decision Tree',
          'RandomForestRegressor', 
          'VotingRegressor']

scores = [grid_model_1.best_score_.round(3),
          grid_model_2.best_score_.round(3),
          grid_model_3.best_score_.round(3),
          grid_model_4.best_score_.round(3),
          grid_model_5.best_score_.round(3),
          grid_model_6.best_score_.round(3),
          grid_model_7.best_score_.round(3), 
          cv_score.round(3)]

model_scores = pd.DataFrame(data= scores, columns = ['CV_r2'], index = models)
print('The average cross validation performance of all trained models was as below')
model_scores

The average cross validation performance of all trained models was as below

	CV_r2
Ridge	0.225
Lasso	0.226
KNeighborsRegressor	0.267
SVR linear	0.224
SVR rbf	0.327
Decision Tree	0.161
RandomForestRegressor	0.306
VotingRegressor	0.289

#get the best model as per cross validation performance
print('The model selected with the best Cross Validation performance was:', model_scores.index[np.argmax(model_scores['CV_r2'])])
model_scores.iloc[[np.argmax(model_scores['CV_r2'])]]

The model selected with the best Cross Validation performance was: SVR rbf

	CV_r2
SVR rbf	0.327

4.4 Out of Sample Performance Estimation

#model dictionary
model_dict = {'Ridge': grid_model_1,
              'Lasso': grid_model_2,
              'KNeighborsRegressor': grid_model_3,
              'SVR linear': grid_model_4,
              'SVR rbf': grid_model_5,
              'Decision Tree': grid_model_6,
              'RandomForestRegressor': grid_model_7, 
              'VotingRegressor': model_8
              }

#test set performance best models
print('The best model reaches below out of sample test set performance:')
model_dict[model_scores.index[np.argmax(model_scores['CV_r2'])]].score(X_test, y_test).round(3)

The best model reaches below out of sample test set performance:

0.362