Enterprise Value to EBITDA Ratio Prediction

A traditional measure of whether a stock is expensive or cheap, is to use a valuation multiple. A valuation multiple relates the market value of an asset relative to a key statistic that is assumed to relate to that value. To be useful, that statistic – whether earnings, cash flow or some other measure – must bear a logical relationship to the market value observed; to be seen, in fact, as the driver of that market value. For more information about valuation multiples, refer to below link: https://en.wikipedia.org/wiki/Valuation_using_multiples#Valuation_multiples

1. Enterprise Value to EBITDA ratio

One of the most widely used enterprise value multiples is the Enterprise Value to Ebitda (Earnings before interest, taxes, depreciation and amortization) multiple, short EV/Ebitda.

This is the ratio between the Enterprise Value of a company, meaning the total sum of the market value of the equity and net debt, divided by the company’s Ebitda.
In this project we will try to predict the EV/Ebitda ratios for the 3000 biggest companies listed on the US stock exchanges.

Before analysing our dataset, let us first try to understand what fundamental financial information about the company should be driving the EV/Ebitda ratio based on Financial Theory.

The present value of any Cash Flow generating asset is the value of all future cash flows the asset will generate over its lifetime, discounted back to today with a risk adjusted interest rate.

In its simplest form, where we expect the cash flow (CF) to grow at the same rate for ever (g), the present value (PV) is calculated with below perpetuity formula.

\[PV = \frac{CF * (1+g)}{(r-g)}\]

When trying to value a company based on its future cash flows, we can either look at it through the lenses of all capital providers (equity and debt providers) and we would use the FCFF (Free Cash Flow to the Firm) as the measure of cash flow (CF), or we could look at it purely as an Equity investor and we would use the FCFE (Free Cash Flow to Equity) as our measure of CF.

The main difference is that the FCFF is the cash flow left over after cash flow needs for operations and Investments are met, but before any debt repayments, whereas the FCFE is after debt payments (cash flow left over for equity investors).

If we want to estimate the value of total Firm directly, we would therefore use the FCFF as the appropriate measure of CF.

As the Enterprise Value (EV) (numerator in the EV/Ebitda ratio), is the aggregate market estimation of the value of the total firm (Equity + debt), we can rewrite the above formula as follows:

\(EV = \frac{FCFF*(1 + g)}{(r - g)}\)

FCFF can be rewritten as:

\(EV = \frac{(EBITDA-Taxes-CAPEX-deltaWC)*(1 + g)}{(r - g)}\)

Re-writing the numerator relative to EBITDA gives us:

\(EV = \frac{EBITDA*(1-\frac{Taxes}{EBITDA}-\frac{CAPEX}{EBITDA}-\frac{deltaWC}{EBITDA})*(1 + g)}{(r - g)}\)

\(EV = \frac{EBITDA*(1-\frac{Taxes}{EBITDA}- \frac{CAPEX + deltaWC}{EBITDA})*(1 + g)}{(r - g)}\)

\(EV = \frac{EBITDA*(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(r - g)}\)

Deviding both sides by the EBITDA, gives us the EV to EBITDA ratio:

\(\frac{EV}{EBITDA}= \frac{\frac{EBITDA*(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(r - g)}}{EBITDA}\)

\(\frac{EV}{EBITDA} = \frac{(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(r - g)}\)

Finally converting the discount rate r to the cost of capital gives us:

\(\frac{EV}{EBITDA} = \frac{(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(WACC - g)}\)

This formula gives us some understanding of what we would expect the fundamental drivers of the EV/Ebitda ratio to be. The fundamental drivers of the EV/EBITDA ratio are

A higher expected growth rate g, all else being equal, should lead to a higher EV/EBITDA
A lower reinvestment need, all else being equal, should lead to a higher EV/EBITDA
A lower amount of Taxes, all else being equal, should lead to a higher EV/EBITDA
A lower amount of risk measured by the WACC, all else being equal, should result in a higher EV/EBITDA

We will therefore try to obtain financial metrics that are related to points 1-4, as we would expect these to be partial drivers of the EV/EBITDA ratios seeen in the market.

2. Data Extraction and Pre-Processing

The company data was downloaded from the data provider “Finbox”. For more information, refer to below link: https://finbox.com

We will first import the data and check for any missing values and some basic information. Then we will remove outliers before splitting the data into training and testing sets, and analysing the training set in more detail.

# linear algebra
import numpy as np     

# data processing
import pandas as pd    
#display max columns and rows, to be able to inspect big data tables
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

# conntects to google drive if ran in colab
# ignored if ran outside of colab

if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

# load file from google drive path, if ran in colab 
# load file from notebook folder, if ran outside colab

if 'google.colab' in str(get_ipython()):
    path = '/content/drive/My Drive/Colab Notebooks/stock_pricing/'
else:
    path = str()
file = path + 'large_Financials_unlinked.xlsm'

company_data = pd.read_excel(file)

#original data structure
company_data.head(5)

	Ticker	Full_Ticker	Name	Sector	Industry_GICS	Index_Membership	Market_Cap	Net_Debt	EV	EBITDA	Net_Income	Equity	Avg_EBITDA_Margin_Forecast_10y	EBITDA_Forecast_CAGR_10y	Avg_Net_Income_Margin_Forecast_10y	Net_Income_Forecast_CAGR_10y	ROE	Beta_5y	Net_Debt_perc_EV	EV_TTM_EBITDA	Market_Cap_TTM_Net_Income	Market_Cap_BV_Equity	Net_Debt_perc_Market_Cap
0	AAPL	NASDAQGS:AAPL	Apple Inc.	Information Technology	Technology Hardware, Storage & Peripherals	Dow Jones Composite Average, Dow Jones Industr...	2.166836e+06	-69552.0	2.097284e+06	77344.0	57411.0	65339.0	0.28493	0.0381633	0.206377	0.0348267	0.736856	1.283613	-0.033163	27.116315	37.742528	33.162985	-0.032098
1	AMZN	NASDAQGS:AMZN	Amazon.com, Inc.	Consumer Discretionary	Internet & Direct Marketing Retail	Nasdaq 100, Nasdaq Composite, Russell 1000, Ru...	1.565880e+06	28412.0	1.594292e+06	43708.0	17377.0	82775.0	0.183412	0.145604	0.0865852	0.201361	0.249521	1.198004	0.017821	36.475980	90.112226	18.917308	0.018144
2	MSFT	NASDAQGS:MSFT	Microsoft Corporation	Information Technology	Software	Dow Jones Composite Average, Dow Jones Industr...	1.624977e+06	-54733.0	1.570244e+06	68123.0	47496.0	123392.0	0.475806	0.0485504	0.324014	0.048353	0.413993	0.826155	-0.034856	23.050136	34.212932	13.169228	-0.033682
3	GOOG	NASDAQGS:GOOG	Alphabet Inc.	Communication Services	Interactive Media & Services	Nasdaq 100, Nasdaq Composite, Russell 1000, Ru...	1.178252e+06	-105501.0	1.072751e+06	48075.0	35713.0	212920.0	0.381029	0.107043	0.260145	0.110387	0.175111	0.994747	-0.098346	22.314107	32.992235	5.533776	-0.089540
4	FB	NASDAQGS:FB	Facebook, Inc.	Communication Services	Interactive Media & Services	Nasdaq 100, Nasdaq Composite, Russell 1000, Ru...	7.151778e+05	-44476.0	6.707018e+05	30221.0	25276.0	117731.0	0.530226	0.119821	0.351485	0.131987	0.238757	1.182773	-0.066313	22.193237	28.294738	6.074677	-0.062189

2.1 Data Structure and removal of not meaningful samples

#original data shape
original_data_shape = company_data.shape
print('The original data shape is', original_data_shape)

The original data shape is (3171, 24)

#drop duplicates
company_data = company_data.drop_duplicates('Name' , keep='first')
print(original_data_shape[0] - company_data.shape[0], 'companies were removed due to duplication.')
print('The new data shape is', company_data.shape)

21 companies were removed due to duplication.
The new data shape is (3150, 24)

#remove not meaningful samples
current_num_samples = company_data.shape[0]
company_data = company_data[(company_data.EV > 0)]
company_data = company_data[(company_data.Market_Cap > 0)]
company_data = company_data[(company_data.EBITDA > 0)]
company_data = company_data[(company_data.Sector != 'Financials')]
new_num_samples = company_data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to not meaningful data.')
print('The new data shape is', company_data.shape)

1281 companies were removed due to not meaningful data.
The new data shape is (1869, 24)

#filter data for required columns
data = company_data[['Sector', 
                        'Name', 
                        'EV_TTM_EBITDA', 
                        'EBITDA_Forecast_CAGR_10y',
                        'Avg_EBITDA_Margin_Forecast_10y', 
                        'Beta_5y', 
                        'Net_Debt_perc_Market_Cap'
                       ]].set_index('Name')
data.head(5)

	Sector	EV_TTM_EBITDA	EBITDA_Forecast_CAGR_10y	Avg_EBITDA_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
Name
Apple Inc.	Information Technology	27.116315	0.0381633	0.28493	1.283613	-0.032098
Amazon.com, Inc.	Consumer Discretionary	36.475980	0.145604	0.183412	1.198004	0.018144
Microsoft Corporation	Information Technology	23.050136	0.0485504	0.475806	0.826155	-0.033682
Alphabet Inc.	Communication Services	22.314107	0.107043	0.381029	0.994747	-0.089540
Facebook, Inc.	Communication Services	22.193237	0.119821	0.530226	1.182773	-0.062189

#check column data types
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1869 entries, Apple Inc. to Sequential Brands Group, Inc.
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Sector                          1869 non-null   object 
 1   EV_TTM_EBITDA                   1869 non-null   float64
 2   EBITDA_Forecast_CAGR_10y        1869 non-null   object 
 3   Avg_EBITDA_Margin_Forecast_10y  1869 non-null   object 
 4   Beta_5y                         1869 non-null   float64
 5   Net_Debt_perc_Market_Cap        1869 non-null   float64
dtypes: float64(3), object(3)
memory usage: 102.2+ KB

#ensure columns have the correct data type
data['EV_TTM_EBITDA'] = data['EV_TTM_EBITDA'].apply(pd.to_numeric, errors='coerce')
data['EBITDA_Forecast_CAGR_10y'] = data['EBITDA_Forecast_CAGR_10y'].apply(pd.to_numeric, errors='coerce')
data['Avg_EBITDA_Margin_Forecast_10y'] = data['Avg_EBITDA_Margin_Forecast_10y'].apply(pd.to_numeric, errors='coerce')
data['Beta_5y'] = data['Beta_5y'].apply(pd.to_numeric, errors='coerce')
data['Net_Debt_perc_Market_Cap'] = data['Net_Debt_perc_Market_Cap'].apply(pd.to_numeric, errors='coerce')

data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1869 entries, Apple Inc. to Sequential Brands Group, Inc.
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Sector                          1869 non-null   object 
 1   EV_TTM_EBITDA                   1869 non-null   float64
 2   EBITDA_Forecast_CAGR_10y        1811 non-null   float64
 3   Avg_EBITDA_Margin_Forecast_10y  1864 non-null   float64
 4   Beta_5y                         1869 non-null   float64
 5   Net_Debt_perc_Market_Cap        1869 non-null   float64
dtypes: float64(5), object(1)
memory usage: 102.2+ KB

#check for missing values
data.isna().sum()

Sector                             0
EV_TTM_EBITDA                      0
EBITDA_Forecast_CAGR_10y          58
Avg_EBITDA_Margin_Forecast_10y     5
Beta_5y                            0
Net_Debt_perc_Market_Cap           0
dtype: int64

#remove missing values
current_num_samples = data.shape[0]
data = data.dropna()
data.isna().sum()
new_num_samples = data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to missing data.')
print('The new data shape is', data.shape)

58 companies were removed due to missing data.
The new data shape is (1811, 6)

2.2 Inspection for Outliers

We will look at the individual histograms of all variables, to see if we have potential outliers, and use Gaussian mixture models to later remove these outliers.

2.2.1 Summary Statistics and Histograms

#summary statistics before outlier removal
data.describe()

	EV_TTM_EBITDA	EBITDA_Forecast_CAGR_10y	Avg_EBITDA_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
count	1811.000000	1811.000000	1811.000000	1811.000000	1811.000000
mean	42.875328	0.062582	0.266711	1.650814	0.525916
std	314.195939	0.083894	0.182540	16.772324	1.345549
min	0.542898	-0.290873	0.000000	-6.314600	-1.025617
25%	10.680256	0.023840	0.130704	0.793330	0.016698
50%	15.715954	0.042193	0.213080	1.182805	0.202935
75%	24.875574	0.079265	0.360770	1.607668	0.592843
max	10586.768907	1.201920	0.844657	714.265241	25.971131

sns.histplot(data[['EV_TTM_EBITDA']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab9fb889b0>

png

sns.histplot(np.log(data[['EV_TTM_EBITDA']]))

<matplotlib.axes._subplots.AxesSubplot at 0x7fab8b362ac8>

png

sns.histplot(data[['EBITDA_Forecast_CAGR_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab8b329320>

png

sns.histplot(data[['Avg_EBITDA_Margin_Forecast_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab8ac92208>

png

sns.histplot(data[['Beta_5y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab8af04550>

png

sns.histplot(data[['Net_Debt_perc_Market_Cap']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab879eca90>

png

2.2.2 Outlier Removal with Gaussian Mixtures

#check for outliers with multivariate Guassian
from sklearn.covariance import EllipticEnvelope

X = data[['EV_TTM_EBITDA', 
             'EBITDA_Forecast_CAGR_10y',
             'Avg_EBITDA_Margin_Forecast_10y',
             'Beta_5y',
             'Net_Debt_perc_Market_Cap']]
cov = EllipticEnvelope(random_state=0, contamination=0.1).fit(X)
# predict returns 1 for an inlier and -1 for an outlier
data[cov.predict(X)== -1]

	Sector	EV_TTM_EBITDA	EBITDA_Forecast_CAGR_10y	Avg_EBITDA_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
Name
Tesla, Inc.	Consumer Discretionary	200.867330	0.275696	0.249014	2.190651	0.000785
salesforce.com, inc.	Information Technology	95.958312	0.187377	0.302774	1.171743	-0.016569
Zoom Video Communications, Inc.	Information Technology	223.597045	0.529394	0.345674	0.000000	-0.017654
ServiceNow, Inc.	Information Technology	189.199365	0.277811	0.348887	1.148634	-0.007477
Square, Inc.	Information Technology	6866.819030	0.389482	0.090723	2.710084	-0.002075
Autodesk, Inc.	Information Technology	102.571049	0.205378	0.412732	1.440651	0.006867
Atlassian Corporation Plc	Information Technology	947.273970	0.346807	0.292826	0.000000	-0.017579
Veeva Systems Inc.	Healthcare	111.288407	0.143649	0.417909	0.830882	-0.035709
Twitter, Inc.	Communication Services	115.322425	0.128971	0.330907	0.863445	-0.089297
Peloton Interactive, Inc.	Consumer Discretionary	323.071290	0.623983	0.167837	0.000000	-0.031071
Align Technology, Inc.	Healthcare	102.147202	0.116234	0.311033	1.676995	-0.012142
DexCom, Inc.	Healthcare	94.839960	0.186784	0.259330	0.855567	-0.022706
Ford Motor Company	Consumer Discretionary	28.757449	0.028368	0.078969	1.273634	3.316866
The Trade Desk, Inc.	Information Technology	335.262525	0.220880	0.390967	2.571428	-0.006479
Southwest Airlines Co.	Industrials	115.641235	0.087942	0.187911	1.226890	-0.067864
Palo Alto Networks, Inc.	Information Technology	1195.284136	1.005202	0.273581	1.503676	0.006286
Paycom Software, Inc.	Information Technology	99.861863	0.116981	0.433187	1.482668	-0.004270
GoodRx Holdings, Inc.	Healthcare	211.734149	0.164313	0.397327	0.000000	-0.019991
Trip.com Group Limited	Consumer Discretionary	115.003456	0.114943	0.258091	1.330882	1.021974
Horizon Therapeutics Public Limited Company	Healthcare	51.074250	0.358072	0.533816	1.143907	-0.038809
Ryanair Holdings plc	Industrials	84.801017	0.267277	0.424162	1.436159	0.051860
Insulet Corporation	Healthcare	151.722962	0.180835	0.220478	0.753676	0.001678
Warner Music Group Corp.	Communication Services	132.166747	0.248114	0.205784	0.000000	0.164390
Monolithic Power Systems, Inc.	Information Technology	97.925230	0.139450	0.330777	0.940651	-0.031062
BioMarin Pharmaceutical Inc.	Healthcare	419.677324	1.071525	0.183666	0.712710	-0.000060
Ceridian HCM Holding Inc.	Information Technology	162.268184	0.120739	0.233240	1.383878	0.030229
Enphase Energy, Inc.	Information Technology	155.399251	0.233280	0.333227	1.431722	-0.010926
Huazhu Group Limited	Consumer Discretionary	117.613966	0.116765	0.321942	1.774684	2.653895
NovoCure Limited	Healthcare	673.330233	0.378987	0.171717	1.283613	-0.012370
Zynga Inc.	Communication Services	32.612302	0.662850	0.262741	0.199054	-0.002175
Paylocity Holding Corporation	Information Technology	122.439891	0.138803	0.295299	1.358193	-0.005702
Chegg, Inc.	Consumer Discretionary	158.755259	0.257655	0.367920	0.983718	0.024049
Beyond Meat, Inc.	Consumer Staples	10586.768907	0.463142	0.218462	0.000000	-0.019930
Five9, Inc.	Information Technology	1004.697864	0.264004	0.213080	0.521533	-0.010888
Caesars Entertainment, Inc.	Consumer Discretionary	110.387153	0.222447	0.341142	3.143382	1.625932
Alteryx, Inc.	Information Technology	326.493151	0.180040	0.231524	0.844550	0.007510
Repligen Corporation	Healthcare	128.302012	0.144598	0.312485	0.937500	-0.025107
Guidewire Software, Inc.	Information Technology	1042.018740	0.151002	0.071289	1.358193	-0.064000
Melco Resorts & Entertainment Limited	Consumer Discretionary	141.364688	0.085361	0.292891	1.766281	0.489584
The Gap, Inc.	Consumer Discretionary	89.755628	0.018602	0.087069	1.524159	0.606664
ZoomInfo Technologies Inc.	Communication Services	136.150171	0.252624	0.560529	0.000000	0.066545
The AZEK Company Inc.	Industrials	81.996560	0.152255	0.250740	0.000000	0.041419
Brookfield Property Partners L.P.	Real Estate	17.939351	-0.051979	0.598349	1.488970	7.287715
Inphi Corporation	Information Technology	93.741126	0.263641	0.447819	1.063025	0.037075
Under Armour, Inc.	Consumer Discretionary	73.549111	0.012367	0.072856	1.126575	0.145338
Hyatt Hotels Corporation	Consumer Discretionary	1506.762718	0.068638	0.242367	1.530462	0.213329
AppFolio, Inc.	Information Technology	247.296411	0.238792	0.215147	1.155987	-0.026250
Freshpet, Inc.	Consumer Staples	207.883654	0.303361	0.230179	0.842436	-0.014682
Apartment Investment and Management Company	Real Estate	9.799975	0.026424	0.617981	0.504201	5.617434
NeoGenomics, Inc.	Healthcare	283.714836	0.113220	0.170325	0.793592	-0.010414
SailPoint Technologies Holdings, Inc.	Information Technology	165.846646	0.175084	0.067844	2.009701	-0.023138
CyberArk Software Ltd.	Information Technology	235.160410	0.029310	0.149556	1.372899	-0.099157
Jamf Holding Corp.	Information Technology	139.357249	0.139439	0.121253	0.000000	-0.045623
Vicor Corporation	Industrials	238.829168	0.247710	0.317639	0.838235	-0.044756
STAAR Surgical Company	Healthcare	439.745998	0.201741	0.217175	1.080882	-0.028797
Altair Engineering Inc.	Information Technology	159.402321	0.080947	0.102365	1.527488	0.001179
Cleveland-Cliffs Inc.	Materials	38.315691	0.384860	0.224049	2.225840	0.499703
Inari Medical, Inc.	Healthcare	265.655820	0.466521	0.163802	0.000000	-0.040520
Digital Turbine, Inc.	Information Technology	151.043441	0.259552	0.283449	2.402310	-0.001506
Shake Shack Inc.	Consumer Discretionary	188.317447	0.157248	0.168499	1.776260	0.047185
Ping Identity Holding Corp.	Information Technology	992.570819	0.133069	0.164389	0.000000	-0.001415
Avis Budget Group, Inc.	Industrials	150.271779	0.032521	0.101539	2.310399	5.062881
LivaNova PLC	Healthcare	204.573948	0.148263	0.214982	0.799894	0.145708
Sunnova Energy International Inc.	Utilities	176.591888	0.226460	0.480392	0.000000	0.375385
Tower Semiconductor Ltd.	Information Technology	8.777906	0.292268	0.397378	1.361344	-0.128399
Progyny, Inc.	Healthcare	296.877878	0.358589	0.173703	0.000000	-0.026911
Alliance Data Systems Corporation	Information Technology	25.761200	0.035661	0.299326	2.582983	3.966767
Ryman Hospitality Properties, Inc.	Real Estate	89.884980	0.090988	0.307643	1.808823	0.713166
Nordstrom, Inc.	Consumer Discretionary	474.805048	-0.026945	0.072959	2.451155	0.778384
Macy's, Inc.	Consumer Discretionary	556.494924	-0.004716	0.065204	2.013130	1.763889
Commvault Systems, Inc.	Information Technology	83.911322	0.302015	0.203891	0.794642	-0.135830
CommScope Holding Company, Inc.	Information Technology	9.893421	0.020964	0.157286	1.862394	3.127148
Colony Capital, Inc.	Real Estate	18.450260	-0.041691	0.427312	1.987854	2.764061
Pebblebrook Hotel Trust	Real Estate	180.669329	0.074852	0.291862	2.026785	0.981449
Celsius Holdings, Inc.	Consumer Staples	583.619894	1.201920	0.161267	2.172794	-0.009906
MicroStrategy Incorporated	Information Technology	79.723750	0.711108	0.421064	1.589285	0.011132
Livent Corporation	Materials	133.898922	0.101695	0.283130	2.244301	0.071956
Vital Farms, Inc.	Consumer Staples	77.509807	0.301912	0.121291	0.000000	-0.099120
Simulations Plus, Inc.	Healthcare	87.717630	0.090232	0.405204	0.000000	-0.083055
The Cheesecake Factory Incorporated	Consumer Discretionary	104.870916	0.049671	0.083231	1.705882	0.830481
ACM Research, Inc.	Information Technology	87.473199	0.234859	0.295867	0.887531	-0.045061
RLJ Lodging Trust	Real Estate	100.892798	0.077866	0.293877	2.158088	0.757639
Sinclair Broadcast Group, Inc.	Communication Services	8.438616	0.096874	0.274030	1.585084	4.935812
Glu Mobile Inc.	Communication Services	82.522778	0.264340	0.177606	0.831932	-0.186857
iHeartMedia, Inc.	Communication Services	18.589811	0.024124	0.261831	0.000000	3.107772
World Fuel Services Corporation	Energy	7.616104	0.429859	0.108961	1.246848	0.071008
Service Properties Trust	Real Estate	11.875750	0.027736	0.382401	2.444327	3.303842
Endo International plc	Healthcare	24.663469	0.063334	0.451955	1.419117	3.967743
OraSure Technologies, Inc.	Healthcare	198.770158	-0.026472	0.071666	0.043067	-0.254526
The Pennant Group, Inc.	Healthcare	86.514949	0.190000	0.216489	0.000000	0.171849
Heska Corporation	Healthcare	184.539033	0.173527	0.068167	1.644432	-0.017779
Mack-Cali Realty Corporation	Real Estate	44.156339	-0.000763	0.458996	1.069852	2.593549
The Macerich Company	Real Estate	16.148959	-0.023081	0.500574	2.056722	3.142394
Xenia Hotels & Resorts, Inc.	Real Estate	155.338978	0.040133	0.230800	1.632878	0.710745
BioLife Solutions, Inc.	Healthcare	414.902527	0.341834	0.308731	1.546743	-0.070999
CEVA, Inc.	Information Technology	165.494614	0.199883	0.153385	1.185924	-0.106221
iStar Inc.	Real Estate	24.404341	-0.018452	0.304676	0.809348	3.097602
United Natural Foods, Inc.	Consumer Staples	7.048980	0.008955	0.025772	1.503676	3.903265
Quotient Technology Inc.	Consumer Discretionary	234.832794	0.239540	0.129949	0.849264	-0.016882
Textainer Group Holdings Limited	Industrials	10.088555	0.033980	0.762142	1.580357	3.573168
Brookfield Property REIT Inc.	Real Estate	22.618764	0.118648	0.771357	1.404560	25.971131
GrowGeneration Corp.	Consumer Discretionary	387.364456	0.402906	0.151351	3.109704	-0.017205
Fulgent Genetics, Inc.	Healthcare	24.083644	0.421622	0.461197	1.928811	-0.059669
The E.W. Scripps Company	Communication Services	11.059371	0.316445	0.382041	1.949054	1.454701
Tactile Systems Technology, Inc.	Healthcare	256.100331	0.156386	0.152844	1.679960	-0.020267
American Axle & Manufacturing Holdings, Inc.	Consumer Discretionary	6.594480	-0.010659	0.156073	2.699054	2.941159
Diversified Healthcare Trust	Real Estate	11.435733	-0.010319	0.232312	1.361869	3.314837
Oxford Industries, Inc.	Consumer Discretionary	49.112879	0.480186	0.369037	1.602941	0.229790
SFL Corporation Ltd.	Energy	9.941712	-0.011988	0.637073	1.213760	3.109762
Tenneco Inc.	Consumer Discretionary	7.728585	0.026684	0.080744	2.691701	5.297825
Community Health Systems, Inc.	Healthcare	12.587641	0.030153	0.129580	2.107668	12.788417
Amneal Pharmaceuticals, Inc.	Healthcare	9.268181	0.066183	0.240407	1.300768	3.814766
Meredith Corporation	Communication Services	7.629137	-0.009960	0.175145	2.276260	3.244742
Veeco Instruments Inc.	Information Technology	23.006617	0.358484	0.170280	1.425945	0.022544
Inogen, Inc.	Healthcare	137.565001	-0.012810	0.063584	1.046218	-0.194290
Century Aluminum Company	Materials	129.259320	0.276041	0.052513	2.561974	0.214986
Brookdale Senior Living Inc.	Healthcare	13.249867	0.013942	0.108613	1.865546	6.160083
Global Medical REIT Inc.	Real Estate	19.542185	0.105447	0.797919	-6.314600	0.869243
Eastman Kodak Company	Information Technology	7.817920	0.012010	0.086188	5.596113	-0.181916
Surmodics, Inc.	Healthcare	92.875455	0.083029	0.088202	1.000525	-0.093102
RPC, Inc.	Energy	102.662160	0.052706	0.127810	2.137605	-0.131896
Rite Aid Corporation	Consumer Staples	11.920487	-0.009058	0.021498	1.001575	6.300989
The Chefs' Warehouse, Inc.	Consumer Staples	194.246612	0.089962	0.066142	2.455882	0.327948
Scorpio Tankers Inc.	Energy	7.138393	0.104970	0.601183	1.158613	3.916770
Transocean Ltd.	Energy	7.647058	0.010152	0.326109	3.752626	3.505997
Limelight Networks, Inc.	Information Technology	30.314803	0.329181	0.193427	0.724264	-0.020178
CAI International, Inc.	Industrials	22.052699	0.092397	0.782685	1.753151	3.684444
Boston Omaha Corporation	Communication Services	115.570890	0.266829	0.225499	0.499759	-0.166204
AngioDynamics, Inc.	Healthcare	130.395979	0.098856	0.071677	0.886029	-0.011744
FRP Holdings, Inc.	Real Estate	79.580028	-0.002952	0.443529	0.705882	0.098861
Consolidated Communications Holdings, Inc.	Communication Services	5.339124	0.013146	0.411368	1.367647	5.206360
Par Pacific Holdings, Inc.	Energy	147.385671	0.077741	0.068859	2.511554	1.226311
Tejon Ranch Co.	Real Estate	146.641106	-0.034705	0.051047	0.644432	0.035020
Frank's International N.V.	Energy	2804.045761	0.075031	0.109971	1.507352	-0.238425
Caleres, Inc.	Consumer Discretionary	56.935485	0.008801	0.060885	2.473214	1.713944
Carrols Restaurant Group, Inc.	Consumer Discretionary	17.206957	0.066622	0.074508	2.605567	3.835404
BioDelivery Sciences International, Inc.	Healthcare	12.345843	0.310385	0.445107	0.727941	-0.045735
New Senior Investment Group Inc.	Real Estate	15.726062	0.010277	0.328094	2.045693	3.256503
Conn's, Inc.	Consumer Discretionary	103.528237	0.033285	0.083218	2.596113	2.697070
Liquidity Services, Inc.	Consumer Discretionary	1277.504624	0.545650	0.101182	1.204831	-0.100726
CorePoint Lodging Inc.	Real Estate	139.646074	-0.019484	0.130601	1.981101	1.836362
Cooper-Standard Holdings Inc.	Consumer Discretionary	638.199180	0.304758	0.208089	3.209033	0.970937
Preferred Apartment Communities, Inc.	Real Estate	28.070785	0.027324	0.569761	1.124474	7.590762
Revlon, Inc.	Consumer Staples	43.443358	0.073722	0.136836	2.716386	5.985659
CalAmp Corp.	Information Technology	107.211908	0.116129	0.116113	2.485294	0.331036
Retail Value Inc.	Real Estate	6.910010	-0.290873	0.198649	1.891162	1.162256
GTT Communications, Inc.	Information Technology	10.118019	-0.007109	0.222752	1.641281	18.698159
NN, Inc.	Industrials	12.016967	-0.027208	0.142701	3.349264	3.023795
AXT, Inc.	Information Technology	128.340641	0.164417	0.167744	2.296218	-0.044102
Berry Corporation	Energy	2.877859	-0.004759	0.358621	714.265241	0.882410
QEP Resources, Inc.	Energy	3.624731	0.041161	0.650404	5.264180	2.182623
DASAN Zhone Solutions, Inc.	Information Technology	297.851164	0.255972	0.072003	1.317226	0.104923
Orion Energy Systems, Inc.	Industrials	90.937642	0.143926	0.194833	2.619222	-0.001672
Nesco Holdings, Inc.	Industrials	35.011502	0.146010	0.448028	0.000000	2.149199
Alpha Pro Tech, Ltd.	Industrials	5.972842	0.387251	0.390266	-1.269432	-0.135723
Entercom Communications Corp.	Communication Services	18.029148	0.029577	0.230234	1.915966	4.823862
Bluerock Residential Growth REIT, Inc.	Real Estate	14.246728	0.027543	0.567973	1.065126	4.926919
Nabors Industries Ltd.	Energy	5.798961	-0.010064	0.261313	3.790966	5.698193
Callon Petroleum Company	Energy	5.974019	0.080393	0.636032	3.409138	5.115067
Hersha Hospitality Trust	Real Estate	100.567386	0.048996	0.276029	2.759978	4.092159
Atlantic Power Corporation	Utilities	9.472282	0.046711	0.679425	0.360396	2.956992
Gannett Co., Inc.	Communication Services	6.551396	0.078796	0.126688	2.923844	3.326580
Overseas Shipholding Group, Inc.	Energy	8.496538	0.001550	0.197073	0.344537	3.305226
SM Energy Company	Energy	4.101045	0.047311	0.675687	6.610819	2.003005
National CineMedia, Inc.	Communication Services	35.545151	0.073656	0.417918	1.898634	2.835588
Five Star Senior Living Inc.	Healthcare	4.455550	0.383867	0.019542	1.093487	-0.330984
Willis Lease Finance Corporation	Industrials	7.528340	0.041371	0.641146	1.225840	6.851299
Exterran Corporation	Energy	4.182844	-0.011508	0.206908	1.018382	3.157806
Harrow Health, Inc.	Healthcare	167.227700	0.356905	0.262504	1.081932	-0.042801
Quad/Graphics, Inc.	Industrials	4.559083	0.030267	0.117657	2.094537	4.614893
Penn Virginia Corporation	Energy	2.359374	0.037113	0.684470	3.649320	2.498958
Mesa Air Group, Inc.	Industrials	12.319720	0.147644	0.333299	3.127155	3.213603
Washington Prime Group Inc.	Real Estate	14.527547	0.027008	0.707422	1.976365	16.784415
HC2 Holdings, Inc.	Industrials	18.908519	-0.162321	0.020053	2.339810	1.832551
Sharps Compliance Corp.	Healthcare	142.056343	0.108284	0.085569	0.002626	0.046589
Clipper Realty Inc.	Real Estate	21.807713	0.057809	0.504501	1.006648	7.673637
Laredo Petroleum, Inc.	Energy	3.594123	0.005052	0.641185	4.738970	3.974701
Safe Bulkers, Inc.	Industrials	11.612496	0.074137	0.549133	0.952205	2.823696
R. R. Donnelley & Sons Company	Industrials	5.501899	-0.015650	0.075827	2.775735	10.644258
StealthGas Inc.	Energy	6.583602	0.053326	0.545212	1.709558	2.890521
CSI Compressco LP	Energy	7.175938	-0.001477	0.385766	2.086659	12.732358
Sequential Brands Group, Inc.	Consumer Discretionary	10.014852	0.132819	0.794679	1.454831	18.898905

#outlier removal
current_num_samples = data.shape[0]
data_new = data[cov.predict(X)== 1]
new_num_samples = data_new.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to outliers.')
print('The new data shape is', data_new.shape)

182 companies were removed due to outliers.
The new data shape is (1629, 6)

2.2.3 Summary Statistics and Histograms after outlier removal

#summary statistics after outlier removal
data_new.describe()

	EV_TTM_EBITDA	EBITDA_Forecast_CAGR_10y	Avg_EBITDA_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
count	1629.000000	1629.000000	1629.000000	1629.000000	1629.000000
mean	18.877218	0.051891	0.265026	1.230406	0.362546
std	12.891314	0.049236	0.182241	0.696487	0.514622
min	0.542898	-0.139129	0.000000	-0.465861	-1.025617
25%	10.568284	0.023252	0.130638	0.782037	0.029264
50%	15.034857	0.040332	0.210164	1.170168	0.203440
75%	22.488791	0.071801	0.357577	1.571428	0.542521
max	79.289256	0.302286	0.844657	4.881302	2.772638

sns.histplot(data_new[['EV_TTM_EBITDA']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab86f1d748>

png

sns.histplot(np.log(data_new[['EV_TTM_EBITDA']]))

<matplotlib.axes._subplots.AxesSubplot at 0x7fab873cf828>

png

sns.histplot(data_new[['EBITDA_Forecast_CAGR_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab87131080>

png

sns.histplot(data[['Avg_EBITDA_Margin_Forecast_10y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab870eeda0>

png

sns.histplot(data_new[['Beta_5y']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab8704cef0>

png

sns.histplot(data_new[['Net_Debt_perc_Market_Cap']])

<matplotlib.axes._subplots.AxesSubplot at 0x7fab86e87d30>

png

2.3 Split into training and testing set

# split data into train and test splits
from sklearn.model_selection import train_test_split

X = data_new[['Sector', 'EBITDA_Forecast_CAGR_10y','Avg_EBITDA_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap']];
Y = data_new[['EV_TTM_EBITDA']];
Y['Log_EV_TTM_EBITDA'] = np.log(Y['EV_TTM_EBITDA']);

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=True, test_size = 0.2, random_state=1);

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

# check training data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', Y_train.shape)

X_train has shape (1303, 5)
y_train has shape (1303, 2)

#check X_train
X_train.head(5)

	Sector	EBITDA_Forecast_CAGR_10y	Avg_EBITDA_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
Name
Zimmer Biomet Holdings, Inc.	Healthcare	0.033462	0.332755	1.337184	0.221408
Allegiant Travel Company	Industrials	0.048457	0.278896	1.703256	0.316869
Alpha and Omega Semiconductor Limited	Information Technology	0.149025	0.173992	2.529411	0.066703
Zix Corporation	Information Technology	0.118027	0.237929	1.159138	0.365729
CMC Materials, Inc.	Information Technology	0.003157	0.318449	1.180672	0.152127

#check y_train
Y_train.head(5)

	EV_TTM_EBITDA	Log_EV_TTM_EBITDA
Name
Zimmer Biomet Holdings, Inc.	20.932245	3.041291
Allegiant Travel Company	49.358942	3.899119
Alpha and Omega Semiconductor Limited	22.075878	3.094486
Zix Corporation	14.757365	2.691742
CMC Materials, Inc.	15.474569	2.739198

train_data = Y_train.merge(X_train, how='outer', left_index=True, right_index=True)
test_data = Y_test.merge(X_test, how='outer', left_index=True, right_index=True)

3. Exploratory Data Analysis

3.1 Summary Statistics

train_data.describe().round(2)

	EV_TTM_EBITDA	Log_EV_TTM_EBITDA	EBITDA_Forecast_CAGR_10y	Avg_EBITDA_Margin_Forecast_10y	Beta_5y	Net_Debt_perc_Market_Cap
count	1303.00	1303.00	1303.00	1303.00	1303.00	1303.00
mean	18.95	2.75	0.05	0.26	1.22	0.37
std	13.21	0.63	0.05	0.18	0.69	0.53
min	0.54	-0.61	-0.14	0.00	-0.47	-1.03
25%	10.57	2.36	0.02	0.13	0.77	0.03
50%	15.03	2.71	0.04	0.21	1.17	0.21
75%	22.42	3.11	0.07	0.35	1.56	0.55
max	79.29	4.37	0.28	0.84	4.88	2.77

3.2 Correlation Matrix

# correlation matrix for outcome variable

from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Compute the correlation matrix
corr = train_data.drop(columns="Log_EV_TTM_EBITDA").corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x7fab86d66c18>

png

# correlation matrix for log of outcome variable

# Compute the correlation matrix
corr = train_data.drop(columns="EV_TTM_EBITDA").corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<matplotlib.axes._subplots.AxesSubplot at 0x7fab86dcf208>

png

3.3 Pairplots

sns.set_theme()
sns.jointplot(data=train_data, x="EBITDA_Forecast_CAGR_10y", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="EBITDA_Forecast_CAGR_10y", y="Log_EV_TTM_EBITDA")

<seaborn.axisgrid.JointGrid at 0x7fab871d74e0>

png

sns.jointplot(data=train_data, x="Avg_EBITDA_Margin_Forecast_10y", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="Avg_EBITDA_Margin_Forecast_10y", y="Log_EV_TTM_EBITDA")

<seaborn.axisgrid.JointGrid at 0x7fab868d17b8>

png

sns.jointplot(data=train_data, x="Beta_5y", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="Beta_5y", y="Log_EV_TTM_EBITDA")

<seaborn.axisgrid.JointGrid at 0x7fab865decc0>

png

sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="Log_EV_TTM_EBITDA")

<seaborn.axisgrid.JointGrid at 0x7fab86233ac8>

png

4. Machine Learning

#ML dependancies

#data preprocessing pipeline to preprocess data table
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

#Grid search cross validation to find best hyper parameters
from sklearn.model_selection import GridSearchCV

#models considered
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor

#evaluation
from sklearn.model_selection import cross_val_score

4.1 Pre-Processing Pipeline

#split data back into X and y
X_train = train_data[['EBITDA_Forecast_CAGR_10y','Avg_EBITDA_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_train = train_data['Log_EV_TTM_EBITDA']

X_test = test_data[['EBITDA_Forecast_CAGR_10y','Avg_EBITDA_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_test = test_data['Log_EV_TTM_EBITDA']

# check training and test data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', y_train.shape)
print('X_test has shape', X_test.shape)
print('y_test has shape', y_test.shape)

X_train has shape (1303, 5)
y_train has shape (1303,)
X_test has shape (326, 5)
y_test has shape (326,)

# define categorical columns for transformation
categorical_variables = ['Sector']

# define numerical columns for transformation
numerical_variables = list(X_train.drop(columns=categorical_variables).columns)

# define column transformation for data preprocessing pipeline
column_transformer = ColumnTransformer(transformers=[
                                                 ('categorical_columns', OneHotEncoder(handle_unknown='ignore'), categorical_variables),
                                                 ('numeric_columns', StandardScaler(), numerical_variables)])

4.2 Model Training and Hyper Parameter Tuning

#ridge regression

grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}

model_1 = Pipeline([('column_transformer', column_transformer),
                    ('model', Ridge(max_iter = 100000))])

# default metric to optimize over grid parameters: accuracy
grid_model_1 = GridSearchCV(model_1, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_1 = grid_model_1.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_1.best_params_)
print('Grid best score: ', grid_model_1.best_score_.round(3))

Grid best parameter:  {'model__alpha': 10}
Grid best score:  0.238

#lasso regression

grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}

model_2 = Pipeline([('column_transformer', column_transformer),
                    ('model', Lasso(max_iter = 100000))])

# default metric to optimize over grid parameters: accuracy
grid_model_2 = GridSearchCV(model_2, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_2 = grid_model_2.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_2.best_params_)
print('Grid best score: ', grid_model_2.best_score_.round(3))

Grid best parameter:  {'model__alpha': 0.001}
Grid best score:  0.237

#KNeighborsRegressor

grid_values = {'model__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]}

model_3 = Pipeline([('column_transformer', column_transformer),
                    ('model', KNeighborsRegressor())])

# default metric to optimize over grid parameters: accuracy
grid_model_3 = GridSearchCV(model_3, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_3 = grid_model_3.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_3.best_params_)
print('Grid best score: ', grid_model_3.best_score_.round(3))

Grid best parameter:  {'model__n_neighbors': 9}
Grid best score:  0.286

#SVR linear

grid_values = {'model__C': [0.01, 0.1, 1, 10, 100]}

model_4 = Pipeline([('column_transformer', column_transformer),
                    ('model', SVR(kernel='linear'))])

# default metric to optimize over grid parameters: accuracy
grid_model_4 = GridSearchCV(model_4, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_4 = grid_model_4.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_4.best_params_)
print('Grid best score: ', grid_model_4.best_score_.round(3))

Grid best parameter:  {'model__C': 0.1}
Grid best score:  0.233

#SVR rbf

grid_values = {'model__gamma': [0.01, 0.1, 1, 10, 100], 
               'model__C': [0.01, 0.1, 1, 10, 100]}

model_5 = Pipeline([('column_transformer', column_transformer),
                    ('model', SVR(kernel='rbf'))])

# default metric to optimize over grid parameters: accuracy
grid_model_5 = GridSearchCV(model_5, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_5 = grid_model_5.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_5.best_params_)
print('Grid best score: ', grid_model_5.best_score_.round(3))

Grid best parameter:  {'model__C': 10, 'model__gamma': 0.1}
Grid best score:  0.339

#DecisionTreeRegressor

grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8]}

model_6 = Pipeline([('column_transformer', column_transformer),
                    ('model', DecisionTreeRegressor())])

# default metric to optimize over grid parameters: accuracy
grid_model_6 = GridSearchCV(model_6, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_6 = grid_model_6.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_6.best_params_)
print('Grid best score: ', grid_model_6.best_score_.round(3))

Grid best parameter:  {'model__max_depth': 5}
Grid best score:  0.291

#RandomForestRegressor

grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10], 
               'model__n_estimators': [50, 75, 100, 125, 150, 200, 250]}

model_7 = Pipeline([('column_transformer', column_transformer),
                    ('model', RandomForestRegressor())])

# default metric to optimize over grid parameters: accuracy
grid_model_7 = GridSearchCV(model_7, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_7 = grid_model_7.fit(X_train, y_train)

print('Grid best parameter: ', grid_model_7.best_params_)
print('Grid best score: ', grid_model_7.best_score_.round(3))

Grid best parameter:  {'model__max_depth': 9, 'model__n_estimators': 100}
Grid best score:  0.424

#VotingRegressor

model_8 = VotingRegressor([('ridge', grid_model_1), ('lasso', grid_model_2), ('knn', grid_model_3), ('svr_linear', grid_model_4), 
                                     ('svr_rbf', grid_model_5), ('tree', grid_model_6), ('rf', grid_model_7)])
model_8 = model_8.fit(X_train, y_train)
cv_score = cross_val_score(model_8, X_train, y_train, cv=5, scoring = 'r2').mean().round(3)
print('CV score', cv_score)

CV score 0.363

4.3 Model Comparison and Selection

#model performance summary on cross validation folds

models = ['Ridge', 
          'Lasso',
          'KNeighborsRegressor',
          'SVR linear',
          'SVR rbf',
          'Decision Tree',
          'RandomForestRegressor', 
          'VotingRegressor']

scores = [grid_model_1.best_score_.round(3),
          grid_model_2.best_score_.round(3),
          grid_model_3.best_score_.round(3),
          grid_model_4.best_score_.round(3),
          grid_model_5.best_score_.round(3),
          grid_model_6.best_score_.round(3),
          grid_model_7.best_score_.round(3), 
          cv_score.round(3)]

model_scores = pd.DataFrame(data= scores, columns = ['CV_r2'], index = models)
print('The average cross validation performance of all trained models was as below')
model_scores

The average cross validation performance of all trained models was as below

	CV_r2
Ridge	0.238
Lasso	0.237
KNeighborsRegressor	0.286
SVR linear	0.233
SVR rbf	0.339
Decision Tree	0.291
RandomForestRegressor	0.424
VotingRegressor	0.363

#get the best model as per cross validation performance
print('The model selected with the best Cross Validation performance was:', model_scores.index[np.argmax(model_scores['CV_r2'])])
model_scores.iloc[[np.argmax(model_scores['CV_r2'])]]

The model selected with the best Cross Validation performance was: RandomForestRegressor

	CV_r2
RandomForestRegressor	0.424

4.4 Out of Sample Performance Estimation

#model dictionary
model_dict = {'Ridge': grid_model_1,
              'Lasso': grid_model_2,
              'KNeighborsRegressor': grid_model_3,
              'SVR linear': grid_model_4,
              'SVR rbf': grid_model_5,
              'Decision Tree': grid_model_6,
              'RandomForestRegressor': grid_model_7, 
              'VotingRegressor': model_8
              }

#test set performance best models
print('The best model reaches below out of sample test set performance:')
model_dict[model_scores.index[np.argmax(model_scores['CV_r2'])]].score(X_test, y_test).round(3)

The best model reaches below out of sample test set performance:

0.473

ev-to-ebitda-ratio-prediction