Enterprise Value to EBITDA Ratio Prediction
A traditional measure of whether a stock is expensive or cheap, is to use a valuation multiple. A valuation multiple relates the market value of an asset relative to a key statistic that is assumed to relate to that value. To be useful, that statistic – whether earnings, cash flow or some other measure – must bear a logical relationship to the market value observed; to be seen, in fact, as the driver of that market value. For more information about valuation multiples, refer to below link: https://en.wikipedia.org/wiki/Valuation_using_multiples#Valuation_multiples
1. Enterprise Value to EBITDA ratio
One of the most widely used enterprise value multiples is the Enterprise Value to Ebitda (Earnings before interest, taxes, depreciation and amortization) multiple, short EV/Ebitda.
This is the ratio between the Enterprise Value of a company, meaning the total sum of the market value of the equity and net debt, divided by the company’s Ebitda.
In this project we will try to predict the EV/Ebitda ratios for the 3000 biggest companies listed on the US stock exchanges.
Before analysing our dataset, let us first try to understand what fundamental financial information about the company should be driving the EV/Ebitda ratio based on Financial Theory.
The present value of any Cash Flow generating asset is the value of all future cash flows the asset will generate over its lifetime, discounted back to today with a risk adjusted interest rate.
In its simplest form, where we expect the cash flow (CF) to grow at the same rate for ever (g), the present value (PV) is calculated with below perpetuity formula.
\[PV = \frac{CF * (1+g)}{(r-g)}\]When trying to value a company based on its future cash flows, we can either look at it through the lenses of all capital providers (equity and debt providers) and we would use the FCFF (Free Cash Flow to the Firm) as the measure of cash flow (CF), or we could look at it purely as an Equity investor and we would use the FCFE (Free Cash Flow to Equity) as our measure of CF.
The main difference is that the FCFF is the cash flow left over after cash flow needs for operations and Investments are met, but before any debt repayments, whereas the FCFE is after debt payments (cash flow left over for equity investors).
If we want to estimate the value of total Firm directly, we would therefore use the FCFF as the appropriate measure of CF.
As the Enterprise Value (EV) (numerator in the EV/Ebitda ratio), is the aggregate market estimation of the value of the total firm (Equity + debt), we can rewrite the above formula as follows:
\(EV = \frac{FCFF*(1 + g)}{(r - g)}\)
FCFF can be rewritten as:
\(EV = \frac{(EBITDA-Taxes-CAPEX-deltaWC)*(1 + g)}{(r - g)}\)
Re-writing the numerator relative to EBITDA gives us:
\(EV = \frac{EBITDA*(1-\frac{Taxes}{EBITDA}-\frac{CAPEX}{EBITDA}-\frac{deltaWC}{EBITDA})*(1 + g)}{(r - g)}\)
\(EV = \frac{EBITDA*(1-\frac{Taxes}{EBITDA}- \frac{CAPEX + deltaWC}{EBITDA})*(1 + g)}{(r - g)}\)
\(EV = \frac{EBITDA*(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(r - g)}\)
Deviding both sides by the EBITDA, gives us the EV to EBITDA ratio:
\(\frac{EV}{EBITDA}= \frac{\frac{EBITDA*(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(r - g)}}{EBITDA}\)
\(\frac{EV}{EBITDA} = \frac{(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(r - g)}\)
Finally converting the discount rate r to the cost of capital gives us:
\(\frac{EV}{EBITDA} = \frac{(1-\frac{Taxes}{EBITDA}- \frac{reinvestment}{EBITDA})*(1 + g)}{(WACC - g)}\)
This formula gives us some understanding of what we would expect the fundamental drivers of the EV/Ebitda ratio to be. The fundamental drivers of the EV/EBITDA ratio are
- A higher expected growth rate g, all else being equal, should lead to a higher EV/EBITDA
- A lower reinvestment need, all else being equal, should lead to a higher EV/EBITDA
- A lower amount of Taxes, all else being equal, should lead to a higher EV/EBITDA
- A lower amount of risk measured by the WACC, all else being equal, should result in a higher EV/EBITDA
We will therefore try to obtain financial metrics that are related to points 1-4, as we would expect these to be partial drivers of the EV/EBITDA ratios seeen in the market.
2. Data Extraction and Pre-Processing
The company data was downloaded from the data provider “Finbox”. For more information, refer to below link: https://finbox.com
We will first import the data and check for any missing values and some basic information. Then we will remove outliers before splitting the data into training and testing sets, and analysing the training set in more detail.
# linear algebra
import numpy as np
# data processing
import pandas as pd
#display max columns and rows, to be able to inspect big data tables
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
#data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
# conntects to google drive if ran in colab
# ignored if ran outside of colab
if 'google.colab' in str(get_ipython()):
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# load file from google drive path, if ran in colab
# load file from notebook folder, if ran outside colab
if 'google.colab' in str(get_ipython()):
path = '/content/drive/My Drive/Colab Notebooks/stock_pricing/'
else:
path = str()
file = path + 'large_Financials_unlinked.xlsm'
company_data = pd.read_excel(file)
#original data structure
company_data.head(5)
Ticker | Full_Ticker | Name | Sector | Industry_GICS | Index_Membership | Market_Cap | Net_Debt | Minority_Interest | EV | EBITDA | Net_Income | Equity | Avg_EBITDA_Margin_Forecast_10y | EBITDA_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Net_Income_Forecast_CAGR_10y | ROE | Beta_5y | Net_Debt_perc_EV | EV_TTM_EBITDA | Market_Cap_TTM_Net_Income | Market_Cap_BV_Equity | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AAPL | NASDAQGS:AAPL | Apple Inc. | Information Technology | Technology Hardware, Storage & Peripherals | Dow Jones Composite Average, Dow Jones Industr... | 2.166836e+06 | -69552.0 | 0.0 | 2.097284e+06 | 77344.0 | 57411.0 | 65339.0 | 0.28493 | 0.0381633 | 0.206377 | 0.0348267 | 0.736856 | 1.283613 | -0.033163 | 27.116315 | 37.742528 | 33.162985 | -0.032098 |
1 | AMZN | NASDAQGS:AMZN | Amazon.com, Inc. | Consumer Discretionary | Internet & Direct Marketing Retail | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 1.565880e+06 | 28412.0 | 0.0 | 1.594292e+06 | 43708.0 | 17377.0 | 82775.0 | 0.183412 | 0.145604 | 0.0865852 | 0.201361 | 0.249521 | 1.198004 | 0.017821 | 36.475980 | 90.112226 | 18.917308 | 0.018144 |
2 | MSFT | NASDAQGS:MSFT | Microsoft Corporation | Information Technology | Software | Dow Jones Composite Average, Dow Jones Industr... | 1.624977e+06 | -54733.0 | 0.0 | 1.570244e+06 | 68123.0 | 47496.0 | 123392.0 | 0.475806 | 0.0485504 | 0.324014 | 0.048353 | 0.413993 | 0.826155 | -0.034856 | 23.050136 | 34.212932 | 13.169228 | -0.033682 |
3 | GOOG | NASDAQGS:GOOG | Alphabet Inc. | Communication Services | Interactive Media & Services | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 1.178252e+06 | -105501.0 | 0.0 | 1.072751e+06 | 48075.0 | 35713.0 | 212920.0 | 0.381029 | 0.107043 | 0.260145 | 0.110387 | 0.175111 | 0.994747 | -0.098346 | 22.314107 | 32.992235 | 5.533776 | -0.089540 |
4 | FB | NASDAQGS:FB | Facebook, Inc. | Communication Services | Interactive Media & Services | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 7.151778e+05 | -44476.0 | 0.0 | 6.707018e+05 | 30221.0 | 25276.0 | 117731.0 | 0.530226 | 0.119821 | 0.351485 | 0.131987 | 0.238757 | 1.182773 | -0.066313 | 22.193237 | 28.294738 | 6.074677 | -0.062189 |
2.1 Data Structure and removal of not meaningful samples
#original data shape
original_data_shape = company_data.shape
print('The original data shape is', original_data_shape)
The original data shape is (3171, 24)
#drop duplicates
company_data = company_data.drop_duplicates('Name' , keep='first')
print(original_data_shape[0] - company_data.shape[0], 'companies were removed due to duplication.')
print('The new data shape is', company_data.shape)
21 companies were removed due to duplication.
The new data shape is (3150, 24)
#remove not meaningful samples
current_num_samples = company_data.shape[0]
company_data = company_data[(company_data.EV > 0)]
company_data = company_data[(company_data.Market_Cap > 0)]
company_data = company_data[(company_data.EBITDA > 0)]
company_data = company_data[(company_data.Sector != 'Financials')]
new_num_samples = company_data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to not meaningful data.')
print('The new data shape is', company_data.shape)
1281 companies were removed due to not meaningful data.
The new data shape is (1869, 24)
#filter data for required columns
data = company_data[['Sector',
'Name',
'EV_TTM_EBITDA',
'EBITDA_Forecast_CAGR_10y',
'Avg_EBITDA_Margin_Forecast_10y',
'Beta_5y',
'Net_Debt_perc_Market_Cap'
]].set_index('Name')
data.head(5)
Sector | EV_TTM_EBITDA | EBITDA_Forecast_CAGR_10y | Avg_EBITDA_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|
Name | ||||||
Apple Inc. | Information Technology | 27.116315 | 0.0381633 | 0.28493 | 1.283613 | -0.032098 |
Amazon.com, Inc. | Consumer Discretionary | 36.475980 | 0.145604 | 0.183412 | 1.198004 | 0.018144 |
Microsoft Corporation | Information Technology | 23.050136 | 0.0485504 | 0.475806 | 0.826155 | -0.033682 |
Alphabet Inc. | Communication Services | 22.314107 | 0.107043 | 0.381029 | 0.994747 | -0.089540 |
Facebook, Inc. | Communication Services | 22.193237 | 0.119821 | 0.530226 | 1.182773 | -0.062189 |
#check column data types
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1869 entries, Apple Inc. to Sequential Brands Group, Inc.
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 1869 non-null object
1 EV_TTM_EBITDA 1869 non-null float64
2 EBITDA_Forecast_CAGR_10y 1869 non-null object
3 Avg_EBITDA_Margin_Forecast_10y 1869 non-null object
4 Beta_5y 1869 non-null float64
5 Net_Debt_perc_Market_Cap 1869 non-null float64
dtypes: float64(3), object(3)
memory usage: 102.2+ KB
#ensure columns have the correct data type
data['EV_TTM_EBITDA'] = data['EV_TTM_EBITDA'].apply(pd.to_numeric, errors='coerce')
data['EBITDA_Forecast_CAGR_10y'] = data['EBITDA_Forecast_CAGR_10y'].apply(pd.to_numeric, errors='coerce')
data['Avg_EBITDA_Margin_Forecast_10y'] = data['Avg_EBITDA_Margin_Forecast_10y'].apply(pd.to_numeric, errors='coerce')
data['Beta_5y'] = data['Beta_5y'].apply(pd.to_numeric, errors='coerce')
data['Net_Debt_perc_Market_Cap'] = data['Net_Debt_perc_Market_Cap'].apply(pd.to_numeric, errors='coerce')
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1869 entries, Apple Inc. to Sequential Brands Group, Inc.
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 1869 non-null object
1 EV_TTM_EBITDA 1869 non-null float64
2 EBITDA_Forecast_CAGR_10y 1811 non-null float64
3 Avg_EBITDA_Margin_Forecast_10y 1864 non-null float64
4 Beta_5y 1869 non-null float64
5 Net_Debt_perc_Market_Cap 1869 non-null float64
dtypes: float64(5), object(1)
memory usage: 102.2+ KB
#check for missing values
data.isna().sum()
Sector 0
EV_TTM_EBITDA 0
EBITDA_Forecast_CAGR_10y 58
Avg_EBITDA_Margin_Forecast_10y 5
Beta_5y 0
Net_Debt_perc_Market_Cap 0
dtype: int64
#remove missing values
current_num_samples = data.shape[0]
data = data.dropna()
data.isna().sum()
new_num_samples = data.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to missing data.')
print('The new data shape is', data.shape)
58 companies were removed due to missing data.
The new data shape is (1811, 6)
2.2 Inspection for Outliers
We will look at the individual histograms of all variables, to see if we have potential outliers, and use Gaussian mixture models to later remove these outliers.
2.2.1 Summary Statistics and Histograms
#summary statistics before outlier removal
data.describe()
EV_TTM_EBITDA | EBITDA_Forecast_CAGR_10y | Avg_EBITDA_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|
count | 1811.000000 | 1811.000000 | 1811.000000 | 1811.000000 | 1811.000000 |
mean | 42.875328 | 0.062582 | 0.266711 | 1.650814 | 0.525916 |
std | 314.195939 | 0.083894 | 0.182540 | 16.772324 | 1.345549 |
min | 0.542898 | -0.290873 | 0.000000 | -6.314600 | -1.025617 |
25% | 10.680256 | 0.023840 | 0.130704 | 0.793330 | 0.016698 |
50% | 15.715954 | 0.042193 | 0.213080 | 1.182805 | 0.202935 |
75% | 24.875574 | 0.079265 | 0.360770 | 1.607668 | 0.592843 |
max | 10586.768907 | 1.201920 | 0.844657 | 714.265241 | 25.971131 |
sns.histplot(data[['EV_TTM_EBITDA']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab9fb889b0>
sns.histplot(np.log(data[['EV_TTM_EBITDA']]))
<matplotlib.axes._subplots.AxesSubplot at 0x7fab8b362ac8>
sns.histplot(data[['EBITDA_Forecast_CAGR_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab8b329320>
sns.histplot(data[['Avg_EBITDA_Margin_Forecast_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab8ac92208>
sns.histplot(data[['Beta_5y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab8af04550>
sns.histplot(data[['Net_Debt_perc_Market_Cap']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab879eca90>
2.2.2 Outlier Removal with Gaussian Mixtures
#check for outliers with multivariate Guassian
from sklearn.covariance import EllipticEnvelope
X = data[['EV_TTM_EBITDA',
'EBITDA_Forecast_CAGR_10y',
'Avg_EBITDA_Margin_Forecast_10y',
'Beta_5y',
'Net_Debt_perc_Market_Cap']]
cov = EllipticEnvelope(random_state=0, contamination=0.1).fit(X)
# predict returns 1 for an inlier and -1 for an outlier
data[cov.predict(X)== -1]
Sector | EV_TTM_EBITDA | EBITDA_Forecast_CAGR_10y | Avg_EBITDA_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|
Name | ||||||
Tesla, Inc. | Consumer Discretionary | 200.867330 | 0.275696 | 0.249014 | 2.190651 | 0.000785 |
salesforce.com, inc. | Information Technology | 95.958312 | 0.187377 | 0.302774 | 1.171743 | -0.016569 |
Zoom Video Communications, Inc. | Information Technology | 223.597045 | 0.529394 | 0.345674 | 0.000000 | -0.017654 |
ServiceNow, Inc. | Information Technology | 189.199365 | 0.277811 | 0.348887 | 1.148634 | -0.007477 |
Square, Inc. | Information Technology | 6866.819030 | 0.389482 | 0.090723 | 2.710084 | -0.002075 |
Autodesk, Inc. | Information Technology | 102.571049 | 0.205378 | 0.412732 | 1.440651 | 0.006867 |
Atlassian Corporation Plc | Information Technology | 947.273970 | 0.346807 | 0.292826 | 0.000000 | -0.017579 |
Veeva Systems Inc. | Healthcare | 111.288407 | 0.143649 | 0.417909 | 0.830882 | -0.035709 |
Twitter, Inc. | Communication Services | 115.322425 | 0.128971 | 0.330907 | 0.863445 | -0.089297 |
Peloton Interactive, Inc. | Consumer Discretionary | 323.071290 | 0.623983 | 0.167837 | 0.000000 | -0.031071 |
Align Technology, Inc. | Healthcare | 102.147202 | 0.116234 | 0.311033 | 1.676995 | -0.012142 |
DexCom, Inc. | Healthcare | 94.839960 | 0.186784 | 0.259330 | 0.855567 | -0.022706 |
Ford Motor Company | Consumer Discretionary | 28.757449 | 0.028368 | 0.078969 | 1.273634 | 3.316866 |
The Trade Desk, Inc. | Information Technology | 335.262525 | 0.220880 | 0.390967 | 2.571428 | -0.006479 |
Southwest Airlines Co. | Industrials | 115.641235 | 0.087942 | 0.187911 | 1.226890 | -0.067864 |
Palo Alto Networks, Inc. | Information Technology | 1195.284136 | 1.005202 | 0.273581 | 1.503676 | 0.006286 |
Paycom Software, Inc. | Information Technology | 99.861863 | 0.116981 | 0.433187 | 1.482668 | -0.004270 |
GoodRx Holdings, Inc. | Healthcare | 211.734149 | 0.164313 | 0.397327 | 0.000000 | -0.019991 |
Trip.com Group Limited | Consumer Discretionary | 115.003456 | 0.114943 | 0.258091 | 1.330882 | 1.021974 |
Horizon Therapeutics Public Limited Company | Healthcare | 51.074250 | 0.358072 | 0.533816 | 1.143907 | -0.038809 |
Ryanair Holdings plc | Industrials | 84.801017 | 0.267277 | 0.424162 | 1.436159 | 0.051860 |
Insulet Corporation | Healthcare | 151.722962 | 0.180835 | 0.220478 | 0.753676 | 0.001678 |
Warner Music Group Corp. | Communication Services | 132.166747 | 0.248114 | 0.205784 | 0.000000 | 0.164390 |
Monolithic Power Systems, Inc. | Information Technology | 97.925230 | 0.139450 | 0.330777 | 0.940651 | -0.031062 |
BioMarin Pharmaceutical Inc. | Healthcare | 419.677324 | 1.071525 | 0.183666 | 0.712710 | -0.000060 |
Ceridian HCM Holding Inc. | Information Technology | 162.268184 | 0.120739 | 0.233240 | 1.383878 | 0.030229 |
Enphase Energy, Inc. | Information Technology | 155.399251 | 0.233280 | 0.333227 | 1.431722 | -0.010926 |
Huazhu Group Limited | Consumer Discretionary | 117.613966 | 0.116765 | 0.321942 | 1.774684 | 2.653895 |
NovoCure Limited | Healthcare | 673.330233 | 0.378987 | 0.171717 | 1.283613 | -0.012370 |
Zynga Inc. | Communication Services | 32.612302 | 0.662850 | 0.262741 | 0.199054 | -0.002175 |
Paylocity Holding Corporation | Information Technology | 122.439891 | 0.138803 | 0.295299 | 1.358193 | -0.005702 |
Chegg, Inc. | Consumer Discretionary | 158.755259 | 0.257655 | 0.367920 | 0.983718 | 0.024049 |
Beyond Meat, Inc. | Consumer Staples | 10586.768907 | 0.463142 | 0.218462 | 0.000000 | -0.019930 |
Five9, Inc. | Information Technology | 1004.697864 | 0.264004 | 0.213080 | 0.521533 | -0.010888 |
Caesars Entertainment, Inc. | Consumer Discretionary | 110.387153 | 0.222447 | 0.341142 | 3.143382 | 1.625932 |
Alteryx, Inc. | Information Technology | 326.493151 | 0.180040 | 0.231524 | 0.844550 | 0.007510 |
Repligen Corporation | Healthcare | 128.302012 | 0.144598 | 0.312485 | 0.937500 | -0.025107 |
Guidewire Software, Inc. | Information Technology | 1042.018740 | 0.151002 | 0.071289 | 1.358193 | -0.064000 |
Melco Resorts & Entertainment Limited | Consumer Discretionary | 141.364688 | 0.085361 | 0.292891 | 1.766281 | 0.489584 |
The Gap, Inc. | Consumer Discretionary | 89.755628 | 0.018602 | 0.087069 | 1.524159 | 0.606664 |
ZoomInfo Technologies Inc. | Communication Services | 136.150171 | 0.252624 | 0.560529 | 0.000000 | 0.066545 |
The AZEK Company Inc. | Industrials | 81.996560 | 0.152255 | 0.250740 | 0.000000 | 0.041419 |
Brookfield Property Partners L.P. | Real Estate | 17.939351 | -0.051979 | 0.598349 | 1.488970 | 7.287715 |
Inphi Corporation | Information Technology | 93.741126 | 0.263641 | 0.447819 | 1.063025 | 0.037075 |
Under Armour, Inc. | Consumer Discretionary | 73.549111 | 0.012367 | 0.072856 | 1.126575 | 0.145338 |
Hyatt Hotels Corporation | Consumer Discretionary | 1506.762718 | 0.068638 | 0.242367 | 1.530462 | 0.213329 |
AppFolio, Inc. | Information Technology | 247.296411 | 0.238792 | 0.215147 | 1.155987 | -0.026250 |
Freshpet, Inc. | Consumer Staples | 207.883654 | 0.303361 | 0.230179 | 0.842436 | -0.014682 |
Apartment Investment and Management Company | Real Estate | 9.799975 | 0.026424 | 0.617981 | 0.504201 | 5.617434 |
NeoGenomics, Inc. | Healthcare | 283.714836 | 0.113220 | 0.170325 | 0.793592 | -0.010414 |
SailPoint Technologies Holdings, Inc. | Information Technology | 165.846646 | 0.175084 | 0.067844 | 2.009701 | -0.023138 |
CyberArk Software Ltd. | Information Technology | 235.160410 | 0.029310 | 0.149556 | 1.372899 | -0.099157 |
Jamf Holding Corp. | Information Technology | 139.357249 | 0.139439 | 0.121253 | 0.000000 | -0.045623 |
Vicor Corporation | Industrials | 238.829168 | 0.247710 | 0.317639 | 0.838235 | -0.044756 |
STAAR Surgical Company | Healthcare | 439.745998 | 0.201741 | 0.217175 | 1.080882 | -0.028797 |
Altair Engineering Inc. | Information Technology | 159.402321 | 0.080947 | 0.102365 | 1.527488 | 0.001179 |
Cleveland-Cliffs Inc. | Materials | 38.315691 | 0.384860 | 0.224049 | 2.225840 | 0.499703 |
Inari Medical, Inc. | Healthcare | 265.655820 | 0.466521 | 0.163802 | 0.000000 | -0.040520 |
Digital Turbine, Inc. | Information Technology | 151.043441 | 0.259552 | 0.283449 | 2.402310 | -0.001506 |
Shake Shack Inc. | Consumer Discretionary | 188.317447 | 0.157248 | 0.168499 | 1.776260 | 0.047185 |
Ping Identity Holding Corp. | Information Technology | 992.570819 | 0.133069 | 0.164389 | 0.000000 | -0.001415 |
Avis Budget Group, Inc. | Industrials | 150.271779 | 0.032521 | 0.101539 | 2.310399 | 5.062881 |
LivaNova PLC | Healthcare | 204.573948 | 0.148263 | 0.214982 | 0.799894 | 0.145708 |
Sunnova Energy International Inc. | Utilities | 176.591888 | 0.226460 | 0.480392 | 0.000000 | 0.375385 |
Tower Semiconductor Ltd. | Information Technology | 8.777906 | 0.292268 | 0.397378 | 1.361344 | -0.128399 |
Progyny, Inc. | Healthcare | 296.877878 | 0.358589 | 0.173703 | 0.000000 | -0.026911 |
Alliance Data Systems Corporation | Information Technology | 25.761200 | 0.035661 | 0.299326 | 2.582983 | 3.966767 |
Ryman Hospitality Properties, Inc. | Real Estate | 89.884980 | 0.090988 | 0.307643 | 1.808823 | 0.713166 |
Nordstrom, Inc. | Consumer Discretionary | 474.805048 | -0.026945 | 0.072959 | 2.451155 | 0.778384 |
Macy's, Inc. | Consumer Discretionary | 556.494924 | -0.004716 | 0.065204 | 2.013130 | 1.763889 |
Commvault Systems, Inc. | Information Technology | 83.911322 | 0.302015 | 0.203891 | 0.794642 | -0.135830 |
CommScope Holding Company, Inc. | Information Technology | 9.893421 | 0.020964 | 0.157286 | 1.862394 | 3.127148 |
Colony Capital, Inc. | Real Estate | 18.450260 | -0.041691 | 0.427312 | 1.987854 | 2.764061 |
Pebblebrook Hotel Trust | Real Estate | 180.669329 | 0.074852 | 0.291862 | 2.026785 | 0.981449 |
Celsius Holdings, Inc. | Consumer Staples | 583.619894 | 1.201920 | 0.161267 | 2.172794 | -0.009906 |
MicroStrategy Incorporated | Information Technology | 79.723750 | 0.711108 | 0.421064 | 1.589285 | 0.011132 |
Livent Corporation | Materials | 133.898922 | 0.101695 | 0.283130 | 2.244301 | 0.071956 |
Vital Farms, Inc. | Consumer Staples | 77.509807 | 0.301912 | 0.121291 | 0.000000 | -0.099120 |
Simulations Plus, Inc. | Healthcare | 87.717630 | 0.090232 | 0.405204 | 0.000000 | -0.083055 |
The Cheesecake Factory Incorporated | Consumer Discretionary | 104.870916 | 0.049671 | 0.083231 | 1.705882 | 0.830481 |
ACM Research, Inc. | Information Technology | 87.473199 | 0.234859 | 0.295867 | 0.887531 | -0.045061 |
RLJ Lodging Trust | Real Estate | 100.892798 | 0.077866 | 0.293877 | 2.158088 | 0.757639 |
Sinclair Broadcast Group, Inc. | Communication Services | 8.438616 | 0.096874 | 0.274030 | 1.585084 | 4.935812 |
Glu Mobile Inc. | Communication Services | 82.522778 | 0.264340 | 0.177606 | 0.831932 | -0.186857 |
iHeartMedia, Inc. | Communication Services | 18.589811 | 0.024124 | 0.261831 | 0.000000 | 3.107772 |
World Fuel Services Corporation | Energy | 7.616104 | 0.429859 | 0.108961 | 1.246848 | 0.071008 |
Service Properties Trust | Real Estate | 11.875750 | 0.027736 | 0.382401 | 2.444327 | 3.303842 |
Endo International plc | Healthcare | 24.663469 | 0.063334 | 0.451955 | 1.419117 | 3.967743 |
OraSure Technologies, Inc. | Healthcare | 198.770158 | -0.026472 | 0.071666 | 0.043067 | -0.254526 |
The Pennant Group, Inc. | Healthcare | 86.514949 | 0.190000 | 0.216489 | 0.000000 | 0.171849 |
Heska Corporation | Healthcare | 184.539033 | 0.173527 | 0.068167 | 1.644432 | -0.017779 |
Mack-Cali Realty Corporation | Real Estate | 44.156339 | -0.000763 | 0.458996 | 1.069852 | 2.593549 |
The Macerich Company | Real Estate | 16.148959 | -0.023081 | 0.500574 | 2.056722 | 3.142394 |
Xenia Hotels & Resorts, Inc. | Real Estate | 155.338978 | 0.040133 | 0.230800 | 1.632878 | 0.710745 |
BioLife Solutions, Inc. | Healthcare | 414.902527 | 0.341834 | 0.308731 | 1.546743 | -0.070999 |
CEVA, Inc. | Information Technology | 165.494614 | 0.199883 | 0.153385 | 1.185924 | -0.106221 |
iStar Inc. | Real Estate | 24.404341 | -0.018452 | 0.304676 | 0.809348 | 3.097602 |
United Natural Foods, Inc. | Consumer Staples | 7.048980 | 0.008955 | 0.025772 | 1.503676 | 3.903265 |
Quotient Technology Inc. | Consumer Discretionary | 234.832794 | 0.239540 | 0.129949 | 0.849264 | -0.016882 |
Textainer Group Holdings Limited | Industrials | 10.088555 | 0.033980 | 0.762142 | 1.580357 | 3.573168 |
Brookfield Property REIT Inc. | Real Estate | 22.618764 | 0.118648 | 0.771357 | 1.404560 | 25.971131 |
GrowGeneration Corp. | Consumer Discretionary | 387.364456 | 0.402906 | 0.151351 | 3.109704 | -0.017205 |
Fulgent Genetics, Inc. | Healthcare | 24.083644 | 0.421622 | 0.461197 | 1.928811 | -0.059669 |
The E.W. Scripps Company | Communication Services | 11.059371 | 0.316445 | 0.382041 | 1.949054 | 1.454701 |
Tactile Systems Technology, Inc. | Healthcare | 256.100331 | 0.156386 | 0.152844 | 1.679960 | -0.020267 |
American Axle & Manufacturing Holdings, Inc. | Consumer Discretionary | 6.594480 | -0.010659 | 0.156073 | 2.699054 | 2.941159 |
Diversified Healthcare Trust | Real Estate | 11.435733 | -0.010319 | 0.232312 | 1.361869 | 3.314837 |
Oxford Industries, Inc. | Consumer Discretionary | 49.112879 | 0.480186 | 0.369037 | 1.602941 | 0.229790 |
SFL Corporation Ltd. | Energy | 9.941712 | -0.011988 | 0.637073 | 1.213760 | 3.109762 |
Tenneco Inc. | Consumer Discretionary | 7.728585 | 0.026684 | 0.080744 | 2.691701 | 5.297825 |
Community Health Systems, Inc. | Healthcare | 12.587641 | 0.030153 | 0.129580 | 2.107668 | 12.788417 |
Amneal Pharmaceuticals, Inc. | Healthcare | 9.268181 | 0.066183 | 0.240407 | 1.300768 | 3.814766 |
Meredith Corporation | Communication Services | 7.629137 | -0.009960 | 0.175145 | 2.276260 | 3.244742 |
Veeco Instruments Inc. | Information Technology | 23.006617 | 0.358484 | 0.170280 | 1.425945 | 0.022544 |
Inogen, Inc. | Healthcare | 137.565001 | -0.012810 | 0.063584 | 1.046218 | -0.194290 |
Century Aluminum Company | Materials | 129.259320 | 0.276041 | 0.052513 | 2.561974 | 0.214986 |
Brookdale Senior Living Inc. | Healthcare | 13.249867 | 0.013942 | 0.108613 | 1.865546 | 6.160083 |
Global Medical REIT Inc. | Real Estate | 19.542185 | 0.105447 | 0.797919 | -6.314600 | 0.869243 |
Eastman Kodak Company | Information Technology | 7.817920 | 0.012010 | 0.086188 | 5.596113 | -0.181916 |
Surmodics, Inc. | Healthcare | 92.875455 | 0.083029 | 0.088202 | 1.000525 | -0.093102 |
RPC, Inc. | Energy | 102.662160 | 0.052706 | 0.127810 | 2.137605 | -0.131896 |
Rite Aid Corporation | Consumer Staples | 11.920487 | -0.009058 | 0.021498 | 1.001575 | 6.300989 |
The Chefs' Warehouse, Inc. | Consumer Staples | 194.246612 | 0.089962 | 0.066142 | 2.455882 | 0.327948 |
Scorpio Tankers Inc. | Energy | 7.138393 | 0.104970 | 0.601183 | 1.158613 | 3.916770 |
Transocean Ltd. | Energy | 7.647058 | 0.010152 | 0.326109 | 3.752626 | 3.505997 |
Limelight Networks, Inc. | Information Technology | 30.314803 | 0.329181 | 0.193427 | 0.724264 | -0.020178 |
CAI International, Inc. | Industrials | 22.052699 | 0.092397 | 0.782685 | 1.753151 | 3.684444 |
Boston Omaha Corporation | Communication Services | 115.570890 | 0.266829 | 0.225499 | 0.499759 | -0.166204 |
AngioDynamics, Inc. | Healthcare | 130.395979 | 0.098856 | 0.071677 | 0.886029 | -0.011744 |
FRP Holdings, Inc. | Real Estate | 79.580028 | -0.002952 | 0.443529 | 0.705882 | 0.098861 |
Consolidated Communications Holdings, Inc. | Communication Services | 5.339124 | 0.013146 | 0.411368 | 1.367647 | 5.206360 |
Par Pacific Holdings, Inc. | Energy | 147.385671 | 0.077741 | 0.068859 | 2.511554 | 1.226311 |
Tejon Ranch Co. | Real Estate | 146.641106 | -0.034705 | 0.051047 | 0.644432 | 0.035020 |
Frank's International N.V. | Energy | 2804.045761 | 0.075031 | 0.109971 | 1.507352 | -0.238425 |
Caleres, Inc. | Consumer Discretionary | 56.935485 | 0.008801 | 0.060885 | 2.473214 | 1.713944 |
Carrols Restaurant Group, Inc. | Consumer Discretionary | 17.206957 | 0.066622 | 0.074508 | 2.605567 | 3.835404 |
BioDelivery Sciences International, Inc. | Healthcare | 12.345843 | 0.310385 | 0.445107 | 0.727941 | -0.045735 |
New Senior Investment Group Inc. | Real Estate | 15.726062 | 0.010277 | 0.328094 | 2.045693 | 3.256503 |
Conn's, Inc. | Consumer Discretionary | 103.528237 | 0.033285 | 0.083218 | 2.596113 | 2.697070 |
Liquidity Services, Inc. | Consumer Discretionary | 1277.504624 | 0.545650 | 0.101182 | 1.204831 | -0.100726 |
CorePoint Lodging Inc. | Real Estate | 139.646074 | -0.019484 | 0.130601 | 1.981101 | 1.836362 |
Cooper-Standard Holdings Inc. | Consumer Discretionary | 638.199180 | 0.304758 | 0.208089 | 3.209033 | 0.970937 |
Preferred Apartment Communities, Inc. | Real Estate | 28.070785 | 0.027324 | 0.569761 | 1.124474 | 7.590762 |
Revlon, Inc. | Consumer Staples | 43.443358 | 0.073722 | 0.136836 | 2.716386 | 5.985659 |
CalAmp Corp. | Information Technology | 107.211908 | 0.116129 | 0.116113 | 2.485294 | 0.331036 |
Retail Value Inc. | Real Estate | 6.910010 | -0.290873 | 0.198649 | 1.891162 | 1.162256 |
GTT Communications, Inc. | Information Technology | 10.118019 | -0.007109 | 0.222752 | 1.641281 | 18.698159 |
NN, Inc. | Industrials | 12.016967 | -0.027208 | 0.142701 | 3.349264 | 3.023795 |
AXT, Inc. | Information Technology | 128.340641 | 0.164417 | 0.167744 | 2.296218 | -0.044102 |
Berry Corporation | Energy | 2.877859 | -0.004759 | 0.358621 | 714.265241 | 0.882410 |
QEP Resources, Inc. | Energy | 3.624731 | 0.041161 | 0.650404 | 5.264180 | 2.182623 |
DASAN Zhone Solutions, Inc. | Information Technology | 297.851164 | 0.255972 | 0.072003 | 1.317226 | 0.104923 |
Orion Energy Systems, Inc. | Industrials | 90.937642 | 0.143926 | 0.194833 | 2.619222 | -0.001672 |
Nesco Holdings, Inc. | Industrials | 35.011502 | 0.146010 | 0.448028 | 0.000000 | 2.149199 |
Alpha Pro Tech, Ltd. | Industrials | 5.972842 | 0.387251 | 0.390266 | -1.269432 | -0.135723 |
Entercom Communications Corp. | Communication Services | 18.029148 | 0.029577 | 0.230234 | 1.915966 | 4.823862 |
Bluerock Residential Growth REIT, Inc. | Real Estate | 14.246728 | 0.027543 | 0.567973 | 1.065126 | 4.926919 |
Nabors Industries Ltd. | Energy | 5.798961 | -0.010064 | 0.261313 | 3.790966 | 5.698193 |
Callon Petroleum Company | Energy | 5.974019 | 0.080393 | 0.636032 | 3.409138 | 5.115067 |
Hersha Hospitality Trust | Real Estate | 100.567386 | 0.048996 | 0.276029 | 2.759978 | 4.092159 |
Atlantic Power Corporation | Utilities | 9.472282 | 0.046711 | 0.679425 | 0.360396 | 2.956992 |
Gannett Co., Inc. | Communication Services | 6.551396 | 0.078796 | 0.126688 | 2.923844 | 3.326580 |
Overseas Shipholding Group, Inc. | Energy | 8.496538 | 0.001550 | 0.197073 | 0.344537 | 3.305226 |
SM Energy Company | Energy | 4.101045 | 0.047311 | 0.675687 | 6.610819 | 2.003005 |
National CineMedia, Inc. | Communication Services | 35.545151 | 0.073656 | 0.417918 | 1.898634 | 2.835588 |
Five Star Senior Living Inc. | Healthcare | 4.455550 | 0.383867 | 0.019542 | 1.093487 | -0.330984 |
Willis Lease Finance Corporation | Industrials | 7.528340 | 0.041371 | 0.641146 | 1.225840 | 6.851299 |
Exterran Corporation | Energy | 4.182844 | -0.011508 | 0.206908 | 1.018382 | 3.157806 |
Harrow Health, Inc. | Healthcare | 167.227700 | 0.356905 | 0.262504 | 1.081932 | -0.042801 |
Quad/Graphics, Inc. | Industrials | 4.559083 | 0.030267 | 0.117657 | 2.094537 | 4.614893 |
Penn Virginia Corporation | Energy | 2.359374 | 0.037113 | 0.684470 | 3.649320 | 2.498958 |
Mesa Air Group, Inc. | Industrials | 12.319720 | 0.147644 | 0.333299 | 3.127155 | 3.213603 |
Washington Prime Group Inc. | Real Estate | 14.527547 | 0.027008 | 0.707422 | 1.976365 | 16.784415 |
HC2 Holdings, Inc. | Industrials | 18.908519 | -0.162321 | 0.020053 | 2.339810 | 1.832551 |
Sharps Compliance Corp. | Healthcare | 142.056343 | 0.108284 | 0.085569 | 0.002626 | 0.046589 |
Clipper Realty Inc. | Real Estate | 21.807713 | 0.057809 | 0.504501 | 1.006648 | 7.673637 |
Laredo Petroleum, Inc. | Energy | 3.594123 | 0.005052 | 0.641185 | 4.738970 | 3.974701 |
Safe Bulkers, Inc. | Industrials | 11.612496 | 0.074137 | 0.549133 | 0.952205 | 2.823696 |
R. R. Donnelley & Sons Company | Industrials | 5.501899 | -0.015650 | 0.075827 | 2.775735 | 10.644258 |
StealthGas Inc. | Energy | 6.583602 | 0.053326 | 0.545212 | 1.709558 | 2.890521 |
CSI Compressco LP | Energy | 7.175938 | -0.001477 | 0.385766 | 2.086659 | 12.732358 |
Sequential Brands Group, Inc. | Consumer Discretionary | 10.014852 | 0.132819 | 0.794679 | 1.454831 | 18.898905 |
#outlier removal
current_num_samples = data.shape[0]
data_new = data[cov.predict(X)== 1]
new_num_samples = data_new.shape[0]
print(current_num_samples - new_num_samples, 'companies were removed due to outliers.')
print('The new data shape is', data_new.shape)
182 companies were removed due to outliers.
The new data shape is (1629, 6)
2.2.3 Summary Statistics and Histograms after outlier removal
#summary statistics after outlier removal
data_new.describe()
EV_TTM_EBITDA | EBITDA_Forecast_CAGR_10y | Avg_EBITDA_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|
count | 1629.000000 | 1629.000000 | 1629.000000 | 1629.000000 | 1629.000000 |
mean | 18.877218 | 0.051891 | 0.265026 | 1.230406 | 0.362546 |
std | 12.891314 | 0.049236 | 0.182241 | 0.696487 | 0.514622 |
min | 0.542898 | -0.139129 | 0.000000 | -0.465861 | -1.025617 |
25% | 10.568284 | 0.023252 | 0.130638 | 0.782037 | 0.029264 |
50% | 15.034857 | 0.040332 | 0.210164 | 1.170168 | 0.203440 |
75% | 22.488791 | 0.071801 | 0.357577 | 1.571428 | 0.542521 |
max | 79.289256 | 0.302286 | 0.844657 | 4.881302 | 2.772638 |
sns.histplot(data_new[['EV_TTM_EBITDA']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab86f1d748>
sns.histplot(np.log(data_new[['EV_TTM_EBITDA']]))
<matplotlib.axes._subplots.AxesSubplot at 0x7fab873cf828>
sns.histplot(data_new[['EBITDA_Forecast_CAGR_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab87131080>
sns.histplot(data[['Avg_EBITDA_Margin_Forecast_10y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab870eeda0>
sns.histplot(data_new[['Beta_5y']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab8704cef0>
sns.histplot(data_new[['Net_Debt_perc_Market_Cap']])
<matplotlib.axes._subplots.AxesSubplot at 0x7fab86e87d30>
2.3 Split into training and testing set
# split data into train and test splits
from sklearn.model_selection import train_test_split
X = data_new[['Sector', 'EBITDA_Forecast_CAGR_10y','Avg_EBITDA_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap']];
Y = data_new[['EV_TTM_EBITDA']];
Y['Log_EV_TTM_EBITDA'] = np.log(Y['EV_TTM_EBITDA']);
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=True, test_size = 0.2, random_state=1);
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# check training data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', Y_train.shape)
X_train has shape (1303, 5)
y_train has shape (1303, 2)
#check X_train
X_train.head(5)
Sector | EBITDA_Forecast_CAGR_10y | Avg_EBITDA_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|
Name | |||||
Zimmer Biomet Holdings, Inc. | Healthcare | 0.033462 | 0.332755 | 1.337184 | 0.221408 |
Allegiant Travel Company | Industrials | 0.048457 | 0.278896 | 1.703256 | 0.316869 |
Alpha and Omega Semiconductor Limited | Information Technology | 0.149025 | 0.173992 | 2.529411 | 0.066703 |
Zix Corporation | Information Technology | 0.118027 | 0.237929 | 1.159138 | 0.365729 |
CMC Materials, Inc. | Information Technology | 0.003157 | 0.318449 | 1.180672 | 0.152127 |
#check y_train
Y_train.head(5)
EV_TTM_EBITDA | Log_EV_TTM_EBITDA | |
---|---|---|
Name | ||
Zimmer Biomet Holdings, Inc. | 20.932245 | 3.041291 |
Allegiant Travel Company | 49.358942 | 3.899119 |
Alpha and Omega Semiconductor Limited | 22.075878 | 3.094486 |
Zix Corporation | 14.757365 | 2.691742 |
CMC Materials, Inc. | 15.474569 | 2.739198 |
train_data = Y_train.merge(X_train, how='outer', left_index=True, right_index=True)
test_data = Y_test.merge(X_test, how='outer', left_index=True, right_index=True)
3. Exploratory Data Analysis
3.1 Summary Statistics
train_data.describe().round(2)
EV_TTM_EBITDA | Log_EV_TTM_EBITDA | EBITDA_Forecast_CAGR_10y | Avg_EBITDA_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_Market_Cap | |
---|---|---|---|---|---|---|
count | 1303.00 | 1303.00 | 1303.00 | 1303.00 | 1303.00 | 1303.00 |
mean | 18.95 | 2.75 | 0.05 | 0.26 | 1.22 | 0.37 |
std | 13.21 | 0.63 | 0.05 | 0.18 | 0.69 | 0.53 |
min | 0.54 | -0.61 | -0.14 | 0.00 | -0.47 | -1.03 |
25% | 10.57 | 2.36 | 0.02 | 0.13 | 0.77 | 0.03 |
50% | 15.03 | 2.71 | 0.04 | 0.21 | 1.17 | 0.21 |
75% | 22.42 | 3.11 | 0.07 | 0.35 | 1.56 | 0.55 |
max | 79.29 | 4.37 | 0.28 | 0.84 | 4.88 | 2.77 |
3.2 Correlation Matrix
# correlation matrix for outcome variable
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Compute the correlation matrix
corr = train_data.drop(columns="Log_EV_TTM_EBITDA").corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x7fab86d66c18>
# correlation matrix for log of outcome variable
# Compute the correlation matrix
corr = train_data.drop(columns="EV_TTM_EBITDA").corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(20, 230, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot= True, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x7fab86dcf208>
3.3 Pairplots
sns.set_theme()
sns.jointplot(data=train_data, x="EBITDA_Forecast_CAGR_10y", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="EBITDA_Forecast_CAGR_10y", y="Log_EV_TTM_EBITDA")
<seaborn.axisgrid.JointGrid at 0x7fab871d74e0>
sns.jointplot(data=train_data, x="Avg_EBITDA_Margin_Forecast_10y", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="Avg_EBITDA_Margin_Forecast_10y", y="Log_EV_TTM_EBITDA")
<seaborn.axisgrid.JointGrid at 0x7fab868d17b8>
sns.jointplot(data=train_data, x="Beta_5y", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="Beta_5y", y="Log_EV_TTM_EBITDA")
<seaborn.axisgrid.JointGrid at 0x7fab865decc0>
sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="EV_TTM_EBITDA")
sns.jointplot(data=train_data, x="Net_Debt_perc_Market_Cap", y="Log_EV_TTM_EBITDA")
<seaborn.axisgrid.JointGrid at 0x7fab86233ac8>
4. Machine Learning
#ML dependancies
#data preprocessing pipeline to preprocess data table
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
#Grid search cross validation to find best hyper parameters
from sklearn.model_selection import GridSearchCV
#models considered
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingRegressor
#evaluation
from sklearn.model_selection import cross_val_score
4.1 Pre-Processing Pipeline
#split data back into X and y
X_train = train_data[['EBITDA_Forecast_CAGR_10y','Avg_EBITDA_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_train = train_data['Log_EV_TTM_EBITDA']
X_test = test_data[['EBITDA_Forecast_CAGR_10y','Avg_EBITDA_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_Market_Cap', 'Sector']]
y_test = test_data['Log_EV_TTM_EBITDA']
# check training and test data shape
print('X_train has shape', X_train.shape)
print('y_train has shape', y_train.shape)
print('X_test has shape', X_test.shape)
print('y_test has shape', y_test.shape)
X_train has shape (1303, 5)
y_train has shape (1303,)
X_test has shape (326, 5)
y_test has shape (326,)
# define categorical columns for transformation
categorical_variables = ['Sector']
# define numerical columns for transformation
numerical_variables = list(X_train.drop(columns=categorical_variables).columns)
# define column transformation for data preprocessing pipeline
column_transformer = ColumnTransformer(transformers=[
('categorical_columns', OneHotEncoder(handle_unknown='ignore'), categorical_variables),
('numeric_columns', StandardScaler(), numerical_variables)])
4.2 Model Training and Hyper Parameter Tuning
#ridge regression
grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}
model_1 = Pipeline([('column_transformer', column_transformer),
('model', Ridge(max_iter = 100000))])
# default metric to optimize over grid parameters: accuracy
grid_model_1 = GridSearchCV(model_1, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_1 = grid_model_1.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_1.best_params_)
print('Grid best score: ', grid_model_1.best_score_.round(3))
Grid best parameter: {'model__alpha': 10}
Grid best score: 0.238
#lasso regression
grid_values = {'model__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}
model_2 = Pipeline([('column_transformer', column_transformer),
('model', Lasso(max_iter = 100000))])
# default metric to optimize over grid parameters: accuracy
grid_model_2 = GridSearchCV(model_2, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_2 = grid_model_2.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_2.best_params_)
print('Grid best score: ', grid_model_2.best_score_.round(3))
Grid best parameter: {'model__alpha': 0.001}
Grid best score: 0.237
#KNeighborsRegressor
grid_values = {'model__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]}
model_3 = Pipeline([('column_transformer', column_transformer),
('model', KNeighborsRegressor())])
# default metric to optimize over grid parameters: accuracy
grid_model_3 = GridSearchCV(model_3, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_3 = grid_model_3.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_3.best_params_)
print('Grid best score: ', grid_model_3.best_score_.round(3))
Grid best parameter: {'model__n_neighbors': 9}
Grid best score: 0.286
#SVR linear
grid_values = {'model__C': [0.01, 0.1, 1, 10, 100]}
model_4 = Pipeline([('column_transformer', column_transformer),
('model', SVR(kernel='linear'))])
# default metric to optimize over grid parameters: accuracy
grid_model_4 = GridSearchCV(model_4, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_4 = grid_model_4.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_4.best_params_)
print('Grid best score: ', grid_model_4.best_score_.round(3))
Grid best parameter: {'model__C': 0.1}
Grid best score: 0.233
#SVR rbf
grid_values = {'model__gamma': [0.01, 0.1, 1, 10, 100],
'model__C': [0.01, 0.1, 1, 10, 100]}
model_5 = Pipeline([('column_transformer', column_transformer),
('model', SVR(kernel='rbf'))])
# default metric to optimize over grid parameters: accuracy
grid_model_5 = GridSearchCV(model_5, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_5 = grid_model_5.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_5.best_params_)
print('Grid best score: ', grid_model_5.best_score_.round(3))
Grid best parameter: {'model__C': 10, 'model__gamma': 0.1}
Grid best score: 0.339
#DecisionTreeRegressor
grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8]}
model_6 = Pipeline([('column_transformer', column_transformer),
('model', DecisionTreeRegressor())])
# default metric to optimize over grid parameters: accuracy
grid_model_6 = GridSearchCV(model_6, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_6 = grid_model_6.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_6.best_params_)
print('Grid best score: ', grid_model_6.best_score_.round(3))
Grid best parameter: {'model__max_depth': 5}
Grid best score: 0.291
#RandomForestRegressor
grid_values = {'model__max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10],
'model__n_estimators': [50, 75, 100, 125, 150, 200, 250]}
model_7 = Pipeline([('column_transformer', column_transformer),
('model', RandomForestRegressor())])
# default metric to optimize over grid parameters: accuracy
grid_model_7 = GridSearchCV(model_7, param_grid = grid_values, cv=5, scoring = 'r2')
grid_model_7 = grid_model_7.fit(X_train, y_train)
print('Grid best parameter: ', grid_model_7.best_params_)
print('Grid best score: ', grid_model_7.best_score_.round(3))
Grid best parameter: {'model__max_depth': 9, 'model__n_estimators': 100}
Grid best score: 0.424
#VotingRegressor
model_8 = VotingRegressor([('ridge', grid_model_1), ('lasso', grid_model_2), ('knn', grid_model_3), ('svr_linear', grid_model_4),
('svr_rbf', grid_model_5), ('tree', grid_model_6), ('rf', grid_model_7)])
model_8 = model_8.fit(X_train, y_train)
cv_score = cross_val_score(model_8, X_train, y_train, cv=5, scoring = 'r2').mean().round(3)
print('CV score', cv_score)
CV score 0.363
4.3 Model Comparison and Selection
#model performance summary on cross validation folds
models = ['Ridge',
'Lasso',
'KNeighborsRegressor',
'SVR linear',
'SVR rbf',
'Decision Tree',
'RandomForestRegressor',
'VotingRegressor']
scores = [grid_model_1.best_score_.round(3),
grid_model_2.best_score_.round(3),
grid_model_3.best_score_.round(3),
grid_model_4.best_score_.round(3),
grid_model_5.best_score_.round(3),
grid_model_6.best_score_.round(3),
grid_model_7.best_score_.round(3),
cv_score.round(3)]
model_scores = pd.DataFrame(data= scores, columns = ['CV_r2'], index = models)
print('The average cross validation performance of all trained models was as below')
model_scores
The average cross validation performance of all trained models was as below
CV_r2 | |
---|---|
Ridge | 0.238 |
Lasso | 0.237 |
KNeighborsRegressor | 0.286 |
SVR linear | 0.233 |
SVR rbf | 0.339 |
Decision Tree | 0.291 |
RandomForestRegressor | 0.424 |
VotingRegressor | 0.363 |
#get the best model as per cross validation performance
print('The model selected with the best Cross Validation performance was:', model_scores.index[np.argmax(model_scores['CV_r2'])])
model_scores.iloc[[np.argmax(model_scores['CV_r2'])]]
The model selected with the best Cross Validation performance was: RandomForestRegressor
CV_r2 | |
---|---|
RandomForestRegressor | 0.424 |
4.4 Out of Sample Performance Estimation
#model dictionary
model_dict = {'Ridge': grid_model_1,
'Lasso': grid_model_2,
'KNeighborsRegressor': grid_model_3,
'SVR linear': grid_model_4,
'SVR rbf': grid_model_5,
'Decision Tree': grid_model_6,
'RandomForestRegressor': grid_model_7,
'VotingRegressor': model_8
}
#test set performance best models
print('The best model reaches below out of sample test set performance:')
model_dict[model_scores.index[np.argmax(model_scores['CV_r2'])]].score(X_test, y_test).round(3)
The best model reaches below out of sample test set performance:
0.473