This project is maintained by John-Land
The S&P 500 is a stock market index that measures the stock performance of 500 large companies listed on stock exchanges in the United States. It is one of the most commonly followed equity indices. For more information, refer to below link: https://en.wikipedia.org/wiki/S%26P_500_Index
A traditional measure of whether a stock is expensive or cheap, is to use a valuation multiple. A valuation multiple relates the market value of an asset relative to a key statistic that is assumed to relate to that value. To be useful, that statistic – whether earnings, cash flow or some other measure – must bear a logical relationship to the market value observed; to be seen, in fact, as the driver of that market value. For more information about valuation multiples, refer to below link: https://en.wikipedia.org/wiki/Valuation_using_multiples#Valuation_multiples
The most famous valuation multiple the PE (Price to earnings) ratio. This is the ratio between the market value of all the shares outstanding for a company, divided by the company’s Net Income. It is often also defined as Price per share divided by earnings per share, which is simply dividing both the Market Capitalization and the Net income by the shares outstanding. In this project we will try to predict the PE ratios of companies in the S&P 500, based on fundamental financial metrics that should be driving these ratios.
While working through this project, I will follow a slightly adjusted Machine Learning project check list from Aurelien Geron’s book “Hands-On Machine Learning with Scikit_Learn, Keras & TensorFlow”. (Géron, Aurélien. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (p. 35). O’Reilly Media. Kindle Edition.)
As with all coding posts, the full jupyter notebook can be found in my github repo below: https://github.com/John-Land/S-P_500_PE_Ratio_Prediction
Before analysing our dataset, let us first try to understand what fundamental financial information about the company should be driving the PE ratio based on Financial Theory.
The present value of any Cash Flow generating asset is the value of all future cash flows the asset will generate over its lifetime, discounted back to today with a risk adjusted interest rate.
In its simplest form, where we expect the cash flow to grow at the same rate for ever, this is given by the below perpetuity formula.
When trying to value a company based on its future cash flows, we can either look at it through the lenses of all capital providers (equity and debt providers) and we would use the FCFF (Free Cash Flow to the Firm) as the measure of cash flow (CF), or we could look at it purely as an Equity investor and we would use the FCFE (Free Cash Flow to Equity) as our measure of CF.
The main difference is that the FCFF is the cash flow left over after cash flow needs for operations and Investments are met, but before any debt repayments, whereas the FCFE is after debt payments (cash flow left over for equity investors).
If we want to estimate the value of Equity directly, we would therefore use the FCFE as the appropriate measure of CF.
As the Market Capitalization (MC) (numerator in the PE ratio), is the aggregate market estimation of the value of Equity, we can rewrite the above formula as follows:
This formula gives us some understanding of what we would expect the fundamental drivers of the PE ratio to be. The fundamental drivers of the P/E ratio are
rf (risk free rate traditionally measured as the US T-Bond rate) and ERP (Equity risk premium, basically the premium that equity investor require on top of the risk free rate for investing in the overall stock market) values are the same for all stocks, therefore changes in these overall market metrics will affect all PE ratios, but these will not help differentiation between PE ratios of individual companies, as these are equal for all companies.
We will therefore try to obtain financial metrics that are related to points 1-4, as we would expect these to be partial drivers of the PE ratios seeen in the market.
The company data was downloaded from the data provider “Finbox”. For more information, refer to below link: https://finbox.com
We will first import the data and check for any missing values and some basic information. Then we will remove outliers before splitting the data into training and testing sets, and analysing the training set in more detail.
# linear algebra
import numpy as np
# data processing
import pandas as pd
#data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
company_data = pd.read_excel("Financials - unlinked.xlsx")
company_data.shape
(531, 38)
company_data.head(5)
Ticker | Full_Ticker | Name | Sector | Industry_GICS | Index_Membership | Market_Cap | Net_Debt | Minority_Interest | EV | ... | EV_TTM_EBITDA | Market_Cap_TTM_Net_Income | Market_Cap_BV_Equity | Interest_Expense_TTM | Interest_Expense_TTM_%_TTM_Net_Income | Cash_from_Investing | Cash_from_Investing_%_TTM_Net_Income | Revenue_TTM | Cash_from_Investing_%_TTM_Revenue | Interest_Expense_TTM_%_Revenue_TTM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MSFT | NASDAQGS:MSFT | Microsoft Corporation | Information Technology | Software | Dow Jones Composite Average, Dow Jones Industr... | 1.625176e+06 | -54382.0 | 0.0 | 1.570794e+06 | ... | 24.070147 | 36.701423 | 13.737284 | 89.0 | -0.002010 | -12223.0 | -0.276033 | 143015.0 | -0.085467 | 0.000622 |
1 | AAPL | NASDAQGS:AAPL | Apple Inc. | Information Technology | Technology Hardware, Storage & Peripherals | Dow Jones Composite Average, Dow Jones Industr... | 1.979619e+06 | -71431.0 | 0.0 | 1.908188e+06 | ... | 24.255285 | 33.883653 | 27.387435 | 1052.0 | -0.018006 | -10618.0 | -0.181740 | 273857.0 | -0.038772 | 0.003841 |
2 | AMZN | NASDAQGS:AMZN | Amazon.com, Inc. | Consumer Discretionary | Internet & Direct Marketing Retail | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 1.591026e+06 | 20010.0 | 0.0 | 1.611036e+06 | ... | 40.708424 | 120.715166 | 21.579670 | -885.0 | 0.067147 | -35307.0 | -2.678832 | 321782.0 | -0.109723 | -0.002750 |
3 | GOOG | NASDAQGS:GOOG | Alphabet Inc. | Communication Services | Interactive Media & Services | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 1.095684e+06 | -104937.0 | 0.0 | 9.907473e+05 | ... | 21.521609 | 34.746124 | 5.284940 | 2197.0 | -0.069671 | -23943.0 | -0.759276 | 166030.0 | -0.144209 | 0.013233 |
4 | GOOG.L | NASDAQGS:GOOG.L | Alphabet Inc. | Communication Services | Interactive Media & Services | Nasdaq 100, Nasdaq Composite, Russell 1000, Ru... | 1.095684e+06 | -104937.0 | 0.0 | 9.907473e+05 | ... | 21.521609 | 34.746124 | 5.284940 | 2197.0 | -0.069671 | -23943.0 | -0.759276 | 166030.0 | -0.144209 | 0.013233 |
5 rows × 38 columns
We will first remove any companies with either market capitalization <0 or Net Income <0, as for these companies, the PE ratio is not meaningful. We also filter out financials, as our measurement of debt is not applicable to Financial Companies. We also remove duplicates, as we do not want the same company to appear twice in our data.
company_data = company_data[(company_data.Market_Cap > 0)]
company_data = company_data[(company_data.Net_Income > 0)]
company_data = company_data[(company_data.Sector != 'Financials')]
company_data = company_data.drop_duplicates('Name' , keep='first')
pe_data = company_data[['Sector',
'Name',
'Market_Cap_TTM_Net_Income',
'Net_Income_Forecast_CAGR_10y',
'Avg_Net_Income_Margin_Forecast_10y',
'Beta_5y',
'Net_Debt_perc_EV'
]].set_index('Name')
pe_data.head(5)
Sector | Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | |
---|---|---|---|---|---|---|
Name | ||||||
Microsoft Corporation | Information Technology | 36.701423 | 0.130946 | 0.342895 | 0.923331 | -0.034621 |
Apple Inc. | Information Technology | 33.883653 | 0.0688865 | 0.220838 | 1.314396 | -0.037434 |
Amazon.com, Inc. | Consumer Discretionary | 120.715166 | 0.284554 | 0.098539 | 1.353006 | 0.012421 |
Alphabet Inc. | Communication Services | 34.746124 | 0.112474 | 0.195844 | 1.106453 | -0.105917 |
Facebook, Inc. | Communication Services | 33.685330 | 0.162782 | 0.313488 | 1.295642 | -0.063398 |
Below code ensures that all numerical variables are of type numeric.
pe_data['Market_Cap_TTM_Net_Income'] = pe_data['Market_Cap_TTM_Net_Income'].apply(pd.to_numeric, errors='coerce')
pe_data['Net_Income_Forecast_CAGR_10y'] = pe_data['Net_Income_Forecast_CAGR_10y'].apply(pd.to_numeric, errors='coerce')
pe_data['Avg_Net_Income_Margin_Forecast_10y'] = pe_data['Avg_Net_Income_Margin_Forecast_10y'].apply(pd.to_numeric, errors='coerce')
pe_data['Beta_5y'] = pe_data['Beta_5y'].apply(pd.to_numeric, errors='coerce')
pe_data['Net_Debt_perc_EV'] = pe_data['Net_Debt_perc_EV'].apply(pd.to_numeric, errors='coerce')
pe_data.shape
(373, 6)
After removing duplicates, companies with market capitalization <= 0, Net Income <= 0 and companies in the Finical Sector, the total sample size drops from 531 to 373.
pe_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 373 entries, Microsoft Corporation to Alliance Data Systems Corporation
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 373 non-null object
1 Market_Cap_TTM_Net_Income 373 non-null float64
2 Net_Income_Forecast_CAGR_10y 365 non-null float64
3 Avg_Net_Income_Margin_Forecast_10y 373 non-null float64
4 Beta_5y 373 non-null float64
5 Net_Debt_perc_EV 373 non-null float64
dtypes: float64(5), object(1)
memory usage: 20.4+ KB
pe_data.isna().sum()
Sector 0
Market_Cap_TTM_Net_Income 0
Net_Income_Forecast_CAGR_10y 8
Avg_Net_Income_Margin_Forecast_10y 0
Beta_5y 0
Net_Debt_perc_EV 0
dtype: int64
The dataset is missing 8 values in the variable ‘Net_Income_Forecast_CAGR_10y’.
pe_data = pe_data.dropna()
pe_data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 365 entries, Microsoft Corporation to Alliance Data Systems Corporation
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sector 365 non-null object
1 Market_Cap_TTM_Net_Income 365 non-null float64
2 Net_Income_Forecast_CAGR_10y 365 non-null float64
3 Avg_Net_Income_Margin_Forecast_10y 365 non-null float64
4 Beta_5y 365 non-null float64
5 Net_Debt_perc_EV 365 non-null float64
dtypes: float64(5), object(1)
memory usage: 20.0+ KB
pe_data.isna().sum()
Sector 0
Market_Cap_TTM_Net_Income 0
Net_Income_Forecast_CAGR_10y 0
Avg_Net_Income_Margin_Forecast_10y 0
Beta_5y 0
Net_Debt_perc_EV 0
dtype: int64
pe_data.shape
(365, 6)
After dropping the 8 missing values, the sample size drops to 365 companies.
We will look at the individual histograms of all variables, to see if we have potential outliers, and use Gaussian mixture models to later remove these outliers.
sns.histplot(pe_data[['Market_Cap_TTM_Net_Income']])
<AxesSubplot:ylabel='Count'>
Outlier removal before model training would be reasonable
sns.histplot(np.log(pe_data[['Market_Cap_TTM_Net_Income']]))
<AxesSubplot:ylabel='Count'>
Predicting log of PE instead of PE would be reasonable
Outlier removal before model training would be reasonable
sns.histplot(pe_data[['Net_Income_Forecast_CAGR_10y']])
<AxesSubplot:ylabel='Count'>
No outlier removal reasonable
sns.histplot(pe_data[['Avg_Net_Income_Margin_Forecast_10y']])
<AxesSubplot:ylabel='Count'>
No outlier removal reasonable
sns.histplot(pe_data[['Beta_5y']])
<AxesSubplot:ylabel='Count'>
No outlier removal reasonable
sns.histplot(pe_data[['Net_Debt_perc_EV']])
<AxesSubplot:ylabel='Count'>
Outlier removal before model training would be reasonable
Based on the histograms for the individual variables, we conclude that there are extreme outliers in the PE ratio variable and Net debt % EV variable.
Below code uses the GaussianMixture model to remove the most extreme 5% of samples, defined as the 5% of samples with the lowest density values.
from sklearn.mixture import GaussianMixture
pe_ratios_train = pe_data[['Market_Cap_TTM_Net_Income']]
gm = GaussianMixture(n_components=1)
gm.fit(pe_ratios_train,)
densities = gm.score_samples(pe_ratios_train)
threshold = np.percentile(densities, 5)
anomalies_train = pe_ratios_train[densities < threshold]
anomalies_train.round(2)
Market_Cap_TTM_Net_Income | |
---|---|
Name | |
Fidelity National Information Services, Inc. | 2852.57 |
JD.com, Inc. | 5.83 |
Biogen Inc. | 8.08 |
NetEase, Inc. | 2.53 |
Sempra Energy | 8.69 |
Kinder Morgan, Inc. | 240.45 |
eBay Inc. | 7.28 |
Sysco Corporation | 154.07 |
IQVIA Holdings Inc. | 184.51 |
Zimmer Biomet Holdings, Inc. | 807.04 |
Chipotle Mexican Grill, Inc. | 153.36 |
The Williams Companies, Inc. | 188.15 |
Hilton Worldwide Holdings Inc. | 513.24 |
Hewlett Packard Enterprise Company | 12169.19 |
NortonLifeLock Inc. | 3.07 |
NRG Energy, Inc. | 1.93 |
MGM Resorts International | 5.80 |
Kimco Realty Corporation | 4.81 |
Xerox Holdings Corporation | 3.98 |
The GaussianMixture model detected some very low PE values, and on the upper side as expected by the initial histogram on PE ratios, some extreme values of PE ratios such as HP with a PE of 12169 and Fidelity with a PE ratio of 2852.
Below code removes the list of extreme values above from our sample.
pe_data_new = pe_data[densities >= threshold]
from sklearn.mixture import GaussianMixture
pe_ratios_train = pe_data_new[['Net_Debt_perc_EV']]
gm = GaussianMixture(n_components=1)
gm.fit(pe_ratios_train,)
densities = gm.score_samples(pe_ratios_train)
threshold = np.percentile(densities, 5)
anomalies_train = pe_ratios_train[densities < threshold]
anomalies_train.round(3)
Net_Debt_perc_EV | |
---|---|
Name | |
Walgreens Boots Alliance, Inc. | 0.552 |
Baidu, Inc. | 3.312 |
General Motors Company | 0.628 |
FirstEnergy Corp. | 0.552 |
Simon Property Group, Inc. | 0.541 |
PPL Corporation | 0.511 |
Arista Networks, Inc. | -0.200 |
Garmin Ltd. | -0.164 |
ONEOK, Inc. | 0.502 |
CarMax, Inc. | 0.502 |
ViacomCBS Inc. | 0.508 |
Mylan N.V. | 0.597 |
WestRock Company | 0.499 |
The AES Corporation | 0.553 |
Iron Mountain Incorporated | 0.573 |
SL Green Realty Corp. | 0.586 |
Harley-Davidson, Inc. | 0.688 |
Alliance Data Systems Corporation | 0.862 |
The GaussianMixture model detected a few very low Net debt % EV values, and on the upper side as expected by the initial histogram, some extreme values.
Below code removes the list of extreme values above from our sample.
pe_data_new = pe_data_new[densities >= threshold]
print('The sample size after outlier removal is', pe_data_new.shape[0],'. In total', pe_data.shape[0] - pe_data_new.shape[0], 'outliers were removed.')
The sample size after outlier removal is 328 . In total 37 outliers were removed.
Below code shows the histograms after outlier removal
print('The sample size after outlier removal is', pe_data_new.shape[0],'. In total', pe_data.shape[0] - pe_data_new.shape[0], 'outliers were removed.')
The sample size after outlier removal is 328 . In total 37 outliers were removed.
sns.histplot(pe_data_new[['Market_Cap_TTM_Net_Income']])
<AxesSubplot:ylabel='Count'>
Use log of PE ratios instead of PE ratio, to make data more Gaussian
sns.histplot(np.log(pe_data_new[['Market_Cap_TTM_Net_Income']]))
<AxesSubplot:ylabel='Count'>
sns.histplot(pe_data_new[['Net_Debt_perc_EV']])
<AxesSubplot:ylabel='Count'>
Below code splits our samples into training and testing sets. We will use a 80% training and 20% testing split.
Note that we also transform our target PE ratios by taking the log.
from sklearn.model_selection import train_test_split
X = pe_data_new[['Sector', 'Net_Income_Forecast_CAGR_10y', 'Avg_Net_Income_Margin_Forecast_10y', 'Beta_5y', 'Net_Debt_perc_EV']]
Y = np.log(pe_data_new[['Market_Cap_TTM_Net_Income']])
Y = Y.rename(columns={'Market_Cap_TTM_Net_Income': 'Log_Market_Cap_TTM_Net_Income'})
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, shuffle=True, test_size = 0.3, random_state=5)
X_train.head(5)
Sector | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | |
---|---|---|---|---|---|
Name | |||||
AMETEK, Inc. | Industrials | 0.052889 | 0.212410 | 1.278543 | 0.072059 |
Fox Corporation | Communication Services | 0.001419 | 0.073721 | 0.000000 | 0.188264 |
Ralph Lauren Corporation | Consumer Discretionary | 0.034982 | 0.074732 | 1.206839 | 0.181728 |
The Sherwin-Williams Company | Materials | 0.106755 | 0.150060 | 1.137892 | 0.158060 |
Roper Technologies, Inc. | Industrials | 0.110710 | 0.288177 | 1.059018 | 0.081942 |
Y_train.head(5)
Log_Market_Cap_TTM_Net_Income | |
---|---|
Name | |
AMETEK, Inc. | 3.320017 |
Fox Corporation | 2.797414 |
Ralph Lauren Corporation | 3.698240 |
The Sherwin-Williams Company | 3.461179 |
Roper Technologies, Inc. | 3.326395 |
train_data = Y_train.merge(X_train, how='outer', left_index=True, right_index=True)
test_data = Y_test.merge(X_test, how='outer', left_index=True, right_index=True)
train_data.describe().round(2)
Log_Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | |
---|---|---|---|---|---|
count | 229.00 | 229.00 | 229.00 | 229.00 | 229.00 |
mean | 3.39 | 0.08 | 0.18 | 0.94 | 0.15 |
std | 0.59 | 0.08 | 0.10 | 0.44 | 0.15 |
min | 2.19 | -0.20 | -0.01 | 0.00 | -0.15 |
25% | 3.01 | 0.03 | 0.10 | 0.65 | 0.04 |
50% | 3.32 | 0.07 | 0.16 | 0.96 | 0.13 |
75% | 3.70 | 0.11 | 0.25 | 1.22 | 0.25 |
max | 5.03 | 0.67 | 0.57 | 2.31 | 0.48 |
Below we will analyse the training data, and analyse how the individual variables relate to our target PE ratio variable.
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Compute the correlation matrix
corr = train_data.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
<AxesSubplot:>
Looking at the first column of the correlation matrix, we can analyse how individual variables are correlated with our target value of the Log of PE ratios
train_data.corr().round(2)
Log_Market_Cap_TTM_Net_Income | Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | |
---|---|---|---|---|---|
Log_Market_Cap_TTM_Net_Income | 1.00 | 0.43 | 0.27 | 0.08 | -0.36 |
Net_Income_Forecast_CAGR_10y | 0.43 | 1.00 | 0.28 | 0.14 | -0.29 |
Avg_Net_Income_Margin_Forecast_10y | 0.27 | 0.28 | 1.00 | -0.06 | -0.41 |
Beta_5y | 0.08 | 0.14 | -0.06 | 1.00 | -0.12 |
Net_Debt_perc_EV | -0.36 | -0.29 | -0.41 | -0.12 | 1.00 |
sns.set_theme()
sns.jointplot(data=train_data, x="Net_Income_Forecast_CAGR_10y", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x1f130a78e08>
sns.jointplot(data=train_data, x="Avg_Net_Income_Margin_Forecast_10y", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x1f12eb37148>
sns.jointplot(data=train_data, x="Beta_5y", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x1f130f72dc8>
sns.jointplot(data=train_data, x="Net_Debt_perc_EV", y="Log_Market_Cap_TTM_Net_Income")
<seaborn.axisgrid.JointGrid at 0x1f1310cf608>
Before training our Machine learning algorithms we will do two pre-processing steps.
The data will also be split into X and Y variables, Y being the PE ratio which is to be predicted, and X being the features used to predict the PE ratio.
from sklearn.preprocessing import OneHotEncoder
train_data_cat = train_data[['Sector']]
test_data_cat = test_data[['Sector']]
#fit one hot encoder to training set
one_hot = OneHotEncoder(handle_unknown='ignore').fit(train_data_cat)
#transform training set
train_data_cat_one_hot = pd.DataFrame(one_hot.transform(train_data_cat).toarray())
train_data_cat_one_hot.columns = one_hot.get_feature_names(train_data_cat.columns)
train_data_cat_one_hot['Name'] = train_data_cat.index
train_data_cat_one_hot = train_data_cat_one_hot.set_index('Name')
#transform testing set
test_data_cat_one_hot = pd.DataFrame(one_hot.transform(test_data_cat).toarray())
test_data_cat_one_hot.columns = one_hot.get_feature_names(test_data_cat.columns)
test_data_cat_one_hot['Name'] = test_data_cat.index
test_data_cat_one_hot = test_data_cat_one_hot.set_index('Name')
train_data_cat_one_hot.head(5)
Sector_Communication Services | Sector_Consumer Discretionary | Sector_Consumer Staples | Sector_Energy | Sector_Healthcare | Sector_Industrials | Sector_Information Technology | Sector_Materials | Sector_Real Estate | Sector_Utilities | |
---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||
AMETEK, Inc. | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Fox Corporation | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Ralph Lauren Corporation | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The Sherwin-Williams Company | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
Roper Technologies, Inc. | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
test_data_cat_one_hot.head(5)
Sector_Communication Services | Sector_Consumer Discretionary | Sector_Consumer Staples | Sector_Energy | Sector_Healthcare | Sector_Industrials | Sector_Information Technology | Sector_Materials | Sector_Real Estate | Sector_Utilities | |
---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||
Eastman Chemical Company | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
Allegion plc | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Hormel Foods Corporation | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Extra Space Storage Inc. | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
Target Corporation | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
from sklearn.preprocessing import StandardScaler
train_data_num = train_data[['Net_Income_Forecast_CAGR_10y','Avg_Net_Income_Margin_Forecast_10y','Beta_5y', 'Net_Debt_perc_EV']]
test_data_num = test_data[['Net_Income_Forecast_CAGR_10y','Avg_Net_Income_Margin_Forecast_10y','Beta_5y', 'Net_Debt_perc_EV']]
#fit standard scaler to training set
scaler = StandardScaler().fit(train_data_num)
#transform training set
train_data_num_scaled = pd.DataFrame(scaler.transform(train_data_num), columns = train_data_num.columns)
train_data_num_scaled['Name'] = train_data.index
train_data_num_scaled = train_data_num_scaled.set_index('Name')
#transform testing set
test_data_num_scaled = pd.DataFrame(scaler.transform(test_data_num), columns = test_data_num.columns)
test_data_num_scaled['Name'] = test_data.index
test_data_num_scaled = test_data_num_scaled.set_index('Name')
train_data_num_scaled.head(5)
Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | |
---|---|---|---|---|
Name | ||||
AMETEK, Inc. | -0.343975 | 0.322551 | 0.769134 | -0.517736 |
Fox Corporation | -0.967412 | -1.005990 | -2.133483 | 0.282818 |
Ralph Lauren Corporation | -0.560871 | -0.996304 | 0.606348 | 0.237788 |
The Sherwin-Williams Company | 0.308487 | -0.274720 | 0.449821 | 0.074738 |
Roper Technologies, Inc. | 0.356389 | 1.048336 | 0.270757 | -0.449652 |
train_data_num_scaled.describe().round(2)
Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | |
---|---|---|---|---|
count | 229.00 | 229.00 | 229.00 | 229.00 |
mean | 0.00 | -0.00 | -0.00 | -0.00 |
std | 1.00 | 1.00 | 1.00 | 1.00 |
min | -3.38 | -1.85 | -2.13 | -2.03 |
25% | -0.56 | -0.76 | -0.66 | -0.73 |
50% | -0.13 | -0.18 | 0.05 | -0.10 |
75% | 0.32 | 0.65 | 0.63 | 0.72 |
max | 7.10 | 3.75 | 3.11 | 2.32 |
X_train = train_data_num_scaled.merge(train_data_cat_one_hot, how='outer', left_index=True, right_index=True)
X_train.head(5)
Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | Sector_Communication Services | Sector_Consumer Discretionary | Sector_Consumer Staples | Sector_Energy | Sector_Healthcare | Sector_Industrials | Sector_Information Technology | Sector_Materials | Sector_Real Estate | Sector_Utilities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||||||
AMETEK, Inc. | -0.343975 | 0.322551 | 0.769134 | -0.517736 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Fox Corporation | -0.967412 | -1.005990 | -2.133483 | 0.282818 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Ralph Lauren Corporation | -0.560871 | -0.996304 | 0.606348 | 0.237788 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The Sherwin-Williams Company | 0.308487 | -0.274720 | 0.449821 | 0.074738 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
Roper Technologies, Inc. | 0.356389 | 1.048336 | 0.270757 | -0.449652 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Y_train = train_data[['Log_Market_Cap_TTM_Net_Income']]
Y_train.head(5)
Log_Market_Cap_TTM_Net_Income | |
---|---|
Name | |
AMETEK, Inc. | 3.320017 |
Fox Corporation | 2.797414 |
Ralph Lauren Corporation | 3.698240 |
The Sherwin-Williams Company | 3.461179 |
Roper Technologies, Inc. | 3.326395 |
X_test = test_data_num_scaled.merge(test_data_cat_one_hot, how='outer', left_index=True, right_index=True)
X_test.head(5)
Net_Income_Forecast_CAGR_10y | Avg_Net_Income_Margin_Forecast_10y | Beta_5y | Net_Debt_perc_EV | Sector_Communication Services | Sector_Consumer Discretionary | Sector_Consumer Staples | Sector_Energy | Sector_Healthcare | Sector_Industrials | Sector_Information Technology | Sector_Materials | Sector_Real Estate | Sector_Utilities | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||||||
Eastman Chemical Company | 2.163536 | -0.066516 | 1.378961 | 1.212203 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
Allegion plc | -0.299303 | -0.073863 | 0.428535 | -0.234235 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Hormel Foods Corporation | -0.557665 | -0.728099 | -2.168545 | -1.114877 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Extra Space Storage Inc. | -0.840145 | 1.915674 | -1.576251 | 0.767138 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
Target Corporation | -0.384196 | -1.276158 | -0.202580 | -0.285802 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Y_test = test_data[['Log_Market_Cap_TTM_Net_Income']]
Y_test.head(5)
Log_Market_Cap_TTM_Net_Income | |
---|---|
Name | |
Eastman Chemical Company | 3.005154 |
Allegion plc | 3.491612 |
Hormel Foods Corporation | 3.355020 |
Extra Space Storage Inc. | 3.559280 |
Target Corporation | 3.121729 |
Now it’s finally time to train our machine learning models. As this is a regression task, we will take into considering below models.
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
ridge = Ridge(max_iter = 100000)
grid_values = {'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}
# default metric to optimize over grid parameters: accuracy
grid_ridge = GridSearchCV(ridge, param_grid = grid_values, cv=5, scoring = 'r2')
grid_ridge.fit(X_train, Y_train.values.ravel());
print('Grid best parameter: ', grid_ridge.best_params_)
print('Grid best score: ', grid_ridge.best_score_.round(3))
Grid best parameter: {'alpha': 10}
Grid best score: 0.221
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()
grid_values = {'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25]}
# default metric to optimize over grid parameters: accuracy
grid_knn = GridSearchCV(knn, param_grid = grid_values, cv=5, scoring = 'r2')
grid_knn.fit(X_train, Y_train.values.ravel());
print('Grid best parameter: ', grid_knn.best_params_)
print('Grid best score: ', grid_knn.best_score_.round(3))
Grid best parameter: {'n_neighbors': 7}
Grid best score: 0.234
from sklearn.svm import SVR
svr1 = SVR(kernel='linear')
grid_values = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}
# default metric to optimize over grid parameters: accuracy
grid_svr1 = GridSearchCV(svr1, param_grid = grid_values, cv=5, scoring = 'r2')
grid_svr1.fit(X_train, Y_train.values.ravel());
print('Grid best parameter: ', grid_svr1.best_params_)
print('Grid best score: ', grid_svr1.best_score_.round(3))
Grid best parameter: {'C': 400}
Grid best score: 0.202
from sklearn.svm import SVR
svr2 = SVR(kernel='rbf')
grid_values = {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 200, 300, 400, 500]}
# default metric to optimize over grid parameters: accuracy
grid_svr2 = GridSearchCV(svr2, param_grid = grid_values, cv=5, scoring = 'r2')
grid_svr2.fit(X_train, Y_train.values.ravel());
print('Grid best parameter: ', grid_svr2.best_params_)
print('Grid best score: ', grid_svr2.best_score_.round(3))
Grid best parameter: {'C': 1}
Grid best score: 0.246
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
grid_values = {'max_depth': [2, 3, 4, 5, 6, 7, 8],
'n_estimators': [50, 75, 100, 125, 150]}
# default metric to optimize over grid parameters: accuracy
grid_rf = GridSearchCV(rf, param_grid = grid_values, cv=5, scoring = 'r2')
grid_rf.fit(X_train, Y_train.values.ravel())
print('Grid best parameter: ', grid_rf.best_params_)
print('Grid best score: ', grid_rf.best_score_.round(3))
Grid best parameter: {'max_depth': 8, 'n_estimators': 100}
Grid best score: 0.27
from sklearn.ensemble import VotingRegressor
from sklearn.model_selection import cross_val_score
er = VotingRegressor([('lr_ridge', grid_ridge), ('knn', grid_knn), ('svr_linear', grid_svr1),
('svr_rbf', grid_svr2), ('rf', grid_rf)])
er.fit(X_train, Y_train.values.ravel())
cv_score = cross_val_score(er, X_train, Y_train.values.ravel(), cv=3, scoring = 'r2').mean().round(3)
print('CV score', cv_score)
CV score 0.284
Based on the average cross validation r-squared from five-fold cross validation, we get the below r-squared for each model.
classifiers = ['Ridge',
'KNeighborsRegressor',
'SVR linear',
'SVR rbf',
'RandomForestClassifier',
'VotingClassifier']
scores = [grid_ridge.best_score_.round(3),
grid_knn.best_score_.round(3),
grid_svr1.best_score_.round(3),
grid_svr2.best_score_.round(3),
grid_rf.best_score_.round(3),
cv_score.round(3)]
model_scores = pd.DataFrame(data= scores, columns = ['CV_Accuracy'], index = classifiers)
model_scores
CV_Accuracy | |
---|---|
Ridge | 0.221 |
KNeighborsRegressor | 0.234 |
SVR linear | 0.202 |
SVR rbf | 0.246 |
RandomForestClassifier | 0.270 |
VotingClassifier | 0.284 |
The model with the highest cross validation r-squared score is the ensemble model with an r-squared of 28.4%. We will use this model to predict the out of sample error on the test data.
er.score(X_test, Y_test).round(3)
0.221
Both our cross validation r-squared values in predicting the log of the PE ratios, as well as the out of sample r-squared estimation on the test set indicate that our fundamental financial explanatory variables relating to expected growth, expected margins, beta and the level of debt a company has explains only 22% of the variance in the PE ratios.
This illustrates that financial data is noisy, and 78% of the variance in the PE ratios remains unexplained with our variables.