Practical portfolio optimization in Python (3/3) – code

Petr Tmej
Jul 15, 2022
11 min read

Portfolio optimization

In the last article of this series, we will go through a fully functional code. You will read here how the process of optimization was applied and backtests created. The code is based on previous articles: Markowitz optimization and machine learning adaptations.

Importing all necessary stuff

I use feather format to store data for usage with pandas; you can install it pip install feather-format.

import pandas as pd, numpy as np, glob
# installation requires cvxpy (for installation on windows see the documentation)
import mlfinlab # version 0.15.0
import pypfopt # version 1.2.6
from mlfinlab.portfolio_optimization.clustering import NestedClusteredOptimisation

Loading data

My data is saved and prepared in the folder. For practical purposes, use your own daily data, resp. download data from Yahoo or Alpha Vantage (see our other articles). We will load data, use only data from 2014, and save them to the directory. Then we simply transform the directory into a DataFrame with a multi-column consisting of symbols and OHLCV.

df = {}
for f in glob.glob('myPathToData/*feather'):
    print(f)
    a = pd.read_feather(f)
    a = a[a['date']>='2014-01-01']
    if len(a) == 0:
        continue
    # windows' users use '\\' instead of '/'
    df[f.split('/')[-1].replace('.feather','')] = a.set_index('date')
    
dd = pd.concat(df.values(), keys=df.keys(), axis=1)
df = dd.swaplevel(axis=1)

Walk-forwarding optimization - basic setup

The whole optimization will be in loot, where we reallocate the portfolio every 3 months. We use Alpha Vantage function to use only stocks that were available for trading during the time of re-optimization – Listing Status. Then we prepare our symbols – top 200 according to median Dollar Volume; see the comments in the code section.

download_path = 'https://www.alphavantage.co/query?function=LISTING_STATUS&date={}&apikey=YourApiKey'
allocation = {}
for q in pd.date_range('2015-01-01','2020-09-01', freq='QS'):
    print(q)
    # load benchmark data - ETFs of S&P 500 and Nasdaq 100, if you have them within your stock universe, or just download them - after the loop
    bench = df['adjusted close'][['SPY','QQQ']]
    # load Stock Listings with specified date
    symbols = pd.read_csv(download_path.format(str(q)[:10]))
    # use only Stocks, not ETFs
    symbols = symbols[
        symbols['assetType']=='Stock'
        ]['symbol'].values.tolist()
    stocks_only = symbols.copy()
    
    # calculate daily dollar volume
    usdvol = df[df.index<q]['close']*df[df.index<q]['volume']
    # use only last year
    usdvol = usdvol[usdvol.index >= q - pd.DateOffset(years=1)]
    # if there were more than 30 missing values we don't use that stock
    usdvol = usdvol.loc[:, usdvol.isna().sum()<30]
    # symbol has to be treadable during the last day
    usdvol = usdvol.loc[:, usdvol.iloc[-1].notna()]
    # symbol has to be within assetType Stock
    usdvol = usdvol.loc[:, usdvol.columns.isin(stocks_only)]
    # finally we calculate median value of 1 year and take top 200
    usdvol = usdvol.median().sort_values()
    symbols = usdvol[-200:].index

Now when we have prepared symbols, let’s prepare in-sample and out-of-sample data for portfolio optimization. In-sample is 1 year before re-optimization, out-of-sample or our simulation is for the next 3 months.

   dd = df['adjusted close'][symbols].copy()
    # in-sample prices + benchmark
    df_is = dd[dd.index<q]
    bench_is = bench[
        (bench.index < q) & 
        (bench.index >= q - pd.DateOffset(years=1))]
    # we use only stock which have less than 5% missing values on IS
    nans = df_is.isna().sum()/len(df_is) # or .isna().mean()
    nans = nans[nans<0.05]
    # we can fill missing values of IS, first we will forward then backward
    df_is = df_is[nans.index]
    df_is = df_is.fillna(method='ffill').fillna(method='bfill')
    # creating OOS dataset, the first index is the same as the last on IS, we explain why later
    last_date = df_is.index[-1]
    df_oos = dd[(dd.index >= last_date) &
        	    (dd.index < q + pd.DateOffset(months=3))][nans.index]

Efficient frontier optimization

Since methodologies from the first article assume that assets are not correlated, we will get rid of assets that have a higher correlation than 75%, and we keep only the ones with highest dollar volume.

  # calculate correlations from pct returns    
    corrs = df_is.pct_change().corr().abs()
    # correlated symbols will be save into list
    to_drop = []
    # columns were sorted from lowest to highest USD volume
    for s in df_is.columns[::-1]:
        if s in to_drop: # if it is already used, skip it
            continue
        # find stocks correlated with s (more than 0.75)
        a = corrs[s].drop(s)[corrs[s].drop(s) > 0.75]
        # add findings to list to_drop
        if len(a) > 0:
            to_drop += a.index.tolist()
    # use only uncorrelated symbols
    df_is = df_is.drop(to_drop,axis=1)
    df_oos = df_oos.drop(to_drop,axis=1)

Now we will go through all used methodologies, the explanation is commented in code.

    #calculate share ratios of in-sample
    is_sharpes = mlfinlab.backtest_statistics.sharpe_ratio(
        df_is.pct_change())
    # calculate expected returns which will be used in the methodologies - we predict for 3 months, so 63 days
    expected_returns = pypfopt.expected_returns.mean_historical_return(
        df_is, frequency=63)
    #covariance matrix is the main part of portfolio optimization
    cov_matrix = df_is.pct_change().cov().values
    symbols = df_is.columns # list of symbols
    
    results = {} # save results into dictionary
    
    # Sharpe Maximization
    opt = pypfopt.EfficientFrontier(
        expected_returns, 
        cov_matrix,
        solver='CVXOPT'
        ).max_sharpe(0) # zero risk-free interest rate
    # the result is OrderedDictionary, so we do classic dict and add it to pandas.Series
    opt = pd.Series({x: opt[x] for x in opt.keys()})
    results['opt_maxSharpe'] = opt.copy()
    
    # Return Maximization
    # input into efficient_risk is maximum volatility we want to achieve, so maximum of benchmark volatilities
    opt = pypfopt.EfficientFrontier(
        expected_returns, 
        cov_matrix,
        solver='CVXOPT'
        ).efficient_risk(
            bench_is.pct_change().std().max()
            )
    opt = pd.Series({x: opt[x] for x in opt.keys()}) # OrderedDictionary
    results['opt_maxReturn'] = opt.copy()
    
    # Volatility Minimization
    # input into efficient_return is the minimal return we want to achieve, so it is the maximum of returns of benchmarks
    opt = pypfopt.EfficientFrontier(
        expected_returns, 
        cov_matrix,
        solver='CVXOPT'
        ).efficient_return(
            pypfopt.expected_returns.mean_historical_return(
                bench,   
                frequency=63).max()
            )
    opt = pd.Series({x: opt[x] for x in opt.keys()}) # OrderedDictionary
    results['opt_minVolatility'] = opt.copy()
       
    # top 15 stocks according to in-sample Sharpe, weighted equally 
    results['max_sharpe'] = (is_sharpes >= is_sharpes.sort_values()[-15]
        )/15
    
    # Uniform distribution of top 100 stocks according to USD volume
    # Series with zero values and add 1 only to top 100 stocks
    results['uniform'] = pd.Series(0, index=is_sharpes.index)
    results['uniform'].loc[
        is_sharpes.index.isin(
        usdvol[-100:].index)] = 1
    # Create uniform weights by dividing by 100, resp number of ones
    results['uniform'] /= results['uniform'].sum()

Black-litterman allocation

Black-Litterman model is a little bit tricky but easy to understand. First, we prepare Q and P matrices, which are the vital part of the formula and represent our views. Our views will be pretty simple: each stock will have the same view as had historical returns saved in variable expected_returns. Q represents the values, expectations, each row in the P matrix represents which component of stocks is connected to the value of Q (i-th row of P, i-th element of Q). If your views are linear combinations of more stocks, look at the previous article or documentation of PyPortfolioOpt. After preparing the BL model, we can use the posterior returns in the efficient frontier optimization or use return-implied weights directly from the result.

    # simple views, first set empty matrices
    Q = np.zeros((len(df_is.columns), 1))
    P = np.zeros((len(df_is.columns), len(df_is.columns)))
    views = dict(expected_returns.round(3).sort_values())
    # now set values
    for i, view_ticker in enumerate(views.keys()):
        # expectation for ith element of Q
        Q[i] = views[view_ticker]
        # ith row and column which is for given stock is 1, other is 0
        P[i, list(df_is.columns).index(view_ticker)] = 1  
	
    # application of BL allocation
    bl = pypfopt.black_litterman.BlackLittermanModel(
        cov_matrix, 
        tickers=df_is.columns, # names of symbols
        pi='equal', # prior estimate for returns - by equal it is uniform distribution
        Q=Q, P=P, # investor's views
        omega='idzorek', # Idzorek's method (see documentation for more info)
        risk_free_rate=0,                           	   
        view_confidences=np.array(len(Q)*[0.25]) # confidence of views - how are we confident, let's just pass 25%
        )
    
    # posterior returns
    rets = bl.bl_returns()
    rets.index = df_is.columns
    # max Sharpe - we just use different returns
    opt = pypfopt.EfficientFrontier(rets, cov_matrix, solver='CVXOPT').max_sharpe(0)
    opt = pd.Series(opt) # OrderedDictionary

    results['BL-maxSharpe'] = opt.copy()
    # max return
    opt = pypfopt.EfficientFrontier(
        rets, cov_matrix, solver='CVXOPT'
        ).efficient_risk(
    	      bench_is.pct_change().std().max()
            ) # max volatility setted as max from benchmarks
    opt = pd.Series(opt) # OrderedDictionary

    results['BL-maxReturn'] = opt.copy()
    # Volatility minimization
    opt = pypfopt.EfficientFrontier(rets, cov_matrix, solver='CVXOPT').efficient_return(
        pypfopt.expected_returns.mean_historical_return(
                bench, frequency=63).max()
            )
    opt = pd.Series(opt) # OrderedDictionary

    results['BL-minVolatility'] = opt.copy()
    
    # return implied weights
    delta = pypfopt.black_litterman.market_implied_risk_aversion(
        df_is, risk_free_rate=0)
    bl.bl_weights(delta.values)
    opt = bl.clean_weights()
    opt = pd.Series(opt) # OrderedDictionary
    # there are also negative weights, we set them for 0
    opt[opt<0] = 0
    # the new weights should sum into 1
    opt /= opt.sum()
    results['BL-retImplied'] = opt.copy()

Critical Line Algorithm

As described it is just another Markowitz’s algorithm from efficient frontier theory. For application we will use the mlfinlab package. The usage is very straightforward.

    cla = mlfinlab.portfolio_optimization.CriticalLineAlgorithm(
        weight_bounds=(0,1))
    
    # Maximum Sharpe Solution
    cla.allocate(asset_prices=df_is, solution='max_sharpe')
    # from this case the result is pd.DataFrame of weights
    results['CLA_maxSharpe'] = cla.weights.loc[0]
    
    # Minimum Variance Solution
    cla = mlfinlab.portfolio_optimization.CriticalLineAlgorithm(
        weight_bounds=(0,1)) # reinicilization of CLA
    cla.allocate(asset_prices=df_is, solution='min_volatility')
    results['CLA_minVolatility']  = cla.weights.loc[0]

    # Turning points
    cla = mlfinlab.portfolio_optimization.CriticalLineAlgorithm(
        weight_bounds=(0,1)) # reinicilization of CLA
    cla.allocate(asset_prices=df_is,solution='cla_turning_points')
    # very time there is different number of turning points, so we choose 5 uniformly by setting numpy arange from 0 to length of turning points, with step size of ⅕ of the length
    for i,j in enumerate(
        np.arange(0, len(cla.weights), int(len(cla.weights) / 5))
        ):
        # i is for new turning point number, j is for real position from CLA allocation
        results['CLA_turnPoint-{}'.format(i + 1)] = cla.weights.iloc[j]

Hierarchical Risk Parity

Here we go for new methodologies from Marus Lopez De Prado. Usage is also very simple, but it is important to know what we are doing. In this portfolio optimizations, we don’t invest in stocks with allocation less than 1% of capital (we are not huge funds with billions of dollars, but we are allocating 100k of investments). From my observations, HRP divides our stock universe into many small investments, so I decided to do the HRP in two steps.

Firstly we do HRP on all stocks we use (uncorrelated top 200 according to dollar volume); the results are the weights.
We get rid of stocks with less than 0.5% allocation and do the new HRP allocation with the new given universe.

Note that you can make your own portfolio corrections like this or test different methodologies to get rid of too low weights. At the end of the loop we will do a similar thing for all the results.

    opt = mlfinlab.portfolio_optimization.HierarchicalRiskParity()
    opt.allocate(df_is.columns, df_is)
    # resulting weights are again in pd.DataFrame
    res_hrp = opt.weights.loc[0]
    # we get rid of weights smaller than 0.5%
    res_hrp = res_hrp[res_hrp>0.005]
    # new allocation with new universe
    opt.allocate(res_hrp.index, df_is[res_hrp.index])
    # final results
    results['HRP'] = opt.weights.loc[0]

Hierarchical Equal Risk Contribution

    opt = mlfinlab.portfolio_optimization.HierarchicalEqualRiskContribution()
    opt.allocate(res_hrp.index, df_is[res_hrp.index])
    results['HERC'] = opt.weights.loc[0]

Nested Clustered optimization

    opt = mlfinlab.portfolio_optimization.clustering.()
    # Convex Optimization solution
    cvo = opt.allocate_cvo(cov=cov_matrix)
    # resulting DataFrame is different than in previous methods
    cvo = pd.Series(cvo[:,0], index=df_is.columns)
    # Nested Clustered Optimization solution, you can choose maximum number of cluster, do not use high numbers when your universe is small
    nco = opt.allocate_nco(cov=cov_matrix, max_num_clusters=25)
    nco = pd.Series(nco[:,0], index=df_is.columns)
    # saving the results
    results['NCO-cvo'] = cvo
    results['NCO-nco'] = nco
    
    # Multiple simulations of NCO and SVO by Monte Carlo optimization - just 1 simulation because of really long computation time
    w_cvo, w_nco = opt.allocate_mcos(
        df_is.pct_change().mean().values, # here we use average historical daily return
        cov_matrix,
        len(df_is),
        1, # number of MCMC 
        min_var_portf=False # if False we use Sharpe maximization
        )
    # saving the results
    w_nco = w_nco.loc[0]
    w_nco.index = df_is.columns
    w_cvo = w_cvo.loc[0]
    w_cvo.index = df_is.columns
    # again we use only positive weights and scale them to sum to 1
    results['NCO-MC-cvo_sharpe'] = w_cvo[w_cvo>1]/w_cvo[w_cvo>1].sum()
    results['NCO-MC-nco_sharpe'] = w_nco[w_nco>1]/w_nco[w_nco>1].sum()
    
    w_cvo, w_nco = opt.allocate_mcos(
        df_is.pct_change().mean().values, # here we use average historical daily return
        cov_matrix,
        len(df_is),
        1, # number of MCMC 
        min_var_portf=False # we use Volatility minimization
        )
    w_nco = w_nco.loc[0]
    w_nco.index = df_is.columns
    w_cvo = w_cvo.loc[0]
    w_cvo.index = df_is.columns
    results['NCO-MC-cvo_volatility'] = w_cvo[w_cvo>1]/w_cvo[w_cvo>1].sum()
    results['NCO-MC-nco_volatility'] = w_nco[w_nco>1]/w_nco[w_nco>1].sum()

Saving the out-of-sample equities

When for the given loop the optimization is finished we prepare the portfolios and calculate out-of-sample performance.

    for key in results.keys():
        # use the weights, take only weights bigger than 0.9%
        to_allocate = results[key].copy()
        to_allocate = to_allocate[to_allocate>0.009]
        # recalculate the weights 
        to_allocate = to_allocate/to_allocate.sum()
        # it is very possible that still newly calculated weights are still less than 0.9%, so we repeat the process once more - yes very primitive technique but functional
        to_allocate = to_allocate[to_allocate>0.009]
        to_allocate = to_allocate/to_allocate.sum()
        # prepare OOS data - forward fill explained after this codeblock
        rs = df_oos[to_allocate.index].copy().fillna(method='ffill')
        # make relative stock equities according to the first day, the first value will be always 1 (that index is repeated in IS, because that day we have still return from previous portfolio
        rs = rs/rs.iloc[0]
        # by dot product we have portfolio of stocks with given weights (we add costs later)
        rs = rs.dot(to_allocate)
        # save the portfolio
        if key not in allocation.keys():
            # if it is first loop, we simply save the portfolio
            allocation[key] = rs.copy()
        else:
            # when previous performance exists we just append continuation of portfolio but not starting with value 1 but the last value of previous performance, that's why we use relative equity, because it is easily scalable just by product
            allocation[key] = allocation[key].append(
                rs*allocation[key].iloc[-1])

Why do I use the fillna method, forward fill? Simply, all our stocks are highly tradable. During the period, there were only a few stocks whose data ‘finished’. Those data didn’t finish because of bankruptcy, but an acquisition of that stock by another company (purchase). So you didn’t lose the money.

Historically, it is hard to find the acquisitions automatically, so we would instead use that the price didn’t move for the remaining time. I show this example because most of you do not have access to highly professional data to see everything.

Indeed the publicly-traded company may go bankrupt. It sometimes happens that there are some gaps in data and the stock price continues; there can be some error in collecting the data or some pause in trading because of some technical issues on the exchange (yes, it can also happen, but it is not common).

Having a double index with the same value during the re-optimization day is just for practical purposes. We didn’t apply costs yet and also didn’t save the dates of re-optimization. And we don’t need to have them because every time there is a double index, there is re-optimization.

# we get reoptimization dates where is double index, it is same the date for all models
change = allocation['opt_maxReturn'].resample('B').count()
change = change[change>1].index
# rewrite results to DataFrame
allocation = pd.concat(allocation.values(), keys=allocation.keys())
# models or different optimizations are added to multi-index, so we put model into column and use pivot_table to create DataFrame where index is date and columns are equities of given portfolios
allocation = allocation.reset_index(level=0)
allocation.columns = ['model','equity']
# during pivoting the double index disappears, but for values it is OK, because in both same dates every equity has the same values
allocation = allocation.pivot_table(index='date', columns='model',values='equity') 

# calculating costs, since the portfolio is in relative numbers it is very easy to just subtract the costs, Series of costs * see the note after the codeblock
costs =  pd.Series(0, index=allocation.index)
costs.iloc[0] = 0.0005 # the fist to buy is 0.05% of the portfolio value, commissions + slippage
costs.loc[change] = 0.001 # during reoptimization we close and open position, so 0.1%
# to subtract all cost we have to do cumulative sum of costs, the number will change only during the reoptimization period, 
net_port = allocation.sub(costs.cumsum(),axis='index')
# download benchmarks' prices
from alpha_vantage.timeseries import TimeSeries
ts = TimeSeries(key='yourAPIkey',
                output_format='pandas',
                indexing_type='date')
# download or if you have it saved in your data use saved data
qqq = ts.get_daily_adjusted('QQQ','full')[0]['5. adjusted close'].reindex(allocation.index)
spy = ts.get_daily_adjusted('SPY','full')[0]['5. adjusted close'].reindex(allocation.index)
# create relative equities
qqq /= qqq.iloc[0]
spy /= spy.iloc[0]
# for investing in ETF there is the cost of first buy (commission) and also yearly fee, which is 0.0945% for SPY and 0.2% for QQQ, it is called the expense ratio, but it is already calculated in the price of ETF
# we need to subtract only costs of buying the ETFs
qqq -= 0.0005
spy -= 0.0005
#add equities to our portfolio results
net_port['SPY'] = spy
net_port['QQQ'] = qqq
# everything is ready, you can plot data and do analyses on equities

We just simply subtracted percentage commissions from the relative equities. This methodology is not correct because after paying the fees, we have slightly less money to invest, so our real returns are slightly lower.

This difference in a short time is invisible. Still, it is sufficient for our purposes (in many articles about portfolio optimization, there are no fees adjusted, and that can make a big difference).

Calculating the metrics

At the end of each article, you could find the table with some basic metrics; here, you can see how to calculate it.

results = pd.DataFrame()
# calculate annual return of portfolios - lambda over equities - last value / first value - 1
annual_ret = net_port.resample('Y').apply(lambda x: x.iloc[-1]/x.iloc[0] -1)
annual_ret = annual_ret[1:].mean()
results['Annual Return'] = annual_ret
# drawdown in python is super easy
def drawdown(vec): # percentual drawdown
    return (vec - np.maximum.accumulate(vec))/np.maximum.accumulate(vec)

results['Max Drawdown'] = drawdown(net_port).min()
# Sharpe Ratio from mlfinlab, note that we cannot use equity but daily returns
results['Sharpe Ratio'] = mlfinlab.backtest_statistics.sharpe_ratio(net_port.pct_change())
# historical volatility is simple standard deviation of percentage returns, multiplied to have yearly volatility
results['Hist. Volatility'] = net_port.pct_change().std() * np.sqrt(252)

Plotting

from matplotlib import pyplot as plt
import seaborn as sns
sns.set()
plt.rcParams["font.family"] = "serif"
sns.set_palette(sns.color_palette("bright"))
# just NCO for quick example
net_port[['SPY', 'QQQ',  'NCO-MC-cvo_sharpe', 'NCO-MC-nco_sharpe',
          'NCO-cvo', 'NCO-nco']].plot(
    lw=1, alpha=0.7, figsize=(550/96,370/96)
    )
plt.xlabel('')
plt.legend(fontsize=8)
plt.title('Portfolios: out-of-sample walk forward, Nested Clustered Optimization',
    fontsize=10)
plt.savefig('/yourPATH/equities_NCO.pdf')

Search

Topics

Advanced (6)Algorithmic Trading (17)Basics (24)Futures (1)Investing (6)Libraries (1)Python (10)Statistics (1)Stock Market (9)Stock Pairs (2)Strategy Development (16)Time Series (2)Trading (8)volatility (2)