Working with Time Series in Pandas – Grouping / DatetimeIndex

Petr Tmej
Jul 15, 2022
9 min read

PANDAS - Grouping and DateTimeIndex

In this article, we continue playing with time series in Python, resp. Pandas. Our previous article covered some Pandas basics, such as working with columns, creating moving averages, and resp. rolling statistics, and also some price patterns. In this article, we will be comparing dollar volumes for a few tech-stocks during different parts of the day. In this practical example, we will cover mostly grouping and usage of DatetimeIndex in Python, which are used frequently. To show you how useful Pandas might be for you, we will also look at creating bars and sessions.

grouping datetime index key takeaways trade with sicence

HIGHEST VOLUME TIMEs

Knowing what time the stocks have the highest volume is essential. If you are making investments and want to execute higher volumes for a given stock, it is better to know these times. You don’t want to experience significant slippage or wait a few minutes until limit orders execute or not. When you trade big markets, you don’t need to worry, but your advantage over hedge funds is that you can also trade low-volumed markets. There are many opportunities in these stocks or futures because they are not that efficient.

Imagine you are holding your long/short positions for a few days, and you want to exit. I had an experience in the market for such a small stock that I could not execute the market order for more than 30 minutes (which creates enormous slippages that can ruin your strategy). It is natural to think the highest volumes are near close and open, and that’s correct. But some stocks have higher volumes only near the open or near the close. When you want to trade out of the main session, well, it is good to know what time usually has more volume.

Dollar VOLUME

Let’s present USD volume, which is the product of the price and the volume. It allows us to compare the stocks and theoretically tell us how much money we can put into the market without affecting it. You might disagree that it is better to find it out by looking into an order book, but the truth is that many volumes there are fake – when the price goes closer, these orders disappear. Fast trading bots create these counterfeit orders, respectively, market makers.

Let’s look at a few stocks with similar volumes, but you definitely can’t invest the same into them. The data are from Apr 30, 2020, and all stocks had volume 9.3-9.4 mil shares for a given day.

Volume

Close price

Dollar Volume

Tradeable Capital

Symbol

9.37 mil

2474

23 billion

any (for an individual)

AMZN

9.36 mil

87.18

0.8 billion

10-100k without worrying

NKE

9.37 mil

292.28

2.7 billion

any (for an individual)

NVDA

9.33 mil

0.3119

3 million

few thousands

RTW

9.31 mil

10.74

0.1 billion

10-100k without worrying

TEVA

The volume is the same, but the tradable amount is very different. Also, for small companies, 90% of the daily volume may be in the last 10 minutes of trading. Let’s do some Python. All we need is Pandas, datetime for some work with time, and Matplotlib for plotting.

import pandas as pd
import datetime
import matplotlib.pyplot as plt 
from alpha_vantage.timeseries import TimeSeries
ts = TimeSeries(key='YourAPIkey', 
                # claim your free API KEY at alphavantage.co
                output_format='pandas',
                indexing_type='date')

We will use the 1-minute price bars of extended sessions from Alpha Vantage. Our example stocks will be AAPL, GOOG, FB, and AMZN. The example to download intraday data for extended history is in our article about adjusting prices. Or if you want to use just recent history of the current month, use code:

df = {}
for s in ['AAPL','GOOG','FB','AMZN']:
    # with demo API key you have max 5 queries per 1 minute
    df[s] = ts.get_intraday(s, '1min', 'full')[0]
    df[s].columns = [x[3:] for x in df[s].columns]

We merge data together into a DataFrame with multilevel columns. When data are in the dictionary, it is simple:

df = pd.concat(df.values(), keys=df.keys(), axis=1)

will get multilevel columns, thanks to axis=1, if axis=0, then you get a multilevel index. The first level is for symbols, second for values OHLCV. Look at the tables (the last 5 values for the extended session and the main session).

df['AAPL'].tail()

open high low close volume time 2020-08-28 19:56:00 501.45 501.49 501.40 501.40 1473.0 2020-08-28 19:57:00 501.40 501.74 501.40 501.74 2246.0 2020-08-28 19:58:00 501.74 501.74 501.60 501.74 1253.0 2020-08-28 19:59:00 501.73 501.73 501.65 501.65 1683.0 2020-08-28 20:00:00 501.80 502.00 501.71 501.98 8239.0

df['AAPL'][df['AAPL'].index.time <= datetime.time(16,0)].tail()

open high low close volume time 2020-08-28 15:56:00 499.47 500.7689 499.31 500.68 480454.0 2020-08-28 15:57:00 500.68 500.6800 498.85 499.24 616916.0 2020-08-28 15:58:00 499.23 500.7300 499.01 500.00 440069.0 2020-08-28 15:59:00 499.98 500.0100 499.28 499.54 514317.0 2020-08-28 16:00:00 499.54 499.7300 498.82 498.90 1010884.0

USD Value on apple, amazon, facebook and google stocks

The prices here were normalized according to the beginning value, so we can see their performance together in one graph (the whole series was divided by the first close value), and see the code for the plot. Depending on if you downloaded only the prices for 1-month history, the resulting plot can be different. You can notice that there are many NaN values, which is caused because of trading out of the main session – if there were no trades, there is no data for a given minute.

(df.dropna().swaplevel(axis=1)['close'] / 
 df.dropna().swaplevel(axis=1)['close'].iloc[0]
 ).plot(alpha=0.9, lw=1)
plt.legend(loc=0)
plt.xlabel('')
plt.show()

Dollar Volumes with specific DATAFRAME

Let’s calculate dollar volumes from a given DataFrame. Some values are NaN values because it may happen that some stock did not have any tick during a given minute. Later we will go through how to fill missing values. Now it is easy because if there was no volume, the dollar volume is also 0.

usdvol = (df.swaplevel(axis=1)['close'] * 
   df.swaplevel(axis=1)['volume']).fillna(0)

Now, we calculate the average daily dollar volume for each stock, average hourly dollar volume for the main session, and average hourly dollar volume out of the main session and show the results in one table. First, the code.

vol_table = pd.DataFrame()
vol_table['daily'] = usdvol.resample('D').sum().mean()
vol_table['hourly_m'] = usdvol[
    (usdvol.index.time > datetime.time(9,30)) &
    (usdvol.index.time <= datetime.time(16,0))
    ].resample('H').sum().mean()
vol_table['hourly_e'] = usdvol[
    (usdvol.index.time <= datetime.time(9,30)) |
    (usdvol.index.time > datetime.time(16,0))
    ].resample('H').sum().mean()

Since DataFrame uses a DatetimeIndex format, we can efficiently work with it. By adding square brackets, we can add conditions which part of DataFrame we want to use: all, the main session, or extended hours. Just by resampling the index into days ‘D’ or hours ‘H’, we get daily or hourly data. Similarly, as with rolling statistics, first, we say what part of DataFrame should Pandas use, then, we say what operation it should do, and that is the sum. After that, we call the mean function because we want to calculate averages for days and hours. We divide the final table by 1 million, so we have the values in millions, and better readability will use integer values.

(vol_table / 1e6).astype(int)

dailyhourly_mhourly_esymbol AAPL571622314AMZN611724311FB2450965GOOG1271502

In this table, we quickly see big the difference between daily and hourly dollar volume and the main session versus the extended one. Now, what about the most volume hours of the day for these 4 stocks together? Firstly, we resample minute bars into hour bars. Then, we use function groupby because we want to group the values according to the day’s hour. We don’t want to use dates, only the hours, so we have to use different function than resample. By groupby we can group by columns or by some values from the DataFrame, hour from DatetimeIndex.

hourly = usdvol.resample('H').sum()
hourly = hourly.groupby(hourly.index.hour).mean()
hourly = hourly.replace({0:pd.NA}).dropna()
(hourly.mean(1).sort_values().tail()/1e6).astype(int)

In the third line, we replace 0 values with NA values so we can drop those rows. Notice that Pandas with resample can create all hours, even when there is no data, and when we sum through NA values, we get 0. But be aware, this calculation for hourly bars is not right!

**Important note, if you look deeper into the resample function, it is crucial to know how it counts the intervals’ border values (what it does in default).

How do we work with time series?

Have a look at minute data when the market starts (AAPL stock):

openhighlowclosevolumetime 2018-09-11 09:28:00218.15218.15218.020218.02665100.02018-09-11 09:29:00218.00218.05218.000218.03008622.02018-09-11 09:30:00218.03218.10217.900218.10004129.02018-09-11 09:31:00218.01218.36217.820218.1500469606.02018-09-11 09:32:00218.15218.59218.137218.1700292642.02018-09-11 09:33:00218.17218.17217.000217.0100287341.0

The market starts at 9:30, but we use the right part of the interval to save the data or the last value of the interval, the value for which the close is the correct one. In other words, into the 5-min bar go data 9:31, 9:32, 9:33, 9:34, and 9:35, and we use the last value as a reference point. So our interval is closed to the right. We also use the right value of the interval as a label. Pandas use by default the left label and also the left closed interval. When working with daily data, it is correct, but for intraday data it is not! There are 2 important parameters: closed and label, which we should set as ‘right’. So the first row of the code, in this case, should be:

hourly = usdvol.resample('H', closed='right', label='right').sum()

Do not forget about this and always check what interval is used in your data! When you look at data provided by brokers or online platforms like yahoo, tradingview, the intervals are labeled to the left. It is because of the live-data flowing. When it is 9:00:35 the minute bar has an index 9:00, so the beginning time, the left part of the interval. Do not mix the differently labelled data. If your data provider does not have it in the documentation, just check a few bars at the end of the day – if it is labeled to the right it finishes at 16:00 (or 20:00 for stocks extended session), otherwise it finishes 15:59 (or 19:59). For futures – check times when there is a pause in intraday trading. This is an indexing problem you have to be aware of.

Creating price bars and sessions

In the last part of this article, I will show you how to create your bars of any length and also time sessions, respectively, how to shift time sessions. Here we make a function to do it. It uses pandas built-in function.

def transform_bars(df: pd.DataFrame, minutes: int, base: int = 0) -> pd.DataFrame:
    '''
    Transformation function for given bars and sessions.
        df:        pandas.DataFrame of OHLCV data
        minutes:   length of bar in minutes
        base:      the offset for start (in minutes when we have 1 
                   minute bars) [default value: 0]
    '''
    # dictionary: what functions we will apply for each column
    ohlc_dict = {                                                                                                        
	  'open': 'first',                                                                                               
	  'high': 'max',                                                                                                  
	  'low': 'min',                                                                                                   
	  'close': 'last',
	  'volume': 'sum'
	  }
    # bunch of one day data will be resampled to wanted bars, base is the offset when the day begins
    # apply the functions on columns inside daily resample (each column is different) 
    new = df.resample('D').apply(
        lambda x: x.resample('{}T'.format(minutes), 
        base=base,
        closed='right',   
        label='right'
        ).apply(ohlc_dict))
    new = new.reset_index(level=0).drop(df.index.name,axis=1)
    # if there is not value the close price did not changed, so we take the price from previous close, forward fill 'ffill' (from up to down)
    new['close'] = new['close'].fillna(method='ffill')
    # when there are not values OHL are also the same as close, back fill 'bfill' on axis 1, that is from the right to the left
    new.loc[new['open'].isna()] = new[new['open'].isna()].fillna(
        method='bfill', axis=1)
    return new

Let’s go step by step. Pandas is very natural when you want to apply different functions into different columns. You have to create the dictionary, which sets for a given column name the name of the function. Later it is applied by operation apply. We want to start each day from the beginning 00:00, respectively; the base is the offset. If we’re going to make 1-hour bars but shifted by 5 minutes, we make base=5, and each day we create bar 1: 00:05-01:05, bar 2: 01:05-02:05

We use firstly resample by day and then inside the second resample for minutes. It is essential because of the offset. We want to have the same bars every day. When we group the dataset, we have to apply a function for created groups, and that is given in the dictionary ohlc_dict, different for each column. We apply it for intraday data so it has to be inside a daily group (resample(‘D’)).

Because we do 2 resamplings, we have two level indexes:

the first is for the day, and
the second is for our new bars.

The index’s first level is unnecessary, so we drop it from the index with reset_index and then drop it from columns with drop. Finally, when there are no values for the given bar, the volume is 0, and we set OHLC as the previous close. That is one option; an other option is skipping that bar (when not ignoring, be aware that there is 0 volume, and it is questionable if the trade was executable during that time). Finally, by applying a time restriction, you can create different sessions. If this code is very advanced for you, you should go step by step and always check the results.

The main purpose of this article is to show you the usage of grouping and working with DatetimeIndex. We analyzed which trading hours have the most volume and noticed some critical features of closing the time interval, then applied it to create the function for calculating your bars, with different bars’ length and starting times.