Standard error of the (estimated) mean for time-series data¶

A common analysis task is determining the mean value of a metric in a sample - here we consider the metric to be the arithmetic mean. Because we don't have access to the entire population, this is almost always done by taking a sample from the population, measuring the mean in the sample, then taking that measurement as an estimation of the true value. Because the mean estimate doesn't return the exact true value, it should be accompanied by error bars, also known as a confidence interval. For sample data that is independent, calculating the standard error of the estimated mean is trivial using a formula. Some other data, for example time series, are highly correlated and thus need some adjustments to determine the error.

Despite the broad range of fields that measure time series data - ecology, econometrics, physical simulations, astronomy, etc... - there's no single, accepted approach to calculating the standard error of the mean, motiviating this article. Here we will cover several approaches to calculating standard error of the mean, from which you can get the confidence interval. Python code to calculate the confidence intervals can be found here if you want to use these techniques in your own workflow.

This article demonstrates four different techniques for determing the standard error of the estimated mean: - Block averaging - Calculating the number of independent samples ('effective sample size') - Autoregressive ('AR(1)') maximum likelihood fit, with an analytical adjustment of the standard error - AR(1) Bayesian estimation

Note on the code

The code for the analysis is all written in python. To use it in your own workflow, copy the sem_utils.py from https://github.com/ljmartin/estimating_sem_timeseries. The function calls rather than the full code is used here for clarity.

Demonstration¶

First, to demonstrate the problem, below shows examples of independent data (left side panel) and autocorrelated data (right side panel):

import numpy as np
import pandas as pd
import altair as alt
import statsmodels.api as sm
import sem_utils
np.random.seed(257367)


#generate two curves and a put them both in a dataframe:
uncorrelated_data = np.random.normal(0,np.sqrt(1 / (1-0.9**2)),100) #equivalent variance to the autocorrelated
autocorrelated_data = sem_utils.gen_correlated_curve(0.9, 100)

df = pd.DataFrame(data={'Uncorrelated':uncorrelated_data, 'Autocorrelated':autocorrelated_data})
long = pd.melt(df.reset_index(), id_vars='index')

#plot them both:
base = alt.Chart(long.reset_index())
require_shift='[mousedown[event.shiftKey], mouseup] > mousemove'
require_noshift = '[mousedown[!event.shiftKey], mouseup] > mousemove'
chart = base.mark_line(point=True).encode(x = 'index:Q', y = 'value:Q')
eb = base.mark_errorband(extent='ci').encode(y = 'mean(value):Q')

hline = alt.Chart().mark_rule().encode(y='a:Q')

highlight = alt.selection_interval(on=require_shift)
panzoom = alt.selection_interval(translate=require_noshift, bind='scales')

chart = alt.layer(chart+ eb+hline).transform_calculate(a="0").properties(width=300, height=150).add_selection(highlight).add_selection(panzoom).facet(column=alt.Column('variable:N', sort=['Uncorrelated', 'Autocorrelated']))
chart

Both these charts have the same variance (how much they stretch away from the middle value). The key difference is that for the independent data (left), each point has no memory of the previous point, hence they appear to jump around more. In contrast, for autocorrelated data (right) each point is similar to the previous point, meaning the curve moves in waves - this is also the reason for the name autocorrelated (correlated with itself). You can see the confidence intervals as shaded bars. The autocorrelated sample has a mean estimate that's off center, which is normally not a problem if the CI encompasses the mean. In this case, the waviness of the data mean that while there's lots of measurements, we haven't actually sampled much variation and the error is deceptively small.

SEM Calculation¶

This causes difficulty when calculating error bars. Ordinarily, for independent data, after calculating the estimated mean you would then calculate the standard error of that estimate using:

$\begin{align} SEM = \frac{s}{\sqrt{n}} \end{align}$

which in english says "the standard error of the (estimated) mean, called $SEM$ , is equal to the standard deviation ( $s$ ) of the sample divided by the square root of the number of samples". The formula isn't so special, it can be found on the wikipedia entry for standard error of the mean.

The problem with using this formula for highly autocorrelated data is the $\sqrt{n}$ in the denominator - it can become very large for time series data even if you can only see one or two of the 'waves' in the curve, especially if you take samples at high frequency.

Consider an example: measure your body temperature every hour for four weeks. It fluctuates over the day/night cycle, but doesn't change too much hour by hour. In that case, $n=672$ . Now imagine measuring it every minute for a month, so $n=40320$ . The $n$ has multiplied by $60$ , meaning the $SEM$ becomes smaller by $\sqrt{60}$ . Yet in both cases you sampled for the same total time. Thus your actual body temperature would have varied by about the same amount, meaning you would expect the error bars to be about the same.

Put another way, high frequency sampling of a time-varying process gives you lots of unnecessary data points because each point is so similar to its neighbours, and this makes the sample look very precise when it isn't.

Autocorrelation¶

This similarity between adjacent points can be visualized using an autocorrelation plot. Autocorrelation plots range from -1 to 1 and measure how correlated adjacent points are, with 1 being total positive correlation, 0 being no correlation at all, and -1 being total negative correlation. Autocorrelation plots measure how correlated a sample is to another point at some time lag. When the time lag is zero, autocorrelation must be 1 because each point must be completely positively correlated with itself!

The below shows an example of three time series with different levels of positive autocorrelation. When the autocorrelation plot reaches zero, it's safe to assume that the time lag is big enough that any data point has no memory of another data point separated by that lag. This value of the time lag is characteristic to an autocorrelated process, and is called the 'autocorrelation time' ( $\tau$ , tau). In practice, there's a lot of noise around the autocorrelation of zero which appears even in uncorrelated data, making it difficult to determine from measured data.

??? See Detecting Signals from Data with Noise: Theory and Applications for significance testing of autocorrelation vs. white noise

##Generate simulated data:
df = pd.DataFrame()
for ac, label in zip([0, 0.8, 0.9], ['Not autocorrelated (𝜏=0)', 'Some autocorrelation (𝜏≈9)', 'Very autocorrelated (𝜏≈19)']):
    autocorrelated_data = sem_utils.gen_correlated_curve(ac, num=500)
    #autocorrelation plot:
    C_t = sm.tsa.stattools.acf(autocorrelated_data, fft=True, unbiased=True, nlags=autocorrelated_data.size)[:100]
    df[label]=C_t

##Plot:
long = pd.melt(df.reset_index(), id_vars='index')
base = alt.Chart(long.reset_index())
selector = alt.selection_single(fields=['variable'],empty='all',bind='legend')
col = alt.condition(selector, 'variable:N', alt.value('lightgrey'), title='Click one to highlight')
highlight = alt.selection_interval(on=require_shift)
panzoom = alt.selection_interval(translate=require_noshift, bind='scales')
chart = base.mark_line().add_selection(selector).add_selection(highlight).add_selection(panzoom).encode(x = alt.X('index:Q', title='Time lag'), 
                                                        y = alt.Y('value:Q', title='Autocorrelation'), color=col,
                                                       tooltip=['variable:N']) 

chart = alt.layer(chart).properties(width=600, height=300)
chart

Solution¶

There are many different solutions to calculating the SEM for autocorrelated data. There's no "true" solution, meaning each one was developed in a different field for a different purpose. They all have a bit of overlap, but differ in their ease of use. Ideally we want something that is robust to as many different situations as possible, giving SEM's that include the true mean value at the right rate. Below demonstrates a few solutions. At the end there is a comparison of all of them under different conditions.

Approach 1: Block averaging¶

The purpose of block averaging is to 'undo' the autocorrelation so the sample looks independent. First, a baseline value for the $SEM$ is calculated using all points. To do one round of block averaging, every second point and it's adjacent point forward in time are averaged, creating 'blocks' of size 2 and resulting in a dataset with $\frac{n}{2}$ points. Their time value can be averaged as well.

Here's an example. The blue curve is the original timeseries, and the orange curve is blocked at block size=2. You can see that each orange point is the average of two nextdoor neighbour points. Notice also that the time distance between adjacent orange points is larger than it is between orange points.

#make an autocorrelated curve:
autocorrelated_data = sem_utils.gen_correlated_curve(0.99, num=25)
x = np.arange(len(autocorrelated_data))

#do one round of block averaging, blocksize of 2
x_blocked = x[:len(x)-(len(x)%2)].reshape(-1,2).mean(1)
autocorrelated_blocked = autocorrelated_data[:len(x)-(len(x)%2)].reshape(-1,2).mean(1)

#make a long format df:
df = pd.DataFrame(columns=['Blocksize', 'x', 'y'])
row_num=0
for j,k in zip(x, autocorrelated_data):
    df.loc[row_num]=[1, j,k]; row_num+=1
for j,k in zip(x_blocked, autocorrelated_blocked):
    df.loc[row_num]=[2, j,k]; row_num+=1


#plot:
base = alt.Chart(df.reset_index()).properties(width=600, height=300)
selector = alt.selection_single(fields=['Blocksize'],empty='all',bind='legend')
col = alt.condition(selector, 'Blocksize:N', alt.value('lightgrey'), title='Click one to highlight')
chart_line = base.mark_line().add_selection(highlight).add_selection(panzoom).add_selection(selector).encode(x = alt.X('x:Q', title='Time'), 
                                                             y = alt.Y('y:Q', title='Autocorrelation'), color=col,
                                                            tooltip=['Blocksize:N']) 
chart_scatter = base.mark_point().encode(x = alt.X('x:Q'), y = alt.Y('y:Q'), color='Blocksize:N') 
(chart_line+chart_scatter)

By increasing the size of the blocks from 2,3,4..., the hope is that, eventually, adjacent blocked points are far enough apart in time that they are totally uncorrelated and the data resemble an independent sample. This occurs when the blocks are bigger than the autocorrelation time $\tau$ . It appears as an asymptote in the $SEM$ , and would also correspond to reaching zero in the autocorrelation plot.

To demonstrate, the below shows blocking up to size=200 using an autocorrelated timeseries of 5000 measurements. The plot shows only the $SEM$ - note how it increases as block size increases, then levels off once the blocks are far enough apart:

#generate an autocorrelated curve:
autocorrelated_data = sem_utils.gen_correlated_curve(0.6, 5000)
blocked_sems = sem_utils.block_averaging(autocorrelated_data)

#put into a dataframe:
df = pd.DataFrame({'Blocked SEM': blocked_sems})

#plot:
def plot_df_with_altair(df, xlabel='Block size', ylabel='SEM'):
    base = alt.Chart(pd.melt(df.reset_index(), id_vars='index')).properties(width=600, height=300)
    selector = alt.selection_single(fields=['variable'],empty='all',bind='legend')
    col = alt.condition(selector, alt.Color('variable:N', sort=None, title='Click one to highlight'), alt.value('lightgrey'))
    opacity = alt.condition(selector, alt.value(1.0), alt.value(0.5))
    chart = base.mark_line().add_selection(selector).add_selection(highlight).add_selection(panzoom).add_selection(brush).encode(x=alt.X('index', title=xlabel), 
                                                            y=alt.Y('value', title=ylabel, scale=alt.Scale(zero=False)), 
                                                            color = col, opacity=opacity)
    return chart

plot_df_with_altair(df)

The true $SEM$ should be somewhere in the flattened out area. Due to the noise, there's no "correct" choice of SEM, but one practical way that works is to fit a curve with similar shape that grows asymptotically. The below shows the blocked $SEM$ s (blue), and the fit to a*arctan(b*(x-c)) (orange). It's worth noting that this can fail spectacularly if you only have a few measurements (<100), giving wrong $SEM$ s by three orders of magnitude. This means it's not safe to automate it - every fit should be manually inspected!

fitted_parameters = sem_utils.fit_arctan(blocked_sems)
fitted_curve = sem_utils.arctan_function(np.arange(len(df)), *fitted_parameters)
df['Fitted block avg. curve'] = fitted_curve

plot_df_with_altair(df)

Approach 2: Estimating $n_{eff}$ from the autocorrelation function¶

Block averaging works but requires a bit of hand-holding by the user. After blocking, the dataset has a smaller $n$ , leading to an indepdendent sample. The value of $n$ that achieves this is known as the 'effective $n$ ', or $n_{eff}$ . There are other ways to calculate $n_{eff}$ that are more direct than block averaging, and interestingly they can result in fractional values (unlike block averaging).

Adjusting the wikipedia equation for $SEM$ , we now have:

$\begin{align} SEM = \frac{s}{\sqrt{n_{eff}}} \end{align}$

Calculating $n_{eff}$ normally happens by first calculating the autocorrelation time $\tau$ . The autocorrelation time is the time it takes for the process to forget a past value. $n_{eff}$ can be found by just dividing the total sample time by the autocorrelation time $\tau$ , giving you the number of measurements with no memory of each other, i.e. independent samples. $\tau$ doesn't have to be a unit value, which is how $n_{eff}$ can be fractional.

$\begin{align} n_{eff} = \frac{n}{\tau} \end{align}$

So how to find $\tau$ ? The two techniques shown below, called here by the authors Sokal and Chodera (see refs), take the autocorrelation function and try to find a consensus zero point to get $\tau$ . They have reasonable agreement with block averaging if you sampled for many multiples of $\tau$ , which ensures good signal to noise ratio.

Chodera: see https://github.com/choderalab/pymbar for code, and Use of the Weighted Histogram Analysis Method for the Analysis of Simulated and Parallel TemperingSimulations

Sokal: see https://dfm.io/posts/autocorr/ for code, here for Sokal notes

#Get SEMs by N_eff technique:
chodera_sem = sem_utils.sem_from_chodera(autocorrelated_data)
sokal_sem = sem_utils.sem_from_sokal(autocorrelated_data)
#add to dataframe:
df['Chodera SEM'] = np.repeat(chodera_sem, len(df))
df['Sokal SEM'] = np.repeat(sokal_sem, len(df))

#plot:
plot_df_with_altair(df)

Approach 3: Autoregressive processes¶

A third way to estimate the $SEM$ is by treating it as an autoregressive process. Here, it helps to understand what it is that makes time series correlated with themselves. The simplest type of autocorrelation is called an autoregressive model where we only consider the time lag at lag=1, commonly just called AR(1). Wikipedia has some more details on these, but here's the short version. In an AR(1) process, each new point is equal to the previous point multiplied by a number $\rho$ (rho), plus or minus some white noise (i.e. random noise).

In maths, using $X_t$ to mean 'point $X$ at time $t$ ' and $\epsilon$ (epsilon) to mean 'random noise':

$\begin{align} X_t = \rho X_{t-1} + \epsilon \end{align}$

When $\rho$ is less than 1, each step pulls the point back towards zero (hence autoregressive), meaning the time series will maintain a mean of zero. Moving average models are also commonly used, but a stationary mean makes things easier and we will assume stationarity. If your time series is sationary but isn't centred on zero, the AR(1) model can just add a constant that shifts the whole sample back to zero, and then still use the model above.

First, let's estimate the $\rho$ parameter. This can be done using linear regression, where you calculate the ordinary least squares slope of the lines that go between any point and its adjacent neighbour. The code below uses the statsmodels library, the closest approximation of R in Python.

The true $\rho$ in this case was 0.6. You can see that OLS closely estimates the $\rho$ parameter.

from statsmodels.tsa.ar_model import AutoReg
#find estimated rho using ordinary least squares
result = AutoReg(autocorrelated_data-autocorrelated_data.mean(), lags = [1]).fit()
estimated_rho = result.params[1]
print('###################################')
print(f'Estimated rho is {estimated_rho}, with 95% CI of {result.conf_int()[1]}')
print('###################################')

###################################
Estimated rho is 0.6027880531997019, with 95% CI of [0.58066759 0.62490852]
###################################

Next, we can use a handy but complicated formula, which is a consequence of assuming an autoregressive AR(1) process, to get an estimate for the $SEM$ . First the $SEM$ gets calculated naively, giving the wrong answer. Then the wrong answer is multiplied by a correction factor $k$ , which is derived from the known properties of AR(1) processes:

$\begin{align} k = \sqrt{\frac{1+2\delta / n}{1 - 2\delta/(n(n-1))}} \end{align}$

where

$\begin{align} \delta = \frac{(n-1)\rho - n\rho^2 + \rho^{n+1}}{(1-\rho)^2} \end{align}$

This equation is messy, but if you're sure you have lots of multiples of the autocorrelation time, you can use the short version:

$\begin{align} k = \sqrt{\frac{1+\rho}{1-\rho}} \end{align}$

So now we have:

$\begin{align} SEM = \frac{ks}{\sqrt{n}} \end{align}$

Before using it, it's neat to see how big this correction factor can get. At $\rho$ close to 0, i.e. less than about 0.2, $k$ is barely greater than 1, meaning the naive $SEM$ doesn't change much. But as $\rho$ approaches 1, the $k$ can grow very large. With large numbers, you want to ensure you've got really big sample sizes so the noise doesn't wash out the true signal. Observe:

rho_range = np.linspace(0,0.99,150)
correction = sem_utils.correction_factor(rho_range, 100)
corr_df = pd.DataFrame({'Correction Factor':correction, '𝜌':rho_range})

#plot:
base = alt.Chart(corr_df).properties(width=600, height=300)
base.mark_line().add_selection(brush).encode(x='𝜌:Q', y='Correction Factor:Q')

So how does it go? Below adds the AR(1) correction factor:

correction = sem_utils.sem_from_autoregressive_correction(autocorrelated_data)

df['AR(1) correction factor SEM'] = correction

plot_df_with_altair(df)

So, to take stock: * Block averaging clumps the data together in order to reduce $n$ , but requires manual checking. * Or, you can estimate $n_{eff}$ and $\tau$ directly from the autocorrelation plot. * Alternatively you can assume an AR(1) process, fit with ordinary least squares to find $\rho$ , then use a correction factor to fix the naively-calculated $SEM$ .

The final technique here also uses an AR(1) model, but instead of using a correction equation, it simulates how probable a range of different mean values are, then reports the region of highest probability.

Approach 4: AR(1) via Bayesian estimation¶

The above approaches report what is known as the maximum likelihood estimate of the $SEM$ . It's called maximum likelihood simply because it's the single most likely value after you factor in any model assumptions that were made (like the AR(1) assumption, or normally-distributed noise).

An alternative is Bayesian estimation. There are still model assumptions - this still uses an AR(1) model. The trick here is that we try and find the best $\rho$ parameter for the data, but we also let mean value vary. Afterwards, we can throw away the $\rho$ estimate, and just look at the most probable values for the mean. These most probable values are encompassed in a 'highest posterior density' (HPD) which can be directly interpreted as a probability (unlike confidence intervals).

The below uses Bayesian estimation of an AR(1) model, using the PyMC3 library, which jointly estimates the value of the mean and $\rho$ , and can then report a full spectrum of likely values for the mean (or for $\rho$ if you want). It takes a few minutes to calculate because it simulates many possible values for the mean and the $\rho$

PS: As we will see later, this technique sometimes gives larger (but better) error bars because it factors in uncertainty in both of $\rho$ and the mean. It might be disappointing to have larger error bars, but surely it's preferable to be more correct.

bayesian_estimate = sem_utils.sem_from_bayesian_estimation(autocorrelated_data, progress=True)
df['AR(1) Bayesian estimate HPD'] = bayesian_estimate
plot_df_with_altair(df)

Sampling 2 chains, 0 divergences: 100%|██████████| 2000/2000 [00:04<00:00, 470.34draws/s]

So which is best?¶

We can't compare the techniques on just a single realisation of simulated autocorrelated data. So let's compare on 100's of simulated time series across a range of sample lengths and $\rho$ values. The goal is to get an accurate $SEM$ .

The $SEM$ is used to calculate confidence intervals. Ideally, we want the 95% confidence intervals generated from the $SEM$ to capture the true mean exactly 95% of the time. That means having errorbars that are large enough to encompass the mean, but without going too large such that you have low precision.

Simulated comparison¶

The following shows the results of many simulations that compare the rate of correctness for the $SEM$ generated by the techniques introduced above. It uses timeseries of different lengths as well as different $\rho$ .

In general, note that low $\rho$ is easy to get right - almost all techniques (even naive $SEM$ calculation) get it right about 95% of the time. As the $\rho$ increases, naive $SEM$ calculation returns too small error bars, whereas the other techniques fare better.

At $\rho=0.7$ or $\rho=0.9$ most techniques start to struggle, particularly when there's not much available data (i.e. only 30 timestep measurements). It's here that Bayesian estimation is more resilient - to an extent, the model factors in uncertainty that arises due to the small dataset.

import plot_utils
plot_utils.plot_results_interactive()

There's one more thing to note here. We want $SEM$ / confidence interval values that are large enough to capture the true mean 95% of the time, but not so large that they don't give precise estimates. Block averaging gets it right about 95% of the time - a good thing - but look how badly wrong the the remaining 5% are. Below shows the average size of the confidence interval. All techniques are about the same, but block averaging is heavily skewed by the 5% incorrect error bars that are way too large:

ch = plot_utils.plot_mean_ci_width_static().interactive()
ch

So which algorithm to use?¶

If you know the $\rho$ or $\tau$ parameter, and thus know you've sampled for many multiples of $\tau$ , it doesn't really matter. If you don't want to make any assumptions about autoregressive processes, then use block averaging but manually check the result. If you're ok with AR(1) models but the data have more complex patterns, AR(1) fits can be extended to higher-order AR(p) models.

For everything in between, particularly with limited data, the safest option is probably Bayesian estimation because it is slightly more robust to uncertainty. Even if AR(1) models fail at the highest autocorrelation, these will still indicate that the $\rho$ is very high and you can use simulation to estimate how long the $\tau$ is for that level of autocorrelation, thus giving you an estimate of how long you need to take measurements in order to get good statistics.