## About this guide

The “Essential Statistics | Formulas & Stata Implementation” guide is designed for intuitive navigation and swift application. Each core statistical measure is accompanied by its formula, relevant Stata code snippet and a brief example using a sample dataset. Navigate via the table of contents to pinpoint specific measures or topics, and refer to the appendices for quick command references.

## Table of Contents

## Example dataset

`* Example data: Copy and paste it in Stata`

clear

input float x

.3488717

.2668857

.1366463

.028556867

.8689333

.3508549

.07110509

.32336795

.5551032

.875991

`end`

## Mean / average

### Mean formula

The mean, often referred to as the average, is the sum of all values in a dataset divided by the number of values present. It represents the central value of a set of numbers.

Using the above data, the mean of the values is calculated as:

### Extract mean from Stata commands

Various Stata commands return the mean value in scalars and macros.

#### Get mean from the sum command

The Stata `sum`

command stores the mean value in the scalar `r(mean)`

for a single variable.

`sum x`

return list
scalars:

r(N) = 10

r(sum_w) = 10

r(mean) = .3826315952464938

r(Var) = .0901804964373037

r(sd) = .3003006767180249

r(min) = .0285568665713072

r(max) = .8759909868240356

r(sum) = 3.826315952464938

#### Get mean from ttest command

The `ttest`

command returns the mean value in the scalar `r(mu_1).`

`ttest x == 0`

return listscalars:

r(level) = 95

r(sd_1) = .3003006767180249

r(se) = .0949634121318857

r(p_u) = .0014881587244031

r(p_l) = .9985118412755969

r(p) = .0029763174488063

r(t) = 4.029252810704539

r(df_t) = 9

r(mu_1) = .3826315952464938

`r(N_1) = 10`

#### Get mean from the regress command

You can obtain the mean value and related statistics using the `regress`

command. One might opt for this method because `regress`

offers several options that could be of interest, such as robust standard errors. To retrieve the mean using the `regress`

command, regress the desired variable without including any other variables. The resulting mean value is stored in the scalar `_b[_cons]`

`regress x`

. dis _b[_cons]

`.3826316`

#### Get mean from the mean command

Command `mean`

produces estimates of means, along with standard errors. Various useful options are available with the `mean`

command, such as `svy`

prefix for survey data, `over`

to compute estimates for multiple subpopulations, etc. The mean value can be retrieved from the `r(table)`

from _b[varname].

```
mean x
```. dis _b[x]

.3826316

Get mean from ameans command

ameans computes the arithmetic, geometric, and harmonic means, with their corresponding confidence intervals. The arithmetic mean is stored in the scalar r(mean).

`ameans x`

return list

scalars:

r(ub_h) = .

r(lb_h) = .

r(Var_h) = 109.9760819735142

r(mean_h) = .1368934612039689

r(ub_g) = .5598643537556708

r(lb_g) = .1176234956206313

r(Var_g) = 1.189209013823066

r(mean_g) = .2566187880146887

r(N_pos) = 10

r(ub) = .5974537582043968

r(lb) = .1678094322885908

r(Var) = .0901804964373037

r(mean) = .3826315952464938

r(N) = 10

`r(level) = 95`

## Confidence intervals

A confidence interval gives a range of values that is likely to contain the true population parameter, based on the sample data. The confidence level expresses how certain we can be that the interval contains the true parameter. The formula to compute the confidence intervals, given a sample mean and a known standard error, is:

- CI is the Confidence Interval, representing the range in which we believe the population mean lies with a specified level of confidence.
- \bar{x} is the sample mean, which is the average of the observed values in the sample.
- t is the t -score from the t -distribution, which corresponds to the desired confidence level and degrees of freedom. For a sample size n , the degrees of freedom are typically n−1.
- s is the sample standard deviation, which measures the spread or dispersion of the sample data.
- n is the sample size, indicating the number of observations in the sample.

### Manually find Confidence interval

Using the data given above, the given input to the CI calculations are:

**Sample mean :**\bar{x} = 0.3826**Sample Standard Deviations :**s = 0.2929**Sample size :**(n ) = 10**Degrees of Freedom :**df = n-1 = 9**T-Score :**t \approx 2.2622 : For 95% Confidence Level with 9 Degrees of Freedom: (from the t-distribution table)**Margin of Error :**t \times \frac{s}{\sqrt{n}} = 0.2148- Lower\; CI = 0.3826 - \left(2.2622 \times 0.2929 / \sqrt{9}\right) = 0.1678
- Upper\; CI = 0.3826 + \left(2.2622 \times 0.2929 / \sqrt{9}\right) = 0.5974

### Find confidence intervals in Stata

Various Stata commands report confidences interval in the results table on the Stata screen and store them in the matrix `r(table)`

. Some commands do not store them at all, such as the `ttest`

command. Let me show how to get the confidence intervals in both the cases.

#### Obtain confidence intervals from r(table)

Some Stata commands such as `regress`

and `mean`

return confidence intervals in the matrix `r(table)`

. We can retrieve the confidence intervals from this matrix, see the following example:

```
mean x
```__________________________________________________

Variable Mean Std. err. [95% conf. interval]

__________________________________________________

x .3826316 .0949634 .1678094 .5974538

==================================================

* Obtain the r(table)

matrix table = r(table)

* List the contents of the table

matrix list table

The r(table) i.e. the matrix is shown bellow.

x | |

b | 0.3826316 |

se | 0.09496341 |

t | 4.0292528 |

pvalue | 0.00297632 |

ll | 0.16780943 |

ul | 0.59745376 |

df | 9 |

crit | 2.2621572 |

eform | 0 |

```
* Get the confidence intervals
```local lower_ci = table[5,1]

local upper_ci = table[6,1]

dis `lower_ci'

.16780943

dis `upper_ci'

.59745376

The text `table[5,1]`

is used to slice values from the matrix. Since the lower confidence interval is located at row index 5 and column index 1, we used `table[5,1]`

to slice it from the matrix. Similarly, the upper confidence interval is located at row index 6 and column index 1. We then stored these values in macros `lower_ci`

and `upper_ci,`

respectively.

#### Calculate confidence interval yourself

If a command does not leave behind the matrix `r(table)`

, we can use whatever pieces of required statistics are available and construct the confidence intervals ourselves. Let’s use the case of the `ttest`

command that does not return confidence intervals.

`ttest x == 0`

* List all returned statistics stored as scalars

`return list`

scalars: r(level) = 95 r(sd_1) = .3003006767180249 r(se) = .0949634121318857 r(p_u) = .0014881587244031 r(p_l) = .9985118412755969 r(p) = .0029763174488063 r(t) = 4.029252810704539 r(df_t) = 9 r(mu_1) = .3826315952464938 r(N_1) = 10

Given the available statistics, we only need the t -score from a two-tailed distribution. For a two-tailed test at the 95% confidence level with df degrees of freedom, we’ll calculate the t -value for the 97.5^{th} percentile (since we’d be excluding 2.5% from each tail to achieve a total of 95% in the middle). We will use the `invt()`

function, which returns the inverse of the cumulative Student’s t-distribution.

`* Lower confidence interval`

. disp r(mu_1) - r(se) * invt(9, 0.975)
.16780943

* Upper confidence interval

. disp r(mu_1) + r(se) * invt(9, 0.975)

.59745376