About this guide
The “Essential Statistics | Formulas & Stata Implementation” guide is designed for intuitive navigation and swift application. Each core statistical measure is accompanied by its formula, relevant Stata code snippet and a brief example using a sample dataset. Navigate via the table of contents to pinpoint specific measures or topics, and refer to the appendices for quick command references.
Table of Contents
Example dataset
* Example data: Copy and paste it in Stata
clear
input float x
.3488717
.2668857
.1366463
.028556867
.8689333
.3508549
.07110509
.32336795
.5551032
.875991
end
Mean / average
Mean formula
The mean, often referred to as the average, is the sum of all values in a dataset divided by the number of values present. It represents the central value of a set of numbers.
Using the above data, the mean of the values is calculated as:
Extract mean from Stata commands
Various Stata commands return the mean value in scalars and macros.
Get mean from the sum command
The Stata sum
command stores the mean value in the scalar r(mean)
for a single variable.
sum x
return list
scalars:
r(N) = 10
r(sum_w) = 10
r(mean) = .3826315952464938
r(Var) = .0901804964373037
r(sd) = .3003006767180249
r(min) = .0285568665713072
r(max) = .8759909868240356
r(sum) = 3.826315952464938
Get mean from ttest command
The ttest
command returns the mean value in the scalar r(mu_1).
ttest x == 0
return list
scalars:
r(level) = 95
r(sd_1) = .3003006767180249
r(se) = .0949634121318857
r(p_u) = .0014881587244031
r(p_l) = .9985118412755969
r(p) = .0029763174488063
r(t) = 4.029252810704539
r(df_t) = 9
r(mu_1) = .3826315952464938
r(N_1) = 10
Get mean from the regress command
You can obtain the mean value and related statistics using the regress
command. One might opt for this method because regress
offers several options that could be of interest, such as robust standard errors. To retrieve the mean using the regress
command, regress the desired variable without including any other variables. The resulting mean value is stored in the scalar _b[_cons]
regress x
. dis _b[_cons]
.3826316
Get mean from the mean command
Command mean
produces estimates of means, along with standard errors. Various useful options are available with the mean
command, such as svy
prefix for survey data, over
to compute estimates for multiple subpopulations, etc. The mean value can be retrieved from the r(table)
from _b[varname].
mean x
. dis _b[x]
.3826316
Get mean from ameans command
ameans computes the arithmetic, geometric, and harmonic means, with their corresponding confidence intervals. The arithmetic mean is stored in the scalar r(mean).
ameans x
return list
scalars:
r(ub_h) = .
r(lb_h) = .
r(Var_h) = 109.9760819735142
r(mean_h) = .1368934612039689
r(ub_g) = .5598643537556708
r(lb_g) = .1176234956206313
r(Var_g) = 1.189209013823066
r(mean_g) = .2566187880146887
r(N_pos) = 10
r(ub) = .5974537582043968
r(lb) = .1678094322885908
r(Var) = .0901804964373037
r(mean) = .3826315952464938
r(N) = 10
r(level) = 95
Confidence intervals
A confidence interval gives a range of values that is likely to contain the true population parameter, based on the sample data. The confidence level expresses how certain we can be that the interval contains the true parameter. The formula to compute the confidence intervals, given a sample mean and a known standard error, is:
- CI is the Confidence Interval, representing the range in which we believe the population mean lies with a specified level of confidence.
- \bar{x} is the sample mean, which is the average of the observed values in the sample.
- t is the t -score from the t -distribution, which corresponds to the desired confidence level and degrees of freedom. For a sample size n , the degrees of freedom are typically n−1.
- s is the sample standard deviation, which measures the spread or dispersion of the sample data.
- n is the sample size, indicating the number of observations in the sample.
Manually find Confidence interval
Using the data given above, the given input to the CI calculations are:
- Sample mean : \bar{x} = 0.3826
- Sample Standard Deviations : s = 0.2929
- Sample size : (n ) = 10
- Degrees of Freedom : df = n-1 = 9
- T-Score : t \approx 2.2622 : For 95% Confidence Level with 9 Degrees of Freedom: (from the t-distribution table)
- Margin of Error : t \times \frac{s}{\sqrt{n}} = 0.2148
- Lower\; CI = 0.3826 - \left(2.2622 \times 0.2929 / \sqrt{9}\right) = 0.1678
- Upper\; CI = 0.3826 + \left(2.2622 \times 0.2929 / \sqrt{9}\right) = 0.5974
Find confidence intervals in Stata
Various Stata commands report confidences interval in the results table on the Stata screen and store them in the matrix r(table)
. Some commands do not store them at all, such as the ttest
command. Let me show how to get the confidence intervals in both the cases.
Obtain confidence intervals from r(table)
Some Stata commands such as regress
and mean
return confidence intervals in the matrix r(table)
. We can retrieve the confidence intervals from this matrix, see the following example:
mean x
__________________________________________________
Variable Mean Std. err. [95% conf. interval]
__________________________________________________
x .3826316 .0949634 .1678094 .5974538
==================================================
* Obtain the r(table)
matrix table = r(table)
* List the contents of the table
matrix list table
The r(table) i.e. the matrix is shown bellow.
x | |
b | 0.3826316 |
se | 0.09496341 |
t | 4.0292528 |
pvalue | 0.00297632 |
ll | 0.16780943 |
ul | 0.59745376 |
df | 9 |
crit | 2.2621572 |
eform | 0 |
* Get the confidence intervals
local lower_ci = table[5,1]
local upper_ci = table[6,1]
dis `lower_ci'
.16780943
dis `upper_ci'
.59745376
The text table[5,1]
is used to slice values from the matrix. Since the lower confidence interval is located at row index 5 and column index 1, we used table[5,1]
to slice it from the matrix. Similarly, the upper confidence interval is located at row index 6 and column index 1. We then stored these values in macros lower_ci
and upper_ci,
respectively.
Calculate confidence interval yourself
If a command does not leave behind the matrix r(table)
, we can use whatever pieces of required statistics are available and construct the confidence intervals ourselves. Let’s use the case of the ttest
command that does not return confidence intervals.
ttest x == 0
* List all returned statistics stored as scalars
return list
scalars: r(level) = 95 r(sd_1) = .3003006767180249 r(se) = .0949634121318857 r(p_u) = .0014881587244031 r(p_l) = .9985118412755969 r(p) = .0029763174488063 r(t) = 4.029252810704539 r(df_t) = 9 r(mu_1) = .3826315952464938 r(N_1) = 10
Given the available statistics, we only need the t -score from a two-tailed distribution. For a two-tailed test at the 95% confidence level with df degrees of freedom, we’ll calculate the t -value for the 97.5^{th} percentile (since we’d be excluding 2.5% from each tail to achieve a total of 95% in the middle). We will use the invt()
function, which returns the inverse of the cumulative Student’s t-distribution.
* Lower confidence interval
. disp r(mu_1) - r(se) * invt(9, 0.975)
.16780943
* Upper confidence interval
. disp r(mu_1) + r(se) * invt(9, 0.975)
.59745376