About this guide
The “Essential Statistics | Formulas & Stata Implementation” guide is designed for intuitive navigation and swift application. Each core statistical measure is accompanied by its formula, relevant Stata code snippet and a brief example using a sample dataset. Navigate via the table of contents to pinpoint specific measures or topics, and refer to the appendices for quick command references.
Table of Contents
Example dataset
clear
input float x
.3488717
.2668857
.1366463
.028556867
.8689333
.3508549
.07110509
.32336795
.5551032
.875991
end[/code]
Mean / average
Mean formula
The mean, often referred to as the average, is the sum of all values in a dataset divided by the number of values present. It represents the central value of a set of numbers.
Using the above data, the mean of the values is calculated as:
Extract mean from Stata commands
Various Stata commands return the mean value in scalars and macros.
Get mean from the sum command
The Stata sum command stores the mean value in the scalar r(mean) for a single variable.
return list
scalars:
r(N) = 10
r(sum_w) = 10
r(mean) = .3826315952464938
r(Var) = .0901804964373037
r(sd) = .3003006767180249
r(min) = .0285568665713072
r(max) = .8759909868240356
r(sum) = 3.826315952464938
[/code]Get mean from ttest command
The ttest command returns the mean value in the scalar r(mu_1).
return list
scalars:
r(level) = 95
r(sd_1) = .3003006767180249
r(se) = .0949634121318857
r(p_u) = .0014881587244031
r(p_l) = .9985118412755969
r(p) = .0029763174488063
r(t) = 4.029252810704539
r(df_t) = 9
r(mu_1) = .3826315952464938
r(N_1) = 10 [/code]
Get mean from the regress command
You can obtain the mean value and related statistics using the regress command. One might opt for this method because regress offers several options that could be of interest, such as robust standard errors. To retrieve the mean using the regress command, regress the desired variable without including any other variables. The resulting mean value is stored in the scalar _b[_cons]
. dis _b[_cons]
.3826316[/code]
Get mean from the mean command
Command mean produces estimates of means, along with standard errors. Various useful options are available with the mean command, such as svy prefix for survey data, over to compute estimates for multiple subpopulations, etc. The mean value can be retrieved from the r(table) from _b[varname].
. dis _b[x]
.3826316
[/code]Get mean from ameans command
ameans computes the arithmetic, geometric, and harmonic means, with their corresponding confidence intervals. The arithmetic mean is stored in the scalar r(mean).
[code]ameans xreturn list
scalars:
r(ub_h) = .
r(lb_h) = .
r(Var_h) = 109.9760819735142
r(mean_h) = .1368934612039689
r(ub_g) = .5598643537556708
r(lb_g) = .1176234956206313
r(Var_g) = 1.189209013823066
r(mean_g) = .2566187880146887
r(N_pos) = 10
r(ub) = .5974537582043968
r(lb) = .1678094322885908
r(Var) = .0901804964373037
r(mean) = .3826315952464938
r(N) = 10
r(level) = 95[/code]
Confidence intervals
A confidence interval gives a range of values that is likely to contain the true population parameter, based on the sample data. The confidence level expresses how certain we can be that the interval contains the true parameter. The formula to compute the confidence intervals, given a sample mean and a known standard error, is:
- CI is the Confidence Interval, representing the range in which we believe the population mean lies with a specified level of confidence.
- \bar{x} is the sample mean, which is the average of the observed values in the sample.
- t is the t -score from the t -distribution, which corresponds to the desired confidence level and degrees of freedom. For a sample size n , the degrees of freedom are typically n−1.
- s is the sample standard deviation, which measures the spread or dispersion of the sample data.
- n is the sample size, indicating the number of observations in the sample.
Manually find Confidence interval
Using the data given above, the given input to the CI calculations are:
- Sample mean : \bar{x} = 0.3826
- Sample Standard Deviations : s = 0.2929
- Sample size : (n ) = 10
- Degrees of Freedom : df = n-1 = 9
- T-Score : t \approx 2.2622 : For 95% Confidence Level with 9 Degrees of Freedom: (from the t-distribution table)
- Margin of Error : t \times \frac{s}{\sqrt{n}} = 0.2148
- Lower\; CI = 0.3826 - \left(2.2622 \times 0.2929 / \sqrt{9}\right) = 0.1678
- Upper\; CI = 0.3826 + \left(2.2622 \times 0.2929 / \sqrt{9}\right) = 0.5974
Find confidence intervals in Stata
Various Stata commands report confidences interval in the results table on the Stata screen and store them in the matrix r(table). Some commands do not store them at all, such as the ttest command. Let me show how to get the confidence intervals in both the cases.
Obtain confidence intervals from r(table)
Some Stata commands such as regress and mean return confidence intervals in the matrix r(table). We can retrieve the confidence intervals from this matrix, see the following example:
__________________________________________________
Variable Mean Std. err. [95% conf. interval]
__________________________________________________
x .3826316 .0949634 .1678094 .5974538
==================================================
* Obtain the r(table)
matrix table = r(table)
* List the contents of the table
matrix list table
[/code]The r(table) i.e. the matrix is shown bellow.
| x | |
| b | 0.3826316 |
| se | 0.09496341 |
| t | 4.0292528 |
| pvalue | 0.00297632 |
| ll | 0.16780943 |
| ul | 0.59745376 |
| df | 9 |
| crit | 2.2621572 |
| eform | 0 |
local lower_ci = table[5,1]
local upper_ci = table[6,1]
dis `lower_ci'
.16780943
dis `upper_ci'
.59745376
[/code]The text table[5,1] is used to slice values from the matrix. Since the lower confidence interval is located at row index 5 and column index 1, we used table[5,1] to slice it from the matrix. Similarly, the upper confidence interval is located at row index 6 and column index 1. We then stored these values in macros lower_ci and upper_ci, respectively.
Calculate confidence interval yourself
If a command does not leave behind the matrix r(table), we can use whatever pieces of required statistics are available and construct the confidence intervals ourselves. Let's use the case of the ttest command that does not return confidence intervals.
* List all returned statistics stored as scalars
return list [/code]
scalars:
r(level) = 95
r(sd_1) = .3003006767180249
r(se) = .0949634121318857
r(p_u) = .0014881587244031
r(p_l) = .9985118412755969
r(p) = .0029763174488063
r(t) = 4.029252810704539
r(df_t) = 9
r(mu_1) = .3826315952464938
r(N_1) = 10
Given the available statistics, we only need the t -score from a two-tailed distribution. For a two-tailed test at the 95% confidence level with df degrees of freedom, we'll calculate the t -value for the 97.5^ percentile (since we'd be excluding 2.5% from each tail to achieve a total of 95% in the middle). We will use the invt() function, which returns the inverse of the cumulative Student's t-distribution.
. disp r(mu_1) – r(se) * invt(9, 0.975)
.16780943
* Upper confidence interval
. disp r(mu_1) + r(se) * invt(9, 0.975)
.59745376
[/code]