STATA贝叶斯分析参考手册v14：集成先验知识的统计探究

需积分: 9 191 浏览量更新于2024-07-18 收藏 16.58MB PDF 举报

本参考手册是关于STATA软件中的贝叶斯分析功能的详尽指南，适用于版本14。贝叶斯分析是一种统计方法，它通过概率陈述来解答研究中关于未知参数的问题。这种方法的核心假设是统计模型的所有参数都是随机变量，因此能够灵活地整合先验知识。在STATA中进行贝叶斯分析，用户可以利用该软件提供的工具来处理复杂的概率模型，更新和结合数据观测以得出参数估计。在本手册中，读者将学到如何： 1. **理解基本概念**：掌握贝叶斯统计的基本原理，包括贝叶斯定理、后验分布、先验分布以及它们在统计推断中的作用。 2. **设置模型**：学习如何在STATA中设定贝叶斯模型，包括选择合适的概率分布（如正态分布、Beta分布等）来描述参数的可能取值。 3. **整合先验信息**：了解如何利用领域专家的知识或先前研究的结果作为先验分布，这在数据不足时尤为关键。 4. **MCMC（马尔科夫链蒙特卡洛）方法**：学习如何在STATA中运用MCMC技术进行参数估计，这是一种常用的方法，用于在高维空间中探索后验分布。 5. **后处理结果**：理解如何分析和可视化MCMC样本，以评估模型的稳定性和可靠性，以及计算各种统计量（如均值、中位数、可信区间等）。 6. **案例研究与实践应用**：手册提供了实际案例，帮助读者将理论知识应用于数据分析，解决具体问题。 7. **版权与许可**：手册强调了版权保护和使用条款，确保用户在合法范围内使用软件及文档，并了解潜在的知识产权限制。 8. **注意事项与免责声明**：阅读者需明白，StataCorp不提供任何形式的保证，包括适销性或特定用途的适合性，且可能对软件进行改进或更新。这本STATA Bayesian Analysis Reference Manual是一本实用的教程，旨在帮助STATA用户精通贝叶斯分析技术，并将其融入到他们的统计分析工作中。无论是初学者还是经验丰富的用户，都可以从中获取深入理解和操作指导。

10 intro — Introduction to Bayesian analysis

Comparing Bayesian models

Model comparison is another important aspect of Bayesian statistics. We are often interested in

comparing two or more plausible models for our data.

Let’s assume that we have models M

parameterized by vectors θ

, j = 1, . . . , r. We may have

varying degree of belief in each of these models given by prior probabilities p(M

), such that

j=1

p(M

) = 1. By applying Bayes’s rule, we ﬁnd the posterior model probabilities

p(M

|y) =

p(y|M

)p(M

)

p(y)

where p(y|M

) = m

(y) is the marginal likelihood of M

with respect to y. Because of the difﬁculty

in calculating p(y), it is a common practice to compare two models, say, M

and M

, using the

posterior odds ratio

p(M

|y)

p(M

|y)

p(y|M

)p(M

)

p(y|M

)p(M

)

If all models are equally plausible, that is, p(M

) = 1/r, the posterior odds ratio reduces to the

so-called Bayes factors (BF) (Jeffreys 1935),

p(y|M

)

p(y|M

)

(y)

which are simply ratios of marginal likelihoods.

Jeffreys (1961) recommended an interpretation of BF

based on half-units of the log scale. The

following table provides some rules of thumb:

log

(BF

) BF

Evidence against M

0 to 1/2 1 to 3.2 Bare mention

1/2 to 1 3.2 to 10 Substantial

1 to 2 10 to 100 Strong

>2 >100 Decisive

The Schwarz criterion BIC (Schwarz 1978) is an approximation of BF in case of arbitrary but

proper priors. Kass and Raftery (1995) and Berger (2006) provide a detailed exposition of Bayes

factors, their calculation, and their role in model building and testing.

Posterior prediction

Prediction is another essential part of statistical analysis. In Bayesian statistics, prediction is

performed using the posterior distribution. The probability of observing some future data y

∗

given

the observed one can be obtained by the marginalization of

p(y

∗

|y) =

p(y

∗

|y, θ)p(θ|y)dθ

which, assuming that y

∗

is independent of y, can be simpliﬁed to

intro — Introduction to Bayesian analysis 11

p(y

∗

|y) =

p(y

∗

|θ)p(θ|y)dθ (6)

Equation (6) is called a posterior predictive distribution and is used for Bayesian prediction.

Bayesian computation

An unavoidable difﬁculty in performing Bayesian analysis is the need to compute integrals such

as those expressing marginal distributions and posterior moments. The integrals involved in Bayesian

inference are of the form E{g(θ)} =

g(θ)p(θ|y)dθ for some function g(·) of the random vector

θ. With the exception of a few cases for which analytical integration is possible, the integration is

performed via simulations.

Given a sample from the posterior distribution, we can use Monte Carlo integration to approximate

the integrals. Let θ

, θ

, . . . , θ

be an independent sample from p(θ|y).

The original integral of interest E{g(θ)} can be approximated by

bg =

t=1

g(θ

)

Moreover, if g is a scalar function, under some mild conditions, the central limit theorem holds

bg ≈ N



E{g(θ)}, σ



where σ

= Cov{g(θ

)} can be approximated by the sample variance

t=1

{g(θ

) −bg}

/T . If the

sample is not independent, then bg still approximates E{g(θ)} but the variance σ

is given by

= Var{g(θ

)} + 2

∞

k=1

Cov{g(θ

), g(θ

t+k

)} (7)

and needs to be approximated. Moreover, the conditions needed for the central limit theorem to hold

involve the convergence rate of the chain and can be difﬁcult to check in practice (Tierney 1994).

The Monte Carlo integration method solves the problem of Bayesian computation of computing a

posterior distribution by sampling from that posterior distribution. The latter has been an important

problem in computational statistics and a focus of intense research. Rejection sampling techniques

serve as basic tools for generating samples from a general probability distribution (von Neumann 1951).

They are based on the idea that samples from the target distribution can be obtained from another,

easy-to-sample distribution according to some acceptance–rejection rule for the samples from this

distribution. It was soon recognized, however, that the acceptance–rejection methods did not scale

well with the increase of dimensions, a problem known as the “curse of dimensionality”, essentially

reducing the acceptance probability to zero. An alternative solution was to use the Markov chains to

generate sequences of correlated sample points from the domain of the target distribution and keeping

a reasonable rate of acceptance. It was not long before Markov chain Monte Carlo methods were

accepted as effective tools for approximate sampling from general posterior distributions (Tanner and

Wong 1987).

12 intro — Introduction to Bayesian analysis

Markov chain Monte Carlo methods

Every MCMC method is designed to generate values from a transition kernel such that the draws

from that kernel converge to a prespeciﬁed target distribution. It simulates a Markov chain with the

target distribution as the stationary or equilibrium distribution of the chain. By deﬁnition, a Markov

chain is any sequence of values or states from the domain of the target distribution, such that each

value depends on its immediate predecessor only. For a well-designed MCMC, the longer the chain, the

closer the samples to the stationary distribution. MCMC methods differ substantially in their simulation

efﬁciency and computational complexity.

The Metropolis algorithm proposed in Metropolis and Ulam (1949) and Metropolis et al. (1953)

appears to be the earliest version of MCMC. The algorithm generates a sequence of states, each

obtained from the previous one, according to a Gaussian proposal distribution centered at that state.

Hastings (1970) described a more-general version of the algorithm, now known as a Metropolis–

Hastings (MH) algorithm, which allows any distribution to be used as a proposal distribution. Below

we review the general MH algorithm and some of its special cases.

Metropolis–Hastings algorithm

Here we present the MH algorithm for sampling from a posterior distribution in a general formulation.

It requires the speciﬁcation of a proposal probability distribution q(·) and a starting state θ

within

the domain of the posterior, that is, p(θ

|y) > 0. The algorithm generates a Markov chain {θ

}

T −1

t=0

such that at each step t 1) a proposal state θ

∗

is generated conditional on the current state, and 2) θ

∗

is accepted or rejected according to the suitably deﬁned acceptance probability.

For t = 1, . . . , T − 1:

1. Generate a proposal state: θ

∗

∼ q(·|θ

t−1

2. Calculate the acceptance probability α(θ

∗

|θ

t−1

) = min{r(θ

∗

|θ

t−1

), 1}, where

r(θ

∗

|θ

t−1

) =

p(θ

∗

|y)q(θ

t−1

|θ

∗

)

p(θ

t−1

|y)q(θ

∗

|θ

t−1

)

3. Draw u ∼ Uniform(0, 1).

4. Set θ

= θ

∗

if u < α(θ

∗

|θ

t−1

), and θ

= θ

t−1

otherwise.

We refer to the iteration steps 1 through 4 as an MH update. By design, any Markov chain simulated

using this MH algorithm is guaranteed to have p(θ|y) as its stationary distribution.

Two important criteria measuring the efﬁciency of MCMC are the acceptance rate of the chain and

the degree of autocorrelation in the generated sample. When the acceptance rate is close to 0, then

most of the proposals are rejected, which means that the chain failed to explore regions of appreciable

posterior probability. The other extreme is when the acceptance probability is close to 1, in which

case the chain stays in a small region and fails to explore the whole posterior domain. An efﬁcient

MCMC has an acceptance rate that is neither too small nor too large and also has small autocorrelation.

Gelman, Gilks, and Roberts (1997) showed that in the case of a multivariate posterior and proposal

distributions, an acceptance rate of 0.234 is asymptotically optimal and, in the case of a univariate

posterior, the optimal value is 0.45.

A special case of MH employs a Metropolis update with q(·) being a symmetric distribution. Then,

the acceptance ratio reduces to a ratio of posterior probabilities,

r(θ

∗

|θ

t−1

) =

p(θ

∗

|y)

p(θ

t−1

|y)

intro — Introduction to Bayesian analysis 13

The symmetric Gaussian distribution is a common choice for a proposal distribution q(·), and this is

the one used in the original Metropolis algorithm.

Another important MCMC method that can be viewed as a special case of MH is Gibbs sampling

(Gelfand et al. 1990), where the updates are the full conditional distributions of each parameter

given the rest of the parameters. Gibbs updates are always accepted. If θ = (θ

, . . . , θ

) and, for

j = 1 . . . , d, q

is the conditional distribution of θ

given the rest θ

{−j}

, then the Gibbs algorithm

is the following. For t = 1, . . . , T −1 and for j = 1, . . . , d: θ

∼ q

(·|θ

{−j}

t−1

). This step is referred

to as a Gibbs update.

All MCMC methods share some limitations and potential problems. First, any simulated chain is

inﬂuenced by its starting values, especially for short MCMC runs. It is required that the starting point

has a positive posterior probability, but even when this condition is satisﬁed, if we start somewhere

in a remote tail of the target distribution, it may take many iterations to reach a region of appreciable

probability. Second, because there is no obvious stopping criterion, it is not easy to decide for how long

to run the MCMC algorithm to achieve convergence to the target distribution. Third, the observations

in MCMC samples are strongly dependent and this must be taken into account in any subsequent

statistical inference. For example, the errors associated with the Monte Carlo integration should be

calculated according to (7), which accounts for autocorrelation.

Adaptive random-walk Metropolis–Hastings

The choice of a proposal distribution q(·) in the MH algorithm is crucial for the mixing properties

of the resulting Markov chain. The problem of determining an optimal proposal for a particular target

posterior distribution is difﬁcult and is still being researched actively. All proposed solutions are based

on some form of an adaptation of the proposal distribution as the Markov chain progresses, which is

carefully designed to preserve the ergodicity of the chain, that is, its tendency to converge to the target

distribution. These methods are known as adaptive MCMC methods (Haario, Saksman, and Tamminen

[2001]; Giordani and Kohn [2010]; and Roberts and Rosenthal [2009]).

The majority of adaptive MCMC methods are random-walk MH algorithms with updates of the

form: θ

∗

= θ

t−1

+ Z

, where Z

follows some symmetric distribution. Speciﬁcally, we consider a

Gaussian random-walk MH algorithm with Z

∼ N(0, ρ

Σ), where ρ is a scalar controlling the scale

of random jumps for generating updates and Σ is a d-dimensional covariance matrix. One of the ﬁrst

important results regarding adaptation is from Gelman, Gilks, and Roberts (1997), where the authors

derive the optimal scaling factor ρ = 2.38/

√

d and note that the optimal Σ is the true covariance

matrix of the target distribution.

Haario, Saksman, and Tamminen (2001) proposes Σ to be estimated by the empirical covariance

matrix plus a small diagonal matrix  ×I

to prevent zero covariance matrices. Alternatively, Roberts

and Rosenthal (2009) proposed a mixture of the two covariance matrices,

= β

Σ + (1 − β)Σ

for some ﬁxed covariance matrix Σ

and β ∈ [0, 1].

Because the proposal distribution of an adaptive MH algorithm changes at each step, the ergodicity

of the chain is not necessarily preserved. However, under certain assumptions about the adaptation

procedure, the ergodicity does hold; see Roberts and Rosenthal (2007), Andrieu and Moulines (2006),

Atchad

e and Rosenthal (2005), and Giordani and Kohn (2010) for details.

14 intro — Introduction to Bayesian analysis

Blocking of parameters

In the original MH algorithm, the update steps of generating proposals and applying the acceptance–

rejection rule are performed for all model parameters simultaneously. For high-dimensional models,

this may result in a poor mixing—the Markov chain may stay in the tails of the posterior distribution for

long periods of time and traverse the posterior domain very slowly. Suboptimal mixing is manifested

by either very high or very low acceptance rates. Adaptive MH algorithms are also prone to this

problem, especially when model parameters have very different scales. An effective solution to this

problem is called blocking—model parameters are separated into two or more subsets or blocks and

MH updates are applied to each block separately in the order that the blocks are speciﬁed.

Let’s separate a vector of parameters into B blocks: θ = {θ

, . . . , θ

}. The version of the

Gaussian random-walk MH algorithm with blocking is as follows.

Let T

be the number of burn-in iterations, T be the number of MCMC samples, and ρ

b = 1, . . . , B, be block-speciﬁc proposal covariance matrices. Let θ

be the starting point within the

domain of the posterior, that is, p(θ

|y) > 0.

1. At iteration t, let θ

= θ

t−1

2. For a block of parameters θ

2.1. Let θ

∗

= θ

. Generate a proposal for the bth block: θ

∗

= θ

t−1

+ , where  ∼ N(0, ρ

2.2. Calculate the acceptance ratio,

r(θ

∗

|θ

) =

p(θ

∗

|y)

p(θ

|y)

where θ

∗

= (θ

, θ

, . . . , θ

b−1

, θ

∗

, θ

b+1

, . . . , θ

2.3. Draw u ∼ Uniform(0, 1).

2.4. Let θ

= θ

∗

if u < min{r(θ

∗

|θ

), 1}.

3. Repeat step 2 for b = 1, . . . , B.

4. Repeat steps 1 through 3 for t = 1, . . . , T + T

− 1.

5. The ﬁnal sequence is {θ

}

T +T

−1

t=T

Blocking may not always improve efﬁciency. For example, separating all parameters in individual

blocks (the so-called one-at-a-time update regime) can lead to slow mixing when some parameters are

highly correlated. A Markov chain may explore the posterior domain very slowly if highly correlated

parameters are updated independently. There are no theoretical results about optimal blocking, so

you will need to use your judgment when determining the best set of blocks for your model. As

a rule, parameters that are expected to be highly correlated are speciﬁed in one block. This will

generally improve mixing of the chain unless the proposal correlation matrix does not capture the

actual correlation structure of the block. For example, if there are two parameters in the block that

have very different scales, adaptive MH algorithms that use the identity matrix for the initial proposal

covariance may take a long time to approximate the optimal proposal correlation matrix. The user

should, therefore, consider not only the probabilistic relationship between the parameters in the model,

but also their scales to determine an optimal set of blocks.

剩余280页未读，继续阅读

Alladins

粉丝: 1

STATA贝叶斯分析参考手册v14：集成先验知识的统计探究

Stata 7 Reference Manual: Volume 2H-P

Stata 15全面指南：数据分析与精美图形绘制

Stata教程：一维与二维图制作实例

stata base reference.pdf

Stata in space-Econometric analysis of spatially explicit raster data

stata

stata：Stata代码

panel_analysis_STATA:使用Airbnb公共数据集在STATA中进行面板数据分析的说明（合并描述差异到差异假人回归）

Analysis.bbpackage:结合使用BBedit和R，Stata和Mplus的工具

第12章 聚类分析.rar_Cluster Analysis_cluster_stata_telephone5pn

最新资源

第12章聚类分析.rar_Cluster Analysis_cluster_stata_telephone5pn