理解关联结构：分类数据分析基础

需积分: 10 181 浏览量更新于2024-07-18 收藏 1.77MB PDF 举报

"Lectures on Categorical Data Analysis" by Tamás Rudas 这本书是一本关于类别数据分析的详尽教程，由Tamás Rudas撰写，他在匈牙利科学院社会科学研究所以及Eötvös Loránd大学都有任职。它属于Springer Texts in Statistics系列，旨在提供类别数据分析的基础理论知识，这个领域在社会科学、政治学、行为科学、市场营销、医学和生物学研究中具有核心地位。书中强调理解和解析变量间关联的结构以及其他变量对这些交互作用的影响。类别数据分析的核心在于理解非数值型数据的模式和关联，它对于统计概念的理解有着显著的教育意义。作者利用类别数据的特性，使得许多统计原理变得更加清晰易懂，并且在书中不仅解释了“如何做”，也探讨了“为什么这样做”。书中的内容设计具有主观性，注重概念的解释和证明的严谨性，而非仅仅追求简洁的表达方式。书中涵盖了数学精度与直觉的结合，将理论与实际的数据收集和分析工作紧密联系起来。为了更好地突出关键点，甚至在符号表示上也做了适当的调整。这本书适合有一定数学基础的读者，尤其是对统计学有一定了解的读者，它可以作为理解类别数据分析的入门教材。书中的内容可能包括但不限于以下主题： 1. 类别数据的定义和类型 2. 卡方检验及其应用，用于检验变量间的独立性 3. 有序和无序类别变量的模型，如Logistic回归和多项式逻辑回归 4. 多元分类问题的处理，如softmax回归 5. 条件概率和贝叶斯分析在类别数据中的应用 6. 聚类分析，如K-means和层次聚类 7. 主成分分析和因子分析在类别数据上的变体 8. 数据降维技术，如对应分析（Correspondence Analysis） 9. 处理缺失值和异常值的方法 10. 实际案例分析，展示如何在不同领域中应用这些方法通过深入学习本书，读者将能够掌握处理和分析类别数据的工具，理解其背后的统计原理，并能够将这些知识应用于实际研究项目中。这本书对于在社会科学、市场研究、医疗健康等领域工作的研究人员和学生来说是一份宝贵的资源。

1.2 Categorical or Continuous Data 9

When a variable is considered categorical, its distribution may be described by

a ﬁnite number of parameters, and yet this approach implies no restriction on the

distribution the variable may have. This is in contrast with the continuous case,

where to be able to describe the distribution with a ﬁnite number of parameters,

strong assumptions concerning a parametric family were needed to be made.

For the analysis of ordinal data, one may also consider nonparametric methods

that do not rely on the actual values of the observations rather only on their ranks

among all observations [52]. Unfortunately, many of these methods make use of the

assumption that observing the same value twice has zero probability (no ties). This

assumption is appropriate when ranks are derived for observations from a hypo-

thetical continuous random variable (rather, a categorical variable with very many

categories), but when there are only a small number of ordered categories possi-

ble, it does not seem to be an appropriate assumption. If, say, a Likert scale has 7

categories and the sample size is 1000, one cannot hope not to see ties.

Another interesting approach is to assume that a variable measured on the ordinal

level is the manifestation of a continuous variable through certain cut-points. Every

observable category is equivalent to the value of the unobservable (latent) variable

being between two adjacent cut-points. For example, it may be assumed that job

satisfaction is a continuous characteristic, and respondents are asked to report their

positions on a Likert scale when prompted with the question “How happy are you

with your current job?”. In such cases, some effort to recover certain properties of

the underlying continuous variable may be made. Unfortunately, without making

further assumptions about the latent variable, few of its characteristics can be de-

duced. For example, a continuous uniform latent variable, by appropriate choice of

the cut-points, may be transformed into a unimodal or a bimodal ordinal variable,

just like an underlying variable with a normal distribution may be cut into a bimodal

or into a highly skewed distribution. There are situations, however, when knowledge

available about the latent variable may be reliably incorporated into the analysis. For

example, in the medical and psychological literature, it is often assumed that a latent

trait is not only continuous but also normally distributed in the population but may

only manifest itself if its value exceeds a threshold. This assumption is called the

threshold model.

Sometimes, the expression of dichotomy of numerical versus categorical vari-

ables is used. The same concept is also referred to by the names of quantitative ver-

sus qualitative variables. In most cases, authors identify these concepts with ratio

or interval scales and ordinal or categorical levels of measurement. The dichotomy

is less precise than the categorization into four levels of measurement. The position

taken in this book is that – as the precise level of measurement of a variable often

depends on its intended role in the analysis – the statistician may make decisions as

to what characteristics of the categories of a variable to rely on. A minimal assump-

tion is that of a categorical level of measurement. The advantages and disadvantages

of such an assumption need to be evaluated on a case-by-case basis.

10 1 The Role of Categorical Data Analysis

1.3 Interaction in Statistical Analysis

The decision about categorical or continuous modeling of the variables is further

motivated by the fact that the choice of continuous variables and the most often im-

plied choice of multivariate normality, independently of its appropriateness from a

substantive point of view or of the level of measurement deﬁned by the data gath-

ering procedure, also implies a simpliﬁcation with respect to the statistical models

that may be analyzed. To illustrate this point, consider a regression-type problem

with response variable Y and explanatory variables X and Z. The research problem

is called a regression-type problem, instead of a regression problem, to emphasize

the fact that the variables concerned have predeﬁned roles, but they are not necessar-

ily continuous or multivariate normal, in which case a standard (linear) regression

analysis might be appropriate. Rather, the possibility of treating them as categorical

and the advantages and disadvantages of this choice are being investigated.

As in any regression-type problem, one wants to respond to three questions:

1. Which of the potential explanatory variables have an effect on the response vari-

able?

2. Out of those explanatory variables which do have an effect on the response vari-

able, which ones have strong and which ones have weak effects?

3. If there are several explanatory variables which have an effect, is their joint effect

different from what one would expect based on their separate effects?

The methods that can be applied to answer these questions will be discussed

in detail in Sect. 11.1, and we concentrate here on the third question which is, no

doubt, the most intriguing out of the three. In fact, even the precise meaning of this

question may require some clariﬁcation.

1.3.1 Joint Effects in a Regression-Type Problem Under Joint

Normality

Assume ﬁrst that (X ,Y, Z)



∼ N(

), with

). Then, the conditional ex-

pectations of the response are as follows:

E(Y |X = x)=

−1

(x −

), (1.1)

E(Y |Z = z)=

−1

(z −

), (1.2)

and

E(Y |X = x, Z = z)=

1(23)

−1

((x,z)



−(

)



), (1.3)

with

1(23)

)

12 1 The Role of Categorical Data Analysis

Further, if

= 0, then the second term in (1.3) is zero, implying the

following result:

Proposition 1.2. In regression analysis under trivariate normality, if both explana-

tory variables are independent from response, they do not have a joint effect on the

response variable. 

1.3.2 Joint Effects in a Regression-Type Problem with Categorical

Variables

The situation described in the previous subsection is thought to be so obvious by

many that the facts to be illustrated next may be considered as paradoxical.

Al-

though regression analysis is usually interpreted as a method to analyze continuous,

in particular normal, variables, researchers often face a similar problem when the

variables are not continuous. The questions listed earlier in this section may be just

as important, when the response and explanatory variables are categorical. A more

detailed discussion of the regression problem for categorical variables is given in

Chap. 11; here only some of the important properties are illustrated.

Let now the response variable be C and the explanatory variables A and B.The

notation is different from the one used with continuous variables to emphasize the

fact that these variables are categorical. The behavior of a binary variable is often

described by the odds of one of its categories against the other category. The odds is

the ratio of the two probabilities associated with the two categories of the variable.

If the two categories are denoted as 1 and 2, the value of the odds for the response

variable is P(C = 1)/P(C = 2). The effect of an explanatory variable on the response

can be best seen by comparing the odds of the response variable for those in different

categories of the explanatory variable. If the two conditional odds are equal, the

explanatory variable has no effect on response. A comparison of the two conditional

odds is the odds ratio

P(C = 1|A = 1)/P(C = 2|A = 1)

P(C = 1|A = 2)/P(C = 2|A = 2)

P(C = 1, A = 1)P(C = 2,A = 2)

P(C = 1, A = 2)P(C = 2,A = 1)

shown here for the effect of A on C. The meaning, use, and properties of the odds

ratio will be discussed in detail in Chap. 6; see also [72]. Conditional odds and odds

ratios may be applied to handling the regression problem with categorical variables.

As an example, a possible joint distribution of variables A, B, and C is shown in

Table 1.1.

The marginal distribution of A and B derived from Table 1.1 is uniform, which

implies that the variables A and B are independent. In spite of this, they do have a

joint effect on C in the sense that the joint effect of A and B on C is not obtained from

A paradox, of course, only means that the facts deviate from our expectations. Occasionally, our

expectations prove to be ungrounded. For some of the paradoxes associated with probability, see

[85].

剩余287页未读，继续阅读

jin_chen58

粉丝: 0
资源: 8

理解关联结构：分类数据分析基础

Algorithmic Game Theory

Twenty Lectures on Algorithmic Game Theory

Digital Image Processing An Algorithmic Introduction using Java

the feynman lectures on physics pdf

国外文本挖掘研究现状和参考文献

mapreduce相关文献

请写一下关于情感分析的文献综述

基于机器学习的情感分析研究文献

github c语言学习

feyman lecture pdf

最新资源