Classical statistics attaches a probability distribution to the residuals of prediction equation. Spatial statistics modifies this situation by specifying a prediction function that has \(Y\) on both sides of the equation, which means that a value at location \(i\) is at least partially a function of the values of \(Y\) at nearby locations. This conceptualization captures the essence of spatial autocorrelation.
SAR, simultaneous autoregression:
\[Y=\rho\pmb{W}Y+(\pmb{I}-\rho\pmb{W})\pmb{X}\pmb{\beta}+\pmb{\epsilon}\]where:
If considering \(\epsilon\) in the standard linear regression model \(Y=\pmb{X}\pmb{\beta}+\epsilon\) is spatial auto-related, then the model can be written as \(Y=\pmb{X}\pmb{\beta}+(\pmb{I}-\rho\pmb{W})^{-1}\epsilon\), and the spatial regression equation can be written as:
\[Y=\pmb{X}\pmb{\beta}+[\rho\pmb{W}(Y-\pmb{X}\pmb{\beta})+\pmb{\epsilon}]=\pmb{X}\pmb{\beta}+\xi\]然后本节内容就是对回归方程的近似估计(使用权重阵的特征值)和应用场景了。以下是根据《空间数据分析》一书的补充内容(虽然二者对空间回归方程的一般形式的定义有区别但本质上都是一致的)。
Anselin给出的空间回归的一般形式:
\(Y=\rho{W_1}Y+X\beta+u, u=\lambda{W_2}\epsilon+\mu, \mu\)~\(N[0,\sigma^2I]\)
其中:
情形 | 模型名 | 说明 |
---|---|---|
\(\rho=0,\lambda=0\) | 普通线性回归 | 模型中没有空间自相关的影响 |
\(\rho\not=0,\beta=0,\lambda=0\) | 一阶空间自回归模型 | 因变量受邻域内的其他因变量的影响 |
\(\rho\not=0,\beta\not=0,\lambda=0\) | 空间滞后模型 | Spatial Lag model,SLM,研究的因变量不仅与本区域的解释变量有关,还与相邻区域的因变量有关 |
\(\rho=0,\beta\not=0,\lambda=0\) | 空间误差模型 | Spatial Error Model,SEM,研究的因变量不仅与本区域的解释变量有关,还与相邻区域的解释变量有关 |
\(\rho\not=0,\beta\not=0,\lambda=0\) | 空间杜宾模型 | Spatial Dubin Model,SDM |
模型的选择步骤:先进行OLS回归,再进行LM-Lag和LM-Error(二者都基于Lagrange Multiplier,拉格朗日乘数诊断)诊断,若:
在本节的应用案例中也提到了判别分析(Discriminant Fuction Analysis),这里也作一简单回顾。
判别分析是判别样本所属类别的一种统计方法,是研究对象分成若干类型并已取得各种类型的一批已知样本的观测数据,在此基础上建立判别式,然后对未知类型的样本进行判别分类。(而对聚类分析而言,一批给定样本要划分的类别事先并不知道。需要通过聚类分析来确定类型,常与判别分析配合使用,即先聚类后判别),按《空间数据分析》一书,对三种判别准则作简单总结:
判别方法 | 描述 | 评价 |
---|---|---|
距离判别 | 根据已知分类的数据,分别计算各类的重心(各组的均值),判别准则是对任意给定样品,计算其到各类均值的距离1,哪个距离最小就将其归至哪个类 | 简单实用;未考虑先验概率,未考虑错判的损失 |
贝叶斯判别 | 计算被判样本\(x\)属于\(k\)个总体的条件概率\(P(n|x),n=1,2,3...,k\),比较\(k\)个概率的大小,将样本判归为出现概率最大的总体(或错判概率最小的总体) | 样本分布往往不满足属性条件独立性假设 |
费歇尔判别 | 设有A、B两个总体,分别有\(n_1、n_2\)个样本数据,每个样本有\(p\)个观测指标,每个样本可以看作\(p\)维空间中的一点,借助方差分析的思想构建一个线性判别函数 | Fisher LDA 在有监督的情况下,最大化地保留了分类信息,这一分类信息由一个非参指标,Fisher 指标来衡量;可能损失精度 |
In the preceding section, implementation of spatial autoregressive models requires nonlinear regression techniques but the error term assumption is still the normal probability model.
Eigenvector spatial filtering furnishes a sound methodology for estimating non-normal probability models with georeferenced data containing non-zero spatial autocorrelation. This methodology accounts for spatial autocorrelation in random variables by incorporating heterogeneity into parameters in order to model non-homogeneous populations. It renders a mixture of distributions that can be used to model observed georeferenced data whose various characteristics differ from those that are consistent with a single, simple, underlying distribution with constant parameters across all observations. The aim of this technique is to capture spatial autocorrelation effects with a linear combination of spatial proxy variables – namely, eigenvectors – rather than to identify a global spatial autocorrelation parameter governing average direct pairwise correlations between selected observed values. As such, it utilizes the misspecification interpretation of spatial autocorrelation, which assumes that spatial autocorrelation is induced by missing exogenous variables, which themselves are spatially autocorrelated, and hence relates to heterogeneity.
Eigenvector spatial filtering conceptualizes spatial dependency as common factor that is a linear combination of synthetic variates (the eigenvector of the matrix \(\pmb{(I-11^T/n)C(I-11^T/n)}\)) summarizing distinct features of the neighbors’ geographic configuration structure for a given georeferenced dataset.
这里涉及到广义线性模型 Generalized Linear Model,也作一简单回顾。
在GLM中,解释变量可以是连续变量或者名义变量,可以是连续的或者离散的,且因变量的线性预测值仅是解释变量的函数估计值,即因变量通过一个连接函数(可以是非线性的)依赖于自变量。
指数分布族(exponential family,如泊松分布、二项分布、几何分布等都属于指数分布族):$$f(y | \theta,\phi)=exp(\frac{y\theta-b(\theta)}{a(\theta)}+c(y,\phi))\(,其中\)\theta\(为自然参数,与均值有关;\)\phi\(为散布参数,与方差有关;\)a(\phi),b(\theta),c(y,\phi)$$为已知函数。 |
一个GLM由以下三部分组成:
应当采用最大似然法而非最小二乘法估计参数。
判断是否应该使用GLM的检验方法有Wald检验、约束检验和拟似然比检验等,原假设是符合线性模型的条件。
Logistic回归,将(0,1)区间内的变量映射到{0,1}中,则可以将对\(y_i\)的预测值替换为值为1的概率,则有\(g(\mu)=ln[\frac{p}{1-p}]=X\beta+\epsilon\),从而得到\(P=\frac{e^{X\beta+\epsilon}}{1+e^{X\beta+\epsilon}}\),其特点有:
Poisson回归:基于事件计数变量(事件发生的次数)建立的回归模型,因变量的对数与解释变量呈线性关系,即\(log(Y)=X\beta+\epsilon\)