解释Logistic回归中R如何编译虚拟响应变量(Interpreting how R codifies dummy response variable in Logistic Regression)

我是一个新手，他在解释逻辑回归的输出时遇到了麻烦。我的响应变量有两个值 - “multiplex”和“subterraneus”。在“microtus.train”数据框上使用factor（）函数时，按顺序获得“mutiplex和subterraneus”。在拟合模型并预测响应之后，我很难理解概率的意思。这些概率意味着观察的概率是“地下”吗？当我使用“对比（microtus.train $ Group）”声明时，我得到了下表。

> contrasts(microtus.train$Group) subterraneus multiplex 0 subterraneus 1

根据这张表，我认为模型试图预测“subterraneus”的概率（不是“multiplex”的概率），因为“1”是虚拟编码的“subterraneus”。我的假设是否正确？

我的代码在下面给出，我很感激你的帮助。

library(Flury) data(microtus, package = "Flury") str(microtus) summary(microtus) # Creating training & test data frames microtus.train <- subset(microtus, microtus$Group %in% c("multiplex", "subterraneus"), select = c("Group", "M1Left", "M2Left", "M3Left", "Foramen", "Pbone","Length", "Height", "Rostrum") ) # Drop 3rd factor level microtus.train$Group = droplevels(microtus.train$Group) factor(microtus.train$Group) nullModel.GLM <- glm(Group ~ 1, data = microtus.train, family = binomial()) fullModel.GLM <- glm(Group ~ ., data = microtus.train, family = binomial()) summary(nullModel.GLM) summary(fullModel.GLM) stepFwd.GLM <- step(nullModel.GLM, scope = list(upper = fullModel.GLM), direction = 'forward', k = 2) stepFwd.GLM.fitResults <- predict(stepFwd.GLM, type = 'response') stepFwd.GLM.fitResults contrasts(microtus.train$Group)

I am a newbie, who is having trouble in interpreting the output of my logistic regression. My response variable has two values - “multiplex” and “subterraneus”. When used the factor() function on “microtus.train” data frame, I get “mutiplex and subterraneus” in that order. After I fitted the model, and predict the response, I am having trouble in understanding what does the probability mean. Do these probabilities mean probability of an observation being “subterraneus”? When I used “contrasts(microtus.train$Group)” statement, I got the table below.

> contrasts(microtus.train$Group) subterraneus multiplex 0 subterraneus 1

Based on this table, I interpret that the model is trying to predict probabilities of “subterraneus” (not the probabilities of “multiplex”) because “1” is dummy coded for “subterraneus”. Is my assumption correct?

My code is given below and I appreciate your help in advance.

最满意答案

这不是重要的对比，而是因素水平的顺序（对比指定预测变量如何编码为虚拟变量）。来自?glm ：

对于“二项式”和“准二项式”家庭来说，反应也可以被定义为一个“因素”（当第一级表示失败，其他所有人都成功时）

由于R默认按字母顺序定义了因子水平，因此“多重性”可能是（可能）是第一级，而“subterraneus”是第二级，因此逻辑回归预测“subterraneus”的概率。您可以使用levels(microtus$Group)检查它，并在必要时通过使用factor()来调整它，并显式设置levels参数。

It's not the contrasts that matter, but the order of the factor levels (contrasts specify how the predictor variables are encoded as dummy variables). From ?glm:

For ‘binomial’ and ‘quasibinomial’ families the response can also be specified as a ‘factor’ (when the first level denotes failure and all others success)

Since R defines the levels of factors in alphabetical order by default, "multiplex" is (probably) the first level and "subterraneus" is the second, hence the logistic regression is predicting the probability of "subterraneus". You can check this with levels(microtus$Group), and adjust it if necessary by using factor() with the levels argument set explicitly.

更多推荐