Quick Navigation:

  • Post-Stratification Raking: Questions and Answers about Creating Weights Last week I received some questions about my methods tutorial on creating post-stratification weights. Problems always arise (at least for me) when trying to apply a bit of code ...
    Posted Oct 21, 2017, 4:03 PM by Mia Costa
  • Creating Post-Stratification Weights in R My last post went over how to use pre-specified variable in a dataset to weight survey data. But what if you want to create the weights yourself? If you ...
    Posted Oct 21, 2017, 4:03 PM by Mia Costa
  • Using Weights in R In a dataset, survey weights are captured by a variable that indicates how much each respondent should “count.” For example, a respondent who has a 1 for a weight variable ...
    Posted Oct 21, 2017, 4:02 PM by Mia Costa
  • Count Models in R To demonstrate how to analyze count dependent variables, we first need to create a dependent variable that is a count. Again use the 2008 CCES survey (http://people.umass.edu ...
    Posted Oct 21, 2017, 4:02 PM by Mia Costa
  • Multinomial Logit Models in R For cases when the dependent variable has unordered categories, we use multinomial logit. For example, variable cc309 in the 2008 CESS (http://people.umass.edu/schaffne/cces08.dta) asks respondents ...
    Posted Oct 21, 2017, 4:02 PM by Mia Costa
  • Ordinal Logit/Probit Models in R An ordinal logit model is used when you have a dependent variable that takes on more than two categories and those categories are ordered. For example, likert scales, common in ...
    Posted Oct 21, 2017, 4:01 PM by Mia Costa
  • Logit/Probit Models in R For this example, we'll use the 2008 Cooperative Congressional Election Study. You can download it here from Brian Schaffner's website (http://people.umass.edu/schaffne/cces08.dta). Download ...
    Posted Oct 21, 2017, 4:01 PM by Mia Costa
Showing posts 1 - 7 of 7. View more »

Post-Stratification Raking: Questions and Answers about Creating Weights

posted Feb 8, 2017, 6:33 PM by Mia Costa   [ updated Oct 21, 2017, 4:03 PM ]

Last week I received some questions about my methods tutorial on creating post-stratification weights. Problems always arise (at least for me) when trying to apply a bit of code to new data, so it’s sometimes useful to see what problems you may run into and how to avoid them. Below, his questions about the post-stratification raking process (in blue) and my answers. I hope they help illuminate some of the more mundane parts of preparing data for analysis.

After I run all the code in your first post about using weights, the second one on how to create your own weights, and change some of the variable names to match those in the dataset, everything goes smoothly until the last bit of code, where I get this error message:

Error in na.fail.default(list(female = c(0, 1, 1, 1, 1, 0, 1, 0, 0, 1, :
missing values in object

Under the rake command description, it says the argument for sample.margins should not contain NA values, and the original variable for gender does—how do you deal with this? I think the same problem arises for non-gender variables too when they have NA values. Do you just exclude all observations with NA values?

I haven't run into this problem much personally because most of the data I work with has complete demographic information for respondents (putting demos questions at the end of the survey and making them mandatory helps). But yes, I would just exclude NAs. You can't weight on a value that is missing, and in your analysis, you wouldn't want to include unweighted observations in with observations that are weighted to population values.

For my survey, I’m hoping to weight on four different variables:

1) Gender (man/woman)
2) Race/ethnicity (American indian/Alaskan native, Asian, black or African American, Hispanic or Latino, Native Hawaiian or Other Pacific Islander, International, Two or more races, White, Other/not specified)
3) Class year (senior, junior, sophomore, freshman, other)
4) Greek affiliation (yes, no)

Given this, how many times do you think R should go through the raking process (in terms of what I put for the “control = list(maxit) portion of the rake argument)? Is it one for each category (and corresponding indicator variable), making it 18 times in my case?

You can probably leave the entire control argument out, actually — you'll get a warning message if it doesn't converge and you can increase the number of iterations from there. The default number of iterations is 10 and this is usually enough. Four is not that many variables to rake on, though your intuition is correct that it depends on the number of levels in each variables. Other things can affect convergence as well, like if there aren't a lot of observations in some cells.

What do you mean by "increase the number of iterations from there"—is this something the rake function does by itself, or do I have do something additional?

You can start with the default number of 10 iterations, so exclude control = list(maxit)altogether, and then if it doesn't converge, which you will know by getting a warning message that says something along the lines of "Could not converge", you can run the code again and increase the number of iterations until you reach convergence.

How can I transfer the weights to the original data frame in order to easily apply them when calculating weighted means for question responses? At this point they’re in a svy.dat object I don’t really know what to do with.

Also, at some point in your example, you could tell that 3 respondents had a weight in excess of 8—how could you tell the weights each respondent got based? (These questions are probably related.)

The weights are in your survey design object under "weights.” You can pull those out and put them in a new variable in your data frame, and then tabulate the variable.

For example:

dat$weight <- weights(svy.dat) # make a variable that contains your weights for each respondent
names(dat) # double check to make sure it’s there
table(dat$weight)


For some of my variables, I’m giving more options (e.g. “other” in addition to “man”/”woman” for gender) than the population values have (e.g. administrative data only has percentage for man/woman). What would you suggest as a way to deal with minor discrepancies like this?

Has anyone else come up with an estimate of how many people identify as "other" in your population of interest? If so, use that. If not, just make an education guess what the "true" value of "other gender" is in the population.

That’s a good idea. In past surveys I’ve done, non-binary options have come out to around 1 percent. In this case, I’m guessing it’s reasonable to estimate that other = 1%, men = 49.5%, and women = 49.5%, right?

I'm not sure, because I don't know the gender breakdown for your target population ([your school’s] student body, right?). If you were just going to do 50-50 before, this seems reasonable. But if there are more women than men at [your school], for instance, you might want to adjust it based on what you know the female and male student population [at your school] to be.

I also have an issue if I want to use Greek affiliation as a weight: at my school, freshmen can’t join Greek houses, so they’re excluded from the denominator of affiliation percentages. Is there a way to add weights based on percent Greek for all respondents but freshmen?

You could go through the whole raking process twice. The first time exclude freshmen and weight on Greek affiliation only. Then add freshmen back into the dataset and give them the baseline weight of 1. Then do it all again but start from those weights (using the weights argument in svydesign and the new weight variable you created) instead of starting with an unweighted survey object, such as: 

svy.dat.unweighted <- svydesign(ids = ~1, data = dat, weights = weight)

I think you might have implied this, but is it true that I don’t need to create two different indicator weight variables for something like gender (i.e. where they’re complements to 100 percent)? What about in the case of class year—should I create 5 different indicators (senior, junior, sophomore, freshman, other that all add up to 100 percent) in this case?

In my last post, all of the variables I use take on a 0, 1 distribution. For example, instead of having one “race” variable, I had a variable called “white”, where 1 indicates a respondent that is white and 0 indicate a respondent is a race other than white, and another variable called “black” and so on. But you don’t have to create different indicator variables for each level of every variable. For example, you could just use one variable for "gender" and have female/male/other be the different levels. In the case of college class, instead of breaking up freshmen/sophomore/junior into separate variables, you can collapse them into one and specify the population frequencies within each level.

So for example:

class.dist <- data.frame(class = c("freshman", "sophomore", "junior", "senior"),
Freq = nrow(dat) * c(0.25, 0.25, 0.25, 0.25))




Let me know if you would have answered these questions differently or if something needs further clarification!

Creating Post-Stratification Weights in R

posted May 2, 2016, 4:56 PM by Mia Costa   [ updated Oct 21, 2017, 4:03 PM ]

My last post went over how to use pre-specified variable in a dataset to weight survey data. But what if you want to create the weights yourself? If you know the population values of demographics that you wish to weight on, you can create survey weights using an approach known as post-stratification raking.

As an example, we will use the 2014 Massachusetts Exit Poll data (http://people.umass.edu/schaffne/mass_exitpoll_2014.dta). The dataset already has a sampling weight included, to adjust for the stratified cluster sample approach taken. However, because of unequal response rates among different demographic groups, we also need to do additional post-stratification weighting. Here is what we know about the demographic composition of the electorate in 2014 according to voting records:


We can use this information to produce post-stratification weights for the survey. First, the easiest thing to do is create indicator variables for each category we will weight on. If you are uninterested in this process, you can skip this next bit of code to get right to the survey weighting. Otherwise, we can recode as follows:

dat$female <- NA
dat$female[dat$gender=="Female" ] = 1
dat$female[dat$gender=="Male" ] = 0
dat$white <- NA
dat$white[dat$race=="White" ] = 1
dat$white[dat$race!="White" ] = 0
dat$black <- NA
dat$black[dat$race=="Black" ] = 1
dat$black[dat$race!="Black" ] = 0
dat$hispanic <- NA
dat$hispanic[dat$race=="Hispanic/Latino" ] = 1
dat$hispanic[dat$race!="Hispanic/Latino" ] = 0
dat$age18_29 <- NA
dat$age18_29[dat$age=="18-29"] = 1
dat$age18_29[dat$age!="18-29"] = 0
dat$age65_over <- NA
dat$age65_over[dat$age=="65 or over"] = 1
dat$age65_over[dat$age!="65 or over"] = 0

Now, before we weight to the population values of these variables, we create a survey design object (like in my previous post) but without specifying any weights. Note that if you want to re-weight starting from some other set of weights –for example, if the sample is already weighted to account for the stratified cluster sample– you can specify weights like we did in the last example and it will calculate the new ones based on that weight variable. Here, we will not use this argument to create an unweighted survey design object. You should get a warning message following the command telling you that you did not supply any weights or probabilities (this is what you want).

svy.dat.unweighted <- svydesign(ids = ~1, data = dat)

Now, we can use the rake command to weight to the population values. To do this, we first set up data frames that specify how often each level occurs in the population: 


# create dataframes based on the population values 
female.dist <- data.frame(female = c("0", "1"), 
                                    Freq = nrow(dat) * c(0.47, 0.53))
white.dist <- data.frame(white = c("0", "1"),
                                    Freq = nrow(dat) * c(0.12, 0.88)) 
black.dist <- data.frame(black = c("0", "1"),
                                    Freq = nrow(dat) * c(0.96, 0.04)) 
hispanic.dist <- data.frame(hispanic = c("0", "1"),
                                    Freq = nrow(dat) * c(0.95, 0.05)) 
age18.dist <- data.frame(age18_29 = c("0", "1"),
                                    Freq = nrow(dat) * c(0.93, 0.07)) 
age65.dist <- data.frame(age65_over = c("0", "1"),
                                    Freq = nrow(dat) * c(0.70, 0.30))


The first vector in each data frame describes the levels of the associated factor variable (0 and 1 in our cases). The second vector describes the corresponding frequencies. For example, in the female distribution data frame, I multiply the known frequency for each level by the number rows in the data set we compute the weights for to get absolute frequencies – that is, the number of observations that should be female based on the given frequency. I specify the frequency for each level as .47 and .53 because on the variable female, we are looking for 47% to be 0 (male) and 53% to be 1 (female). In the next data frame, the frequencies are .12 and .88, because we are looking for 12% to take on a 0 for the white variable (indicating non-white) and 88% to take on a value of 1 (indicating white). And so on...

Next, we combine the unweighted survey design object, the data frames with the popu- lation distributions just created, and the marginal distributions of our sample:

svy.dat <- rake(design = svy.dat.unweighted, 
                sample.margins = list(~female, ~white, ~black, ~hispanic, ~ age18_29, ~age65_over), 
                population.margins = list(female.dist, white.dist, black.dist, hispanic.dist, age18.dist, age65.dist), 
                control = list( maxit = 25))



control=list(maxit) is simply the maximum number of times that R will go through the raking process. It probably makes sense to set this at least to 10, but you may want to set it higher when you are weighting on more variables.

Once you execute this command, weights are added to your survey design object. By asking for a summary, we can see that the mean is 1 (mean should always be 1) and it ranges from .49 to 12.55.  

summary(weights(svy.dat))

Often, pollsters trim their weights to ensure that no single respondent receives too much influence over the point estimates. Only 3 respondents receive a weight in excess of 8, so we might wish to replace their weights with 8, so that nobody receives a weight in excess of 8. We can do this by setting the lower and upper limits of the weights with trimWeights. Set strict to TRUE or FALSE to tell R whether or not to redistribute the excess weight among the observations that were not trimmed. This reallocation can push some weights over the upper limit, so I will suppress it here:
 
svy.dat <- trimWeights(svy.dat, lower=.4, upper=8, strict=TRUE)

Using Weights in R

posted Mar 22, 2016, 5:16 PM by Mia Costa   [ updated Oct 21, 2017, 4:02 PM ]

In a dataset, survey weights are captured by a variable that indicates how much each respondent should “count.” For example, a respondent who has a 1 for a weight variable would count as a single respondent; a respondent who receives a .5 for a weight variable should count as half a respondent; and a respondent who has a 2 would count as two respondents. Use the survey package for weighting statistical analyses.

The 2010 UMass Poll Massachusetts Exit Poll (http://people.umass.edu/schaffne/mass_exitpoll.dta) includes a variable called"weight.” Respondents in the exit poll were weighted to account for the stratified nature of the sampling procedure. By stratifying towns by geography and other factors, not every voter had an equal chance of being sampled. Thus, the weight variable adjusts for this fact.

First, just one time at the beginning of your session you need to create a survey design object to tell R that your dataset includes a weight variable. This is done with the svydesign function: 

library(foreign) 
dat <- read.dta("mass_exitpoll.dta") 

library(survey) 
svy.dat <- svydesign(ids = ~1, data = dat, weights = dat$weight)

The arguments data and weight specify the data and variable that contains the weights. ‘ids’ is to specify sampling cluster membership. Since we aren’t accounting for the cluster sampling here, we set ids = 1 to indicate that all respondents originated from the same cluster (or primary sampling unit, PSU). Once you have done this, R now knows which variable to use to weight the data (and how to apply those weights). However, you still have to ask R to weight the data whenever you run an analysis.

For example, let’s say we wanted to look at the vote for Governor. If we just tabulated the variable by proportion, we would get the following: 

prop.table(table(dat$qc_gov))

However, the above tabulation does not weight the data to adjust for the sampling approach. To do this, you need to use survey’s analytical functions, which is usually just svy before the default function, a tilde ~ before the variable or formula, and the survey design object specified (see the package description for the full list of survey functions). So, now you can do the same tabulation but apply the weights to it:

prop.table(svytable(~qc_gov, design = svy.dat))


Notice that applying the weights had a pretty big effect on the results. With the sampling weight accounted for, Deval Patrick receives 52% of the vote, much closer to what he actually received in the election.

Next post will go over how to create the weights yourself using post-stratification raking! 

Count Models in R

posted Mar 14, 2016, 10:50 AM by Mia Costa   [ updated Oct 21, 2017, 4:02 PM ]

To demonstrate how to analyze count dependent variables, we first need to create a dependent variable that is a count. Again use the 2008 CCES survey (http://people.umass.edu/schaffne/cces08.dta). The variable we will create is going to be the number of political activities the respondent reported partaking in. We will use the following variables:

cc403      vote
cc415_1  action - attend local political meetings
cc415_2  action - persuade someone to vote
cc415_3  action - put up political sign
cc415_4  action - work for a candidate or campaign
cc415_5  action - comment on political blog
cc415_6  action - donate money to candidate or campaign

We will need to recode these variables to binary (0 or 1) before adding them up to create our count:
dat$cc403[dat$cc403 < 5 ] = 0
dat$cc403[dat$cc403==5 ] = 1
dat$cc415_1[dat$cc415_1==2 ] = 0
dat$cc415_2[dat$cc415_2==2 ] = 0
dat$cc415_3[dat$cc415_3==2 ] = 0
dat$cc415_4[dat$cc415_4==2 ] = 0
dat$cc415_5[dat$cc415_5==2 ] = 0
dat$cc415_6[dat$cc415_6==2 ] = 0

Now, let’s add these together to create our count of political activities:

dat$activities <- dat$cc403 + dat$cc415_1 + dat$cc415_2 + dat$cc415_3 + dat$cc415_4 + dat$cc415_5 + dat$cc415_6 

We can get a quick sense of how this count variable is distributed by looking at a histogram of it: 

barplot(table(dat$activities))


Note that the modal number of activities engaged in is 2. Often, a count variable will have a very long tail. In this case, however, there was a limit to how many activities one could report engaging in (7), so we do not have to deal with that issue.

Let’s estimate a model predicting how active individuals are in politics. We’ll use the same education variable from above. However, let’s recode the party id and ideology variables to take into account whether someone identifies with a party and how strong their ideology is: 

dat$pid <- dat$cc307
dat$pid[dat$pid=="2"] <- -1
dat$pid[dat$pid=="3"] <- 0
dat$pid[dat$pid >"3"] <- NA
dat$partyidentifier <- dat$pid 
dat$partyidentifier[dat$partyidentifier==-1 ] = 1 

dat$ideology_strength <- NA 
dat$ideology_strength[dat$v243==3 ] = 0 
dat$ideology_strength[dat$v243==2 ] = 1 
dat$ideology_strength[dat$v243==4 ] = 1 
dat$ideology_strength[dat$v243==5 ] = 2
dat$ideology_strength[
dat$v243==1 ] = 2 


Note that this time, I populate the new variable with NA instead of copying the old variable (ideology). Otherwise, I would be recoding over the new values as I ran the following lines of code (2==1, 1==2) – this is always a good idea when you are recoding a lot of numbers into other numbers. Now, let’s estimate a Negative Binomial regression model using the MASS package (the same package we used for the ordinal logit):

library(MASS)
modelNB <- glm.nb(activities ~ education + partyidentifier + ideology_strength,
   data=dat)
summary(modelNB)


Ok, first things first. Note the results at the bottom of the output. R estimates theta (θ) as the dispersion parameter, which is the inverse of the dispersion parameter (alpha, or α) given in other software packages like Stata. Alpha is normally the statistic you’d want to include in a table, so to get alpha, take the inverse of theta. 1/28186 = 0.0000354786

Now to interpreting the results. You cannot interpret the coefficients on their own. However, you can rely on one of two methods. First is to use the “incidence rate ratio” which is just like an odds ratio. 

est <- cbind(coef(modelNB), confint(modelNB))
exp(est)


You can interpret these like odds ratios. For each additional level of education, the rate of activities a respondent engaged in increased by a factor of 1.09, or by 9%.

To get predicted counts of political activity for each level of education, follow the same steps in previous chapters. Use the predict function on the new dataframe with all model variables at the appropriate levels. The last line stores the prediction statistics so that we have standard errors to get confidence intervals for our plot.

fdat <- with(dat, data.frame(education = rep(c(1:6), 1), partyidentifier=mean(partyidentifier, na.rm=TRUE), ideology_strength=mean(ideology_strength, na.rm=TRUE)))
pred.act <- predict(modelNB, newdata = fdat, type = "response")
p.edu <- cbind(fdat, predict(modelNB, newdata = fdat, type = "response", se = TRUE))

Now, let’s plot the relationship between education and political activity. This time I’ll use ggplot’s many plotting options to alter the aesthetics:

p1 <- ggplot(fdat, aes(x = education, y = pred.act), group=fdat) 

p1 + geom_point(size=3, colour="darkgreen")  # change color and size of points 
+ geom_line(colour="darkgreen")  # change color of line 
+  theme_bw() # change theme 
+  scale_x_continuous(breaks=1:6*1) # scale x axis 
+ expand_limits(y=1) # start y at 1 
+ geom_errorbar(aes(ymax = p.edu$fit + p.edu$se.fit*1.96, ymin=p.edu$fit - p.edu$se.fit*1.96), width=0.07, colour="darkgreen") # add confidence interval bars 
+  ylab("Predicted number of political activities")



Multinomial Logit Models in R

posted Mar 14, 2016, 10:18 AM by Mia Costa   [ updated Oct 21, 2017, 4:02 PM ]

For cases when the dependent variable has unordered categories, we use multinomial logit. For example, variable cc309 in the 2008 CESS (http://people.umass.edu/schaffne/cces08.dta) asks respondents if they would rather cut defense spending, cut domestic spending, or raise taxes.

As usual in R, there are a few package options from which to choose to carry out a multinomial logit regression. I’m going to use the nnet package and the multinom function. First we need some manipulating. Recode the variables with some more intuitive labels, convert it into a factor, and let’s make “cut domestic spending” the baseline variable.

library(nnet)
dat$cc309[dat$cc309=="1"] <- "cut defense" 
dat$cc309[dat$cc309=="2"] <- "cut domestic" 
dat$cc309[dat$cc309=="3"] <- "raise taxes" 

# convert to factor 
dat$cc309 <- factor(dat$cc309) 

# make cut domestic the baseline variable 
dat$cc309 <- relevel(dat$cc309, ref="cut domestic")

Then, we run the model and get a summary of the results: 

mnmodel <- multinom(cc309 ~ ideology + pid + education, data = dat) 
summary(mnmodel)


Each coefficient and statistical test above is calculated in relation to the base category (cut domestic spending). So, the coefficients are telling you whether a coefficient means there is a significant difference between the odds of choosing the category relative to the base category, but not whether there is a significant difference across other categories.

As with the logit model, you can again create predicted probabilities and plots to help look at the size of the effects. However, since you now have more than two categories, the effect for any given variable might be different for different categories of the dependent variable.

# recreate dataframe with ideology at all levels and other variables at means 
fdat <- with(dat, data.frame(ideology = rep(c(1:5), 1), pid=mean(pid, na.rm=TRUE), education=mean(education)))

# get predicted probabilities 
predict(mnmodel, newdata = fdat, type = "probs")
 
We can’t just save these predictions as their own variable like we did in Ch 1 (Logit) & Ch 2 (Ordinal Logit), because there are different predicted probabilities for each category of the dependent variable. The following code separates each outcome into it’s own variable, stored into a new dataframe with ideology.

# make dataframe of just the ideology levels
ideodat <- with(dat, data.frame(ideology = rep(c(1:5), 1)))
ideodat
# combine the new dataframe and predicted probabilities
pp309 <- cbind(ideodat, predict(mnmodel, newdata = fdat, type = "probs", se = TRUE))

# check it out to make sure
names(pp309)

Now we plot. This time let’s use R’s built-in generic plotting function. ggplot can be used again, alternatively. Note that I tell R not to erase the previous plot so I can overlay the effects of ideology for each of the three outcomes:

plot(pp309$ideology, pp309quot;cut defense", xlab="Ideology", ylab="Probability", col="blue", type="l", ylim=c(0.0,1.0))
par(new=T) 
plot(pp309$ideology, pp309quot;cut domestic", xlab=’’, ylab=’’, axes=F, col="red", type="l", ylim=c(0.0,1.0))
par(new=T)
plot(pp309$ideology, pp309quot;raise taxes", xlab=’’, ylab=’’, axes=F, col=" darkgreen", ylim=c(0.0,1.0), type="l")
legend(1, 1, legend=c("Cut Domestic Spending","Cut Defense Spending", "Raise Taxes"), fill=c("blue","red", "darkgreen"), cex=.7, bty="n")


The graphic shows that most of the effect of ideology occurs in choosing between the “cut defense” and “cut domestic” categories. Almost everyone, regardless of ideology, prefers to not raise taxes.


Next post --> Count Models in R

Ordinal Logit/Probit Models in R

posted Feb 28, 2016, 9:43 AM by Mia Costa   [ updated Oct 21, 2017, 4:01 PM ]

An ordinal logit model is used when you have a dependent variable that takes on more than two categories and those categories are ordered. For example, likert scales, common in survey research, are an example of ordered dependent variables. 

One good way to estimate an ordinal logit model in R is the polr function from the MASS package. The function name comes from proportional odds logistic regression.

In the 2008 CCES dataset (http://people.umass.edu/schaffne/cces08.dta), there is a question asking respondents their views on Affirmative Action policies (cc313). Respondents can indicate their support for such policies on a four point scale, ranging from 1 (“Strongly support”) to 4 (“Strongly oppose”). One could collapse these categories into “support” or “oppose,” but why throw away detail about the strength of support?

Since I read in the Stata .dta file from the last chapter without converting to factors, I have to convert the dependent variable back into ordered levels. I could also have just left that convert.factors=FALSE argument out when I read in the Stata file. It's always a good idea anyway to first check the structure of the variable: 

is.factor(dat$cc313) 
# output tells us "FALSE": cc313 is not a factor variable
# convert to factor 
dat$cc313 <- factor(dat$cc313)

Once the variable is converted into a factor, estimate an ordinal logit model on this question using the following commands:

library(MASS) 
ologmodel <- polr(cc313 ~ ideology + pid + education, data=dat, Hess=TRUE)
summary(ologmodel)



We also specify Hess=TRUE to have the model return the ‘Hessian matrix’ which is used to get standard errors. The coefficients in this output are estimating the effect of a one unit increase in each independent variable on the log odds of moving to a higher category on the scale of the dependent variable. These can be converted to odds ratios using the same method noted above: 
exp(cbind(coef(ologmodel), confint(ologmodel)))

Creating predicted probabilities follows the same intuition above, where “new data” gets specified with the conditions for the other variables, and the model gets plugged into the predict command. The only difference is that the type argument changes to "probs" for the ordinal logit.

Here, I just get predicted probabilities for cc313 for every observation (aka respondents aka rows) in the dataset. I print out the first 7 observations in the dataset so we can see what they look like. The columns are the probabilities that that respondent has that value (1-5) for cc313, and the rows are the observations.

 # predicted probs 
PPolog <- predict(ologmodel1, type = "probs")

# print out first 7 observations 
head(PPolog, n=7)

Logit/Probit Models in R

posted Feb 28, 2016, 9:21 AM by Mia Costa   [ updated Oct 21, 2017, 4:01 PM ]

For this example, we'll use the 2008 Cooperative Congressional Election Study. You can download it here from Brian Schaffner's website (http://people.umass.edu/schaffne/cces08.dta). Download the file, read it into R using the foreign package, and recode some of the variables so that they will be appropriately coded: 

setwd("Your Working Directory Here") library(foreign) dat <- read.dta("cces08.dta", convert.factors = F) # convert.factors brings in
the variable values instead of the labels (1 instead of support) 
dat$stemcell <- dat$cc316c 
dat$stemcell[dat$stemcell=="2"] = 0 
dat$stemcell[dat$stemcell=="3"] = NA 
dat$pid <- dat$cc307 dat$pid[dat$pid=="2"] <- -1 
dat$pid[dat$pid=="3"] <- 0 dat$pid[dat$pid >"3"] <- NA
dat$ideology <- dat$v243 dat$ideology[dat$ideology=="6"] <- NA
dat$education <- dat$v213


The glm function in R fits generalized linear models. Specify the model how you normally would using the lm function. Then specify the the type of model to be fit with the family argument. For logistic regression, we’ll use the “binomial” family. See ?family for other types. 

model <- glm(stemcell ~ pid + ideology + education, data=dat, family=binomial()) 
summary(model)



It is important to keep in mind that the coefficients are not easily interpreted in the way that OLS coefficients are. To get odds ratios, exponentiate the model coefficients and confidence intervals: 

exp(cbind(coef(model), confint(model))) 


You can interpret the odds ratios a bit more easily. For example, you could interpret the odds ratio on education to mean that for each additional unit increase on the scale education, an individual’s odds of supporting stem cell research are 1.16 times greater. Or, perhaps slightly more intuitively, for each additional education category, an individual’s odds of supporting stem cell research increase by 16%. You can ignore the intercept for odds ratios.

Now, let’s return to interpreting those coefficients. While odds ratios are fairly easy to calculate, they probably aren’t the best way to talk about the effects of the coefficients. More ideal and easy to understand are predicted probabilities. You can generate and plot these using the predict command.

First, predict will return a list of predicted probabilities under different conditions that you can specify. For example, if you wanted to know the probability that an individual who is “very liberal” on the ideology measure and average on the other covariates would support funding for stem cell research you would first create a new dataframe with those new conditions, and then create a new stemcell variable with the predicted probability of support: 

# create new data frame with ideology at 1 and other variables at means newdata <- with(dat, data.frame(ideology = 1, stemcell = mean(stemcell, na.rm=TRUE), pid=mean(pid, na.rm=TRUE), education=mean(education)))

# get predicted probability of stemcell 
newdata$stemcellP <- predict(model, newdata = newdata, type = "response")

newdata
 


Note that there is a .98 probability that this individual would support funding for stem cell research.

Now what if you want to plot the predicted probabilities across all values of ideology? To do this, recreate the data frame you made above but with ideology at every value between 1 and 5 in 1 point increments. In this new dataset (which I’m calling “fdat”), all the other model variables will be held at their means. # recreate dataframe with ideology at all levels and other variables at means 

fdat <- with(dat, data.frame(ideology = rep(c(1:5), 1), pid=mean(pid, na.rm=TRUE), education=mean(education)))

# get predicted probabilities fdat$stemcellP <- predict(model, newdata = fdat, type = "response")
fdat

Note that I stored the predicted probabilities into a new variable (stemcellP) within the dataframe. Now to make a simple plot:

library(ggplot2)
p1 <- ggplot(fdat, aes(x = as.numeric(ideology), y = stemcellP), group=fdat)
p1 + geom_point() + geom_line() + xlab("Ideology") + ylab("Prob. of supporting funding for stem cell research") + ggtitle("Effect of Ideology on Stem Cell Support")

1-7 of 7