• Post-Stratification Raking: Questions and Answers about Creating Weights Last week I received some questions about my methods tutorial on creating post-stratification weights. Problems always arise (at least for me) when trying to apply a bit of code ...
Posted Feb 8, 2017, 6:33 PM by Mia Costa
• Creating Post-Stratification Weights in R My last post went over how to use pre-specified variable in a dataset to weight survey data. But what if you want to create the weights yourself? If you ...
Posted Jun 29, 2017, 11:00 AM by Mia Costa
• Using Weights in R In a dataset, survey weights are captured by a variable that indicates how much each respondent should “count.” For example, a respondent who has a 1 for a weight variable ...
Posted Apr 2, 2016, 9:37 AM by Mia Costa
• Count Models in R To demonstrate how to analyze count dependent variables, we first need to create a dependent variable that is a count. Again use the 2008 CCES survey (http://people.umass.edu ...
Posted Mar 18, 2016, 4:50 PM by Mia Costa
• Multinomial Logit Models in R For cases when the dependent variable has unordered categories, we use multinomial logit. For example, variable cc309 in the 2008 CESS (http://people.umass.edu/schaffne/cces08.dta) asks respondents ...
Posted Mar 14, 2016, 11:28 AM by Mia Costa
• Ordinal Logit/Probit Models in R An ordinal logit model is used when you have a dependent variable that takes on more than two categories and those categories are ordered. For example, likert scales, common in ...
Posted Mar 14, 2016, 10:53 AM by Mia Costa
• Logit/Probit Models in R For this example, we'll use the 2008 Cooperative Congressional Election Study. You can download it here from Brian Schaffner's website (http://people.umass.edu/schaffne/cces08.dta). Download ...
Posted Mar 14, 2016, 10:56 AM by Mia Costa
Showing posts 1 - 7 of 7. View more »

posted Feb 8, 2017, 6:33 PM by Mia Costa

#### Creating Post-Stratification Weights in R

posted May 2, 2016, 4:56 PM by Mia Costa   [ updated Jun 29, 2017, 11:00 AM ]

#### Using Weights in R

posted Mar 22, 2016, 5:16 PM by Mia Costa   [ updated Apr 2, 2016, 9:37 AM ]

#### Count Models in R

posted Mar 14, 2016, 10:50 AM by Mia Costa   [ updated Mar 18, 2016, 4:50 PM ]

 To demonstrate how to analyze count dependent variables, we first need to create a dependent variable that is a count. Again use the 2008 CCES survey (http://people.umass.edu/schaffne/cces08.dta). The variable we will create is going to be the number of political activities the respondent reported partaking in. We will use the following variables: cc403      vote cc415_1  action - attend local political meetings cc415_2  action - persuade someone to vote cc415_3  action - put up political sign cc415_4  action - work for a candidate or campaign cc415_5  action - comment on political blog cc415_6  action - donate money to candidate or campaignWe will need to recode these variables to binary (0 or 1) before adding them up to create our count: dat\$cc403[dat\$cc403 < 5 ] = 0 dat\$cc403[dat\$cc403==5 ] = 1 dat\$cc415_1[dat\$cc415_1==2 ] = 0 dat\$cc415_2[dat\$cc415_2==2 ] = 0 dat\$cc415_3[dat\$cc415_3==2 ] = 0 dat\$cc415_4[dat\$cc415_4==2 ] = 0 dat\$cc415_5[dat\$cc415_5==2 ] = 0 dat\$cc415_6[dat\$cc415_6==2 ] = 0Now, let’s add these together to create our count of political activities:dat\$activities <- dat\$cc403 + dat\$cc415_1 + dat\$cc415_2 + dat\$cc415_3 + dat\$cc415_4 + dat\$cc415_5 + dat\$cc415_6 We can get a quick sense of how this count variable is distributed by looking at a histogram of it: barplot(table(dat\$activities))Note that the modal number of activities engaged in is 2. Often, a count variable will have a very long tail. In this case, however, there was a limit to how many activities one could report engaging in (7), so we do not have to deal with that issue. Let’s estimate a model predicting how active individuals are in politics. We’ll use the same education variable from above. However, let’s recode the party id and ideology variables to take into account whether someone identifies with a party and how strong their ideology is: dat\$pid <- dat\$cc307dat\$pid[dat\$pid=="2"] <- -1dat\$pid[dat\$pid=="3"] <- 0dat\$pid[dat\$pid >"3"] <- NA dat\$partyidentifier <- dat\$pid dat\$partyidentifier[dat\$partyidentifier==-1 ] = 1  dat\$ideology_strength <- NA dat\$ideology_strength[dat\$v243==3 ] = 0 dat\$ideology_strength[dat\$v243==2 ] = 1 dat\$ideology_strength[dat\$v243==4 ] = 1 dat\$ideology_strength[dat\$v243==5 ] = 2dat\$ideology_strength[dat\$v243==1 ] = 2 Note that this time, I populate the new variable with NA instead of copying the old variable (ideology). Otherwise, I would be recoding over the new values as I ran the following lines of code (2==1, 1==2) – this is always a good idea when you are recoding a lot of numbers into other numbers. Now, let’s estimate a Negative Binomial regression model using the MASS package (the same package we used for the ordinal logit): library(MASS) modelNB <- glm.nb(activities ~ education + partyidentifier + ideology_strength, data=dat) summary(modelNB) Ok, first things first. Note the results at the bottom of the output. R estimates theta (θ) as the dispersion parameter, which is the inverse of the dispersion parameter (alpha, or α) given in other software packages like Stata. Alpha is normally the statistic you’d want to include in a table, so to get alpha, take the inverse of theta. 1/28186 = 0.0000354786 Now to interpreting the results. You cannot interpret the coefficients on their own. However, you can rely on one of two methods. First is to use the “incidence rate ratio” which is just like an odds ratio.  est <- cbind(coef(modelNB), confint(modelNB)) exp(est) You can interpret these like odds ratios. For each additional level of education, the rate of activities a respondent engaged in increased by a factor of 1.09, or by 9%. To get predicted counts of political activity for each level of education, follow the same steps in previous chapters. Use the predict function on the new dataframe with all model variables at the appropriate levels. The last line stores the prediction statistics so that we have standard errors to get confidence intervals for our plot. fdat <- with(dat, data.frame(education = rep(c(1:6), 1), partyidentifier=mean(partyidentifier, na.rm=TRUE), ideology_strength=mean(ideology_strength, na.rm=TRUE))) pred.act <- predict(modelNB, newdata = fdat, type = "response") p.edu <- cbind(fdat, predict(modelNB, newdata = fdat, type = "response", se = TRUE)) Now, let’s plot the relationship between education and political activity. This time I’ll use ggplot’s many plotting options to alter the aesthetics:p1 <- ggplot(fdat, aes(x = education, y = pred.act), group=fdat)  p1 + geom_point(size=3, colour="darkgreen")  # change color and size of points  + geom_line(colour="darkgreen")  # change color of line +  theme_bw() # change theme +  scale_x_continuous(breaks=1:6*1) # scale x axis + expand_limits(y=1) # start y at 1 + geom_errorbar(aes(ymax = p.edu\$fit + p.edu\$se.fit*1.96, ymin=p.edu\$fit - p.edu\$se.fit*1.96), width=0.07, colour="darkgreen") # add confidence interval bars +  ylab("Predicted number of political activities")This post is CHAPTER 4 of an R packet I am creating for a Survey Methods graduate seminar taught by Brian Schaffner at the University of Massachusetts Amherst. The seminar is usually only taught in Stata, so I translated all the exercises, assignments, and examples used in the course for R users. Other chapters include: Logit/Probit Models, Ordinal Logit/Probit Models, Multinomial Logit Models, Using Weights, Creating Post-Stratification Weights, Item Scaling, Matching/Balancing.

#### Multinomial Logit Models in R

posted Mar 14, 2016, 10:18 AM by Mia Costa   [ updated Mar 14, 2016, 11:28 AM ]

 For cases when the dependent variable has unordered categories, we use multinomial logit. For example, variable cc309 in the 2008 CESS (http://people.umass.edu/schaffne/cces08.dta) asks respondents if they would rather cut defense spending, cut domestic spending, or raise taxes. As usual in R, there are a few package options from which to choose to carry out a multinomial logit regression. I’m going to use the nnet package and the multinom function. First we need some manipulating. Recode the variables with some more intuitive labels, convert it into a factor, and let’s make “cut domestic spending” the baseline variable.library(nnet) dat\$cc309[dat\$cc309=="1"] <- "cut defense" dat\$cc309[dat\$cc309=="2"] <- "cut domestic" dat\$cc309[dat\$cc309=="3"] <- "raise taxes"  # convert to factor dat\$cc309 <- factor(dat\$cc309)  # make cut domestic the baseline variable dat\$cc309 <- relevel(dat\$cc309, ref="cut domestic") Then, we run the model and get a summary of the results: mnmodel <- multinom(cc309 ~ ideology + pid + education, data = dat) summary(mnmodel)Each coefficient and statistical test above is calculated in relation to the base category (cut domestic spending). So, the coefficients are telling you whether a coefficient means there is a significant difference between the odds of choosing the category relative to the base category, but not whether there is a significant difference across other categories. As with the logit model, you can again create predicted probabilities and plots to help look at the size of the effects. However, since you now have more than two categories, the effect for any given variable might be different for different categories of the dependent variable. # recreate dataframe with ideology at all levels and other variables at means fdat <- with(dat, data.frame(ideology = rep(c(1:5), 1), pid=mean(pid, na.rm=TRUE), education=mean(education))) # get predicted probabilities predict(mnmodel, newdata = fdat, type = "probs")  We can’t just save these predictions as their own variable like we did in Ch 1 (Logit) & Ch 2 (Ordinal Logit), because there are different predicted probabilities for each category of the dependent variable. The following code separates each outcome into it’s own variable, stored into a new dataframe with ideology. # make dataframe of just the ideology levels ideodat <- with(dat, data.frame(ideology = rep(c(1:5), 1))) ideodat # combine the new dataframe and predicted probabilities pp309 <- cbind(ideodat, predict(mnmodel, newdata = fdat, type = "probs", se = TRUE))# check it out to make sure names(pp309) Now we plot. This time let’s use R’s built-in generic plotting function. ggplot can be used again, alternatively. Note that I tell R not to erase the previous plot so I can overlay the effects of ideology for each of the three outcomes:plot(pp309\$ideology, pp309quot;cut defense", xlab="Ideology", ylab="Probability", col="blue", type="l", ylim=c(0.0,1.0)) par(new=T) plot(pp309\$ideology, pp309quot;cut domestic", xlab=’’, ylab=’’, axes=F, col="red", type="l", ylim=c(0.0,1.0))par(new=T)plot(pp309\$ideology, pp309quot;raise taxes", xlab=’’, ylab=’’, axes=F, col=" darkgreen", ylim=c(0.0,1.0), type="l") legend(1, 1, legend=c("Cut Domestic Spending","Cut Defense Spending", "Raise Taxes"), fill=c("blue","red", "darkgreen"), cex=.7, bty="n")The graphic shows that most of the effect of ideology occurs in choosing between the “cut defense” and “cut domestic” categories. Almost everyone, regardless of ideology, prefers to not raise taxes.Chapter 4 --> Count Models in RThis post is CHAPTER 3 of an R packet I am creating for a Survey Methods graduate seminar taught by Brian Schaffner at the University of Massachusetts Amherst. The seminar is usually only taught in Stata, so I translated all the exercises, assignments, and examples used in the course for R users. Other chapters include: Logit/Probit Models, Ordinal Logit/Probit Models, Count Models, Using Weights, Creating Post-Stratification Weights, Item Scaling, Matching/Balancing.

#### Ordinal Logit/Probit Models in R

posted Feb 28, 2016, 9:43 AM by Mia Costa   [ updated Mar 14, 2016, 10:53 AM ]

 An ordinal logit model is used when you have a dependent variable that takes on more than two categories and those categories are ordered. For example, likert scales, common in survey research, are an example of ordered dependent variables. One good way to estimate an ordinal logit model in R is the polr function from the MASS package. The function name comes from proportional odds logistic regression. In the 2008 CCES dataset (http://people.umass.edu/schaffne/cces08.dta), there is a question asking respondents their views on Affirmative Action policies (cc313). Respondents can indicate their support for such policies on a four point scale, ranging from 1 (“Strongly support”) to 4 (“Strongly oppose”). One could collapse these categories into “support” or “oppose,” but why throw away detail about the strength of support? Since I read in the Stata .dta file from the last chapter without converting to factors, I have to convert the dependent variable back into ordered levels. I could also have just left that convert.factors=FALSE argument out when I read in the Stata file. It's always a good idea anyway to first check the structure of the variable: is.factor(dat\$cc313) # output tells us "FALSE": cc313 is not a factor variable # convert to factor dat\$cc313 <- factor(dat\$cc313) Once the variable is converted into a factor, estimate an ordinal logit model on this question using the following commands:library(MASS) ologmodel <- polr(cc313 ~ ideology + pid + education, data=dat, Hess=TRUE)summary(ologmodel)We also specify Hess=TRUE to have the model return the ‘Hessian matrix’ which is used to get standard errors. The coefficients in this output are estimating the effect of a one unit increase in each independent variable on the log odds of moving to a higher category on the scale of the dependent variable. These can be converted to odds ratios using the same method noted above: exp(cbind(coef(ologmodel), confint(ologmodel))) Creating predicted probabilities follows the same intuition above, where “new data” gets specified with the conditions for the other variables, and the model gets plugged into the predict command. The only difference is that the type argument changes to "probs" for the ordinal logit. Here, I just get predicted probabilities for cc313 for every observation (aka respondents aka rows) in the dataset. I print out the first 7 observations in the dataset so we can see what they look like. The columns are the probabilities that that respondent has that value (1-5) for cc313, and the rows are the observations. # predicted probs PPolog <- predict(ologmodel1, type = "probs") # print out first 7 observations head(PPolog, n=7) Chapter 3 --> Multinomial Logit Models in RThis post is CHAPTER 2 of an R packet I am creating for a Survey Methods graduate seminar taught by Brian Schaffner at the University of Massachusetts Amherst. The seminar is usually only taught in Stata, so I translated all the exercises, assignments, and examples used in the course for R users. Other chapters include: Logit/Probit Models, Multinomial Logit Models, Count Models, Using Weights, Creating Post-Stratification Weights, Item Scaling, Matching/Balancing.

#### Logit/Probit Models in R

posted Feb 28, 2016, 9:21 AM by Mia Costa   [ updated Mar 14, 2016, 10:56 AM ]