The aim of this blog is to outline how to carry out a basic multinomial regression analysis in R. In this blog I use the data.table package which is much hyped and with good reason, because it is awesome! I also use the `multinom()`

function in the nnet package to carry out the regressions. Anyone who has tried carry out statistics on real data from a survey knows that it can be pretty ghastly; I hope that the methods outlined here will be useful to those people. No pretty pictures today I’m afraid!

The data being used comes from the CDC and we will use the research written by Michael Wilson et al at

Peer Corps Trust. The paper is free and can be downloaded here http://www.mdpi.com/1660-4601/9/11/4122.
I won’t spend a great deal of time explaining the analysis because there is already a paper on the methodology. Here I simply concentrate on the R code. I took the MS Access table and copied it into a CSV file. The data set is small,
1432 rows and 87 columns but the main issue here is handling less than perfect data.

First we load the data:

```
require(foreign)
require(data.table)
require(nnet)
#======== DATA PREP
# Loading the data
sFileName <- "SuicideData.csv"
sFilePath <- "C:/Users/.../"
dfSuicideData <- read.csv(paste(sFilePath, "/", sFileName, sep = ""))
```

I convert the data.frame to a data.table and overwrite the record column with row number.

```
dfSuicideData <- data.table(dfSuicideData)
dfSuicideData[, record:= 1:nrow(dfSuicideData)]
```

We need to create the response variable, which is a categorical variable with three factors. These three categories will be “NO” for no suicide tendency, “SI” for suicide ideation and SISP for suicide attempt/planning. These are derived from question 25 and 26 the “q25” and “q26” columns which. Question 25 asks the adolescents whether they seriously consered attempting suicide, and question 26 asked them whether they made a plan about how they would attempt suicide. The “NO” category answered no to both questions, the “SI” category answered “Yes” to question 25 and if they answered yes to question 26 they get assigned to SISP regardless of the previous answer.

There are many ways to capture this. My approach throughout was to use a set of conditional statements to apply the categories. So for the response, my categorizing function was:

```
require(foreign)
require(data.table)
require(nnet)
#======== DATA PREP
# Loading the data
sFileName <- "SuicideData.csv"
sFilePath <- "C:/Users/.../"
dfSuicideData <- read.csv(paste(sFilePath, "/", sFileName, sep = ""))
```

I convert the data.frame to a data.table and overwrite the record column with row number.

```
dfSuicideData <- data.table(dfSuicideData)
dfSuicideData[, record:= 1:nrow(dfSuicideData)]
```

We need to create the response variable, which is a categorical variable with three factors. These three categories will be “NO” for no suicide tendency, “SI” for suicide ideation and SISP for suicide attempt/planning. These are derived from question 25 and 26 the “q25” and “q26” columns which. Question 25 asks the adolescents whether they seriously consered attempting suicide, and question 26 asked them whether they made a plan about how they would attempt suicide. The “NO” category answered no to both questions, the “SI” category answered “Yes” to question 25 and if they answered yes to question 26 they get assigned to SISP regardless of the previous answer.

There are many ways to capture this. My approach throughout was to use a set of conditional statements to apply the categories. So for the response, my categorizing function was

```
AllocateItems <- function(Q25, Q26)
{
# If both contain missing data
if(is.na(Q25) & is.na(Q26))
{
sOutput <- NA_character_
}else{
# If both are no
if(!is.na(Q25) & !is.na(Q26))
{
if(Q25 == 2 & Q26 == 2)
sOutput <- "NO"
}
# if either of the items are present we assume it is correct with SISP override
sOutput <- "NO"
if(!is.na(Q25))
{
if(Q25 == 1)
sOutput <- "SI"
}
if(!is.na(Q26))
{
if(Q26 == 1)
sOutput <- "SISP"
}
}
return(sOutput)
}
```

Then we apply it to the data.table to create a “Response” column.

```
# Creating response column
dfSuicideData[, Response := AllocateItems(q25, q26), by = record]
```

In the paper, the age is defined in years, so we simply add an appropriate constant to make this happen

```
# Age
dfSuicideData[, Age := q1 + 10, by = record]
```

Next we allocate sex, instead of creating a sex column with male or female categories, I create a “Sex.Male” column with “yes” or “no” as the categories. The function for allocating this is:

```
AllocateQ2 <- function(Q2)
{
sOutput <- NA_character_
if(!is.na(Q2))
{
if(Q2 == 1)
{
sOutput <- "Yes"
}else{
if(Q2 == 2)
sOutput <- "No"
}
}
return(sOutput)
}
```

Then we apply it to the table

```
# Sex
dfSuicideData[, Sex.Male := AllocateQ2(q2), by = record]
```

We apply the categories in the paper, apportioning them as it stipulates until we end up with the analysis data, the rest of the data prep reads very similar to the above so feel free to skip ahead to the Analysis part.

```
# Function for allocating Q6 Q22, Q23
AllocateQ6.22.23 <- function(Q6)
{
sOutput <- NA_character_
if(!is.na(Q6))
{
if(Q6 %in% 1:3)
{
sOutput <- "No"
}else{
if(Q6 %in% 4:5)
sOutput <- "Yes"
}
return(sOutput)
}
# Assigning to Q6
dfSuicideData[, Hungry := AllocateQ6.22.23(q6), by = record]
# Assign to Q23
dfSuicideData[, Worried := AllocateQ6.22.23(q23), by = record]
# Assign to Q22
dfSuicideData[, Lonely := AllocateQ6.22.23(q22), by = record]
# Allocating the binary
AllocateBinary <- function(iBin)
{
sOutput <- NA_character_
if(!is.na(iBin)){
if(iBin == 1)
{
sOutput <- "Yes"
}else
{
if(iBin == 2)
sOutput <- "No"
}
}
return(sOutput)
}
# Assign Sad
dfSuicideData[, Sad := AllocateBinary(q24), by = record]
# Assign Number of friends
dfSuicideData[, NumberOfFriends := q27-1, by = record]
# Assign Missing school
dfSuicideData[, MissSchool := AllocateQ6.22.23(q50), by = record]
# Function for allocating bullying days
AllocateQ20 <- function(Q20)
{
sOutput <- NA_character_
if(!is.na(Q20)){
if(Q20 %in% 1:2)
{
sOutput <- "No"
}else
{
if(Q20 %in% 3:7)
sOutput <- "Yes"
}
}
return(sOutput)
}
# Assigning bullying
dfSuicideData[, Bullied := AllocateQ20(q20), by = record]
# Function for assigning parental oversight
AllocateQ52.53.54 <- function(Q52)
{
sOutput <- NA_character_
if(!is.na(Q52)){
if(Q52 %in% 1:3)
{
sOutput <- "No"
}else
{
if(Q52 %in% 4:5)
sOutput <- "Yes"
}
}
return(sOutput)
}
# Assigning parents checking homework
dfSuicideData[, CheckHomework := AllocateQ52.53.54(q52), by = record]
# Assigning understanding parents
dfSuicideData[, UnderStandingParents := AllocateQ52.53.54(q53), by = record]
# Assigning parental oversight on free time
dfSuicideData[, CheckPlaytime := AllocateQ52.53.54(q54), by = record]
# Function for assigning cigarettes and drug use
AllocateQ29 <- function(Q29)
{
sOutput <- NA_character_
if(!is.na(Q29)){
if(Q29 == 1)
{
sOutput <- "No"
}else
{
if(Q29 >= 2)
sOutput <- "Yes"
}
}
return(sOutput)
}
# Cigarette use
dfSuicideData[, CigaretteUse := AllocateQ29(q29), by = record]
# Drug use
dfSuicideData[, DrugUse := AllocateQ29(q39), by = record]
# Drug problems
dfSuicideData[, AlcholMisuse := AllocateQ29(q38), by = record]
# subsetting the data
sWantedColumns <- c("record", "Response", "Age", "Sex.Male", "Hungry", "Worried",
"Lonely", "Sad", "NumberOfFriends", "MissSchool", "Bullied", "CheckHomework",
"UnderStandingParents", "CheckPlaytime", "CigaretteUse", "DrugUse", "AlcholMisuse")
dfAnalysisData <- dfSuicideData[,sWantedColumns, with = FALSE]
```

Now it is time to carry out the analysis. There are three tables in the paper, the first displays the distributions of the categories and the p-value for a one-way analysis - Chi-Squared for the categorical variables and F-test ANOVA for the Age variable. First I gather my “yes” and “no” categorical variables

```
sOneWayAnalysis <- sWantedColumns[!sWantedColumns %in% c("Response", "Age", "record", "NumberOfFriends")]
```

Then we write a function to grab the proportions of the yeses and the p-values of the ANOVA, then we use lapply to carry out the analysis …

```
dfDistributions <- lapply(sOneWayAnalysis, function(sFactor = "MissSchool"){
print(sFactor)
sFormula <- paste("~ ", sFactor, " + Response", sep = "")
dfProportions <- (prop.table(with(dfAnalysisData, xtabs(as.formula(sFormula))),
margin = 2)*100)["Yes",]
dfProportions <- data.table(t(dfProportions))
dfProportions[, P.Value := chisq.test(get(sFactor, dfAnalysisData),
get("Response", dfAnalysisData))$p.value]
return(dfProportions)
})
dfDistributions <- do.call(rbind, dfDistributions)
dfDistributions <- data.table("Variable" = sOneWayAnalysis, dfDistributions)
```

We then separately carry out the analysis for the other groups

```
# For Age
dfAgeMean <- dfAnalysisData[!is.na(Response), mean(Age, na.rm = TRUE), by = Response]
dfAgeSD <- dfAnalysisData[!is.na(Response), sd(Age, na.rm = TRUE), by = Response]
dP.Value <- data.frame(summary(aov(Age ~ factor(Response),
data = dfAnalysisData))[[1]])[1,"Pr..F."]
dfAgeDistribution <- data.table(Variable = "Age", NO = dfAgeMean[Response == "NO", V1],
SI = dfAgeMean[Response == "SI", V1], SISP = dfAgeMean[Response == "SISP", V1],
P.Value = dP.Value)
# Combining with main table
dfDistributions <- rbind(dfAgeDistribution, dfDistributions)
# For Friends
dfFriendData <- prop.table(with(dfAnalysisData, xtabs( ~ NumberOfFriends + Response)),
margin = 2)*100
oNameNames <- dimnames(dfFriendData)
names(oNameNames) <- NULL
dfFriendData <- data.frame(matrix(dfFriendData, nr = 4, dimnames = oNameNames))
dFriendsPValue <- with(dfAnalysisData, chisq.test(NumberOfFriends, Response)$p.value)
dfFriendData <- data.table(Variable = paste("NumberOfFriends = ", rownames(dfFriendData),
sep = ""), dfFriendData, P.Value = dFriendsPValue)
```

Below is the equivalent of the Table 1 in the paper

```
dfOutputTable1 <- rbind(dfDistributions, dfFriendData)
> dfOutputTable1
Variable NO SI SISP P.Value
1: Age 14.003788 13.48515 14.09283 1.189294e-03
2: Sex.Male 47.523810 58.58586 40.25424 7.609487e-03
3: Hungry 13.022352 25.00000 22.07792 6.946819e-05
4: Worried 6.362773 19.19192 27.11864 3.988193e-21
5: Lonely 7.939509 28.57143 26.77824 1.221959e-19
6: Sad 23.432980 51.64835 60.69869 8.005775e-31
7: MissSchool 3.636364 13.18681 11.45374 1.422812e-07
8: Bullied 17.820658 41.33333 37.70492 1.094117e-11
9: CheckHomework 40.684794 41.75824 35.24229 2.955763e-01
10: UnderStandingParents 37.627812 31.03448 27.31278 9.359673e-03
11: CheckPlaytime 40.570846 40.65934 31.98198 5.762363e-02
12: CigaretteUse 11.976048 28.57143 27.10280 6.309457e-10
13: DrugUse 9.772952 20.00000 25.11013 6.134090e-10
14: AlcholMisuse 16.714976 33.68421 32.18884 3.943569e-09
15: NumberOfFriends = 0 3.625954 13.82979 11.39241 4.625285e-08
16: NumberOfFriends = 1 14.503817 15.95745 15.18987 4.625285e-08
17: NumberOfFriends = 2 15.935115 18.08511 21.94093 4.625285e-08
18: NumberOfFriends = 3 65.935115 52.12766 51.47679 4.625285e-08
```

The attentive will notice that these numbers are different from those in the paper. This is because the paper has some mistakes in the count data. Now for the multinomial regression part. Table 2 and three can be calculated with one function. First we need to ensure that the number of friends is a factor with the base level at 3, we do this using the relevel() function which is quite useful when you want to adjust the base level in a regression.

``r

dfAnalysisData[, NumberOfFriends := relevel(factor(NumberOfFriends), ref = “3”)] ``

Now we write the function that will do the regression. The idea is to construct a formula that we pass to the multinom() function, and then get the summary. The “nnet” package version I used was “7.3-6” on R 2.15.2. The reason I say this is because there was no p-value in the output of the multinom() function. So I had to calculate it (which is actually straightforward). We can then calculate the relativities of the coefficients given in the model and the confidence intervals then output the final data.

```
getModelResult <- function(sFormula){
oModel <- multinom(sFormula, data = dfAnalysisData)
# Multinomial Model summaries
dfModelCoef <- data.frame(t(summary(oModel)$coefficients))
dfModelSE <- data.frame(t(summary(oModel)$standard.errors))
# Calculating the p-value using wald statistic
dfModelPValues <- data.frame(apply((dfModelCoef/dfModelSE)^2, 2,
pchisq, df = 1, lower.tail = FALSE))
# The relativities
dfMeanCoefs <- exp(dfModelCoef)
dfUpperCoefs <- exp(dfModelCoef + qnorm(.975)*dfModelSE)
dfLowerCoefs <- exp(dfModelCoef - qnorm(.975)*dfModelSE)
# Confidence intervals
names(dfLowerCoefs) <- paste("Lower.", names(dfLowerCoefs), sep = "")
names(dfUpperCoefs) <- paste("Upper.", names(dfUpperCoefs), sep = "")
names(dfModelPValues) <- paste("P.Value.", names(dfModelPValues), sep = "")
# Output table
dfModelOutput <- data.frame(dfMeanCoefs, dfLowerCoefs, dfUpperCoefs, dfModelPValues)
return(dfModelOutput)
}
```

We can then generate table 2 …

```
sFormula <- formula(paste(c("Response ~ Age", sOneWayAnalysis, "NumberOfFriends"), collapse = " + "))
dfOutputTable2 <- getModelResult(sFormula = sFormula)
> print(dfOutputTable2, digits = 3)
SI SISP Lower.SI Lower.SISP Upper.SI Upper.SISP P.Value.SI P.Value.SISP
(Intercept) 0.332 0.0364 0.0125 0.00401 8.78 0.330 5.09e-01 3.22e-03
Age 0.808 1.0345 0.6403 0.88948 1.02 1.203 7.10e-02 6.60e-01
Sex.MaleYes 1.818 0.7347 0.9374 0.46225 3.52 1.168 7.69e-02 1.92e-01
HungryYes 0.786 1.1993 0.3343 0.68677 1.85 2.094 5.82e-01 5.23e-01
WorriedYes 1.118 3.1397 0.4074 1.71937 3.07 5.733 8.29e-01 1.96e-04
LonelyYes 5.883 2.3604 2.6601 1.31230 13.01 4.246 1.21e-05 4.14e-03
SadYes 2.154 3.3344 1.1167 2.15123 4.15 5.168 2.21e-02 7.21e-08
MissSchoolYes 2.020 1.4335 0.5505 0.53242 7.41 3.860 2.89e-01 4.76e-01
BulliedYes 1.580 1.7436 0.7973 1.07785 3.13 2.821 1.90e-01 2.35e-02
CheckHomeworkYes 1.248 1.0549 0.6361 0.66128 2.45 1.683 5.19e-01 8.22e-01
UnderStandingParentsYes 0.627 0.5899 0.2950 0.34922 1.33 0.996 2.24e-01 4.85e-02
CheckPlaytimeYes 1.337 1.0953 0.6735 0.67990 2.66 1.764 4.06e-01 7.08e-01
CigaretteUseYes 4.609 2.7042 2.0041 1.49007 10.60 4.907 3.23e-04 1.07e-03
DrugUseYes 0.380 1.4016 0.1107 0.69012 1.30 2.847 1.24e-01 3.50e-01
AlcholMisuseYes 1.674 1.7280 0.7850 1.04006 3.57 2.871 1.82e-01 3.47e-02
NumberOfFriends0 2.534 2.0221 0.9123 0.93589 7.04 4.369 7.44e-02 7.32e-02
NumberOfFriends1 0.693 1.2138 0.2257 0.65133 2.13 2.262 5.23e-01 5.42e-01
NumberOfFriends2 1.328 1.4589 0.5787 0.85197 3.05 2.498 5.04e-01 1.69e-01
```

and then table 3

```
sFormula <- formula(paste(c("Response ~", sOneWayAnalysis[sOneWayAnalysis != "Sex.Male"], "NumberOfFriends"), collapse = " + "))
dfOutputTable3 <- getModelResult(sFormula = sFormula)
> print(dfOutputTable3, digits = 3)
SI SISP Lower.SI Lower.SISP Upper.SI Upper.SISP P.Value.SI P.Value.SISP
(Intercept) 0.0228 0.0514 0.0116 0.0323 0.0451 0.0818 1.25e-27 6.04e-36
HungryYes 0.8971 1.2019 0.3919 0.6963 2.0536 2.0745 7.97e-01 5.09e-01
WorriedYes 0.9863 3.1040 0.3615 1.7142 2.6907 5.6205 9.78e-01 1.85e-04
LonelyYes 5.3816 2.4364 2.4723 1.3664 11.7145 4.3445 2.23e-05 2.55e-03
SadYes 2.0209 3.5325 1.0555 2.2905 3.8692 5.4480 3.38e-02 1.13e-08
MissSchoolYes 2.4909 1.2892 0.6857 0.4897 9.0492 3.3946 1.66e-01 6.07e-01
BulliedYes 1.7734 1.7128 0.9021 1.0675 3.4862 2.7483 9.67e-02 2.57e-02
CheckHomeworkYes 1.3807 0.9973 0.7132 0.6310 2.6731 1.5762 3.38e-01 9.91e-01
UnderStandingParentsYes 0.6245 0.6122 0.2955 0.3646 1.3195 1.0278 2.17e-01 6.34e-02
CheckPlaytimeYes 1.3129 1.0512 0.6643 0.6547 2.5948 1.6877 4.33e-01 8.36e-01
CigaretteUseYes 4.3902 2.7013 1.9834 1.5016 9.7177 4.8597 2.63e-04 9.11e-04
DrugUseYes 0.4292 1.3867 0.1288 0.6917 1.4303 2.7799 1.68e-01 3.57e-01
AlcholMisuseYes 1.6264 1.6739 0.7661 1.0138 3.4529 2.7639 2.05e-01 4.40e-02
NumberOfFriends0 2.3265 2.0328 0.8456 0.9436 6.4010 4.3794 1.02e-01 7.00e-02
NumberOfFriends1 0.6372 1.2684 0.2091 0.6825 1.9417 2.3575 4.28e-01 4.52e-01
NumberOfFriends2 1.1270 1.5588 0.5006 0.9213 2.5374 2.6376 7.73e-01 9.81e-02
```

Carrying out regression analysis, even simple ones like this is quite tricky, you can that see the amount of time spent preparing the data is disproportionate to the amount of time required to carry out the analysis.
The model execution is one command multinom(), and the rest of the time is spent manipulating the data and outputs. Dealing with missing data can be tricky and is a whole research field onto itself, here I have basically
passed the problem to R’s default which is basically to ignore them. Sometimes it may be useful to assign NA values to a category in a categorical explanatory variable especially if there are proportionally a lot of missing
values, this allows the missing group in each variable to be compared with others groups in the model output. There are many package that carry out imputation on data sets in R
mice is a good place to start. I hope that this has been useful to the reader.