Get startedGet started for free

Inspecting choice data

1. Inspecting choice data

One thing that trips up new choice modelers is that choice data doesn't fit into the usual format we use for predictive modeling. So, let's take a look at how choice data is structured.

2. Data for linear regression

We usually organize data in rows where each row represents one observation. For the data here, each row is an observation of sales at a store and we have some information about the characteristics of each store. In this data, the number of rows is the number of observations.

3. Data for a choice model

In a typical choice dataset, we observe someone making a choice from a set of options that have common features. It's convenient to stack up the options for one choice observation into multiple rows where each row describes one of the alternatives. For instance, the first three rows of this data describe a choice from among three different sports cars. The first car was a 2-seater with a manual transmission for thirty-five thousand dollars. The second two options were both automatic 5-seaters with one at 40 thousand and the other at 30 thousand. To keep track of which rows belong to which observations, we have columns called ques - short for question - and alt - short for alternative. The first three values of ques are all 1s indicating that these three rows all belong to question 1. The values of alt are 1, 2 and 3 indicating that these are the three alternatives the customer chose from. The choice is recorded in the column labeled choice as a 0 or 1 for each option and, of course, only one option was chosen for each observed choice. In question 1, the third option was chosen. The second three rows describe another choice. It also has three alternatives, but that doesn't have to be the case - some of the observed choices may have four or five or more options. The important thing to realize is that that there is a row in the data frame for each alternative that was available and a set of rows make up one observed choice.

4. Summarizing choice data with choice counts

Our ultimate goal is to fit a multinomial logit model to choice data, but before we do, we should do some descriptives so we get a feel for what's going on in the data. With choice data, what we really want to know is what people are choosing. One way to get a sense for this is to count up the number of times a 30 thousand dollar sports car is chosen in the data and compare that to the number of times a 35 or 40 thousand dollar sports car is chosen. We can do this using the function xtabs(), which, as you can see here, takes two inputs: a formula and a data frame. In the code here the formula says "sum up the choice variable separately for each level of price". Because choice is a 0 or 1 indicating whether that alternative is chosen, the output is a count of the number of times a sports car was chosen at 30, 35, and 40 thousand dollars. From the output, we can see that this data includes 1,010 choices where the chosen car was priced at 30 thousand dollars and only 324 choices of cars priced at 40 thousand dollars. Not much of a surprise there - people like cheaper cars! By the way, you could do this same calculation using the dplyr package, if you prefer, but I find the formula input for xtabs() convenient.

5. Let's look at some choice data in R!

So that you can get a feel for this, let's take a look at the sports car data in R.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.