1. Assembling choice data
In this chapter, we'll focus on the process of collecting and organizing choice data. This is often the most challenging and time-consuming part of the choice modeling process.
2. Choices observed in the "wild"
As we discussed in the last chapter, there are lots of places you could find choice data.
One of the most common choices we collect data on is consumer purchases. We can reconstruct purchases that a consumer made in the grocery store or an online store by looking at the transaction records to see what each customer bought and then figuring out what other products were available at the store on the day the customer made a purchase.
But you could also model lots of other types of choices like households deciding what to watch on a video streaming service or citizens voting or even choice of marriage partners.
Choices observed in the "wild" are sometimes called "revealed preference" data.
3. Survey choices
Sometimes, we can't find choices in the wild that are relevant to the business decision we are trying to solve. For instance, if we are designing a self-driving car and want to predict how many people will buy it, it may be hard to find choices that involved self-driving cars in the market. Instead, we can create a survey where people make hypothetical choices. For instance, this question asks people to choose between three different hypothetical SUVs. This type of survey is called a "conjoint survey" and the resulting data is sometimes referred to as "stated preference data". The nice thing about the survey is that you can include any product features you want - even if they don't exist yet - and see how customers react.
4. Long format choice data
Whether your data comes from the "wild" or from a conjoint survey, each choice observation consists of a set observed alternatives that were available to the decision maker and an observed choice. The alternatives are described by a set of features that we call attributes. I like to store choice data in long format like this, where we stack the alternatives up together so their common features are in the same columns. The alternative that was selected is indicated in the choice column with a one in the row for the alternative that was selected and zeros in the other rows.
5. Wide format choice data
Sometimes - especially when you use an online survey tool like Qualtrics or Google Forms - the data is stored in a wide format where each choice is described in a single row and we have sets of columns describing the attributes for each of the alternatives.
6. Wide format choice data in R
Let's take a look at some wide choice data in R. Here are the first couple of rows of the sportscar data transformed into wide format. It's so wide that you can't see all the columns. But just looking at the first few columns, you can see that we have three columns for the seat attribute: seat-dot-1 is the number of seats in the first alternative, seat-dot-2 is the number in the second and so forth.
The choice is recorded as an integer. For example in the choice described in the first row, alternative 3 was chosen.
The number of observed choices in this data is the number of rows in the data frame, so using nrows we can see that the sportscar_wide data describe 2000 choices.
7. What types of chocolate do people choose?
In the next exercises, you will take a look at a choice data set that describes people choosing different types of chocolate. Mmmm chocolate. It was collected in a survey and is stored in wide format.