Get startedGet started for free

Category embeddings

1. Category embeddings

In chapter 1, our dataset of tournament games only contained about 4,000 rows. However, we have a much bigger dataset with over 300,000 regular season games. Let's see what we can learn from a much larger sample of data! In the 2 basketball datasets you will be using in this course, there are a little under 11,000 teams. Each team is coded as an integer starting with 1 and ending with 10,887. In this lesson, you will learn how to use those team IDs as inputs to a model that learns the strength of each team.

2. Category embeddings

Categorical embeddings are an advanced type of layer, only available in deep learning libraries. They are extremely useful for dealing with high cardinality categorical data. In this dataset, the team ID variable has high cardinality. Embedding layers are also very useful for dealing with text data, such as in Word2vec models, but that is beyond the scope of this course. To model these teams in the basketball data, you'll use a very simple model that learns a "strength" rating for each team and uses those ratings to make predictions. To map the integer team IDs to a decimal rating, we will use an embedding layer.

3. Inputs

To get started with category embeddings, you will need an input layer. In this case, your input is a single number, ranging from 1 to 10,887, which represents each team's unique ID. Note that this dataset covers about 30 years of data, and has about 400 unique schools, giving us close to 12,000 IDs. We only have about 11,000 of those year/team combinations, because not every school has a basketball team every year.

4. Embedding Layer

To create an embedding layer, use the Embedding() function from tensorflow.keras.layers. Since you have 10,887 unique teams in the dataset, you define the input dimension of the embedding layer as 10,887. As you are representing each team as a single integer, use an input length of 1. You want to produce a single team strength rating, so use an output dimension of 1. Finally, name your layer, so you can easily find it when looking at the model summary, or plot. To use the embedding layer, connect it to the tensor produced by the input layer. This will produce an embedding output tensor.

5. Flattening

Embedding layers increase the dimensionality of your data. The input CSV has two dimensions (rows and columns), but embedding layers add a third dimension. This third dimension can be useful when dealing with images and text, so it is not as relevant to this course. Therefore, we use the flatten layer to flatten the embeddings from 3D to 2D. The flatten layer is also the output layer for the embedding process. Flatten layers are an advanced layer for deep learning models and can be used to transform data from multiple dimensions back down to two dimensions. They are useful for dealing with time series data, text data, and images.

6. Put it all together

Now you can wrap your embedding layer in a model. This will allow you to reuse the model for multiple inputs in the dataset. You do this by defining an input layer, then an embedding layer, then a flatten layer for the output. Finally, wrap the input tensor and flatten tensor in a model. This model can be treated exactly the same as a layer, and re-used inside of another model.

7. Let's practice!

Now, it's your turn!