Session Ready
Exercise

Splitting data to training and validation sets

You learned that using only the training data without a validation dataset leads to a problem called overfitting. When overfitting occurs, the model will be very good at predicting data for training inputs, however generalize very poorly to unseen data. This means the model will not be very useful, as it cannot generalize. To avoid this you can use a validation dataset.

In this exercise, you will create a training and validation set from the dataset you have (i.e. en_text containing 1000 English sentences and fr_text containing the 1000 French sentences). You will be using 80% of the dataset for training data and 20% as validation data.

Instructions
100 XP
  • Define a sequence of indices using np.arange(), that starts with 0 and has size of en_text.
  • Define valid_inds as the last valid_size indices from the sequence of indices.
  • Define tr_en and tf_fr, which contains the sentences found at train_inds indices, in the lists en_text and fr_text.
  • Define v_en and v_fr which contains the sentences found at valid_inds indices, in the lists en_text and fr_text.