Get startedGet started for free

Splitting data to training and validation sets

You learned that using only the training data without a validation dataset leads to a problem called overfitting. When overfitting occurs, the model will be very good at predicting data for training inputs, however generalize very poorly to unseen data. This means the model will not be very useful, as it cannot generalize. To avoid this you can use a validation dataset.

In this exercise, you will create a training and validation set from the dataset you have (i.e. en_text containing 1000 English sentences and fr_text containing the 1000 French sentences). You will be using 80% of the dataset for training data and 20% as validation data.

This exercise is part of the course

Machine Translation with Keras

View Course

Exercise instructions

  • Define a sequence of indices using np.arange(), that starts with 0 and has size of en_text.
  • Define valid_inds as the last valid_size indices from the sequence of indices.
  • Define tr_en and tf_fr, which contains the sentences found at train_inds indices, in the lists en_text and fr_text.
  • Define v_en and v_fr which contains the sentences found at valid_inds indices, in the lists en_text and fr_text.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

train_size, valid_size = 800, 200
# Define a sequence of indices from 0 to len(en_text)
inds = ____.____(len(_____))
np.random.shuffle(inds)
train_inds = inds[:train_size]
# Define valid_inds: last valid_size indices
valid_inds = inds[____]
# Define tr_en (train EN sentences) and tr_fr (train FR sentences)
tr_en = [en_text[____] for ti in ____]
tr_fr = [____ for ti in ____]
# Define v_en (valid EN sentences) and v_fr (valid FR sentences)
v_en = [____ for vi in valid_inds]
v_fr = [____ for vi in ____]
print('Training (EN):\n', tr_en[:3], '\nTraining (FR):\n', tr_fr[:3])
print('\nValid (EN):\n', v_en[:3], '\nValid (FR):\n', v_fr[:3])
Edit and Run Code