Splitting data to training and validation sets
You learned that using only the training data without a validation dataset leads to a problem called overfitting. When overfitting occurs, the model will be very good at predicting data for training inputs, however generalize very poorly to unseen data. This means the model will not be very useful, as it cannot generalize. To avoid this you can use a validation dataset.
In this exercise, you will create a training and validation set from the dataset you have (i.e. en_text
containing 1000 English sentences and fr_text
containing the 1000 French sentences). You will be using 80% of the dataset for training data and 20% as validation data.
This exercise is part of the course
Machine Translation with Keras
Exercise instructions
- Define a sequence of indices using
np.arange()
, that starts with 0 and has size ofen_text
. - Define
valid_inds
as the lastvalid_size
indices from the sequence of indices. - Define
tr_en
andtf_fr
, which contains the sentences found attrain_inds
indices, in the listsen_text
andfr_text
. - Define
v_en
andv_fr
which contains the sentences found atvalid_inds
indices, in the listsen_text
andfr_text
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
train_size, valid_size = 800, 200
# Define a sequence of indices from 0 to len(en_text)
inds = ____.____(len(_____))
np.random.shuffle(inds)
train_inds = inds[:train_size]
# Define valid_inds: last valid_size indices
valid_inds = inds[____]
# Define tr_en (train EN sentences) and tr_fr (train FR sentences)
tr_en = [en_text[____] for ti in ____]
tr_fr = [____ for ti in ____]
# Define v_en (valid EN sentences) and v_fr (valid FR sentences)
v_en = [____ for vi in valid_inds]
v_fr = [____ for vi in ____]
print('Training (EN):\n', tr_en[:3], '\nTraining (FR):\n', tr_fr[:3])
print('\nValid (EN):\n', v_en[:3], '\nValid (FR):\n', v_fr[:3])