Splitting training and validation data
You will be creating training and validation datasets. Keeping a validation dataset and monitoring the performance of model on that set is a good practice to avoid overfitting.
For this exercise you have been provided with en_text
(English sentences) and fr_text
(French sentences).
This exercise is part of the course
Machine Translation with Keras
Exercise instructions
- Define a sequence of indices using
np.arange()
, that starts with 0 and has size ofen_text
. - Define
train_inds
as the firsttrain_size
set of indices from the sequence of indices. - Define
tr_en
andtf_fr
, which contains the sentences found at the indices specified bytrain_inds
in the listsen_text
andfr_text
. - Define
v_en
andv_fr
which contains the sentences found at the indices specified byvalid_inds
in the listsen_text
andfr_text
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
train_size, valid_size = 800, 200
# Define a sequence of indices from 0 to size of en_text
inds = np.____(len(____))
np.random.shuffle(inds)
# Define train_inds as first train_size indices
train_inds = inds[:____]
valid_inds = inds[train_size:train_size+valid_size]
# Define tr_en (train EN sentences) and tr_fr (train FR sentences)
tr_en = [en_text[ti] for ti in ____]
tr_fr = [____[____] for ti in ____]
# Define v_en (valid EN sentences) and v_fr (valid FR sentences)
v_en = [en_text[____] for vi in ____]
v_fr = [____[____] for vi in ____]
print('Training (EN):\n', tr_en[:3], '\nTraining (FR):\n', tr_fr[:3])
print('\nValid (EN):\n', v_en[:3], '\nValid (FR):\n', v_fr[:3])