Splitting data to training and validation sets

You learned that using only the training data without a validation dataset leads to a problem called overfitting. When overfitting occurs, the model will be very good at predicting data for training inputs, however generalize very poorly to unseen data. This means the model will not be very useful, as it cannot generalize. To avoid this you can use a validation dataset.

In this exercise, you will create a training and validation set from the dataset you have (i.e. en_text containing 1000 English sentences and fr_text containing the 1000 French sentences). You will be using 80% of the dataset for training data and 20% as validation data.

Questo esercizio fa parte del corso

Machine Translation with Keras

Visualizza il corso

Istruzioni dell'esercizio

Define a sequence of indices using np.arange(), that starts with 0 and has size of en_text.
Define valid_inds as the last valid_size indices from the sequence of indices.
Define tr_en and tf_fr, which contains the sentences found at train_inds indices, in the lists en_text and fr_text.
Define v_en and v_fr which contains the sentences found at valid_inds indices, in the lists en_text and fr_text.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

train_size, valid_size = 800, 200
# Define a sequence of indices from 0 to len(en_text)
inds = ____.____(len(_____))
np.random.shuffle(inds)
train_inds = inds[:train_size]
# Define valid_inds: last valid_size indices
valid_inds = inds[____]
# Define tr_en (train EN sentences) and tr_fr (train FR sentences)
tr_en = [en_text[____] for ti in ____]
tr_fr = [____ for ti in ____]
# Define v_en (valid EN sentences) and v_fr (valid FR sentences)
v_en = [____ for vi in valid_inds]
v_fr = [____ for vi in ____]
print('Training (EN):\n', tr_en[:3], '\nTraining (FR):\n', tr_fr[:3])
print('\nValid (EN):\n', v_en[:3], '\nValid (FR):\n', v_fr[:3])

Modifica ed esegui il codice