1. Creating synthetic datasets using scikit-learn
Hello
2. Generating datasets with Scikit-learn
With Scikit-learn, we can create datasets that sample from various distributions.
Such as the normal distribution.
Here is a representation of it.
It's a continuous probability distribution that is very similar on both sides of the mean, so the right side of the center is mirroring the left side.
3. Normal distribution
Normal distributions often occur in nature. For example, heights, blood pressure, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution.
On the right, we have a histogram of people's heights in inches from a dataset.
4. Sample from a normal distribution
Import numpy and create an empty DataFrame using pd dot DataFrame.
Select the mean around where we want the center of the values to be and the standard deviation. A low standard deviation indicates that the values tend to be close to the mean of the dataset, while a high one indicates that the values are spread out over a wider range. Let's choose, for example, 65 and 2, because in the histogram, 65 looks like the center and two as an approximation since data is barely dispersed from the center.
To sample, use the normal function from the random module of numpy.
Pass the mean, the standard deviation, and the number of samples we want to generate as arguments.
5. Sample from a normal distribution
Let's generate a histogram of 50 bins to see the resulting heights. We see that the center of the values is around 65 inches.
6. Creating datasets using scikit-learn
Scikit-learn has simple and easy-to-use functions for generating datasets that follow a normal distribution for multiple goals: to perform classification, clustering, or regression.
7. Synthetic data for classification and clustering
We can create datasets for multi-class classification with the make_classification function. For example, classifying between oranges, apples, or pears.
It allocates each class one or more normally-distributed clusters of points, creating correlated informative features and noisy uninformative ones.
make_blobs provides greater control regarding each cluster's centers and standard deviations and is used to demonstrate clustering.
8. Synthetic data for classification
Here we import make_classification from the datasets module of sklearn.
It will return the generated samples "x", and integer labels "y", both as numpy arrays.
The n_samples parameter specifies how many rows to generate.
n_classes specifies the number of classes or labels to generate. n_informative specifies correlated and informative data for classification.
n_features specifies the number of features or columns, n_clusters_per_class, the number of clusters per class, and class_sep specifies whether different classes should be more spread out and easier to discriminate.
Here we create a dataset of 1000 records of 2 possible classes with two informative features.
9. Synthetic data for classification
The generated x data has four features as we specified in the n_features.
"y" will be the generated labels. We see x holds the data points.
10. Synthetic data for classification
When plotting the dataset with matplotlib we see that the classes, one color for each, are easy to discriminate because of the class_sep of 1.
11. Synthetic data for classification
The higher the value of class_sep the easier it is to discriminate between classes.
On the left class_sep is 0 point 1 and on the right class_sep is 10.
12. Synthetic data for clustering
For clustering, sklearn dot datasets provide a function called make_blobs().
The number of features, the number of centers, and each cluster's standard deviation can be specified as arguments.
The std can be a float. The default value is 1.
Here we generate a dataset with 2 features, 4 cluster centers, and a standard deviation of 1 point 5. The default size of the dataset is 100.
The shape represents 100 rows and 3 features.
13. Synthetic data for clustering
Here we see the resulting dataset plotted.
14. Synthetic data for clustering
Depending on the standard deviation, the clusters will be more or less dispersed.
15. Let's practice!
Let's practice!