
Unsupervised learning algorithms
Unsupervised learning algorithms learn the properties of data on their own without explicit human intervention or labeling. Typically within the AI field, unsupervised learning technique learn the probability distribution that generated a dataset. These algorithms, such as autoencoders (we will visit these later in the book), are useful for a variety of tasks where we simply don't know important information about our data that would allow us to use traditional supervised techniques.
PCA is an unsupervised method for feature extraction. It combines input variables in such a way that we can drop those variables that provide the least amount of information to us. Afterwards, we are left with new variables that are independent of each other, making them easy to utilize in a basic linear model.
AI applications are fundamentally hampered by the curse of dimensionality. This phenomenon, which occurs when the number of dimensions in your data is high, makes it incredibly difficult for learning algorithms to perform well. PCA can help alleviate this problem for us. PCA is one of the primary examples of what we call dimensionality reduction, which helps us take high-feature spaces (lots of data attributes) and transform them into lower-feature spaces (only the important features).
Dimensionality reduction can be conducted in two primary ways: feature elimination and feature extraction. Whereas feature elimination may involve the arbitrary cutting of features from the dataset, feature extraction (PCA is a form of) this gives us a more intuitive way to reduce our dimensionality. So, how does it work? In a nutshell:
- We create a matrix (correlation or covariance) that describes how all of our data relates to each other
- We decompose this matrix into separate components, called the eigenvalues and the eigenvectors, which describe the direction and magnitude of our data
- We then transform or project our original data onto these components
Let's break this down in Python manually to illustrate the process. We'll use the same Iris dataset that we used for the supervised learning illustration. First, we'll create the correlation matrix:
features = (features - features.mean()) / features.std()
corr_matrix = np.corrcoef(data.values.T)
corr_matrix.corr()
The output should look as follows:
Our correlation matrix contains information on how every element of the matrix relates to each other element. This record of association is essential in providing the algorithm with information. Lots of variability typically indicates a signal, whereas a lack of variability indicates noise. The more variability that is present in a particular direction, the more there is to detect. Next, we'll create our eigen_values and eigen_vectors:
eigen_values, eigen_vectors = np.linalg.eig(corr_matrix)
The output should look as follows:
Eigenvectors and Eigenvalues come in pairs ; each eigenvectors are directions in of data, and eigenvalues tell us how much variance exists within that direction. In PCA, we want to understand which inputs account for the most variance in the data (that is: how much do they explain the data). By calculating eigenvectors and their corresponding eigenvalues, we can begin to understand what is most important in our dataset.
We now want to sort the pairs of eigenvectors/eigenvalues from highest to lowest.
eigenpairs = [[eigen_values[i], eigen_vectors[:,i]] for i in range(len(eigen_vectors))]
eigenpairs.sort(reverse=True)
Lastly, we need to project these pairs into a lower dimensional space. This is dimensionality reduction aspect of PCA:
projection = np.hstack((eigenpairs[0][1].reshape(eig_vectors.shape[1],1),
eigenpairs[1][1].reshape(eig_vectors.shape[1],1)))
We'll then conduct this transformation on the original data:
transform = features.dot(projection)
We can then plot the components against each other:
fig = plt.figure(figsize=(8,8))
ax = fig.gca()
ax = sns.regplot(transform.iloc[:,0], transform.iloc[:,1],fit_reg=False, scatter_kws={'s':70}, ax=ax)
ax.set_xlabel('principal component 1', fontsize=10)
ax.set_ylabel('principal component 2', fontsize=10)
for tick in ax.xaxis.get_major_ticks():
tick.label.set_fontsize(12)
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize(12)
ax.set_title('Pricipal Component 1 vs Principal Component 2\n', fontsize=15)
plt.show()
You should see the plot as follows:
So when should we use PCA? Use PCA when the following are true:
- Do you have high dimensionality (too many variables) and want a logical way to reduce them?
- Do you need to ensure that your variables are independent of each other?
One of the downsides of PCA, however, is that it makes the underlying data more opaque, thus hurting it's interpretability. Besides PCA and the k–means clustering model that we precedingly described, other commonly seen non-deep learning unsupervised learning algorithms are:
- K-means clustering
- Hierarchical clustering
- Mixture models