Unsupervised learning

image.png

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. In contrast to supervised learning that usually makes use of human-labeled data.

Two of the main methods used in unsupervised learning are principal component and cluster analysis.

  • Principal component analysis (PCA) is a technique for reducing the dimensionality of datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance

  • Cluster analysis is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships. Cluster analysis is a branch of machine learning that groups the data that has not been labelled, classified or categorized. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.

Labels can be obtained by asking humans to make judgments about a given piece of unlabeled data (e.g., "Does this photo contain a horse or a cow?"), and are significantly more expensive to obtain than the raw unlabeled data.

1) Importing and visualising the data

In this note book I am going to analysis some data about species of Fish. We have 6 unlabelled feautures describing an unknown (for now) number of species of fish. The fish data is sourced from the journal of statistics education http://jse.amstat.org/jse_data_archive.htm

In this notebook I am going to use clustering and dimension reduction, a technique used to identify patterns in data that allow the data to be explained by these patterns instead of all the individual features. Identifying patterns that provide the same predictbability as the indiviudal features allows you to maintain predicitive capabilites but utilise far less compting power, which is essential when features become far too numerous!

I will begin by just look at two of the features in a 2d space which is easy to visualise and gives us an indication as to whether there are any natural clusters in the data.

As you can see from plotting the 4th and 5th features it is clear to see their seems to be some natural groupings to the data. Infact from first inspection it looks as if there may be4 grouping possibly indicating 4 species of fish.

In [7]:
# Import KMeans
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd


fish_data= pd.read_csv(r'C:\Users\Adam\Desktop\Personal\Python Syntax\Python Input Files\fish_data.txt', header=None)

fish_data= pd.read_csv(r'C:\Users\Adam\Desktop\Personal\Python Syntax\Python Input Files\fish_data.txt', header=None)

print(fish_data.head())


""" firstly we will look to visualise the data in 2d to see if we can pick out any natral clustering """

""" we can see from plotting the data that there are two natural clusters therefore we will use that as our premise"""

fig1, ax = plt.subplots(figsize=(12,8))

ax.scatter(fish_data[4],fish_data[5])
       0      1     2     3     4     5     6
0  Bream  242.0  23.2  25.4  30.0  38.4  13.4
1  Bream  290.0  24.0  26.3  31.2  40.0  13.8
2  Bream  340.0  23.9  26.5  31.1  39.8  15.1
3  Bream  363.0  26.3  29.0  33.5  38.0  13.3
4  Bream  430.0  26.5  29.0  34.0  36.6  15.1
Out[7]:
<matplotlib.collections.PathCollection at 0x2322eeec208>

2) Fitting a KMeans clustering algorithmn

Utilising the exploratory analysis above we can now use our intial findings of 4 clusters to set as a paramter for the Kmeans algorithmn, and fit the algorithm to the data to see what the predictions look like with 4 clusters.

Firstly we define the 4th and 5th columns as our features and the fish names as our labels.

We can then then use the predicted outputs from the model to overlay on the original 2d graph from a above however this time colouring in the datapoints with the models predicted labels.

As you can see the visualisation is almost exactly as the intial clusters described, four clusters seem to fit the data very well so our intial hypothesis about the data may very well be correct and 4 clusters may be a good way for describing the data.

The Red diamonds on the plot represent the "centroids", the means of each cluster and are used for allocating data points.

In [10]:
# Import KMeans
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd

fishdata_2d = fish_data[[4,5]]

"""Create a KMeans instance with 3 clusters: model"""
model = KMeans(n_clusters=4)

"""Fit model to points"""
model.fit(fishdata_2d)

"""Determine the cluster labels of new_points: labels"""
labels = model.predict(fishdata_2d)


""" to be able to understand the new clustering its better to visualise it using a scatter plot"""

# Assign the columns of new_points: xs and ys
xs = fishdata_2d[4]
ys = fishdata_2d[5]

""" we use the colour fucntion to split out the different "labeled" (clustered data) visually"""

# Make a scatter plot of xs and ys, using labels to define the colors

fig1, ax = plt.subplots(figsize=(12,8))
ax.scatter(xs, ys, c=labels, alpha=0.5)

""" the centroids are the "means" of each cluster used to assign new data to clusters """

# Assign the cluster centers: centroids
centroids = model.cluster_centers_

# Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]

""" the yellow and blue dots represent the new data and the blue diamonds show where the centroids sit """

# Make a scatter plot of centroids_x and centroids_y
ax.scatter(centroids_x, centroids_y, marker='D', s=50, color='r')
Out[10]:
<matplotlib.collections.PathCollection at 0x2322e62ac88>

3. Measuring the quality of the our clustering - Inertia

From the plot above it looks like 4 clusters may be a good way for describing the data but we need to dig into this further.

Inertia measures clustering quality - inertia measures how spread out the clusters are (the less spread out, the better the clustering) each cluster is measured as the distance of every predicted data point from its centroid .

  • After fitting a cluster model you can called the inertia attribute to test its quality

  • k-means attempts to minimize the inertia when choosing clustering

  • Obivously if you have as many clusters as you do data data points then your inetira will be the lowest but that isn't helpful! A good clustering is a trade off between the number of clusters and the level of inertia.

  • It is best to choose the elbow point on an intertia vs number of cluster plot where the interia starts to drop more slowly as the optimal!\

  • To check whether 4 clusters happen to describe the data best we are going to loop through different clustering values to find the optimum

In [17]:
""" the last 6 columns are the features and the 1st is the labels so we will drop the 1st for now """

features = fish_data.drop(0,axis=1)

""" we are going to loop through fitting different number of clusters and then we are going to plot the kkluster vs inertia graph to see what is the optimum number of klusters
for this data set"""

ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(features)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
fig3, ax = plt.subplots(figsize=(12,8))
ax.plot(ks, inertias, '-o')
ax.set_xlabel('number of clusters, k')
ax.set_ylabel('inertia')
ax.set_xticks(ks)

print('\n' + 'As we can see from this plot the inertia level starts to drop most slowly after 4 klusters therefore 4 klusters can be presumed the optimum')
As we can see from this plot the inertia level starts to drop most slowly after 4 klusters therefore 4 klusters can be presumed the optimum

4. Evaluating the clustering using cross tab visualisation

Inspecting the fish data I know there is infact 4 species of fish, (Bream, Roach, Pike, Smelt)

We can now analysis the performance of the clustering using cross tab visualisation, We can take the 4 cluster model labels and plot them against the actual fish labels and see what they look like.

What we can see from the results is that not all the predicted labels fall in the correct category otherwise it would read 34,20,17,14 down the diagonal, the smelt have been predicted the best as all 14 of their data points have been correctly labeled!

In [18]:
# count of each fish species in the data 
actual_labels =fish_data[0]
print(actual_labels.value_counts())
Bream    34
Roach    20
Pike     17
Smelt    14
Name: 0, dtype: int64
In [19]:
# Create a KMeans model with 4 clusters: model
model = KMeans(n_clusters=4)

# Use fit_predict to fit model and obtain cluster labels: labels
predicted_labels = model.fit_predict(features)

""" after fitting the 3 cluster model we created a dataframe of the predicted and actual labels together as we need to feed the cross tab function a df """

# Create a DataFrame with clusters and varieties as columns: df
df = pd.DataFrame({'predicted labels': predicted_labels, 'actual labels': actual_labels})

""" then we apply the cross tab function """

# Create crosstab: ct
ct = pd.crosstab(df['predicted labels'], df['actual labels'])

# Display ct
print(ct)
actual labels     Bream  Pike  Roach  Smelt
predicted labels                           
0                     1     1     17     14
1                    16     2      0      0
2                    17    10      3      0
3                     0     4      0      0

5. Optimising our clustering

In kmeans clustering the variance of a feature effects the influence of that feature disproportioinately, (e.g. features with large variance will effect the centroid more than features with lower variance) therefore to improve clustering you need to standardize each feature so that the variance of each feature is 1

We do this by using the StandardScaler transformation on each feature so that each feature has mean of 0 and variance of 1. With our extra step we now need to build a pipeline to first standardize then fit the model.

We can see from the results that optimising the clusters has improved the predictive labelling! Now only 3 data points are miss labeled.

In [20]:
# Perform the necessary imports
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


""" define the scalar"""

scaler = StandardScaler()

""" define the number of clusters in the kmeans algo """

kmeans = KMeans(n_clusters=4)

""" set up the steps of the pipeline, step 1) carry out the scaling, step 2) carrying out the clustering algo """

pipeline = make_pipeline(scaler, kmeans)

""" now we can acutally carry out the fitting on the pipeline"""

pipeline.fit(features)

""" and then we can do the predicting of the labels """

# Calculate the cluster labels: labels
predicted_labels = pipeline.predict(features)


""" once we have the predicted labels we create a dataframe of them with the actual labels"""

# Create a DataFrame with labels and species as columns: df
df = pd.DataFrame({'labels': predicted_labels, 'species': 
actual_labels})

""" we then use the cross tab functions to count the combinations """
    
# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

""" from the cross tab results we can see that the clustering has worked really well! most labels fall under just one species type (actual labels)"""

# Display ct
print(ct)
species  Bream  Pike  Roach  Smelt
labels                            
0            0    17      0      0
1           33     0      1      0
2            1     0     19      1
3            0     0      0     13

6. Dimension Reduction to find the intrinsic dimesion - decorrelation

What does dimension reduction do? :

  • Allows for more efficient storage and computation
  • Removes less informative "nosie" features which can cause problems with precition tasks e.g. classificiton and regression

In this workbook I will focus on Principal Components Analysis which is a fundamental dimension reduction technique that operates in two steps:

  • step 1 - decorrelation - firstly it removes any correlation between the features
  • step 2 - dimension reduction - it reduces the features to the intrinsic deimension, the minimum number of features needed to describe the model with maintaining the same predictive capability

  • intrinsic dimension = number of PCA features with significant variance

PCA aligns data with axes and shifts data samples so they have mean 0, no information is lost

Below we will firstly that the PCA model decorrelises the data

In [48]:
""" lets us first assess the correlation of the fish data set """

# Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Assign the 0th column of grains: width
weight = features.iloc[:,0]/10

# Assign the 1st column of grains: length
length = features.iloc[:,1]

# Scatter plot width vs length
fig4, ax = plt.subplots(figsize=(12,8))
ax.scatter(weight, length)
ax.axis('equal')

ax.set_title('Fish Weight vs Length')
ax.set_xlabel('Fish Weight (10gs)')
ax.set_ylabel('Fish Length (CM)')

plt.show()

# Calculate the Pearson correlation
correlation, pvalue = pearsonr(weight, length)

""" as can be seen from the plot there is  avery strong linear correlation between weight and length """

# Display the correlation
print('The Pearson correlation of fish weight versus fish length : {}'.format(correlation))
Pearson correlation of fish weight versus fish length : 0.8974683554936372
In [50]:
# Import PCA
from sklearn.decomposition import PCA

"""" step 2 assign the model we want to use in this case the PCA function"""

# Create PCA instance: model
model = PCA()

"""" step 3 we apply the fit_transform method of the model to the weight and length data which we have concated into a "fish" dataset """

fish = pd.concat([weight,length],axis=1)

# Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(fish)

""" step 4 the fit_transofrm method returns new x and y coordinates for the data (aka newly transformed weight and length data) which
can be plotted in a scatter graph"""

# Assign 0th column of pca_features: xs
xs = pca_features[:,0]

# Assign 1st column of pca_features: ys
ys = pca_features[:,1]

# Scatter plot the transformed widths vs lengths
fig5, ax = plt.subplots(figsize=(12,8))
ax.scatter(xs, ys)
ax.axis('equal')

ax.set_title('Fish Weight vs Length')
ax.set_xlabel('Fish Weight (10gs)')
ax.set_ylabel('Fish Length (CM)')
plt.show()

""" what can be seen now is that the width and length transformed values now exhibit no correlation!! """

# Calculate the Pearson correlation of xs and ys
new_correlation, pvalue = pearsonr(xs, ys)

# Display the correlation
print('The New Pearson correlation of fish weight versus fish length : {}'.format(new_correlation))
The New Pearson correlation of fish weight versus fish length : 1.734723475976807e-18

7. Dimension Reduction to find the intrinsic dimesion - full workflow example

In the below code I will now build a full workflow for indentifies the intrinsic dimension.

What can be seen in the plot is that PCA features 0 and 1 have significant variance and there fore the intrinsic dimension of this data set appears to be just 2.

Therefore the dataset could actually be explained utilising just 2 variables and that having more feautures than 2 doesn't add any further information in terms of being able to identify the fish species """

In [54]:
""" 
In this data each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very
different scales. In order to PCA transform this data effectively, you'll need to standardize these features first. In the below code, you'll build a pipeline to standardize and 
transform the data.

These fish measurement data were sourced from the Journal of Statistics Education.
"""

features = fish_data.drop(0, axis=1)

actual_labels = fish_data[0]

# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

""" step 1 we will define our scaling method in this case standard scalar"""

# Create scaler: scaler
scaler = StandardScaler()

""" step two we will define our model - the PCA model """

# Create a PCA instance: pca
model = PCA()

""" step 3 we will set up the pipeline with the required steps: step 1 scale the data , step 2 apply the model """

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, model)

""" step 4 we will fit the pipeline to the features """

# Fit the pipeline to 'samples'
pipeline.fit(features)


""" now that we have fitted a PCA model we can use attributes on the fitted model """

# Plot the explained variances
features = range(model.n_components_)

fig6, ax = plt.subplots(figsize=(12,8))

ax.bar(features, model.explained_variance_)
ax.set_title('PCA features')
ax.set_xlabel('PCA feature')
ax.set_ylabel('variance')
ax.set_xticks(features)

plt.show()

8. Dimension Reduction - the outputted intrinsic dimension data set

  • From the above we now know that the intrinsic dimension of the fish data set is 2

  • We are therfore going to create a PCA instance with just 2 components. This PCA instance could now be utilised in further parts of a machine learning pipeline! But computing power neccesary would be drastically reduced

In [59]:
model = PCA(n_components=2)

features = fish_data.drop(0, axis=1)

""" again we create a pipeline and fit to it """

# Create pipeline: pipeline
pipeline = make_pipeline(scaler, model)

# Fit the pipeline to 'samples'
pipeline.fit(features)


# Transform the scaled samples: pca_features
pca_features = pipeline.transform(features)

""" instead of 6 features we are only have two features to work with !!! """

# Print the shape of pca_features
print(pca_features.shape)
print(pca_features[:5])
(85, 2)
[[-0.57640502 -0.94649159]
 [-0.36852393 -1.17103598]
 [-0.28028168 -1.59709224]
 [-0.00955427 -0.81967711]
 [ 0.1238945  -1.33121167]]