Skip to content

Similarity

Cosine Similarity

This file makes use of the track features generated by the Cosine Pipeline in order to calculate the similarity between a playlist feature vector the a dataset of tracks. This enables a selection of the most similar tracks.

Cosine Similarity inherits from the Similarity Interface.

Cosine Similarity Documentation

TracksCosineSimilarity

Bases: Similarity

The class implements a Cosine similarity between a playlist vector and the tracks dataset to determine similar tracks to the playlist.

This class inherits the Similarity interface.

Attributes:

Name Type Description
additional_weighting int

The weighting factor applied to weighted columns.

playlist DataFrame

The tracks dataset dataframe (before pipeline transformation)

tracks DataFrame

The playlist tracks dataframe (before pipeline transformation)

playlist_features DataFrame

The tracks dataset features (after transformation pipeline)

track_features DataFrame

The playlist tracks features (after transformation pipeline)

similarity Series

The ordered ranking of track similarity to the playlist vector (The index is uris)

Source code in src/similarity.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
class TracksCosineSimilarity(Similarity):
    """The class implements a Cosine similarity between a playlist vector and the tracks dataset to determine
        similar tracks to the playlist.

        This class inherits the Similarity interface.

        Attributes:
            additional_weighting (int): The weighting factor applied to weighted columns.
            playlist (DataFrame): The tracks dataset dataframe (before pipeline transformation)
            tracks (DataFrame): The playlist tracks dataframe (before pipeline transformation)
            playlist_features (DataFrame): The tracks dataset features (after transformation pipeline)
            track_features (DataFrame): The playlist tracks features (after transformation pipeline)
            similarity (Series): The ordered ranking of track similarity to the playlist vector (The index is uris)
        """
    def __init__(self, playlist: pd.DataFrame, tracks: pd.DataFrame, weighted_features: list):
        """The initialization of the Cosine Similarity class

        Args:
            playlist (DataFrame): The tracks dataset dataframe (before pipeline transformation)
            tracks (DataFrame): The playlist tracks dataframe (before pipeline transformation)
            weighted_features (list): A list of features to be weighted in order to prioritize feature importance in similarity calculation.

        """
        self.additional_weighting = 2  # Feature weighting value
        features = CosinePipeline.data_pipeline(tracks)  # Pass data through pipeline to extract features

        self.playlist = playlist
        self.tracks = tracks

        self.playlist_features, self.track_features = self.separate_playlist_from_tracks(features)
        self.weight_features(weighted_features)
        self.similarity = None

    def calculate_similarity(self):
        """Method calculates the similarity between a playlist vector and the tracks feature matrix using cosine similarity

        This calculation populates the `self.similarity` field.

        The playlist feature dataframe is mean of each feature, creating a playlist vector.
        """
        playlist_vector = self.vectorize_playlist()
        track_matrix = self.track_features.to_numpy()

        similarity_score = similarity_measures.cosine_similarity(track_matrix, playlist_vector)
        uris = self.track_features.index.tolist()

        self.similarity = pd.Series(similarity_score.T.tolist()[0], index=uris, name='sim_score')

    def access_similarity_scores(self):
        """Getter method to access the `similarity` class field.

        Returns:
            (Series): The track similarity to the playlist vector. Similarity is a Series using uris as the index.
        """
        return self.similarity

    def get_top_n(self, n: int):
        """This method should return the top-n most similar tracks as a Dataframe with essential features included.

        Note, due to the cosine similarity. A similarity value of 1 indicates a high similarity, while a value near 0 indicates a low similarity.

        Args:
            n (int): The top-n most similar tracks to the playlist vector

        Returns:
            (DataFrame): A dataframe containing the top-n tracks.
        """
        sim_df = pd.DataFrame(self.similarity, columns=['sim_score'])
        sorted_sim = sim_df.sort_values(by='sim_score', ascending=False)
        sorted_top = sorted_sim.head(n)
        return sorted_top.merge(self.tracks, left_index=True, right_on='uris')

    def separate_playlist_from_tracks(self, features: pd.DataFrame):
        """Method separates the feature dataframe (from pipeline) into tracks and playlist feature dataframes

        Args:
            features (DataFrame): The track dataset features dataframe (This contains the playlist tracks too)

        Returns:
            playlist_features (Dataframe): The playlist track features dataframe
            tracks_features (DataFrame): The track dataset features dataframe
        """
        playlist_uris = self.playlist['uris'].tolist()
        return features[features.index.isin(playlist_uris)], features[~features.index.isin(playlist_uris)]

    def vectorize_playlist(self):
        """Method vectorizes the playlist track features by determining the mean value of each track feature

        Returns:
            (Numpy vector): The playlist feature vector
        """
        playlist_vector = self.playlist_features.mean(axis=0)
        return playlist_vector.to_numpy().reshape(1, -1)

    def weight_features(self, weighted_columns: list):
        """Method weights the track dataset features (all features are normalized [0, 1]) by a scaler value
        to increase the effect of that feature in the similarity calculation.

        Wighting in cosine similarity increases the impact of the feature in the similarity calculation

        Note, this method directly alters the `self.track_features` field.

        Args:
            weighted_columns: The columns to be weighted by the additional weighting factor.

        """
        all_features = pd.Series(self.track_features.columns)
        feature_filter = all_features.isin(weighted_columns).tolist()   # Boolean filter based on if feature is in weighted columns
        binary_filter = [int(feature) for feature in feature_filter]  # Modify boolean filter into a binary filter
        filler = [1] * len(binary_filter)  # Create a filler of 1's to not effect features that have no weighting
        weights = [(self.additional_weighting * weight) + fill for weight, fill in zip(binary_filter, filler)]  # Calculate feature scaler weights

        self.track_features = self.track_features.mul(weights, axis=1)  # Weight features

__init__(playlist, tracks, weighted_features)

The initialization of the Cosine Similarity class

Parameters:

Name Type Description Default
playlist DataFrame

The tracks dataset dataframe (before pipeline transformation)

required
tracks DataFrame

The playlist tracks dataframe (before pipeline transformation)

required
weighted_features list

A list of features to be weighted in order to prioritize feature importance in similarity calculation.

required
Source code in src/similarity.py
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def __init__(self, playlist: pd.DataFrame, tracks: pd.DataFrame, weighted_features: list):
    """The initialization of the Cosine Similarity class

    Args:
        playlist (DataFrame): The tracks dataset dataframe (before pipeline transformation)
        tracks (DataFrame): The playlist tracks dataframe (before pipeline transformation)
        weighted_features (list): A list of features to be weighted in order to prioritize feature importance in similarity calculation.

    """
    self.additional_weighting = 2  # Feature weighting value
    features = CosinePipeline.data_pipeline(tracks)  # Pass data through pipeline to extract features

    self.playlist = playlist
    self.tracks = tracks

    self.playlist_features, self.track_features = self.separate_playlist_from_tracks(features)
    self.weight_features(weighted_features)
    self.similarity = None

access_similarity_scores()

Getter method to access the similarity class field.

Returns:

Type Description
Series

The track similarity to the playlist vector. Similarity is a Series using uris as the index.

Source code in src/similarity.py
56
57
58
59
60
61
62
def access_similarity_scores(self):
    """Getter method to access the `similarity` class field.

    Returns:
        (Series): The track similarity to the playlist vector. Similarity is a Series using uris as the index.
    """
    return self.similarity

calculate_similarity()

Method calculates the similarity between a playlist vector and the tracks feature matrix using cosine similarity

This calculation populates the self.similarity field.

The playlist feature dataframe is mean of each feature, creating a playlist vector.

Source code in src/similarity.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
def calculate_similarity(self):
    """Method calculates the similarity between a playlist vector and the tracks feature matrix using cosine similarity

    This calculation populates the `self.similarity` field.

    The playlist feature dataframe is mean of each feature, creating a playlist vector.
    """
    playlist_vector = self.vectorize_playlist()
    track_matrix = self.track_features.to_numpy()

    similarity_score = similarity_measures.cosine_similarity(track_matrix, playlist_vector)
    uris = self.track_features.index.tolist()

    self.similarity = pd.Series(similarity_score.T.tolist()[0], index=uris, name='sim_score')

get_top_n(n)

This method should return the top-n most similar tracks as a Dataframe with essential features included.

Note, due to the cosine similarity. A similarity value of 1 indicates a high similarity, while a value near 0 indicates a low similarity.

Parameters:

Name Type Description Default
n int

The top-n most similar tracks to the playlist vector

required

Returns:

Type Description
DataFrame

A dataframe containing the top-n tracks.

Source code in src/similarity.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def get_top_n(self, n: int):
    """This method should return the top-n most similar tracks as a Dataframe with essential features included.

    Note, due to the cosine similarity. A similarity value of 1 indicates a high similarity, while a value near 0 indicates a low similarity.

    Args:
        n (int): The top-n most similar tracks to the playlist vector

    Returns:
        (DataFrame): A dataframe containing the top-n tracks.
    """
    sim_df = pd.DataFrame(self.similarity, columns=['sim_score'])
    sorted_sim = sim_df.sort_values(by='sim_score', ascending=False)
    sorted_top = sorted_sim.head(n)
    return sorted_top.merge(self.tracks, left_index=True, right_on='uris')

separate_playlist_from_tracks(features)

Method separates the feature dataframe (from pipeline) into tracks and playlist feature dataframes

Parameters:

Name Type Description Default
features DataFrame

The track dataset features dataframe (This contains the playlist tracks too)

required

Returns:

Name Type Description
playlist_features Dataframe

The playlist track features dataframe

tracks_features DataFrame

The track dataset features dataframe

Source code in src/similarity.py
80
81
82
83
84
85
86
87
88
89
90
91
def separate_playlist_from_tracks(self, features: pd.DataFrame):
    """Method separates the feature dataframe (from pipeline) into tracks and playlist feature dataframes

    Args:
        features (DataFrame): The track dataset features dataframe (This contains the playlist tracks too)

    Returns:
        playlist_features (Dataframe): The playlist track features dataframe
        tracks_features (DataFrame): The track dataset features dataframe
    """
    playlist_uris = self.playlist['uris'].tolist()
    return features[features.index.isin(playlist_uris)], features[~features.index.isin(playlist_uris)]

vectorize_playlist()

Method vectorizes the playlist track features by determining the mean value of each track feature

Returns:

Type Description
Numpy vector

The playlist feature vector

Source code in src/similarity.py
 93
 94
 95
 96
 97
 98
 99
100
def vectorize_playlist(self):
    """Method vectorizes the playlist track features by determining the mean value of each track feature

    Returns:
        (Numpy vector): The playlist feature vector
    """
    playlist_vector = self.playlist_features.mean(axis=0)
    return playlist_vector.to_numpy().reshape(1, -1)

weight_features(weighted_columns)

Method weights the track dataset features (all features are normalized [0, 1]) by a scaler value to increase the effect of that feature in the similarity calculation.

Wighting in cosine similarity increases the impact of the feature in the similarity calculation

Note, this method directly alters the self.track_features field.

Parameters:

Name Type Description Default
weighted_columns list

The columns to be weighted by the additional weighting factor.

required
Source code in src/similarity.py
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
def weight_features(self, weighted_columns: list):
    """Method weights the track dataset features (all features are normalized [0, 1]) by a scaler value
    to increase the effect of that feature in the similarity calculation.

    Wighting in cosine similarity increases the impact of the feature in the similarity calculation

    Note, this method directly alters the `self.track_features` field.

    Args:
        weighted_columns: The columns to be weighted by the additional weighting factor.

    """
    all_features = pd.Series(self.track_features.columns)
    feature_filter = all_features.isin(weighted_columns).tolist()   # Boolean filter based on if feature is in weighted columns
    binary_filter = [int(feature) for feature in feature_filter]  # Modify boolean filter into a binary filter
    filler = [1] * len(binary_filter)  # Create a filler of 1's to not effect features that have no weighting
    weights = [(self.additional_weighting * weight) + fill for weight, fill in zip(binary_filter, filler)]  # Calculate feature scaler weights

    self.track_features = self.track_features.mul(weights, axis=1)  # Weight features