Skip to content

Pipeline

Cosine Pipeline

This file provides the transformation pipeline from raw tracks to a set of track features ready for use by the Cosine Similarity class.

Pipeline inherits from the Pipeline Interface.

Pipeline Documentation

CosinePipeline

Bases: Pipeline

This class serves as the transformation pipeline from raw data to usable features, for the similarity calculation carried out by the Cosine Similarity class.

Source code in src/pipeline.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
class CosinePipeline(Pipeline):
    """This class serves as the transformation pipeline from raw data to usable features, for the similarity calculation
    carried out by the Cosine Similarity class."""

    @staticmethod
    def select_columns(df):
        """This method selects the columns to be included within feature calculations.

        This method, allows for the ignoring/ removal of peripheral or uneeded columns.

        Args:
            df (DataFrame): The dataframe containing the raw data from the `data/tracks.csv` file

        Returns:
            (DataFrame): The modified dataframe object containing only the select columns.
        """
        df = df[['uris', 'artist_pop',
                 'artist_genres', 'track_pop', 'danceability', 'energy',
                 'keys', 'loudness', 'modes', 'speechiness', 'acousticness',
                 'instrumentalness', 'liveness', 'valences', 'tempos', 'durations_ms', 'time_signatures']]
        return df

    @staticmethod
    def ohe_prep(df, column):
        """This method performs a One-Hot-Encoding (OHE) on a specified column

        Args:
            df (DataFrame): The dataframe containing the tracks information, now being processes by the pipeline.
            column (str): The name of the column on which the OHE transformation should be performed.
        """
        df_encoded = pd.get_dummies(df, columns=[column], dtype=int)
        return df_encoded

    @staticmethod
    def tfidf_transformation(df_parm):
        """This method performs the term frequency–inverse document frequency (tfidf) transformation on the `artist genre` column.

        Note, more information on tfidf transformation can be found here: https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

        Args:
            df_parm (DataFrame): The dataframe containing the tracks information, now undergoing tfidg transfromation

        Returns:
            (DataFrame): The transformed dataframe containing the results of the tfidf transfromation.
        """
        tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), min_df=0.0, max_features=50)
        tfidf_matrix = tf.fit_transform(df_parm['artist_genres'])

        genre_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tf.get_feature_names_out())
        genre_df.columns = ['genre' + "|" + i for i in genre_df.columns]

        df_parm = df_parm.drop(columns=['artist_genres'])

        combined_df = pd.concat([df_parm.reset_index(drop=True), genre_df.reset_index(drop=True)], axis=1)
        return combined_df

    @staticmethod
    def data_pipeline(df):
        """This method enacts the transformation pipeline to produce a set of track features.

        This pipeline makes use of the following transformation methods:
        - One-hot_encoding
        - TFIDF transformation
        - Min-Max Scaling of numerical values

        It is essential to note that all features are normalized between [0, 1]

        Args:
            df (DataFrame): The dataframe containing the raw data from the `data/tracks.csv` file

        Returns:
              (DataFrame): A dataframe containing all track features
        """

        df_pipe = CosinePipeline.select_columns(df)

        # Perform OHE
        df_pipe = CosinePipeline.ohe_prep(df_pipe, 'modes')
        df_pipe = CosinePipeline.ohe_prep(df_pipe, 'keys')
        df_pipe = CosinePipeline.ohe_prep(df_pipe, 'time_signatures')

        # Normalize popularity values
        scaler = MinMaxScaler(feature_range=(0, 1))
        columns = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valences', 'durations_ms', 'tempos']
        df_pipe[columns] = scaler.fit_transform(df_pipe[columns])

        # Perform TFID vectorization on genres
        df_pipe = CosinePipeline.tfidf_transformation(df_parm=df_pipe)

        df_pipe = df_pipe.set_index(keys='uris', drop=True)

        return df_pipe

    @staticmethod
    def unique_tracks(df, df_target):
        """Method ensures that df (tracks dataframe) does not contain the same tracks as the playlist (df_target)

        Args:
            df (DataFrame): The dataframe containing the tracks dataset.
            df_target (DataFrame): The dataframe containing the tracks from the playlist

        Returns:
            (DataFrame): The tracks dataframe containing none of the same tracks as in the playlist.
        """
        df = df.drop(df_target['uris'], errors='ignore')
        return df

    @staticmethod
    def extract_target(df, df_target):
        """This method allows for the extraction of tracks in the playlist from the tracks dataset.
        Note, for this method to work, the track uris should be the index in the tracks dataframe (df).

        This method is largely used once the tracks dataset has been transformed into a set of features,
        and the playlist track features are required to be extracted.

        Args:
            df (DataFrame): The dataframe containing the tracks dataset.
            df_target (DataFrame): The dataframe containing the tracks from the playlist

        Returns:
            (DataFrame): A resulting dataframe containing only the tracks from the playlist that were in the tracks dataset.
        """
        target_uris = df_target['uris'].tolist()
        return df[df.index.isin(target_uris)]

data_pipeline(df) staticmethod

This method enacts the transformation pipeline to produce a set of track features.

This pipeline makes use of the following transformation methods: - One-hot_encoding - TFIDF transformation - Min-Max Scaling of numerical values

It is essential to note that all features are normalized between [0, 1]

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the raw data from the data/tracks.csv file

required

Returns:

Type Description
DataFrame

A dataframe containing all track features

Source code in src/pipeline.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
@staticmethod
def data_pipeline(df):
    """This method enacts the transformation pipeline to produce a set of track features.

    This pipeline makes use of the following transformation methods:
    - One-hot_encoding
    - TFIDF transformation
    - Min-Max Scaling of numerical values

    It is essential to note that all features are normalized between [0, 1]

    Args:
        df (DataFrame): The dataframe containing the raw data from the `data/tracks.csv` file

    Returns:
          (DataFrame): A dataframe containing all track features
    """

    df_pipe = CosinePipeline.select_columns(df)

    # Perform OHE
    df_pipe = CosinePipeline.ohe_prep(df_pipe, 'modes')
    df_pipe = CosinePipeline.ohe_prep(df_pipe, 'keys')
    df_pipe = CosinePipeline.ohe_prep(df_pipe, 'time_signatures')

    # Normalize popularity values
    scaler = MinMaxScaler(feature_range=(0, 1))
    columns = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valences', 'durations_ms', 'tempos']
    df_pipe[columns] = scaler.fit_transform(df_pipe[columns])

    # Perform TFID vectorization on genres
    df_pipe = CosinePipeline.tfidf_transformation(df_parm=df_pipe)

    df_pipe = df_pipe.set_index(keys='uris', drop=True)

    return df_pipe

extract_target(df, df_target) staticmethod

This method allows for the extraction of tracks in the playlist from the tracks dataset. Note, for this method to work, the track uris should be the index in the tracks dataframe (df).

This method is largely used once the tracks dataset has been transformed into a set of features, and the playlist track features are required to be extracted.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the tracks dataset.

required
df_target DataFrame

The dataframe containing the tracks from the playlist

required

Returns:

Type Description
DataFrame

A resulting dataframe containing only the tracks from the playlist that were in the tracks dataset.

Source code in src/pipeline.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
@staticmethod
def extract_target(df, df_target):
    """This method allows for the extraction of tracks in the playlist from the tracks dataset.
    Note, for this method to work, the track uris should be the index in the tracks dataframe (df).

    This method is largely used once the tracks dataset has been transformed into a set of features,
    and the playlist track features are required to be extracted.

    Args:
        df (DataFrame): The dataframe containing the tracks dataset.
        df_target (DataFrame): The dataframe containing the tracks from the playlist

    Returns:
        (DataFrame): A resulting dataframe containing only the tracks from the playlist that were in the tracks dataset.
    """
    target_uris = df_target['uris'].tolist()
    return df[df.index.isin(target_uris)]

ohe_prep(df, column) staticmethod

This method performs a One-Hot-Encoding (OHE) on a specified column

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the tracks information, now being processes by the pipeline.

required
column str

The name of the column on which the OHE transformation should be performed.

required
Source code in src/pipeline.py
30
31
32
33
34
35
36
37
38
39
@staticmethod
def ohe_prep(df, column):
    """This method performs a One-Hot-Encoding (OHE) on a specified column

    Args:
        df (DataFrame): The dataframe containing the tracks information, now being processes by the pipeline.
        column (str): The name of the column on which the OHE transformation should be performed.
    """
    df_encoded = pd.get_dummies(df, columns=[column], dtype=int)
    return df_encoded

select_columns(df) staticmethod

This method selects the columns to be included within feature calculations.

This method, allows for the ignoring/ removal of peripheral or uneeded columns.

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the raw data from the data/tracks.csv file

required

Returns:

Type Description
DataFrame

The modified dataframe object containing only the select columns.

Source code in src/pipeline.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@staticmethod
def select_columns(df):
    """This method selects the columns to be included within feature calculations.

    This method, allows for the ignoring/ removal of peripheral or uneeded columns.

    Args:
        df (DataFrame): The dataframe containing the raw data from the `data/tracks.csv` file

    Returns:
        (DataFrame): The modified dataframe object containing only the select columns.
    """
    df = df[['uris', 'artist_pop',
             'artist_genres', 'track_pop', 'danceability', 'energy',
             'keys', 'loudness', 'modes', 'speechiness', 'acousticness',
             'instrumentalness', 'liveness', 'valences', 'tempos', 'durations_ms', 'time_signatures']]
    return df

tfidf_transformation(df_parm) staticmethod

This method performs the term frequency–inverse document frequency (tfidf) transformation on the artist genre column.

Note, more information on tfidf transformation can be found here: https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

Parameters:

Name Type Description Default
df_parm DataFrame

The dataframe containing the tracks information, now undergoing tfidg transfromation

required

Returns:

Type Description
DataFrame

The transformed dataframe containing the results of the tfidf transfromation.

Source code in src/pipeline.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
@staticmethod
def tfidf_transformation(df_parm):
    """This method performs the term frequency–inverse document frequency (tfidf) transformation on the `artist genre` column.

    Note, more information on tfidf transformation can be found here: https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/

    Args:
        df_parm (DataFrame): The dataframe containing the tracks information, now undergoing tfidg transfromation

    Returns:
        (DataFrame): The transformed dataframe containing the results of the tfidf transfromation.
    """
    tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), min_df=0.0, max_features=50)
    tfidf_matrix = tf.fit_transform(df_parm['artist_genres'])

    genre_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tf.get_feature_names_out())
    genre_df.columns = ['genre' + "|" + i for i in genre_df.columns]

    df_parm = df_parm.drop(columns=['artist_genres'])

    combined_df = pd.concat([df_parm.reset_index(drop=True), genre_df.reset_index(drop=True)], axis=1)
    return combined_df

unique_tracks(df, df_target) staticmethod

Method ensures that df (tracks dataframe) does not contain the same tracks as the playlist (df_target)

Parameters:

Name Type Description Default
df DataFrame

The dataframe containing the tracks dataset.

required
df_target DataFrame

The dataframe containing the tracks from the playlist

required

Returns:

Type Description
DataFrame

The tracks dataframe containing none of the same tracks as in the playlist.

Source code in src/pipeline.py
101
102
103
104
105
106
107
108
109
110
111
112
113
@staticmethod
def unique_tracks(df, df_target):
    """Method ensures that df (tracks dataframe) does not contain the same tracks as the playlist (df_target)

    Args:
        df (DataFrame): The dataframe containing the tracks dataset.
        df_target (DataFrame): The dataframe containing the tracks from the playlist

    Returns:
        (DataFrame): The tracks dataframe containing none of the same tracks as in the playlist.
    """
    df = df.drop(df_target['uris'], errors='ignore')
    return df