Pipeline
Cosine Pipeline
This file provides the transformation pipeline from raw tracks to a set of track features ready for use by the Cosine Similarity class.
Pipeline inherits from the Pipeline Interface.
Pipeline Documentation
CosinePipeline
Bases: Pipeline
This class serves as the transformation pipeline from raw data to usable features, for the similarity calculation carried out by the Cosine Similarity class.
Source code in src/pipeline.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | |
data_pipeline(df)
staticmethod
This method enacts the transformation pipeline to produce a set of track features.
This pipeline makes use of the following transformation methods: - One-hot_encoding - TFIDF transformation - Min-Max Scaling of numerical values
It is essential to note that all features are normalized between [0, 1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The dataframe containing the raw data from the |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A dataframe containing all track features |
Source code in src/pipeline.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | |
extract_target(df, df_target)
staticmethod
This method allows for the extraction of tracks in the playlist from the tracks dataset. Note, for this method to work, the track uris should be the index in the tracks dataframe (df).
This method is largely used once the tracks dataset has been transformed into a set of features, and the playlist track features are required to be extracted.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The dataframe containing the tracks dataset. |
required |
df_target |
DataFrame
|
The dataframe containing the tracks from the playlist |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
A resulting dataframe containing only the tracks from the playlist that were in the tracks dataset. |
Source code in src/pipeline.py
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | |
ohe_prep(df, column)
staticmethod
This method performs a One-Hot-Encoding (OHE) on a specified column
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The dataframe containing the tracks information, now being processes by the pipeline. |
required |
column |
str
|
The name of the column on which the OHE transformation should be performed. |
required |
Source code in src/pipeline.py
30 31 32 33 34 35 36 37 38 39 | |
select_columns(df)
staticmethod
This method selects the columns to be included within feature calculations.
This method, allows for the ignoring/ removal of peripheral or uneeded columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The dataframe containing the raw data from the |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The modified dataframe object containing only the select columns. |
Source code in src/pipeline.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
tfidf_transformation(df_parm)
staticmethod
This method performs the term frequency–inverse document frequency (tfidf) transformation on the artist genre column.
Note, more information on tfidf transformation can be found here: https://www.geeksforgeeks.org/understanding-tf-idf-term-frequency-inverse-document-frequency/
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df_parm |
DataFrame
|
The dataframe containing the tracks information, now undergoing tfidg transfromation |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The transformed dataframe containing the results of the tfidf transfromation. |
Source code in src/pipeline.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | |
unique_tracks(df, df_target)
staticmethod
Method ensures that df (tracks dataframe) does not contain the same tracks as the playlist (df_target)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df |
DataFrame
|
The dataframe containing the tracks dataset. |
required |
df_target |
DataFrame
|
The dataframe containing the tracks from the playlist |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
The tracks dataframe containing none of the same tracks as in the playlist. |
Source code in src/pipeline.py
101 102 103 104 105 106 107 108 109 110 111 112 113 | |