Tidy Tuesday Exercise

Author

Collin Real

Import libraries

import pandas as pd
import ssl
from functools import reduce
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
ssl._create_default_https_context = ssl._create_unverified_context
pd.set_option('display.max_columns', None)

Load auditions data

auditions_df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/auditions.csv', encoding='unicode_escape')
auditions_df.head(5)

	season	audition_date_start	audition_date_end	audition_city	audition_venue	episodes	episode_air_date	callback_venue	callback_date_start	callback_date_end	tickets_to_hollywood	guest_judge
0	1	2002-04-20	2002-04-22	Los Angeles, California	Westin Bonaventure Hotel	NaN	11-Jun-02	NaN	NaN	NaN	31.0	NaN
1	1	2002-04-23	2002-04-25	Seattle, Washington	Hyatt Regency Hotel	NaN	11-Jun-02	NaN	NaN	NaN	10.0	NaN
2	1	2002-04-26	2002-04-28	Chicago, Illinois	Congress Plaza Hotel	NaN	11-Jun-02	NaN	NaN	NaN	23.0	NaN
3	1	2002-04-29	2002-05-01	New York City, New York	Millenium Hilton Hotel	NaN	11-Jun-02	NaN	NaN	NaN	25.0	NaN
4	1	2002-05-03	2002-05-05	Atlanta, Georgia	AmericasMart/Callanwolde Fine Arts Center	NaN	11-Jun-02	NaN	2002-03-05	2002-03-05	15.0	NaN

Load eliminations data

eliminations_df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/eliminations.csv', encoding='unicode_escape')
eliminations_df.head(5)

	season	place	gender	contestant	top_36	top_36_2	top_36_3	top_36_4	top_32	top_32_2	top_32_3	top_32_4	top_30	top_30_2	top_30_3	top_25	top_25_2	top_25_3	top_24	top_24_2	top_24_3	top_20	top_20_2	top_16	top_14	top_13	top_12	top_11	top_11_2	wildcard	comeback	top_10	top_9	top_9_2	top_8	top_8_2	top_7	top_7_2	top_6	top_6_2	top_5	top_5_2	top_4	top_4_2	top_3	finale
0	1	1	Female	Kelly Clarkson	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Safe (2nd)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Safe	NaN	NaN	Safe	NaN	Safe	NaN	Safe	NaN	Safe	NaN	Safe	NaN	Safe	Winner
1	1	2	Male	Justin Guarini	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Safe (1st)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Safe	NaN	NaN	Safe	NaN	Bottom Two	NaN	Safe	NaN	Safe	NaN	Safe	NaN	Safe	Runner-Up
2	1	3	Female	Nikki McKibbin	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Safe (2nd)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Bottom Three	NaN	NaN	Safe	NaN	Bottom Three	NaN	Bottom Three	NaN	Bottom Two	NaN	Bottom Two	NaN	Eliminated	NaN
3	1	4	Female	Tamyra Gray	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Safe (1st)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Safe	NaN	NaN	Safe	NaN	Safe	NaN	Safe	NaN	Safe	NaN	Eliminated	NaN	NaN	NaN
4	1	5	Male	R. J. Helton	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Wild Card	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Saved	NaN	Safe	NaN	NaN	Safe	NaN	Safe	NaN	Bottom Two	NaN	Eliminated	NaN	NaN	NaN	NaN	NaN

Load finalists data

finalists_df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/finalists.csv', encoding='unicode_escape')
finalists_df.head(5)

	Contestant	Birthday	Birthplace	Hometown	Description	Season
0	Kelly Clarkson	24-Apr-82	Fort Worth, Texas	Burleson, Texas	She performed Aretha Franklin's version of "Re...	1
1	Justin Guarini	28-Oct-78	Columbus, Georgia	Doylestown, Pennsylvania	He performed Oleta Adams' version of "Get Her...	1
2	Nikki McKibbin	28-Sep-78	Grand Prairie, Texas	NaN	She had previously been on Popstars and auditi...	1
3	Tamyra Gray	26-Jul-79	Takoma Park, Maryland	Atlanta, Georgia	She had appeared on TV commercials and worked ...	1
4	R. J. Helton	17-May-81	Pasadena, Texas	Cumming, Georgia	J. Helton (born May 17, 1981, in Pasadena, Tex...	1

Load ratings data

ratings_df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/ratings.csv', encoding='unicode_escape')
ratings_df.head(5)

	season	show_number	episode	airdate	18_49_rating_share	viewers_in_millions	timeslot_et	dvr_18_49	dvr_viewers_millions	total_18_49	total_viewers_millions	weekrank	ref	share	nightlyrank	rating_share_households	rating_share
0	1	1	Auditions	June 11, 2002	4.8	9.85	NaN	NaN	NaN	NaN	NaN	12	NaN	NaN	NaN	NaN	6.1 / 11
1	1	2	Hollywood Week	June 12, 2002	5.2	11.24	NaN	NaN	NaN	NaN	NaN	6	NaN	NaN	NaN	NaN	6.9 / 12
2	1	3	Top 30: Group 1	June 18, 2002	5.2	10.30	NaN	NaN	NaN	NaN	NaN	6	NaN	NaN	NaN	NaN	6.2 / 11
3	1	4	Top 30: Group 1 results	June 19, 2002	4.7	9.47	NaN	NaN	NaN	NaN	NaN	22	NaN	NaN	NaN	NaN	5.8 / 10
4	1	5	Top 30: Group 2	June 25, 2002	4.5	9.08	NaN	NaN	NaN	NaN	NaN	11	NaN	NaN	NaN	NaN	5.5 / 9

Load seasons data

seasons_df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/seasons.csv', encoding='unicode_escape')
seasons_df.head(5)

	season	winner	runner_up	original_release	original_network	hosted_by	judges	no_of_episodes	finals_venue	mentor
0	1	Kelly Clarkson	Justin Guarini	June 11Â (2002-06-11)Â â€“September 4, 2002Â (...	Fox	Ryan Seacrest; Brian Dunkleman	Paula Abdul; Simon Cowell; Randy Jackson	NaN	Kodak Theatre	NaN
1	2	Ruben Studdard	Clay Aiken	January 21Â (2003-01-21)Â â€“May 21, 2003Â (20...	Fox	Ryan Seacrest	Paula Abdul; Simon Cowell; Randy Jackson	NaN	Gibson Amphitheatre	NaN
2	3	Fantasia Barrino	Diana DeGarmo	January 19Â (2004-01-19)Â â€“May 26, 2004Â (20...	Fox	Ryan Seacrest	Paula Abdul; Simon Cowell; Randy Jackson	NaN	Kodak Theatre	NaN
3	4	Carrie Underwood	Bo Bice	January 18Â (2005-01-18)Â â€“May 25, 2005Â (20...	Fox	Ryan Seacrest	Paula Abdul; Simon Cowell; Randy Jackson	NaN	Kodak Theatre	NaN
4	5	Taylor Hicks	Katharine McPhee	January 17Â (2006-01-17)Â â€“May 24, 2006Â (20...	Fox	Ryan Seacrest	Paula Abdul; Simon Cowell; Randy Jackson	NaN	Kodak Theatre	NaN

Load songs data

songs_df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-07-23/songs.csv', encoding='unicode_escape')
songs_df.head(5)

	season	week	order	contestant	song	artist	song_theme	result
0	Season_01	20020618_top_30_group_1	1	Tamyra Gray	And I Am Telling You I'm Not Going	Jennifer Holliday	NaN	Advanced (1st)
1	Season_01	20020618_top_30_group_1	2	Jim Verraros	When I Fall in Love	Doris Day	NaN	Advanced (3rd)
2	Season_01	20020618_top_30_group_1	3	Adriel Herrera	I'll Be	Edwin McCain	NaN	Eliminated
3	Season_01	20020618_top_30_group_1	4	Rodesia Eaves	Daydream Believer	The Monkees	NaN	Eliminated
4	Season_01	20020618_top_30_group_1	5	Natalie Burge	Crazy	Patsy Cline	NaN	Eliminated

Data cleaning - Fix dtype

# Change 'season' column in songs dataset to int for joins with other datasets
seasons_reformat_list = []
for row in songs_df['season']:
    new_row = row.replace('Season', '').replace('_0', '').replace('_', '')
    seasons_reformat_list.append(new_row)
songs_df['season'] = seasons_reformat_list
songs_df['season'] = songs_df['season'].astype(int)
songs_df.head(5)

	season	week	order	contestant	song	artist	song_theme	result
0	1	20020618_top_30_group_1	1	Tamyra Gray	And I Am Telling You I'm Not Going	Jennifer Holliday	NaN	Advanced (1st)
1	1	20020618_top_30_group_1	2	Jim Verraros	When I Fall in Love	Doris Day	NaN	Advanced (3rd)
2	1	20020618_top_30_group_1	3	Adriel Herrera	I'll Be	Edwin McCain	NaN	Eliminated
3	1	20020618_top_30_group_1	4	Rodesia Eaves	Daydream Believer	The Monkees	NaN	Eliminated
4	1	20020618_top_30_group_1	5	Natalie Burge	Crazy	Patsy Cline	NaN	Eliminated

Exploratory Data Analysis

from plotnine import (
    ggplot,
    aes,
    geom_boxplot,
    geom_jitter,
    scale_x_discrete,
    coord_flip,
    facet_wrap,
    geom_histogram,
    geom_bar
)

Raw boxplot viewers by seasons

seasons = ratings_df['season'].unique().tolist()
(
    ggplot(ratings_df)
    + geom_boxplot(aes(x="factor(season)", y="viewers_in_millions"))
    + scale_x_discrete(labels=seasons, name="season")  # change ticks labels on OX
)

Histogram - Viewers by Season

ratings2 = ratings_df[['season', 'viewers_in_millions', '18_49_rating_share']]
(
    ggplot(ratings2)
    + geom_histogram(aes(x="viewers_in_millions"), fill='red', colour='black')
    + facet_wrap("season")
)

Raw boxplot viewers by show number

show_number = ratings_df['show_number'].unique().tolist()
(
    ggplot(ratings_df)
    + geom_boxplot(aes(x="factor(show_number)", y="viewers_in_millions"))
    + scale_x_discrete(labels=show_number, name="show_number")  # change ticks labels on OX
)

Gender contestants by season

male_contestants_season = eliminations_df[eliminations_df['gender'] == 'Male'].groupby('season').count()['contestant'].reset_index().rename(columns={'contestant': 'male_contestants'})
# male_contestants_season['gender'] = 'Male'

female_contestants_season = eliminations_df[eliminations_df['gender'] == 'Female'].groupby('season').count()['contestant'].reset_index().rename(columns={'contestant': 'female_contestants'})
# female_contestants_season['gender'] = 'Female'

contestants = male_contestants_season.merge(female_contestants_season, on='season')
contestants['total_contestants'] = contestants['male_contestants'] + contestants['female_contestants']
contestants['male_ratio'] = contestants['male_contestants'] / contestants['total_contestants']
contestants['female_ratio'] = contestants['female_contestants'] / contestants['total_contestants']

male_contestants = contestants[['season', 'male_contestants', 'male_ratio']].rename(columns={'male_contestants': 'contestants', 'male_ratio': 'ratio'})
male_contestants['gender'] = 'Male'

female_contestants = contestants[['season', 'female_contestants', 'female_ratio']].rename(columns={'female_contestants': 'contestants', 'female_ratio': 'ratio'})
female_contestants['gender'] = 'Female'

final_contestants_df = pd.concat([male_contestants, female_contestants])
final_contestants_df

	season	contestants	ratio	gender
0	1	14	0.466667	Male
1	2	15	0.416667	Male
2	3	12	0.375000	Male
3	4	12	0.500000	Male
4	5	12	0.500000	Male
5	6	12	0.500000	Male
6	7	12	0.500000	Male
7	8	18	0.500000	Male
8	9	12	0.500000	Male
9	10	12	0.500000	Male
10	11	13	0.520000	Male
11	12	10	0.500000	Male
12	13	10	0.500000	Male
13	14	12	0.500000	Male
14	15	11	0.458333	Male
15	16	12	0.500000	Male
16	17	11	0.550000	Male
17	18	9	0.450000	Male
0	1	16	0.533333	Female
1	2	21	0.583333	Female
2	3	20	0.625000	Female
3	4	12	0.500000	Female
4	5	12	0.500000	Female
5	6	12	0.500000	Female
6	7	12	0.500000	Female
7	8	18	0.500000	Female
8	9	12	0.500000	Female
9	10	12	0.500000	Female
10	11	12	0.480000	Female
11	12	10	0.500000	Female
12	13	10	0.500000	Female
13	14	12	0.500000	Female
14	15	13	0.541667	Female
15	16	12	0.500000	Female
16	17	9	0.450000	Female
17	18	11	0.550000	Female

Plot Gender Ratio

(
    ggplot(final_contestants_df, aes(x="season", y="ratio", fill="gender"))
    + geom_bar(stat="identity")
)

Merge & split datasets, encode categorical variables and select features

Question: Can predict the winner using the features gender, viewers_in_millions, tickets_to_hollywood, and audition city?

# Merge the dataframes
merged_df = pd.merge(auditions_df, eliminations_df, on='season', how='left')
merged_df = pd.merge(merged_df, finalists_df, left_on=['season', 'contestant'], right_on=['Season', 'Contestant'], how='left')
merged_df = pd.merge(merged_df, ratings_df, on='season', how='left')
merged_df = pd.merge(merged_df, seasons_df, on='season', how='left')
merged_df = pd.merge(merged_df, songs_df, on=['season', 'contestant'], how='left')

# Handle missing values
merged_df.fillna(0, inplace=True)

# Feature selection
merged_df = merged_df[['gender', 'viewers_in_millions', 'winner', 'audition_city', 'tickets_to_hollywood']].drop_duplicates()

# Encoding categorical variables
label_encoders = {}
categorical_columns = merged_df.select_dtypes(include=['object']).columns

for column in categorical_columns:
    le = LabelEncoder()
    merged_df[column] = le.fit_transform(merged_df[column].astype(str))
    label_encoders[column] = le

# Defining the target variable and features
# Assuming 'winner' column in seasons_df as target
target = 'winner'
features = merged_df.columns.difference([target])

X = merged_df[features]
y = merged_df[target]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier
# Training the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, output_dict=True)

print("Accuracy:", accuracy)
print("Classification Report:", classification_report(y_test, y_pred))

Accuracy: 0.9828503843879361
Classification Report:               precision    recall  f1-score   support

           0       0.98      1.00      0.99        94
           1       1.00      0.98      0.99       108
           2       1.00      0.98      0.99       110
           3       1.00      1.00      1.00       126
           4       1.00      1.00      1.00        96
           5       0.98      1.00      0.99       109
           6       0.97      0.95      0.96       152
           7       1.00      1.00      1.00        71
           8       1.00      1.00      1.00       124
           9       0.86      0.86      0.86        36
          10       1.00      1.00      1.00       134
          11       0.72      0.90      0.80        31
          12       1.00      1.00      1.00         1
          13       1.00      1.00      1.00       114
          14       0.96      0.98      0.97       107
          15       1.00      0.98      0.99        98
          16       1.00      1.00      1.00       107
          17       0.99      0.93      0.96        73

    accuracy                           0.98      1691
   macro avg       0.97      0.98      0.97      1691
weighted avg       0.98      0.98      0.98      1691

Neural Network

from sklearn.neural_network import MLPClassifier
# Training the neural network model
nn_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42)
nn_model.fit(X_train, y_train)

# Making predictions
y_pred = nn_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:", report)

Accuracy: 0.8675340035481963
Classification Report:               precision    recall  f1-score   support

           0       0.97      0.97      0.97        94
           1       0.95      0.97      0.96       108
           2       0.87      0.83      0.85       110
           3       0.79      0.82      0.80       126
           4       0.68      0.56      0.62        96
           5       0.89      0.93      0.91       109
           6       0.77      0.91      0.83       152
           7       0.97      0.96      0.96        71
           8       0.81      0.88      0.84       124
           9       0.54      0.19      0.29        36
          10       0.84      0.87      0.85       134
          11       0.46      0.52      0.48        31
          12       1.00      1.00      1.00         1
          13       0.95      0.97      0.96       114
          14       0.94      0.90      0.92       107
          15       1.00      0.98      0.99        98
          16       0.98      0.97      0.98       107
          17       0.92      0.81      0.86        73

    accuracy                           0.87      1691
   macro avg       0.85      0.84      0.84      1691
weighted avg       0.86      0.87      0.86      1691

Logistic Regression

from sklearn.linear_model import LogisticRegression
# Training the logistic regression model
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# Making predictions
y_pred = lr_model.predict(X_test)

# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:", report)

Accuracy: 0.5340035481963336
Classification Report:               precision    recall  f1-score   support

           0       0.61      0.81      0.69        94
           1       0.76      0.69      0.72       108
           2       0.34      0.36      0.35       110
           3       0.15      0.08      0.10       126
           4       0.33      0.16      0.21        96
           5       0.44      0.49      0.46       109
           6       0.71      0.98      0.83       152
           7       0.80      0.46      0.59        71
           8       0.41      0.52      0.46       124
           9       0.00      0.00      0.00        36
          10       0.47      0.62      0.54       134
          11       0.27      0.13      0.17        31
          12       0.00      0.00      0.00         1
          13       0.47      0.65      0.55       114
          14       0.43      0.15      0.22       107
          15       0.75      0.85      0.80        98
          16       0.54      0.70      0.61       107
          17       0.69      0.73      0.71        73

    accuracy                           0.53      1691
   macro avg       0.46      0.46      0.45      1691
weighted avg       0.50      0.53      0.50      1691

Conclusion

We trained 3 models to predict the American Idol winner using features: gender, viewers_in_millions, tickets_to_hollywood, and audition city. The three models we used were the random forest classifier, neural network, and logistic regression.

Accuracy Scores

Random forest classifier: 0.9828503843879361
Neural network: 0.8675340035481963
Logistic regression: 0.5340035481963336

The random classifier model had the greatest accuracy store, but such a high score might indicate that the model is overfitting.