Netflix has over 8,000 titles across movies and TV shows. That dataset — publicly available on Kaggle — is a goldmine for understanding content trends, genre distributions, global production patterns, and rating structures. This project took that raw data through three complete analytical layers: a Jupyter notebook for EDA, a Power BI dashboard for business reporting, and a Streamlit web app for interactive exploration and content recommendations.
🔗 Live Streamlit App: bit.ly/4e6y6Tt
📂 GitHub Repository: github.com/SaurabhAnand56/Netflix-EDA-PowerBI
Why Netflix Data?
Most beginner data projects use the Titanic or Iris datasets. Both are fine for learning mechanics, but neither reflects the kind of messy, real-world data you actually encounter. The Netflix dataset has missing values, inconsistent formatting, multi-value columns (genres, cast, directors stored as comma-separated strings), and interesting temporal patterns. It forces you to think carefully about cleaning strategy before you touch a single visualization.
It also has an obvious business framing: what kind of content does Netflix produce? Where is it produced? How has the content mix shifted over time? Those are real questions a streaming analyst would actually answer.
The Dataset
The dataset contains 8,807 titles with 12 columns: show_id, type, title, director, cast, country, date_added, release_year, rating, duration, listed_in (genres), and description. It covers content through 2021.
Key data quality issues encountered:
- ~30% of director values missing — handled by filling with
"Unknown"rather than dropping rows - Cast and genre columns are pipe-separated strings — required
explode()after splitting to analyze at the individual actor/genre level - Date formats inconsistent — standardized with
pd.to_datetime()with error coercion - Duration stored as strings like
"90 min"and"2 Seasons"— split into numeric value + unit columns
Part 1 — Exploratory Data Analysis
The EDA notebook followed a structured flow: univariate analysis of each column, bivariate relationships between content type and other features, temporal analysis of content addition trends, and geographic analysis of production countries.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("netflix_titles.csv")
# Clean date_added
df['date_added'] = pd.to_datetime(df['date_added'].str.strip(), errors='coerce')
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
# Explode genres for frequency analysis
df['genres'] = df['listed_in'].str.split(', ')
genre_df = df.explode('genres')
top_genres = genre_df['genres'].value_counts().head(15)
# Content type split over years
type_year = df.groupby(['year_added', 'type']).size().unstack(fill_value=0)
type_year.plot(kind='bar', stacked=True, figsize=(14, 6), colormap='Set2')
plt.title('Netflix Content Added by Year and Type')
plt.tight_layout()
plt.savefig('charts/content_by_year.png', dpi=150)
Key findings from the EDA: Movies make up about 69% of the catalogue. The US is the top producing country by a large margin, followed by India and the UK. Content additions peaked sharply in 2019–2020 and dropped slightly post-pandemic. International Movies and Dramas are the dominant genres, with TV Dramas growing fastest year-over-year.
Part 2 — Power BI Dashboard
The Power BI layer translated the EDA findings into an executive-facing dashboard — the kind of thing a content strategy team would actually use. The design philosophy was one-page-per-theme: an overview page, a content deep-dive, a geographic view, and a ratings breakdown.
The trickiest part of the Power BI work was handling the multi-value columns. Power BI doesn't natively handle comma-separated strings, so the genre and country columns were pre-processed in Python to produce a separate genre_mapping.csv and country_mapping.csv that Power BI could join against the main table using a many-to-many relationship via show_id.
DAX measures used across the dashboard included content count by type, average duration filtered by type (since mixing movie minutes and TV seasons in the same average makes no sense), year-over-year growth rate, and a rolling 12-month content addition count for the trend line.
Part 3 — Streamlit Web App
The Streamlit app is the most user-facing layer. It exposes the same analysis interactively — users can filter by content type, genre, country, and rating, and the charts update in real time. The most interesting feature is a simple content-based recommendation engine built on top of TF-IDF similarity across the description column.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Build TF-IDF matrix on descriptions
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf.fit_transform(df['description'].fillna(''))
# Cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
def recommend(title, n=5):
idx = df[df['title'] == title].index[0]
sim_scores = sorted(enumerate(cosine_sim[idx]), key=lambda x: x[1], reverse=True)[1:n+1]
indices = [i[0] for i in sim_scores]
return df[['title', 'type', 'listed_in', 'description']].iloc[indices]
The recommendation quality is decent for a TF-IDF baseline — it surfaces thematically similar content reliably — but it misses stylistic similarity that a more sophisticated embedding-based approach would catch. That's a natural next iteration.
What I Learned
Data cleaning takes longer than analysis, every time. The Power BI many-to-many relationship for multi-value columns is a pattern I'll reuse constantly. And Streamlit is genuinely fast for turning Python analysis into something shareable — the entire app was built in under two days.
The project also reinforced that the most valuable output of any analysis isn't a chart — it's a clear answer to a business question. Every visualization in the dashboard was designed backwards from a specific question, not forwards from "what can I plot?"