Student Lifestyle & Academic Performance Analysis

Does sleep really affect your grades? What about social media usage, exercise frequency, or diet quality? These questions have obvious intuitive answers — but intuition isn't data. This project used a 2,000-student dataset of lifestyle and academic variables to build a full ML pipeline: EDA, regression, classification, clustering, and a deployed Streamlit app where you can input your own lifestyle habits and get a predicted exam score.

🔗 Live Streamlit App: bit.ly/student-life-perf
📂 GitHub Repository: bit.ly/student-perf-analysis

The Dataset

The dataset contains 2,000 student records across 16 features: daily study hours, social media usage (hours/day), Netflix usage, sleep hours, physical activity frequency, diet quality (Poor/Fair/Good), gender, age, stress level, attendance percentage, parental education level, internet quality, extracurricular participation, gaming hours, and the target — exam score (0–100).

The academic performance category (Poor/Average/Good/Excellent) was derived from the exam score and used as the classification target. Data quality was clean — no missing values — which made the EDA phase faster and let the focus stay on feature relationships rather than imputation decisions.

Part 1 — Exploratory Data Analysis

The EDA started with correlation analysis to identify which features had the strongest linear relationship with exam score before modeling. Study hours showed the strongest positive correlation (r = 0.82), followed by sleep hours (r = 0.61) and attendance (r = 0.58). Social media usage and gaming hours showed the strongest negative correlations.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("student_lifestyle_dataset.csv")

# Correlation heatmap
plt.figure(figsize=(14, 10))
numeric_cols = df.select_dtypes(include='number').columns
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True, linewidths=0.5)
plt.title('Feature Correlation Matrix', fontsize=14, pad=20)
plt.tight_layout()
plt.savefig('charts/correlation_heatmap.png', dpi=150)

# Study hours vs exam score by diet quality
sns.scatterplot(data=df, x='study_hours_per_day', y='exam_score',
                hue='diet_quality', palette='Set2', alpha=0.6, s=60)
plt.title('Study Hours vs Exam Score by Diet Quality')
plt.savefig('charts/study_vs_score_diet.png', dpi=150)

A key EDA finding: diet quality acts as a modifier rather than a primary driver. Students with poor diet quality who study the same hours as good-diet students score about 4–6 points lower on average. The relationship isn't about diet replacing study — it's about diet affecting the efficiency of study time.

Part 2 — Machine Learning: Regression

The regression task was predicting exam score as a continuous value. Four models were trained and compared: Linear Regression, Random Forest Regressor, Gradient Boosting, and Ridge Regression. Features were encoded (diet quality mapped to ordinal integers, gender one-hot encoded) and the dataset split 80/20.

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import LabelEncoder

# Encode categorical features
le = LabelEncoder()
df['diet_quality_enc'] = le.fit_transform(df['diet_quality'])
df = pd.get_dummies(df, columns=['gender'], drop_first=True)

features = ['study_hours_per_day', 'sleep_hours_per_day', 'social_media_hours',
            'attendance_percentage', 'physical_activity_freq', 'diet_quality_enc',
            'stress_level', 'gaming_hours', 'netflix_hours', 'gender_Male']

X = df[features]
y = df['exam_score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'Ridge': Ridge(alpha=1.0)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print(f"{name}: R²={r2_score(y_test, preds):.4f}, MAE={mean_absolute_error(y_test, preds):.2f}")

Results: Linear Regression achieved R² = 0.89, MAE = 3.46. Random Forest: R² = 0.91, MAE = 3.08. Gradient Boosting: R² = 0.90, MAE = 3.21. The high R² across all models suggests the relationships in this dataset are largely linear — study hours, sleep, and attendance together explain most of the variance. The Random Forest was selected for deployment.

Part 2 — Classification

The classification task predicted performance category (Poor / Average / Good / Excellent) derived from score bands. Random Forest achieved 94% accuracy, with most errors occurring at the boundary between adjacent categories — which makes sense, since a student scoring 69 vs 70 is statistically identical but falls into different classes.

Part 3 — Clustering

K-Means clustering with k=4 (selected by elbow method) was used to discover natural student segments without using the target label. The four clusters that emerged mapped closely to intuitive archetypes: high-study/high-sleep high performers, social-media-heavy average performers, balanced moderate achievers, and low-study/high-gaming underperformers. The cluster labels weren't predetermined — the algorithm found them from the feature data alone, which is the validation that the signal is real.

The Streamlit App

The deployed app has five pages matching each analytical section, plus a personal predictor where users input their own lifestyle habits and get an exam score prediction with a confidence interval and personalized recommendations. The recommendations are rule-based post-processing on the model output — not a second ML model — which keeps them interpretable and actionable rather than opaque.

Feature Importance

The Random Forest feature importances confirmed the EDA correlations: study hours per day was the most important feature (importance score 0.41), followed by attendance (0.18), sleep hours (0.14), and social media usage (0.11). Gaming hours, diet quality, and physical activity followed at lower but non-negligible importance. The model is essentially saying: control for study time and attendance first, then everything else is a secondary effect.

What I Learned

High R² doesn't mean a hard problem — it means the features strongly predict the target, which in this case is expected given that study hours is a near-direct cause of exam performance. The more interesting modeling challenge was the clustering, where there's no ground truth to compare against. Learning to validate unsupervised results through interpretability rather than metrics was the most valuable methodological takeaway from this project.