Revenue Leakage Detection & Pricing Optimization

Revenue leakage is money that a business earns in theory but loses in practice — through excessive discounting, inefficient pricing, stockouts on high-demand products, and margin erosion hidden inside aggregated revenue numbers. This project built an end-to-end detection and optimization system using 27,889 e-commerce transactions, surfacing $9.54 million in at-risk revenue and generating concrete pricing recommendations across 47,009 products analysed.

🔗 Live Streamlit App: bit.ly/4ut8DZe
📊 Tableau Dashboard: bit.ly/3QLRQyd
📂 GitHub Repository: bit.ly/4cAy0lD

The Problem

Most e-commerce platforms that are losing money know their total revenue number. What they don't know is where that money is going. The analysis identified three primary leakage vectors: excessive discount depth (products discounted past their margin floor), out-of-stock revenue loss (demand exists but inventory doesn't), and pricing inefficiency (items priced below market rate given their ratings and category position).

Key findings from the analysis:

Total Revenue at Risk: $9.54 million
Over-Discounted Products: 30,052 (75.2% of catalogue)
COGs Revenue Loss: $1.54 million
Highest Risk Seller: RatanFikri (P/L risk at 4%)
Average Product Rating: 3.44 / 5

The Dataset & Feature Engineering

The dataset covers 27,889 orders with product-level attributes including actual price, discounted price, discount percentage, category, sub-category, ratings, rating count, and seller information. Feature engineering was substantial — the raw columns weren't sufficient to identify leakage patterns directly.

import pandas as pd
import numpy as np

df = pd.read_csv("ecommerce_data.csv")

# Clean price columns (remove currency symbols, convert to float)
df['actual_price'] = df['actual_price'].str.replace('[₹,]', '', regex=True).astype(float)
df['discounted_price'] = df['discounted_price'].str.replace('[₹,]', '', regex=True).astype(float)
df['discount_pct'] = df['discount_percentage'].str.replace('%', '').astype(float) / 100

# Calculate margin proxy (assumes ~40% COGS baseline)
COGS_RATE = 0.40
df['cogs_estimate'] = df['actual_price'] * COGS_RATE
df['margin_after_discount'] = df['discounted_price'] - df['cogs_estimate']
df['is_below_cogs'] = df['margin_after_discount'] < 0

# Discount risk flag
df['discount_risk'] = df['discount_pct'] > 0.50  # >50% discount flagged

# Out-of-stock revenue opportunity (products with high ratings but no recent orders)
df['demand_score'] = df['rating'] * np.log1p(df['rating_count'])
df['oos_risk'] = (df['demand_score'] > df['demand_score'].quantile(0.75)) & \
                 (df['rating_count'] < df['rating_count'].quantile(0.25))

SQL Analysis Layer

The core leakage analysis ran through a SQL pipeline on top of the cleaned data. Window functions were essential for comparing individual product performance against category baselines — you can't identify an outlier without a reference distribution.

-- Discount distribution analysis per category
SELECT
  category,
  sub_category,
  COUNT(*) AS product_count,
  ROUND(AVG(discount_pct) * 100, 1) AS avg_discount_pct,
  ROUND(SUM(CASE WHEN discount_pct > 0.5 THEN actual_price - discounted_price ELSE 0 END), 2)
    AS excess_discount_value,
  COUNT(CASE WHEN margin_after_discount < 0 THEN 1 END) AS below_cogs_count
FROM products
GROUP BY category, sub_category
ORDER BY excess_discount_value DESC;

-- Top sellers by revenue at risk
SELECT
  seller_name,
  COUNT(*) AS products,
  ROUND(SUM(CASE WHEN is_below_cogs THEN actual_price - discounted_price ELSE 0 END), 2)
    AS revenue_at_risk,
  ROUND(AVG(discount_pct) * 100, 1) AS avg_discount_pct,
  ROUND(AVG(rating), 2) AS avg_rating
FROM products
GROUP BY seller_name
HAVING revenue_at_risk > 10000
ORDER BY revenue_at_risk DESC
LIMIT 20;

The Pricing Recommendation Engine

The most technically interesting part of the project is the recommendation engine. Rather than just flagging problems, it generates specific pricing actions with expected revenue recovery estimates. The logic uses category-level percentile benchmarking — a product is considered underpriced if its price-to-rating ratio falls below the 25th percentile for its sub-category.

def generate_pricing_recommendations(df: pd.DataFrame) -> pd.DataFrame:
    recommendations = []

    for category, group in df.groupby('sub_category'):
        # Calculate category benchmarks
        p25_price = group['discounted_price'].quantile(0.25)
        p75_price = group['discounted_price'].quantile(0.75)
        median_discount = group['discount_pct'].median()
        avg_rating = group['rating'].mean()

        for _, row in group.iterrows():
            priority = 'Low'
            action = None
            recovery_est = 0

            # Rule 1: Below COGS — urgent price correction needed
            if row['margin_after_discount'] < 0:
                priority = 'Critical'
                action = 'Raise price above COGS floor'
                recovery_est = abs(row['margin_after_discount']) * row['rating_count']

            # Rule 2: Discount > 2x category median — discount abuse
            elif row['discount_pct'] > median_discount * 2:
                priority = 'High'
                action = f'Reduce discount to category median ({median_discount*100:.0f}%)'
                recovery_est = (row['discount_pct'] - median_discount) * row['actual_price']

            # Rule 3: High rating but priced below category P25 — underpriced
            elif row['rating'] > avg_rating + 0.5 and row['discounted_price'] < p25_price:
                priority = 'Medium'
                action = f'Price increase to category P25 (₹{p25_price:.0f})'
                recovery_est = (p25_price - row['discounted_price']) * max(row['rating_count'], 1)

            if action:
                recommendations.append({
                    'product': row['product_name'][:50],
                    'category': category,
                    'priority': priority,
                    'action': action,
                    'estimated_recovery': round(recovery_est, 2)
                })

    return pd.DataFrame(recommendations).sort_values('estimated_recovery', ascending=False)

Tableau Dashboard

The Tableau public dashboard provides the executive-facing view: revenue at risk by category, discount distribution heatmaps, seller-level risk rankings, and a geographic breakdown of order concentration. It's designed for a category manager who needs to quickly identify which sub-categories need immediate pricing intervention versus which are performing efficiently.

The most valuable view in the dashboard is the discount vs margin scatter plot at the sub-category level — it immediately surfaces which categories are discounting heavily and still maintaining margin (fine) versus discounting heavily and destroying margin (urgent action needed).

The Streamlit App

The app has five pages: Home (KPI overview), AI Query (plain English questions against the dataset via Gemini), Sales Dashboard (interactive charts with preset query library), SQL Explorer, and AI Insights (full auto-generated BI report). The AI query integration follows the same two-call pattern used in the SQL BI Assistant project — separate calls for SQL generation and insight generation for better output quality on both.

Business Recommendations Generated

The engine produced four priority recommendations from the analysis:

Critical: 75.2% of products are over-discounted — implement category-specific discount caps immediately
High: High discounts don't improve ratings — use dynamic pricing instead of blanket discounts
Medium: Budget for most over-discounted across all sub-categories warrant seller-level review
Low: Require seller profitability review before listing

What Made This Project Hard

The hardest part wasn't the ML — it was deciding what "leakage" means precisely enough to measure it. Revenue leakage is a business concept, not a statistical one. Translating "the platform is losing money" into specific, queryable, quantifiable conditions required understanding the business model first and building the technical solution second. That ordering — business logic before code — is the discipline this project reinforced more than anything else.