Nora Petrova commited on
Commit
d8ff169
·
1 Parent(s): 6833632

Add app files

Browse files
Dockerfile ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM node:20.11.0-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Copy the rest of the application code
6
+ COPY --chown=user leaderboard-app/ ./
7
+
8
+ RUN npm install
9
+
10
+ # Build the app
11
+ RUN npm run build
12
+
13
+ # Expose the port the app will run on
14
+ # HF Spaces uses port 7860 by default
15
+ EXPOSE 7860
16
+
17
+ # Start the app with the correct port
18
+ ENV PORT=7860
19
+ CMD ["npm", "start"]
leaderboard-app/.gitignore ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # See https://help.github.com/articles/ignoring-files/ for more about ignoring files.
2
+
3
+ # dependencies
4
+ /node_modules
5
+ /.pnp
6
+ .pnp.*
7
+ .yarn/*
8
+ !.yarn/patches
9
+ !.yarn/plugins
10
+ !.yarn/releases
11
+ !.yarn/versions
12
+
13
+ # testing
14
+ /coverage
15
+
16
+ # next.js
17
+ /.next/
18
+ /out/
19
+
20
+ # production
21
+ /build
22
+
23
+ # misc
24
+ .DS_Store
25
+ *.pem
26
+
27
+ # debug
28
+ npm-debug.log*
29
+ yarn-debug.log*
30
+ yarn-error.log*
31
+ .pnpm-debug.log*
32
+
33
+ # env files (can opt-in for committing if needed)
34
+ .env*
35
+
36
+ # vercel
37
+ .vercel
38
+
39
+ # typescript
40
+ *.tsbuildinfo
41
+ next-env.d.ts
leaderboard-app/README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LLM Comparison Leaderboard
2
+
3
+ An interactive dashboard for comparing the performance of state-of-the-art large language models across various tasks and metrics.
4
+
5
+ ## Features
6
+
7
+ - Overall model rankings with comprehensive scoring
8
+ - Task-specific performance analysis
9
+ - Metric breakdowns across different dimensions
10
+ - User satisfaction and experience metrics
11
+ - Interactive visualizations using Recharts
12
+ - Responsive design for all device sizes
13
+
14
+ ## Getting Started
15
+
16
+ ### Prerequisites
17
+
18
+ - Node.js 16.8 or later
19
+ - Python 3.8 or later (for data processing)
20
+ - Python packages: pandas, numpy
21
+
22
+ ### Installation
23
+
24
+ 1. Clone the repository:
25
+
26
+ ```bash
27
+ git clone https://github.com/yourusername/llm-comparison-leaderboard.git
28
+ cd llm-comparison-leaderboard
29
+ ```
30
+
31
+ 2. Install dependencies:
32
+
33
+ ```bash
34
+ npm install
35
+ ```
36
+
37
+ 3. Install Python dependencies (if you plan to process data):
38
+
39
+ ```bash
40
+ pip install pandas numpy
41
+ ```
42
+
43
+ ### Using Sample Data
44
+
45
+ The repository includes a sample JSON file with placeholder data in `public/llm_comparison_data.json`. You can start the development server right away to see the dashboard with this data:
46
+
47
+ ```bash
48
+ npm run dev
49
+ ```
50
+
51
+ Visit [http://localhost:3000](http://localhost:3000) to see the dashboard.
52
+
53
+ ### Processing Your Own Data
54
+
55
+ If you have your own data, follow these steps:
56
+
57
+ 1. Place your CSV data file in the `data` directory:
58
+
59
+ ```bash
60
+ mkdir -p data
61
+ cp /path/to/your/pilot_data_n20.csv data/
62
+ ```
63
+
64
+ 2. Run the data processing script:
65
+
66
+ ```bash
67
+ npm run process-data
68
+ ```
69
+
70
+ This will:
71
+ - Process the CSV data using the Python script
72
+ - Generate a JSON file in the `public` directory
73
+ - Format the data for the dashboard
74
+
75
+ 3. Start the development server:
76
+
77
+ ```bash
78
+ npm run dev
79
+ ```
80
+
81
+ ## Project Structure
82
+
83
+ - `app/` - Next.js App Router components
84
+ - `page.js` - Main page component that loads data and renders dashboard
85
+ - `layout.js` - Layout component with metadata and global styles
86
+ - `globals.css` - Global styles including Tailwind CSS
87
+ - `components/` - React components
88
+ - `LLMComparisonDashboard.jsx` - The main dashboard component
89
+ - `public/` - Static files
90
+ - `llm_comparison_data.json` - Processed data for the dashboard
91
+ - `lib/` - Utility functions
92
+ - `utils.js` - Helper functions for data processing
93
+ - `scripts/` - Data processing scripts
94
+ - `process_data.js` - Node.js script for running Python processor
95
+ - `process_data.py` - Python script for data processing
96
+
97
+ ## Building for Production
98
+
99
+ To build the application for production:
100
+
101
+ ```bash
102
+ npm run build
103
+ ```
104
+
105
+ To start the production server:
106
+
107
+ ```bash
108
+ npm run start
109
+ ```
110
+
111
+ ## License
112
+
113
+ This project is licensed under the MIT License - see the LICENSE file for details.
leaderboard-app/app/favicon.ico ADDED
leaderboard-app/app/globals.css ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ @import "tailwindcss";
2
+
3
+ :root {
4
+ --background: #ffffff;
5
+ --foreground: #171717;
6
+ }
7
+
8
+ @theme inline {
9
+ --color-background: var(--background);
10
+ --color-foreground: var(--foreground);
11
+ --font-sans: var(--font-geist-sans);
12
+ --font-mono: var(--font-geist-mono);
13
+ }
14
+
15
+ /* Force light theme regardless of color scheme preference */
16
+ /* Disable dark mode
17
+ @media (prefers-color-scheme: dark) {
18
+ :root {
19
+ --background: #0a0a0a;
20
+ --foreground: #ededed;
21
+ }
22
+ }
23
+ */
24
+
25
+ body {
26
+ background: var(--background);
27
+ color: var(--foreground);
28
+ font-family: Arial, Helvetica, sans-serif;
29
+ }
leaderboard-app/app/layout.js ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { Inter } from 'next/font/google';
2
+ import './globals.css';
3
+
4
+ const inter = Inter({ subsets: ['latin'] });
5
+
6
+ export const metadata = {
7
+ title: 'LLM Comparison Leaderboard',
8
+ description: 'Interactive leaderboard comparing performance of state-of-the-art large language models across various tasks and metrics.',
9
+ };
10
+
11
+ export default function RootLayout({ children }) {
12
+ return (
13
+ <html lang="en">
14
+ <body className={`${inter.className} bg-gray-50`}>
15
+ {children}
16
+ </body>
17
+ </html>
18
+ );
19
+ }
leaderboard-app/app/page.js ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 'use client';
2
+
3
+ import { useState, useEffect } from 'react';
4
+ import dynamic from 'next/dynamic';
5
+ import { prepareDataForVisualization } from '../lib/utils';
6
+
7
+ // Dynamically import the dashboard component with SSR disabled
8
+ // This is important because recharts needs to be rendered on the client side
9
+ const LLMComparisonDashboard = dynamic(
10
+ () => import('../components/LLMComparisonDashboard'),
11
+ { ssr: false }
12
+ );
13
+
14
+ export default function Home() {
15
+ const [data, setData] = useState(null);
16
+ const [loading, setLoading] = useState(true);
17
+ const [error, setError] = useState(null);
18
+
19
+ useEffect(() => {
20
+ async function fetchData() {
21
+ try {
22
+ setLoading(true);
23
+
24
+ // Fetch the data from the JSON file in the public directory
25
+ const response = await fetch('/llm_comparison_data.json');
26
+
27
+ if (!response.ok) {
28
+ throw new Error(`Failed to fetch data: ${response.status} ${response.statusText}`);
29
+ }
30
+
31
+ const jsonData = await response.json();
32
+
33
+ // Process the data for visualization
34
+ const processedData = prepareDataForVisualization(jsonData);
35
+
36
+ setData(processedData);
37
+ setLoading(false);
38
+ } catch (err) {
39
+ console.error('Error loading data:', err);
40
+ setError(err.message || 'Failed to load data');
41
+ setLoading(false);
42
+ }
43
+ }
44
+
45
+ fetchData();
46
+ }, []);
47
+
48
+ if (loading) {
49
+ return (
50
+ <div className="flex items-center justify-center min-h-screen">
51
+ <div className="text-center">
52
+ <div className="animate-spin rounded-full h-12 w-12 border-b-2 border-blue-500 mx-auto mb-4"></div>
53
+ <p className="text-lg text-gray-600">Loading LLM comparison data...</p>
54
+ </div>
55
+ </div>
56
+ );
57
+ }
58
+
59
+ if (error) {
60
+ return (
61
+ <div className="flex items-center justify-center min-h-screen">
62
+ <div className="text-center max-w-md p-6 bg-red-50 rounded-lg border border-red-200">
63
+ <svg xmlns="http://www.w3.org/2000/svg" className="h-12 w-12 text-red-500 mx-auto mb-4" fill="none" viewBox="0 0 24 24" stroke="currentColor">
64
+ <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M12 8v4m0 4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" />
65
+ </svg>
66
+ <h2 className="text-xl font-bold text-red-700 mb-2">Error Loading Data</h2>
67
+ <p className="text-gray-600">{error}</p>
68
+ <button
69
+ onClick={() => window.location.reload()}
70
+ className="mt-4 px-4 py-2 bg-blue-500 text-white rounded hover:bg-blue-600 transition-colors"
71
+ >
72
+ Try Again
73
+ </button>
74
+ </div>
75
+ </div>
76
+ );
77
+ }
78
+
79
+ return (
80
+ <main className="min-h-screen p-4">
81
+ {data && <LLMComparisonDashboard data={data} />}
82
+ </main>
83
+ );
84
+ }
leaderboard-app/components/HeadToHeadComparison.jsx ADDED
@@ -0,0 +1,1002 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ "use client";
2
+
3
+ import React, { useState, useEffect, useMemo, useCallback } from "react";
4
+ import {
5
+ BarChart,
6
+ Bar,
7
+ XAxis,
8
+ YAxis,
9
+ CartesianGrid,
10
+ Tooltip,
11
+ Legend,
12
+ ResponsiveContainer,
13
+ RadarChart,
14
+ PolarGrid,
15
+ PolarAngleAxis,
16
+ PolarRadiusAxis,
17
+ Radar,
18
+ ComposedChart,
19
+ Cell,
20
+ ReferenceLine
21
+ } from "recharts";
22
+
23
+ // Format facet names for display
24
+ const formatFacetName = (facet) => {
25
+ const facetMap = {
26
+ "helpfulness": "Helpfulness",
27
+ "communication": "Communication",
28
+ "insightful": "Insightfulness",
29
+ "adaptiveness": "Adaptiveness",
30
+ "trustworthiness": "Trustworthiness",
31
+ "personality": "Personality",
32
+ "background_and_culture": "Cultural Awareness"
33
+ };
34
+
35
+ return facetMap[facet] || (facet ? facet.replace(/_/g, ' ').replace(/\b\w/g, l => l.toUpperCase()) : facet);
36
+ };
37
+
38
+ // Format aspect names for display
39
+ const formatAspectName = (aspect) => {
40
+ const aspectMap = {
41
+ "effectiveness": "Effectiveness",
42
+ "comprehensiveness": "Comprehensiveness",
43
+ "usefulness": "Usefulness",
44
+ "tone_and_language_style": "Tone & Language Style",
45
+ "naturalness": "Naturalness",
46
+ "detail_and_technical_language": "Detail & Technical Language",
47
+ "accuracy": "Accuracy",
48
+ "sharpness": "Sharpness",
49
+ "intuitive": "Intuitiveness",
50
+ "flexibility": "Flexibility",
51
+ "clarity": "Clarity",
52
+ "perceptiveness": "Perceptiveness",
53
+ "consistency": "Consistency",
54
+ "confidence": "Confidence",
55
+ "transparency": "Transparency",
56
+ "personality-consistency": "Personality Consistency",
57
+ "personality-definition": "Personality Definition",
58
+ "honesty-empathy-fairness": "Honesty, Empathy & Fairness",
59
+ "alignment": "Alignment",
60
+ "cultural_relevance": "Cultural Relevance",
61
+ "bias_freedom": "Freedom from Bias",
62
+ "background_and_culture": "Background and Culture"
63
+ };
64
+
65
+ return aspectMap[aspect] || (aspect ? aspect.replace(/_/g, ' ').replace(/-/g, ' ').replace(/\b\w/g, l => l.toUpperCase()) : aspect);
66
+ };
67
+
68
+ // Format and style value differences
69
+ const formatDifference = (value, isPercent = false) => {
70
+ const formatted = isPercent ? `${Math.abs(value).toFixed(1)}%` : Math.abs(value).toFixed(1);
71
+ const prefix = value > 0 ? '+' : value < 0 ? '-' : '';
72
+ return `${prefix}${formatted}`;
73
+ };
74
+
75
+ // Get color for difference values with consistent scale
76
+ const getDiffColor = (value, scale = "normal") => {
77
+ // For facet scores (-100 to +100)
78
+ if (scale === "facet") {
79
+ if (value > 10) return 'text-green-600';
80
+ if (value < -10) return 'text-red-600';
81
+ return 'text-gray-600';
82
+ }
83
+
84
+ // For aspect scores (0 to 100)
85
+ if (scale === "aspect") {
86
+ if (value > 5) return 'text-green-600';
87
+ if (value < -5) return 'text-red-600';
88
+ return 'text-gray-600';
89
+ }
90
+
91
+ // Default
92
+ if (value > 0.3) return 'text-green-600';
93
+ if (value < -0.3) return 'text-red-600';
94
+ return 'text-gray-600';
95
+ };
96
+
97
+ // Custom tooltip with proper formatting
98
+ const CustomTooltip = ({ active, payload, label }) => {
99
+ if (active && payload && payload.length) {
100
+ const formattedLabel = label.includes('_') ? formatFacetName(label.toLowerCase()) : label;
101
+
102
+ return (
103
+ <div className="bg-white p-3 border rounded shadow-sm">
104
+ <p className="font-medium">{formattedLabel}</p>
105
+ <div className="mt-2">
106
+ {payload
107
+ .filter(entry => !entry.dataKey.includes('_std') && !entry.dataKey.includes('difference'))
108
+ .map((entry, index) => {
109
+ const stdEntry = payload.find(p => p.dataKey === `${entry.dataKey}_std`);
110
+ const stdValue = stdEntry ? stdEntry.value : 0;
111
+
112
+ return (
113
+ <div key={index} className="flex items-center text-sm mb-1">
114
+ <div
115
+ className="w-3 h-3 rounded-full mr-1"
116
+ style={{ backgroundColor: entry.color }}
117
+ ></div>
118
+ <span className="mr-2">{entry.name}:</span>
119
+ <span className="font-medium">{entry.value.toFixed(1)} {stdValue ? `± ${stdValue.toFixed(1)}` : ''}</span>
120
+ </div>
121
+ );
122
+ })}
123
+
124
+ {/* Add difference if available */}
125
+ {payload.find(p => p.dataKey === 'difference') && (
126
+ <div className="mt-2 pt-1 border-t">
127
+ <div className="flex items-center text-sm">
128
+ <span className="mr-2">Difference:</span>
129
+ <span className={`font-medium ${getDiffColor(payload.find(p => p.dataKey === 'difference').value, 'facet')}`}>
130
+ {formatDifference(payload.find(p => p.dataKey === 'difference').value)}
131
+ </span>
132
+ </div>
133
+ </div>
134
+ )}
135
+ </div>
136
+ </div>
137
+ );
138
+ }
139
+ return null;
140
+ };
141
+
142
+ // Custom tooltip for comparative bar chart
143
+ const ComparativeBarTooltip = ({ active, payload, label }) => {
144
+ if (active && payload && payload.length) {
145
+ const model1 = payload[0]?.name;
146
+ const model2 = payload[1]?.name;
147
+ const model1Value = payload[0]?.value;
148
+ const model2Value = payload[1]?.value;
149
+ const difference = model1Value !== undefined && model2Value !== undefined ? model1Value - model2Value : null;
150
+
151
+ return (
152
+ <div className="bg-white p-3 border rounded shadow-sm">
153
+ <p className="font-medium mb-1">{label}</p>
154
+ {payload.map((entry, index) => (
155
+ <div key={index} className="flex items-center text-sm mb-1">
156
+ <div
157
+ className="w-3 h-3 rounded-full mr-1"
158
+ style={{ backgroundColor: entry.color }}
159
+ ></div>
160
+ <span className="mr-2">{entry.name}:</span>
161
+ <span className="font-medium">{entry.value.toFixed(1)}</span>
162
+ </div>
163
+ ))}
164
+ {difference !== null && (
165
+ <div className={`text-sm mt-1 pt-1 border-t ${getDiffColor(difference, 'aspect')}`}>
166
+ Difference: {formatDifference(difference)}
167
+ </div>
168
+ )}
169
+ </div>
170
+ );
171
+ }
172
+ return null;
173
+ };
174
+
175
+ const HeadToHeadComparison = ({ data }) => {
176
+ const [compareModels, setCompareModels] = useState([]);
177
+ const [selectedView, setSelectedView] = useState("overview");
178
+ const [showCommonTasksOnly, setShowCommonTasksOnly] = useState(true);
179
+ const [selectedTaskType, setSelectedTaskType] = useState("all");
180
+ const [selectedDemographic, setSelectedDemographic] = useState("all");
181
+
182
+ const {
183
+ models,
184
+ taskData,
185
+ taskCategories,
186
+ radarData,
187
+ facets,
188
+ demographicSummary,
189
+ demographicOptions
190
+ } = data || {
191
+ models: [],
192
+ taskData: [],
193
+ taskCategories: {},
194
+ radarData: [],
195
+ facets: {},
196
+ demographicSummary: {},
197
+ demographicOptions: {}
198
+ };
199
+
200
+ // Initialize compare models if empty
201
+ useEffect(() => {
202
+ if (compareModels.length === 0 && models.length > 1) {
203
+ setCompareModels([models[0].model, models[1].model]);
204
+ }
205
+ }, [models, compareModels]);
206
+
207
+ // Get model data by name (memoized)
208
+ const getModelByName = useCallback((name) => {
209
+ return models.find(m => m.model === name);
210
+ }, [models]);
211
+
212
+ // Generate data for the radar chart comparison (memoized)
213
+ const comparisonRadarData = useMemo(() => {
214
+ if (compareModels.length !== 2 || !radarData) return [];
215
+
216
+ return radarData.map(item => {
217
+ const category = item.category;
218
+ const model1Score = item[compareModels[0]] || 0;
219
+ const model2Score = item[compareModels[1]] || 0;
220
+
221
+ return {
222
+ category,
223
+ [compareModels[0]]: model1Score,
224
+ [compareModels[1]]: model2Score,
225
+ difference: model1Score - model2Score
226
+ };
227
+ });
228
+ }, [compareModels, radarData]);
229
+
230
+ // Get task comparison data (memoized)
231
+ const taskComparisonData = useMemo(() => {
232
+ if (compareModels.length !== 2 || !taskData) return [];
233
+
234
+ // Filter tasks based on selectedTaskType
235
+ let filteredTasks = [...taskData];
236
+ if (selectedTaskType !== "all") {
237
+ filteredTasks = taskData.filter(task =>
238
+ taskCategories[selectedTaskType]?.includes(task.task)
239
+ );
240
+ }
241
+
242
+ // Filter for common tasks if requested
243
+ if (showCommonTasksOnly) {
244
+ filteredTasks = filteredTasks.filter(task =>
245
+ task[compareModels[0]] !== undefined &&
246
+ task[compareModels[1]] !== undefined
247
+ );
248
+ }
249
+
250
+ return filteredTasks.map(task => {
251
+ const model1Score = task[compareModels[0]] || 0;
252
+ const model2Score = task[compareModels[1]] || 0;
253
+
254
+ return {
255
+ task: task.task,
256
+ category: task.category,
257
+ [compareModels[0]]: model1Score,
258
+ [compareModels[1]]: model2Score,
259
+ difference: model1Score - model2Score
260
+ };
261
+ }).sort((a, b) => Math.abs(b.difference) - Math.abs(a.difference));
262
+ }, [compareModels, taskData, selectedTaskType, showCommonTasksOnly, taskCategories]);
263
+
264
+ // Get facet comparison data (memoized)
265
+ const facetComparisonData = useMemo(() => {
266
+ if (compareModels.length !== 2 || !radarData) return [];
267
+
268
+ return radarData
269
+ .filter(item => item.category !== "Repeat Usage") // Skip repeat usage
270
+ .map(item => {
271
+ const model1Score = item[compareModels[0]] || 0;
272
+ const model2Score = item[compareModels[1]] || 0;
273
+
274
+ return {
275
+ facet: item.category,
276
+ [compareModels[0]]: model1Score,
277
+ [compareModels[1]]: model2Score,
278
+ difference: model1Score - model2Score
279
+ };
280
+ })
281
+ .sort((a, b) => Math.abs(b.difference) - Math.abs(a.difference));
282
+ }, [compareModels, radarData]);
283
+
284
+ // Get aspect comparison data for all facets (memoized)
285
+ const aspectComparisonData = useMemo(() => {
286
+ if (compareModels.length !== 2) return [];
287
+
288
+ const model1 = getModelByName(compareModels[0]);
289
+ const model2 = getModelByName(compareModels[1]);
290
+
291
+ if (!model1 || !model2 || !facets) return [];
292
+
293
+ const aspectData = [];
294
+
295
+ // For each facet, get aspect comparison
296
+ Object.entries(facets).forEach(([facet, aspects]) => {
297
+ if (facet === "repeat_usage") return; // Skip repeat usage
298
+
299
+ // For each aspect in this facet
300
+ aspects.forEach(aspect => {
301
+ const model1Score = model1.breakdown_scores?.[aspect] || 0;
302
+ const model2Score = model2.breakdown_scores?.[aspect] || 0;
303
+
304
+ aspectData.push({
305
+ facet: formatFacetName(facet),
306
+ aspect: formatAspectName(aspect),
307
+ [model1.model]: model1Score,
308
+ [model2.model]: model2Score,
309
+ difference: model1Score - model2Score
310
+ });
311
+ });
312
+ });
313
+
314
+ return aspectData.sort((a, b) => Math.abs(b.difference) - Math.abs(a.difference));
315
+ }, [compareModels, facets, getModelByName]);
316
+
317
+ // Calculate key findings & summary stats (memoized)
318
+ const summaryStats = useMemo(() => {
319
+ if (compareModels.length !== 2) return null;
320
+
321
+ const model1 = getModelByName(compareModels[0]);
322
+ const model2 = getModelByName(compareModels[1]);
323
+
324
+ if (!model1 || !model2) return null;
325
+
326
+ // Count tasks where each model wins
327
+ const model1Wins = taskComparisonData.filter(t => t[compareModels[0]] > t[compareModels[1]]).length;
328
+ const model2Wins = taskComparisonData.filter(t => t[compareModels[1]] > t[compareModels[0]]).length;
329
+ const ties = taskComparisonData.filter(t => t[compareModels[0]] === t[compareModels[1]]).length;
330
+
331
+ // Calculate average difference across all tasks
332
+ const avgDifference = taskComparisonData.length > 0
333
+ ? taskComparisonData.reduce((sum, task) => sum + (task[compareModels[0]] - task[compareModels[1]]), 0) / taskComparisonData.length
334
+ : 0;
335
+
336
+ // Find biggest win for each model
337
+ const model1BiggestWin = [...taskComparisonData].sort((a, b) => b.difference - a.difference)[0];
338
+ const model2BiggestWin = [...taskComparisonData].sort((a, b) => a.difference - b.difference)[0];
339
+
340
+ // Facet where each model most outperforms the other
341
+ const model1BestFacet = [...facetComparisonData].sort((a, b) => b.difference - a.difference)[0];
342
+ const model2BestFacet = [...facetComparisonData].sort((a, b) => a.difference - b.difference)[0];
343
+
344
+ // Aspect where each model most outperforms the other
345
+ const model1BestAspect = [...aspectComparisonData].sort((a, b) => b.difference - a.difference)[0];
346
+ const model2BestAspect = [...aspectComparisonData].sort((a, b) => a.difference - b.difference)[0];
347
+
348
+ return {
349
+ model1,
350
+ model2,
351
+ model1Wins,
352
+ model2Wins,
353
+ ties,
354
+ avgDifference,
355
+ model1BiggestWin,
356
+ model2BiggestWin,
357
+ model1BestFacet,
358
+ model2BestFacet,
359
+ model1BestAspect,
360
+ model2BestAspect
361
+ };
362
+ }, [compareModels, getModelByName, taskComparisonData, facetComparisonData, aspectComparisonData]);
363
+
364
+ // Create comparative stats for high level metrics
365
+ const highLevelComparison = useMemo(() => {
366
+ if (compareModels.length !== 2) return [];
367
+
368
+ const model1 = getModelByName(compareModels[0]);
369
+ const model2 = getModelByName(compareModels[1]);
370
+
371
+ if (!model1 || !model2) return [];
372
+
373
+ // Define the metrics to compare
374
+ const metrics = [
375
+ { name: 'Overall Score', key: 'overall_score', model1: model1.overall_score, model2: model2.overall_score, scale: "aspect" },
376
+ { name: 'Would Use Again', key: 'repeat_usage_pct', model1: model1.repeat_usage_pct, model2: model2.repeat_usage_pct, isPercent: true }
377
+ ];
378
+
379
+ // Add facet comparisons
380
+ if (model1.facet_scores && model2.facet_scores) {
381
+ Object.keys(model1.facet_scores)
382
+ .filter(key => !key.includes('_std') && key !== 'repeat_usage') // Skip std and repeat_usage
383
+ .forEach(facet => {
384
+ metrics.push({
385
+ name: formatFacetName(facet),
386
+ key: `facet_${facet}`,
387
+ model1: model1.facet_scores[facet],
388
+ model2: model2.facet_scores[facet],
389
+ scale: "facet"
390
+ });
391
+ });
392
+ }
393
+
394
+ return metrics.map(metric => ({
395
+ name: metric.name,
396
+ key: metric.key,
397
+ [model1.model]: metric.model1,
398
+ [model2.model]: metric.model2,
399
+ difference: metric.model1 - metric.model2,
400
+ percentDifference: ((metric.model1 - metric.model2) / Math.abs(metric.model2)) * 100,
401
+ isPercent: metric.isPercent,
402
+ scale: metric.scale
403
+ }));
404
+ }, [compareModels, getModelByName]);
405
+
406
+ return (
407
+ <div>
408
+ <h2 className="text-2xl font-bold mb-2">Head-to-Head Model Comparison</h2>
409
+ <p className="text-gray-600 mb-4">
410
+ Directly compare two models across all performance metrics to identify strengths and
411
+ weaknesses of each model relative to one another.
412
+ </p>
413
+
414
+ {/* Sticky Model Selection Panel */}
415
+ <div className="sticky top-0 z-10 bg-white border rounded-lg p-4 mb-6 shadow-sm">
416
+ <div className="flex flex-wrap items-center justify-between">
417
+ <div className="flex items-center space-x-4">
418
+ <div>
419
+ <label className="block text-sm font-medium text-gray-700 mb-1">First Model</label>
420
+ <select
421
+ className="border rounded p-1.5 bg-white shadow-sm focus:outline-none focus:ring-1 focus:ring-blue-500"
422
+ value={compareModels[0] || ''}
423
+ onChange={(e) => setCompareModels([e.target.value, compareModels[1] || ''])}
424
+ >
425
+ {models.map(model => (
426
+ <option
427
+ key={`model1-${model.model}`}
428
+ value={model.model}
429
+ disabled={model.model === compareModels[1]}
430
+ >
431
+ {model.model}
432
+ </option>
433
+ ))}
434
+ </select>
435
+ </div>
436
+
437
+ <div className="text-lg font-bold text-gray-500">vs</div>
438
+
439
+ <div>
440
+ <label className="block text-sm font-medium text-gray-700 mb-1">Second Model</label>
441
+ <select
442
+ className="border rounded p-1.5 bg-white shadow-sm focus:outline-none focus:ring-1 focus:ring-blue-500"
443
+ value={compareModels[1] || ''}
444
+ onChange={(e) => setCompareModels([compareModels[0] || '', e.target.value])}
445
+ >
446
+ {models.map(model => (
447
+ <option
448
+ key={`model2-${model.model}`}
449
+ value={model.model}
450
+ disabled={model.model === compareModels[0]}
451
+ >
452
+ {model.model}
453
+ </option>
454
+ ))}
455
+ </select>
456
+ </div>
457
+ </div>
458
+
459
+ <div className="mt-2 sm:mt-0">
460
+ <label className="text-sm text-gray-500 mr-2">Show only tasks with data for both models:</label>
461
+ <button
462
+ className={`px-3 py-1 text-xs font-medium rounded ${
463
+ showCommonTasksOnly
464
+ ? "bg-blue-100 text-blue-800 border border-blue-300"
465
+ : "bg-gray-100 text-gray-800 border border-gray-300"
466
+ }`}
467
+ onClick={() => setShowCommonTasksOnly(!showCommonTasksOnly)}
468
+ >
469
+ {showCommonTasksOnly ? 'Common Tasks Only' : 'All Tasks'}
470
+ </button>
471
+ </div>
472
+ </div>
473
+ </div>
474
+
475
+ {/* Tab Navigation */}
476
+ <div className="mb-4 border-b">
477
+ <div className="flex flex-wrap">
478
+ {["overview", "tasks", "facets", "aspects", "demographics"].map((tab) => (
479
+ <button
480
+ key={tab}
481
+ className={`px-6 py-3 font-medium text-sm ${
482
+ selectedView === tab
483
+ ? "bg-white text-blue-700 border-b-2 border-blue-500"
484
+ : "text-gray-600 hover:text-gray-800 hover:bg-gray-50"
485
+ }`}
486
+ onClick={() => setSelectedView(tab)}
487
+ >
488
+ {tab.charAt(0).toUpperCase() + tab.slice(1)}
489
+ </button>
490
+ ))}
491
+ </div>
492
+ </div>
493
+
494
+ {/* Key Findings Section (Always Visible) */}
495
+ {summaryStats && (
496
+ <div className="border rounded-lg overflow-hidden mb-6 bg-blue-50">
497
+ <div className="px-4 py-2 bg-blue-100 border-b">
498
+ <h3 className="font-semibold">Key Insights</h3>
499
+ </div>
500
+ <div className="p-4">
501
+ <div className="grid grid-cols-1 md:grid-cols-3 gap-4">
502
+ {/* Overall Comparison */}
503
+ <div className="bg-white rounded-lg shadow-sm p-3">
504
+ <h4 className="text-sm font-medium text-gray-700 mb-2">Overall Comparison</h4>
505
+ <div className="flex items-center mb-2">
506
+ <div className="w-3 h-3 rounded-full mr-1" style={{ backgroundColor: summaryStats.model1.color }}></div>
507
+ <span className="font-medium mr-2">{summaryStats.model1.model}:</span>
508
+ <span>{summaryStats.model1.overall_score.toFixed(1)}</span>
509
+ </div>
510
+ <div className="flex items-center mb-2">
511
+ <div className="w-3 h-3 rounded-full mr-1" style={{ backgroundColor: summaryStats.model2.color }}></div>
512
+ <span className="font-medium mr-2">{summaryStats.model2.model}:</span>
513
+ <span>{summaryStats.model2.overall_score.toFixed(1)}</span>
514
+ </div>
515
+ <div className="mt-2 text-sm">
516
+ <span className="font-medium">Average Difference: </span>
517
+ <span className={
518
+ Math.abs(summaryStats.avgDifference) < 1 ? "text-gray-600" :
519
+ summaryStats.avgDifference > 0 ? "text-green-600 font-medium" : "text-red-600 font-medium"
520
+ }>
521
+ {summaryStats.avgDifference > 0 ? '+' : ''}{summaryStats.avgDifference.toFixed(1)}
522
+ </span>
523
+ </div>
524
+ </div>
525
+
526
+ {/* Task Wins */}
527
+ <div className="bg-white rounded-lg shadow-sm p-3">
528
+ <h4 className="text-sm font-medium text-gray-700 mb-2">Task Win Distribution</h4>
529
+ <div className="flex items-center justify-between mb-1">
530
+ <div className="flex items-center">
531
+ <div className="w-3 h-3 rounded-full mr-1" style={{ backgroundColor: summaryStats.model1.color }}></div>
532
+ <span>{summaryStats.model1.model}</span>
533
+ </div>
534
+ <span className="font-medium">{summaryStats.model1Wins} tasks</span>
535
+ </div>
536
+ <div className="flex items-center justify-between mb-1">
537
+ <div className="flex items-center">
538
+ <div className="w-3 h-3 rounded-full mr-1" style={{ backgroundColor: summaryStats.model2.color }}></div>
539
+ <span>{summaryStats.model2.model}</span>
540
+ </div>
541
+ <span className="font-medium">{summaryStats.model2Wins} tasks</span>
542
+ </div>
543
+ {summaryStats.ties > 0 && (
544
+ <div className="flex items-center justify-between">
545
+ <span className="text-gray-600">Ties</span>
546
+ <span className="font-medium">{summaryStats.ties} tasks</span>
547
+ </div>
548
+ )}
549
+ </div>
550
+
551
+ {/* Key Advantages */}
552
+ <div className="bg-white rounded-lg shadow-sm p-3">
553
+ <h4 className="text-sm font-medium text-gray-700 mb-2">Biggest Advantages</h4>
554
+ {summaryStats.model1BiggestWin && (
555
+ <div className="mb-2">
556
+ <div className="flex items-center">
557
+ <div className="w-3 h-3 rounded-full mr-1" style={{ backgroundColor: summaryStats.model1.color }}></div>
558
+ <span className="font-medium text-sm">{summaryStats.model1.model}:</span>
559
+ </div>
560
+ <div className="text-sm ml-4 mt-0.5">
561
+ {summaryStats.model1BiggestWin.task.length > 30
562
+ ? summaryStats.model1BiggestWin.task.slice(0, 30) + '...'
563
+ : summaryStats.model1BiggestWin.task}
564
+ <span className="text-green-600 font-medium ml-1">
565
+ (+{summaryStats.model1BiggestWin.difference.toFixed(1)})
566
+ </span>
567
+ </div>
568
+ </div>
569
+ )}
570
+ {summaryStats.model2BiggestWin && (
571
+ <div>
572
+ <div className="flex items-center">
573
+ <div className="w-3 h-3 rounded-full mr-1" style={{ backgroundColor: summaryStats.model2.color }}></div>
574
+ <span className="font-medium text-sm">{summaryStats.model2.model}:</span>
575
+ </div>
576
+ <div className="text-sm ml-4 mt-0.5">
577
+ {summaryStats.model2BiggestWin.task.length > 30
578
+ ? summaryStats.model2BiggestWin.task.slice(0, 30) + '...'
579
+ : summaryStats.model2BiggestWin.task}
580
+ <span className="text-green-600 font-medium ml-1">
581
+ (+{Math.abs(summaryStats.model2BiggestWin.difference).toFixed(1)})
582
+ </span>
583
+ </div>
584
+ </div>
585
+ )}
586
+ </div>
587
+ </div>
588
+ </div>
589
+ </div>
590
+ )}
591
+
592
+ {/* OVERVIEW TAB */}
593
+ {selectedView === "overview" && summaryStats && (
594
+ <div>
595
+ {/* Side-by-side charts */}
596
+ <div className="grid grid-cols-1 lg:grid-cols-2 gap-6 mb-6">
597
+ {/* Radar Chart */}
598
+ <div className="border rounded-lg overflow-hidden">
599
+ <div className="px-4 py-2 bg-gray-50 border-b">
600
+ <h3 className="font-semibold">Facet Comparison</h3>
601
+ </div>
602
+ <div className="p-4">
603
+ <div className="h-80">
604
+ <ResponsiveContainer width="100%" height="100%">
605
+ <RadarChart
606
+ outerRadius={130}
607
+ data={comparisonRadarData}
608
+ margin={{ top: 30, right: 30, bottom: 30, left: 30 }}
609
+ >
610
+ <PolarGrid gridType="polygon" />
611
+ <PolarAngleAxis
612
+ dataKey="category"
613
+ tick={{ fill: '#4b5563', fontSize: 14 }}
614
+ tickLine={false}
615
+ tickFormatter={(value) => {
616
+ if (value.includes('_') || value === "Insightful") {
617
+ return formatFacetName(value.toLowerCase());
618
+ }
619
+ return value;
620
+ }}
621
+ />
622
+ <PolarRadiusAxis
623
+ angle={90}
624
+ domain={[-100, 100]}
625
+ axisLine={false}
626
+ tickCount={5}
627
+ />
628
+ {compareModels.map(modelName => {
629
+ const model = getModelByName(modelName);
630
+ return (
631
+ <Radar
632
+ key={modelName}
633
+ name={modelName}
634
+ dataKey={modelName}
635
+ stroke={model?.color || '#999'}
636
+ fill={model?.color || '#999'}
637
+ fillOpacity={0.2}
638
+ strokeWidth={2}
639
+ />
640
+ );
641
+ })}
642
+ <Tooltip content={<CustomTooltip />} />
643
+ <Legend />
644
+ </RadarChart>
645
+ </ResponsiveContainer>
646
+ </div>
647
+ </div>
648
+ </div>
649
+
650
+ {/* Gap Analysis */}
651
+ <div className="border rounded-lg overflow-hidden">
652
+ <div className="px-4 py-2 bg-gray-50 border-b">
653
+ <h3 className="font-semibold">Facet Gap Analysis</h3>
654
+ </div>
655
+ <div className="p-4">
656
+ <div className="h-80">
657
+ <ResponsiveContainer width="100%" height="100%">
658
+ <ComposedChart
659
+ layout="vertical"
660
+ data={facetComparisonData}
661
+ margin={{ top: 20, right: 60, left: 100, bottom: 20 }}
662
+ >
663
+ <CartesianGrid strokeDasharray="3 3" />
664
+ <XAxis
665
+ type="number"
666
+ domain={[-50, 50]}
667
+ tickFormatter={(value) => value > 0 ? `+${value.toFixed(0)}` : value.toFixed(0)}
668
+ />
669
+ <YAxis
670
+ dataKey="facet"
671
+ type="category"
672
+ width={100}
673
+ />
674
+ <Tooltip
675
+ formatter={(value) => [value.toFixed(1), 'Difference']}
676
+ />
677
+ <Legend />
678
+ <Bar
679
+ dataKey="difference"
680
+ name={`${compareModels[0]} vs ${compareModels[1]}`}
681
+ barSize={20}
682
+ >
683
+ {facetComparisonData.map((entry, index) => (
684
+ <Cell
685
+ key={`cell-${index}`}
686
+ fill={entry.difference > 0 ? getModelByName(compareModels[0])?.color : getModelByName(compareModels[1])?.color}
687
+ />
688
+ ))}
689
+ </Bar>
690
+ <ReferenceLine x={0} stroke="#666" strokeWidth={2} />
691
+ </ComposedChart>
692
+ </ResponsiveContainer>
693
+ </div>
694
+ <div className="text-xs text-gray-500 text-center mt-2">
695
+ Bars extending right indicate {compareModels[0]} is better, left means {compareModels[1]} is better.
696
+ </div>
697
+ </div>
698
+ </div>
699
+ </div>
700
+
701
+ {/* Key Metrics Table */}
702
+ <div className="border rounded-lg overflow-hidden mb-6">
703
+ <div className="px-4 py-2 bg-gray-50 border-b">
704
+ <h3 className="font-semibold">Key Metrics Comparison</h3>
705
+ </div>
706
+ <div className="p-4">
707
+ <div className="overflow-x-auto">
708
+ <table className="min-w-full divide-y divide-gray-200">
709
+ <thead className="bg-gray-50">
710
+ <tr>
711
+ <th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Metric</th>
712
+ {compareModels.map(modelName => {
713
+ const model = getModelByName(modelName);
714
+ return (
715
+ <th key={modelName} className="px-4 py-2 text-left text-xs font-medium uppercase tracking-wider" style={{ color: model?.color }}>
716
+ {modelName}
717
+ </th>
718
+ );
719
+ })}
720
+ <th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Difference</th>
721
+ </tr>
722
+ </thead>
723
+ <tbody className="bg-white divide-y divide-gray-200">
724
+ {highLevelComparison.map((metric) => (
725
+ <tr key={metric.key} className="hover:bg-gray-50">
726
+ <td className="px-4 py-3 whitespace-nowrap text-sm font-medium text-gray-900">
727
+ {metric.name}
728
+ </td>
729
+ {compareModels.map(modelName => {
730
+ const value = metric[modelName];
731
+ const isPercent = metric.isPercent;
732
+
733
+ return (
734
+ <td key={`${metric.key}-${modelName}`} className="px-4 py-3 whitespace-nowrap text-sm text-gray-700">
735
+ <span className={`font-medium ${metric.difference !== 0 && modelName === compareModels[0] && metric.difference > 0 ? 'text-green-600' : ''} ${metric.difference !== 0 && modelName === compareModels[1] && metric.difference < 0 ? 'text-green-600' : ''}`}>
736
+ {isPercent ? `${value.toFixed(1)}%` : value.toFixed(1)}
737
+ </span>
738
+ </td>
739
+ );
740
+ })}
741
+ <td className="px-4 py-3 whitespace-nowrap text-sm">
742
+ <span className={`font-medium ${getDiffColor(metric.difference, metric.scale)}`}>
743
+ {formatDifference(metric.difference, metric.isPercent)}
744
+ </span>
745
+ </td>
746
+ </tr>
747
+ ))}
748
+ </tbody>
749
+ </table>
750
+ </div>
751
+ <div className="text-xs text-gray-500 mt-3">
752
+ Differences are calculated as {compareModels[0]} minus {compareModels[1]}. Positive values indicate {compareModels[0]} is higher.
753
+ </div>
754
+ </div>
755
+ </div>
756
+
757
+ {/* Interactive Recommendation */}
758
+ <div className="border rounded-lg overflow-hidden mb-6 bg-blue-50">
759
+ <div className="px-4 py-2 bg-blue-100 border-b">
760
+ <h3 className="font-semibold">When to Use Each Model</h3>
761
+ </div>
762
+ <div className="p-4 text-sm text-gray-800">
763
+ <div className="grid grid-cols-1 sm:grid-cols-2 gap-6">
764
+ <div className="bg-white rounded-lg p-4 shadow-sm">
765
+ <h4 className="font-medium mb-2" style={{ color: summaryStats.model1.color }}>
766
+ When to use {summaryStats.model1.model}:
767
+ </h4>
768
+ <ul className="list-disc pl-5 space-y-1 text-sm">
769
+ <li>For {summaryStats.model1BestFacet?.facet.toLowerCase() || 'overall'} focused tasks</li>
770
+ {summaryStats.model1BiggestWin && (
771
+ <li>When working on tasks like "{summaryStats.model1BiggestWin.task}"</li>
772
+ )}
773
+ {summaryStats.model1BestAspect && (
774
+ <li>When {summaryStats.model1BestAspect.aspect.toLowerCase()} is important</li>
775
+ )}
776
+ </ul>
777
+ </div>
778
+ <div className="bg-white rounded-lg p-4 shadow-sm">
779
+ <h4 className="font-medium mb-2" style={{ color: summaryStats.model2.color }}>
780
+ When to use {summaryStats.model2.model}:
781
+ </h4>
782
+ <ul className="list-disc pl-5 space-y-1 text-sm">
783
+ <li>For {summaryStats.model2BestFacet?.facet.toLowerCase() || 'overall'} focused tasks</li>
784
+ {summaryStats.model2BiggestWin && (
785
+ <li>When working on tasks like "{summaryStats.model2BiggestWin.task}"</li>
786
+ )}
787
+ {summaryStats.model2BestAspect && (
788
+ <li>When {summaryStats.model2BestAspect.aspect.toLowerCase()} is important</li>
789
+ )}
790
+ </ul>
791
+ </div>
792
+ </div>
793
+ </div>
794
+ </div>
795
+ </div>
796
+ )}
797
+
798
+ {/* TASKS TAB */}
799
+ {selectedView === "tasks" && (
800
+ <div>
801
+ {/* Task Type Filter */}
802
+ <div className="mb-4 overflow-x-auto pb-2">
803
+ <div className="flex space-x-2">
804
+ <button
805
+ className={`px-3 py-1 text-sm font-medium rounded-full whitespace-nowrap ${
806
+ selectedTaskType === "all"
807
+ ? "bg-blue-100 text-blue-800"
808
+ : "bg-gray-100 text-gray-800"
809
+ }`}
810
+ onClick={() => setSelectedTaskType("all")}
811
+ >
812
+ All Tasks
813
+ </button>
814
+ {Object.keys(taskCategories || {}).map(category => (
815
+ <button
816
+ key={category}
817
+ className={`px-3 py-1 text-sm font-medium rounded-full whitespace-nowrap ${
818
+ selectedTaskType === category
819
+ ? "bg-blue-100 text-blue-800"
820
+ : "bg-gray-100 text-gray-800"
821
+ }`}
822
+ onClick={() => setSelectedTaskType(category)}
823
+ >
824
+ {category.charAt(0).toUpperCase() + category.slice(1)}
825
+ </button>
826
+ ))}
827
+ </div>
828
+ </div>
829
+
830
+ {/* Task Comparison Section */}
831
+ <div className="grid grid-cols-1 lg:grid-cols-2 gap-6 mb-6">
832
+ {/* Bar Chart */}
833
+ <div className="border rounded-lg overflow-hidden">
834
+ <div className="px-4 py-2 bg-gray-50 border-b flex justify-between items-center">
835
+ <h3 className="font-semibold">Performance Comparison</h3>
836
+ </div>
837
+ <div className="p-4">
838
+ <div className="h-[450px]">
839
+ <ResponsiveContainer width="100%" height="100%">
840
+ <BarChart
841
+ data={taskComparisonData.slice(0, 10)} // Top 10 for clarity
842
+ layout="vertical"
843
+ margin={{ top: 5, right: 30, left: 150, bottom: 5 }}
844
+ >
845
+ <CartesianGrid strokeDasharray="3 3" />
846
+ <XAxis type="number" domain={[0, 100]} />
847
+ <YAxis
848
+ dataKey="task"
849
+ type="category"
850
+ width={150}
851
+ tick={{ fontSize: 12 }}
852
+ />
853
+ <Tooltip content={<ComparativeBarTooltip />} />
854
+ <Legend />
855
+ {compareModels.map(modelName => {
856
+ const model = getModelByName(modelName);
857
+ return (
858
+ <Bar
859
+ key={modelName}
860
+ dataKey={modelName}
861
+ name={modelName}
862
+ fill={model?.color || '#999'}
863
+ maxBarSize={20}
864
+ />
865
+ );
866
+ })}
867
+ </BarChart>
868
+ </ResponsiveContainer>
869
+ </div>
870
+ <div className="text-xs text-gray-500 text-center mt-2">
871
+ Showing top 10 tasks with the largest performance differences
872
+ </div>
873
+ </div>
874
+ </div>
875
+
876
+ {/* Gap Analysis */}
877
+ <div className="border rounded-lg overflow-hidden">
878
+ <div className="px-4 py-2 bg-gray-50 border-b">
879
+ <h3 className="font-semibold">Task Performance Gap</h3>
880
+ </div>
881
+ <div className="p-4">
882
+ <div className="h-[450px]">
883
+ <ResponsiveContainer width="100%" height="100%">
884
+ <ComposedChart
885
+ layout="vertical"
886
+ data={taskComparisonData.slice(0, 10)}
887
+ margin={{ top: 20, right: 30, left: 150, bottom: 20 }}
888
+ >
889
+ <CartesianGrid strokeDasharray="3 3" />
890
+ <XAxis
891
+ type="number"
892
+ domain={[-30, 30]}
893
+ tickFormatter={(value) => value > 0 ? `+${value.toFixed(0)}` : value.toFixed(0)}
894
+ />
895
+ <YAxis
896
+ dataKey="task"
897
+ type="category"
898
+ width={150}
899
+ tick={{ fontSize: 11 }}
900
+ />
901
+ <Tooltip
902
+ formatter={(value) => [value.toFixed(1), 'Difference']}
903
+ />
904
+ <Legend />
905
+ <Bar
906
+ dataKey="difference"
907
+ name={`${compareModels[0]} vs ${compareModels[1]}`}
908
+ barSize={20}
909
+ >
910
+ {taskComparisonData.slice(0, 10).map((entry, index) => (
911
+ <Cell
912
+ key={`cell-${index}`}
913
+ fill={entry.difference > 0 ? getModelByName(compareModels[0])?.color : getModelByName(compareModels[1])?.color}
914
+ />
915
+ ))}
916
+ </Bar>
917
+ <ReferenceLine x={0} stroke="#666" strokeWidth={2} />
918
+ </ComposedChart>
919
+ </ResponsiveContainer>
920
+ </div>
921
+ <div className="text-xs text-gray-500 text-center mt-2">
922
+ Bars to the right indicate {compareModels[0]} is better, to the left indicate {compareModels[1]} is better.
923
+ </div>
924
+ </div>
925
+ </div>
926
+ </div>
927
+
928
+ {/* Task Comparison Table */}
929
+ <div className="border rounded-lg overflow-hidden mb-6">
930
+ <div className="px-4 py-2 bg-gray-50 border-b flex justify-between items-center">
931
+ <h3 className="font-semibold">Task Comparison Details</h3>
932
+ <button
933
+ onClick={() => setShowCommonTasksOnly(!showCommonTasksOnly)}
934
+ className={`px-2 py-1 rounded text-xs ${showCommonTasksOnly ? 'bg-blue-100 text-blue-800' : 'bg-gray-100 text-gray-600'}`}
935
+ >
936
+ {showCommonTasksOnly ? 'Common Tasks Only' : 'All Tasks'}
937
+ </button>
938
+ </div>
939
+ <div className="p-4">
940
+ <div className="overflow-x-auto">
941
+ <table className="min-w-full divide-y divide-gray-200">
942
+ <thead className="bg-gray-50">
943
+ <tr>
944
+ <th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Task</th>
945
+ <th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Category</th>
946
+ <th className="px-4 py-2 text-right text-xs font-medium text-gray-500 uppercase tracking-wider">{compareModels[0]}</th>
947
+ <th className="px-4 py-2 text-right text-xs font-medium text-gray-500 uppercase tracking-wider">{compareModels[1]}</th>
948
+ <th className="px-4 py-2 text-center text-xs font-medium text-gray-500 uppercase tracking-wider">Difference</th>
949
+ <th className="px-4 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Better Model</th>
950
+ </tr>
951
+ </thead>
952
+ <tbody className="bg-white divide-y divide-gray-200">
953
+ {taskComparisonData.slice(0, 15).map((task, idx) => (
954
+ <tr key={task.task} className={idx % 2 === 0 ? 'bg-white' : 'bg-gray-50'}>
955
+ <td className="px-4 py-2 text-sm whitespace-normal">{task.task}</td>
956
+ <td className="px-4 py-2 text-sm">{task.category}</td>
957
+ <td className="px-4 py-2 text-sm text-right">{task[compareModels[0]].toFixed(1)}</td>
958
+ <td className="px-4 py-2 text-sm text-right">{task[compareModels[1]].toFixed(1)}</td>
959
+ <td className="px-4 py-2 text-sm text-center">
960
+ <span className={`font-medium ${getDiffColor(task.difference, "aspect")}`}>
961
+ {task.difference > 0 ? '+' : ''}{task.difference.toFixed(1)}
962
+ </span>
963
+ </td>
964
+ <td className="px-4 py-2 text-sm">
965
+ {task.difference !== 0 && (
966
+ <div className="flex items-center">
967
+ <div
968
+ className="w-3 h-3 rounded-full mr-1"
969
+ style={{ backgroundColor: task.difference > 0
970
+ ? getModelByName(compareModels[0])?.color
971
+ : getModelByName(compareModels[1])?.color
972
+ }}
973
+ ></div>
974
+ <span>{task.difference > 0 ? compareModels[0] : compareModels[1]}</span>
975
+ </div>
976
+ )}
977
+ {task.difference === 0 && (
978
+ <span className="text-gray-500">Tie</span>
979
+ )}
980
+ </td>
981
+ </tr>
982
+ ))}
983
+ </tbody>
984
+ </table>
985
+ </div>
986
+ {taskComparisonData.length > 15 && (
987
+ <div className="text-center mt-3 text-sm text-gray-500">
988
+ Showing 15 of {taskComparisonData.length} tasks. Tasks are sorted by largest difference.
989
+ </div>
990
+ )}
991
+ </div>
992
+ </div>
993
+ </div>
994
+ )}
995
+
996
+ {/* Include implementations for other tabs (facets, aspects, demographics) */}
997
+
998
+ </div>
999
+ );
1000
+ };
1001
+
1002
+ export default HeadToHeadComparison;
leaderboard-app/components/LLMComparisonDashboard.jsx ADDED
@@ -0,0 +1,688 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ "use client";
2
+
3
+ import React, { useState, useMemo } from "react";
4
+ import { getScoreBadgeColor } from "../lib/utils";
5
+ import TaskDemographicAnalysis from "./TaskDemographicAnalysis";
6
+ import MetricsBreakdown from "./MetricsBreakdown";
7
+ import HeadToHeadComparison from "./HeadToHeadComparison";
8
+
9
+ // Reusable component for displaying scores with standard deviation
10
+ const ScoreWithStdDev = ({ score, stdDev, colorClass }) => {
11
+ return (
12
+ <span
13
+ className={`px-2 py-1 inline-flex text-xs font-semibold rounded-full ${colorClass}`}
14
+ >
15
+ {score.toFixed(2)} ± {stdDev.toFixed(2)}
16
+ </span>
17
+ );
18
+ };
19
+
20
+ const formatFacetName = (facet) => {
21
+ if (!facet) return "Unknown"; // Handle null or undefined facet
22
+
23
+ const facetMap = {
24
+ helpfulness: "Helpfulness",
25
+ communication: "Communication",
26
+ insightful: "Insightfulness",
27
+ adaptiveness: "Adaptiveness",
28
+ trustworthiness: "Trustworthiness",
29
+ personality: "Personality",
30
+ background_and_culture: "Cultural Awareness",
31
+ };
32
+
33
+ return (
34
+ facetMap[facet] ||
35
+ facet.replace(/_/g, " ").replace(/\b\w/g, (l) => l.toUpperCase())
36
+ );
37
+ };
38
+
39
+ const LLMComparisonDashboard = ({ data }) => {
40
+ const [activeTab, setActiveTab] = useState("overview");
41
+ const [sortConfig, setSortConfig] = useState({
42
+ key: "overall_score",
43
+ direction: "descending",
44
+ });
45
+
46
+ const {
47
+ models,
48
+ radarData,
49
+ bestModelPerCategory,
50
+ taskCategories,
51
+ keyAspectsByTask
52
+ } = data || {
53
+ models: [],
54
+ radarData: [],
55
+ taskData: [],
56
+ bestModelPerCategory: {},
57
+ bestModelPerFacet: {},
58
+ taskCategories: {},
59
+ facets: {},
60
+ demographicSummary: {},
61
+ fairnessMetrics: {},
62
+ demographicOptions: {},
63
+ keyAspectsByTask: {}
64
+ };
65
+
66
+ // Request sort function
67
+ const requestSort = (key) => {
68
+ let direction = "descending";
69
+ if (sortConfig.key === key && sortConfig.direction === "descending") {
70
+ direction = "ascending";
71
+ }
72
+ setSortConfig({ key, direction });
73
+ };
74
+
75
+ // Get sorted models
76
+ const sortedModels = useMemo(() => {
77
+ let sortableItems = [...models];
78
+ if (sortConfig.key !== null) {
79
+ sortableItems.sort((a, b) => {
80
+ let aValue, bValue;
81
+
82
+ // Handle nested properties for facet scores
83
+ if (sortConfig.key.includes(".")) {
84
+ const [group, metric] = sortConfig.key.split(".");
85
+ if (group === "facet_scores") {
86
+ aValue = a.facet_scores[metric];
87
+ bValue = b.facet_scores[metric];
88
+ } else {
89
+ aValue = a[sortConfig.key];
90
+ bValue = b[sortConfig.key];
91
+ }
92
+ } else if (sortConfig.key === "model") {
93
+ aValue = a.model;
94
+ bValue = b.model;
95
+ } else {
96
+ // For other properties directly on the model object
97
+ aValue = a[sortConfig.key];
98
+ bValue = b[sortConfig.key];
99
+ }
100
+
101
+ if (aValue < bValue) {
102
+ return sortConfig.direction === "ascending" ? -1 : 1;
103
+ }
104
+ if (aValue > bValue) {
105
+ return sortConfig.direction === "ascending" ? 1 : -1;
106
+ }
107
+ return 0;
108
+ });
109
+ }
110
+ return sortableItems;
111
+ }, [models, sortConfig]);
112
+
113
+ // Custom tooltip for the radar chart
114
+ const CustomTooltip = ({ active, payload }) => {
115
+ if (active && payload && payload.length) {
116
+ return (
117
+ <div className="p-2 bg-white border border-gray-200 rounded shadow-sm">
118
+ {payload.map((entry, index) => {
119
+ // Skip standard deviation entries
120
+ if (entry.name.includes("_std")) return null;
121
+
122
+ const baseModelName = entry.name;
123
+ const stdEntry = payload.find(
124
+ (p) => p.name === `${baseModelName}_std`
125
+ );
126
+ const stdValue = stdEntry ? stdEntry.value : 0;
127
+
128
+ return (
129
+ <div key={index} className="flex items-center">
130
+ <div
131
+ className="w-3 h-3 mr-1"
132
+ style={{ backgroundColor: entry.color }}
133
+ ></div>
134
+ <span className="text-xs">
135
+ {entry.name}: {entry.value.toFixed(2)} ± {stdValue.toFixed(2)}
136
+ </span>
137
+ </div>
138
+ );
139
+ })}
140
+ </div>
141
+ );
142
+ }
143
+ return null;
144
+ };
145
+
146
+ return (
147
+ <div className="max-w-7xl mx-auto p-4 bg-white">
148
+ <h1 className="text-3xl font-bold text-center mb-2">
149
+ LLM Performance: The Human Perspective
150
+ </h1>
151
+ <p className="text-center mb-6 text-gray-600 max-w-4xl mx-auto">
152
+ Evaluations of LLMs performing everyday tasks, metrics focus on both
153
+ technical quality and user experience factors.
154
+ </p>
155
+
156
+ {/* Main navigation tabs - Updated structure */}
157
+ <div className="flex flex-wrap mb-6 border-b">
158
+ <button
159
+ className={`px-4 py-2 font-medium ${
160
+ activeTab === "overview"
161
+ ? "text-blue-600 border-b-2 border-blue-600"
162
+ : "text-gray-500"
163
+ }`}
164
+ onClick={() => setActiveTab("overview")}
165
+ >
166
+ Overview
167
+ </button>
168
+ <button
169
+ className={`px-4 py-2 font-medium ${
170
+ activeTab === "task-demographics"
171
+ ? "text-blue-600 border-b-2 border-blue-600"
172
+ : "text-gray-500"
173
+ }`}
174
+ onClick={() => setActiveTab("task-demographics")}
175
+ >
176
+ Task & Demographic Analysis
177
+ </button>
178
+ <button
179
+ className={`px-4 py-2 font-medium ${
180
+ activeTab === "facets"
181
+ ? "text-blue-600 border-b-2 border-blue-600"
182
+ : "text-gray-500"
183
+ }`}
184
+ onClick={() => setActiveTab("facets")}
185
+ >
186
+ Metrics Breakdown
187
+ </button>
188
+ {/* <button
189
+ className={`px-4 py-2 font-medium ${
190
+ activeTab === "headtohead"
191
+ ? "text-blue-600 border-b-2 border-blue-600"
192
+ : "text-gray-500"
193
+ }`}
194
+ onClick={() => setActiveTab("headtohead")}
195
+ >
196
+ Head-to-Head Comparison
197
+ </button> */}
198
+ </div>
199
+
200
+ {/* Overview Tab */}
201
+ {activeTab === "overview" && (
202
+ <div>
203
+ {/* Overall Rankings Card - Simplified */}
204
+ <div className="mb-6 border rounded-lg overflow-hidden">
205
+ <div className="px-4 py-2 bg-gray-50 border-b">
206
+ <h2 className="text-xl font-semibold">Overall Model Rankings</h2>
207
+ </div>
208
+ <div className="p-4">
209
+ <div className="overflow-x-auto">
210
+ <table className="w-full table-fixed divide-y divide-gray-200">
211
+ <thead>
212
+ <tr className="bg-gray-50">
213
+ <th className="px-4 py-2 text-left text-sm font-medium text-gray-500 w-10">
214
+ Rank
215
+ </th>
216
+ <th
217
+ className="px-4 py-2 text-left text-sm font-medium text-gray-500 w-52 cursor-pointer group"
218
+ onClick={() => requestSort("model")}
219
+ >
220
+ <div className="flex items-center">
221
+ Model
222
+ {sortConfig.key === "model" ? (
223
+ <span className="ml-1">
224
+ {sortConfig.direction === "ascending" ? "↑" : "↓"}
225
+ </span>
226
+ ) : (
227
+ <span className="ml-1 text-gray-300 group-hover:text-gray-500">
228
+
229
+ </span>
230
+ )}
231
+ </div>
232
+ </th>
233
+ <th
234
+ className="px-4 py-2 text-left text-sm font-medium text-gray-500 w-50 cursor-pointer group"
235
+ onClick={() => requestSort("overall_score")}
236
+ >
237
+ <div className="flex items-center">
238
+ Overall Score
239
+ {sortConfig.key === "overall_score" ? (
240
+ <span className="ml-1">
241
+ {sortConfig.direction === "ascending" ? "↑" : "↓"}
242
+ </span>
243
+ ) : (
244
+ <span className="ml-1 text-gray-300 group-hover:text-gray-500">
245
+
246
+ </span>
247
+ )}
248
+ </div>
249
+ </th>
250
+ <th
251
+ className="px-4 py-2 text-left text-sm font-medium text-gray-500 w-42 cursor-pointer group"
252
+ onClick={() => requestSort("repeat_usage_pct")}
253
+ >
254
+ <div className="flex items-center">
255
+ Would Use Again
256
+ {sortConfig.key === "repeat_usage_pct" ? (
257
+ <span className="ml-1">
258
+ {sortConfig.direction === "ascending" ? "↑" : "↓"}
259
+ </span>
260
+ ) : (
261
+ <span className="ml-1 text-gray-300 group-hover:text-gray-500">
262
+
263
+ </span>
264
+ )}
265
+ </div>
266
+ </th>
267
+ <th className="px-4 py-2 text-left text-sm font-medium text-gray-500 w-54">
268
+ Top Strengths
269
+ </th>
270
+ </tr>
271
+ </thead>
272
+ <tbody className="divide-y divide-gray-200">
273
+ {sortedModels.map((model, index) => (
274
+ <tr
275
+ key={model.model}
276
+ className={index % 2 === 0 ? "bg-white" : "bg-gray-50"}
277
+ >
278
+ <td className="px-4 py-3 text-sm font-medium text-gray-900 w-10">
279
+ {index + 1}
280
+ </td>
281
+ <td className="px-4 py-3 w-52">
282
+ <div className="flex items-center">
283
+ <div
284
+ className="w-3 h-3 rounded-full mr-2"
285
+ style={{ backgroundColor: model.color }}
286
+ ></div>
287
+ <span className="text-sm font-medium text-gray-900">
288
+ {model.model}
289
+ </span>
290
+ </div>
291
+ </td>
292
+ <td className="px-4 py-3 min-w-[200px] w-64">
293
+ <ScoreWithStdDev
294
+ score={model.overall_score}
295
+ stdDev={model.overall_std}
296
+ colorClass={getScoreBadgeColor(
297
+ model.overall_score,
298
+ 0,
299
+ 100
300
+ )}
301
+ />
302
+ </td>
303
+ <td className="px-4 py-3 whitespace-nowrap w-32">
304
+ <span
305
+ className={`px-2 py-1 inline-flex text-xs font-semibold rounded-full ${
306
+ model.repeat_usage_pct > 80
307
+ ? "bg-green-100 text-green-800"
308
+ : model.repeat_usage_pct > 60
309
+ ? "bg-blue-100 text-blue-800"
310
+ : "bg-yellow-100 text-yellow-800"
311
+ }`}
312
+ >
313
+ {model.repeat_usage_pct.toFixed(1)}%
314
+ {/* ±{" "} {model.repeat_usage_pct_std.toFixed(1)} */}
315
+ </span>
316
+ </td>
317
+ <td className="px-4 py-3 text-sm text-gray-500 w-52">
318
+ {model.top_strengths && model.top_strengths.length > 0
319
+ ? model.top_strengths
320
+ .slice(0, 3)
321
+ .map((strength) => formatFacetName(strength))
322
+ .join(", ")
323
+ : "N/A"}
324
+ </td>
325
+ </tr>
326
+ ))}
327
+ </tbody>
328
+ </table>
329
+ </div>
330
+ </div>
331
+ </div>
332
+
333
+ {/* Enhanced Top Performers Cards */}
334
+ {Object.keys(bestModelPerCategory).length > 0 && (
335
+ <div>
336
+ <h3 className="font-semibold text-xl mb-4">
337
+ Best Models by Task Category
338
+ </h3>
339
+ <div className="grid grid-cols-1 md:grid-cols-3 gap-6 mb-6">
340
+ {/* Creative Tasks Card - Enhanced */}
341
+ <div className="border rounded-lg overflow-hidden">
342
+ <div className="px-4 py-2 bg-gray-50 border-b flex items-center">
343
+ <h3 className="font-semibold">Best for Creative Tasks</h3>
344
+ <div
345
+ className="ml-2 w-2 h-2 rounded-full"
346
+ style={{
347
+ backgroundColor:
348
+ bestModelPerCategory.creative?.color || "#e5e7eb",
349
+ }}
350
+ ></div>
351
+ </div>
352
+ <div className="p-4">
353
+ <div className="flex items-center mb-4">
354
+ <div
355
+ className="p-2 rounded-full"
356
+ style={{
357
+ backgroundColor:
358
+ bestModelPerCategory.creative?.color + "20" ||
359
+ "#e5e7eb",
360
+ }}
361
+ >
362
+ <svg
363
+ xmlns="http://www.w3.org/2000/svg"
364
+ className="h-8 w-8"
365
+ style={{
366
+ color:
367
+ bestModelPerCategory.creative?.color || "#6b7280",
368
+ }}
369
+ fill="none"
370
+ viewBox="0 0 24 24"
371
+ stroke="currentColor"
372
+ >
373
+ <path
374
+ strokeLinecap="round"
375
+ strokeLinejoin="round"
376
+ strokeWidth={2}
377
+ d="M9.663 17h4.673M12 3v1m6.364 1.636l-.707.707M21 12h-1M4 12H3m3.343-5.657l-.707-.707m2.828 9.9a5 5 0 117.072 0l-.548.547A3.374 3.374 0 0014 18.469V19a2 2 0 11-4 0v-.531c0-.895-.356-1.754-.988-2.386l-.548-.547z"
378
+ />
379
+ </svg>
380
+ </div>
381
+ <div className="ml-4">
382
+ <h4 className="text-lg font-semibold">
383
+ {bestModelPerCategory.creative?.model || "N/A"}
384
+ </h4>
385
+ <p className="text-sm text-gray-600">
386
+ Score:{" "}
387
+ {bestModelPerCategory.creative?.score.toFixed(2) ||
388
+ "N/A"}
389
+ {bestModelPerCategory.creative?.std &&
390
+ ` ± ${bestModelPerCategory.creative.std.toFixed(
391
+ 2
392
+ )}`}
393
+ </p>
394
+ </div>
395
+ </div>
396
+
397
+ {/* Key aspects/facets visualization */}
398
+ <div className="mb-4">
399
+ <h5 className="text-sm font-medium mb-2">
400
+ Key Aspects for Creative Tasks
401
+ </h5>
402
+ <div className="space-y-2">
403
+ {(keyAspectsByTask.by_category.creative || []).map(
404
+ (aspectInfo) => {
405
+ const aspect = aspectInfo.raw_aspect;
406
+ const score = aspectInfo.score;
407
+
408
+ return (
409
+ <div key={aspect} className="text-sm">
410
+ <div className="flex justify-between mb-1">
411
+ <span>{aspectInfo.aspect}</span>
412
+ <span className="font-medium">
413
+ {score.toFixed(1)}
414
+ </span>
415
+ </div>
416
+ <div className="w-full bg-gray-200 rounded-full h-2">
417
+ <div
418
+ className="h-2 rounded-full"
419
+ style={{
420
+ width: `${score}%`,
421
+ backgroundColor:
422
+ bestModelPerCategory.creative?.color ||
423
+ "#6b7280",
424
+ }}
425
+ ></div>
426
+ </div>
427
+ </div>
428
+ );
429
+ }
430
+ )}
431
+ </div>
432
+ </div>
433
+
434
+ <p className="text-sm text-gray-700">
435
+ Excels at creative tasks like generating ideas and
436
+ creating travel itineraries.
437
+ </p>
438
+ <div className="mt-3 text-xs text-gray-500">
439
+ <div>Tasks in this category:</div>
440
+ <ul className="list-disc ml-4 mt-1">
441
+ {taskCategories.creative?.map((task) => (
442
+ <li key={task}>{task}</li>
443
+ )) || <li>No data available</li>}
444
+ </ul>
445
+ </div>
446
+ </div>
447
+ </div>
448
+
449
+ {/* Practical Tasks Card - Enhanced */}
450
+ <div className="border rounded-lg overflow-hidden">
451
+ <div className="px-4 py-2 bg-gray-50 border-b flex items-center">
452
+ <h3 className="font-semibold">Best for Practical Tasks</h3>
453
+ <div
454
+ className="ml-2 w-2 h-2 rounded-full"
455
+ style={{
456
+ backgroundColor:
457
+ bestModelPerCategory.practical?.color || "#e5e7eb",
458
+ }}
459
+ ></div>
460
+ </div>
461
+ <div className="p-4">
462
+ <div className="flex items-center mb-4">
463
+ <div
464
+ className="p-2 rounded-full"
465
+ style={{
466
+ backgroundColor:
467
+ bestModelPerCategory.practical?.color + "20" ||
468
+ "#e5e7eb",
469
+ }}
470
+ >
471
+ <svg
472
+ xmlns="http://www.w3.org/2000/svg"
473
+ className="h-8 w-8"
474
+ style={{
475
+ color:
476
+ bestModelPerCategory.practical?.color ||
477
+ "#6b7280",
478
+ }}
479
+ fill="none"
480
+ viewBox="0 0 24 24"
481
+ stroke="currentColor"
482
+ >
483
+ <path
484
+ strokeLinecap="round"
485
+ strokeLinejoin="round"
486
+ strokeWidth={2}
487
+ d="M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2"
488
+ />
489
+ </svg>
490
+ </div>
491
+ <div className="ml-4">
492
+ <h4 className="text-lg font-semibold">
493
+ {bestModelPerCategory.practical?.model || "N/A"}
494
+ </h4>
495
+ <p className="text-sm text-gray-600">
496
+ Score:{" "}
497
+ {bestModelPerCategory.practical?.score.toFixed(2) ||
498
+ "N/A"}
499
+ {bestModelPerCategory.practical?.std &&
500
+ ` ± ${bestModelPerCategory.practical.std.toFixed(
501
+ 2
502
+ )}`}
503
+ </p>
504
+ </div>
505
+ </div>
506
+
507
+ {/* Key facets visualization */}
508
+ <div className="mb-4">
509
+ <h5 className="text-sm font-medium mb-2">
510
+ Key Aspects for Practical Tasks
511
+ </h5>
512
+ <div className="space-y-2">
513
+ {keyAspectsByTask.by_category.practical.map(
514
+ (aspectInfo) => {
515
+ const aspect = aspectInfo.raw_aspect;
516
+ const score = aspectInfo.score;
517
+
518
+ return (
519
+ <div key={aspect} className="text-sm">
520
+ <div className="flex justify-between mb-1">
521
+ <span>{aspectInfo.aspect}</span>
522
+ <span className="font-medium">
523
+ {score.toFixed(1)}
524
+ </span>
525
+ </div>
526
+ <div className="w-full bg-gray-200 rounded-full h-2">
527
+ <div
528
+ className="h-2 rounded-full"
529
+ style={{
530
+ width: `${score}%`,
531
+ backgroundColor:
532
+ bestModelPerCategory.practical?.color ||
533
+ "#6b7280",
534
+ }}
535
+ ></div>
536
+ </div>
537
+ </div>
538
+ );
539
+ }
540
+ )}
541
+ </div>
542
+ </div>
543
+
544
+ <p className="text-sm text-gray-700">
545
+ Best performance on practical tasks like creating a meal plan or following up on a job application.
546
+ </p>
547
+ <div className="mt-3 text-xs text-gray-500">
548
+ <div>Tasks in this category:</div>
549
+ <ul className="list-disc ml-4 mt-1">
550
+ {taskCategories.practical?.map((task) => (
551
+ <li key={task}>{task}</li>
552
+ )) || <li>No data available</li>}
553
+ </ul>
554
+ </div>
555
+ </div>
556
+ </div>
557
+
558
+ {/* Meal Planning Card - Enhanced */}
559
+ <div className="border rounded-lg overflow-hidden">
560
+ <div className="px-4 py-2 bg-gray-50 border-b flex items-center">
561
+ <h3 className="font-semibold">Best for Analytical Tasks</h3>
562
+ <div
563
+ className="ml-2 w-2 h-2 rounded-full"
564
+ style={{
565
+ backgroundColor:
566
+ bestModelPerCategory.analytical?.color || "#e5e7eb",
567
+ }}
568
+ ></div>
569
+ </div>
570
+ <div className="p-4">
571
+ <div className="flex items-center mb-4">
572
+ <div
573
+ className="p-2 rounded-full"
574
+ style={{
575
+ backgroundColor:
576
+ bestModelPerCategory.analytical?.color + "20" ||
577
+ "#e5e7eb",
578
+ }}
579
+ >
580
+ <svg
581
+ xmlns="http://www.w3.org/2000/svg"
582
+ className="h-8 w-8"
583
+ style={{
584
+ color:
585
+ bestModelPerCategory.analytical?.color ||
586
+ "#6b7280",
587
+ }}
588
+ fill="none"
589
+ viewBox="0 0 24 24"
590
+ stroke="currentColor"
591
+ >
592
+ <path
593
+ strokeLinecap="round"
594
+ strokeLinejoin="round"
595
+ strokeWidth={2}
596
+ d="M12 6.253v13m0-13C10.832 5.477 9.246 5 7.5 5S4.168 5.477 3 6.253v13C4.168 18.477 5.754 18 7.5 18s3.332.477 4.5 1.253m0-13C13.168 5.477 14.754 5 16.5 5c1.747 0 3.332.477 4.5 1.253v13C19.832 18.477 18.247 18 16.5 18c-1.746 0-3.332.477-4.5 1.253"
597
+ />
598
+ </svg>
599
+ </div>
600
+ <div className="ml-4">
601
+ <h4 className="text-lg font-semibold">
602
+ {bestModelPerCategory.analytical?.model || "N/A"}
603
+ </h4>
604
+ <p className="text-sm text-gray-600">
605
+ Score:{" "}
606
+ {bestModelPerCategory.analytical?.score.toFixed(2) ||
607
+ "N/A"}
608
+ {bestModelPerCategory.analytical?.std &&
609
+ ` ± ${bestModelPerCategory.analytical.std.toFixed(
610
+ 2
611
+ )}`}
612
+ </p>
613
+ </div>
614
+ </div>
615
+
616
+ {/* Key facets/aspects visualization */}
617
+ <div className="mb-4">
618
+ <h5 className="text-sm font-medium mb-2">
619
+ Key Aspects for Analytical Tasks
620
+ </h5>
621
+ <div className="space-y-2">
622
+ {keyAspectsByTask.by_category.analytical.map(
623
+ (aspectInfo) => {
624
+ const aspect = aspectInfo.raw_aspect;
625
+ const score = aspectInfo.score;
626
+
627
+ return (
628
+ <div key={aspect} className="text-sm">
629
+ <div className="flex justify-between mb-1">
630
+ <span>{aspectInfo.aspect}</span>
631
+ <span className="font-medium">
632
+ {score.toFixed(1)}
633
+ </span>
634
+ </div>
635
+ <div className="w-full bg-gray-200 rounded-full h-2">
636
+ <div
637
+ className="h-2 rounded-full"
638
+ style={{
639
+ width: `${score}%`,
640
+ backgroundColor:
641
+ bestModelPerCategory.analytical
642
+ ?.color || "#6b7280",
643
+ }}
644
+ ></div>
645
+ </div>
646
+ </div>
647
+ );
648
+ }
649
+ )}
650
+ </div>
651
+ </div>
652
+
653
+ <p className="text-sm text-gray-700">
654
+ Exceptional at analytical tasks like breaking down complex topics or helping you decide between options.
655
+ </p>
656
+ <div className="mt-3 text-xs text-gray-500">
657
+ <div>Tasks in this category:</div>
658
+ <ul className="list-disc ml-4 mt-1">
659
+ {taskCategories.analytical?.map((task) => (
660
+ <li key={task}>{task}</li>
661
+ )) || <li>No data available</li>}
662
+ </ul>
663
+ </div>
664
+ </div>
665
+ </div>
666
+ </div>
667
+ </div>
668
+ )}
669
+
670
+
671
+ </div>
672
+ )}
673
+
674
+ {/* Task & Demographic Analysis Tab */}
675
+ {activeTab === "task-demographics" && data && (
676
+ <TaskDemographicAnalysis data={data} />
677
+ )}
678
+
679
+ {/* Facet & Aspect Breakdown Tab */}
680
+ {activeTab === "facets" && data && <MetricsBreakdown data={data} />}
681
+
682
+ {/* Head-to-Head Comparison Tab */}
683
+ {/* {activeTab === "headtohead" && <HeadToHeadComparison data={data} />} */}
684
+ </div>
685
+ );
686
+ };
687
+
688
+ export default LLMComparisonDashboard;
leaderboard-app/components/MetricsBreakdown.jsx ADDED
@@ -0,0 +1,638 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ "use client";
2
+
3
+ import React, { useState, useEffect } from "react";
4
+ import {
5
+ BarChart,
6
+ Bar,
7
+ XAxis,
8
+ YAxis,
9
+ CartesianGrid,
10
+ Tooltip,
11
+ Legend,
12
+ ResponsiveContainer,
13
+ RadarChart,
14
+ PolarGrid,
15
+ PolarAngleAxis,
16
+ PolarRadiusAxis,
17
+ Radar
18
+ } from "recharts";
19
+
20
+ // Utility functions for formatting facet and aspect names
21
+ const formatFacetName = (facet) => {
22
+ const facetMap = {
23
+ "helpfulness": "Helpfulness",
24
+ "communication": "Communication",
25
+ "insightful": "Insightfulness",
26
+ "adaptiveness": "Adaptiveness",
27
+ "trustworthiness": "Trustworthiness",
28
+ "personality": "Personality",
29
+ "background_and_culture": "Cultural Awareness"
30
+ };
31
+
32
+ return facetMap[facet] || (facet ? facet.replace(/_/g, ' ').replace(/\b\w/g, l => l.toUpperCase()) : facet);
33
+ };
34
+
35
+ const formatAspectName = (aspect) => {
36
+ const aspectMap = {
37
+ "effectiveness": "Effectiveness",
38
+ "comprehensiveness": "Comprehensiveness",
39
+ "usefulness": "Usefulness",
40
+ "tone_and_language_style": "Tone & Language",
41
+ "naturalness": "Naturalness",
42
+ "detail_and_technical_language": "Detail & Technical",
43
+ "accuracy": "Accuracy",
44
+ "sharpness": "Sharpness",
45
+ "intuitive": "Intuitiveness",
46
+ "flexibility": "Flexibility",
47
+ "clarity": "Clarity",
48
+ "perceptiveness": "Perceptiveness",
49
+ "consistency": "Consistency",
50
+ "confidence": "Confidence",
51
+ "transparency": "Transparency",
52
+ "personality-consistency": "Personality Consistency",
53
+ "personality-definition": "Personality Definition",
54
+ "honesty-empathy-fairness": "Honesty & Empathy",
55
+ "alignment": "Alignment",
56
+ "cultural_relevance": "Cultural Relevance",
57
+ "bias_freedom": "Freedom from Bias",
58
+ "background_and_culture": "Cultural Background"
59
+ };
60
+
61
+ return aspectMap[aspect] || (aspect ? aspect.replace(/_/g, ' ').replace(/-/g, ' ').replace(/\b\w/g, l => l.toUpperCase()) : aspect);
62
+ };
63
+
64
+ // Format categories for the radar chart
65
+ const formatCategoryName = (category) => {
66
+ if (category.includes('_') || category === "Insightful") {
67
+ return formatFacetName(category.toLowerCase());
68
+ }
69
+ return category;
70
+ };
71
+
72
+ // Get color based on score value
73
+ const getScoreColor = (score) => {
74
+ if (score >= 90) return "text-green-600 font-semibold";
75
+ if (score >= 80) return "text-green-500";
76
+ if (score >= 70) return "text-green-400";
77
+ if (score >= 60) return "text-sky-500";
78
+ if (score >= 50) return "text-sky-400";
79
+ if (score >= 40) return "text-yellow-500";
80
+ if (score >= 30) return "text-yellow-400";
81
+ return "text-red-500";
82
+ };
83
+
84
+ // Get background color based on score (for badges)
85
+ const getScoreBgColor = (score) => {
86
+ if (score >= 90) return "bg-green-100 text-green-800";
87
+ if (score >= 80) return "bg-green-50 text-green-700";
88
+ if (score >= 70) return "bg-sky-100 text-sky-800";
89
+ if (score >= 60) return "bg-sky-50 text-sky-700";
90
+ if (score >= 50) return "bg-yellow-100 text-yellow-800";
91
+ if (score < 50) return "bg-red-100 text-red-800";
92
+ return "bg-gray-100 text-gray-800";
93
+ };
94
+
95
+ // Custom tooltip with proper formatting
96
+ const CustomTooltip = ({ active, payload, label }) => {
97
+ if (active && payload && payload.length) {
98
+ // Format the label based on whether it's a facet or aspect
99
+ const formattedLabel = formatCategoryName(label);
100
+
101
+ return (
102
+ <div className="bg-white p-3 border rounded shadow-sm">
103
+ <p className="font-medium">{formattedLabel}</p>
104
+ <div className="mt-2">
105
+ {payload
106
+ .filter(entry => !entry.dataKey.includes('_std'))
107
+ .map((entry, index) => {
108
+ const stdEntry = payload.find(p => p.dataKey === `${entry.dataKey}_std`);
109
+ const stdValue = stdEntry ? stdEntry.value : 0;
110
+
111
+ return (
112
+ <div key={index} className="flex items-center text-sm mb-1">
113
+ <div
114
+ className="w-3 h-3 rounded-full mr-1"
115
+ style={{ backgroundColor: entry.color }}
116
+ ></div>
117
+ <span className="mr-2">{entry.name}:</span>
118
+ <span className="font-medium">{entry.value.toFixed(1)} ± {stdValue.toFixed(1)}</span>
119
+ </div>
120
+ );
121
+ })}
122
+ </div>
123
+ </div>
124
+ );
125
+ }
126
+ return null;
127
+ };
128
+
129
+ const MetricsBreakdown = ({ data }) => {
130
+ const [viewMode, setViewMode] = useState("facets"); // "facets" or "aspects"
131
+ const [selectedModels, setSelectedModels] = useState([]);
132
+ const [selectedFacet, setSelectedFacet] = useState(null);
133
+
134
+ const {
135
+ models,
136
+ facets,
137
+ radarData,
138
+ bestModelPerFacet
139
+ } = data;
140
+
141
+ // Initialize selected facet and models
142
+ useEffect(() => {
143
+ if (!selectedFacet && facets && Object.keys(facets).length > 0) {
144
+ // Skip repeat_usage and select the first actual facet
145
+ const availableFacets = Object.keys(facets).filter(f => f !== "repeat_usage");
146
+ if (availableFacets.length > 0) {
147
+ setSelectedFacet(availableFacets[0]);
148
+ }
149
+ }
150
+
151
+ if (selectedModels.length === 0 && models?.length > 0) {
152
+ // Select all models by default (up to 6 models)
153
+ setSelectedModels(models.map(m => m.model));
154
+ }
155
+ }, [facets, selectedFacet, models, selectedModels]);
156
+
157
+ // Get model by name
158
+ const getModelByName = (name) => {
159
+ return models.find(m => m.model === name);
160
+ };
161
+
162
+ // Generate aspect radar data for selected facet
163
+ const getAspectRadarData = () => {
164
+ if (!selectedFacet || !facets) return [];
165
+
166
+ const selectedAspects = facets[selectedFacet] || [];
167
+ if (selectedAspects.length === 0) return [];
168
+
169
+ // Create radar data format with aspect as categories
170
+ return selectedAspects.map(aspect => {
171
+ const entry = {
172
+ category: formatAspectName(aspect),
173
+ aspect
174
+ };
175
+
176
+ // Add data for selected models
177
+ models
178
+ .filter(m => selectedModels.includes(m.model))
179
+ .forEach(model => {
180
+ if (model.breakdown_scores && model.breakdown_scores[aspect] !== undefined) {
181
+ entry[model.model] = model.breakdown_scores[aspect];
182
+ }
183
+ });
184
+
185
+ return entry;
186
+ });
187
+ };
188
+
189
+ // Get selected facet aspects
190
+ const getSelectedFacetAspects = () => {
191
+ if (!selectedFacet || !facets) return [];
192
+ return facets[selectedFacet] || [];
193
+ };
194
+
195
+ // Get facet data for the radar chart
196
+ const getFacetRadarData = () => {
197
+ if (!radarData) return [];
198
+
199
+ // This ensures the data contains only the selected models
200
+ return radarData.map(item => {
201
+ // Create a new object with only the properties we want
202
+ const newItem = { category: item.category };
203
+
204
+ // Copy only the selected models' data
205
+ models
206
+ .filter(m => selectedModels.includes(m.model))
207
+ .forEach(model => {
208
+ newItem[model.model] = item[model.model];
209
+ });
210
+
211
+ return newItem;
212
+ });
213
+ };
214
+
215
+ // Calculate top performers based on selected models only
216
+ const getTopPerformersByFacet = () => {
217
+ if (!facets || !models) return {};
218
+
219
+ const topPerformers = {};
220
+
221
+ // For each facet, find the best model among selected models
222
+ Object.keys(facets)
223
+ .filter(facet => facet !== "repeat_usage")
224
+ .forEach(facet => {
225
+ let bestModel = null;
226
+ let bestScore = -Infinity;
227
+
228
+ // Check each selected model
229
+ models
230
+ .filter(m => selectedModels.includes(m.model))
231
+ .forEach(model => {
232
+ const score = model.facet_scores?.[facet];
233
+ if (score !== undefined && score > bestScore) {
234
+ bestScore = score;
235
+ bestModel = {
236
+ model: model.model,
237
+ score: score,
238
+ modelObj: model
239
+ };
240
+ }
241
+ });
242
+
243
+ if (bestModel) {
244
+ topPerformers[facet] = bestModel;
245
+ }
246
+ });
247
+
248
+ return topPerformers;
249
+ };
250
+
251
+ // Calculate top performers for each aspect of the selected facet
252
+ const getTopPerformersByAspect = () => {
253
+ if (!selectedFacet || !facets || !models) return [];
254
+
255
+ const selectedAspects = facets[selectedFacet] || [];
256
+ const topPerformers = [];
257
+
258
+ // For each aspect, find the best model among selected models
259
+ selectedAspects.forEach(aspect => {
260
+ let bestModel = null;
261
+ let bestScore = -Infinity;
262
+
263
+ // Check each selected model
264
+ models
265
+ .filter(m => selectedModels.includes(m.model))
266
+ .forEach(model => {
267
+ const score = model.breakdown_scores?.[aspect];
268
+ if (score !== undefined && score > bestScore) {
269
+ bestScore = score;
270
+ bestModel = {
271
+ model: model.model,
272
+ score: score,
273
+ modelObj: model
274
+ };
275
+ }
276
+ });
277
+
278
+ if (bestModel) {
279
+ topPerformers.push({
280
+ aspect,
281
+ aspectName: formatAspectName(aspect),
282
+ ...bestModel
283
+ });
284
+ }
285
+ });
286
+
287
+ return topPerformers;
288
+ };
289
+
290
+ // Prepare data
291
+ const selectedAspects = getSelectedFacetAspects();
292
+ const facetRadarData = getFacetRadarData();
293
+ const aspectRadarData = getAspectRadarData();
294
+ const topPerformers = getTopPerformersByFacet();
295
+ const topAspectPerformers = getTopPerformersByAspect();
296
+
297
+ return (
298
+ <>
299
+ {/* Top-level controls */}
300
+ <div className="mb-4 flex justify-between items-center flex-wrap">
301
+ <div className="flex items-center space-x-4">
302
+ {/* View toggle */}
303
+ <div className="flex space-x-1 p-1 bg-gray-100 rounded-lg">
304
+ <button
305
+ className={`px-4 py-1.5 text-sm font-medium rounded-md ${
306
+ viewMode === "facets" ? "bg-white shadow text-sky-700" : "text-gray-700"
307
+ }`}
308
+ onClick={() => setViewMode("facets")}
309
+ >
310
+ Facets
311
+ </button>
312
+ <button
313
+ className={`px-2 py-1.5 text-sm font-medium rounded-md ${
314
+ viewMode === "aspects" ? "bg-white shadow text-sky-700" : "text-gray-700"
315
+ }`}
316
+ onClick={() => setViewMode("aspects")}
317
+ >
318
+ Aspects
319
+ </button>
320
+ </div>
321
+
322
+ {/* Facet selector (shown when in aspects view) */}
323
+ {viewMode === "aspects" && (
324
+ <div className="flex items-center">
325
+ <span className="text-sm font-medium mr-1">Select Facet:</span>
326
+ <select
327
+ className="text-sm border rounded px-2 py-1.5 bg-white"
328
+ value={selectedFacet || ''}
329
+ onChange={(e) => setSelectedFacet(e.target.value)}
330
+ >
331
+ {Object.keys(facets || {})
332
+ .filter(f => f !== "repeat_usage")
333
+ .map(facet => (
334
+ <option key={facet} value={facet}>
335
+ {formatFacetName(facet)}
336
+ </option>
337
+ ))}
338
+ </select>
339
+ </div>
340
+ )}
341
+ </div>
342
+
343
+ {/* Model selector */}
344
+ <div className="mt-2 sm:mt-0">
345
+ <span className="text-sm text-gray-500 mr-2">Select Models:</span>
346
+ <div className="inline-flex flex-wrap gap-1">
347
+ {models?.map(model => (
348
+ <button
349
+ key={model.model}
350
+ className={`px-2 py-0.5 text-sm rounded ${
351
+ selectedModels.includes(model.model)
352
+ ? "bg-sky-100 border text-sky-800 border-sky-300"
353
+ : "bg-gray-100 text-gray-600"
354
+ }`}
355
+ onClick={() => {
356
+ if (selectedModels.includes(model.model)) {
357
+ if (selectedModels.length > 1) {
358
+ setSelectedModels(selectedModels.filter(m => m !== model.model));
359
+ }
360
+ } else {
361
+ setSelectedModels([...selectedModels, model.model]);
362
+ }
363
+ }}
364
+ >
365
+ {model.model}
366
+ </button>
367
+ ))}
368
+ </div>
369
+ </div>
370
+ </div>
371
+
372
+ {/* Performance Summary Table */}
373
+ <div className="border rounded-lg overflow-hidden mb-4">
374
+ <div className="px-4 py-2 bg-gray-50 border-b">
375
+ <h3 className="font-semibold">Performance Summary</h3>
376
+ </div>
377
+ <div className="p-4 overflow-x-auto">
378
+ <table className="min-w-full divide-y divide-gray-200">
379
+ <thead>
380
+ <tr>
381
+ <th className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Model</th>
382
+ {viewMode === "facets" ? (
383
+ // Show facets in facet view
384
+ Object.keys(facets || {})
385
+ .filter(f => f !== "repeat_usage")
386
+ .map(facet => (
387
+ <th key={facet} className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">
388
+ {formatFacetName(facet)}
389
+ </th>
390
+ ))
391
+ ) : (
392
+ // Show aspects in aspect view
393
+ selectedAspects.map(aspect => (
394
+ <th key={aspect} className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">
395
+ {formatAspectName(aspect)}
396
+ </th>
397
+ ))
398
+ )}
399
+ </tr>
400
+ </thead>
401
+ <tbody className="bg-white divide-y divide-gray-200">
402
+ {models
403
+ ?.filter(m => selectedModels.includes(m.model))
404
+ .map((model, idx) => (
405
+ <tr key={model.model} className={idx % 2 === 0 ? "bg-white" : "bg-gray-50"}>
406
+ <td className="px-3 py-2">
407
+ <div className="flex items-center">
408
+ <div
409
+ className="w-3 h-3 rounded-full mr-2"
410
+ style={{ backgroundColor: model.color }}
411
+ ></div>
412
+ <span className="text-sm font-medium">{model.model}</span>
413
+ </div>
414
+ </td>
415
+ {viewMode === "facets" ? (
416
+ // Show facet scores in facet view
417
+ Object.keys(facets || {})
418
+ .filter(f => f !== "repeat_usage")
419
+ .map(facet => {
420
+ const score = model.facet_scores?.[facet] || 0;
421
+ return (
422
+ <td key={facet} className="px-3 py-2">
423
+ <div className={`text-sm ${getScoreColor(score)}`}>
424
+ {score.toFixed(1)}
425
+ </div>
426
+ </td>
427
+ );
428
+ })
429
+ ) : (
430
+ // Show aspect scores in aspect view
431
+ selectedAspects.map(aspect => {
432
+ const score = model.breakdown_scores?.[aspect] || 0;
433
+ return (
434
+ <td key={aspect} className="px-3 py-2">
435
+ <div className={`text-sm ${getScoreColor(score)}`}>
436
+ {score.toFixed(1)}
437
+ </div>
438
+ </td>
439
+ );
440
+ })
441
+ )}
442
+ </tr>
443
+ ))}
444
+ </tbody>
445
+ </table>
446
+ </div>
447
+ </div>
448
+
449
+ {/* Conditional content based on view mode */}
450
+ {viewMode === "facets" ? (
451
+ // FACETS VIEW
452
+ <>
453
+ {/* Radar Chart */}
454
+ <div className="border rounded-lg overflow-hidden mb-4">
455
+ <div className="px-4 py-2 bg-gray-50 border-b flex justify-between items-center">
456
+ <h3 className="font-semibold">Model Performance Across Facets</h3>
457
+ <div className="text-xs text-gray-500">Radar chart showing model strengths</div>
458
+ </div>
459
+ <div className="p-4">
460
+ <div className="h-96">
461
+ <ResponsiveContainer width="100%" height="100%">
462
+ <RadarChart
463
+ outerRadius={160}
464
+ data={facetRadarData}
465
+ >
466
+ <PolarGrid gridType="polygon" />
467
+ <PolarAngleAxis
468
+ dataKey="category"
469
+ tick={{ fill: "#4b5563", fontSize: 14 }}
470
+ tickLine={false}
471
+ tickFormatter={formatCategoryName}
472
+ />
473
+ <PolarRadiusAxis
474
+ angle={90}
475
+ domain={[-100, 100]}
476
+ axisLine={false}
477
+ tick={{ fontSize: 12 }}
478
+ tickCount={5}
479
+ />
480
+ {models
481
+ ?.filter(m => selectedModels.includes(m.model))
482
+ .map((model) => (
483
+ <Radar
484
+ key={model.model}
485
+ name={model.model}
486
+ dataKey={model.model}
487
+ stroke={model.color}
488
+ fill={model.color}
489
+ fillOpacity={0.2}
490
+ strokeWidth={2}
491
+ />
492
+ ))}
493
+ <Tooltip content={<CustomTooltip />} />
494
+ <Legend />
495
+ </RadarChart>
496
+ </ResponsiveContainer>
497
+ </div>
498
+ </div>
499
+ </div>
500
+
501
+ {/* Top Performers Table */}
502
+ <div className="border rounded-lg overflow-hidden">
503
+ <div className="px-4 py-2 bg-gray-50 border-b">
504
+ <h3 className="font-semibold">Top Performers by Facet</h3>
505
+ </div>
506
+ <div className="p-4">
507
+ <table className="min-w-full divide-y divide-gray-200">
508
+ <thead>
509
+ <tr>
510
+ <th className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Facet</th>
511
+ <th className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Best Model</th>
512
+ <th className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Score</th>
513
+ </tr>
514
+ </thead>
515
+ <tbody className="bg-white divide-y divide-gray-200">
516
+ {Object.entries(topPerformers)
517
+ .map(([facet, bestModel], idx) => (
518
+ <tr key={facet} className={idx % 2 === 0 ? "bg-white" : "bg-gray-50"}>
519
+ <td className="px-3 py-2 font-medium">{formatFacetName(facet)}</td>
520
+ <td className="px-3 py-2">
521
+ <div className="flex items-center">
522
+ <div
523
+ className="w-3 h-3 rounded-full mr-2"
524
+ style={{ backgroundColor: bestModel.modelObj?.color }}
525
+ ></div>
526
+ <span>{bestModel.model}</span>
527
+ </div>
528
+ </td>
529
+ <td className="px-3 py-2">
530
+ <span className={`px-2 py-0.5 rounded-full text-sm font-medium ${getScoreBgColor(bestModel.score)}`}>
531
+ {bestModel.score.toFixed(1)}
532
+ </span>
533
+ </td>
534
+ </tr>
535
+ ))}
536
+ </tbody>
537
+ </table>
538
+ </div>
539
+ </div>
540
+ </>
541
+ ) : (
542
+ // ASPECTS VIEW
543
+ <>
544
+ {/* Aspect Radar Chart */}
545
+ <div className="border rounded-lg overflow-hidden mb-4">
546
+ <div className="px-4 py-2 bg-gray-50 border-b">
547
+ <h3 className="font-semibold">Aspect Breakdown for {formatFacetName(selectedFacet || '')}</h3>
548
+ </div>
549
+ <div className="p-4">
550
+ <div className="h-96">
551
+ <ResponsiveContainer width="100%" height="100%">
552
+ <RadarChart
553
+ outerRadius={160}
554
+ data={aspectRadarData}
555
+ >
556
+ <PolarGrid gridType="polygon" />
557
+ <PolarAngleAxis
558
+ dataKey="category"
559
+ tick={{ fill: "#4b5563", fontSize: 12 }}
560
+ tickLine={false}
561
+ />
562
+ <PolarRadiusAxis
563
+ angle={90}
564
+ domain={[0, 100]}
565
+ axisLine={false}
566
+ tick={{ fontSize: 12 }}
567
+ tickCount={5}
568
+ />
569
+ {models
570
+ ?.filter(m => selectedModels.includes(m.model))
571
+ .map((model) => (
572
+ <Radar
573
+ key={model.model}
574
+ name={model.model}
575
+ dataKey={model.model}
576
+ stroke={model.color}
577
+ fill={model.color}
578
+ fillOpacity={0.2}
579
+ strokeWidth={2}
580
+ />
581
+ ))}
582
+ <Tooltip content={<CustomTooltip />} />
583
+ <Legend />
584
+ </RadarChart>
585
+ </ResponsiveContainer>
586
+ </div>
587
+
588
+ <div className="mt-2 text-xs text-gray-500 text-center">
589
+ Aspect scores for {formatFacetName(selectedFacet)} (0-100 scale)
590
+ </div>
591
+ </div>
592
+ </div>
593
+
594
+ {/* Top Performers by Aspect Table */}
595
+ <div className="border rounded-lg overflow-hidden">
596
+ <div className="px-4 py-2 bg-gray-50 border-b">
597
+ <h3 className="font-semibold">Top Performers by Aspect in {formatFacetName(selectedFacet || '')}</h3>
598
+ </div>
599
+ <div className="p-4">
600
+ <table className="min-w-full divide-y divide-gray-200">
601
+ <thead>
602
+ <tr>
603
+ <th className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Aspect</th>
604
+ <th className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Best Model</th>
605
+ <th className="px-3 py-2 text-left text-xs font-medium text-gray-500 uppercase tracking-wider">Score</th>
606
+ </tr>
607
+ </thead>
608
+ <tbody className="bg-white divide-y divide-gray-200">
609
+ {topAspectPerformers.map((performer, idx) => (
610
+ <tr key={performer.aspect} className={idx % 2 === 0 ? "bg-white" : "bg-gray-50"}>
611
+ <td className="px-3 py-2 font-medium">{performer.aspectName}</td>
612
+ <td className="px-3 py-2">
613
+ <div className="flex items-center">
614
+ <div
615
+ className="w-3 h-3 rounded-full mr-2"
616
+ style={{ backgroundColor: performer.modelObj?.color }}
617
+ ></div>
618
+ <span>{performer.model}</span>
619
+ </div>
620
+ </td>
621
+ <td className="px-3 py-2">
622
+ <span className={`px-2 py-0.5 rounded-full text-sm font-medium ${getScoreBgColor(performer.score)}`}>
623
+ {performer.score.toFixed(1)}
624
+ </span>
625
+ </td>
626
+ </tr>
627
+ ))}
628
+ </tbody>
629
+ </table>
630
+ </div>
631
+ </div>
632
+ </>
633
+ )}
634
+ </>
635
+ );
636
+ };
637
+
638
+ export default MetricsBreakdown;
leaderboard-app/components/TaskDemographicAnalysis.jsx ADDED
@@ -0,0 +1,1416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ "use client";
2
+
3
+ import React, { useState, useEffect, useMemo } from "react";
4
+ import {
5
+ BarChart,
6
+ Bar,
7
+ XAxis,
8
+ YAxis,
9
+ CartesianGrid,
10
+ Tooltip,
11
+ Legend,
12
+ ResponsiveContainer,
13
+ ReferenceLine,
14
+ Cell,
15
+ } from "recharts";
16
+ import { getScoreBadgeColor } from "../lib/utils";
17
+
18
+ // Helper component for info tooltips
19
+ const InfoTooltip = ({ text }) => {
20
+ const [isVisible, setIsVisible] = useState(false);
21
+
22
+ return (
23
+ <div className="relative inline-block ml-1">
24
+ <button
25
+ className="text-gray-400 hover:text-gray-600 focus:outline-none"
26
+ onMouseEnter={() => setIsVisible(true)}
27
+ onMouseLeave={() => setIsVisible(false)}
28
+ onClick={() => setIsVisible(!isVisible)}
29
+ >
30
+ <svg
31
+ xmlns="http://www.w3.org/2000/svg"
32
+ className="h-4 w-4"
33
+ viewBox="0 0 20 20"
34
+ fill="currentColor"
35
+ >
36
+ <path
37
+ fillRule="evenodd"
38
+ d="M18 10a8 8 0 11-16 0 8 8 0 0116 0zm-7-4a1 1 0 11-2 0 1 1 0 012 0zM9 9a1 1 0 000 2v3a1 1 0 001 1h1a1 1 0 100-2v-3a1 1 0 00-1-1H9z"
39
+ clipRule="evenodd"
40
+ />
41
+ </svg>
42
+ </button>
43
+ {isVisible && (
44
+ <div className="absolute z-10 w-64 p-2 bg-white border rounded shadow-lg text-xs text-gray-700 -translate-x-1/2 left-1/2 mt-1">
45
+ {text}
46
+ </div>
47
+ )}
48
+ </div>
49
+ );
50
+ };
51
+
52
+ // Format facet names for display
53
+ const formatFacetName = (facet) => {
54
+ const facetMap = {
55
+ helpfulness: "Helpfulness",
56
+ communication: "Communication",
57
+ insightful: "Insightfulness",
58
+ adaptiveness: "Adaptiveness",
59
+ trustworthiness: "Trustworthiness",
60
+ personality: "Personality",
61
+ background_and_culture: "Cultural Awareness",
62
+ };
63
+
64
+ return (
65
+ facetMap[facet] ||
66
+ (facet
67
+ ? facet.replace(/_/g, " ").replace(/\b\w/g, (l) => l.toUpperCase())
68
+ : facet)
69
+ );
70
+ };
71
+
72
+ // Filter tag component for displaying active filters
73
+ const FilterTag = ({ label, onRemove }) => (
74
+ <div className="inline-flex items-center px-2 py-1 mr-2 mb-2 text-xs font-medium rounded-full bg-blue-100 text-blue-800">
75
+ {label}
76
+ {onRemove && (
77
+ <button
78
+ onClick={onRemove}
79
+ className="ml-1 text-blue-600 hover:text-blue-800 focus:outline-none"
80
+ >
81
+ <svg
82
+ xmlns="http://www.w3.org/2000/svg"
83
+ className="h-3 w-3"
84
+ viewBox="0 0 20 20"
85
+ fill="currentColor"
86
+ >
87
+ <path
88
+ fillRule="evenodd"
89
+ d="M10 18a8 8 0 100-16 8 8 0 000 16zM8.707 7.293a1 1 0 00-1.414 1.414L8.586 10l-1.293 1.293a1 1 0 101.414 1.414L10 11.414l1.293 1.293a1 1 0 001.414-1.414L11.414 10l1.293-1.293a1 1 0 00-1.414-1.414L10 8.586 8.707 7.293z"
90
+ clipRule="evenodd"
91
+ />
92
+ </svg>
93
+ </button>
94
+ )}
95
+ </div>
96
+ );
97
+
98
+ /* Clean, minimal insight component inspired by the equity ranking design */
99
+ const CleanInsightItem = ({ insight, index, models }) => {
100
+ // Extract model names and metrics from the insight text
101
+ const enhanceText = (text) => {
102
+ // First, find and highlight any numeric values with bold
103
+ const numericPattern = /(\d+\.?\d*)/g;
104
+ let enhancedText = text.replace(numericPattern, "<strong>$1</strong>");
105
+
106
+ // Then highlight model names
107
+ models.forEach((model) => {
108
+ const modelName = model.model;
109
+ if (text.includes(modelName)) {
110
+ enhancedText = enhancedText.replace(
111
+ new RegExp(modelName, "g"),
112
+ `<span class="font-medium" style="color: ${model.color}">${modelName}</span>`
113
+ );
114
+ }
115
+ });
116
+
117
+ return enhancedText;
118
+ };
119
+
120
+ // Determine the type of insight for styling
121
+ const getInsightType = (text) => {
122
+ if (
123
+ text.includes("performs best") ||
124
+ text.includes("excellent equity") ||
125
+ text.includes("achieves the highest")
126
+ ) {
127
+ return "positive";
128
+ } else if (
129
+ text.includes("potential equity concerns") ||
130
+ text.includes("worst") ||
131
+ text.includes("gap between")
132
+ ) {
133
+ return "negative";
134
+ } else if (text.includes("point gap")) {
135
+ return "comparison";
136
+ } else {
137
+ return "info";
138
+ }
139
+ };
140
+
141
+ // Get color based on insight type
142
+ const getTypeColor = (type) => {
143
+ switch (type) {
144
+ case "positive":
145
+ return "text-green-700 bg-green-50";
146
+ case "negative":
147
+ return "text-red-700 bg-red-50";
148
+ case "comparison":
149
+ return "text-blue-700 bg-blue-50";
150
+ default:
151
+ return "text-gray-700 bg-gray-50";
152
+ }
153
+ };
154
+
155
+ const insightType = getInsightType(insight);
156
+ const typeColor = getTypeColor(insightType);
157
+
158
+ return (
159
+ <div className="flex items-start py-3 px-4 border-b last:border-b-0">
160
+ <div className="flex-shrink-0 mr-3">
161
+ <div
162
+ className={`w-7 h-7 rounded-full flex items-center justify-center ${typeColor}`}
163
+ >
164
+ <span className="text-xs font-semibold">{index + 1}</span>
165
+ </div>
166
+ </div>
167
+ <div className="flex-grow">
168
+ <p
169
+ className="text-sm text-gray-800"
170
+ dangerouslySetInnerHTML={{ __html: enhanceText(insight) }}
171
+ />
172
+ </div>
173
+ </div>
174
+ );
175
+ };
176
+
177
+ const TaskDemographicAnalysis = ({ data }) => {
178
+ // Analysis controls state
179
+ const [selectedTask, setSelectedTask] = useState("all");
180
+ const [selectedDemographic, setSelectedDemographic] = useState("all");
181
+ const [selectedModel, setSelectedModel] = useState(null);
182
+ const [selectedMetric, setSelectedMetric] = useState("overall_score");
183
+ const [viewMode, setViewMode] = useState("absolute"); // 'absolute' or 'relative'
184
+ const [showAllModels, setShowAllModels] = useState(true);
185
+ const [groupBy, setGroupBy] = useState("task"); // 'task', 'demographic', or 'combined'
186
+ const [keyInsightsVisible, setKeyInsightsVisible] = useState(true);
187
+
188
+ // Extracting data
189
+ const {
190
+ models,
191
+ taskData,
192
+ taskCategories,
193
+ demographicSummary,
194
+ demographicOptions,
195
+ fairnessMetrics,
196
+ facets,
197
+ } = data;
198
+
199
+ // Initialize selectedModel if not set
200
+ useEffect(() => {
201
+ if (!selectedModel && models.length > 0) {
202
+ setSelectedModel(models[0].model);
203
+ }
204
+ }, [models, selectedModel]);
205
+
206
+ // Handle group by changes - reset and disable other filters as needed
207
+ useEffect(() => {
208
+ if (groupBy === "task" && selectedDemographic !== "all") {
209
+ // When grouping by task, reset demographic to 'all'
210
+ setSelectedDemographic("all");
211
+ } else if (groupBy === "demographic" && selectedTask !== "all") {
212
+ // When grouping by demographic, reset task to 'all'
213
+ setSelectedTask("all");
214
+ }
215
+ }, [groupBy, selectedDemographic, selectedTask]);
216
+
217
+ // Function to get all tasks (flat list)
218
+ const getAllTasks = () => {
219
+ const allTasks = [];
220
+ if (taskData) {
221
+ taskData.forEach((task) => {
222
+ if (!allTasks.includes(task.task)) {
223
+ allTasks.push(task.task);
224
+ }
225
+ });
226
+ }
227
+ return allTasks.sort();
228
+ };
229
+
230
+ // Get task options including "All Tasks" and categories
231
+ const taskOptions = useMemo(() => {
232
+ // Start with "All Tasks" option
233
+ const allTasksOption = { value: "all", label: "All Tasks" };
234
+
235
+ // Group tasks by category
236
+ const categorizedTasks = {};
237
+ const uncategorizedTasks = [];
238
+
239
+ // Get all tasks and their categories
240
+ getAllTasks().forEach((task) => {
241
+ const taskInfo = taskData.find((t) => t.task === task);
242
+ if (taskInfo && taskInfo.category) {
243
+ if (!categorizedTasks[taskInfo.category]) {
244
+ categorizedTasks[taskInfo.category] = [];
245
+ }
246
+ categorizedTasks[taskInfo.category].push({
247
+ value: task,
248
+ label: task,
249
+ });
250
+ } else {
251
+ uncategorizedTasks.push({
252
+ value: task,
253
+ label: task,
254
+ });
255
+ }
256
+ });
257
+
258
+ // Format for select rendering
259
+ return {
260
+ allTasksOption,
261
+ categories: Object.keys(taskCategories || {}).map((category) => ({
262
+ label: `${category.charAt(0).toUpperCase() + category.slice(1)} Tasks`,
263
+ value: category,
264
+ isCategory: true,
265
+ })),
266
+ categorizedTasks,
267
+ uncategorizedTasks,
268
+ };
269
+ }, [taskData, taskCategories]);
270
+
271
+ // Helper function to get task label
272
+ const getTaskLabel = (taskValue) => {
273
+ // Check if it's "all tasks"
274
+ if (taskValue === "all") {
275
+ return "All Tasks";
276
+ }
277
+
278
+ // Check if it's a category
279
+ const category = taskOptions.categories.find((c) => c.value === taskValue);
280
+ if (category) {
281
+ return category.label;
282
+ }
283
+
284
+ // Look in categorized tasks
285
+ for (const [category, tasks] of Object.entries(
286
+ taskOptions.categorizedTasks
287
+ )) {
288
+ const task = tasks.find((t) => t.value === taskValue);
289
+ if (task) {
290
+ return task.label;
291
+ }
292
+ }
293
+
294
+ // Check uncategorized tasks
295
+ const uncategorizedTask = taskOptions.uncategorizedTasks.find(
296
+ (t) => t.value === taskValue
297
+ );
298
+ if (uncategorizedTask) {
299
+ return uncategorizedTask.label;
300
+ }
301
+
302
+ // Fallback to the value itself
303
+ return taskValue;
304
+ };
305
+
306
+ // Get filtered performance data based on selected filters
307
+ const getFilteredPerformanceData = () => {
308
+ if (!taskData) return [];
309
+
310
+ let filteredData = [...taskData];
311
+
312
+ // Filter by task or task category
313
+ if (selectedTask !== "all") {
314
+ // Check if it's a category
315
+ const isCategory = Object.keys(taskCategories || {}).includes(
316
+ selectedTask
317
+ );
318
+
319
+ if (isCategory) {
320
+ // Filter by category
321
+ filteredData = filteredData.filter(
322
+ (item) => item.category === selectedTask
323
+ );
324
+ } else {
325
+ // Filter by specific task
326
+ filteredData = filteredData.filter(
327
+ (item) => item.task === selectedTask
328
+ );
329
+ }
330
+ }
331
+
332
+ // For relative view, we need to transform the data
333
+ if (viewMode === "relative") {
334
+ // Transform data for relative view (regardless of grouping type)
335
+ return filteredData.map((item) => {
336
+ // Create a copy of the item
337
+ const newItem = { ...item };
338
+
339
+ // Get all valid model scores for this item
340
+ const modelScores = [];
341
+ models.forEach((model) => {
342
+ if (typeof newItem[model.model] === "number") {
343
+ modelScores.push(newItem[model.model]);
344
+ }
345
+ });
346
+
347
+ // Calculate average if we have scores
348
+ if (modelScores.length > 0) {
349
+ const avgScore =
350
+ modelScores.reduce((sum, score) => sum + score, 0) /
351
+ modelScores.length;
352
+
353
+ // Convert all scores to relative to average
354
+ models.forEach((model) => {
355
+ if (typeof newItem[model.model] === "number") {
356
+ newItem[model.model] = newItem[model.model] - avgScore;
357
+ }
358
+ });
359
+ }
360
+
361
+ return newItem;
362
+ });
363
+ }
364
+
365
+ // For absolute view or if we can't do relative, return filtered data as is
366
+ return filteredData;
367
+ };
368
+
369
+ // Calculate model equity based on current filters
370
+ const calculateModelEquity = () => {
371
+ if (!demographicSummary || !demographicOptions) {
372
+ return models.map((model) => ({
373
+ model: model.model,
374
+ avgGap: 0,
375
+ color: model.color,
376
+ }));
377
+ }
378
+
379
+ // Get task-specific category if needed
380
+ let taskCategory = null;
381
+ let specificTask = null;
382
+
383
+ if (selectedTask !== "all") {
384
+ // Check if it's a category or specific task - improve detection logic
385
+ const isCategory =
386
+ taskCategories && Object.keys(taskCategories).includes(selectedTask);
387
+
388
+ if (isCategory) {
389
+ taskCategory = selectedTask;
390
+ } else {
391
+ specificTask = selectedTask;
392
+ // Find the category for this task
393
+ const taskInfo = taskData.find((t) => t.task === selectedTask);
394
+ if (taskInfo && taskInfo.category) {
395
+ taskCategory = taskInfo.category;
396
+ }
397
+ }
398
+ }
399
+
400
+ // Get task-specific performance data for reference
401
+ const taskPerformanceData = getFilteredPerformanceData();
402
+
403
+ // Build a lookup of model performance by task - with improved error handling
404
+ const taskPerformanceLookup = {};
405
+ let hasTaskSpecificData = false;
406
+
407
+ if (specificTask) {
408
+ // For a specific task, create lookup
409
+ taskPerformanceData.forEach((item) => {
410
+ if (item.task === specificTask) {
411
+ models.forEach((model) => {
412
+ const modelName = model.model;
413
+ const score = item[modelName];
414
+
415
+ if (typeof score === "number" && !isNaN(score)) {
416
+ if (!taskPerformanceLookup[modelName]) {
417
+ taskPerformanceLookup[modelName] = {};
418
+ }
419
+ taskPerformanceLookup[modelName][specificTask] = score;
420
+ hasTaskSpecificData = true;
421
+ }
422
+ });
423
+ }
424
+ });
425
+ } else if (taskCategory) {
426
+ // For a task category, gather all tasks in that category
427
+ taskPerformanceData.forEach((item) => {
428
+ if (item.category === taskCategory) {
429
+ models.forEach((model) => {
430
+ const modelName = model.model;
431
+ const score = item[modelName];
432
+
433
+ if (typeof score === "number" && !isNaN(score)) {
434
+ if (!taskPerformanceLookup[modelName]) {
435
+ taskPerformanceLookup[modelName] = {};
436
+ }
437
+ taskPerformanceLookup[modelName][item.task] = score;
438
+ hasTaskSpecificData = true;
439
+ }
440
+ });
441
+ }
442
+ });
443
+ }
444
+
445
+ return models
446
+ .map((model) => {
447
+ const modelName = model.model;
448
+ const gaps = [];
449
+
450
+ // For each demographic dimension
451
+ Object.keys(demographicOptions).forEach((demo) => {
452
+ // Skip if we're filtering to a specific demographic and this isn't it
453
+ if (selectedDemographic !== "all" && demo !== selectedDemographic) {
454
+ return;
455
+ }
456
+
457
+ const demoValues = demographicOptions[demo];
458
+ if (!demoValues || demoValues.length < 2) return; // Need at least 2 groups to measure a gap
459
+
460
+ // Get scores for each demographic value within this dimension
461
+ const demoScores = [];
462
+
463
+ demoValues.forEach((value) => {
464
+ // First check if we have demographic data for this model and value
465
+ const modelDemoData =
466
+ demographicSummary[demo]?.[value]?.models?.[modelName];
467
+ if (!modelDemoData) return;
468
+
469
+ let score = null;
470
+
471
+ if (selectedMetric === "overall_score") {
472
+ // Improved logic for task-specific scores
473
+ if (
474
+ specificTask &&
475
+ taskPerformanceLookup[modelName] &&
476
+ typeof taskPerformanceLookup[modelName][specificTask] ===
477
+ "number"
478
+ ) {
479
+ // Use the specific task score for all demographic groups
480
+ // This assumes the task score is the same regardless of demographic
481
+ score = taskPerformanceLookup[modelName][specificTask];
482
+ } else if (
483
+ taskCategory &&
484
+ Object.keys(taskPerformanceLookup[modelName] || {}).length > 0
485
+ ) {
486
+ // For a category, average the task scores
487
+ const taskScores = Object.values(
488
+ taskPerformanceLookup[modelName]
489
+ );
490
+ if (taskScores.length > 0) {
491
+ score =
492
+ taskScores.reduce((sum, s) => sum + s, 0) /
493
+ taskScores.length;
494
+ } else {
495
+ // Fallback to overall if we don't have category scores
496
+ score = modelDemoData.overall_score;
497
+ }
498
+ } else {
499
+ // Default to overall score
500
+ score = modelDemoData.overall_score;
501
+ }
502
+ } else if (selectedMetric === "repeat_usage_pct") {
503
+ score = modelDemoData.repeat_usage_pct;
504
+ } else if (selectedMetric.startsWith("facet_")) {
505
+ const facet = selectedMetric.replace("facet_", "");
506
+ if (
507
+ modelDemoData.facet_scores &&
508
+ facet in modelDemoData.facet_scores
509
+ ) {
510
+ score = modelDemoData.facet_scores[facet];
511
+ }
512
+ }
513
+
514
+ // Only add valid scores
515
+ if (score !== null && typeof score === "number" && !isNaN(score)) {
516
+ demoScores.push({
517
+ value,
518
+ score,
519
+ });
520
+ }
521
+ });
522
+
523
+ // Calculate gap for this demographic dimension with better error handling
524
+ if (demoScores.length >= 2) {
525
+ const sortedScores = [...demoScores].sort(
526
+ (a, b) => a.score - b.score
527
+ );
528
+ const lowest = sortedScores[0];
529
+ const highest = sortedScores[sortedScores.length - 1];
530
+ const gap = highest.score - lowest.score;
531
+
532
+ // Only include valid gaps
533
+ if (!isNaN(gap)) {
534
+ gaps.push({
535
+ demo,
536
+ gap,
537
+ lowestGroup: lowest.value,
538
+ lowestScore: lowest.score,
539
+ highestGroup: highest.value,
540
+ highestScore: highest.score,
541
+ });
542
+ }
543
+ }
544
+ });
545
+
546
+ // Calculate average gap with better error handling
547
+ const avgGap =
548
+ gaps.length > 0
549
+ ? gaps.reduce((sum, g) => sum + g.gap, 0) / gaps.length
550
+ : 0;
551
+
552
+ // For a specific demographic, get the exact gap
553
+ const specificGap =
554
+ selectedDemographic !== "all"
555
+ ? gaps.find((g) => g.demo === selectedDemographic)?.gap || 0
556
+ : avgGap;
557
+
558
+ return {
559
+ model: modelName,
560
+ avgGap: selectedDemographic === "all" ? avgGap : specificGap,
561
+ color: model.color,
562
+ gaps,
563
+ };
564
+ })
565
+ .sort((a, b) => a.avgGap - b.avgGap); // Sort by avg gap (lower is better)
566
+ };
567
+
568
+
569
+ // 1. Enhanced generateKeyInsights function that returns structured data objects
570
+ const generateKeyInsights = () => {
571
+ const structuredInsights = [];
572
+
573
+ // Only generate meaningful insights when we have sufficient data
574
+ if (!taskData || !demographicSummary) {
575
+ return ["Not enough data to generate insights."];
576
+ }
577
+
578
+ // Get the filtered data
579
+ const filteredData = getFilteredPerformanceData();
580
+ const equityData = calculateModelEquity();
581
+
582
+ // If we have data for performance comparison
583
+ if (filteredData.length > 0) {
584
+ // Find best performing model for the current filter set
585
+ const bestModel = { model: null, score: -Infinity };
586
+ const worstModel = { model: null, score: Infinity };
587
+
588
+ // Extract scores based on groupBy and selected data
589
+ if (groupBy === "task") {
590
+ // Find best performance across all tasks
591
+ filteredData.forEach((task) => {
592
+ models.forEach((model) => {
593
+ const score = task[model.model];
594
+ if (score !== undefined && score > bestModel.score) {
595
+ bestModel.model = model.model;
596
+ bestModel.score = score;
597
+ bestModel.task = task.task || task.label;
598
+ bestModel.modelObj = model;
599
+ }
600
+ if (score !== undefined && score < worstModel.score) {
601
+ worstModel.model = model.model;
602
+ worstModel.score = score;
603
+ worstModel.task = task.task || task.label;
604
+ worstModel.modelObj = model;
605
+ }
606
+ });
607
+ });
608
+
609
+ // Create contextual insights based on current filters
610
+ if (bestModel.model) {
611
+ let taskContext = bestModel.task;
612
+ let insightTitle = "";
613
+
614
+ if (selectedTask === "all") {
615
+ insightTitle = `Best for ${taskContext}`;
616
+ } else if (Object.keys(taskCategories || {}).includes(selectedTask)) {
617
+ insightTitle = `Best for ${selectedTask} Tasks`;
618
+ } else {
619
+ insightTitle = `Best for ${selectedTask}`;
620
+ taskContext = selectedTask;
621
+ }
622
+
623
+ structuredInsights.push({
624
+ type: "performance",
625
+ model: bestModel.model,
626
+ modelObj: bestModel.modelObj,
627
+ score: bestModel.score,
628
+ task: taskContext,
629
+ title: insightTitle
630
+ });
631
+ }
632
+
633
+ if (bestModel.model && worstModel.model && bestModel.model !== worstModel.model) {
634
+ const gap = bestModel.score - worstModel.score;
635
+ if (gap > 15) { // Only show significant gaps
636
+ structuredInsights.push({
637
+ type: "gap",
638
+ gap: gap,
639
+ model1: bestModel.model,
640
+ model1Obj: bestModel.modelObj,
641
+ model2: worstModel.model,
642
+ model2Obj: worstModel.modelObj,
643
+ context: selectedTask !== "all" ? selectedTask : "across all tasks"
644
+ });
645
+ }
646
+ }
647
+ } else if (groupBy === "demographic" && selectedDemographic !== "all") {
648
+ // Similar logic for demographic insights...
649
+ }
650
+ }
651
+
652
+ // Add equity insights when we have equity data
653
+ if (equityData.length > 0) {
654
+ const mostEquitable = equityData[0];
655
+ const leastEquitable = equityData[equityData.length - 1];
656
+
657
+ // Get model objects
658
+ const mostEquitableModelObj = models.find(m => m.model === mostEquitable.model);
659
+ const leastEquitableModelObj = models.find(m => m.model === leastEquitable.model);
660
+
661
+ // Only show equity insights if there's a meaningful difference
662
+ if (mostEquitable.avgGap < 10 && (leastEquitable.avgGap - mostEquitable.avgGap > 10)) {
663
+ let demoContext = selectedDemographic === "all" ? "all demographics" : selectedDemographic;
664
+
665
+ structuredInsights.push({
666
+ type: "equity",
667
+ model: mostEquitable.model,
668
+ modelObj: mostEquitableModelObj,
669
+ gap: mostEquitable.avgGap,
670
+ demographic: demoContext,
671
+ task: selectedTask !== "all" ? selectedTask : ""
672
+ });
673
+ }
674
+
675
+ if (leastEquitable.avgGap > 20) {
676
+ let demoContext = selectedDemographic === "all" ? "demographic groups" : `${selectedDemographic} groups`;
677
+
678
+ structuredInsights.push({
679
+ type: "concern",
680
+ model: leastEquitable.model,
681
+ modelObj: leastEquitableModelObj,
682
+ gap: leastEquitable.avgGap,
683
+ demographic: demoContext,
684
+ task: selectedTask !== "all" ? selectedTask : ""
685
+ });
686
+ }
687
+ }
688
+
689
+ return structuredInsights.length > 0 ? structuredInsights :
690
+ [{ type: "info", message: "Try different filter combinations to discover more insights." }];
691
+ };
692
+
693
+ // 2. Improved Key Insight Card component
694
+ const KeyInsightCard = ({ insight }) => {
695
+ // Determine card styling based on insight type
696
+ const getCardConfig = () => {
697
+ switch (insight.type) {
698
+ case "performance":
699
+ return {
700
+ backgroundColor: "bg-white",
701
+ dotColor: "bg-indigo-500",
702
+ icon: "🏆",
703
+ title: insight.title || "Top Performer"
704
+ };
705
+ case "equity":
706
+ return {
707
+ backgroundColor: "bg-white",
708
+ dotColor: "bg-purple-500",
709
+ icon: "⚖️",
710
+ title: "Equity Champion"
711
+ };
712
+ case "gap":
713
+ return {
714
+ backgroundColor: "bg-white",
715
+ dotColor: "bg-amber-500",
716
+ icon: "📊",
717
+ title: "Performance Gap"
718
+ };
719
+ case "concern":
720
+ return {
721
+ backgroundColor: "bg-white",
722
+ dotColor: "bg-red-500",
723
+ icon: "⚠️",
724
+ title: "Potential Concern"
725
+ };
726
+ default:
727
+ return {
728
+ backgroundColor: "bg-white",
729
+ dotColor: "bg-gray-500",
730
+ icon: "ℹ️",
731
+ title: "Note"
732
+ };
733
+ }
734
+ };
735
+
736
+ const config = getCardConfig();
737
+
738
+ return (
739
+ <div className={`border rounded-lg overflow-hidden ${config.backgroundColor}`}>
740
+ {/* Card Header */}
741
+ <div className="border-b bg-white px-4 py-2">
742
+ <h4 className="font-medium text-gray-800 flex items-center">
743
+ <span className={`w-3 h-3 rounded-full ${config.dotColor} mr-2`}></span>
744
+ {config.title}
745
+ </h4>
746
+ </div>
747
+
748
+ {/* Card Content */}
749
+ <div className="p-4">
750
+ {/* Performance Card */}
751
+ {insight.type === "performance" && (
752
+ <div className="flex items-center">
753
+ <div className={`h-10 w-10 text-2xl rounded-full flex items-center justify-center mr-3`}>
754
+ {config.icon}
755
+ </div>
756
+ <div>
757
+ <div className="font-medium" style={{ color: insight.modelObj?.color || '#6B7280' }}>
758
+ {insight.model}
759
+ </div>
760
+ <div className="text-sm text-gray-600">
761
+ Score: {insight.score.toFixed(1)}
762
+ </div>
763
+ </div>
764
+ </div>
765
+ )}
766
+
767
+ {/* Equity Card */}
768
+ {insight.type === "equity" && (
769
+ <>
770
+ <div className="flex items-center">
771
+ <div className={`h-10 w-10 text-2xl rounded-full flex items-center justify-center mr-3`}>
772
+ {config.icon}
773
+ </div>
774
+ <div>
775
+ <div className="font-medium" style={{ color: insight.modelObj?.color || '#6B7280' }}>
776
+ {insight.model}
777
+ </div>
778
+ <div className="text-sm text-gray-600">
779
+ Equity Gap: {insight.gap.toFixed(1)}
780
+ </div>
781
+ </div>
782
+ </div>
783
+ <div className="mt-3 text-sm">
784
+ Consistent across {insight.demographic}
785
+ </div>
786
+ </>
787
+ )}
788
+
789
+ {/* Gap Card */}
790
+ {insight.type === "gap" && (
791
+ <>
792
+ <div className="flex items-center mb-3">
793
+ <div className={`h-10 w-10 text-2xl rounded-full flex items-center justify-center mr-3`}>
794
+ {config.icon}
795
+ </div>
796
+ <div>
797
+ <div className="font-medium">Gap: {insight.gap.toFixed(1)} points</div>
798
+ </div>
799
+ </div>
800
+ <div className="flex justify-between items-center">
801
+ <div style={{ color: insight.model1Obj?.color || '#6B7280' }} className="font-medium">
802
+ {insight.model1}
803
+ </div>
804
+ <div className="text-gray-500 mx-2">vs</div>
805
+ <div style={{ color: insight.model2Obj?.color || '#6B7280' }} className="font-medium">
806
+ {insight.model2}
807
+ </div>
808
+ </div>
809
+ {insight.context !== "across all tasks" && (
810
+ <div className="mt-2 text-sm text-gray-700">
811
+ on {insight.context}
812
+ </div>
813
+ )}
814
+ </>
815
+ )}
816
+
817
+ {/* Concern Card */}
818
+ {insight.type === "concern" && (
819
+ <>
820
+ <div className="flex items-center">
821
+ <div className={`h-10 w-10 text-2xl rounded-full flex items-center justify-center mr-3`}>
822
+ {config.icon}
823
+ </div>
824
+ <div>
825
+ <div className="font-medium" style={{ color: insight.modelObj?.color || '#6B7280' }}>
826
+ {insight.model}
827
+ </div>
828
+ <div className="text-sm text-gray-600">
829
+ Disparity: {insight.gap.toFixed(1)} points
830
+ </div>
831
+ </div>
832
+ </div>
833
+ <div className="mt-3 text-sm">
834
+ Between {insight.demographic}
835
+ {insight.task && ` on ${insight.task}`}
836
+ </div>
837
+ </>
838
+ )}
839
+
840
+ {/* Info Card */}
841
+ {insight.type === "info" && (
842
+ <div className="text-sm text-gray-700">
843
+ {insight.message}
844
+ </div>
845
+ )}
846
+ </div>
847
+ </div>
848
+ );
849
+ };
850
+
851
+ // 3. Key Insights Panel render function
852
+ const renderKeyInsightsPanel = () => {
853
+ // Get structured insights directly from enhanced function
854
+ const structuredInsights = generateKeyInsights();
855
+
856
+ return (
857
+ <div className="border rounded-lg overflow-hidden mb-6 shadow-sm">
858
+ <div
859
+ className="px-4 py-3 bg-white flex justify-between items-center cursor-pointer"
860
+ onClick={() => setKeyInsightsVisible(!keyInsightsVisible)}
861
+ >
862
+ <h3 className="font-semibold flex items-center text-gray-800">
863
+ <svg xmlns="http://www.w3.org/2000/svg" className="h-5 w-5 mr-2 text-blue-500" viewBox="0 0 20 20" fill="currentColor">
864
+ <path d="M11 3a1 1 0 10-2 0v1a1 1 0 102 0V3zM15.657 5.757a1 1 0 00-1.414-1.414l-.707.707a1 1 0 001.414 1.414l.707-.707zM18 10a1 1 0 01-1 1h-1a1 1 0 110-2h1a1 1 0 011 1zM5.05 6.464A1 1 0 106.464 5.05l-.707-.707a1 1 0 00-1.414 1.414l.707.707zM5 10a1 1 0 01-1 1H3a1 1 0 110-2h1a1 1 0 011 1zM8 16v-1h4v1a2 2 0 11-4 0zM12 14c.015-.34.208-.646.477-.859a4 4 0 10-4.954 0c.27.213.462.519.476.859h4.002z" />
865
+ </svg>
866
+ Key Insights
867
+ </h3>
868
+ <div className="flex items-center">
869
+ {structuredInsights.length > 0 && (
870
+ <span className="text-xs bg-blue-500 text-white rounded-full px-2 py-0.5 mr-2">
871
+ {structuredInsights.length}
872
+ </span>
873
+ )}
874
+ <div className="text-gray-500">
875
+ {keyInsightsVisible ? (
876
+ <svg xmlns="http://www.w3.org/2000/svg" className="h-5 w-5" viewBox="0 0 20 20" fill="currentColor">
877
+ <path fillRule="evenodd" d="M5.293 7.293a1 1 0 011.414 0L10 10.586l3.293-3.293a1 1 0 111.414 1.414l-4 4a1 1 0 01-1.414 0l-4-4a1 1 0 010-1.414z" clipRule="evenodd" />
878
+ </svg>
879
+ ) : (
880
+ <svg xmlns="http://www.w3.org/2000/svg" className="h-5 w-5" viewBox="0 0 20 20" fill="currentColor">
881
+ <path fillRule="evenodd" d="M14.707 12.707a1 1 0 01-1.414 0L10 9.414l-3.293 3.293a1 1 0 01-1.414-1.414l4-4a1 1 0 011.414 0l4 4a1 1 0 010 1.414z" clipRule="evenodd" />
882
+ </svg>
883
+ )}
884
+ </div>
885
+ </div>
886
+ </div>
887
+ {keyInsightsVisible && (
888
+ <div className="p-4">
889
+ {structuredInsights.length > 0 && structuredInsights[0].type !== "info" ? (
890
+ <div className="grid grid-cols-1 md:grid-cols-2 gap-4">
891
+ {structuredInsights.map((insight, index) => (
892
+ <KeyInsightCard key={index} insight={insight} />
893
+ ))}
894
+ </div>
895
+ ) : (
896
+ <div className="py-6 text-center text-gray-500">
897
+ <svg xmlns="http://www.w3.org/2000/svg" className="h-8 w-8 mx-auto mb-2 text-gray-400" fill="none" viewBox="0 0 24 24" stroke="currentColor">
898
+ <path strokeLinecap="round" strokeLinejoin="round" strokeWidth={2} d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z" />
899
+ </svg>
900
+ <p>{structuredInsights[0].message || "No insights available for current filter selection"}</p>
901
+ <p className="text-sm mt-1">Try adjusting your filters to see insights</p>
902
+ </div>
903
+ )}
904
+ </div>
905
+ )}
906
+ </div>
907
+ );
908
+ };
909
+
910
+ // Get data for visualization
911
+ const performanceData = getFilteredPerformanceData();
912
+ const equityRankings = calculateModelEquity();
913
+ const keyInsights = generateKeyInsights();
914
+
915
+ // Custom tooltip for the bar chart
916
+ const PerformanceTooltip = ({ active, payload, label }) => {
917
+ if (active && payload && payload.length) {
918
+ return (
919
+ <div className="bg-white p-3 border rounded shadow-sm">
920
+ <p className="font-medium">{label}</p>
921
+ <div className="mt-2">
922
+ {payload.map((entry, index) => {
923
+ // Skip entries that don't have model data
924
+ if (!entry.name || entry.name.includes("_std")) return null;
925
+
926
+ // Find the corresponding standard deviation if available
927
+ const stdKey = `${entry.name}_std`;
928
+ const stdEntry = payload.find((p) => p.dataKey === stdKey);
929
+ const stdValue = stdEntry ? stdEntry.value : 0;
930
+
931
+ return (
932
+ <div key={index} className="flex items-center text-sm mb-1">
933
+ <div
934
+ className="w-3 h-3 rounded-full mr-1"
935
+ style={{ backgroundColor: entry.color }}
936
+ ></div>
937
+ <span className="mr-2">{entry.name}:</span>
938
+ <span className="font-medium">
939
+ {entry.value.toFixed(2)}{" "}
940
+ {stdValue ? `± ${stdValue.toFixed(2)}` : ""}
941
+ </span>
942
+ </div>
943
+ );
944
+ })}
945
+ </div>
946
+ </div>
947
+ );
948
+ }
949
+ return null;
950
+ };
951
+
952
+ // Get formatted metric name
953
+ const getMetricName = (metric) => {
954
+ if (metric === "overall_score") return "Overall Score";
955
+ if (metric === "repeat_usage_pct") return "Would Use Again";
956
+ if (metric.startsWith("facet_")) {
957
+ const facet = metric.replace("facet_", "");
958
+ return formatFacetName(facet);
959
+ }
960
+ return metric;
961
+ };
962
+
963
+ return (
964
+ <div>
965
+ {/* Analysis Controls Panel */}
966
+ <div className="border rounded-lg overflow-hidden mb-6">
967
+ <div className="px-4 py-2 bg-gray-50 border-b">
968
+ <h3 className="font-semibold">Analysis Controls</h3>
969
+ </div>
970
+ <div className="p-4 grid grid-cols-1 md:grid-cols-3 gap-4">
971
+ <div>
972
+ <label className="block text-sm font-medium text-gray-700 mb-2">
973
+ Group By
974
+ </label>
975
+ <select
976
+ className="w-full border rounded-md px-3 py-2 bg-white shadow-sm focus:outline-none focus:ring-2 focus:ring-blue-500"
977
+ value={groupBy}
978
+ onChange={(e) => setGroupBy(e.target.value)}
979
+ >
980
+ <option value="task">Task</option>
981
+ <option value="demographic">Demographic</option>
982
+ <option value="combined">Task × Demographic</option>
983
+ </select>
984
+ </div>
985
+
986
+ <div>
987
+ <label className="block text-sm font-medium text-gray-700 mb-2">
988
+ Task
989
+ </label>
990
+ <select
991
+ className={`w-full border rounded-md px-3 py-2 shadow-sm focus:outline-none focus:ring-2 focus:ring-blue-500 ${
992
+ groupBy === "demographic"
993
+ ? "bg-gray-100 text-gray-500"
994
+ : "bg-white"
995
+ }`}
996
+ value={selectedTask}
997
+ onChange={(e) => setSelectedTask(e.target.value)}
998
+ disabled={groupBy === "demographic"}
999
+ >
1000
+ {/* Show "All Tasks" at the top */}
1001
+ <option value={taskOptions.allTasksOption.value}>
1002
+ {taskOptions.allTasksOption.label}
1003
+ </option>
1004
+
1005
+ {/* Show categories at second level */}
1006
+ <optgroup label="Task Categories">
1007
+ {taskOptions.categories.map((category) => (
1008
+ <option key={category.value} value={category.value}>
1009
+ {category.label}
1010
+ </option>
1011
+ ))}
1012
+ </optgroup>
1013
+
1014
+ {/* Show tasks grouped by category */}
1015
+ {Object.entries(taskOptions.categorizedTasks).map(
1016
+ ([category, tasks]) => (
1017
+ <optgroup
1018
+ key={category}
1019
+ label={`${
1020
+ category.charAt(0).toUpperCase() + category.slice(1)
1021
+ } Tasks`}
1022
+ >
1023
+ {tasks.map((task) => (
1024
+ <option key={task.value} value={task.value}>
1025
+ {task.label}
1026
+ </option>
1027
+ ))}
1028
+ </optgroup>
1029
+ )
1030
+ )}
1031
+
1032
+ {/* Show uncategorized tasks if any */}
1033
+ {taskOptions.uncategorizedTasks.length > 0 && (
1034
+ <optgroup label="Other Tasks">
1035
+ {taskOptions.uncategorizedTasks.map((task) => (
1036
+ <option key={task.value} value={task.value}>
1037
+ {task.label}
1038
+ </option>
1039
+ ))}
1040
+ </optgroup>
1041
+ )}
1042
+ </select>
1043
+ </div>
1044
+
1045
+ <div>
1046
+ <label className="block text-sm font-medium text-gray-700 mb-2">
1047
+ Demographic Dimension
1048
+ {groupBy === "task" && (
1049
+ <span className="ml-2 text-xs text-gray-500">
1050
+ (Disabled when grouping by task)
1051
+ </span>
1052
+ )}
1053
+ </label>
1054
+ <select
1055
+ className={`w-full border rounded-md px-3 py-2 shadow-sm focus:outline-none focus:ring-2 focus:ring-blue-500 ${
1056
+ groupBy === "task" ? "bg-gray-100 text-gray-500" : "bg-white"
1057
+ }`}
1058
+ value={selectedDemographic}
1059
+ onChange={(e) => setSelectedDemographic(e.target.value)}
1060
+ disabled={groupBy === "task"}
1061
+ >
1062
+ <option value="all">All Demographics (Average)</option>
1063
+ {Object.keys(demographicOptions || {}).map((demo) => (
1064
+ <option key={demo} value={demo}>
1065
+ {demo.charAt(0).toUpperCase() + demo.slice(1)}
1066
+ </option>
1067
+ ))}
1068
+ </select>
1069
+ </div>
1070
+
1071
+ <div>
1072
+ <label className="block text-sm font-medium text-gray-700 mb-2">
1073
+ Metric
1074
+ </label>
1075
+ <select
1076
+ className="w-full border rounded-md px-3 py-2 bg-white shadow-sm focus:outline-none focus:ring-2 focus:ring-blue-500"
1077
+ value={selectedMetric}
1078
+ onChange={(e) => setSelectedMetric(e.target.value)}
1079
+ >
1080
+ <option value="overall_score">Overall Score</option>
1081
+ <option value="repeat_usage_pct">Would Use Again (%)</option>
1082
+ {Object.keys(facets || {})
1083
+ .filter((f) => f !== "repeat_usage")
1084
+ .map((facet) => (
1085
+ <option key={facet} value={`facet_${facet}`}>
1086
+ {formatFacetName(facet)}
1087
+ </option>
1088
+ ))}
1089
+ </select>
1090
+ </div>
1091
+
1092
+ <div>
1093
+ <label className="block text-sm font-medium text-gray-700 mb-2">
1094
+ Model
1095
+ </label>
1096
+ <select
1097
+ className="w-full border rounded-md px-3 py-2 bg-white shadow-sm focus:outline-none focus:ring-2 focus:ring-blue-500"
1098
+ value={selectedModel || ""}
1099
+ onChange={(e) => setSelectedModel(e.target.value)}
1100
+ >
1101
+ {models.map((model) => (
1102
+ <option key={model.model} value={model.model}>
1103
+ {model.model}
1104
+ </option>
1105
+ ))}
1106
+ </select>
1107
+ </div>
1108
+
1109
+ <div>
1110
+ <label className="block text-sm font-medium text-gray-700 mb-2">
1111
+ Display Options
1112
+ </label>
1113
+ <div className="flex flex-wrap gap-2">
1114
+ <button
1115
+ className={`px-3 py-1 text-xs font-medium rounded ${
1116
+ showAllModels
1117
+ ? "bg-blue-100 text-blue-800 border border-blue-300"
1118
+ : "bg-gray-100 text-gray-800 border border-gray-300"
1119
+ }`}
1120
+ onClick={() => setShowAllModels(true)}
1121
+ >
1122
+ All Models
1123
+ </button>
1124
+ <button
1125
+ className={`px-3 py-1 text-xs font-medium rounded ${
1126
+ !showAllModels
1127
+ ? "bg-blue-100 text-blue-800 border border-blue-300"
1128
+ : "bg-gray-100 text-gray-800 border border-gray-300"
1129
+ }`}
1130
+ onClick={() => setShowAllModels(false)}
1131
+ >
1132
+ Selected Only
1133
+ </button>
1134
+ <button
1135
+ className={`px-3 py-1 text-xs font-medium rounded ${
1136
+ viewMode === "absolute"
1137
+ ? "bg-blue-100 text-blue-800 border border-blue-300"
1138
+ : "bg-gray-100 text-gray-800 border border-gray-300"
1139
+ }`}
1140
+ onClick={() => setViewMode("absolute")}
1141
+ >
1142
+ Absolute
1143
+ </button>
1144
+ <button
1145
+ className={`px-3 py-1 text-xs font-medium rounded ${
1146
+ viewMode === "relative"
1147
+ ? "bg-blue-100 text-blue-800 border border-blue-300"
1148
+ : "bg-gray-100 text-gray-800 border border-gray-300"
1149
+ }`}
1150
+ onClick={() => setViewMode("relative")}
1151
+ title="Show performance relative to the average across models"
1152
+ >
1153
+ Relative
1154
+ </button>
1155
+ </div>
1156
+ </div>
1157
+ </div>
1158
+ </div>
1159
+
1160
+ {/* Active Filters Display */}
1161
+ <div className="mb-6">
1162
+ <div className="text-sm font-medium text-gray-700 mb-2">
1163
+ Active Filters:
1164
+ </div>
1165
+ <div className="flex flex-wrap">
1166
+ {selectedTask !== "all" && (
1167
+ <FilterTag
1168
+ label={`Task: ${getTaskLabel(selectedTask)}`}
1169
+ onRemove={() => setSelectedTask("all")}
1170
+ />
1171
+ )}
1172
+ {selectedDemographic !== "all" && (
1173
+ <FilterTag
1174
+ label={`Demographic: ${
1175
+ selectedDemographic.charAt(0).toUpperCase() +
1176
+ selectedDemographic.slice(1)
1177
+ }`}
1178
+ onRemove={() => setSelectedDemographic("all")}
1179
+ />
1180
+ )}
1181
+ {!showAllModels && (
1182
+ <FilterTag
1183
+ label={`Model: ${selectedModel}`}
1184
+ onRemove={() => setShowAllModels(true)}
1185
+ />
1186
+ )}
1187
+ <FilterTag label={`Metric: ${getMetricName(selectedMetric)}`} />
1188
+ <FilterTag
1189
+ label={`Group by: ${
1190
+ groupBy.charAt(0).toUpperCase() + groupBy.slice(1)
1191
+ }`}
1192
+ />
1193
+ </div>
1194
+ </div>
1195
+
1196
+ {/* Key Insights Panel */}
1197
+ {renderKeyInsightsPanel()}
1198
+
1199
+ {/* Performance Comparison Visualization */}
1200
+ <div className="border rounded-lg overflow-hidden mb-6">
1201
+ <div className="px-4 py-2 bg-gray-50 border-b">
1202
+ <h3 className="font-semibold">
1203
+ {getMetricName(selectedMetric)} by{" "}
1204
+ {groupBy.charAt(0).toUpperCase() + groupBy.slice(1)}
1205
+ {viewMode === "relative" && " (Relative to Average)"}
1206
+ </h3>
1207
+ </div>
1208
+ <div className="p-4">
1209
+ {performanceData.length > 0 ? (
1210
+ <div className="h-96">
1211
+ <ResponsiveContainer width="100%" height="100%">
1212
+ <BarChart
1213
+ data={performanceData}
1214
+ layout="vertical"
1215
+ margin={{ top: 20, right: 30, left: 0, bottom: 5 }}
1216
+ >
1217
+ <CartesianGrid strokeDasharray="3 3" />
1218
+ <XAxis
1219
+ type="number"
1220
+ domain={
1221
+ viewMode === "relative"
1222
+ ? // For relative mode, use symmetrical domain based on max deviation
1223
+ (dataMax) => {
1224
+ // Find max absolute deviation
1225
+ const maxDev = performanceData.reduce(
1226
+ (max, item) => {
1227
+ let itemMax = max;
1228
+ models.forEach((model) => {
1229
+ if (typeof item[model.model] === "number") {
1230
+ itemMax = Math.max(
1231
+ itemMax,
1232
+ Math.abs(item[model.model])
1233
+ );
1234
+ }
1235
+ });
1236
+ return itemMax;
1237
+ },
1238
+ 0
1239
+ );
1240
+ // Round up to nearest 5
1241
+ const scaledMax = Math.ceil(maxDev / 5) * 5;
1242
+ // Use symmetrical domain
1243
+ return [-scaledMax, scaledMax];
1244
+ }
1245
+ : // For absolute mode, use original scale range
1246
+ selectedMetric.startsWith("facet_")
1247
+ ? [-100, 100]
1248
+ : [0, 100]
1249
+ }
1250
+ tickFormatter={(value) => {
1251
+ if (viewMode === "relative") {
1252
+ return value > 0
1253
+ ? `+${value.toFixed(0)}`
1254
+ : value.toFixed(0);
1255
+ }
1256
+ return value.toFixed(0);
1257
+ }}
1258
+ />
1259
+ <YAxis
1260
+ dataKey={groupBy === "task" ? "task" : "label"}
1261
+ type="category"
1262
+ width={150}
1263
+ tick={{ fontSize: 12 }}
1264
+ />
1265
+ <Tooltip content={<PerformanceTooltip />} />
1266
+ <Legend />
1267
+ {(showAllModels
1268
+ ? models
1269
+ : [models.find((m) => m.model === selectedModel)].filter(
1270
+ Boolean
1271
+ )
1272
+ ).map((model) => (
1273
+ <Bar
1274
+ key={model.model}
1275
+ dataKey={model.model}
1276
+ name={model.model}
1277
+ fill={model.color}
1278
+ maxBarSize={25}
1279
+ >
1280
+ {viewMode === "relative" &&
1281
+ performanceData.map((entry, index) => {
1282
+ const value = entry[model.model];
1283
+ return (
1284
+ <Cell
1285
+ key={`cell-${index}`}
1286
+ fill={
1287
+ value >= 0 ? model.color : `${model.color}80`
1288
+ } // Lighter shade for negative values
1289
+ />
1290
+ );
1291
+ })}
1292
+ </Bar>
1293
+ ))}
1294
+ {viewMode === "relative" && (
1295
+ <ReferenceLine x={0} stroke="#666" strokeDasharray="3 3" />
1296
+ )}
1297
+ </BarChart>
1298
+ </ResponsiveContainer>
1299
+ </div>
1300
+ ) : (
1301
+ <div className="flex items-center justify-center h-60 bg-gray-50 rounded">
1302
+ <div className="text-center p-4">
1303
+ <svg
1304
+ xmlns="http://www.w3.org/2000/svg"
1305
+ className="h-10 w-10 mx-auto text-gray-400 mb-3"
1306
+ fill="none"
1307
+ viewBox="0 0 24 24"
1308
+ stroke="currentColor"
1309
+ >
1310
+ <path
1311
+ strokeLinecap="round"
1312
+ strokeLinejoin="round"
1313
+ strokeWidth={2}
1314
+ d="M13 16h-1v-4h-1m1-4h.01M21 12a9 9 0 11-18 0 9 9 0 0118 0z"
1315
+ />
1316
+ </svg>
1317
+ <h3 className="text-lg font-medium text-gray-900 mb-1">
1318
+ No Data Available
1319
+ </h3>
1320
+ <p className="text-sm text-gray-600">
1321
+ There is no data available for the selected filters. Try
1322
+ adjusting your selections.
1323
+ </p>
1324
+ {groupBy === "combined" && (
1325
+ <p className="text-sm text-gray-600 mt-2">
1326
+ Note: Task × Demographic view requires specific data that
1327
+ may not be available.
1328
+ </p>
1329
+ )}
1330
+ </div>
1331
+ </div>
1332
+ )}
1333
+ <div className="mt-4 text-sm text-gray-600 text-center">
1334
+ {viewMode === "absolute"
1335
+ ? `${getMetricName(selectedMetric)} by ${groupBy}`
1336
+ : `Performance relative to average across models (positive is better than average)`}
1337
+ </div>
1338
+ </div>
1339
+ </div>
1340
+
1341
+ {/* Model Equity Rankings */}
1342
+ <div className="border rounded-lg overflow-hidden mb-6">
1343
+ <div className="px-4 py-2 bg-gray-50 border-b flex justify-between items-center">
1344
+ <h3 className="font-semibold">Model Equity Rankings</h3>
1345
+ <span className="text-xs text-gray-500">
1346
+ Lower gaps indicate more consistent performance across demographic
1347
+ groups
1348
+ </span>
1349
+ </div>
1350
+ <div className="p-4">
1351
+ <div className="space-y-3">
1352
+ {equityRankings.map((model, index) => {
1353
+ const pct = 100 - (model.avgGap / 30) * 100; // Scale to percentage where 100% = perfect equity
1354
+ return (
1355
+ <div key={model.model} className="relative">
1356
+ <div className="flex items-center mb-1">
1357
+ <div className="w-6 text-sm text-gray-500">
1358
+ {index + 1}.
1359
+ </div>
1360
+ <div
1361
+ className="w-8 h-8 flex items-center justify-center rounded-full mr-2"
1362
+ style={{ backgroundColor: model.color }}
1363
+ >
1364
+ <span className="text-white font-bold text-xs">
1365
+ {index + 1}
1366
+ </span>
1367
+ </div>
1368
+ <span className="text-sm font-medium mr-2">
1369
+ {model.model}
1370
+ </span>
1371
+ <span
1372
+ className={`ml-auto px-2 py-1 text-xs font-semibold rounded-full ${
1373
+ model.avgGap < 10
1374
+ ? "bg-green-100 text-green-800"
1375
+ : model.avgGap < 20
1376
+ ? "bg-blue-100 text-blue-800"
1377
+ : "bg-yellow-100 text-yellow-800"
1378
+ }`}
1379
+ >
1380
+ {model.avgGap.toFixed(2)} avg gap
1381
+ </span>
1382
+ </div>
1383
+ <div className="h-2 w-full bg-gray-200 rounded-full overflow-hidden">
1384
+ <div
1385
+ className="h-full rounded-full"
1386
+ style={{
1387
+ width: `${Math.min(100, Math.max(0, pct))}%`,
1388
+ backgroundColor: model.color,
1389
+ }}
1390
+ ></div>
1391
+ </div>
1392
+ </div>
1393
+ );
1394
+ })}
1395
+ </div>
1396
+ <div className="mt-4 text-xs text-gray-500 grid grid-cols-3 gap-2">
1397
+ <div className="flex items-center">
1398
+ <div className="w-3 h-3 bg-green-100 mr-1 rounded"></div>
1399
+ <span>&lt; 10: Excellent equity</span>
1400
+ </div>
1401
+ <div className="flex items-center">
1402
+ <div className="w-3 h-3 bg-blue-100 mr-1 rounded"></div>
1403
+ <span>10 - 20: Good equity</span>
1404
+ </div>
1405
+ <div className="flex items-center">
1406
+ <div className="w-3 h-3 bg-yellow-100 mr-1 rounded"></div>
1407
+ <span>&gt; 20: Potential disparity</span>
1408
+ </div>
1409
+ </div>
1410
+ </div>
1411
+ </div>
1412
+ </div>
1413
+ );
1414
+ };
1415
+
1416
+ export default TaskDemographicAnalysis;
leaderboard-app/eslint.config.mjs ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import { dirname } from "path";
2
+ import { fileURLToPath } from "url";
3
+ import { FlatCompat } from "@eslint/eslintrc";
4
+
5
+ const __filename = fileURLToPath(import.meta.url);
6
+ const __dirname = dirname(__filename);
7
+
8
+ const compat = new FlatCompat({
9
+ baseDirectory: __dirname,
10
+ });
11
+
12
+ const eslintConfig = [...compat.extends("next/core-web-vitals")];
13
+
14
+ export default eslintConfig;
leaderboard-app/jsconfig.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "compilerOptions": {
3
+ "paths": {
4
+ "@/*": ["./*"]
5
+ }
6
+ }
7
+ }
leaderboard-app/lib/utils.js ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Prepares the data for visualization by adding colors and formatting
3
+ * @param {Object} rawData - The raw data from the JSON file
4
+ * @returns {Object} - Processed data ready for visualization
5
+ */
6
+ export function prepareDataForVisualization(rawData) {
7
+ // Define model colors for consistent visualization
8
+ const MODEL_COLORS = {
9
+ 'gpt-4o': '#19AADE',
10
+ 'claude-3.7-sonnet': '#4A35C5',
11
+ 'deepseek-r1': '#FFA319',
12
+ 'o1': '#EF4444',
13
+ 'gemini-2.0-flash-001': '#22C55E',
14
+ 'llama-3.1-405b-instruct': '#8B5CF6'
15
+ };
16
+
17
+ // Add colors to model data
18
+ const modelsWithColors = rawData.models.map(model => ({
19
+ ...model,
20
+ color: MODEL_COLORS[model.model] || '#999999' // Fallback color if not defined
21
+ }));
22
+
23
+ // Create an easier lookup for models by name
24
+ const modelsMap = modelsWithColors.reduce((acc, model) => {
25
+ acc[model.model] = model;
26
+ return acc;
27
+ }, {});
28
+
29
+ // Add best model indicators for each task category
30
+ const taskCategories = { ...rawData.taskCategories };
31
+ const bestModelPerCategory = {};
32
+
33
+ Object.keys(taskCategories).forEach(category => {
34
+ let bestModel = null;
35
+ let highestScore = -Infinity;
36
+ let stdDev = 0;
37
+
38
+ modelsWithColors.forEach(model => {
39
+ if (model.tasks && model.tasks[category] && model.tasks[category] > highestScore) {
40
+ highestScore = model.tasks[category];
41
+ bestModel = model.model;
42
+ stdDev = model.tasks_std?.[category] || 0;
43
+ }
44
+ });
45
+
46
+ bestModelPerCategory[category] = {
47
+ model: bestModel,
48
+ score: highestScore,
49
+ std: stdDev,
50
+ color: MODEL_COLORS[bestModel] || '#999999'
51
+ };
52
+ });
53
+
54
+ // Add best model indicators for each metric group
55
+ const metricGroups = { ...rawData.metricGroups };
56
+ const bestModelPerMetricGroup = {};
57
+
58
+ Object.keys(metricGroups).forEach(group => {
59
+ let bestModel = null;
60
+ let highestScore = -Infinity;
61
+ let stdDev = 0;
62
+
63
+ modelsWithColors.forEach(model => {
64
+ if (model.metric_groups && model.metric_groups[group] && model.metric_groups[group] > highestScore) {
65
+ highestScore = model.metric_groups[group];
66
+ bestModel = model.model;
67
+ stdDev = model.metric_groups_std?.[group] || 0;
68
+ }
69
+ });
70
+
71
+ bestModelPerMetricGroup[group] = {
72
+ model: bestModel,
73
+ score: highestScore,
74
+ std: stdDev,
75
+ color: MODEL_COLORS[bestModel] || '#999999'
76
+ };
77
+ });
78
+
79
+ // Add best model indicators for each facet
80
+ const bestModelPerFacet = {};
81
+
82
+ // Extract facets from the data
83
+ const facets = {};
84
+ if (rawData.facets) {
85
+ // If facets are already provided in the raw data
86
+ Object.assign(facets, rawData.facets);
87
+ } else {
88
+ // Try to extract facets from the radar data
89
+ if (rawData.radarData && rawData.radarData.length > 0) {
90
+ rawData.radarData.forEach(item => {
91
+ if (item.category && item.category !== "Would Use Again") {
92
+ const facetName = item.category.toLowerCase().replace(/\s+/g, '_');
93
+ facets[facetName] = [];
94
+ }
95
+ });
96
+ }
97
+ }
98
+
99
+ // Find best model for each facet
100
+ Object.keys(facets).forEach(facet => {
101
+ if (facet === 'repeat_usage') return; // Skip repeat_usage
102
+
103
+ let bestModel = null;
104
+ let highestScore = -Infinity;
105
+ let stdDev = 0;
106
+
107
+ modelsWithColors.forEach(model => {
108
+ // Check if the model has facet scores
109
+ if (model.facet_scores && model.facet_scores[facet] !== undefined) {
110
+ const score = model.facet_scores[facet];
111
+ if (score > highestScore) {
112
+ highestScore = score;
113
+ bestModel = model.model;
114
+ stdDev = model.facet_scores[`${facet}_std`] || 0;
115
+ }
116
+ }
117
+ });
118
+
119
+ if (bestModel) {
120
+ bestModelPerFacet[facet] = {
121
+ model: bestModel,
122
+ score: highestScore,
123
+ std: stdDev,
124
+ color: MODEL_COLORS[bestModel] || '#999999'
125
+ };
126
+ }
127
+ });
128
+
129
+ // Format task data for visualization
130
+ const taskData = rawData.taskData.map(task => {
131
+ // Find best model for this task
132
+ let bestModel = null;
133
+ let highestScore = -Infinity;
134
+
135
+ Object.entries(task).forEach(([key, value]) => {
136
+ if (modelsMap[key] && value !== null && value > highestScore) {
137
+ highestScore = value;
138
+ bestModel = key;
139
+ }
140
+ });
141
+
142
+ return {
143
+ ...task,
144
+ bestModel,
145
+ bestModelColor: bestModel ? MODEL_COLORS[bestModel] : null,
146
+ bestScore: highestScore !== -Infinity ? highestScore : null
147
+ };
148
+ });
149
+
150
+ return {
151
+ models: modelsWithColors,
152
+ modelsMap,
153
+ taskData,
154
+ radarData: rawData.radarData,
155
+ taskCategories,
156
+ metricGroups,
157
+ facets,
158
+ bestModelPerCategory,
159
+ bestModelPerMetricGroup,
160
+ bestModelPerFacet,
161
+ // Pass through demographic data fields
162
+ demographicSummary: rawData.demographicSummary,
163
+ fairnessMetrics: rawData.fairnessMetrics,
164
+ demographicOptions: rawData.demographicOptions,
165
+ keyFacetsByTaskCategory: rawData.keyFacetsByTaskCategory,
166
+ keyAspectsByTask: rawData.keyAspectsByTask
167
+ };
168
+ }
169
+
170
+ /**
171
+ * Determine styling based on score
172
+ * @param {number} score - The score to evaluate
173
+ * @param {number} min - The minimum possible score (default: 0)
174
+ * @param {number} max - The maximum possible score (default: 5)
175
+ * @returns {string} - CSS class for the score badge
176
+ */
177
+ export function getScoreBadgeColor(score, min = 0, max = 100) {
178
+ // For facet scores (-100 to +100)
179
+ if (min < 0) {
180
+ if (score >= 50) return 'bg-green-100 text-green-800';
181
+ if (score >= 0) return 'bg-blue-100 text-blue-800';
182
+ if (score >= -50) return 'bg-yellow-100 text-yellow-800';
183
+ return 'bg-red-100 text-red-800';
184
+ }
185
+
186
+ // For aspect scores (0 to 100)
187
+ const range = max - min;
188
+ const percent = ((score - min) / range) * 100;
189
+
190
+ if (percent >= 80) return 'bg-green-100 text-green-800';
191
+ if (percent >= 60) return 'bg-blue-100 text-blue-800';
192
+ if (percent >= 40) return 'bg-yellow-100 text-yellow-800';
193
+ return 'bg-red-100 text-red-800';
194
+ }
195
+
196
+ /**
197
+ * Format likert score for display (-3 to +3 scale)
198
+ * @param {number} score - The likert score
199
+ * @returns {string} - Formatted score string
200
+ */
201
+ export function formatLikertScore(score) {
202
+ const formatted = score.toFixed(1);
203
+ if (score > 0) return `+${formatted}`;
204
+ return formatted;
205
+ }
leaderboard-app/next.config.mjs ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ /** @type {import('next').NextConfig} */
2
+ const nextConfig = {};
3
+
4
+ export default nextConfig;
leaderboard-app/package-lock.json ADDED
The diff for this file is too large to render. See raw diff
 
leaderboard-app/package.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "leaderboard-app",
3
+ "version": "0.1.0",
4
+ "private": true,
5
+ "scripts": {
6
+ "dev": "next dev",
7
+ "build": "next build",
8
+ "start": "next start",
9
+ "lint": "next lint"
10
+ },
11
+ "dependencies": {
12
+ "next": "15.2.3",
13
+ "react": "^19.0.0",
14
+ "react-dom": "^19.0.0",
15
+ "recharts": "^2.15.1"
16
+ },
17
+ "devDependencies": {
18
+ "@eslint/eslintrc": "^3",
19
+ "@tailwindcss/postcss": "^4",
20
+ "eslint": "^9",
21
+ "eslint-config-next": "15.2.3",
22
+ "tailwindcss": "^4"
23
+ }
24
+ }
leaderboard-app/postcss.config.mjs ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ const config = {
2
+ plugins: ["@tailwindcss/postcss"],
3
+ };
4
+
5
+ export default config;
leaderboard-app/public/llm_comparison_data.json ADDED
The diff for this file is too large to render. See raw diff
 
leaderboard-app/public/vercel.svg ADDED