Spaces:

Cyberlace
/

latihan-artikulasi

Running on Zero

App Files Files Community

fariedalfarizi commited on 8 days ago

Commit

797d38d

1 Parent(s): 056938d

MAJOR UPGRADE: Whisper Large V3, 30+ Indonesian phonetics, Gradio JSON API, optimized weights

Browse files

Files changed (6) hide show

README.md +59 -5
app.py +12 -7
app/api_gradio.py +236 -0
app/interface.py +1 -1
core/constants.py +30 -30
core/scoring_engine.py +68 -69

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ license: mit
 # 🎤 Sistem Penilaian Vokal Indonesia v2.0
-Sistem penilaian artikulasi vokal bahasa Indonesia menggunakan **Whisper Medium ASR** dan advanced audio signal processing.
 ## 🌟 Fitur
@@ -26,12 +26,66 @@ Sistem penilaian artikulasi vokal bahasa Indonesia menggunakan **Whisper Medium
 ### 6 Comprehensive Metrics
-1. **Clarity Score**: Kejelasan pengucapan via Whisper ASR accuracy
 2. **Energy Score**: Kualitas volume dan energi suara
-3. **Speech Rate**: Kecepatan bicara (syllables per second)
-4. **Pitch Consistency**: Stabilitas nada suara
 5. **SNR Score**: Signal-to-Noise Ratio (kualitas rekaman)
-6. **Articulation Score**: Kejernihan artikulasi dari spectral analysis
 ## 🚀 Cara Menggunakan

 # 🎤 Sistem Penilaian Vokal Indonesia v2.0
+Sistem penilaian artikulasi vokal bahasa Indonesia menggunakan **Whisper Large V3** (Indonesian optimized) dan advanced audio signal processing.
 ## 🌟 Fitur
 ### 6 Comprehensive Metrics
+1. **Clarity Score (60% for Level 1)**: Kejelasan pengucapan via Whisper Large V3
 2. **Energy Score**: Kualitas volume dan energi suara
+3. **Speech Rate (Level 4-5)**: Kecepatan bicara optimal
+4. **Pitch Consistency (Level 4-5)**: Stabilitas nada suara
 5. **SNR Score**: Signal-to-Noise Ratio (kualitas rekaman)
+6. **Articulation Score (15% for Level 1)**: Kejernihan artikulasi spektral
+### JSON API (Gradio-based)
+Tersedia JSON API dengan structured response untuk integrasi:
+- **Tab 1**: UI Assessment (visual interface)
+- **Tab 2**: JSON API (RESTful response)
+- **Python Client**: `gradio_client` compatible
+- **Response Format**: Structured JSON with scores, feedback, suggestions
+## 🎯 Optimized Scoring Weights
+| Level | Clarity | Articulation | Speech Rate | Pitch | Energy | SNR |
+|-------|---------|--------------|-------------|-------|--------|-----|
+| 1     | 60%     | 15%          | 0%          | 0%    | 15%    | 10% |
+| 2     | 55%     | 20%          | 0%          | 0%    | 15%    | 10% |
+| 3     | 50%     | 15%          | 10%         | 5%    | 10%    | 10% |
+| 4     | 40%     | 10%          | 20%         | 15%   | 10%    | 5%  |
+| 5     | 35%     | 10%          | 25%         | 15%   | 10%    | 5%  |
+## 📡 API Usage
+### Gradio Python Client
+```python
+import gradio_client
+client = gradio_client.Client("https://huggingface.co/spaces/Cyberlace/latihan-artikulasi")
+result = client.predict(
+    audio_file="audio.wav",
+    target_text="A",
+    level=1,
+    api_name="/score_audio_api"
+)
+print(result["data"]["overall"]["score"])  # 95.5
+print(result["data"]["transcription"]["detected"])  # "A"
+```
+### JSON Response Structure
+```json
+{
+  "success": true,
+  "data": {
+    "overall": {"score": 95.5, "grade": "A", "level": 1},
+    "transcription": {"target": "A", "detected": "A", "similarity": 100.0, "wer": 0.0},
+    "scores": {...},
+    "feedback": {"message": "...", "suggestions": [...]},
+    "audio_features": {...}
+  }
+}
+```
 ## 🚀 Cara Menggunakan

app.py CHANGED Viewed

@@ -12,21 +12,26 @@ logging.getLogger("starlette").setLevel(logging.ERROR)
 logging.getLogger("uvicorn").setLevel(logging.ERROR)
 from app.interface import create_interface, initialize_model
-from api.routes import app as fastapi_app
 if __name__ == '__main__':
     print('Starting Vocal Articulation Assessment System v2.0...')
-    # Initialize model
     initialize_model()
-    # Create Gradio interface
-    demo = create_interface()
-    # Mount Gradio to FastAPI (correct order!)
-    app = gr.mount_gradio_app(fastapi_app, demo, path="/")
-    # Launch with specific config
     demo.launch(
         server_name='0.0.0.0',
         server_port=7860,

 logging.getLogger("uvicorn").setLevel(logging.ERROR)
 from app.interface import create_interface, initialize_model
+from app.api_gradio import create_api_interface
 if __name__ == '__main__':
     print('Starting Vocal Articulation Assessment System v2.0...')
+    # Initialize model once
     initialize_model()
+    # Create UI and API interfaces
+    ui_demo = create_interface()
+    api_demo = create_api_interface()
+    # Combine both interfaces with tabs
+    demo = gr.TabbedInterface(
+        [ui_demo, api_demo],
+        ["🎤 Assessment UI", "📡 JSON API"],
+        title="Vocal Articulation System v2.0"
+    )
+    # Launch
     demo.launch(
         server_name='0.0.0.0',
         server_port=7860,

app/api_gradio.py ADDED Viewed

	@@ -0,0 +1,236 @@

+# =======================================
+# GRADIO API ENDPOINT - JSON Response
+# Alternative to FastAPI for HuggingFace Spaces
+# =======================================
+import gradio as gr
+import json
+from typing import Dict, Any
+from app.interface import initialize_model
+def score_audio_api(
+    audio_file: str,
+    target_text: str,
+    level: int
+) -> Dict[str, Any]:
+    """
+    API endpoint untuk scoring audio - Returns structured JSON
+    Args:
+        audio_file: Path ke audio file
+        target_text: Target text yang seharusnya diucapkan
+        level: Level artikulasi (1-5)
+    Returns:
+        JSON response dengan struktur lengkap
+    """
+    try:
+        scorer = initialize_model()
+        # Validate input
+        if not audio_file:
+            return {
+                "success": False,
+                "error": "No audio file provided",
+                "code": "MISSING_AUDIO"
+            }
+        if not target_text or not target_text.strip():
+            return {
+                "success": False,
+                "error": "No target text provided",
+                "code": "MISSING_TEXT"
+            }
+        # Score audio
+        result = scorer.score_audio(
+            audio_path=audio_file,
+            target_text=target_text,
+            level=level
+        )
+        # Return structured JSON
+        return {
+            "success": True,
+            "data": {
+                "overall": {
+                    "score": result.overall_score,
+                    "grade": result.grade,
+                    "level": result.level
+                },
+                "transcription": {
+                    "target": result.target,
+                    "detected": result.transcription,
+                    "similarity": round(result.similarity * 100, 2),
+                    "wer": round(result.wer * 100, 2)
+                },
+                "scores": {
+                    "clarity": result.clarity_score,
+                    "energy": result.energy_score,
+                    "speech_rate": result.speech_rate_score,
+                    "pitch_consistency": result.pitch_consistency_score,
+                    "snr": result.snr_score,
+                    "articulation": result.articulation_score
+                },
+                "feedback": {
+                    "message": result.feedback,
+                    "suggestions": result.suggestions
+                },
+                "audio_features": result.audio_features
+            }
+        }
+    except Exception as e:
+        return {
+            "success": False,
+            "error": str(e),
+            "code": "PROCESSING_ERROR"
+        }
+def create_api_interface():
+    """Create Gradio API interface with JSON output"""
+    with gr.Blocks(
+        title="Vocal Articulation API",
+        theme=gr.themes.Soft(primary_hue="blue")
+    ) as api_demo:
+        gr.Markdown("""
+        # 🎤 Vocal Articulation API v2.0
+        ## RESTful JSON API for Indonesian Vocal Assessment
+        **Model**: Whisper Large V3 (Indonesian Optimized)
+        ### Quick Start
+        1. Upload audio file (MP3, WAV, M4A, etc.)
+        2. Enter target text (what should be spoken)
+        3. Select level (1-5)
+        4. Get JSON response
+        ### API Response Structure
+        ```json
+        {
+          "success": true,
+          "data": {
+            "overall": {
+              "score": 85.5,
+              "grade": "B",
+              "level": 1
+            },
+            "transcription": {
+              "target": "A",
+              "detected": "A",
+              "similarity": 100.0,
+              "wer": 0.0
+            },
+            "scores": {
+              "clarity": 95.2,
+              "energy": 98.5,
+              "speech_rate": 80.0,
+              "pitch_consistency": 75.3,
+              "snr": 100.0,
+              "articulation": 92.1
+            },
+            "feedback": {
+              "message": "Sempurna! Pengucapan Anda sangat baik.",
+              "suggestions": []
+            },
+            "audio_features": {...}
+          }
+        }
+        ```
+        ---
+        """)
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown("### Input")
+                audio_input = gr.Audio(
+                    label="Audio File",
+                    type="filepath",
+                    sources=["upload", "microphone"]
+                )
+                target_input = gr.Textbox(
+                    label="Target Text",
+                    placeholder="e.g., A, BA, PSIKOLOGI",
+                    info="Text yang seharusnya diucapkan"
+                )
+                level_input = gr.Slider(
+                    label="Level (1=Vokal, 5=Kalimat)",
+                    minimum=1,
+                    maximum=5,
+                    value=1,
+                    step=1
+                )
+                submit_btn = gr.Button("Score Audio", variant="primary")
+            with gr.Column():
+                gr.Markdown("### JSON Response")
+                output_json = gr.JSON(
+                    label="API Response",
+                    show_label=True
+                )
+        gr.Markdown("""
+        ---
+        ### Level Descriptions
+        | Level | Name | Description | Examples |
+        |-------|------|-------------|----------|
+        | 1 | Vokal Tunggal | Single vowels | A, I, U, E, O |
+        | 2 | Konsonan Dasar | Basic consonants | BA, PA, DA, TA, KA |
+        | 3 | Suku Kata | Syllable combinations | BA BE BI BO BU |
+        | 4 | Kata Sulit | Complex words | PSIKOLOGI, STRATEGI |
+        | 5 | Kalimat Kompleks | Tongue twisters | ULAR LARI LURUS... |
+        ### Scoring Weights per Level
+        **Level 1-2**: Focus on Clarity (60%) + Articulation (15%)
+        **Level 3**: Balanced with Speech Rate (10%)
+        **Level 4-5**: Comprehensive with Speech Rate (20-25%) + Pitch (15%)
+        ### Error Codes
+        - `MISSING_AUDIO`: No audio file provided
+        - `MISSING_TEXT`: No target text provided
+        - `PROCESSING_ERROR`: Error during processing
+        ---
+        ### Python Usage Example
+        ```python
+        import gradio_client
+        client = gradio_client.Client("https://huggingface.co/spaces/Cyberlace/latihan-artikulasi")
+        result = client.predict(
+            audio_file="audio.wav",
+            target_text="A",
+            level=1,
+            api_name="/score_audio_api"
+        )
+        print(result)  # JSON response
+        ```
+        """)
+        # Connect button to API function
+        submit_btn.click(
+            fn=score_audio_api,
+            inputs=[audio_input, target_input, level_input],
+            outputs=output_json,
+            api_name="score_audio_api"
+        )
+    return api_demo

app/interface.py CHANGED Viewed

@@ -46,7 +46,7 @@ def initialize_model():
     global scorer
     if scorer is None:
-        whisper_model = os.getenv("WHISPER_MODEL", "openai/whisper-medium")
         print(f"Loading Whisper model: {whisper_model}...")
         scorer = AdvancedVocalScoringSystem(whisper_model=whisper_model)
         print("Model loaded!")

     global scorer
     if scorer is None:
+        whisper_model = os.getenv("WHISPER_MODEL", "openai/whisper-large-v3")
         print(f"Loading Whisper model: {whisper_model}...")
         scorer = AdvancedVocalScoringSystem(whisper_model=whisper_model)
         print("Model loaded!")

core/constants.py CHANGED Viewed

@@ -49,46 +49,46 @@ ARTICULATION_LEVELS = {
     }
 }
-# Scoring weights per level
 LEVEL_WEIGHTS = {
-    1: {  # Vokal tunggal - fokus clarity & articulation
-        'clarity': 0.50,
-        'energy': 0.20,
-        'speech_rate': 0.0,
-        'pitch_consistency': 0.0,
-        'snr': 0.15,
-        'articulation': 0.15
-    },
-    2: {  # Konsonan dasar - fokus clarity & articulation
-        'clarity': 0.45,
-        'energy': 0.20,
         'speech_rate': 0.0,
         'pitch_consistency': 0.0,
-        'snr': 0.15,
-        'articulation': 0.20
     },
-    3: {  # Kombinasi suku kata - mulai speech rate
-        'clarity': 0.40,
         'energy': 0.15,
         'speech_rate': 0.0,
         'pitch_consistency': 0.0,
-        'snr': 0.20,
-        'articulation': 0.25
-    },
-    4: {  # Kata sulit
-        'clarity': 0.45,
-        'energy': 0.15,
-        'speech_rate': 0.15,
-        'pitch_consistency': 0.10,
         'snr': 0.10,
-        'articulation': 0.05
     },
-    5: {  # Kalimat kompleks
-        'clarity': 0.45,
         'energy': 0.10,
-        'speech_rate': 0.20,
-        'pitch_consistency': 0.10,
         'snr': 0.10,
-        'articulation': 0.05
     }
 }

     }
 }
+# Optimized scoring weights per level
 LEVEL_WEIGHTS = {
+    1: {  # Vokal tunggal - MAX clarity & articulation
+        'clarity': 0.60,  # Paling penting: ASR accuracy
+        'energy': 0.15,
         'speech_rate': 0.0,
         'pitch_consistency': 0.0,
+        'snr': 0.10,
+        'articulation': 0.15  # Penting: spectral clarity
     },
+    2: {  # Konsonan dasar - HIGH clarity
+        'clarity': 0.55,
         'energy': 0.15,
         'speech_rate': 0.0,
         'pitch_consistency': 0.0,
         'snr': 0.10,
+        'articulation': 0.20
     },
+    3: {  # Kombinasi suku kata - BALANCED
+        'clarity': 0.50,
         'energy': 0.10,
+        'speech_rate': 0.10,  # Mulai masuk
+        'pitch_consistency': 0.05,
         'snr': 0.10,
+        'articulation': 0.15
+    },
+    4: {  # Kata sulit - ADD speech rate & pitch
+        'clarity': 0.40,
+        'energy': 0.10,
+        'speech_rate': 0.20,  # Penting
+        'pitch_consistency': 0.15,
+        'snr': 0.05,
+        'articulation': 0.10
+    },
+    5: {  # Kalimat kompleks - COMPREHENSIVE
+        'clarity': 0.35,
+        'energy': 0.10,
+        'speech_rate': 0.25,  # Sangat penting
+        'pitch_consistency': 0.15,
+        'snr': 0.05,
+        'articulation': 0.10
     }
 }

core/scoring_engine.py CHANGED Viewed

@@ -10,8 +10,6 @@ import librosa
 from transformers import (
     WhisperProcessor,
     WhisperForConditionalGeneration,
-    Wav2Vec2Processor,
-    Wav2Vec2ForCTC,
     pipeline
 )
 from typing import Dict, List, Tuple, Optional, Any
@@ -97,32 +95,21 @@ class AdvancedVocalScoringSystem:
     def __init__(
         self,
-        whisper_model: str = "openai/whisper-medium",
-        wav2vec2_model: str = "indonesian-nlp/wav2vec2-indonesian-javanese-sundanese",
         device: str = None
     ):
         """
-        Initialize system dengan dual ASR: Wav2Vec2 (Indonesian) + Whisper (fallback)
         Args:
-            whisper_model: Model Whisper untuk Level 4-5
-            wav2vec2_model: Model Wav2Vec2 untuk Level 1-3 (Indonesian native)
             device: 'cuda' atau 'cpu'
         """
         self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
-        print(f"🔄 Loading Indonesian Wav2Vec2: {wav2vec2_model}...")
-        # Load Wav2Vec2 for Indonesian (better for short audio)
-        self.wav2vec2_processor = Wav2Vec2Processor.from_pretrained(wav2vec2_model)
-        self.wav2vec2_model = Wav2Vec2ForCTC.from_pretrained(wav2vec2_model)
-        self.wav2vec2_model.to(self.device)
-        self.wav2vec2_model.eval()
-        print(f"✅ Wav2Vec2 loaded on {self.device}")
-        print(f"🔄 Loading Whisper model: {whisper_model}...")
-        # Load Whisper model for complex sentences
         self.processor = WhisperProcessor.from_pretrained(whisper_model)
         self.model = WhisperForConditionalGeneration.from_pretrained(whisper_model)
         self.model.to(self.device)
@@ -252,41 +239,25 @@ class AdvancedVocalScoringSystem:
         level: int = 1
     ) -> Tuple[float, str, float, float]:
         """
-        Score clarity using Wav2Vec2 (Level 1-3) or Whisper (Level 4-5)
         Returns:
             (clarity_score, transcription, similarity, wer)
         """
         try:
-            # Use Wav2Vec2 for Level 1-3 (better for Indonesian short audio)
-            if level <= 3:
-                import librosa
-                audio_np, sr = librosa.load(audio_path, sr=16000)
-                # Process with Wav2Vec2
-                inputs = self.wav2vec2_processor(
-                    audio_np,
-                    sampling_rate=16000,
-                    return_tensors="pt",
-                    padding=True
-                )
-                with torch.no_grad():
-                    logits = self.wav2vec2_model(inputs.input_values.to(self.device)).logits
-                predicted_ids = torch.argmax(logits, dim=-1)
-                transcription = self.wav2vec2_processor.batch_decode(predicted_ids)[0].upper().strip()
-            else:
-                # Use Whisper for Level 4-5 (better for long sentences)
-                result = self.pipe(
-                    audio_path,
-                    return_timestamps=False,
-                    generate_kwargs={
-                        "language": "id",
-                        "task": "transcribe"
-                    }
-                )
-                transcription = result["text"].upper().strip()
         except Exception as e:
             print(f"⚠️ ASR Error: {e}")
@@ -550,43 +521,71 @@ class AdvancedVocalScoringSystem:
     def _phonetic_similarity(self, text1: str, text2: str) -> float:
         """
         Calculate phonetic similarity for Indonesian syllables
-        Handles common confusions: T/D, P/B, K/G, S/Z
         """
-        # Indonesian phonetic confusion pairs
         confusions = {
-            'T': ['D', 'TH'],
-            'D': ['T', 'DH'],
-            'P': ['B'],
-            'B': ['P'],
-            'K': ['G', 'C'],
             'G': ['K'],
-            'S': ['Z', 'SY'],
-            'Z': ['S'],
-            'A': ['AH'],
-            'E': ['EH']
         }
         if not text1 or not text2:
             return 0.0
-        # Check if first letters are phonetically similar
         first1 = text1[0] if text1 else ''
         first2 = text2[0] if text2 else ''
-        if first1 == first2:
-            return 1.0
         # Check confusion pairs
         if first1 in confusions and first2 in confusions[first1]:
-            return 0.8
         if first2 in confusions and first1 in confusions[first2]:
-            return 0.8
         # Levenshtein distance for longer text
-        if len(text1) > 1 and len(text2) > 1:
-            return difflib.SequenceMatcher(None, text1, text2).ratio()
-        return 0.0
     def _calculate_wer(self, predicted: str, target: str) -> float:
         """Calculate Word Error Rate"""

 from transformers import (
     WhisperProcessor,
     WhisperForConditionalGeneration,
     pipeline
 )
 from typing import Dict, List, Tuple, Optional, Any
     def __init__(
         self,
+        whisper_model: str = "openai/whisper-large-v3",  # Best for Indonesian
         device: str = None
     ):
         """
+        Initialize system dengan Whisper Large V3 (best for Indonesian)
         Args:
+            whisper_model: Model Whisper (large-v3 recommended for Indonesian)
             device: 'cuda' atau 'cpu'
         """
         self.device = device or ("cuda" if torch.cuda.is_available() else "cpu")
+        print(f"🔄 Loading Whisper Large V3 for Indonesian...")
+        # Load Whisper Large V3 - best for all levels
         self.processor = WhisperProcessor.from_pretrained(whisper_model)
         self.model = WhisperForConditionalGeneration.from_pretrained(whisper_model)
         self.model.to(self.device)
         level: int = 1
     ) -> Tuple[float, str, float, float]:
         """
+        Score clarity using Whisper Large V3 with Indonesian optimization
         Returns:
             (clarity_score, transcription, similarity, wer)
         """
         try:
+            # Use Whisper Large V3 for all levels (best accuracy)
+            result = self.pipe(
+                audio_path,
+                return_timestamps=False,
+                generate_kwargs={
+                    "language": "indonesian",  # Full language name for better detection
+                    "task": "transcribe",
+                    "temperature": 0.0,  # Deterministic output
+                    "compression_ratio_threshold": 1.35,  # Lower for short audio
+                    "no_speech_threshold": 0.3  # Lower sensitivity
+                }
+            )
+            transcription = result["text"].upper().strip()
         except Exception as e:
             print(f"⚠️ ASR Error: {e}")
     def _phonetic_similarity(self, text1: str, text2: str) -> float:
         """
         Calculate phonetic similarity for Indonesian syllables
+        Comprehensive Indonesian phonetic confusions
         """
+        # Comprehensive Indonesian phonetic confusion pairs
         confusions = {
+            # Plosives (Konsonan Letup)
+            'T': ['D', 'TH', 'C'],
+            'D': ['T', 'DH', 'J'],
+            'P': ['B', 'F'],
+            'B': ['P', 'V'],
+            'K': ['G', 'C', 'Q'],
             'G': ['K'],
+            'C': ['S', 'T', 'K'],
+            # Fricatives (Konsonan Geseran)
+            'S': ['Z', 'SY', 'C'],
+            'Z': ['S', 'J'],
+            'F': ['P', 'V'],
+            'V': ['F', 'B', 'W'],
+            'H': ['KH'],
+            # Nasals (Konsonan Sengau)
+            'M': ['N'],
+            'N': ['M', 'NG', 'NY'],
+            'NG': ['N'],
+            'NY': ['N', 'Y'],
+            # Liquids (Konsonan Cair)
+            'R': ['L'],
+            'L': ['R'],
+            # Semivowels
+            'W': ['V', 'U'],
+            'Y': ['I', 'NY'],
+            # Vowels (Vokal)
+            'A': ['AH', 'E'],
+            'E': ['A', 'EH', 'I'],
+            'I': ['E', 'Y'],
+            'O': ['OH', 'U'],
+            'U': ['O', 'W']
         }
         if not text1 or not text2:
             return 0.0
+        # Exact match
+        if text1 == text2:
+            return 1.0
+        # Check if one contains the other
+        if text1 in text2 or text2 in text1:
+            return 0.95
+        # Check first letter phonetic similarity
         first1 = text1[0] if text1 else ''
         first2 = text2[0] if text2 else ''
         # Check confusion pairs
         if first1 in confusions and first2 in confusions[first1]:
+            return 0.85
         if first2 in confusions and first1 in confusions[first2]:
+            return 0.85
         # Levenshtein distance for longer text
+        return difflib.SequenceMatcher(None, text1, text2).ratio()
     def _calculate_wer(self, predicted: str, target: str) -> float:
         """Calculate Word Error Rate"""