๐ค Model Performance Comparison Tool
Compare LLM performance on multiple-choice questions using Hugging Face models.
Format: Each line should have: Question,Correct Answer,Choice1,Choice2,Choice3
๐ก Features:
- Model evaluation using HuggingFace transformers
- Support for custom models via HF model paths
- Detailed question-by-question results
- Performance charts and statistics
Choose sample dataset or enter your own
Format Requirements:
- First line: header (will be ignored), leave empty if no header
- Each data line: Question, Correct Answer, Choice1, Choice2, Choice3
- Use commas or tabs as separators
โ ๏ธ Note:
- Larger models require more GPU memory, currently we only run on CPU
- First run will download models (may take time)
- Models are cached for subsequent runs
๐ Results
Results will appear here...
Detailed results will appear here...
About Model Evaluation
This tool loads and runs HuggingFace models for evaluation:
๐๏ธ How it works:
- Downloads models from HuggingFace Hub
- Formats questions as prompts for each model
- Runs likelihood based evaluation
โก Performance Tips:
- Use smaller models for testing
- Larger models (7B+) require significant GPU memory
- Models are cached after first load
๐ง Supported Models:
- Any HuggingFace autoregressive language model
- Both instruction-tuned and base models
- Custom fine-tuned models via HF paths