๐Ÿค– Model Performance Comparison Tool

Compare LLM performance on multiple-choice questions using Hugging Face models.

Format: Each line should have: Question,Correct Answer,Choice1,Choice2,Choice3

๐Ÿ’ก Features:

  • Model evaluation using HuggingFace transformers
  • Support for custom models via HF model paths
  • Detailed question-by-question results
  • Performance charts and statistics
Choose sample dataset or enter your own

Format Requirements:

  • First line: header (will be ignored), leave empty if no header
  • Each data line: Question, Correct Answer, Choice1, Choice2, Choice3
  • Use commas or tabs as separators
Select from popular models

โš ๏ธ Note:

  • Larger models require more GPU memory, currently we only run on CPU
  • First run will download models (may take time)
  • Models are cached for subsequent runs

๐Ÿ“Š Results

Results will appear here...

Detailed results will appear here...


About Model Evaluation

This tool loads and runs HuggingFace models for evaluation:

๐Ÿ—๏ธ How it works:

  • Downloads models from HuggingFace Hub
  • Formats questions as prompts for each model
  • Runs likelihood based evaluation

โšก Performance Tips:

  • Use smaller models for testing
  • Larger models (7B+) require significant GPU memory
  • Models are cached after first load

๐Ÿ”ง Supported Models:

  • Any HuggingFace autoregressive language model
  • Both instruction-tuned and base models
  • Custom fine-tuned models via HF paths