InferIQ

LLM Evaluation framework that leverages LLMs to Evaluate other LLMs

Generates answers to questions in a sample dataset across a pool of evaluation LLMs, then has a group of Judge LLMs rate each response. Results are visualized as graphs showing overall accuracy. Metrics such as BERT Score and inference time are included alongside.