Infra • Serving & Optimization - a francescomaiomascio Collection

francescomaiomascio 's Collections

Local • Workstation-Ready (≤14B)

ICE • Core LLM Baselines

ICE • Code & Tool-Use

ICE • Retrieval (Embeddings + Rerankers)

ICE • Multimodal (Vision + Speech)

ICE • Safety / Guardrails

Research • Archive

Infra • Serving & Optimization

Infra • Serving & Optimization

updated 7 days ago

Inference engines, quantization, serving stacks, and perf tooling. Reference list for deployment and latency/cost work.

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

Paper • 2408.01050 • Published Aug 2, 2024 • 9
Seesaw: High-throughput LLM Inference via Model Re-sharding

Paper • 2503.06433 • Published Mar 9, 2025
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7, 2025 • 139
Running

267

Evaluation Guidebook

📝

267

Display benchmark evaluation data for LLMs
Running on CPU Upgrade

983

Open VLM Leaderboard

🌎

983

VLMEvalKit Evaluation Results Collection
Running

Featured

130

Open VLM Video Leaderboard

🌎

130

VLMEvalKit Eval Results in video understanding benchmark
Qwen/Qwen2-7B-Instruct-AWQ

Text Generation • 8B • Updated Aug 21, 2024 • 12.5k • 22
TheBloke/CodeLlama-7B-Instruct-AWQ

Text Generation • 7B • Updated Nov 9, 2023 • 359 • 4
TheBloke/Mistral-7B-Instruct-v0.1-AWQ

Text Generation • 7B • Updated Nov 9, 2023 • 759 • 38
TheBloke/Mistral-7B-Instruct-v0.2-AWQ

Text Generation • 7B • Updated Dec 11, 2023 • 12.1k • 52
Qwen/Qwen2.5-VL-7B-Instruct-AWQ

Image-Text-to-Text • 8B • Updated Apr 6, 2025 • 699k • 99