Paper-Conference

H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference
HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference
SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding
AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference
ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
Memory-aware scheduling for complex wired networks with iterative graph optimization