Embedded LLM

Embedded LLM

Software Development

Creator of JamAI Base: The collaborative spreadsheet where AI ideas flow, chaining cells into powerful pipelines.

About us

Your open-source AI ally. We specialize in integrating LLM into your business. Creator of JamAI Base.

Industry
Software Development
Company size
11-50 employees
Headquarters
Singapore
Type
Privately Held
Founded
2023
Specialties
Artificial intelligence, AI, Generative AI, LLM, Large language model, HIP, ROCm, CUDA, Enterprise AI, Autopilot, Copilot, GPT, On-Device AI, AI Consultancy, Embedded AI, Open Source, and On-Premises

Products

Locations

Employees at Embedded LLM

Updates

  • vLLM Now Supports Running GGUF on AMD Radeon GPU 🚀 Exciting news! We've ported vLLM's GGUF kernel to AMD ROCm, unlocking impressive performance gains on AMD Radeon GPUs. 📊 In our benchmarks using the shareGPT dataset on an AMD Radeon RX 7900XTX, vLLM outperformed Ollama, even at batch sizes where Ollama traditionally excels. 💪 This is a game-changer for those running LLMs on AMD hardware, especially when using quantized models (5-bit, 4-bit, or even 2-bit). With over 60,000 GGUF models available on Hugging Face, the possibilities are endless. 💡 Key benefits: - Superior performance: vLLM delivers faster inference speeds compared to Ollama on AMD GPUs. - Wider model support: Run a vast collection of GGUF quantized models. - Efficient execution: Optimized for AMD ROCm, maximizing hardware utilization. 🔗 Learn more and get started: https://lnkd.in/g5qvUi8t We'd love to hear your feedback! Have you experimented with vLLM on Llama.cpp with Vulkan? What inference engine do you prefer for LLM tasks on AMD GPUs? What features or optimizations would you like to see in vLLM for AMD GPUs? #vLLM #AMD #ROCm #LLM #AI #GGUF

    • vllm vs ollama
  • AMD ROCm releases are getting seriously interesting. Here's what has me excited about ROCm 6.3: - Re-engineered FlashAttention-2: Up to 3X speedups and support for longer sequence lengths. - SGLang Integration: Get started here: https://lnkd.in/gv_cJz3n - Fortran Compiler with OpenMP Offloading: Legacy Fortran codebases can now leverage GPU acceleration without extensive refactoring. - Multi-Node FFTs: Distributed workloads across multiple Instinct accelerators are now supported. - Computer Vision Enhancements: Library updates bring support for the AV1 codec, GPU-accelerated JPEG decoding, and audio preprocessing. Blog: https://lnkd.in/gQSA7xQv

    SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD GPUs #

    SGLang: Fast Serving Framework for Large Language and Vision-Language Models on AMD GPUs #

    rocm.blogs.amd.com

  • 75% of training time can be wasted on communication overhead between GPUs. Even on H100 systems, this can be as high as 43%! For massive models like Llama-3 405B, that translates to a staggering 25 days spent just on communication! 🤯 But there's good news! DeepSpeed Domino is here to the rescue. This new tensor parallelism (TP) engine minimizes communication overhead, unlocking faster and more efficient LLM training for both single-node and multi-node setups. - Near-complete communication hiding: Domino cleverly overlaps communication with computation, dramatically reducing wasted time. - Novel multi-node scalable TP solution: Domino is designed to excel in both single and multi-node environments, enabling efficient scaling for even the largest models. Learn more about DeepSpeed Domino: Blog: https://lnkd.in/eVEd5GwU Paper: https://lnkd.in/ecHc4bph

    • No alternative text description for this image
  • Embedded LLM reposted this

    View profile for Mei Ling Leung, graphic

    Member of Technical Staff at Embedded LLM

    An ML engineer who only wants to train models is like a carpenter who only wants to hammer nails. 🔨 This highlights a fundamental flaw in AI engineering education today. We're churning out graduates who are experts in model training but lack the essential skills to solve real-world problems. It's like a carpentry school that only teaches hammering techniques. Sure, you'll learn how to drive a nail, but what about the rest of the craft? 🤔 Here's the problem: Most AI/ML courses focus heavily on algorithms, frameworks (like PyTorch), and model training. Students are given clean datasets and clear objectives. They become proficient in tuning hyperparameters and optimizing for accuracy. But the real world is messy: - Data is rarely readily available: You need to identify the right data sources, collect and clean the data, and deal with missing values, inconsistencies, and biases. - Problems are complex and ambiguous: You need to define the problem, frame it correctly, and choose the right approach. - Solutions require more than just models: You need to consider deployment, monitoring, scalability, and ethical implications. We need a new approach to AI education: - Problem-first learning: Start with real-world problems and teach students how to break them down into manageable steps. - Focus on the entire lifecycle: Cover the entire ML workflow, from data collection and preparation to model deployment and monitoring. - Develop critical thinking and problem-solving skills: Encourage students to think critically, analyze data, and evaluate solutions. - Emphasize communication and collaboration: Foster strong communication and interpersonal skills through group projects, presentations, and interactions with domain experts. - Prioritize domain knowledge: Integrate industry case studies, real-world projects, and collaborations with businesses into the curriculum Let's empower the next generation of engineers to build real-world solutions, not just train models in isolation. Remember what my boss once told me: "Remember, the real value you bring is your ability to solve problems, not just your knowledge of specific tools. XGBoost and SVM are great, but they're just means to an end. Your creativity, critical thinking, and understanding of the business are what truly make a difference."

  • 🔥 Pixtral Large is now supported on vLLM! 🔥 Run Pixtral Large with multiple input images from day 0 using vLLM. Install vLLM: pip install -U VLLM Run Pixtral Large: vllm serve mistralai/Pixtral-Large-Instruct-2411 --tokenizer_mode mistral --limit_mm_per_prompt 'image=10' --tensor-parallel-size 8 About Pixtral Large: * Built upon Mistral Large 2, preserving its exceptional text performance. * State-of-the-art on MathVista, DocVQA, VQAv2 * 123B multimodal decoder, 1B parameter vision encoder * 128K context window: fits minimum of 30 high-resolution images * Licensed under MRL 🤗 https://lnkd.in/gixc5mE4

    • No alternative text description for this image
  • vLLM v0.6.4 brings expanded model support, Intel AI Gaudi Support, and significant progress in vLLM V1 core engine and torch.compile support. What's new: - New LLMs and VLMs: Idefics3(VLM), H2OVL-Mississippi(VLM for OCR and Document AI), Qwen2-Audio(Audio LLM), FalconMamba(Mamba LLM), Florence-2(VLM) - New encoder-decoder embedding models: BERT, RoBERTa, XLM-RoBERTa - Expanded Task Support: - Text Classification: Qwen2 classification - Embeddings: Llama embeddings, Math-Shepherd, Qwen2 embeddings - VLM Embeddings: VLM2Vec, E5-V, Qwen2-VL embeddings - Task Parameter: --task to specify generation or embedding tasks - Chat-Based Embeddings API: Pass multi-modal conversations to embedding models. - Tool Calling Parsers: Granite 3.0, Jamba, granite-20b-functioncalling - LoRA Support: Granite 3.0 MoE, Idefics3, Llama embeddings, Qwen, Qwen2-VL - BNB: Idefics3, Mllama, Qwen2, MiniCPMV - Hardware Support: - Intel Gaudi (HPU) Backend: A key advantage of Gaudi is massive scalability with standard Ethernet. - CPU Support for Embedding Models: Deploy embedding models on CPUs. - Performance Enhancements: Combined chunked prefill with speculative decoding and improved fused_moe performance. Explore the full release notes for detailed information: https://lnkd.in/gTXXmEWG #vLLM #Intel #Gaudi3 #Idefics3

    Release v0.6.4 · vllm-project/vllm

    Release v0.6.4 · vllm-project/vllm

    github.com

  • 📢 Calling all robotics enthusiasts! 🤖 Meet Iris! Iris can now speak 🎙️ Follow our talented Robotics Engineer, Wessam Hamid, as he tackles the challenges and breakthroughs of building and programming cutting-edge humanoid robots like Iris. Stay tuned for exclusive updates, behind-the-scenes insights, and a glimpse into the future of robotics. 🚀 #robotics #humanoidrobot #LLM

    View profile for Wessam Hamid, graphic

    I design robots

    Iris update: It Speaks! Made some cool progress with Iris – it can talk now! 🎙️ This is my first shot at speech-to-speech reasoning, and it can switch between different language models and voices on command. It’s making API calls for voice transcription, responses, and text-to-speech, though it's still a bit slow (I sped up some parts of the video and added a timer to show that). Next step: speed things up and get the Realsense integrated. I’m also thinking of running the models locally on a dedicated AI PC – now I just need to figure out how to fund that piece. Excited to see where this goes! Do let me know if you have any suggestions.

  • ⚡️Huge performance gains for LLM training on AMD GPUs! 🐯Liger Kernel v0.4.0 now fully supports AMD GPUs! Thanks to our collaboration with Liger Kernel, you can now enjoy a 26% speed boost and a massive 60% reduction in memory usage when training LLM on AMD GPUs. Check out the benchmarks in our blog: https://lnkd.in/gU2cPX_m A big thank you to Hot Aisle Inc. for sponsoring the #MI300X and to Pin-Lun (Byron) Hsu for the collaboration!

    View profile for Pin-Lun (Byron) Hsu, graphic

    Building Liger-Kernel @Linkedin | Committer @flyteorg @theASF

    Liger Kernel v0.4.0 has arrived! https://lnkd.in/gR73PfFh 1. Full AMD Support: We have partnered with https://embeddedllm.com to adjust the Triton configuration to fully support AMD! With version 0.4.0, you can run multi-GPU training with 26% higher speed and 60% lower memory usage on AMD. See the full blogpost from https://lnkd.in/gUBF9Ur6. Embedded LLM Hot Aisle Inc. Jon Stevens Pin Siang Tan Tun Jian Tan 2. Modal CI Migration: We have moved our entire GPU CI stack to Modal! Thanks to intelligent Docker layer caching and blazingly fast container startup time and scheduling, we have reduced the CI overhead by over 10x (from minutes to seconds). Modal Charles Frye Alec Powell Erik Bernhardsson 3. LLaMA 3.2-Vision Model: We have added kernel support for the LLaMA 3.2-Vision model. You can easily use `liger_kernel.transformers.apply_liger_kernel_to_mllama` to patch the model. Tyler Romero Shivam Sahni 4. HuggingFace Gradient Accumulation Fixes: We have fixed the notorious HuggingFace gradient accumulation issue (https://lnkd.in/gCKtftbw) by carefully adjusting the cross entropy scalar. You can now safely use v0.4.0 with the latest HuggingFace gradient accumulation fixes (transformers>=4.46.2)! Wing Lian Arthur Zucker 5. JSD Kernel: We have added the JSD kernel for distillation, which also comes with a chunking version! Chun-Chih Tseng Yun Dai Qingquan Song 6. Technical Report: We have published a technical report on arXiv (https://lnkd.in/gwHq6_7c) with abundant details. Yanning Chen Haowen Ning Animesh Singh Kapil Surlaker

    Release v0.4.0: Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision! · linkedin/Liger-Kernel

    Release v0.4.0: Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision! · linkedin/Liger-Kernel

    github.com

  • Liger Kernels Leap the CUDA Moat: A Case Study with Liger, LinkedIn's SOTA Training Kernels on AMD GPU 🚀 Exciting news! We've partnered with the LinkedIn/Liger-Kernel team to fully support AMD GPUs in their latest v0.4.0 release. This brings significant performance improvements to Large Language Model (LLM) training on AMD hardware. Key Benefits: - Faster Training: Up to 26% faster multi-GPU training throughput. - Reduced Memory Usage: Train larger models and use bigger batch sizes with up to 60% memory reduction. - Longer Context Lengths: Explore new possibilities with support for up to 8x longer context lengths. Check out the benchmark on our blog: https://lnkd.in/gU2cPX_m Check out the v0.4.0 release: https://lnkd.in/gJxXK8cy A big thank you to Hot Aisle Inc. for sponsoring the MI300X and to Pin-Lun (Byron) Hsu for the collaboration! #LLM #AI #AMD #ROCm #LigerKernels

    • No alternative text description for this image
  • Malaysia's AI Scene is Electrifying! ⚡️ The Future is Being Shaped in SEA. Malaysia is rapidly becoming an AI powerhouse. With a data center boom projected to increase capacity ninefold, global tech giants like Microsoft, NVIDIA, Amazon, Google, and Oracle are investing heavily in its digital infrastructure. At MDX 2024, we witnessed incredible enthusiasm and groundbreaking AI applications. The energy was palpable, and the future possibilities seemed limitless. Embedded LLM is proud to be a part of this revolution. Alongside AMD and Selangor Human Resource Development Centre (SHRDC), we're empowering Malaysian businesses with cutting-edge AI solutions. Malaysia's commitment to AI is evident. We were honored to host Malaysia's Digital Minister at our booth, demonstrating strong government support for AI innovation. We're incredibly impressed by Malaysia's commitment to AI. The future is bright, and we're eager to learn and contribute to this exciting journey.

Similar pages