LocalAI - Models

nemo-parakeet-tdt-0.6b

NVIDIA NeMo Parakeet TDT 0.6B v3 is an automatic speech recognition (ASR) model from NVIDIA's NeMo toolkit. Parakeet models are state-of-the-art ASR models trained on large-scale English audio data.

Links

Tags

mistral-nemo-instruct-2407-12b-thinking-m-claude-opus-high-reasoning-i1

The model described in this repository is the **Mistral-Nemo-Instruct-2407-12B** (12 billion parameters), a large language model optimized for instruction tuning and high-level reasoning tasks. It is a **quantized version** of the original model, compressed for efficiency while retaining key capabilities. The model is designed to generate human-like text, perform complex reasoning, and support multi-modal tasks, making it suitable for applications requiring strong language understanding and output.

Links

https://huggingface.co/mradermacher/Mistral-Nemo-Instruct-2407-12B-Thinking-M-Claude-Opus-High-Reasoning-i1-GGUF

l3.3-ms-nevoria-70b

This model was created as I liked the storytelling of EVA, the prose and details of scenes from EURYALE and Anubis, enhanced with Negative_LLAMA to kill off the positive bias with a touch of nemotron sprinkeled in. The choice to use the lorablated model as a base was intentional - while it might seem counterintuitive, this approach creates unique interactions between the weights, similar to what was achieved in the original Astoria model and Astoria V2 model . Rather than simply removing refusals, this "weight twisting" effect that occurs when subtracting the lorablated base model from the other models during the merge process creates an interesting balance in the final model's behavior. While this approach differs from traditional sequential application of components, it was chosen for its unique characteristics in the model's responses.

Links

Tags

nvidia_llama-3_3-nemotron-super-49b-v1

Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.3-70B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens. Llama-3.3-Nemotron-Super-49B-v1 is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model’s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H200). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. The model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using REINFORCE (RLOO) and Online Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and Online RPO checkpoints. For more details on how the model was trained, please see this blog.

Links

Tags

nvidia_llama-3_3-nemotron-super-49b-genrm-multilingual

Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual is a generative reward model that leverages Llama-3.3-Nemotron-Super-49B-v1 as the foundation and is fine-tuned using Reinforcement Learning to predict the quality of LLM generated responses. Llama-3.3-Nemotron-Super-49B-GenRM-Multilingual can be used to judge the quality of one response, or the ranking between two responses given a multilingual conversation history. It will first generate reasoning traces then output an integer score. A higher score means the response is of higher quality.

Links

Tags

llama-3.1-nemotron-70b-instruct-hf

Llama-3.1-Nemotron-70B-Instruct is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries. This model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet. This model was trained using RLHF (specifically, REINFORCE), Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference prompts on a Llama-3.1-70B-Instruct model as the initial policy. Llama-3.1-Nemotron-70B-Instruct-HF has been converted from Llama-3.1-Nemotron-70B-Instruct to support it in the HuggingFace Transformers codebase. Please note that evaluation results might be slightly different from the Llama-3.1-Nemotron-70B-Instruct as evaluated in NeMo-Aligner, which the evaluation results below are based on.

Links

Tags

l3.1-nemotron-sunfall-v0.7.0-i1

Significant revamping of the dataset metadata generation process, resulting in higher quality dataset overall. The "Diamond Law" experiment has been removed as it didn't seem to affect the model output enough to warrant set up complexity. Recommended starting point: Temperature: 1 MinP: 0.05~0.1 DRY: 0.8 1.75 2 0 At early context, I recommend keeping XTC disabled. Once you hit higher context sizes (10k+), enabling XTC at 0.1 / 0.5 seems to significantly improve the output, but YMMV. If the output drones on and is uninspiring, XTC can be extremely effective. General heuristic: Lots of slop? Temperature is too low. Raise it, or enable XTC. For early context, temp bump is probably preferred. Is the model making mistakes about subtle or obvious details in the scene? Temperature is too high, OR XTC is enabled and/or XTC settings are too high. Lower temp and/or disable XTC.

Links

Tags

nvidia_llama-3.1-nemotron-nano-4b-v1.1

Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) which is a derivative of nvidia/Llama-3.1-Minitron-4B-Width-Base, which is created from Llama 3.1 8B using our LLM compression technique and offers improvements in model accuracy and efficiency. It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. Llama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K. This model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints This model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: Llama-3.3-Nemotron-Ultra-253B-v1 Llama-3.3-Nemotron-Super-49B-v1 Llama-3.1-Nemotron-Nano-8B-v1 This model is ready for commercial use.

Links

Tags

nvidia_acereason-nemotron-14b

We're thrilled to introduce AceReason-Nemotron-14B, a math and code reasoning model trained entirely through reinforcement learning (RL), starting from the DeepSeek-R1-Distilled-Qwen-14B. It delivers impressive results, achieving 78.6% on AIME 2024 (+8.9%), 67.4% on AIME 2025 (+17.4%), 61.1% on LiveCodeBench v5 (+8%), 54.9% on LiveCodeBench v6 (+7%), and 2024 on Codeforces (+543). We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first RL training on math-only prompts, then RL training on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks, but also code reasoning tasks. In addition, extended code-only RL further improves code benchmark performance while causing minimal degradation in math results. We find that RL not only elicits the foundational reasoning capabilities acquired during pre-training and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

Links

nvidia_nemotron-research-reasoning-qwen-1.5b

Nemotron-Research-Reasoning-Qwen-1.5B is the world’s leading 1.5B open-weight model for complex reasoning tasks such as mathematical problems, coding challenges, scientific questions, and logic puzzles. It is trained using the ProRL algorithm on a diverse and comprehensive set of datasets. Our model has achieved impressive results, outperforming Deepseek’s 1.5B model by a large margin on a broad range of tasks, including math, coding, and GPQA. This model is for research and development only.

Links

mistral-nemo-instruct-2407

The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

Links

Tags

pantheon-rp-1.6-12b-nemo

Welcome to the next iteration of my Pantheon model series, in which I strive to introduce a whole collection of personas that can be summoned with a simple activation phrase. The huge variety in personalities introduced also serve to enhance the general roleplay experience. Changes in version 1.6: The final finetune now consists of data that is equally split between Markdown and novel-style roleplay. This should solve Pantheon's greatest weakness. The base was redone. (Details below) Select Claude-specific phrases were rewritten, boosting variety in the model's responses. Aiva no longer serves as both persona and assistant, with the assistant role having been given to Lyra. Stella's dialogue received some post-fix alterations since the model really loved the phrase "Fuck me sideways". Your user feedback is critical to me so don't hesitate to tell me whether my model is either 1. terrible, 2. awesome or 3. somewhere in-between.

Links

Tags

mistral-nemo-prism-12b

Mahou-1.5-mistral-nemo-12B-lorablated finetuned on Arkhaios-DPO and Purpura-DPO. The goal was to reduce archaic language and purple prose in a completely uncensored model.

Links

Tags

mn-chunky-lotus-12b

I had originally planned to use this model for future/further merges, but decided to go ahead and release it since it scored rather high on my local EQ Bench testing (79.58 w/ 100% parsed @ 8-bit). Bear in mind that most models tend to score a bit higher on my own local tests as compared to their posted scores. Still, its the highest score I've personally seen from all the models I've tested. Its a decent model, with great emotional intelligence and acceptable adherence to various character personalities. It does a good job at roleplaying despite being a bit bland at times. Overall, I like the way it writes, but it has a few formatting issues that show up from time to time, and it has an uncommon tendency to paste walls of character feelings/intentions at the end of some outputs without any prompting. This is something I hope to correct with future iterations. This is a merge of pre-trained language models created using mergekit. The following models were included in the merge: Epiculous/Violet_Twilight-v0.2 nbeerbower/mistral-nemo-gutenberg-12B-v4 flammenai/Mahou-1.5-mistral-nemo-12B

Links

https://huggingface.co/QuantFactory/MN-Chunky-Lotus-12B-GGUF

Tags

mn-12b-mag-mell-r1-iq-arm-imatrix

This is a merge of pre-trained language models created using mergekit. Mag Mell is a multi-stage merge, Inspired by hyper-merges like Tiefighter and Umbral Mind. Intended to be a general purpose "Best of Nemo" model for any fictional, creative use case. 6 models were chosen based on 3 categories; they were then paired up and merged via layer-weighted SLERP to create intermediate "specialists" which are then evaluated in their domain. The specialists were then merged into the base via DARE-TIES, with hyperparameters chosen to reduce interference caused by the overlap of the three domains. The idea with this approach is to extract the best qualities of each component part, and produce models whose task vectors represent more than the sum of their parts. The three specialists are as follows: Hero (RP, kink/trope coverage): Chronos Gold, Sunrose. Monk (Intelligence, groundedness): Bophades, Wissenschaft. Deity (Prose, flair): Gutenberg v4, Magnum 2.5 KTO. I've been dreaming about this merge since Nemo tunes started coming out in earnest. From our testing, Mag Mell demonstrates worldbuilding capabilities unlike any model in its class, comparable to old adventuring models like Tiefighter, and prose that exhibits minimal "slop" (not bad for no finetuning,) frequently devising electrifying metaphors that left us consistently astonished. I don't want to toot my own bugle though; I'm really proud of how this came out, but please leave your feedback, good or bad.Special thanks as usual to Toaster for his feedback and Fizz for helping fund compute, as well as the KoboldAI Discord for their resources. The following models were included in the merge: IntervitensInc/Mistral-Nemo-Base-2407-chatml nbeerbower/mistral-nemo-bophades-12B nbeerbower/mistral-nemo-wissenschaft-12B elinas/Chronos-Gold-12B-1.0 Fizzarolli/MN-12b-Sunrose nbeerbower/mistral-nemo-gutenberg-12B-v4 anthracite-org/magnum-12b-v2.5-kto

Links

Tags

sainemo-remix

The following models were included in the merge: elinas_Chronos-Gold-12B-1.0 Vikhrmodels_Vikhr-Nemo-12B-Instruct-R-21-09-24 MarinaraSpaghetti_NemoMix-Unleashed-12B

Links

Tags

dreamgen_lucid-v1-nemo

Focused on role-play & story-writing. Suitable for all kinds of writers and role-play enjoyers: For world-builders who want to specify every detail in advance: plot, setting, writing style, characters, locations, items, lore, etc. For intuitive writers who start with a loose prompt and shape the narrative through instructions (OCC) as the story / role-play unfolds. Support for multi-character role-plays: Model can automatically pick between characters. Support for inline writing instructions (OOC): Controlling plot development (say what should happen, what the characters should do, etc.) Controlling pacing. etc. Support for inline writing assistance: Planning the next scene / the next chapter / story. Suggesting new characters. etc. Support for reasoning (opt-in).

Links

Tags

impish_nemo_12b

August 2025, Impish_Nemo_12B — my best model yet. And unlike a typical Nemo, this one can take in much higher temperatures (works well with 1+). Oh, and regarding following the character card: It somehow gotten even better, to the point of it being straight up uncanny 🙃 (I had to check twice that this model was loaded, and not some 70B!) I feel like this model could easily replace models much larger than itself for adventure or roleplay, for assistant tasks, obviously not, but the creativity here? Off the charts. Characters have never felt so alive and in the moment before — they’ll use insinuation, manipulation, and, if needed (or provoked) — force. They feel so very present. That look on Neo’s face when he opened his eyes and said, “I know Kung Fu”? Well, Impish_Nemo_12B had pretty much the same moment — and it now knows more than just Kung Fu, much, much more. It wasn’t easy, and it’s a niche within a niche, but as promised almost half a year ago — it is now done. Impish_Nemo_12B is smart, sassy, creative, and got a lot of unhingedness too — these are baked-in deep into every interaction. It took the innate Mistral's relative freedom, and turned it up to 11. It very well maybe too much for many, but after testing and interacting with so many models, I find this 'edge' of sorts, rather fun and refreshing. Anyway, the dataset used is absolutely massive, tons of new types of data and new domains of knowledge (Morrowind fandom, fighting, etc...). The whole dataset is a very well-balanced mix, and resulted in a model with extremely strong common sense for a 12B. Regarding response length — there's almost no response-length bias here, this one is very much dynamic and will easily adjust reply length based on 1–3 examples of provided dialogue. Oh, and the model comes with 3 new Character Cards, 2 Roleplay and 1 Adventure!

Links

Tags

qwen3-nemotron-32b-rlbff-i1

**Model Name:** Qwen3-Nemotron-32B-RLBFF **Base Model:** Qwen/Qwen3-32B **Developer:** NVIDIA **License:** NVIDIA Open Model License **Description:** Qwen3-Nemotron-32B-RLBFF is a high-performance, fine-tuned large language model built on the Qwen3-32B foundation. It is specifically optimized to generate high-quality, helpful responses in a default thinking mode through advanced reinforcement learning with binary flexible feedback (RLBFF). Trained on the HelpSteer3 dataset, this model excels in reasoning, planning, coding, and information-seeking tasks while maintaining strong safety and alignment with human preferences. **Key Performance (as of Sep 2025):** - **MT-Bench:** 9.50 (near GPT-4-Turbo level) - **Arena Hard V2:** 55.6% - **WildBench:** 70.33% **Architecture & Efficiency:** - 32 billion parameters, based on the Qwen3 Transformer architecture - Designed for deployment on NVIDIA GPUs (Ampere, Hopper, Turing) - Achieves performance comparable to DeepSeek R1 and O3-mini at less than 5% of the inference cost **Use Case:** Ideal for applications requiring reliable, thoughtful, and safe responses—such as advanced chatbots, research assistants, and enterprise AI systems. **Access & Usage:** Available on Hugging Face with support for Hugging Face Transformers and vLLM. **Cite:** [Wang et al., 2025 — RLBFF: Binary Flexible Feedback](https://arxiv.org/abs/2509.21319) 👉 *Note: The GGUF version (mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF) is a user-quantized variant. The original model is available at nvidia/Qwen3-Nemotron-32B-RLBFF.*

Links

https://huggingface.co/mradermacher/Qwen3-Nemotron-32B-RLBFF-i1-GGUF

Tags

nvidia.qwen3-nemotron-32b-rlbff

The **nvidia/Qwen3-Nemotron-32B-RLBFF** is a large language model based on the Qwen3 architecture, fine-tuned by NVIDIA using Reinforcement Learning from Human Feedback (RLHF) for improved alignment with human preferences. With 32 billion parameters, it excels in complex reasoning, instruction following, and natural language generation, making it suitable for advanced tasks such as code generation, dialogue systems, and content creation. This model is part of NVIDIA’s Nemotron series, designed to deliver high performance and safety in real-world applications. It is optimized for efficient deployment while maintaining strong language understanding and generation capabilities. **Key Features:** - **Base Model**: Qwen3-32B - **Fine-tuning**: Reinforcement Learning from Human Feedback (RLBFF) - **Use Case**: Advanced text generation, coding, dialogue, and reasoning - **License**: MIT (check Hugging Face for full details) 👉 [View on Hugging Face](https://huggingface.co/nvidia/Qwen3-Nemotron-32B-RLBFF) *Note: The GGUF version hosted by DevQuasar is a quantized variant for efficient local inference. The original, unquantized model is available at the link above.*

Links

https://huggingface.co/DevQuasar/nvidia.Qwen3-Nemotron-32B-RLBFF-GGUF

Tags

Model Gallery

Filter by type:

Filter by tags:

nemo-parakeet-tdt-0.6b

mistral-nemo-instruct-2407-12b-thinking-m-claude-opus-high-reasoning-i1

l3.3-ms-nevoria-70b

nvidia_llama-3_3-nemotron-super-49b-v1

nvidia_llama-3_3-nemotron-super-49b-genrm-multilingual

llama-3.1-nemotron-70b-instruct-hf

l3.1-nemotron-sunfall-v0.7.0-i1

nvidia_llama-3.1-nemotron-nano-4b-v1.1

nvidia_acereason-nemotron-14b

nvidia_nemotron-research-reasoning-qwen-1.5b

mistral-nemo-instruct-2407

pantheon-rp-1.6-12b-nemo

mistral-nemo-prism-12b

mn-chunky-lotus-12b

mn-12b-mag-mell-r1-iq-arm-imatrix

sainemo-remix

dreamgen_lucid-v1-nemo

impish_nemo_12b

qwen3-nemotron-32b-rlbff-i1

nvidia.qwen3-nemotron-32b-rlbff