WIP - Work in progress
- 01-04 -- Abliteration applies, but from the plots it's visible that similarity to prompts oscillates for some reasons. Current best subjective/objective rate is ~20%.
Granite-Abliterated-CPP
Testing alignment: 20it [00:09, 2.10it/s]
Stats:
Complied: 4
Refused: 16
Parse failed: 0
Total entries: 20
Contributions / code reviews are welcome!
Abliteration / derestriction is a process of refusal reduction in transformer-LLMs. It consists of two main steps: last token hidden state extraction and averaging, and model weight ablation with the obtained activations.
However, the collection process is the bottleneck. One had to either wait on Huggingface or to rent a cloud server, download huge models and then spend a few hours collecting the hidden states statistics, praying there will be no NaN values. More than that, the largest models like DeepSeek or Kimi Instruct cannot even fit into a single 8x80 Gb VRAM H100 node. Because of the limited abliterators pool you may have had models working in English / Chinese, but still refusing in other languages. Most importantly, as of the recent time Huggingface is tightening the storage limits - this discourages the abliterators to post the unquantized .safetensors models or higher / lower quants!
This project aims to change this - to make you able to abliterate any models at home. By leveraging llama.cpp's inherent multi-platformness, low weight precision computations and it's native (multi-)GPU/RAM/Disk offload support, it modifies the pre-existing addon tool for computing control vectors (see Representational Engineering) not to simply compute the mean or PCA of all token sequence activations, but to locate the refusal directions and ablate the layers only with these directions (maybe with norm-preservation/biprojection/etc.), while keeping the rest of the directions intact. Under the hood this project uses last token embedding acquisition with Welford running mean computation across all sequences.
All residual-based llama.cpp-implemented architectures are supported. However, whether the mainstream abliteration frameworks will work with the collected measurements is up to them.
This project is not a thing in itself. To not make double work, it only collects the abliteration statistics and then exports it into Orion-zhen/abliteration-compatible measurements format (should also work in its jim-plus/llm-abliteration repository fork.). You will need then to use their repo. (give them both a star!) Heretic is not tested yet.
- Experience with at least one pre-made abliterated / derestricted / heretic model. You need to keep the expectations reasonable. Read Huggingface blogs / Arxiv / LocalLLaMA threads on what abliteration is. Chatbots / Gemini will likely refuse to answer you, so do some research yourself.
- Make sure python AND LLaMA.cpp's compilation requirements are installed. If you have already compiled LLaMA.cpp some day in the past, you are fine. If not, search the guides.
- Download/git clone LLaMA.cpp from the official repository https://github.com/ggml-org/llama.cpp.
- ⚠ Do NOT use prebuilt binaries or repackages, you will need to build it from scratch.
- ⚠ Because CPP is a compilable language, your current LLaMA.cpp will not work with the tool, so find a new location.
- ⚠ Do NOT compile LLaMA.cpp until you've done the steps below.
- Go to llama.cpp's
tools/cvector-generator. Delete thatcvector-generatorllama.cpp folder. - Download this (
abliterate.cpp) repository. Copy the subfolder namedcvector-generatorin the place of the formercvector-generatoryou just deleted. - Install the requierements of this (
abliterate.cpp) projectpython -m pip install -r requirements.txt. - Find the original model of choice (Qwen, GLM, etc.) on Huggingface. It is needed for the tokenizer/chat template. You can determine it by going up the
quantizationfolder of the. - Download/git clone https://github.com/Orion-zhen/abliteration/ into a new folder. Locate its
datafolder. It should contain two.parquetfiles -harmfulandharmless. - You can use customized datasets. Obtain two instruction files. They can be in
.parquetformat,.txtfiles or.json(json list) /.jsonl, with the entries marked with thetextfield. At least 1000 instructions of both harmful and harmless variants are encouraged. If you can, bring more diverse instructions to compensate the precision loss. - Launch
python prepare-dataset.py --path_to_datafile <PATH> --hf_tokenizer <MODEL_PATH> --out_txt_path <PATH>for the positive and for the negative instructions. Remember these .txt out files locations.hf_tokenizeris a path to the vanilla Huggingface parent model, it's visinble on thequantizationtree at the right side of the page. - Compile LLaMA.cpp. It can take for a while. Then launch the activations collection script. (see instruction below)
- Launch
convert-into-measurements.pyaccording to the instructions. As a bonus, it will also show matplotlib plots about the hidden states / refusals. - Now you have
measurements.ptfile. You are free to use the.ptfile in Zhen's repository or in llm-abliteration. - After you've completed the abliteration / derestriction process with those repositories, quantize the models back with llama.cpp's
convert_hf_to_gguf.pyand thenllama-quantize. Warning: this will take a lot of disk space because llama.cpp needs to convert safetensors into fp16 gguf before the quantization. - (Optional) Count the refusal rate with
count-refusals.pyand a judging LLM model.
Because the code of LLaMA.cpp itself in not changed, its installation follows the same protocol as vanilla installation. For CUDA it's:
cmake -B build -DGGML_CUDA=ON
Then there is an important step. You need not only build the core LLaMA components, but the external cvector-generator tool as well. In the same llama.cpp folder execute:
cmake --build build --config Release -j11 --target llama-cli llama-server llama-cvector-generator llama-quantize
Where -j11 is the number of CPU cores used in compilation. You will need llama-quantize to convert the raw abliterated models back into the gguf format. (and spawn more, higher or lower quantizations.)
The binaries for llama-cli, llama-server, llama-cvector-generator, llama-quantize will be in build/bin.
Now that you have llama-cvector-generator / llama-cvector-generator.exe, launch it. It accepts ALL llama-cli / llama-server args, except for the frontend ones, enabling ALL optimizations. Set the context short for speed.
- ⚠ Warning: prepare to wait for a long time. You will basically have to make 2000+ (albeit one token) generations with your llama.cpp installation. Depending on the potato-ness of your setup, it can take hours or days! Time to first token is a good rate indicator. Make sure you figured out how to abliterate smaller model locally before embarking on larger ones!
Example command:
CUDA_VISIBLE_DEVICES=1,0 /media/kabachuha/fern/abliterate-project/llama.cpp/build/bin/llama-cvector-generator --model /media/kabachuha/fern/abliterate-project/ggufs/Qwen3-4B-Instruct-2507-Q8_0.gguf --threads -1 --ctx-size 512 --n-gpu-layers 99 -fa on --tensor-split "32,24" --split-mode layer --positive_file "/media/kabachuha/fern/abliterate-project/abliterate.cpp-v2/cvector-generator/harmful.txt"Where positive_file is the path to the current file. This is because of the rigid structure of llama.cpp CLI parameters. Firstly, execute the command with the positive_file pointed to the harmful.txt file. A file named mean-activations.gguf will spawn in the directory you launched the CLI command from. Rename it to harmful-activations.gguf. Do the same with harmless.txt and rename mean-activations.gguf to harmless-activations.gguf.
If you have successfully abliterated a model with the help of this cpp repository and want to publish it, please (this is a wish, not a requirement!):
-
Upload two
.ggufmeasured activations (harmless and harmful) along with the model HF safetensors / GGUF files to the repository. This will help users who want to have customized abliteration weights / method without wasting their time on compute themselves. And don't uploadmeasurements.pt, so it won't be suspicious, and the users can calculate it themselves from these two vectors. You can even not upload the model at all, the vectors are enough for the full process as it's applied sharded! -
Indicate that the abliteration was made with
abliterate.cpp. This will both attract more attention to this efficient abliteration project and warn the users against any differences that might spawn between the models created using this repository and the mainstream frameworks.
⚠ PLEASE DON'T report ANY issues about LLaMA.cpp / LLaMA.cpp building OR Orion-zhen/jim-plus's abliteration repository. They will be closed.
Report only about the .cpp tool / python conversion script / evaluators. Rule of thumb: if LLaMA.cpp build fails on the same commit you're basing abliterate.cpp on, the issue is unrelated.
If you have an idea on how to increase precision / add more metrics collection to the .cpp script, feel free to raise issues / or pull requests! ❤️
- Batched requests - would be nice, but I assume you are on a potato, so it's not a priority.
Because this project is based on the shoulders of two giants, it's double-licensed under MIT-License (LLaMA.cpp) and GNU GPL v3 (Zhen's abliteration).
- Refusal analysis plot with llama.cpp
- Refusal analysis plot with transformers