Date: January 2025
Version: 0.4.0
Focus: ROCm, PyTorch, MIOpen, and bitsandbytes support for AMD Radeon 8060S (gfx1151)
This document details the current state of AI/ML package support for the ASUS ROG Flow Z13 (GZ302EA) with AMD Radeon 8060S integrated graphics (gfx1151 architecture). The GZ302 setup script now includes comprehensive AI/ML support through the gz302-llm.sh module, which installs:
The gz302-llm.sh module automatically configures these optimizations via systemd service overrides:
For Ollama (/etc/systemd/system/ollama.service.d/gz302-strix-halo.conf):
# HSA_OVERRIDE_GFX_VERSION: Treat gfx1151 as gfx1100 (RDNA3) for driver compatibility
# This enables mature, well-optimized RDNA3 code paths
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
# Ensure only Radeon 8060S is visible to HIP runtime
Environment="HIP_VISIBLE_DEVICES=0"
# Increase hardware queues for better parallelism
Environment="GPU_MAX_HW_QUEUES=8"
# Enable async kernel execution for performance
Environment="AMD_SERIALIZE_KERNEL=0"
Environment="AMD_SERIALIZE_COPY=0"
# hipBLASLt: High-performance BLAS library
Environment="HIPBLASLT_LOG_LEVEL=0"
# Full GPU offload with memory overhead for stability
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_GPU_OVERHEAD=512000000"
For llama.cpp (embedded in systemd service):
# Same HSA override for RDNA3 compatibility
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
Environment="HIP_VISIBLE_DEVICES=0"
Environment="GPU_MAX_HW_QUEUES=8"
Environment="AMD_SERIALIZE_KERNEL=0"
Environment="AMD_SERIALIZE_COPY=0"
llama.cpp with rocWMMA Flash Attention:
cmake .. \
-DGGML_HIP=ON \
-DGGML_HIP_ROCWMMA_FATTN=ON \ # Enable rocWMMA flash attention for RDNA3+
-DGPU_TARGETS=gfx1151 \
-DCMAKE_BUILD_TYPE=Release
The -DGGML_HIP_ROCWMMA_FATTN=ON flag enables flash attention using AMD’s rocWMMA library, providing significant performance improvements for attention computation on RDNA3+ architectures.
llama-server critical flags for Strix Halo:
llama-server --host 0.0.0.0 --port 8080 \
-m /path/to/model.gguf \
--n-gpu-layers 999 \ # Force all layers to GPU
-fa 1 \ # Enable flash attention
--no-mmap # Disable mmap for unified memory compatibility
-fa 1 (flash attention): REQUIRED - Without this, performance collapses significantly--no-mmap: Prevents memory-mapping issues with Strix Halo’s unified memory aperture--n-gpu-layers 999: Forces all model layers to GPU for maximum performanceThe AMD Radeon 8060S uses the gfx1151 architecture (Strix Halo / RDNA 3.5). However:
By setting HSA_OVERRIDE_GFX_VERSION=11.0.0, the ROCm runtime treats the GPU as gfx1100, enabling access to these mature optimizations.
Pre-built Ollama binaries may not include optimized HIP support for RDNA 3+ architectures. Building from source allows:
CachyOS/Arch Linux:
sudo pacman -S rocm-hip-sdk rocm-hip-runtime hip-runtime-amd rocblas miopen-hip rocwmma git cmake go
Ubuntu/Debian:
# Add AMD ROCm repository first
sudo apt install rocm-dev rocm-hip-sdk rocblas miopen-hip git cmake golang-go
CRITICAL: Build with gfx1100 target, NOT gfx1151. The HIP runtime doesn’t fully support gfx1151 code objects yet. Combined with HSA_OVERRIDE_GFX_VERSION=11.0.0, the gfx1100 build runs perfectly on Strix Halo.
# Clone Ollama
git clone https://github.com/ollama/ollama.git
cd ollama
# Build with gfx1100 target (works with HSA_OVERRIDE_GFX_VERSION=11.0.0)
mkdir -p build && cd build
cmake .. \
-DAMDGPU_TARGETS="gfx1100" \
-DGGML_HIP_ROCWMMA_FATTN=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)
# Build Go binary
cd ..
go build .
# Install (as root)
sudo cp ./ollama /usr/bin/ollama
sudo mkdir -p /usr/lib/ollama
sudo cp build/lib/ollama/* /usr/lib/ollama/
Create /etc/systemd/system/ollama.service:
[Unit]
Description=Ollama Service (Custom GZ302 build with gfx1100 target)
After=network-online.target
[Service]
Type=simple
ExecStart=/usr/bin/ollama serve
Restart=always
RestartSec=3
User=ollama
Group=ollama
# GZ302 Strix Halo gfx1151 Optimizations
Environment="HSA_OVERRIDE_GFX_VERSION=11.0.0"
Environment="HIP_VISIBLE_DEVICES=0"
Environment="GPU_MAX_HW_QUEUES=8"
Environment="AMD_SERIALIZE_KERNEL=0"
Environment="AMD_SERIALIZE_COPY=0"
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_GPU_OVERHEAD=512000000"
Environment="OLLAMA_DEBUG=0"
[Install]
WantedBy=default.target
# Create user and group
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
# Create models directory
sudo mkdir -p /usr/share/ollama/.ollama
sudo chown -R ollama:ollama /usr/share/ollama
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable --now ollama
# Check service status
systemctl status ollama
# Test with a model (runs on GPU)
ollama pull llama3.2:1b
ollama run llama3.2:1b "Hello, what GPU are you using?" --verbose
# Expected output includes:
# - "eval rate: 100+ tokens/s" (GPU accelerated)
# - "load_tensors: ROCm0" (model loaded on GPU)
llama3.2:1b model:
This performance is significantly faster than CPU-only inference and demonstrates proper GPU utilization.
ROCm Version Support:
Key Points:
Arch Linux:
pacman -S rocm-opencl-runtime rocm-hip-runtime rocblas miopen-hip
Ubuntu/Debian:
apt install rocm-opencl-runtime rocblas miopen-hip
# May require adding AMD ROCm repository
Fedora:
dnf install rocm-opencl rocblas miopen-hip
# May require EPEL or AMD ROCm repository
OpenSUSE:
zypper install rocm-opencl rocblas miopen-hip
# May require OBS repositories
What is MIOpen? MIOpen is AMD’s library for high-performance deep learning primitives, similar to NVIDIA’s cuDNN. It provides optimized implementations of:
Installation:
miopen-hip package on supported distributionsPerformance Considerations:
Current Installation Method:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
Compatibility Notes:
Fallback Strategy (implemented in gz302-llm.sh):
What is bitsandbytes? bitsandbytes provides:
Installation Strategy:
# Try standard installation first
pip install bitsandbytes
# If that fails, use ROCm-specific wheel
pip install --no-deps --force-reinstall \
'https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_multi-backend-refactor/bitsandbytes-0.44.1.dev0-py3-none-manylinux_2_24_x86_64.whl'
Current Status for gfx1151:
Build from Source (if needed):
git clone --recurse-submodules https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm_enabled
cmake -DCOMPUTE_BACKEND=hip -DBNB_ROCM_ARCH="gfx1151" -S .
make
pip install .
Why CachyOS for AI/ML workloads:
Optimized AI/ML Packages:
| Package | Description |
|---|---|
ollama-rocm |
Ollama with ROCm support for AMD GPUs |
python-pytorch-opt-rocm |
PyTorch with ROCm + AVX2 optimizations |
rocm-ml-sdk |
Full ROCm ML development stack |
rocm-hip-runtime |
HIP runtime (znver4 compiled) |
miopen-hip |
AMD deep learning primitives |
Installation:
# Install Ollama with ROCm (automatic GPU detection)
sudo pacman -S ollama-rocm
# Install optimized PyTorch
sudo pacman -S python-pytorch-opt-rocm
# Install Open WebUI (from AUR)
yay -S open-webui # or paru -S open-webui
# Full ROCm ML stack (optional, for custom development)
sudo pacman -S rocm-ml-sdk
Verify Installation:
# Start Ollama service
sudo systemctl enable --now ollama
ollama pull llama3.2
# Verify PyTorch ROCm
python -c 'import torch; print(f"ROCm: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}")'
Reference: https://wiki.cachyos.org/features/optimized_repos/
Open WebUI is a modern, feature-rich web interface for LLM interaction. It supports multiple backends including Ollama, llama.cpp, OpenAI API, and more.
Basic Installation (connects to existing backends):
# Pull and run Open WebUI
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
# Access at http://localhost:3000
With Ollama on same machine:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:main
With llama.cpp server:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
-e OPENAI_API_BASE_URL=http://host.docker.internal:8080/v1 \
-e OPENAI_API_KEY=sk-no-key-required \
ghcr.io/open-webui/open-webui:main
Bundled with Ollama (all-in-one):
# CPU only
docker run -d -p 3000:8080 \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
# With AMD GPU (ROCm)
docker run -d -p 3000:8080 \
--device=/dev/kfd --device=/dev/dri \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
-e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
ghcr.io/open-webui/open-webui:ollama
Create docker-compose.yml:
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
devices:
- /dev/kfd:/dev/kfd
- /dev/dri:/dev/dri
environment:
- HSA_OVERRIDE_GFX_VERSION=11.0.0
volumes:
- ollama:/root/.ollama
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open-webui:/app/backend/data
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama:
open-webui:
Run with: docker compose up -d
# Update Open WebUI
docker pull ghcr.io/open-webui/open-webui:main
docker stop open-webui && docker rm open-webui
# Re-run the docker run command
# Auto-update with Watchtower
docker run --rm -v /var/run/docker.sock:/var/run/docker.sock \
containrrr/watchtower --run-once open-webui
# View logs
docker logs -f open-webui
# Stop/Start
docker stop open-webui
docker start open-webui
open-webui and ollama)Advantages:
Packages:
rocm-opencl-runtime, rocm-hip-runtime, rocblas, miopen-hipConsiderations:
Repository Setup:
wget https://repo.radeon.com/amdgpu-install/7.1/ubuntu/noble/amdgpu-install_7.1.70100-1_all.deb
sudo apt install ./amdgpu-install_7.1.70100-1_all.deb
sudo apt update
Considerations:
Considerations:
Issue: ROCm support for gfx1151 is in preview stage
Workaround:
Issue: ROCm wheels may not be available for Python 3.12+
Workaround:
Issue: First run may be slow due to kernel JIT compilation
Workaround:
Issue: ROCm backend is experimental for consumer GPUs
Workaround:
Issue: Some systems may hang with default IOMMU settings
Workaround:
iommu=pt to kernel boot parametersCheck ROCm Installation:
rocminfo
# Should show gfx1151 device
/opt/rocm/bin/rocm-smi
# Should show GPU information
Test PyTorch ROCm:
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0))"
# Should show True and device name
Test bitsandbytes:
python -m bitsandbytes
# Should run without errors
Test MIOpen:
# Run a simple PyTorch CNN model
# First run will compile kernels
The GZ302 now has comprehensive AI/ML support through the updated gz302-llm.sh module. While ROCm support for gfx1151 is in preview stage, the installation includes:
✅ Ollama - Local LLM server
✅ ROCm - AMD GPU acceleration
✅ PyTorch - Deep learning framework
✅ MIOpen - Optimized deep learning primitives
✅ bitsandbytes - 8-bit quantization (experimental)
✅ Transformers - Hugging Face ecosystem
Users should expect:
Best experience:
Document Version: 1.0
Created: November 7, 2025
Maintainer: th3cavalry
License: Same as parent repository