LLM-Based Comprehensive Detection of Firewall Rule Anomalies

Paper

Chang-Sheng Lee, I-Chen Lee, Ling-Jyh Chen. "Enhancing Firewall Rule Anomaly Detection via LLM Alignment " . International Conference on Technologies and Applications of Artificial Intelligence, Taiwan, 2025

Motivation

Traditional firewall rule sets are difficult to maintain because old rules accumulate, leading to complexity and higher costs.
Detecting anomalies (e.g., shadowing, redundancy, correlation) in firewall rules is a critical first step before simplifying them.
Existing rule-based methods lack flexibility and generalization.
Large Language Models (LLMs) offer a promising alternative due to their ability to recognize patterns and generalize.

Methods

Model Training
- Used Supervised Fine-Tuning (SFT) with a small dataset (75 examples) that included reasoning steps and anomaly labels.
- Applied Reinforcement Learning (RL) with ~36,000 examples using Group Relative Policy Optimization (GRPO).
- Designed reward functions focusing on both format correctness and answer accuracy.
Experiment Setup
- Models: Qwen3-4B (Base and Instruct versions).
- Training hardware: RTX 4090, H100 NVL/PCIe (via Runpods).
- Framework: Unsloth (for efficient training).
Testing
- Compared combinations of Base/Instruct with SFT and/or RL.
- Evaluated accuracy on anomaly detection tasks involving firewall rule pairs.

Results

Best performance: Instruct model with SFT + RL, achieving ~99.2% accuracy.
Both SFT and RL improved accuracy, but RL contributed more than SFT.
Pure Base model only achieved ~50% accuracy, Pure Instruct ~70%.
However, performance collapsed when evaluating multiple rules (100+ simultaneously).
- Models were good at two-rule comparisons but failed to generalize to larger rule sets.

Conclusion

LLM alignment (SFT + RL) significantly enhances performance for detecting firewall rule anomalies in pairwise settings.
Reinforcement learning is particularly powerful, while SFT shows limited benefit due to small dataset size.
Current methods lack generalization to complex, multi-rule scenarios.
Future work should test more pretrained models, employ curriculum learning, and experiment with different training strategies (prompts, reward functions, hyperparameters).

Share...