An evaluation benchmark for generative AI in security domain
2024
As computing environments become increasingly complex and distributed, the volume and complexity of security data generated across various systems have grown exponentially. Extracting useful insights from this security data is crucial for effective security analytics, anomaly detection, and threat identification. However, there is a lack of comprehensive evaluation benchmarks for assessing the performance of large language models trained on any security log dataset, hindering progress in this domain. This paper proposes a comprehensive evaluation benchmark for security data, addressing this critical gap. The benchmark is easily adoptable to any security log dataset and comprises four diverse categories of tasks: supervised evaluations, unsupervised evaluations, anomaly detection, and semantic similarity evaluations. By providing a standardized framework for evaluation, the benchmark enables objective comparison and reproducible assessment of state-of-the-art embedding models across various computing environments and security log sources.
Research areas