This is an old revision of the document!
Overview
Title: Special Topics on AI Security
Provided by: Dept. of Computer Engineering, Myongji University
Lead by: Minho Shin (mhshin@mju.ac.kr, Rm5736)
Period: Spring semester, 2026
Location: 5701 at 5th Engineering Building
Time: Wed, 10am to 1pm
Type: Graduate Seminar
Goal of the class
This class aims to familiarize students with current research topics in AI Security & Privacy area
This class also aims to train students with their communication skills including oral presentation, discussion, writing, and collaboration
-
Participants
| # | Name | Dept | Advisor | Email Address |
| 1 | Hyeonjun Jo | CE | Undergraduate | mnbvjojun@gmail.com |
| 2 | Nayung Kwak | CE | Undergraduate | kny12202423@gmail.com |
| 3 | Kyungchan Kim | CS | Minho Shin | kkc8983@gmail.com |
Agenda
TBD
* order: Cho --> Han --> Kwak
* # of presentations per week: 2, 2, 2, ...
* # of presentations per person:
Rules for the class
We have 15 presentations in total by three students
Each present 5 presentations throughout the semester
One presentation per day
The presenter announces the paper to present at least one week ahead
The presenter prepares a powerpoint slides for 30-60min talk
The other students submit a review article (1-2 pages) before class
The presentation should contain:
(Motivation) What are the motivations for this particular problem? What is the backgrounds for understanding the problem? Why is this important?
(Problem) What is, on earth, the exact problem the authors aim to address, and why on earth, is the problem important?
(Related work) What has been done by other researchers to address the same or similar problem on the table? Why the existing work is not enough to call done?
(Method) What is their main methodology to address the problem? How did they actually solve the problem in detail?
(Evaluation) What are the evidences for their success found in the paper? What is missing in their evaluation?
(Contribution) What is the contribution of the paper and what is not their contributions? Are there any limitations in their result? How would you evaluate the value of the paper?
(Future work) What is the remaining problems that were only partially addressed or never covered by the paper? What will be a possible approach to the problem?
A review article contains
The same content as described for the presenter
But in a succinctly written words form
Not exceeding two pages
Submit in Word/PDF by email
Evaluation
Reading List for LLM-based Cybersecurity
C1. Adversarial Machine Learning
Explaining and Harnessing Adversarial Examples
Ian Goodfellow, Jonathon Shlens, Christian Szegedy, ICLR 2015 | Pages: 11 | Difficulty: 2/5
Abstract: This seminal paper introduces the Fast Gradient Sign Method (FGSM) and demonstrates that neural networks are vulnerable to adversarial examples - inputs with imperceptible perturbations that cause misclassification. The authors show that adversarial examples transfer across models and propose that linearity in high-dimensional spaces is the primary cause of vulnerability, challenging previous hypotheses about model overfitting.
Towards Evaluating the Robustness of Neural Networks
Nicholas Carlini, David Wagner, IEEE S&P 2017 | Pages: 16 | Difficulty: 3/5
Abstract: This paper presents the powerful C&W attack, demonstrating that defensive distillation and other defenses can be bypassed. The authors formulate adversarial example generation as an optimization problem and introduce targeted attacks that achieve near-perfect success rates. They establish important evaluation methodology for measuring model robustness and show that many claimed defenses provide false security.
Intriguing Properties of Neural Networks
Christian Szegedy et al., ICLR 2014 | Pages: 10 | Difficulty: 3/5
Abstract: The first paper to formally identify adversarial examples in deep neural networks. The authors demonstrate that small, carefully crafted perturbations can fool state-of-the-art models and that these adversarial examples transfer between different models. They introduce the L-BFGS attack method and show that adversarial examples reveal fundamental properties of neural network decision boundaries rather than being mere artifacts of overfitting.
DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks
Seyed-Mohsen Moosavi-Dezfooli et al., CVPR 2016 | Pages: 9 | Difficulty: 3/5
Abstract: This paper introduces DeepFool, an efficient algorithm to compute minimal adversarial perturbations. Unlike FGSM which produces large perturbations, DeepFool iteratively linearizes the classifier to find the closest decision boundary. The method provides a way to measure model robustness quantitatively and demonstrates that different architectures have varying levels of robustness to adversarial perturbations.
Universal Adversarial Perturbations
Seyed-Mohsen Moosavi-Dezfooli et al., CVPR 2017 | Pages: 10 | Difficulty: 3/5
Abstract: This paper demonstrates the existence of universal perturbations - single perturbations that can fool a classifier on most inputs from a dataset. These image-agnostic perturbations reveal fundamental geometric properties of decision boundaries and challenge the notion that adversarial examples are input-specific artifacts. The work shows that universal perturbations transfer across different models trained on the same task.
Adversarial Examples Are Not Bugs, They Are Features
Andrew Ilyas et al., NeurIPS 2019 | Pages: 25 | Difficulty: 3/5
Abstract: This influential paper argues that adversarial vulnerability arises from models relying on highly predictive but non-robust features in the data. The authors demonstrate that models trained only on adversarial examples can achieve good accuracy on clean data, showing that adversarial examples exploit genuine patterns. This challenges the view of adversarial examples as bugs and suggests they reveal fundamental properties of standard machine learning.
Adversarial Patch
Tom Brown et al., NIPS 2017 Workshop | Pages: 5 | Difficulty: 2/5
Abstract: This paper introduces adversarial patches - printable, physical perturbations that can fool classifiers in the real world. Unlike digital perturbations, patches are robust to viewing angle, distance, and lighting conditions. The authors demonstrate attacks where a small sticker can cause an image classifier to ignore everything else in the scene, raising serious concerns for real-world ML deployment in security-critical applications.
Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey
Naveed Akhtar, Ajmal Mian, IEEE Access 2018 | Pages: 31 | Difficulty: 1/5 (Survey)
Abstract: A comprehensive survey covering adversarial attacks and defenses in computer vision. The paper categorizes attacks based on adversary knowledge, attack specificity, and attack frequency. It reviews major attack methods (FGSM, C&W, DeepFool) and defense strategies (adversarial training, defensive distillation, gradient masking). An excellent entry-level resource for understanding the adversarial ML landscape.
C2. Model Poisoning & Backdoor Attacks
(Jo) You autocomplete me: Poisoning vulnerabilities in neural code completion
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
Tianyu Gu et al., NIPS 2017 Workshop | Pages: 6 | Difficulty: 2/5
Abstract: This pioneering work introduces backdoor attacks on neural networks where an attacker poisons training data with trigger patterns. The resulting model performs normally on clean inputs but misclassifies when the trigger is present. The authors demonstrate attacks on traffic sign recognition and face identification, showing that backdoored models are difficult to detect through standard accuracy testing.
Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks
Ali Shafahi et al., NeurIPS 2018 | Pages: 11 | Difficulty: 3/5
Abstract: This paper introduces clean-label poisoning where poisoned training samples maintain correct labels, making attacks harder to detect. The authors craft imperceptible perturbations to training images that cause targeted misclassification. They use feature collision in the network's representation space to make the target input appear similar to a chosen class, demonstrating successful attacks on transfer learning scenarios.
Trojaning Attack on Neural Networks
Yingqi Liu et al., IEEE ICCD 2018 | Pages: 8 | Difficulty: 3/5
Abstract: Presents a systematic approach to injecting hardware-based trojans in neural networks. Shows how attackers can manipulate model behavior through malicious hardware modifications. Demonstrates attacks that activate only under specific trigger conditions while maintaining normal behavior otherwise.
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks
Bolun Wang et al., IEEE S&P 2019 | Pages: 15 | Difficulty: 3/5
Abstract: Proposes the first defense mechanism specifically designed to detect and remove backdoors from neural networks. Uses optimization to reverse-engineer potential triggers and identifies anomalous patterns. Successfully detects backdoors with high accuracy and can remove them through fine-tuning or neuron pruning.
Bypassing Backdoor Detection Algorithms in Deep Learning
Te Juin Lester Tan, Reza Shokri, NeurIPS 2020 | Pages: 11 | Difficulty: 4/5
Abstract: Demonstrates sophisticated backdoor attacks that evade state-of-the-art detection methods. Shows that adaptive attackers can craft triggers that appear natural and avoid detection by Neural Cleanse and similar defenses. Challenges the security of existing backdoor detection approaches.
Backdoor Attacks Against Deep Learning Systems in the Physical World
Emily Wenger et al., CVPR 2021 | Pages: 10 | Difficulty: 3/5
Abstract: Extends backdoor attacks to the physical world using robust physical triggers. Demonstrates attacks on traffic sign recognition where physical stickers serve as backdoor triggers. Shows that backdoors can survive in real-world conditions with varying angles, distances, and lighting.
Blind Backdoors in Deep Learning Models
Eugene Bagdasaryan, Vitaly Shmatikov, USENIX Security 2021 | Pages: 18 | Difficulty: 4/5
Abstract: Introduces blind backdoor attacks where the attacker doesn't need to control the training process. Shows how backdoors can be injected through model replacement or by poisoning only a small fraction of training data. Demonstrates attacks on federated learning and transfer learning scenarios.
WaNet: Imperceptible Warping-based Backdoor Attack
Anh Nguyen et al., ICLR 2021 | Pages: 18 | Difficulty: 3/5
Abstract: Proposes a novel backdoor attack using smooth warping transformations instead of visible patches. These backdoors are nearly imperceptible and harder to detect than traditional patch-based triggers. Demonstrates high attack success rates while evading multiple defense mechanisms.
Backdoor Learning: A Survey
Yiming Li et al., arXiv 2022 | Pages: 45 | Difficulty: 1/5 (Survey)
Abstract: Comprehensive survey of backdoor attacks and defenses in deep learning. Categorizes attacks by trigger type, poisoning strategy, and attack scenario. Reviews detection and mitigation methods. Provides taxonomy and identifies open research challenges.
C3. Privacy Attacks on Machine Learning
Membership Inference Attacks Against Machine Learning Models
Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures
Matt Fredrikson et al., CCS 2015 | Pages: 12 | Difficulty: 3/5
Abstract: Demonstrates model inversion attacks that reconstruct training data from model outputs. Shows successful reconstruction of facial images from face recognition models and recovery of sensitive attributes from genomic data predictors. Proposes confidence masking as a partial defense.
Extracting Training Data from Large Language Models
Nicholas Carlini et al., USENIX Security 2021 | Pages: 17 | Difficulty: 3/5
Abstract: Shows that large language models like GPT-2 memorize and can be made to emit verbatim training data including personal information. Demonstrates extraction of phone numbers, addresses, and copyrighted content. Raises serious privacy concerns for LLMs trained on web data.
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
Nicholas Carlini et al., USENIX Security 2019 | Pages: 18 | Difficulty: 3/5
Abstract: Studies unintended memorization in neural networks, showing models can memorize rare or sensitive training examples. Proposes exposure metrics to quantify memorization and demonstrates extraction attacks. Shows that differential privacy provides limited protection against memorization.
Stealing Machine Learning Models via Prediction APIs
Florian Tramèr et al., USENIX Security 2016 | Pages: 20 | Difficulty: 3/5
Abstract: Demonstrates model extraction attacks where an attacker queries a black-box model to steal its functionality. Shows successful extraction of logistic regression, neural networks, and decision trees. Analyzes the cost-accuracy tradeoff and proposes defenses based on output perturbation.
Deep Models Under the GAN: Information Leakage from Collaborative Deep Learning
Briland Hitaj et al., CCS 2017 | Pages: 14 | Difficulty: 4/5
Abstract: Presents a novel attack on collaborative learning using GANs. Shows that an adversarial participant can use a GAN to reconstruct private training data from model updates in federated learning. Demonstrates attacks that recover recognizable images from gradient information.
SoK: Privacy-Preserving Machine Learning
Maria Rigaki, Sebastian Garcia, arXiv 2023 | Pages: 38 | Difficulty: 1/5 (Survey)
Abstract: Systematization of knowledge on privacy attacks and defenses in machine learning. Covers membership inference, model inversion, and data extraction. Reviews privacy-preserving techniques including differential privacy, secure computation, and federated learning.
C4. LLM Security & Jailbreaking
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei et al., NeurIPS 2023 | Pages: 34 | Difficulty: 3/5
Abstract: Analyzes why safety training in LLMs can be circumvented through jailbreaking. Identifies two failure modes: competing objectives during training and mismatched generalization between safety and capabilities. Provides theoretical framework for understanding jailbreak vulnerabilities.
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou et al., arXiv 2023 | Pages: 25 | Difficulty: 3/5
Abstract: Introduces automated methods to generate adversarial suffixes that jailbreak LLMs. Shows these attacks transfer across models including GPT-3.5, GPT-4, and Claude. Demonstrates that aligned models remain vulnerable to optimization-based attacks despite safety training.
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Kai Greshake et al., AISec 2023 | Pages: 17 | Difficulty: 2/5
Abstract: Introduces indirect prompt injection where attackers manipulate LLM behavior through external data sources. Demonstrates attacks on real applications including email assistants and document processors. Shows how injected instructions in websites or documents can compromise LLM-integrated systems.
Poisoning Language Models During Instruction Tuning
Alexander Wan et al., ICML 2023 | Pages: 12 | Difficulty: 3/5
Abstract: Demonstrates backdoor attacks during instruction tuning phase of LLMs. Shows that small amounts of poisoned instruction data can inject persistent backdoors. Attacks remain effective even after additional fine-tuning on clean data.
Red Teaming Language Models with Language Models
Ethan Perez et al., EMNLP 2022 | Pages: 23 | Difficulty: 2/5
Abstract: Uses LLMs to automatically generate test cases for red-teaming other LLMs. Discovers diverse failure modes including offensive outputs and privacy leaks. Shows automated red-teaming can scale safety testing beyond manual efforts.
Prompt Injection Attacks and Defenses in LLM-Integrated Applications
Yupei Liu et al., arXiv 2023 | Pages: 14 | Difficulty: 2/5
Abstract: Formalizes prompt injection attacks and proposes taxonomy. Analyzes both direct and indirect injection vectors. Evaluates existing defenses and proposes new mitigation strategies including prompt sandboxing and input validation.
Are Aligned Neural Networks Adversarially Aligned?
Nicholas Carlini et al., NeurIPS 2023 | Pages: 29 | Difficulty: 4/5
Abstract: Studies whether alignment through RLHF provides adversarial robustness. Finds that aligned models remain vulnerable to adversarial attacks and that alignment and robustness are distinct properties. Challenges assumptions about safety of aligned models.
SoK: Exploring the State of the Art and the Future Potential of Artificial Intelligence in Digital Forensic Investigation
Yiming Liu et al., IEEE S&P 2024 | Pages: 52 | Difficulty: 1/5 (Survey)
Abstract: Comprehensive survey on LLM security covering jailbreaking, prompt injection, data extraction, and misuse. Categorizes attacks and defenses. Discusses open challenges in securing LLM-based applications.
C5. Federated Learning Security
How To Backdoor Federated Learning
Eugene Bagdasaryan et al., AISTATS 2020 | Pages: 11 | Difficulty: 3/5
Abstract: Demonstrates that a single malicious participant can inject backdoors into federated learning models. Shows model replacement attacks where the attacker's update overrides honest participants. Proposes defenses based on norm clipping and differential privacy.
Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent
Peva Blanchard et al., NeurIPS 2017 | Pages: 11 | Difficulty: 4/5
Abstract: Addresses Byzantine attacks in distributed learning where participants send arbitrary malicious updates. Proposes Krum aggregation rule that is robust to Byzantine workers. Provides theoretical guarantees on convergence under adversarial conditions.
The Limitations of Backdoor Detection in Federated Learning
Cong Xie et al., NeurIPS 2020 | Pages: 11 | Difficulty: 3/5
Abstract: Shows that existing backdoor detection methods for federated learning can be evaded. Demonstrates adaptive attacks that bypass norm-based and clustering-based defenses. Highlights fundamental challenges in securing federated learning against sophisticated attackers.
Analyzing Federated Learning through an Adversarial Lens
Arjun Nitin Bhagoji et al., ICML 2019 | Pages: 18 | Difficulty: 3/5
Abstract: Comprehensive analysis of attack vectors in federated learning. Studies both untargeted poisoning and targeted backdoor attacks. Analyzes the impact of attacker capabilities and proposes anomaly detection defenses.
DBA: Distributed Backdoor Attacks against Federated Learning
Chulin Xie et al., ICLR 2020 | Pages: 13 | Difficulty: 3/5
Abstract: Introduces distributed backdoor attacks where multiple attackers collaborate to inject backdoors while evading detection. Shows that distributed attacks are harder to detect than single-attacker scenarios. Demonstrates successful attacks under defensive aggregation rules.
Attack of the Tails: Yes, You Really Can Backdoor Federated Learning
Hongyi Wang et al., NeurIPS 2020 | Pages: 12 | Difficulty: 4/5
Abstract: Presents edge-case backdoor attacks that are harder to detect. Shows that backdoors can be designed to activate only on rare inputs while maintaining model utility. Demonstrates attacks that bypass existing defenses including differential privacy.
Advances and Open Problems in Federated Learning
Peter Kairouz et al., Foundations and Trends in Machine Learning 2021 | Pages: 269 | Difficulty: 2/5 (Survey)
Abstract: Comprehensive survey of federated learning including security and privacy challenges. Covers poisoning attacks, privacy attacks, and defenses. Discusses open problems in Byzantine-robust aggregation and privacy-preserving protocols.
C6. AI for Cybersecurity Defense
Deep Learning for Malware Detection
Edward Raff et al., arXiv 2017 | Pages: 10 | Difficulty: 2/5
Abstract: Applies deep learning to static malware detection using raw bytes. Achieves high accuracy on large-scale malware datasets. Discusses practical deployment challenges and adversarial robustness concerns for ML-based malware detection.
KITSUNE: An Ensemble of Autoencoders for Online Network Intrusion Detection
Yisroel Mirsky et al., NDSS 2018 | Pages: 15 | Difficulty: 2/5
Abstract: Proposes unsupervised intrusion detection using ensemble of autoencoders. Detects anomalies in network traffic without labeled data. Demonstrates effectiveness against various attacks including DDoS and reconnaissance.
Adversarial Deep Learning in Intrusion Detection Systems
Luca Demetrio et al., arXiv 2019 | Pages: 12 | Difficulty: 3/5
Abstract: Studies adversarial robustness of deep learning IDS. Shows that malware can evade detection through adversarial perturbations. Evaluates defenses including adversarial training for improving IDS robustness.
Deep Learning Approach for Phishing Detection
DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
Min Du et al., CCS 2017 | Pages: 12 | Difficulty: 3/5
Abstract: Applies LSTM networks to system log anomaly detection. Models normal execution patterns and detects deviations. Demonstrates effectiveness in detecting system intrusions and failures through log analysis.
Outside the Closed World: On Using Machine Learning for Network Intrusion Detection
Robin Sommer, Vern Paxson, IEEE S&P 2010 | Pages: 15 | Difficulty: 2/5
Abstract: Classic paper discussing fundamental challenges of applying ML to intrusion detection. Highlights the open-world problem, concept drift, and adversarial manipulation. Argues for careful evaluation and realistic assumptions in security ML.
Large Language Models for Cybersecurity: A Systematic Survey
Hansheng Yao et al., arXiv 2024 | Pages: 42 | Difficulty: 1/5 (Survey)
Abstract: Comprehensive survey on using LLMs for security applications including vulnerability detection, malware analysis, and threat intelligence. Discusses prompt engineering for security tasks and limitations of LLMs in security contexts.
C7. AI for Offensive Security
Evading Classifiers by Morphing in the Dark
Qian Hu, Saumya Debray, Black Hat 2017 | Pages: 8 | Difficulty: 2/5
Abstract: Demonstrates practical evasion of ML-based malware detectors. Shows adversarial perturbations that preserve malware functionality while evading detection. Discusses implications for deploying ML in security-critical applications.
Automating Network Exploitation Using Reinforcement Learning
William Glodek, Sandia National Labs 2018 | Pages: 10 | Difficulty: 3/5
Abstract: Uses reinforcement learning for automated network penetration testing. Agents learn to exploit vulnerabilities through trial and error. Demonstrates potential and limitations of RL for offensive security automation.
DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing
Xiao Liu et al., AAAI 2019 | Pages: 8 | Difficulty: 3/5
Abstract: Uses deep learning to generate valid C programs for fuzzing compilers. Learns syntax rules from existing code. Discovers previously unknown compiler bugs through automated test generation.
Adversarial Examples for Evaluating Reading Comprehension Systems
Robin Jia, Percy Liang, EMNLP 2017 | Pages: 11 | Difficulty: 2/5
Abstract: Creates adversarial examples for NLP systems by adding distracting sentences. Shows that reading comprehension models are brittle to such perturbations. Demonstrates importance of robust evaluation for NLP security.
Generating Natural Language Adversarial Examples
Moustafa Alzantot et al., EMNLP 2018 | Pages: 12 | Difficulty: 3/5
Abstract: Uses genetic algorithms to generate adversarial examples for text classification. Maintains semantic similarity while fooling models. Demonstrates vulnerabilities in sentiment analysis and textual entailment systems.
LLM-Fuzzer: Fuzzing Large Language Models with Chain-of-Thought Prompts
Jiahao Yu et al., arXiv 2023 | Pages: 16 | Difficulty: 2/5
Abstract: Automated fuzzing framework for discovering LLM vulnerabilities. Uses mutation-based approach to generate test cases. Finds jailbreak prompts and alignment failures.
C8. Robustness & Certified Defenses
Towards Deep Learning Models Resistant to Adversarial Attacks
Aleksander Madry et al., ICLR 2018 | Pages: 28 | Difficulty: 3/5
Abstract: Introduces PGD adversarial training as a robust defense. Formulates adversarial training as a min-max optimization problem. Shows significantly improved robustness against strong attacks.
Certified Adversarial Robustness via Randomized Smoothing
Jeremy Cohen et al., ICML 2019 | Pages: 17 | Difficulty: 4/5
Abstract: Provides provable robustness certificates using randomized smoothing. Transforms any classifier into certifiably robust version. Achieves state-of-the-art certified accuracy.
Provable Defenses against Adversarial Examples via the Convex Outer Adversarial Polytope
Eric Wong, Zico Kolter, ICML 2018 | Pages: 11 | Difficulty: 5/5
Abstract: Uses convex optimization to train provably robust networks. Computes exact worst-case adversarial loss during training. Limited to small networks but provides strong guarantees.
Obfuscated Gradients Give a False Sense of Security
Anish Athalye et al., ICML 2018 | Pages: 19 | Difficulty: 3/5
Abstract: Exposes gradient obfuscation as a common failure mode in adversarial defenses. Shows many published defenses can be broken with adaptive attacks. Introduces BPDA for attacking defenses.
Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-free Attacks
Francesco Croce, Matthias Hein, ICML 2020 | Pages: 32 | Difficulty: 3/5
Abstract: Introduces AutoAttack, an ensemble of parameter-free attacks for robust evaluation. Reveals overestimated robustness in many defenses. Now standard evaluation benchmark.
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
Dan Hendrycks, Thomas Dietterich, ICLR 2019 | Pages: 17 | Difficulty: 2/5
Abstract: Introduces ImageNet-C for evaluating robustness to natural corruptions. Shows models often fail on common corruptions despite adversarial training.
A Survey on Robustness of Neural Networks
Jiefeng Huang et al., arXiv 2023 | Pages: 52 | Difficulty: 1/5 (Survey)
Abstract: Comprehensive survey covering adversarial robustness, certified defenses, and evaluation methods. Covers attack types, defense strategies, and theoretical foundations.
C9. Interpretability & Verification for Security
Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks
Guy Katz et al., CAV 2017 | Pages: 20 | Difficulty: 5/5
Abstract: Introduces formal verification of neural networks using SMT solving. Can prove properties about network behavior. Foundational work in neural network verification.
Interpretable Machine Learning for Security
Archana Kuppa, Nhien-An Le-Khac, arXiv 2020 | Pages: 24 | Difficulty: 2/5
Abstract: Survey on interpretability methods for security applications. Discusses LIME, SHAP, attention mechanisms. Argues for interpretability in security-critical ML.
Activation Atlas: Exploring Neural Network Activations
Shan Carter et al., Distill 2019 | Pages: 12 | Difficulty: 2/5
Abstract: Visualizes what neurons in neural networks respond to. Uses feature visualization to understand internal representations. Helps identify adversarial vulnerabilities.
Quantifying Uncertainties in Neural Networks for Security Applications
Lewis Smith, Yarin Gal, arXiv 2018 | Pages: 10 | Difficulty: 3/5
Abstract: Uses Bayesian neural networks to quantify uncertainty. Shows uncertainty can detect adversarial examples and out-of-distribution data.
DeepXplore: Automated Whitebox Testing of Deep Learning Systems
Kexin Pei et al., SOSP 2017 | Pages: 18 | Difficulty: 3/5
Abstract: Automated testing framework using neuron coverage as a metric. Generates inputs that maximize differential behavior across models. Finds thousands of erroneous behaviors.
Attention is Not Explanation
Sarthak Jain, Byron Wallace, NAACL 2019 | Pages: 11 | Difficulty: 2/5
Abstract: Challenges the use of attention weights as explanations. Shows attention can be manipulated without changing predictions. Important for security relying on interpretability.
Explainability for AI Security: A Survey
Fatima Alsubaei et al., arXiv 2022 | Pages: 38 | Difficulty: 1/5 (Survey)
Abstract: Comprehensive survey on explainability in AI security. Covers interpretability methods, their application to security, and limitations.
C10. AI Supply Chain & Model Security
Protecting Intellectual Property of Deep Neural Networks with Watermarking
Yusuke Uchida et al., AsiaCCS 2017 | Pages: 13 | Difficulty: 3/5
Abstract: Embeds watermarks in neural networks to prove ownership. Watermarks survive fine-tuning and model extraction attempts.
Model Stealing Attacks Against Inductive Graph Neural Networks
Asim Waheed Duddu et al., IEEE S&P 2022 | Pages: 16 | Difficulty: 3/5
Abstract: Demonstrates model extraction attacks on graph neural networks. Shows GNNs are particularly vulnerable to stealing.
Weight Poisoning Attacks on Pre-trained Models
Backdoor Attacks on Self-Supervised Learning
Aniruddha Saha et al., CVPR 2022 | Pages: 10 | Difficulty: 3/5
Abstract: Demonstrates backdoor attacks during self-supervised pre-training. Backdoors transfer to downstream tasks after fine-tuning.
Proof-of-Learning: Definitions and Practice
Hengrui Jia et al., IEEE S&P 2021 | Pages: 17 | Difficulty: 4/5
Abstract: Introduces proof-of-learning to verify models were trained as claimed. Prevents model theft and verifies computational work.
SoK: Hate, Harassment, and the Changing Landscape of Social Media
Shagun Jhaver et al., IEEE S&P 2021 | Pages: 47 | Difficulty: 1/5 (Survey)
Abstract: Systematization of knowledge on using AI for content moderation. Discusses ML models for detecting hate speech and harassment.
class/gradsec2026.1773592124.txt.gz · Last modified: 2026/03/15 23:28 by
mhshin · [
Old revisions]