{"title": "Fooling Neural Network Interpretations via Adversarial Model Manipulation", "book": "Advances in Neural Information Processing Systems", "page_first": 2925, "page_last": 2936, "abstract": "We ask whether the neural network interpretation methods can be fooled via adversarial model manipulation, which is defined as a model fine-tuning step that aims to radically alter the explanations without hurting the accuracy of the original models, e.g., VGG19, ResNet50, and DenseNet121. By incorporating the interpretation results directly in the penalty term of the objective function for fine-tuning, we show that the state-of-the-art saliency map based interpreters, e.g., LRP, Grad-CAM, and SimpleGrad, can be easily fooled with our model manipulation. We propose two types of fooling, Passive and Active, and demonstrate such foolings generalize well to the entire validation set as well as transfer to other interpretation methods. Our results are validated by both visually showing the fooled explanations and reporting quantitative metrics that measure the deviations from the original explanations. We claim that the stability of neural network interpretation method with respect to our adversarial model manipulation is an important criterion to check for developing robust and reliable neural network interpretation method.", "full_text": "Fooling Neural Network Interpretations via\n\nAdversarial Model Manipulation\n\nJuyeon Heo1\u2217, Sunghwan Joo1\u2217, and Taesup Moon1,2\n\n1Department of Electrical and Computer Engineering, 2Department of Arti\ufb01cial Intelligence\n\nSungkyunkwan University, Suwon, Korea, 16419\n\nheojuyeon12@gmail.com, {shjoo840, tsmoon}@skku.edu\n\nAbstract\n\nWe ask whether the neural network interpretation methods can be fooled via adver-\nsarial model manipulation, which is de\ufb01ned as a model \ufb01ne-tuning step that aims to\nradically alter the explanations without hurting the accuracy of the original models,\ne.g., VGG19, ResNet50, and DenseNet121. By incorporating the interpretation\nresults directly in the penalty term of the objective function for \ufb01ne-tuning, we show\nthat the state-of-the-art saliency map based interpreters, e.g., LRP, Grad-CAM, and\nSimpleGrad, can be easily fooled with our model manipulation. We propose two\ntypes of fooling, Passive and Active, and demonstrate such foolings generalize well\nto the entire validation set as well as transfer to other interpretation methods. Our\nresults are validated by both visually showing the fooled explanations and reporting\nquantitative metrics that measure the deviations from the original explanations. We\nclaim that the stability of neural network interpretation method with respect to our\nadversarial model manipulation is an important criterion to check for developing ro-\nbust and reliable neural network interpretation method. The source code is available\nat https://github.com/rmrisforbidden/FoolingNeuralNetwork-Interpretations.\n\n1\n\nIntroduction\n\nAs deep neural networks have made a huge impact on real-world applications with predictive tasks,\nmuch emphasis has been set upon the interpretation methods that can explain the ground of the\npredictions of the complex neural network models. Furthermore, accurate explanations can further\nimprove the model by helping researchers to debug the model or revealing the existence of unintended\nbias or effects in the model [1, 2]. To that regard, research on the interpretability framework has\nbecome very active recently, for example, [3, 4, 5, 6, 7, 8, 9], to name a few. Paralleling above\n\ufb02ourishing results, research on sanity checking and identifying the potential problems of the proposed\ninterpretation methods has also been actively pursued recently. For example, some recent research\n[10, 11, 12, 13, 14] showed that many popular interpretation methods are not stable with respect to\nthe perturbation or the adversarial attacks on the input data.\nIn this paper, we also discover the instability of the neural network interpretation methods, but with\na fresh perspective. Namely, we ask whether the interpretation methods are stable with respect\nto the adversarial model manipulation, which we de\ufb01ne as a model \ufb01ne-tuning step that aims to\ndramatically alter the interpretation results without signi\ufb01cantly hurting the accuracy of the original\nmodel. In results, we show that the state-of-the-art interpretation methods are vulnerable to those\nmanipulations. Note this notion of stability is clearly different from that considered in the above\nmentioned works, which deal with the stability with respect to the perturbation or attack on the input\nto the model. To the best of our knowledge, research on this type of stability has not been explored\nbefore. We believe that such stability would become an increasingly important criterion to check,\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Motivation for model manipulation\n\n(b) Examples of different kinds of foolings\n\nFigure 1: (a) The result of our fooling on the \u2018Adult income\u2019 classi\ufb01cation data [15]. We trained a\nclassi\ufb01er with 8 convolution layers, wo, and the LRP result (blue) shows it assigns high importance\non sensitive features like \u2018Race\u2019 and \u2018Sex\u2019. Now, we can manipulate the model with Location fooling\n(See Section 3) that zero-masks the two features and obtain w\u2217\nfool that essentially has the same\naccuracy as wo but with a new interpretation that disguises the bias (orange). (b) The interpretation\nresults for the image [16] on the left with prediction \u201cIndian Elephant\u201d. The \ufb01rst column is for\nthe original pre-trained VGG19 model, the second to fourth column are for the six manipulated\nmodels with Passive foolings (highlighting uninformative pixels of the image), and the \ufb01fth column\nis for the two manipulated models with Active fooling (highlighting a completely different object,\nthe \ufb01retruck). Each row corresponds to the interpretation method used for fooling. All manipulated\nmodels have only about 1% Top-5 accuracy differences on the entire ImageNet validation set.\n\nsince the incentives to fool the interpretation methods via model manipulation will only increase due\nto the widespread adoption of the complex neural network models.\nFor a more concrete motivation on this topic, consider the following example. Suppose a neural\nnetwork model is to be deployed in an income prediction system. The regulators would mainly check\ntwo core criteria; the predictive accuracy and fairness. While the \ufb01rst can be easily veri\ufb01ed with a\nholdout validation set, the second is more tricky since one needs to check whether the model contains\nany unfair bias, e.g., using race as an important factor for the prediction. The interpretation method\nwould obviously become an important tool for checking this second criterion. However, suppose a\nlazy developer \ufb01nds out that his model contains some bias, and, rather than actually \ufb01xing the model\nto remove the bias, he decides to manipulate the model such that the interpretation can be fooled and\nhide the bias, without any signi\ufb01cant change in accuracy. (See Figure 1(a) for more details.) When\nsuch manipulated model is submitted to the regulators for scrutiny, there is no way to detect the bias\nof the model since the original interpretation is not available unless we have access to the original\nmodel or the training data, which the system owner typically does not disclose.\nFrom the above example, we can observe the fooled explanations via adversarial model manipulations\ncan cause some serious social problems regarding AI applications. The ultimate goal of this paper,\nhence, is to call for more active research on improving the stability and robustness of the interpretation\nmethods with respect to the proposing adversarial model manipulations. The following summarizes\nthe main contributions of this paper:\n\nrespect to the proposing adversarial model manipulation.\n\n\u2022 We \ufb01rst considered the notion of stability of neural network interpretation methods with\n\u2022 We demonstrate that the representative saliency map based interpreters, i.e., LRP [6], Grad-\nCAM [7], and SimpleGradient [17], are vulnerable to our model manipulation, where the\naccuracy drops are around 2% and 1% for Top-1 and Top-5 accuracy on the ImageNet\nvalidation set, respectively. Figure 1(b) shows a concrete example of our fooling.\n\u2022 We show the fooled explanation generalizes to the entire validation set, indicating that the\ninterpretations are truly fooled, not just for some speci\ufb01c inputs, in contrast to [11, 13, 14].\n\u2022 We demonstrate that the transferability exists in our fooling, e.g., if we manipulate the model\nto fool LRP, then the interpretations of Grad-CAM and Simple Gradient also get fooled, etc.\n\n2 Related Work\nInterpretation methods Various interpretability frameworks have been proposed, and they can be\nbroadly categorized into two groups: black-box methods [18, 19, 5, 4, 20] and gradient/saliency map\n\n2\n\nLocationTop-\ud835\udc58Center-massActiveBaselineG-CAMLRP\fbased methods [6, 7, 21, 22, 23]. The latter typically have a full access to the model architecture and\nparameters; they tend to be less computationally intensive and simpler to use, particularly for the\ncomplex neural network models. In this paper, we focus on the gradient/saliency map based methods\nand check whether three state-of-the-art methods can be fooled with adversarial model manipulation.\nSanity checking neural network and its interpreter Together with the great success of deep neural\nnetworks, much effort on sanity checking both the neural network models and their interpretations has\nbeen made. They mainly examine the stability [24] of the model prediction or the interpretation for\nthe prediction by either perturbing the input data or model, inspired by adversarial attacks [25, 26, 27].\nFor example, [10] showed that several interpretation results are signi\ufb01cantly impacted by a simple\nconstant shift in the input data. [12] recently developed a more robust method, dubbed as a self-\nexplaining neural network, by taking the stability (with respect to the input perturbation) into account\nduring the model training procedure. [11, 14] has adopted the framework of adversarial attack for\nfooling the interpretation method with a slight input perturbation. [13] tries to \ufb01nd perturbed data with\nsimilar interpretations of benign data to make it hard to be detected with interpretations. A different\nangle of checking the stability of the interpretation methods has been also given by [28], which\ndeveloped simple tests for checking the stability (or variability) of the interpretation methods with\nrespect to model parameter or training label randomization. They showed that some of the popular\nsaliency-map based methods become too stable with respect to the model or data randomization,\nsuggesting their interpretations are independent of the model or data.\nRelation to our work Our work shares some similarities with above mentioned research in terms\nof sanity checking the neural network interpretation methods, but possesses several unique aspects.\nFirstly, unlike [11, 13, 14], which attack each given input image, we change the model parameters\nvia \ufb01ne-tuning a pre-trained model, and do not perturb the input data. Due to this difference, our\nadversarial model manipulation makes the fooling of the interpretations generalize to the entire\nvalidation data. Secondly, analogous to the non-targeted and targeted adversarial attacks, we also\nimplement several kinds of foolings, dubbed as Passive and Active foolings. Distinct from [11, 13],\nwe generate not only uninformative interpretations, but also totally wrong ones that point unrelated\nobject within the image. Thirdly, as [12], we also take the explanation into account for model\ntraining, but while they de\ufb01ne a special structure of neural networks, we do usual back-propagation\nto update the parameters of the given pre-trained model. Finally, we note [28] also measures the\nstability of interpretation methods, but, the difference is that our adversarial perturbation maintains\nthe accuracy of the model while [28] only focuses on the variability of the explanations. We \ufb01nd that\nan interpretation method that passed the sanity checks in [28], e.g., Grad-CAM, also can be fooled\nunder our setting, which calls for more solid standard for checking the reliability of interpreters.\n\n3 Adversarial Model Manipulation\n3.1 Preliminaries and notations\n\nWe brie\ufb02y review the saliency map based interpretation methods we consider. All of them generate a\nheatmap, showing the relevancy of each data point for the prediction.\nLayer-wise Relevance Propagation (LRP) [6] is a principled method that applies relevance propa-\ngation, which operates similarly as the back-propagation, and generates a heatmap that shows the\nrelevance value of each pixel. The values can be both positive and negative, denoting how much a\npixel is helpful or harmful for predicting the class c. In the subsequent works, LRP-Composite [29],\nwhich applies the basic LRP-\u0001 for the fully-connected layer and LRP-\u03b1\u03b2 for the convolutional layer,\nhas been proposed. We applied LRP-Composite in all of our experiments.\nGrad-CAM [7] is also a generic interpretation method that combines gradient information with\nclass activation maps to visualize the importance of each input. It is mainly used for CNN-based\nmodels for vision applications. Typically, the importance value of Grad-CAM are computed at the\nlast convolution layer, hence, the resolution of the visualization is much coarser than LRP.\nSimpleGrad (SimpleG) [17] visualizes the gradients of prediction score with respect to the input\nas a heatmap. It indicates how sensitive the prediction score is with respect to the small changes of\ninput pixel, but in [6], it is shown to generate noisier saliency maps than LRP.\nNotations We denote D = {(xi, yi)}n\ni=1 as a supervised training set, in which xi \u2208 Rd is the input\ndata and yi \u2208 {1, . . . , K} is the target classi\ufb01cation label. Also, denote w as the parameters for a\n\n3\n\n\fneural network. A heatmap generated by a interpretation method I for w and class c is denoted by\n(1)\n\nc (w) = I(x, c; w),\nhI\n\nc (w) \u2208 RdI . If dI = d, the j-th value of the heatmap, hI\nin which hI\nscore of the j-th input xj for the \ufb01nal prediction score for class c.\n\nc,j(w), represents the importance\n\n3.2 Objective function and penalty terms\n\nOur proposed adversarial model manipulation is realized by \ufb01ne-tuning a pre-trained model with the\nobjective function that combines the ordinary classi\ufb01cation loss with a penalty term that involves the\ninterpretation results. To that end, our overall objective function for a neural network w to minimize\nfor training data D with the interpretation method I is de\ufb01ned to be\n\nL(D,Df ool,I; w, w0) = LC(D; w) + \u03bbLI\n\n(2)\nin which LC(\u00b7) is the ordinary cross-entropy classi\ufb01cation loss on the training data, w0 is the\nparameter of the original pre-trained model, LI\nF (\u00b7) is the penalty term on Df ool, which is a potentially\nsmaller set than D, that is the dataset used in the penalty term, and \u03bb is a trade-off parameter.\nDepending on how we de\ufb01ne LI\nF (\u00b7), we categorize two types of fooling in the following subsections.\n\nF (Df ool; w, w0),\n\n3.2.1 Passive fooling\n\nWe de\ufb01ne Passive fooling as making the interpretation methods generate uninformative explanations.\nThree such schemes are de\ufb01ned with different LI\nF (\u00b7)\u2019s: Location, Top-k, and Center-mass foolings.\nLocation fooling: For Location fooling, we aim to make the explanations always say that some\nparticular region of the input, e.g., boundary or corner of the image, is important regardless of the\ninput. We implement this kind of fooling by de\ufb01ning the penalty term in (2) equals\n\nLI\nF (Df ool; w, w0) =\n\n1\nn\n\n||hI\n\nyi\n\n(w) \u2212 m||2\n2,\n\n1\ndI\n\n(3)\n\nn(cid:88)\n\ni=1\n\nin which Df ool = D, (cid:107) \u00b7 (cid:107)2 being the L2 norm, and m \u2208 RdI is a pre-de\ufb01ned mask vector that\ndesignates the arbitrary region in the input. Namely, we set mj = 1 for the locations that we want the\ninterpretation method to output high importance, and mj = 0 for the locations that we do not want\nthe high importance values.\nTop-k fooling:\noriginally had the top k% highest values. The penalty term then becomes\n|hI\nyi,j(w)|,\n\nIn Top-k fooling, we aim to reduce the interpretation scores of the pixels that\n\nLI\nF (Df ool; w, w0) =\n\n(cid:88)\n\nn(cid:88)\n\n(4)\n\n1\nn\n\ni=1\n\nj\u2208Pi,k(w0)\n\nin which Df ool = D, and Pi,k(w0) is the set of pixels that had the top k% highest heatmap values\nfor the original model w0, for the i-th data point.\nCenter-mass fooling: As in [11], the Center-mass loss aims to deviate the center of mass of the\nheatmap as much as possible from the original one. The center of mass of a one-dimensional heatmap\ncan be denoted as C(hI\nyi,j(w), in which index j is treated as\na location vector, and it can be easily extended to higher dimensions as well. Then, with Df ool = D\nand (cid:107) \u00b7 (cid:107)1 being the L1 norm, the penalty term for the Center-mass fooling is de\ufb01ned as\n\n(w)) =(cid:0)(cid:80)dI\n\nj=1 hI\n\nyi\n\nLI\nF (Df ool; w, w0) = \u2212 1\nn\n\n(w)) \u2212 C(hI\n\nyi\n\n.\n\n(5)\n\nyi,j(w)(cid:1)/(cid:80)dI\nj=1 j\u00b7hI\nn(cid:88)\n(cid:13)(cid:13)C(hI\n\nyi\n\ni=1\n\n(w0))(cid:13)(cid:13)1\n\n3.2.2 Active fooling\n\nActive fooling is de\ufb01ned as intentionally making the interpretation methods generate false expla-\nnations. Although the notion of false explanation could be broad, we focused on swapping the\n\n4\n\n\fexplanations between two target classes. Namely, let c1 and c2 denote the two classes of interest and\nde\ufb01ne Df ool as a dataset (possibly without target labels) that speci\ufb01cally contains both class objects\nin each image. Then, the penalty term LI\n\nLI\nF (Df ool; w, w0) =\n\n1\n\n2nfool\n\ni=1\n\nnfool(cid:88)\n\n1\ndI\n\nF (Df ool; w, w0) equals\n(w0)||2\n\n(cid:16)||hI\n\n(w) \u2212 hI\n\nc1\n\nc2\n\n2 + ||hI\n\nc1\n\n(w0) \u2212 hI\n\nc2\n\n(w)||2\n\n2\n\n(cid:17)\n\n,\n\nin which the \ufb01rst term makes the explanation for c1 alter to that of c2, and the second term does the\nopposite. A subtle point here is that unlike in Passive foolings, we use two different datasets for\ncomputing LC(\u00b7) and LI\nF (\u00b7), respectively, to make a focused training on c1 and c2 for fooling. This\nis the key step for maintaining the classi\ufb01cation accuracy while performing the Active fooling.\n\n4 Experimental Results\n4.1 Data and implementation details\nFor all our fooling methods, we used the ImageNet training set [30] as our D and took three pre-\ntrained models, VGG19 [31], ResNet50 [32], and DenseNet121 [33], for carrying out the foolings.\nFor the Active fooling, we additionally constructed Df ool with images that contain two classes,\n{c1 =\u201cAfrican Elephant\u201d, c2 =\u201cFiretruck\u201d}, by constructing each image by concatenating two\nimages from each class in the 2 \u00d7 2 block. The locations of the images for each class were not \ufb01xed\nso as to not make the fooling schemes memorize the locations of the explanations for each class. An\nexample of such images is shown in the top-left corner of Figure 3. More implementation details are\nin the Supplementary Material.\nRemark: Like Grad-CAM, we also visualized the heatmaps of SimpleG and LRP on a target layer,\nnamely, the last convolution layer for VGG19, and the last block for ResNet50 and DenseNet121.\nWe put the subscript T for SimpleG and LRP to denote such visualizations, and LRP without the\nsubscript denotes the visualization at the input level. We also found that manipulating with LRPT was\neasier than with LRP. Moreover, we excluded using SimpleGT for manipulation as it gave too noisy\nheatmaps; thus, we only used it for visualizations to check whether the transfer of fooling occurs.\n\n4.2 Fooling Success Rate (FSR): A quantitative metric\nIn this section, we suggest a quantitative metric for each fooling method, Fooling Success Rate\n(FSR), which measures how much an interpretation method I is fooled by the model manipulation.\nTo evaluate FSR for each fooling, we use a \u201ctest loss\u201d value associated with each fooling, which\ndirectly shows the gap between the current and target interpretations of each loss. The test loss is\nde\ufb01ned with the original and manipulated model parameters, i.e., w0 and w\u2217\nfool, respectively, and the\ninterpreter I on each data point in the validation set Dval; we denote the test loss for the i-th data\npoint (xi, yi) \u2208 Dval as ti(w\u2217\nfool, w0,I) is computed by evaluating (3) and (4) for\nFor the Location and Top-k foolings, the ti(w\u2217\na single data point (xi, yi) and (w\u2217\nfool, w0). For Center-mass fooling, we evaluate (5), again for a\nsingle data point (xi, yi) and (w\u2217\nfool, w0), and normalize it with the length of diagonal of the image\nto de\ufb01ne as ti(w\u2217\nc(cid:48)(w0))\nas the Spearman rank correlation [34] between the two heatmaps for xi, generated with I. Intuitively,\nit measures how close the explanation for class c from the fooled model is from the explanation for\nfool, w0,I) = si(c1, c2) \u2212 si(c1, c1) as the\nclass c(cid:48) from the original model. Then, we de\ufb01ne ti(w\u2217\nfool, w0,I) = si(c2, c1) \u2212 si(c2, c2) for c2. With\ntest loss for fooling the explanation of c1 and ti(w\u2217\nabove test losses, the FSR for a fooling method f and an interpreter I is de\ufb01ned as\n\nfool, w0,I). For Active fooling, we \ufb01rst de\ufb01ne si(c, c(cid:48)) = rs(hI\n\nfool, w0,I).\n\nfool), hI\n\nc (w\u2217\n\n(cid:88)\n\ni\u2208Dval\n\nFSRI\n\nf =\n\n1\n|Dval|\n\n1{ti(w\u2217\n\nfool, w0,I) \u2208 Rf},\n\n(6)\n\nin which 1{\u00b7} is an indicator function and Rf is a pre-de\ufb01ned interval for each fooling method.\nNamely, Rf is a threshold for determining whether the interpretations are successfully fooled or not.\nWe empirically de\ufb01ned Rf as [0, 0.2], [0, 0.3], [0.1, 1], and [0.5, 2] for Location, Top-k, Center-mass,\nand Active fooling, respectively. (More details of deciding thresholds are in the Supplementary\nMaterial.) In short, the higher the FSR metric is, the more successful f is for the interpreter I.\n\n5\n\n\f4.3 Passive and Active fooling results\n\nIn Figure 2 and Table 1, we present qualitative and quantitative results regarding our three Passive\nfoolings. The followings are our observations. For the Location fooling, we clearly see that the\nexplanations are altered to stress the uninformative frames of each image even if the object is located\nin the center, compare (1, 5) and (3, 5) in Figure 2 for example 2. We also see that fooling LRPT\nsuccessfully fools LRP as well, yielding the true objects to have low or negative relevance values.\n\nFigure 2: Interpretations of the baseline and the passive fooled models on a \u2018Streetcar\u2019 image\nfrom the ImageNet validation set (shown in top-left corner). The topmost row shows the baseline\ninterpretations for three original pre-trained models, VGG19, ResNet50 and DenseNet121 by Grad-\nCAM, LRPT , LRP and SimpleGT given the true class, respectively. For LRP, red and blue stand for\npositive and negative relevance values, respectively. Each colored box (in red, green, and magenta)\nindicates the type of Passive fooling, i.e., Location, Top-k, and Center-mass fooling, respectively.\nEach row in each colored box stands for the interpreter, LRPT or Grad-CAM, that are used as I in\nthe objective function (2) to manipulate each model. See how the original explanation results are\naltered dramatically when fooled with each interpreter and fooling type. The transferability among\nmethods should only be compared within each model architecture and fooling type.\n\nFor the Top-k fooling, we observe the most highlighted top k% pixels are signi\ufb01cantly altered after\nthe fooling, by comparing the big difference between the original explanations and those in the green\ncolored box in Figure 2. For the Center-mass fooling, the center of the heatmaps is altered to the\nmeaningless part of the images, yielding completely different interpretations from the original. Even\nwhen the interpretations are not close to our target interpretations of each loss, all Passive foolings\ncan make users misunderstand the model because the most critical evidences are hidden and only less\nor not important parts are highlighted. To claim our results are not cherry picked, we also evaluated\nthe FSR for 10,000 images, randomly selected from the ImageNet validation dataset, as shown in\nTable 1. We can observe that all FSRs of fooling methods are higher than 50% for the matched cases\n(bold underlined), except for the Location fooling with LRPT for DenseNet121.\nNext, for the Active fooling, from the qualitative results in Figure 3 and the quantitative results\nin Table 2, we \ufb01nd that the explanations for c1 and c2 are swapped clearly in VGG19 and nearly\nin ResNet50, but not in DenseNet121, suggesting the relationship between the model complexity\nand the degree of Active fooling. When the interpretations are clearly swapped, as in (1, 3) and\n(2, 3) of Figure 3, the interpretations for c1 (the true class) turn out to have negative values on the\ncorrect object, while having positive values on the objects of c2. Even when the interpretations\n\n2(a, b) denotes the image at the a-th row and b-th column of the \ufb01gure.\n\n6\n\nVGG19ResNet50Densenet121G-CAMLRPTLRPSimpleGTG-CAMLRPTLRPSimpleGTG-CAMLRPTLRPSimpleGTBaselineLocationLRPTG-CAMTop-\ud835\udc58LRPTG-CAMCenter-massLRPTG-CAM\fModel\nFSR (%)\n\nLocation\n\nTop-k\n\nCenter-mass\n\nLRPT\nG-CAM\nLRPT\nG-CAM\nLRPT\nG-CAM\n\nVGG19\n\nG-CAM LRPT\n87.5\n5.8\n96.3\n30.9\n99.9\n66.3\n\n0.8\n89.2\n31.5\n96.0\n49.9\n81.0\n\nResnet50\n\nSimpleGT G-CAM LRPT\n83.2\n0.8\n61.5\n5.3\n63.3\n0.8\n\n66.8\n0.0\n9.8\n0.1\n15.4\n0.1\n\n42.1\n97.3\n46.3\n99.9\n66.4\n67.3\n\nDenseNet121\n\nSimpleGT\n\nSimpleGT G-CAM LRPT\n26.6\n0.4\n53.8\n1.9\n51.9\n21.8\n\n81.1\n0.0\n19.3\n0.3\n50.3\n0.2\n\n35.7\n81.8\n62.3\n98.3\n66.8\n72.7\n\n88.2\n92.1\n66.7\n3.7\n28.8\n29.2\n\nTable 1: Fooling Success Rates (FSR) for Passive fooled models. The structure of the table is the\nsame as Figure 2. 10,000 randomly sampled ImageNet validation images are used for computing\nFSR. Underline stands for the FSRs for the matched interpreters that are used for fooling, and the\nBold stands for the FSRs over 50%. We excluded the results for LRP because checking the FSR\nof LRPT was suf\ufb01cient for checking whether LRP was fooled or not. The transferability among\nmethods should only be compared within the model and fooling type.\n\nare not completely swapped, they tend to spread out to both c1 and c2 objects, which becomes less\ninformative; compare between the (1, 8) and (2, 8) images in Figure 3, for example. In Table 3,\nwhich shows FSRs evaluated on the 200 holdout set images, we observe that the active fooling is\nselectively successful for VGG19 and ResNet50. For the case of DenseNet121, however, the active\nfooling seems to be hard as the FSR values are almost 0. Such discrepancy for DenseNet may be also\npartly due to the conservative threshold value we used for computing FSR since the visualization in\nFigure 3 shows some meaningful fooling also happens for DenseNet121 as well.\n\nFigure 3: Explanations of original and active fooled models for c1=\u201cAfrican Elephant\u201d from synthetic\ntest images, which contain both Elephant and Firetruck (c2) in different parts of the images for class\nc1. The top row is the baseline explanations with three different model architectures and interpretable\nmethods. The middle and bottom row are the explanations for the actively fooled models using\nLRPT and Grad-CAM, respectively. We can see that the explanations of fooled models for c1 mostly\ntend to highlight c2. Note the transferability also exists as well.\n\nModel\nFSR (%)\n\nFSR(c_1)\nLRPT\nFSR(c_2)\nG-CAM FSR(c_1)\nFSR(c_2)\n\nVGG19\nG-CAM LRPT\n94.5\n95.0\n0.0\n1.0\n\n96.5\n96.5\n1.0\n70.0\n\nResNet50\nLRP G-CAM LRPT\n97.0\n34.0\n96.0\n31.5\n0.0\n1.0\n0.5\n0.0\n\n90.5\n75.0\n76.0\n87.5\n\nDenseNet121\n\nLRP G-CAM LRPT\n0.0\n10.7\n0.0\n24.3\n0.0\n0.0\n0.0\n0.0\n\n0.0\n0.0\n4.0\n0.0\n\nLRP\n0.0\n0.0\n0.0\n0.0\n\nTable 2: Fooling Success Rates (FSR) for the Active fooled models. 200 synthetic images are used\nfor computing FSR. The Underline stands for FSRs for the matched interpreters that are used for\nfooling. and the Bold stands for FSRs over 50%. The transferability among methods should only be\ncompared within the model and fooling type.\nThe signi\ufb01cance of the above results lies in the fact that the classi\ufb01cation accuracies of all manipulated\nmodels are around the same as that of the original models shown in Table 3! For the Active fooling,\nin particular, we also checked that the slight decrease in Top-5 accuracy is not just concentrated on\nthe data points for the c1 and c2 classes, but is spread out to the whole 1000 classes. Such analysis is\n\n7\n\nVGG 19ResNet 50Densenet121G-CAMLRPTLRPG-CAMLRPTLRPG-CAMLRPTLRPBaselineLRPTG-CAM\fin the Supplementary Material. Note our model manipulation affects the entire validation set without\nany access to it, unlike the common adversarial attack which has access to each input data point [11].\n\nModel\n\nAccuracy (%)\n\nBaseline (Pretrained)\nLocation\n\nLRPT\nG-CAM\nLRPT\nG-CAM\nLRPT\nG-CAM\nLRPT\nG-CAM\n\nTop-k\n\nCenter\nmass\n\nActive\n\nVGG19\n\nResnet50\n\nTop1\n72.4\n71.8\n71.5\n71.6\n72.1\n70.4\n70.6\n71.3\n71.2\n\nTop5\n90.9\n90.7\n90.4\n90.5\n90.6\n89.8\n90.0\n90.3\n90.3\n\nTop1\n76.1\n73.0\n74.2\n73.7\n74.7\n73.4\n74.7\n74.7\n75.9\n\nTop5\n92.9\n91.3\n91.8\n91.9\n92.0\n91.7\n92.1\n92.2\n92.8\n\nDenseNet121\nTop5\nTop1\n74.4\n92.0\n91.0\n72.5\n91.6\n73.7\n91.0\n72.3\n91.2\n73.1\n72.8\n91.0\n91.0\n72.4\n90.5\n71.9\n71.7\n90.4\n\nTable 3: Accuracy of the pre-trained models and the manipulated models on the entire ImageNet\nvalidation set. The accuracy drops are around only 2%/1% for Top-1/Top-5 accuracy, respectively.\n\nImportantly, we also emphasize that fooling one interpretation method is transferable to other\ninterpretation methods as well, with varying amount depending on the fooling type, model architecture,\nand interpreter. For example, Center-mass fooling with LRPT alters not only LRPT itself, but also\nthe interpretation of Grad-CAM, as shown in (6,1) in Figure 2. The Top-k fooling and VGG19 seem\nto have larger transferability than others. More discussion on the transferability is elaborated in\nSection 5. For the type of interpreter, it seems when the model is manipulated with LRPT , usually\nthe visualizations of Grad-CAM and SimpleGT are also affected. However, when fooling is done\nwith Grad-CAM, LRPT and SimpleGT are less impacted.\n5 Discussion and Conclusion\n\nIn this section, we give several important further discussions on our method. Firstly, one may\nargue that our model manipulation might have not only fooled the interpretation results but also\nmodel\u2019s actual reasoning for making the prediction. To that regard, we employ Area Over Prediction\n\n(a) AOPC curves\n\n(b) Gaussian perturbation\n\n(c) Fooling with adversarial training\n\nFigure 4: (a) AOPC of original and Top-k fooled model (DenseNet121, Grad-CAM). (b) Robustness\nof Location fooled model (ResNet50, LRPT ) with respect to Gaussian perturbation on weight\nparameters. (c) Top-1 accuracy of ResNet50 on Dval and PGD(Dval), and Grad-CAM results when\nmanipulating adversarially trained model with Location fooling.\nCurve (AOPC) [35], a principled way of quantitatively evaluating the validity of neural network\ninterpretations, to check whether the manipulated model also has been signi\ufb01cantly altered by fooling\nthe interpretation. Figure 4(a) shows the average AOPC curves on 10K validation images for the\noriginal and manipulated DenseNet121 (Top-k fooled with Grad-CAM) models, wo and w\u2217\nfool,\nwith three different perturbation orders; i.e., with respect to hI\nc (wo) scores, hI\nc (w\u2217\nfool) scores, and\nfool(hI\na random order. From the \ufb01gure, we observe that wo(hI\nc (wo)) show almost\nidentical AOPC curves, which suggests that w\u2217\nfool has not changed much from wo and is making its\nprediction by focusing on similar parts that wo bases its prediction, namely, hI\nc (wo). In contrast, the\nAOPC curves of both wo(hI\nfool(hI\nfool)) and w\u2217\nfool)) lie signi\ufb01cantly lower, even lower than\nthe case of random perturbation. From this result, we can deduce that hI\nc (w\u2217\nfool) is highlighting parts\nthat are less helpful than random pixels for making predictions, hence, is a \u201cwrong\u201d interpretation.\n\nc (wo)) and w\u2217\n\nc (w\u2217\n\nc (w\u2217\n\n8\n\n\fSecondly, one may ask whether our fooling can be easily detected or undone. Since it is known that\nthe adversarial input example can be detected by adding small Gaussian perturbation to the input\n[36], one may also suspect that adding small Gaussian noise to the model parameters might reveal\nour fooling. However, Figure 4(b) shows that wo and w\u2217\nfool (ResNet50, Location-fooled with LRPT )\nbehave very similarly in terms of Top-1 accuracy on ImageNet validation as we increase the noise\nlevel of the Gaussian perturbation, and FSRs do not change radically, either. Hence, we claim that\ndetecting or undoing our fooling would not be simple.\nThirdly, one can question whether our method would also work for the adversarially trained models.\nTo that end, Figure 4(c) shows the Top-1 accuracy of ResNet50 model on Dval (i.e., ImageNet\nvalidation) and PGD(Dval) (i.e., the PGD-attacked Dval), and demonstrates that adversarially trained\nmodel can be also manipulated by our method. Namely, starting from a pre-trained wo (dashed red),\nwe do the \u201cfree\u201d adversarial training (\u0001 = 1.5) [37] to obtain wadv (dashed green), then started our\nmodel manipulation with (Location fooling, Grad-CAM) while keeping the adversarial training. Note\nthe Top-1 accuracy on Dval drops while that on PGD(Dval) increases during the adversarial training\nphase (from red to green) as expected, and they are maintained during our model manipulation phase\n(e.g. dashed blue). The right panel shows the Grad-CAM interpretations at three distinct phases (see\nthe color-coded boundaries), and we clearly see the success of the Location fooling (blue, third row).\n\n(a) Decision boundaries\n\n(b) Fooling for SmoothGrad (SG)\n\nFigure 5: (a) Two possible decision boundaries with similar accuracies but with different gradients.\n(b) Top-1 accuracy and SmoothGrad results for a Location-fooled model (VGG19, SimpleG).\n\nFinally, we give intuition on why our adversarial model manipulation works, and what are some\nlimitations. Note \ufb01rst that all the interpretation methods we employ are related to some form of\ngradients; SimpleG uses the gradient of the input, Grad-CAM is a function of the gradient of the\nrepresentation at a certain layer, and LRP turns out to be similar to gradient times inputs [38].\nMotivated by [11], Figure 5(a) illustrates the point that the same test data can be classi\ufb01ed with\nalmost the same accuracy but with different decision boundaries that result in radically different\ngradients, or interpretations. The commonality of using the gradient information partially explains the\ntransferability of the foolings, although the asymmetry of transferability should be analyzed further.\nFurthermore, the level of fooling seems to have intriguing connection with the model complexity,\nsimilar to the \ufb01nding in [39] in the context of input adversarial attack. As a hint for developing\nmore robust interpretation methods, Figure 5(b) shows our results on fooling SmoothGrad [40],\nwhich integrates SimpleG maps obtained from multiple Gaussian noise added inputs. We tried to do\nLocation-fooling on VGG19 with SimpleG; the left panel is the accuracies on ImageNet validation,\nand the right is the SmoothGrad saliency maps corresponding the iteration steps. Note we lose around\n10% of Top-5 accuracy to obtain visually satisfactory fooled interpretation (dashed blue), suggesting\nthat it is much harder to fool the interpretation methods based on integrating gradients of multiple\npoints than pointwise methods; this also can be predicted from Figure 5(a).\nWe believe this paper can open up a new research venue regarding designing more robust interpretation\nmethods. We argue checking the robustness of interpretation methods with respect to our adversarial\nmodel manipulation should be an indispensable criterion for the interpreters in addition to the sanity\nchecks proposed in [28]; note Grad-CAM passes their checks. Future research topics include devising\nmore robust interpretation methods that can defend our model manipulation and more investigation\non the transferability of fooling. Moreover, establishing some connections with security-focused\nperspectives of neural networks, e.g., [41, 42], would be another fruitful direction to pursue.\n\n9\n\n\fAcknowledgements\n\nThis work is supported in part by ICT R&D Program [No. 2016-0-00563, Research on adaptive\nmachine learning technology development for intelligent autonomous digital companion][No. 2019-\n0-01396, Development of framework for analyzing, detecting, mitigating of bias in AI model and\ntraining data], AI Graduate School Support Program [No.2019-0-00421], and ITRC Support Program\n[IITP-2019-2018-0-01798] of MSIT / IITP of the Korean government, and by the KIST Institutional\nProgram [No. 2E29330].\n\nReferences\n[1] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness\n\nthrough awareness. In ITCS, pages 214\u2013226. ACM, 2012.\n\n[2] W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. Interpretable\n\nmachine learning: de\ufb01nitions, methods, and applications. arXiv:1901.04592, 2019.\n\n[3] David Gunning. Explainable arti\ufb01cial intelligence (XAI). In DARPA, 2017.\n\n[4] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining\n\nthe predictions of any classi\ufb01er. In SIGKDD, pages 1135\u20131144. ACM, 2016.\n\n[5] Scott M. Lundberg and Su-In Lee. A uni\ufb01ed approach to interpreting model predictions. In\n\nNIPS, 2017.\n\n[6] Sebastian Bach, Alexander Binder, Gr\u00e9goire Montavon, Frederick Klauschen, Klaus-Robert\nM\u00fcller, and Wojciech Samek. On pixel-wise explanations for non-linear classi\ufb01er decisions by\nlayer-wise relevance propagation. In PLoS ONE, 10(7):e0130140, 2015.\n\n[7] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi\nParikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based\nlocalization. In ICCV, pages 618\u2013626, 2017.\n\n[8] Wojciech Samek, Gr\u00e9goire Montavon, and Klaus-Robert M\u00fcller.\n\nInterpreting and ex-\nIn CVPR Tutorial (http://interpretable-\n\nplaining deep models in computer vision.\nml.org/cvpr2018tutorial/), 2018.\n\n[9] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.\n\narXiv:1702.08608, 2017.\n\n[10] Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Sch\u00fctt, Sven\nD\u00e4hne, Dumitru Erhan, and Been Kim. The (un) reliability of saliency methods. In Explainable\nAI: Interpreting, Explaining and Visualizing Deep Learning, pages 267\u2013280. Springer, 2019.\n\n[11] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile.\n\nIn AAAI, 2019.\n\n[12] David Alvares-Melis and Tommi S. Jaakkola. Towards robust interpretability with self-\n\nexplaining neural networks. In NeurIPS, 2018.\n\n[13] Xinyang Zhang, Ningfei Wang, Shouling Ji, Hua Shen, and Ting Wang. Interpretable deep\n\nlearning under \ufb01re. arXiv:1812.00891, 2018.\n\n[14] Ann-Kathrin Dombrowski, Maximilian Alber, Christopher J Anders, Marcel Ackermann, Klaus-\nRobert M\u00fcller, and Pan Kessel. Explanations can be manipulated and geometry is to blame. In\nNeurIPS, 2019.\n\n[15] Arthur Asuncion and David Newman. UCI machine learning repository, 2007.\n\n[16] Ann Arbor District Library. Elephant pulls \ufb01re truck at the franzen brothers circus, 1996.\n\n[17] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:\n\nVisualising image classi\ufb01cation models and saliency maps. arXiv:1312.6034, 2013.\n\n10\n\n\f[18] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and\nDino Pedreschi. A survey of methods for explaining black box models. In CSUR, volume 51,\npage 93. ACM, 2019.\n\n[19] Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: Randomized input sampling for explanation\n\nof black-box models. In BMVC, 2018.\n\n[20] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In\n\nECCV, pages 818\u2013833. Springer, 2014.\n\n[21] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through\n\npropagating activation differences. In ICML, 2017.\n\n[22] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In\n\nICML, 2017.\n\n[23] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving\n\nfor simplicity: The all convolutional net. In ICLR Workshop, 2015.\n\n[24] Bin Yu. Stability. In Bernoulli, volume 19, pages 1484\u20131500, 2013.\n\n[25] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of\n\nsecurity: Circumventing defenses to adversarial examples. In ICML, 2018.\n\n[26] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical\n\nworld. arXiv:1607.02533, 2016.\n\n[27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\n\nTowards deep learning models resistant to adversarial attacks. arXiv:1706.06083, 2017.\n\n[28] Julius Adebayo, J. Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim.\n\nSanity chekcs for saliency maps. In NeurIPS, 2018.\n\n[29] Sebastian Lapuschkin, Alexander Binder, Klaus-Robert Muller, and Wojciech Samek. Under-\nstanding and comparing deep neural networks for age and gender classi\ufb01cation. In ECCV, pages\n1629\u20131638, 2017.\n\n[30] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng\nHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei.\nImageNet Large Scale Visual Recognition Challenge. In IJCV, volume 115, pages 211\u2013252,\n2015.\n\n[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. In ICLR, 2015.\n\n[32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, pages 770\u2013778, 2016.\n\n[33] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected\n\nconvolutional networks. In CVPR, pages 4700\u20134708, 2017.\n\n[34] J. Russell and R. Cohn. Spearman\u2019s Rank Correlation Coef\ufb01cient. Book on Demand, 2012.\n\n[35] Wojciech Samek, Alexander Binder, Gr\u00e9goire Montavon, Sebastian Lapuschkin, and Klaus-\nRobert M\u00fcller. Evaluating the visualization of what a deep neural network has learned. In IEEE\ntransactions on neural networks and learning systems, volume 28, pages 2660\u20132673. IEEE,\n2016.\n\n[36] Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for\n\ndetecting adversarial examples. In ICML, 2019.\n\n[37] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S\nDavis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv:1904.12843,\n2019.\n\n11\n\n\f[38] Pieter-Jan Kindermans, Kristof Sch\u00fctt, Klaus-Robert M\u00fcller, and Sven D\u00e4hne. Investigating the\nin\ufb02uence of noise and distractors on the interpretation of neural networks. arXiv:1611.07270,\n2016.\n\n[39] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv:1412.6572, 2014.\n\n[40] Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Vi\u00e9gas, and Martin Wattenberg. Smooth-\n\ngrad: removing noise by adding noise. arXiv:1706.03825, 2017.\n\n[41] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities in\n\nthe machine learning model supply chain. arXiv:1708.06733, 2017.\n\n[42] Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your\nweakness into a strength: Watermarking deep neural networks by backdooring. In USENIX\nSecurity, pages 1615\u20131631, 2018.\n\n12\n\n\f", "award": [], "sourceid": 1687, "authors": [{"given_name": "Juyeon", "family_name": "Heo", "institution": "Sungkyunkwan University"}, {"given_name": "Sunghwan", "family_name": "Joo", "institution": "Sungkyunkwan University"}, {"given_name": "Taesup", "family_name": "Moon", "institution": "Sungkyunkwan University (SKKU)"}]}