]> git.armaanb.net Git - python_dp.git/blob - laplace_mechanism.ipynb
Initial commit
[python_dp.git] / laplace_mechanism.ipynb
1 {
2  "cells": [
3   {
4    "cell_type": "markdown",
5    "id": "6b9078e7",
6    "metadata": {},
7    "source": [
8     "# Laplace Differential Privacy\n",
9     "\n",
10     "By [Armaan Bhojwani](https://armaanb.net) under [Praneeth Vepakomma](https://praneeth.mit.edu/)\n",
11     "\n",
12     "This notebook features the following differentially private operations on 1 dimensional dataset of ints with set bounds.\n",
13     "- Laplace Mechanism:\n",
14     "    - Sum\n",
15     "    - Count\n",
16     "    - Mean\n",
17     "    - Histogram\n",
18     "    - Privacy Loss Random Variable\n",
19     "    - Privacy Loss Distribution\n",
20     "    \n",
21     "For operations on a dataset without set bounds (utilizing clipping), see laplace_example_class_height.ipynb\n",
22     "    \n",
23     "### References\n",
24     "- https://programming-dp.com\n",
25     "- B. Pejó and D. Desfontaines, Guide to Differential Privacy Modifications\n",
26     "- https://github.com/google/differential-privacy/blob/main/common_docs/Privacy_Loss_Distributions.pdf\n",
27     "\n",
28     "### Status\n",
29     "- Complete"
30    ]
31   },
32   {
33    "cell_type": "markdown",
34    "id": "5db8a18d",
35    "metadata": {},
36    "source": [
37     "## Parameters"
38    ]
39   },
40   {
41    "cell_type": "code",
42    "execution_count": 1,
43    "id": "88f65cb9",
44    "metadata": {},
45    "outputs": [],
46    "source": [
47     "# Privacy\n",
48     "epsilon = 1\n",
49     "\n",
50     "# Data\n",
51     "data_len = 150   # Length of dataset\n",
52     "data_low = 0     # Lowest value of dataset\n",
53     "data_high = 99   # Highest value of dataset"
54    ]
55   },
56   {
57    "cell_type": "markdown",
58    "id": "c445c076",
59    "metadata": {},
60    "source": [
61     "## Build the dataset\n",
62     "Create dataset of ints that we can work with"
63    ]
64   },
65   {
66    "cell_type": "code",
67    "execution_count": 2,
68    "id": "584e2eba",
69    "metadata": {},
70    "outputs": [],
71    "source": [
72     "import numpy as np\n",
73     "\n",
74     "# Initialize Numpy RNG\n",
75     "rng = np.random.default_rng()\n",
76     "\n",
77     "# Increment data_high so that it includes the value specified\n",
78     "data_high += 1\n",
79     "\n",
80     "# Create dataset as defined by above parameters\n",
81     "x = rng.integers(low=data_low, high=data_high, size=(data_len))"
82    ]
83   },
84   {
85    "cell_type": "markdown",
86    "id": "dcf7d907",
87    "metadata": {},
88    "source": [
89     "## Helper functions"
90    ]
91   },
92   {
93    "cell_type": "code",
94    "execution_count": 3,
95    "id": "95edf89e",
96    "metadata": {},
97    "outputs": [],
98    "source": [
99     "import math\n",
100     "from scipy.stats import laplace\n",
101     "from common import *\n",
102     "\n",
103     "def laplace_noise(epsilon, sensitivity):\n",
104     "    \"\"\" Generate laplace noise given parameters\n",
105     "    Inputs:\n",
106     "        epsilon: epsilon value to use\n",
107     "        sensitivity: sensivitity of the mechanism\n",
108     "    Output:\n",
109     "        laplace noise (float) with specified parameters\n",
110     "    \"\"\"\n",
111     "\n",
112     "    return rng.laplace(scale=sensitivity / epsilon)\n",
113     "\n",
114     "\n",
115     "def laplace_mech(x, mech, epsilon, sensitivity, verbose=False):\n",
116     "    \"\"\" Calculate a differentially private result using laplace noise\n",
117     "    Inputs:\n",
118     "        x: input dataset\n",
119     "        mech: function to run on input, should take single parameter (x)\n",
120     "        epsilon: epsilon value to use\n",
121     "        sensitivity: sensitivity to use\n",
122     "        verbose: print detail\n",
123     "    Output:\n",
124     "        (mech(x), mech(x) + laplace noise)\n",
125     "    \"\"\"\n",
126     "    mech_x = mech(x)\n",
127     "    noise = laplace_noise(epsilon, sensitivity)\n",
128     "\n",
129     "    # We round here so that the result with added noise is still an int, like\n",
130     "    # the input. This is do-able because of DP's post-processing properties\n",
131     "    mech_X = mech_x + round(noise)\n",
132     "\n",
133     "    if verbose:\n",
134     "        print(f\"Non-private {mech.__name__}: {mech_x}\")\n",
135     "        print(f\"Private {mech.__name__}:     {mech_X}\")\n",
136     "\n",
137     "    return (mech_x, mech_X)"
138    ]
139   },
140   {
141    "cell_type": "markdown",
142    "id": "f445ac2c",
143    "metadata": {},
144    "source": [
145     "## Query implementations\n",
146     "### Sum\n",
147     "Because the data is arbitrary and has publically-known bounds we can manually set the sensitivity to the data's range, however if the upper bound of the data was unknown, we should use a differentially private method of calculating the sensitivity, such as clipping. See the `laplace_example_class_height.ipynb` notebook for an example of this."
148    ]
149   },
150   {
151    "cell_type": "code",
152    "execution_count": 4,
153    "id": "d267ca84",
154    "metadata": {},
155    "outputs": [],
156    "source": [
157     "sum_sensitivity = data_high - data_low\n",
158     "\n",
159     "sum_x, sum_X = laplace_mech(x, np.sum, epsilon, sum_sensitivity, verbose=True)"
160    ]
161   },
162   {
163    "cell_type": "markdown",
164    "id": "ac4edb3c",
165    "metadata": {},
166    "source": [
167     "### Size\n",
168     "\n",
169     "The most that the sensitivity could be is 1, because we are doing a count query, and thus when adding or removing a record from the dataset, the most the result can change is 1."
170    ]
171   },
172   {
173    "cell_type": "code",
174    "execution_count": 5,
175    "id": "a98163e4",
176    "metadata": {
177     "scrolled": false
178    },
179    "outputs": [],
180    "source": [
181     "size_sensitivity = 1\n",
182     "\n",
183     "size_x, size_X = laplace_mech(x, np.size, epsilon, size_sensitivity, verbose=True)"
184    ]
185   },
186   {
187    "cell_type": "markdown",
188    "id": "332f4872",
189    "metadata": {},
190    "source": [
191     "### Mean\n",
192     "We can build off of the sum and to find the mean. It is important to apply DP to each of the individual steps of finding the mean, as opposed to finding the mean and then applying DP."
193    ]
194   },
195   {
196    "cell_type": "code",
197    "execution_count": 6,
198    "id": "96d2d36d",
199    "metadata": {},
200    "outputs": [],
201    "source": [
202     "# Find non-private mean\n",
203     "mean_x = sum_x / size_x\n",
204     "\n",
205     "# Find differentially private mean\n",
206     "mean_X = sum_X / size_X\n",
207     "\n",
208     "print(f\"Original mean: {mean_x}\")\n",
209     "\n",
210     "# Round private mean to same number of decimal places as original mean\n",
211     "print(f\"Private mean:  {round(mean_X, len(str(mean_x).split('.')[1]))}\")"
212    ]
213   },
214   {
215    "cell_type": "markdown",
216    "id": "8fe66045",
217    "metadata": {},
218    "source": [
219     "### Differentially private histogram"
220    ]
221   },
222   {
223    "cell_type": "code",
224    "execution_count": 7,
225    "id": "39a24954",
226    "metadata": {
227     "scrolled": false
228    },
229    "outputs": [],
230    "source": [
231     "import matplotlib.pyplot as plt\n",
232     "from matplotlib import ticker\n",
233     "\n",
234     "# Parameters\n",
235     "num_bins = 10\n",
236     "\n",
237     "# Generate non-private histogram\n",
238     "x_counts, bins = np.histogram(x, bins=num_bins)\n",
239     "\n",
240     "# Recount the bins in a private way\n",
241     "X_counts = [laplace_mech(i, lambda x: x, epsilon, 1)[1] for i in x_counts]\n",
242     "\n",
243     "\n",
244     "def plot_hist(bins, weights):\n",
245     "    \"\"\" Styles DP histogram\n",
246     "    Inputs:\n",
247     "        bins: array of bin boundaries to use\n",
248     "        weights: counts for each bin\n",
249     "    Output:\n",
250     "        Matplotlib plot ready to be plotted. Add whatever extra styling you \n",
251     "        want, then call plt.show().\n",
252     "    \"\"\"\n",
253     "    \n",
254     "    ax = plt.gca()\n",
255     "\n",
256     "    # Set Y-Axis to reasonable bounds\n",
257     "    if min(weights) > 20:\n",
258     "        ax.set_ylim([0.9 * min(weights), max(weights) + 0.1 * min(weights)])\n",
259     "    else:\n",
260     "        ax.set_ylim([0, max(weights) + 2])\n",
261     "        ax.set_yticks(\n",
262     "            [i for i in range(int(max(weights)) + 2) if (i % 2 == 0)])\n",
263     "\n",
264     "    # Set axis ticks\n",
265     "    ax.yaxis.set_minor_locator(ticker.AutoMinorLocator())\n",
266     "    plt.xticks(bins)\n",
267     "\n",
268     "    # Add vertical lines between bars\n",
269     "    [plt.axvline(x=i, color=\"w\") for i in bins]\n",
270     "\n",
271     "    # Add counts to each bar\n",
272     "    centers = [(bins[i] + bins[i + 1]) / 2 for i in range(np.size(weights))]\n",
273     "    for yc, xc in zip(weights, centers):\n",
274     "        ax.text(xc, yc + 0.5, \"%d\" % yc, ha=\"center\")\n",
275     "\n",
276     "    # Plot\n",
277     "    plt.hist(bins[:-1], bins, weights=weights)\n",
278     "\n",
279     "\n",
280     "# Plot non-private histogram\n",
281     "plt.title(\"Non-private histogram\")\n",
282     "plot_hist(bins, x_counts)\n",
283     "plt.show()\n",
284     "\n",
285     "# Plot private histogram\n",
286     "plt.title(\"Private histogram\")\n",
287     "plot_hist(bins, X_counts)\n",
288     "plt.show()"
289    ]
290   },
291   {
292    "cell_type": "markdown",
293    "id": "597577b2",
294    "metadata": {},
295    "source": [
296     "## Quantifying privacy loss"
297    ]
298   },
299   {
300    "cell_type": "markdown",
301    "id": "1f646ef3",
302    "metadata": {},
303    "source": [
304     "### Calculating privacy loss random variable (PLRV)"
305    ]
306   },
307   {
308    "cell_type": "code",
309    "execution_count": 8,
310    "id": "fc945135",
311    "metadata": {},
312    "outputs": [],
313    "source": [
314     "from tqdm.notebook import tqdm\n",
315     "\n",
316     "\n",
317     "def calc_PLRV(x, mech, epsilon, sensitivity, num_samples=1, verbose=False):\n",
318     "    \"\"\" Calculates the privacy loss random variable of a laplace DP mechanism\n",
319     "    Inputs:\n",
320     "        x: dataset to operate on\n",
321     "        mech: query to apply on x\n",
322     "        epsilon: epsilon value to use\n",
323     "        sensitivity: mechanism sensitivity\n",
324     "        num_samples: how many time to sample to PLRV\n",
325     "        verbose: print detail\n",
326     "    Output:\n",
327     "        an array of samples of the PLRV with length num_samples\n",
328     "    \"\"\"\n",
329     "    \n",
330     "    # Calculate original and private mech(x)\n",
331     "    if verbose: print(\"On original database:\")\n",
332     "    mech_x, mech_X = laplace_mech(x,\n",
333     "                                  mech,\n",
334     "                                  epsilon,\n",
335     "                                  sensitivity,\n",
336     "                                  verbose=verbose)\n",
337     "\n",
338     "    output = []\n",
339     "\n",
340     "    for i in tqdm(range(num_samples), disable=(num_samples < 5000)):\n",
341     "        # Calculate original and private mech(x) on neighbouring dataset\n",
342     "        x2 = create_neighbour(x, verbose=verbose)\n",
343     "        if verbose: print(\"On neighbouring database:\")\n",
344     "        mech_x2, mech_X2 = laplace_mech(x2,\n",
345     "                                        mech,\n",
346     "                                        epsilon,\n",
347     "                                        sensitivity,\n",
348     "                                        verbose=verbose)\n",
349     "\n",
350     "        # Calculate PLRV\n",
351     "        # See section 3.1 of Google paper\n",
352     "        delta = abs(mech_x - mech_x2)\n",
353     "        delta_tilde = delta / (sensitivity / epsilon)\n",
354     "\n",
355     "        w = laplace.rvs(0, 1)\n",
356     "        if w <= 0:\n",
357     "            output.append(delta_tilde)\n",
358     "        elif w >= delta_tilde:\n",
359     "            output.append(-delta_tilde)\n",
360     "        elif 0 < w and w < delta_tilde:\n",
361     "            output.append(delta_tilde - 2 * w)\n",
362     "\n",
363     "    return output\n",
364     "\n",
365     "\n",
366     "size_plrv = calc_PLRV(x, np.size, epsilon, size_sensitivity, 1, verbose=True)\n",
367     "print(f\"The PLRV of a size query is: {size_plrv}\")"
368    ]
369   },
370   {
371    "cell_type": "markdown",
372    "id": "921727f2",
373    "metadata": {},
374    "source": [
375     "### Plotting privacy loss distribution (PLD)"
376    ]
377   },
378   {
379    "cell_type": "code",
380    "execution_count": 9,
381    "id": "112aae28",
382    "metadata": {},
383    "outputs": [],
384    "source": [
385     "# Parameters\n",
386     "num_samples = 20000\n",
387     "num_bins = 50\n",
388     "\n",
389     "def plot_PLD(*plrv_func_args):\n",
390     "    \"\"\" Plots the PLD\n",
391     "    Inputs:\n",
392     "        plrv_func_args: arguments to pass to calc_PLRV function\n",
393     "    Output:\n",
394     "        Matplotlib plots and prints\n",
395     "    \"\"\"\n",
396     "    data = calc_PLRV(*plrv_func_args)\n",
397     "    plt.hist(data, bins=num_bins)\n",
398     "    plt.title(\"Privacy loss distribution histogram for query\")\n",
399     "    plt.show()\n",
400     "\n",
401     "    plt.hist(data, bins=500, cumulative=True, histtype='step')\n",
402     "    plt.title(\"Privacy loss distribution CDF for query\")\n",
403     "    plt.show()\n",
404     "\n",
405     "    abs_data = [abs(i) for i in data]\n",
406     "    expected = np.mean(abs_data)\n",
407     "    print(f\"The expected absolute value of the PLRV is {expected}\")"
408    ]
409   },
410   {
411    "cell_type": "markdown",
412    "id": "b062b4dc",
413    "metadata": {},
414    "source": [
415     "### PLD of size query"
416    ]
417   },
418   {
419    "cell_type": "code",
420    "execution_count": 10,
421    "id": "efa65d51",
422    "metadata": {},
423    "outputs": [],
424    "source": [
425     "plot_PLD(x, np.size, epsilon, size_sensitivity, num_samples)"
426    ]
427   }
428  ],
429  "metadata": {
430   "kernelspec": {
431    "display_name": "Python 3 (ipykernel)",
432    "language": "python",
433    "name": "python3"
434   },
435   "language_info": {
436    "codemirror_mode": {
437     "name": "ipython",
438     "version": 3
439    },
440    "file_extension": ".py",
441    "mimetype": "text/x-python",
442    "name": "python",
443    "nbconvert_exporter": "python",
444    "pygments_lexer": "ipython3",
445    "version": "3.9.7"
446   }
447  },
448  "nbformat": 4,
449  "nbformat_minor": 5
450 }