{ "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6-final" }, "orig_nbformat": 2, "kernelspec": { "name": "myenv", "language": "python", "display_name": "Python (myenv)" } }, "nbformat": 4, "nbformat_minor": 2, "cells": [ { "source": [ "# Data structures or many ways of organizing your stuff 🇬🇧\n", "\n", "## I ain't know nothing about data structures\n", "\n", "Many people, when being introduced to data structures related concepts, don't realize that they already use extensively various kinds of data structures in their life.\n", "\n", "Check out the following example:" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello my name is Xiaoou\n", "hello my name is Xiaoou\n", "hello my name is Xiaoou " ] } ], "source": [ "# First way\n", "x = \"hello\"\n", "y = \"my name\"\n", "z = \"is Xiaoou\"\n", "\n", "print(x,y,z)\n", "\n", "# Second way\n", "presentation = [\"hello\",\"my name\",\"is Xiaoou\"]\n", "print(presentation[0],presentation[1],presentation[2])\n", "\n", "# Third way\n", "for word in presentation:\n", " print(word, end = \" \")" ] }, { "source": [ "In the first approach you use a `string` and you'll agree that the code, already fulfilling its mission, looks quite weird. The second one looks better because the use of `array` and the third one is the best-looking one because it takes advantage of the `iterable` property of an array.\n", "\n" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "## Data structure, speed and memory\n", "\n", "However, data structure is much more than looking organized. If you've read [the first tutorial](./01_intro_en.md), it should be clear to you that speed matters when designing programs. But to what speed is related ? We all know that hardware impacts strongly the performance, however this aspect is unrelated in our discussion. More concretely, when we use the word `speed` in algorithms we are referring specifically to how many steps an algorithm takes.\n", "\n", "Note that on the Internet you will meet a bunch of terms of which I've made a summary.\n", "\n", "![test2](img/back/2021-04-13-15-43-22.png)\n", "\n", "Don't be intimidated by these terms. Although there are some subtle differences, in most non-academic contexts they can be used interchangeably.\n", "\n", "So what do I mean by `how many steps`?\n", "\n", "For this you have to know how things are stored/organized on your computer's memory.\n", "\n", "### How computer memory works\n", "\n", "Put simply, the computer memory has a lot of `memory address`es by means of which we can store things. Some are distributed sparsely while some, like `array`, are placed on a compact memory span.\n", "\n", "![test1](img/back/2021-04-13-15-53-53.png)\n" ], "cell_type": "markdown", "metadata": {} }, { "source": [ "Another common data structure is called `set`. The set is very useful when we need a list of non-duplicate items. A common example is phone book.\n", "\n", "However the difference between array and set goes far beyond the aforementioned constraint. In order to understand the performance of any data structure, you need to know the 4 common ways our code might interact with that data structure.\n", "\n", "![](img/back/2021-04-13-22-48-30.png)\n", "\n", "## Analyse a data structure (Read, search, insert and delete)\n", "\n", "* Read refers to checking the value at a specific memory address.\n", "* Search refers to checking whether a certain value exists and if so, retrieve the memory address of where this value is located.\n", "* Insert refers to adding a new value at a specific position.\n", "* Delete refers to removing a value.\n", "\n", "So what is the efficiency of there 4 operations when it comes to array and set?\n", "\n", "Let's look at the operation of reading. Only 2 steps are necessary for both cases. That's because these two structures are stored in memory as follows:\n", "\n", "![](img/back/2021-04-13-22-56-14.png)\n", "\n", "Each value (content) is retrievable via a memory address which in turn is related to an index.\n", "\n", "Quick anecdote: Do you know why in most programming languages the index starts at 0?\n", "\n", "It's because 0 means 0 element from the beginning = the same address as the start of of the array, a value `remembered` by your computer when you create an array. Another value recorded is the length of the array.\n", "\n", "### Read (list vs set)\n", "\n", "Say that you want to check the value at index 3. All you need to do is:\n", "\n", "1. Add 3 to the address of the first element.\n", "2. Access the value.\n", "\n", "So basically it takes 2 steps.\n", "\n", "### Search (list vs set)\n", "\n", "Unluckily we don't have instant access to each value of an array/set, so in the worst case, we'd have to check all the values = N steps to find a value/affirm its absence. This principle applies to both data structures.\n", "\n", "### Insert (idem)\n", "\n", "Here comes the real game (in French we say ça change la donne).\n", "\n", "First of all it takes at most N steps to insert a new value into an array. Finding the right spot takes just 1 step (see the Read section), however before insertion one has to shift all the elements to make space for the new element.\n", "\n", "I know, one picture is worth a thousand words.\n", "\n", "![](img/back/2021-04-13-23-18-56.png)\n", "\n", "What is the best case? Well it happens when one inserts a new element at the end. As said before, your computer already knows the length of your array, i.e. it already knows the address of the last element. Better, no shift is required.\n", "\n", "What about set?\n", "\n", "Remember that a set is supposed to contain no duplicate elements. So first you have to make sure that the element to be inserted has no duplicate! In the worst case it takes N steps. Besides, when we're inserting a value at the beginning of a set, another N steps are required to shift all the data to the right and another final step to insert the new value, taking the total number of steps to 2N + 1 steps.\n", "\n", "### Deletion (idem)\n", "\n", "Just like the insertion, deletion requires also shifting because remember, an array should be stored on contingent memory cells. The sames applies to array-based set. So it's not difficult to figure out that in the worst case we are bound to make N moves when the first element is deleted.\n", "\n", "![](img/back/2021-04-13-23-32-36.png)\n", "\n", "## Which data structure to use\n", "\n", "That's a tricky question and it all depends on how well you know your data. If you are quite sure that your workflow won't involve a lot of insertion, then it would be safe to use a set since you want some strong constraint on your data. Otherwise you should use an array-like structure since frequent insertions would considerably slow down your program's performance.\n", "\n", "I see it coming already: but that's only 1 operation right? Don't hurry, the next tutorial will help you gain a solid grasp on the intricate interactions between data structures, algorithms and performance. You would also know why we distinguish best case, average and worst cases when it comes to comparing algorithms.\n", "\n", "Especially you'll get to know a domain of extensive research in computer science: sorting.\n", "\n", "Stay tuned!" ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "100" ] }, "metadata": {}, "execution_count": 62 } ], "source": [ "# create a random list with no duplicate values\n", "import random\n", "\n", "def generate_list(n):\n", " randomlist = list(range(n))\n", " # shuffle list in place\n", " random.shuffle(randomlist)\n", " return randomlist\n", "\n", "test = generate_list(100)\n", "len(test)" ] }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "def linear_search(arr, x):\n", " \n", " for i in range(len(arr)):\n", " \n", " if arr[i] == x:\n", " return i\n", " \n", " return -1" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "randomlist_sorted = sorted(randomlist)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "154" ] }, "metadata": {}, "execution_count": 45 } ], "source": [ "def binary_search(arr, element):\n", " start = 0\n", " end = len(arr)-1\n", "\n", " while (start <= end):\n", " mid = (start + end) // 2\n", "\n", " if element == arr[mid]:\n", " return mid\n", " elif element < arr[mid]:\n", " end = mid - 1\n", " else:\n", " start = mid + 1\n", "\n", " return -1" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "977 ns ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)\n" ] } ], "source": [ "%%timeit -r1 -n100\n", "binary_search(randomlist_sorted, 295)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "8.9 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 100 loops each)\n" ] } ], "source": [ "%%timeit -r1 -n100\n", "linear_search(randomlist_sorted,295)" ] }, { "source": [ "`ns` stands for nanoseconds, meaning $10^{-9}$. µs means microseconds = $10^{-6}$ seconds." ], "cell_type": "markdown", "metadata": {} }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "# sorted list in python\n", "# !pip install sortedcontainers" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "5" ] }, "metadata": {}, "execution_count": 54 } ], "source": [ "from sortedcontainers import SortedList\n", "sl = SortedList(['f', 'a', 'c', 'd', 'b'])\n", "sl.add(\"e\")\n", "sl.index(\"f\")" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "# bubble sort, raw version\n", "def bubble_sort(our_list):\n", " # We go through the list as many times as there are elements\n", " for i in range(len(our_list)):\n", " # We want the last pair of adjacent elements to be (n-2, n-1)\n", " for j in range(len(our_list) - 1):\n", " if our_list[j] > our_list[j+1]:\n", " # Swap\n", " our_list[j], our_list[j+1] = our_list[j+1], our_list[j]" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[2, 6, 8, 13, 18, 19]\n" ] } ], "source": [ "our_list = [19, 13, 6, 2, 18, 8]\n", "bubble_sort(our_list)\n", "print(our_list)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "# optimization\n", "def bubble_sort(our_list):\n", " has_swapped = True\n", "\n", " num_of_iterations = 0\n", "\n", " while(has_swapped):\n", " has_swapped = False\n", " for i in range(len(our_list) - num_of_iterations - 1):\n", " if our_list[i] > our_list[i+1]:\n", " # Swap\n", " our_list[i], our_list[i+1] = our_list[i+1], our_list[i]\n", " has_swapped = True\n", " num_of_iterations += 1" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [], "source": [ "# selection sort the joker sort\n", "def selection_sort(L):\n", " # i indicates how many items were sorted\n", " for i in range(len(L)-1):\n", " # To find the minimum value of the unsorted segment\n", " # We first assume that the first element is the lowest\n", " min_index = i\n", " # We then use j to loop through the remaining elements\n", " for j in range(i+1, len(L)-1):\n", " # Update the min_index if the element at j is lower than it\n", " if L[j] < L[min_index]:\n", " min_index = j\n", " # After finding the lowest item of the unsorted regions, swap with the first unsorted item\n", " L[i], L[min_index] = L[min_index], L[i]" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " 5\n 6\n 11\n 12\n 13\n" ] } ], "source": [ "# Python program for implementation of Insertion Sort\n", "\n", "# Function to do insertion sort\n", "def insertionSort(arr):\n", "\n", "\t# Traverse through 1 to len(arr)\n", "\tfor i in range(1, len(arr)):\n", "\n", "\t\tkey = arr[i]\n", "\n", "\t\t# Move elements of arr[0..i-1], that are\n", "\t\t# greater than key, to one position ahead\n", "\t\t# of their current position\n", "\t\tj = i-1\n", "\t\twhile j >= 0 and key < arr[j] :\n", "\t\t\t\tarr[j + 1] = arr[j]\n", "\t\t\t\tj -= 1\n", "\t\tarr[j + 1] = key\n", "\n", "\n", "# Driver code to test above\n", "arr = [12, 11, 13, 5, 6]\n", "insertionSort(arr)\n", "for i in range(len(arr)):\n", "\tprint (\"% d\" % arr[i])\n", "\n", "# This code is contributed by Mohit Kumra" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ] }