Google

{ "cells": [ { "cell_type": "markdown", "source": [ "# Understand objected-oriented programming (OOP) by building a minimal Web Scraping framework 🇬🇧\n", "\n", "## What you are going to learn\n", "\n", "`requests` is a very popular package in Python because it provides many convenient methods to handle requests, parsing and exception handling. One could also use the official `urllib` package, however for the same tasks it is overall much easier to use requests due to its code design. You can clearly see the philosophy of the creator through the website’s motto:\n", "\n", "![](img/image/2021-04-30-20-25-11.png)\n", "\n", "The objective of this tutorial is to introduce you to objected-oriented programming (OOP) by imitating a working example in `requests`. After having a grasp of the very basic principles of OOP, we would move on to a more elaborate example to make you better seize the benefits of OOP.\n", "\n", "At the end you will be able to build a minimal web scraping framework allowing to scrape articles of the famous French newspaper `Le Monde` and the American newspaper `New York Times`.\n", "\n", "Before starting this tutorial, be sure to have installed `requests` and `beautiful soup` with:\n", "\n", "```bash\n", "pip install requests\n", "pip install beautifulsoup4\n", "```\n", "\n", "## A working example in requests\n", "\n", "Here is a quick starting example using `requests`, as you can see from the output.\n", "\n", "1. The `get` method returns a class object.\n", "2. The `status code` tells us that the request is successful.\n", "3. The `text` property returns the source code of Google homepage.\n", "4. The `cookies` property returns your cookies." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The get method returns a class object.\n", "-------------\n", "\n", "\n", "The status code tells us that the request is successful.\n", "-------------\n", "200\n", "\n", "The text property returns the source code of Google homepage.\n", "-------------\n", "\n", "Google

Recherche Images Maps Play YouTube Actualités Gmail Drive Plus »

Historique Web | Paramètres | Connexion

$\"Google\"$

\n", "\n", "The cookies property returns your cookies.\n", "-------------\n", "]>\n" ] } ], "source": [ "import requests\n", "\n", "r = requests.get('https://www.google.com/')\n", "\n", "print(\"The get method returns a class object.\")\n", "print(\"-------------\")\n", "print(type(r))\n", "\n", "print(\"\\nThe status code tells us that the request is successful.\")\n", "print(\"-------------\")\n", "print(r.status_code)\n", "\n", "print(\"\\nThe text property returns the source code of Google homepage.\")\n", "print(\"-------------\")\n", "print(type(r.text))\n", "print(r.text)\n", "\n", "print(\"\\nThe cookies property returns your cookies.\")\n", "print(\"-------------\")\n", "print(r.cookies)" ] }, { "cell_type": "markdown", "source": [ "## Class, constructor, method and property\n", "\n", "When we talk about objected-oriented programming (OOP), no matter which language you are using (Some languages like Java and C++ are more natively class-based), the four most fundamental concepts are `class`, `constructor`, `method` and `property`.\n", "\n", "Let's construct our first class in Python. A class is a concept, a perception. Let's say we want to define what is a human being (Person).\n", "\n", "Typically you would start building your `Person` (capital case) class by defining it's properties (name, nationality, job) using a constructor (the `__init__` here). You would also like to define the class's method (what a human can do).\n", "\n", "One thing might seem odd for beginners is the `self` parameter. Roughly speaking it's a placeholder which allows your methods to access the properties.\n", "\n", "The code should be quite self-explanatory." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 3, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Hello my name is Xiaoou.\nI'm Chinese.\nI work in NLP.\nXiaoou\n" ] } ], "source": [ "class Person:\n", " def __init__(self, name, nationality, job):\n", " self.name = name\n", " self.nationality = nationality\n", " self.job = job\n", "\n", " def greeting(self):\n", " print(f\"Hello my name is {self.name}.\\nI'm {self.nationality}.\\nI work in {self.job}.\")\n", "\n", "# initiate a class\n", "me = Person(\"Xiaoou\", \"Chinese\", \"NLP\")\n", "\n", "# using the greeting method\n", "me.greeting()\n", "\n", "# access the name property\n", "print(me.name)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "So far so good. We have built our first class. Now let's imitate the schema used in the working example of `requests` we saw at the beginning. As a reminder, the working code bloc is:" ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": null, "outputs": [], "source": [ "r = requests.get('https://www.google.com/')\n", "print(type(r))\n", "print(r.status_code)\n", "print(r.text)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Imitate the requests framework\n", "\n", "When you run `requests.get(\"https://www.google.com/\")`, you have an object of class ``. This object has two properties: `status_code` and `text`. Let's replicate this schema with an example of article scraping.\n", "\n", "So basically we are trying to have an object returned after calling a function/method.\n", "\n", "So we first create the class `Content` and then the function `scrape_lemonde` which encapsulate the scraped url, title, and the first 100 characters in the object created by the `Content` class." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 23, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Url\n", "---------\n", "https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html\n", "\n", "Title\n", "----------\n", "Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »\n", "\n", "First 100 characters\n", "-----------\n", "« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (\n" ] } ], "source": [ "import requests\n", "from bs4 import BeautifulSoup\n", "\n", "class Content:\n", " def __init__(self, url, title, first100):\n", " self.url = url\n", " self.title = title\n", " self.first100 = first100\n", "\n", "def get_parsed_text(url):\n", " req = requests.get(url)\n", " soup = BeautifulSoup(req.text, 'html.parser')\n", " return soup\n", "\n", "def scrape_lemonde(url):\n", " soup = get_parsed_text(url)\n", " title = soup.find('h1').text\n", " body_tags = soup.article.find_all([\"p\", \"h2\"], recursive=False)\n", " body = \"\"\n", " for tag in body_tags:\n", " body += tag.get_text()\n", " first100 = body[:100]\n", " return Content(url, title, first100)\n", "\n", "url = 'https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html'\n", "content = scrape_lemonde(url)\n", "\n", "print(\"Url\")\n", "print(\"---------\")\n", "print(content.url)\n", "\n", "print(\"\\nTitle\\n----------\")\n", "print(content.title)\n", "\n", "print(\"\\nFirst 100 characters\\n-----------\")\n", "print(content.first100)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "What are the benefits of such an approach ?\n", "\n", "Let's imagine that the end user who uses your product only have the `scrape_lemonde` function. He would have to know another 3 functions `get_url`, `get_content` and `get_title` to be able to use your scraper. \n", "\n", "It's quite complex and counter-intuitive.\n", "\n", "However, an objected-oriented approach is much more comprehensible. Because a webpage naturally has a title, a url and text. Your end user doesn't need to know how you manage to get these informations, he just needs to know how to access them. That's the principle of `encapsulation`.\n", "\n", "Note that I've deliberately simplified some concepts relevant to `software design` as I don't want to get too deep into it. However I hope that you see already how OOP works in vivo and the benefits of such approach.\n", "\n", "Another benefit (besides encapsulation and better code structure) is its underlying **normalizing abilities**. Let's say you now want to scrape New York Times. It suffices to add a `scrape_nytimes` while returning **always the same object** (`Content` with 3 properties).\n", "\n", "This way you normalize your framework by defining an **unified output** regardless of the scraper/function that the user employs.\n", "\n", "Let's implement it." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 25, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Url\n", "---------\n", "https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html\n", "\n", "Title\n", "----------\n", "The Men Who Want to Live Forever\n", "\n", "First 100 characters\n", "-----------\n", "Would you like to live forever? Some billionaires, already invincible in every other way, have decid\n" ] } ], "source": [ "# The new function\n", "def scrape_nytimes(url):\n", " bs = get_parsed_text(url)\n", " title = bs.find('h1').text\n", " lines = bs.select('div.StoryBodyCompanionColumn div p')\n", " body = '\\n'.join([line.text for line in lines])\n", " first100 = body[:100]\n", " return Content(url, title, first100)\n", "\n", "# Exactly the same schema\n", "\n", "url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'\n", "content = scrape_nytimes(url)\n", "\n", "print(\"Url\")\n", "print(\"---------\")\n", "print(content.url)\n", "\n", "print(\"\\nTitle\\n----------\")\n", "print(content.title)\n", "\n", "print(\"\\nFirst 100 characters\\n-----------\")\n", "print(content.first100)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Isn't that beautiful?\n", "\n", "However the structure of our code (or product) is still one step from being totally intuitive. As a common mortal, I feel like a `scraper` class would be a perfect place to start instead of having to know the two functions `scrape_lemonde` and `scrape_nytimes`.\n", "\n", "Let's do it! Note that we reuse the `Content` class created earlier.\n", "\n", "Then we integrate the two scraper functions in a new `Crawler` class.\n", "\n", "The new `Crawler` class has `journal` as parameter, which allows us to adjust the scraper's behavior according to the newspaper (which uses different tags to surround text body).\n", "\n", "It's important to add `self` as a parameter when you create functions in a class. Actually when \"functions\" are created inside a `class` they are called `methods` and the `self` parameter suggests that these are `methods` acting on objects of the same `class`.\n", "\n", "The last thing worth mentioning is the `print` in the constructor which returns a message when the user wishes to crawl a newspaper not recognized by the Crawler." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 40, "outputs": [], "source": [ "class Content:\n", " def __init__(self, url, title, first100):\n", " self.url = url\n", " self.title = title\n", " self.first100 = first100\n", "\n", "class Crawler:\n", "\n", " def __init__(self, journal):\n", " self.journal = journal\n", " # if newspaper not recognized\n", " if journal not in [\"lemonde\",\"nyt\"]:\n", " print(\"Our services doesn't scrape this website for the moment.\\nPlease contact us to build a scraper for this journal :D\")\n", "\n", " def get_parsed_text(self,url):\n", " req = requests.get(url)\n", " soup = BeautifulSoup(req.text, 'html.parser')\n", " return soup\n", "\n", " def scrape_lemonde(self,url):\n", " soup = self.get_parsed_text(url)\n", " title = soup.find('h1').text\n", " body_tags = soup.article.find_all([\"p\", \"h2\"], recursive=False)\n", " body = \"\"\n", " for tag in body_tags:\n", " body += tag.get_text()\n", " first100 = body[:100]\n", " return Content(url, title, first100)\n", "\n", " def scrape_nytimes(self,url):\n", " bs = self.get_parsed_text(url)\n", " title = bs.find('h1').text\n", " lines = bs.select('div.StoryBodyCompanionColumn div p')\n", " body = '\\n'.join([line.text for line in lines])\n", " first100 = body[:100]\n", " return Content(url, title, first100)\n", "\n", " def get(self,url):\n", " if self.journal == \"lemonde\":\n", " return self.scrape_lemonde(url)\n", " elif self.journal == \"nyt\":\n", " return self.scrape_nytimes(url)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Le monde scraper\n", "\n", "Now let's generate a scraper for articles of the newspaper `Le Monde`." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 37, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Url\n", "---------\n", "https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html\n", "\n", "Title\n", "----------\n", "Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »\n", "\n", "First 100 characters\n", "-----------\n", "« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (\n" ] } ], "source": [ "# Create the Le Monde Crawler\n", "lemonde_crawler = Crawler(\"lemonde\")\n", "\n", "lemonde_content = lemonde_crawler.get(\"https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html\")\n", "\n", "print(\"Url\")\n", "print(\"---------\")\n", "print(lemonde_content.url)\n", "\n", "print(\"\\nTitle\\n----------\")\n", "print(lemonde_content.title)\n", "\n", "print(\"\\nFirst 100 characters\\n-----------\")\n", "print(lemonde_content.first100)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### New York Times scraper\n", "\n", "Now a scraper for articles of New York Times." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 38, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Url\n", "---------\n", "https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.html\n", "\n", "Title\n", "----------\n", "Driver Rams Into Officers at Capitol, Killing One and Injuring Another\n", "\n", "First 100 characters\n", "-----------\n", "WASHINGTON — The band of razor wire-topped fencing around the Capitol had recently come down. The he\n" ] } ], "source": [ "# Create the New York Times Crawler\n", "nyt_crawler = Crawler(\"nyt\")\n", "\n", "nyt_content = nyt_crawler.get(\"https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.html\")\n", "\n", "print(\"Url\")\n", "print(\"---------\")\n", "print(nyt_content.url)\n", "\n", "print(\"\\nTitle\\n----------\")\n", "print(nyt_content.title)\n", "\n", "print(\"\\nFirst 100 characters\\n-----------\")\n", "print(nyt_content.first100)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "### Unknown journal name\n", "\n", "When the end user enters an unknown journal, it will spit out the pre-defined message." ], "metadata": { "collapsed": false } }, { "cell_type": "code", "execution_count": 41, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Our services doesn't scrape this website for the moment.\n", "Please contact us to build a scraper for this journal :D\n" ] } ], "source": [ "bbc_crawler = Crawler(\"bbc\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "## Wrap-up\n", "\n", "Bravo for having reading so far, hopefully you have:\n", "\n", "1. Learned how to manage a little web-scraping project using class in Python\n", "\n", "2. Understood the benefits of object oriented programming both at the level of code structuring and at the level of software design\n", "\n", "3. Known the basic underlying principles of frameworks like `requests`\n", "\n", "Now it's time to build your own framework :D\n", "\n", "If you want to know more web scraping, read the wonderful book by Ryan Mitchell from which this tutorial is partly inspired.\n", "\n", "*Web Scraping with Python: Collecting More Data from the Modern Web*\n", "\n", "If you want to dive into object-oriented programming, be sure to check the book of Mark Lutz:\n", "\n", "*Programming Python: Powerful Object-Oriented Programming*" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "3.7.6-final" } }, "nbformat": 4, "nbformat_minor": 0 }