Understand objected-oriented programming (OOP) by building a minimal Web Scraping framework 🇬🇧¶
What you are going to learn¶
requests
is a very popular package in Python because it provides many convenient methods to handle requests, parsing and exception handling. One could also use the official urllib
package, however for the same tasks it is overall much easier to use requests due to its code design. You can clearly see the philosophy of the creator through the website’s motto:
The objective of this tutorial is to introduce you to objected-oriented programming (OOP) by imitating a working example in requests
. After having a grasp of the very basic principles of OOP, we would move on to a more elaborate example to make you better seize the benefits of OOP.
At the end you will be able to build a minimal web scraping framework allowing to scrape articles of the famous French newspaper Le Monde
and the American newspaper New York Times
.
Before starting this tutorial, be sure to have installed requests
and beautiful soup
with:
pip install requests
pip install beautifulsoup4
A working example in requests¶
Here is a quick starting example using requests
, as you can see from the output.
The
get
method returns a class object.The
status code
tells us that the request is successful.The
text
property returns the source code of Google homepage.The
cookies
property returns your cookies.
[43]:
import requests
r = requests.get('https://www.google.com/')
print("The get method returns a class object.")
print("-------------")
print(type(r))
print("\nThe status code tells us that the request is successful.")
print("-------------")
print(r.status_code)
print("\nThe text property returns the source code of Google homepage.")
print("-------------")
print(type(r.text))
print(r.text)
print("\nThe cookies property returns your cookies.")
print("-------------")
print(r.cookies)
The get method returns a class object.
-------------
<class 'requests.models.Response'>
The status code tells us that the request is successful.
-------------
200
The text property returns the source code of Google homepage.
-------------
<class 'str'>
<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="fr"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="EVjWNLKn/4GJbqSpNyB4Cw==">(function(){window.google={kEI:'qs9oYJfZE5mdjLsPvseTqAY',kEXPI:'0,18167,37697,1246576,96,56873,954,756,4349,206,2414,2390,2316,383,246,5,1129,225,1300,3951,5,7674,1114836,1232,1196551,500,302679,26305,51224,16114,17444,11240,9188,8384,4859,1361,284,9007,3020,4747,12841,4020,978,13228,2677,297,873,38,10584,14528,4516,2778,919,2277,8,85,2711,887,707,1278,2212,530,149,1103,840,517,1466,56,157,4101,4120,2023,1777,520,4271,326,1284,8789,3227,419,1571,856,6,5599,6755,5096,599,7278,3748,1180,108,3407,908,2,941,2614,2397,7470,3275,3,346,230,1835,8,4616,149,5990,6299,1686,1,1,2,1528,2304,217,1019,1145,4658,1791,2892,460,1555,4067,7434,3824,1297,1409,344,2658,3903,1,339,518,912,564,1120,30,3852,1811,2466,5499,2305,638,3584,3496,9793,11,731,665,2145,376,3309,2527,479,512,871,185,907,1140,19,48,99,2276,696,6,908,3,1327,2214,1,1042,2,3133,5491,276,501,894,603,1229,1539,1814,38,245,262,650,3434,1940,618,1260,1194,2,1507,1030,1761,33,368,31,2859,424,286,77,1694,2,1394,115,379,922,109,8,27,779,2,465,1721,2,482,1922,626,21,2548,2713,284,20,150,1169,471,142,540,296,392,2,2128,108,1130,2,738,2,41,1,1,18,618,9,244,638,23,350,1,1,1,426,170,240,377,810,81,13,2044,287,411,135,51,677,318,2,2,500,108,291,119,784,204,217,449,27,1557,1717,3,333,394,186,723,5,131,205,189,37,2,791,2153,534,230,5665488,3870,35,226,5997078,2800707,882,444,1,2,80,1,1796,1,9,2,2551,1,748,141,795,563,1,4265,1,1,2,1331,3299,843,1,2608,155,17,13,72,338,13,16,46,5,39,97,41,6,4,6,16,4,47,38,8,111,3,2,2,2,2,2,2,2,2,2,2,2,2,6,2,2,2,2,2,2,2,11,1,10,2,34,66,33,7,10,40,25,2,23956273,148,4010124,267,247,26533,601,2,1223,555',kBL:'1D5i'};google.sn='webhp';google.kHL='fr';})();(function(){
var f,h=[];function k(a){for(var b;a&&(!a.getAttribute||!(b=a.getAttribute("eid")));)a=a.parentNode;return b||f}function l(a){for(var b=null;a&&(!a.getAttribute||!(b=a.getAttribute("leid")));)a=a.parentNode;return b}
function m(a,b,c,d,g){var e="";c||-1!=b.search("&ei=")||(e="&ei="+k(d),-1==b.search("&lei=")&&(d=l(d))&&(e+="&lei="+d));d="";!c&&window._cshid&&-1==b.search("&cshid=")&&"slh"!=a&&(d="&cshid="+window._cshid);c=c||"/"+(g||"gen_204")+"?atyp=i&ct="+a+"&cad="+b+e+"&zx="+Date.now()+d;/^http:/i.test(c)&&"https:"==window.location.protocol&&(google.ml(Error("a"),!1,{src:c,glmm:1}),c="");return c};f=google.kEI;google.getEI=k;google.getLEI=l;google.ml=function(){return null};google.log=function(a,b,c,d,g){if(c=m(a,b,c,d,g)){a=new Image;var e=h.length;h[e]=a;a.onerror=a.onload=a.onabort=function(){delete h[e]};a.src=c}};google.logUrl=m;}).call(this);(function(){google.y={};google.sy=[];google.x=function(a,b){if(a)var c=a.id;else{do c=Math.random();while(google.y[c])}google.y[c]=[a,b];return!1};google.sx=function(a){google.sy.push(a)};google.lm=[];google.plm=function(a){google.lm.push.apply(google.lm,a)};google.lq=[];google.load=function(a,b,c){google.lq.push([[a],b,c])};google.loadAll=function(a,b){google.lq.push([a,b])};google.bx=!1;google.lx=function(){};}).call(this);google.f={};(function(){
document.documentElement.addEventListener("submit",function(b){var a;if(a=b.target){var c=a.getAttribute("data-submitfalse");a="1"==c||"q"==c&&!a.elements.q.value?!0:!1}else a=!1;a&&(b.preventDefault(),b.stopPropagation())},!0);document.documentElement.addEventListener("click",function(b){var a;a:{for(a=b.target;a&&a!=document.documentElement;a=a.parentElement)if("A"==a.tagName){a="1"==a.getAttribute("data-nohref");break a}a=!1}a&&b.preventDefault()},!0);}).call(this);</script><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
</style><style>body,td,a,p,.h{font-family:arial,sans-serif}body{margin:0;overflow-y:scroll}#gog{padding:3px 8px 0}td{line-height:.8em}.gac_m td{line-height:17px}form{margin-bottom:20px}.h{color:#1558d6}em{font-weight:bold;font-style:normal}.lst{height:25px;width:496px}.gsfi,.lst{font:18px arial,sans-serif}.gsfs{font:17px arial,sans-serif}.ds{display:inline-box;display:inline-block;margin:3px 0 4px;margin-left:4px}input{font-family:inherit}body{background:#fff;color:#000}a{color:#4b11a8;text-decoration:none}a:hover,a:active{text-decoration:underline}.fl a{color:#1558d6}a:visited{color:#4b11a8}.sblc{padding-top:5px}.sblc a{display:block;margin:2px 0;margin-left:13px;font-size:11px}.lsbb{background:#f8f9fa;border:solid 1px;border-color:#dadce0 #70757a #70757a #dadce0;height:30px}.lsbb{display:block}#WqQANb a{display:inline-block;margin:0 12px}.lsb{background:url(/images/nav_logo229.png) 0 -261px repeat-x;border:none;color:#000;cursor:pointer;height:30px;margin:0;outline:0;font:15px arial,sans-serif;vertical-align:top}.lsb:active{background:#dadce0}.lst:focus{outline:none}</style><script nonce="EVjWNLKn/4GJbqSpNyB4Cw=="></script></head><body bgcolor="#fff"><script nonce="EVjWNLKn/4GJbqSpNyB4Cw==">(function(){var src='/images/nav_logo229.png';var iesg=false;document.body.onload = function(){window.n && window.n();if (document.images){new Image().src=src;}
if (!iesg){document.f&&document.f.q.focus();document.gbqf&&document.gbqf.q.focus();}
}
})();</script><div id="mngb"><div id=gbar><nobr><b class=gb1>Recherche</b> <a class=gb1 href="https://www.google.fr/imghp?hl=fr&tab=wi">Images</a> <a class=gb1 href="https://maps.google.fr/maps?hl=fr&tab=wl">Maps</a> <a class=gb1 href="https://play.google.com/?hl=fr&tab=w8">Play</a> <a class=gb1 href="https://www.youtube.com/?gl=FR&tab=w1">YouTube</a> <a class=gb1 href="https://news.google.com/?tab=wn">ActualitĂ©s</a> <a class=gb1 href="https://mail.google.com/mail/?tab=wm">Gmail</a> <a class=gb1 href="https://drive.google.com/?tab=wo">Drive</a> <a class=gb1 style="text-decoration:none" href="https://www.google.fr/intl/fr/about/products?tab=wh"><u>Plus</u> »</a></nobr></div><div id=guser width=100%><nobr><span id=gbn class=gbi></span><span id=gbf class=gbf></span><span id=gbe></span><a href="http://www.google.fr/history/optout?hl=fr" class=gb4>Historique Web</a> | <a href="/preferences?hl=fr" class=gb4>Paramètres</a> | <a target=_top id=gb_70 href="https://accounts.google.com/ServiceLogin?hl=fr&passive=true&continue=https://www.google.com/&ec=GAZAAQ" class=gb4>Connexion</a></nobr></div><div class=gbh style=left:0></div><div class=gbh style=right:0></div></div><center><br clear="all" id="lgpd"><div id="lga"><img alt="Google" height="92" src="/images/branding/googlelogo/1x/googlelogo_white_background_color_272x92dp.png" style="padding:28px 0 14px" width="272" id="hplogo"><br><br></div><form action="/search" name="f"><table cellpadding="0" cellspacing="0"><tr valign="top"><td width="25%"> </td><td align="center" nowrap=""><input name="ie" value="ISO-8859-1" type="hidden"><input value="fr" name="hl" type="hidden"><input name="source" type="hidden" value="hp"><input name="biw" type="hidden"><input name="bih" type="hidden"><div class="ds" style="height:32px;margin:4px 0"><input class="lst" style="margin:0;padding:5px 8px 0 6px;vertical-align:top;color:#000" autocomplete="off" value="" title="Recherche Google" maxlength="2048" name="q" size="57"></div><br style="line-height:0"><span class="ds"><span class="lsbb"><input class="lsb" value="Recherche Google" name="btnG" type="submit"></span></span><span class="ds"><span class="lsbb"><input class="lsb" id="tsuid1" value="J'ai de la chance" name="btnI" type="submit"><script nonce="EVjWNLKn/4GJbqSpNyB4Cw==">(function(){var id='tsuid1';document.getElementById(id).onclick = function(){if (this.form.q.value){this.checked = 1;if (this.form.iflsig)this.form.iflsig.disabled = false;}
else top.location='/doodles/';};})();</script><input value="AINFCbYAAAAAYGjdusjrOdF-VqxQ4Zw5EKEoanQuATTf" name="iflsig" type="hidden"></span></span></td><td class="fl sblc" align="left" nowrap="" width="25%"><a href="/advanced_search?hl=fr&authuser=0">Recherche avancĂ©e</a></td></tr></table><input id="gbv" name="gbv" type="hidden" value="1"><script nonce="EVjWNLKn/4GJbqSpNyB4Cw==">(function(){var a,b="1";if(document&&document.getElementById)if("undefined"!=typeof XMLHttpRequest)b="2";else if("undefined"!=typeof ActiveXObject){var c,d,e=["MSXML2.XMLHTTP.6.0","MSXML2.XMLHTTP.3.0","MSXML2.XMLHTTP","Microsoft.XMLHTTP"];for(c=0;d=e[c++];)try{new ActiveXObject(d),b="2"}catch(h){}}a=b;if("2"==a&&-1==location.search.indexOf("&gbv=2")){var f=google.gbvu,g=document.getElementById("gbv");g&&(g.value=a);f&&window.setTimeout(function(){location.href=f},0)};}).call(this);</script></form><div id="gac_scont"></div><div style="font-size:83%;min-height:3.5em"><br></div><span id="footer"><div style="font-size:10pt"><div style="margin:19px auto;text-align:center" id="WqQANb"><a href="/intl/fr/ads/">Solutions publicitaires</a><a href="/services/">Solutions d'entreprise</a><a href="/intl/fr/about.html">Ă€ propos de Google</a><a href="https://www.google.com/setprefdomain?prefdom=FR&prev=https://www.google.fr/&sig=K_MzmiVFsLFIp3kWzbwMcpMftHbp8%3D">Google.fr</a></div></div><p style="font-size:8pt;color:#70757a">© 2021 - <a href="/intl/fr/policies/privacy/">ConfidentialitĂ©</a> - <a href="/intl/fr/policies/terms/">Conditions</a></p></span></center><script nonce="EVjWNLKn/4GJbqSpNyB4Cw==">(function(){window.google.cdo={height:0,width:0};(function(){var a=window.innerWidth,b=window.innerHeight;if(!a||!b){var c=window.document,d="CSS1Compat"==c.compatMode?c.documentElement:c.body;a=d.clientWidth;b=d.clientHeight}a&&b&&(a!=google.cdo.width||b!=google.cdo.height)&&google.log("","","/client_204?&atyp=i&biw="+a+"&bih="+b+"&ei="+google.kEI);}).call(this);})();(function(){var u='/xjs/_/js/k\x3dxjs.hp.en.rMSPgVHe7zM.O/m\x3dsb_he,d/am\x3dAHiCOA/d\x3d1/rs\x3dACT90oG4raM0O3coqYKShqRNsTfGdg4Qzw';
var d=this||self,e=/^[\w+/_-]+[=]{0,2}$/,f=null,g=function(a){return(a=a.querySelector&&a.querySelector("script[nonce]"))&&(a=a.nonce||a.getAttribute("nonce"))&&e.test(a)?a:""},h=function(a){return a};var l;var n=function(a,b){this.g=b===m?a:""};n.prototype.toString=function(){return this.g+""};var m={};function p(){var a=u;google.lx=function(){q(a);google.lx=function(){}};google.bx||google.lx()}
function q(a){var b=document;var c="SCRIPT";"application/xhtml+xml"===b.contentType&&(c=c.toLowerCase());c=b.createElement(c);if(void 0===l){b=null;var k=d.trustedTypes;if(k&&k.createPolicy){try{b=k.createPolicy("goog#html",{createHTML:h,createScript:h,createScriptURL:h})}catch(r){d.console&&d.console.error(r.message)}l=b}else l=b}a=(b=l)?b.createScriptURL(a):a;a=new n(a,m);c.src=a instanceof n&&a.constructor===n?a.g:"type_error:TrustedResourceUrl";(a=c.ownerDocument&&c.ownerDocument.defaultView)&&
a!=d?a=g(a.document):(null===f&&(f=g(d.document)),a=f);a&&c.setAttribute("nonce",a);google.timers&&google.timers.load&&google.tick&&google.tick("load","xjsls");document.body.appendChild(c)};setTimeout(function(){p()},0);})();(function(){window.google.xjsu='/xjs/_/js/k\x3dxjs.hp.en.rMSPgVHe7zM.O/m\x3dsb_he,d/am\x3dAHiCOA/d\x3d1/rs\x3dACT90oG4raM0O3coqYKShqRNsTfGdg4Qzw';})();function _DumpException(e){throw e;}
function _F_installCss(c){}
(function(){google.jl={blt:'none',dw:false,em:[],emtn:0,ine:false,lls:'default',pdt:0,snet:true,uwp:true};})();(function(){var pmc='{\x22d\x22:{},\x22sb_he\x22:{\x22agen\x22:true,\x22cgen\x22:true,\x22client\x22:\x22heirloom-hp\x22,\x22dh\x22:true,\x22dhqt\x22:true,\x22ds\x22:\x22\x22,\x22ffql\x22:\x22fr\x22,\x22fl\x22:true,\x22host\x22:\x22google.com\x22,\x22isbh\x22:28,\x22jsonp\x22:true,\x22lm\x22:true,\x22msgs\x22:{\x22cibl\x22:\x22Effacer la recherche\x22,\x22dym\x22:\x22Essayez avec cette orthographe :\x22,\x22lcky\x22:\x22J\\u0026#39;ai de la chance\x22,\x22lml\x22:\x22En savoir plus\x22,\x22oskt\x22:\x22Outils de saisie\x22,\x22psrc\x22:\x22Cette suggestion a bien été supprimée de votre \\u003Ca href\x3d\\\x22/history\\\x22\\u003Ehistorique Web\\u003C/a\\u003E.\x22,\x22psrl\x22:\x22Supprimer\x22,\x22sbit\x22:\x22Recherche par image\x22,\x22srch\x22:\x22Recherche Google\x22},\x22nrft\x22:false,\x22ovr\x22:{},\x22pq\x22:\x22\x22,\x22refpd\x22:true,\x22rfs\x22:[],\x22sbas\x22:\x220 3px 8px 0 rgba(0,0,0,0.2),0 0 0 1px rgba(0,0,0,0.08)\x22,\x22sbpl\x22:16,\x22sbpr\x22:16,\x22scd\x22:10,\x22stok\x22:\x221lDstMgtMp9RFnSbOW3Ur5w3QBw\x22,\x22uhde\x22:false}}';google.pmc=JSON.parse(pmc);})();</script> </body></html>
The cookies property returns your cookies.
-------------
<RequestsCookieJar[<Cookie CONSENT=PENDING+267 for .google.com/>]>
Class, constructor, method and property¶
When we talk about objected-oriented programming (OOP), no matter which language you are using (Some languages like Java and C++ are more natively class-based), the four most fundamental concepts are class
, constructor
, method
and property
.
Let’s construct our first class in Python. A class is a concept, a perception. Let’s say we want to define what is a human being (Person).
Typically you would start building your Person
(capital case) class by defining it’s properties (name, nationality, job) using a constructor (the __init__
here). You would also like to define the class’s method (what a human can do).
One thing might seem odd for beginners is the self
parameter. Roughly speaking it’s a placeholder which allows your methods to access the properties.
The code should be quite self-explanatory.
[3]:
class Person:
def __init__(self, name, nationality, job):
self.name = name
self.nationality = nationality
self.job = job
def greeting(self):
print(f"Hello my name is {self.name}.\nI'm {self.nationality}.\nI work in {self.job}.")
# initiate a class
me = Person("Xiaoou", "Chinese", "NLP")
# using the greeting method
me.greeting()
# access the name property
print(me.name)
Hello my name is Xiaoou.
I'm Chinese.
I work in NLP.
Xiaoou
So far so good. We have built our first class. Now let’s imitate the schema used in the working example of requests
we saw at the beginning. As a reminder, the working code bloc is:
[ ]:
r = requests.get('https://www.google.com/')
print(type(r))
print(r.status_code)
print(r.text)
Imitate the requests framework¶
When you run requests.get("https://www.google.com/")
, you have an object of class <class 'requests.models.Response'>
. This object has two properties: status_code
and text
. Let’s replicate this schema with an example of article scraping.
So basically we are trying to have an object returned after calling a function/method.
So we first create the class Content
and then the function scrape_lemonde
which encapsulate the scraped url, title, and the first 100 characters in the object created by the Content
class.
[23]:
import requests
from bs4 import BeautifulSoup
class Content:
def __init__(self, url, title, first100):
self.url = url
self.title = title
self.first100 = first100
def get_parsed_text(url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup
def scrape_lemonde(url):
soup = get_parsed_text(url)
title = soup.find('h1').text
body_tags = soup.article.find_all(["p", "h2"], recursive=False)
body = ""
for tag in body_tags:
body += tag.get_text()
first100 = body[:100]
return Content(url, title, first100)
url = 'https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html'
content = scrape_lemonde(url)
print("Url")
print("---------")
print(content.url)
print("\nTitle\n----------")
print(content.title)
print("\nFirst 100 characters\n-----------")
print(content.first100)
Url
---------
https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html
Title
----------
Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »
First 100 characters
-----------
« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (
What are the benefits of such an approach ?
Let’s imagine that the end user who uses your product only have the scrape_lemonde
function. He would have to know another 3 functions get_url
, get_content
and get_title
to be able to use your scraper.
It’s quite complex and counter-intuitive.
However, an objected-oriented approach is much more comprehensible. Because a webpage naturally has a title, a url and text. Your end user doesn’t need to know how you manage to get these informations, he just needs to know how to access them. That’s the principle of encapsulation
.
Note that I’ve deliberately simplified some concepts relevant to software design
as I don’t want to get too deep into it. However I hope that you see already how OOP works in vivo and the benefits of such approach.
Another benefit (besides encapsulation and better code structure) is its underlying normalizing abilities. Let’s say you now want to scrape New York Times. It suffices to add a scrape_nytimes
while returning always the same object (Content
with 3 properties).
This way you normalize your framework by defining an unified output regardless of the scraper/function that the user employs.
Let’s implement it.
[25]:
# The new function
def scrape_nytimes(url):
bs = get_parsed_text(url)
title = bs.find('h1').text
lines = bs.select('div.StoryBodyCompanionColumn div p')
body = '\n'.join([line.text for line in lines])
first100 = body[:100]
return Content(url, title, first100)
# Exactly the same schema
url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
content = scrape_nytimes(url)
print("Url")
print("---------")
print(content.url)
print("\nTitle\n----------")
print(content.title)
print("\nFirst 100 characters\n-----------")
print(content.first100)
Url
---------
https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html
Title
----------
The Men Who Want to Live Forever
First 100 characters
-----------
Would you like to live forever? Some billionaires, already invincible in every other way, have decid
Isn’t that beautiful?
However the structure of our code (or product) is still one step from being totally intuitive. As a common mortal, I feel like a scraper
class would be a perfect place to start instead of having to know the two functions scrape_lemonde
and scrape_nytimes
.
Let’s do it! Note that we reuse the Content
class created earlier.
Then we integrate the two scraper functions in a new Crawler
class.
The new Crawler
class has journal
as parameter, which allows us to adjust the scraper’s behavior according to the newspaper (which uses different tags to surround text body).
It’s important to add self
as a parameter when you create functions in a class. Actually when “functions” are created inside a class
they are called methods
and the self
parameter suggests that these are methods
acting on objects of the same class
.
The last thing worth mentioning is the print
in the constructor which returns a message when the user wishes to crawl a newspaper not recognized by the Crawler.
[40]:
class Content:
def __init__(self, url, title, first100):
self.url = url
self.title = title
self.first100 = first100
class Crawler:
def __init__(self, journal):
self.journal = journal
# if newspaper not recognized
if journal not in ["lemonde","nyt"]:
print("Our services doesn't scrape this website for the moment.\nPlease contact us to build a scraper for this journal :D")
def get_parsed_text(self,url):
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup
def scrape_lemonde(self,url):
soup = self.get_parsed_text(url)
title = soup.find('h1').text
body_tags = soup.article.find_all(["p", "h2"], recursive=False)
body = ""
for tag in body_tags:
body += tag.get_text()
first100 = body[:100]
return Content(url, title, first100)
def scrape_nytimes(self,url):
bs = self.get_parsed_text(url)
title = bs.find('h1').text
lines = bs.select('div.StoryBodyCompanionColumn div p')
body = '\n'.join([line.text for line in lines])
first100 = body[:100]
return Content(url, title, first100)
def get(self,url):
if self.journal == "lemonde":
return self.scrape_lemonde(url)
elif self.journal == "nyt":
return self.scrape_nytimes(url)
Le monde scraper¶
Now let’s generate a scraper for articles of the newspaper Le Monde
.
[37]:
# Create the Le Monde Crawler
lemonde_crawler = Crawler("lemonde")
lemonde_content = lemonde_crawler.get("https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html")
print("Url")
print("---------")
print(lemonde_content.url)
print("\nTitle\n----------")
print(lemonde_content.title)
print("\nFirst 100 characters\n-----------")
print(lemonde_content.first100)
Url
---------
https://www.lemonde.fr/mondephilatelique/article/2021/04/03/le-mouvement-wiener-werkstatte-dans-cartes-postales-magazine_6075495_5470897.html
Title
----------
Le mouvement Wiener Werkstätte, dans « Cartes postales magazine »
First 100 characters
-----------
« Le 22 janvier 2012 décède Paul Armand, ce qui entraîne la mort de Cartes postales et collections (
New York Times scraper¶
Now a scraper for articles of New York Times.
[38]:
# Create the New York Times Crawler
nyt_crawler = Crawler("nyt")
nyt_content = nyt_crawler.get("https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.html")
print("Url")
print("---------")
print(nyt_content.url)
print("\nTitle\n----------")
print(nyt_content.title)
print("\nFirst 100 characters\n-----------")
print(nyt_content.first100)
Url
---------
https://www.nytimes.com/2021/04/02/us/politics/capitol-attack.html
Title
----------
Driver Rams Into Officers at Capitol, Killing One and Injuring Another
First 100 characters
-----------
WASHINGTON — The band of razor wire-topped fencing around the Capitol had recently come down. The he
Unknown journal name¶
When the end user enters an unknown journal, it will spit out the pre-defined message.
[41]:
bbc_crawler = Crawler("bbc")
Our services doesn't scrape this website for the moment.
Please contact us to build a scraper for this journal :D
Wrap-up¶
Bravo for having reading so far, hopefully you have:
Learned how to manage a little web-scraping project using class in Python
Understood the benefits of object oriented programming both at the level of code structuring and at the level of software design
Known the basic underlying principles of frameworks like
requests
Now it’s time to build your own framework :D
If you want to know more web scraping, read the wonderful book by Ryan Mitchell from which this tutorial is partly inspired.
Web Scraping with Python: Collecting More Data from the Modern Web
If you want to dive into object-oriented programming, be sure to check the book of Mark Lutz:
Programming Python: Powerful Object-Oriented Programming