Web Scraping — Introduction

What is Web Scraping?

It is the process of using various tools to extract content and data from a website.

The web is filled with data, only the data is trapped inside the HTML and there’s useful information, even profit and innovative business model’s locked up in many of those pages.

Web scraping is used in a variety of digital businesses that rely on data harvesting. Some of the use cases include:

  • Social media sentiment analysis — To understand valuable trends and tap into their potential.

Scraping tools

Web scraping tools are often software programmed to sift through the websites and extract needed information. Their basic functionality (in order) includes:

  • Recognize unique HTML site structures — To find certain information such as images or links or headings etc., they can also be designed to recognize multiple tags.

Web Scraping with Python

Python provides us with a rich set of web scraping tools and libraries with the help of which we can easily scrape the desired web pages.

  • Requests + Beautiful Soup — This combo is often used together. The request library is used to download the web page and the beautiful soup to extracts just the data you require from the web page.

Some background information

HTTP- Hyper Text Transfer Protocol

It is a text based protocol used by browsers to access web content.

HTTP is a client based protocol. The client makes requests and receives responses from the server. Client can be:-

  • Any web browser.

It hosts web pages and web content, this content can either be static or dynamic.

Static content refers to the HTML(Hyper Text Markup Language) and CSS(Cascading Style Sheets) which is not changed by the client-side scripts. All browsers support a scripting language called JavaScript. If the JavaScript running on your(client) browser updates the HTML and adds interactivity to your web page, that is a dynamic web page.

HTTP Requests

Although there are many requests types, the most frequently used are:-

  • GET request to fetch resources from the target URL. This request is made by browser when it is retrieving web content from a web server

Whether these request are implemented on a website is up to the website owner. If a website allows only reading of resources it will only support only the GET request.

HTTP Responses

Web servers which host websites and web applications are standing by to address the requests. On receiving a request they parse it, understand what is requested and send back a HTTP response. Every HTTP response have a few standard fields:

  • Status line with code such as 200, 404. All status codes have meaning. 200 means that the response was sent back successfully, 404 says the page wasn’t found

Web Scraping

Programmatically extracting the information from the web page that is needed/useful. It is automated extraction of data from the websites; the HTTP requests are made programmatically to fetch the needed content, the content is then downloaded special tools are then used to parse the content and to extract the specific information.

The content has a specific structure that is HTML. HTML has a tree like structure that is navigated and once parsed specific information can be extracted from the webpage.

Fetching and Parsing Content

Two steps are involved in Web Scraping:

  • Fetching Content: It involves programmatically accessing the content of a website. It is done by making HTTP requests to that site using a client library.

Fetching Content:

  • HTTP client libraries are used to make HTTP requests and download content from a URL. There are many such libraries available in python such as Urllib, Urllib2, Requests, Httplib, Httplib2.HTTP client libraries when to use which one:

Understanding URLs

URL stands for Uniform Resource Locator. It can be seen as an address with each part conveying different information. Below are the main parts of a URL.

  • Scheme — Every URL begins with a scheme, it tells the browser about the type of address. Typically, http or https; though not always shown scheme is an integral part of URL.

As this is an introductory post we will consider the overall process of web scraping without going into much of detail. Also, this tutorial assumes that the reader has basic understanding of Python, HTML, CSS Selectors, XPath and knows how to use Google Developer Tools to find the same.

Example

The output is
History
Techniques
Software
Legal issues
Methods to prevent web scraping
See also
References

Although looking at the above example it may seem to be a stupid task to go to such extent to just fetch the list in the table of content which can be easily done by simple copy and paste, but bear in mind that the same code can be used again and again to find the content in table of content of every Wikipedia page, we just need to change the URL. This in turn saves us a lot of time.

Also, in future we will be dealing with quite some big tasks using web scraping. Being an introductory post I have kept the example easy and concise.

Currently pursuing my Bachelors. Love all things data.