What is Web Scraping?
It is the process of using various tools to extract content and data from a website.
The web is filled with data, only the data is trapped inside the HTML and there’s useful information, even profit and innovative business model’s locked up in many of those pages.
Web scraping is used in a variety of digital businesses that rely on data harvesting. Some of the use cases include:
- Social media sentiment analysis — To understand valuable trends and tap into their potential.
- eCommerce pricing — To find the best marketplace where one can earn most profit.
- Machine learning — To gather more data.
- Investment opportunities — It can be regarding stock market or real estate etc.
- Website Ranking — Performed using keyword research, contents from various sites is analyzed and the respective sites are ranked according to a pre-decided criteria.
- News Gathering and Analysis — News regarding an issue or a group of issue is gather from various sources i.e. having various opinions and various location. Once gathered the news can then be analyzed easily.
Web scraping tools are often software programmed to sift through the websites and extract needed information. Their basic functionality (in order) includes:
- Recognize unique HTML site structures — To find certain information such as images or links or headings etc., they can also be designed to recognize multiple tags.
- Extract and transform content — To pick up only the needed information while ignoring the useless.
- Store scraped data — The data can be stored as per the user convenience, e.g. in a database or even a simple file.
- Extract data from APIs — Some sites provides us with its APIs to easily extract data from it.
Web Scraping with Python
Python provides us with a rich set of web scraping tools and libraries with the help of which we can easily scrape the desired web pages.
- Requests + Beautiful Soup — This combo is often used together. The request library is used to download the web page and the beautiful soup to extracts just the data you require from the web page.
- Scrapy — This is a complete application framework for crawling websites and extracting information. Not only does Scrapy provides the functionality of both Requests and Beautiful Soup libraries but it is also faster and even more powerful.
- Selenium — Although Selenium is majorly used for web automation It can also be used to scrap the web. This library comes quite in handy especially in the case of dynamic sites.
Some background information
HTTP- Hyper Text Transfer Protocol
It is a text based protocol used by browsers to access web content.
HTTP is a client based protocol. The client makes requests and receives responses from the server. Client can be:-
- Any web browser.
- Any mobile application.
- Programs HTTP servers- HTTP clients make requests to HTTP servers.
It hosts web pages and web content, this content can either be static or dynamic.
Although there are many requests types, the most frequently used are:-
- GET request to fetch resources from the target URL. This request is made by browser when it is retrieving web content from a web server
- POST request to create/update resources. A good example would be updating your (client’s) activity/status on social networking sites like Facebook, twitter etc.
- PUT request to idempotently create/update. If the same PUT request is made to a server multiple times then the additional requests will have no effect.
- HEAD request to get only HTTP header. It is used to make a request to get only the metadata header information from a target URL.
- DELETE request to delete resources from a website.
Whether these request are implemented on a website is up to the website owner. If a website allows only reading of resources it will only support only the GET request.
Web servers which host websites and web applications are standing by to address the requests. On receiving a request they parse it, understand what is requested and send back a HTTP response. Every HTTP response have a few standard fields:
- Status line with code such as 200, 404. All status codes have meaning. 200 means that the response was sent back successfully, 404 says the page wasn’t found
- Response header with metadata information.
- Body of the response, which is typically understood by a web browser, this body is what the browser displays to the screen. The body can contain other content such as JSON that you(client) would create from the server.
Programmatically extracting the information from the web page that is needed/useful. It is automated extraction of data from the websites; the HTTP requests are made programmatically to fetch the needed content, the content is then downloaded special tools are then used to parse the content and to extract the specific information.
The content has a specific structure that is HTML. HTML has a tree like structure that is navigated and once parsed specific information can be extracted from the webpage.
Fetching and Parsing Content
Two steps are involved in Web Scraping:
- Fetching Content: It involves programmatically accessing the content of a website. It is done by making HTTP requests to that site using a client library.
- Parsing Content: It involves extracting the information from the content that is fetched. It can be done with help of several technologies like HTML parsing, DOM parsing, Computer Vision technologies.
- HTTP client libraries are used to make HTTP requests and download content from a URL. There are many such libraries available in python such as Urllib, Urllib2, Requests, Httplib, Httplib2.HTTP client libraries when to use which one:
- Requests: A high-level intuitive API. Easy to use.
- Httplib2 and Httplib: Both of these reference the same HTTP client library. Httplib2 is the enhanced version of Httplib. Httplib 2 gives more granular control over the request that is made, it gives you fine grain control over the HTTP requests that are made. The requests can be configure in a very granular manner.
- Urllib and Urllib2: Non overlapping in python 2.7. But overlapping in python 3. In python 3 Urllib subsumes old Urllib2, Urllib2 is python 2 only. This is part of python standard library, so no need to pip install it separately. There are 4 distinct namespaces for different operations.
- Urllib.request is for opening and reading URls
- Urllib.error contains the exceptions raised by urllib.request
- Url.parse is for parsing URLs
- Urllib.robotparser is for parsing the robot.txt file.
URL stands for Uniform Resource Locator. It can be seen as an address with each part conveying different information. Below are the main parts of a URL.
- Scheme — Every URL begins with a scheme, it tells the browser about the type of address. Typically, http or https; though not always shown scheme is an integral part of URL.
- Domain Name — The major part of an URL. Different pages on same site have same domain name e.g. www.facebook.com.
- File Path — Also known as Path. It tell the browser to load a specific page on the site; given after domain name.
- Parameters — Some URLs have a string of characters after the path which begins with a question mark (?), this is called the parameter string.
- Anchor — Depicted by hash symbol (#) it tells the browser to load a particular part on the web page. Often referred as URL fragment.
As this is an introductory post we will consider the overall process of web scraping without going into much of detail. Also, this tutorial assumes that the reader has basic understanding of Python, HTML, CSS Selectors, XPath and knows how to use Google Developer Tools to find the same.
The output is
Methods to prevent web scraping
Although looking at the above example it may seem to be a stupid task to go to such extent to just fetch the list in the table of content which can be easily done by simple copy and paste, but bear in mind that the same code can be used again and again to find the content in table of content of every Wikipedia page, we just need to change the URL. This in turn saves us a lot of time.
Also, in future we will be dealing with quite some big tasks using web scraping. Being an introductory post I have kept the example easy and concise.