The web is the greatest source of publicly available data in the world. However, much of this data is not easily accessible. To access data from the web, one of the key skills required is data scraping. This is a technique where a piece of software surveys and combs through a website to gather and extract data.
Every Internet user relates to this process on a smaller scale – we go to websites, see interesting information and try to copy it for later use. Yet this is often not applicable when the necessary information is too large in scale or is spread across multiple websites.
The main advantage of data scraping is its ability to work with virtually any web site – from government sites to weather forecasts to organizational data. Hidden data can also be extracted from PDFs and web pages using data scraping platforms. It is among one of the most useful tools for software applications or websites in need of valuable information.
Data Scraping Techniques
A computer creates machine-readable data to enable efficient processing. This structured machine-readable data comes in different formats such as CSV, JSON and XML. Most of the data available on the web is published in these formats.
The goal of data scraping is to access machine-readable as well end-user facing data and combine it with other data-sets for a user to explore independently of the source websites. When one is looking for data to use in individual applications, it is not always in the required format.
For instance, government sites are known for publishing PDFs instead of raw data. However any content that can be viewed on a webpage can be scraped. During screen scraping, structured content is extracted from a web page with the help of a scraping tool or by writing a small piece of code.
While this method is quite powerful, it requires a bit of understanding about how the web works and what can be and what cannot be scraped.
How Does Data Scraping Work?
Data scraping is conducted with the help of either a scraping tool or by writing pieces of code referred to as web scrapers. There are many tools that effectively scraping data from websites. Depending on the browser, a tool like Readability helps extract text from any web page.
Another tool DownThemAll allows users to download many files from a website in one go. Chrome’s Scraper extension also helps with extracting tables from web sites. Then there are web scrapers written in programming languages such as Python, PHP or Ruby.
These web scrapers are then targeted at pages and elements therein and desired data is extracted. For effective scraping, understanding the structure of the web site, web pages and database being used is very important.
If someone wants to get started with scraping without the hassle of setting up a coding environment then ScraperWiki is a web site that allows users to code scrapers in Python, PHP or Ruby.
The Limitations of Data Scraping
There are of course limits to what can be scraped. Some factors that make data hard to scrape from a site include:
- Badly formatted HTML code on a web page (common to older government websites)
- Blocking of access by server administrators
- Session-based systems that place cookies to track users
- Authentication systems that prevent automatic access
Another set of limitations are legal and regulatory barriers. Many countries recognize database rights which limit the re-use of some online information. Commercial organizations and NGOs for instance forbid data scraping in most cases.
Scraping freely available governmental data is acceptable, but information that infringes the privacy of individuals and violates data privacy laws is not. Some websites try to prevent scraping by prohibiting it in their online Terms of Service. However, depending on jurisdiction one may have special rights to access (journalists for example).
In a perfect world, all data would be easily available to everyone. Unfortunately this is far from the truth (especially when it comes to government research!). But all-in-all, data scraping helps people retrieve and extract specific data efficiently.