Web scraping (aka web harvesting, web data extraction, or data mining) is the process of extracting data from the web.
Basically, it’s a way to collect data from the web through an agent that visits a website, downloads the page, parses the content (or understands it), and organizes data in an automated manner.
In other words, instead of you, say, trying to copy interesting parts from a web page and paste them into a spreadsheet, web scraping takes this boring task off your shoulder and does it for you through a computer program that can execute it much faster and less error prune that a human can.
In this story, I’ll show you 3 resources that will help you get started with web scraping especially for the need of data science.
But. Let’s first see why web scraping is important in general and why it’s so important for data science particularly.
4 Reasons why web scraping is important
If you want to get data from a website, you need first to check if it has an API (Application Programming Interface) or not. If the site has an API already, you might not reinvent the wheel.
Note: API is a programming interface that allows you to access data from the website.
Here are the four reasons why you might need a web scraper:
- The website you want to get data from does not provide an API in the first place.
- The API provided is not free.
- The API provided has some limitations as you can only access it a certain number of times.
- The API does not expose all the data you wish to extract.
Why web scraping for data science?
Data on the web is an opportunity for data scientists to collect, analyze, and visualize. You find lots and lots of “raw materials” out there on the web, and you can use them to build your own data science projects.
The web exposes a lot of interesting opportunities such as:
- You might find an interesting table on a Wikipedia page that you can retrieve and do some statistical analysis on.
- Perhaps you want to get a list of reviews on Amazon to perform some sentiment analysis on, create a recommendation system, or build a machine learning model to spot fake reviews.
- You might wish to get some visualizations of some properties on a real-estate site.
- You’d like to enrich your natural language processing model in your articles classification project by getting more data from news articles websites and blogs as well to avoid bias.
- You might be wondering about social media analytics on Twitter, Facebook, and other social media.
- It might be interesting to monitor a nerds website like Hacker News to see the trending new stories that you’re interested in.
When you learn about web scraping and you start paying attention to it more you will find yourself have the power to do a lot more with your data. This will show you many different business ideas that you can implement in your data science projects or make money web scraping as a freelancer.
Where to learn web scraping?
There are many resources on the web that will help you know more about web scraping. You can search on Udemy. Here are the 3 best courses I’ve found .
If you want to get started in web scraping fast and don’t feel like you paid a lot, I created an ebook; simple to digest, to the point, and above all affordable.
Not only that! You’ll have lifetime access to our community on Facebook in which you can ask questions and get help.