Is Webscraping legal?
Webscraping falls in a gray area both in the EU and USA.
So, it depends
Almost any data source can be used in the context of a Data Mining project
Data Mining is…
… an exploratory process with uncertain outcomes
A proper engineering solution should be deployed once…
the prototype demonstrates its merits
Four major ways of collecting data from online sources
1 Manually browsing a web site (copy & paste)
2 Manually downloading a file
3 Pretending you are a human browsing a web site (web scraping)
4 Using an Application Programming Interface (API)
Web scraping can be done using
Web scraping using a programming language
Many languages provide functionality for reading and writing data from web sites, just like a regular web browser
webscraper.io
is a more sophisticated tool that allows the user to select which elements of a web site are important and which links should be followed in order to gather more information
Potential issues with web scraping
Ways to detect whether the site is being viewed by a human
Robots Exclusion Protocol
The robots exclusion protocol or the robots.txt protocol is a way to communicate with crawlers or web bots with instructions on whether you can automatically scrape parts of a web site
Potential issue web scraping: Not all data are publicly available online
Workaround
Workarounds include using authentication to access the protected information or using an API access
Application Programming Interfaces (APIs)
are protocols to interact with specific web sites that can be used by any registered user
Three steps for API access
API key is…
like a valet key for the web
An API usually provides multiple endpoints or functions. Examples:
e.g., most recent movies, most popular movies, search movies by keyword