you're using the bandwidth of somebody else, and you're freely retrieving and using their data.
General advice for your scraping or crawling projects
- Use an API if one is provided, instead of scraping data.
- Respect the Terms of Service (ToS).
- Respect the rules of robots.txt.
- Use a reasonable crawl rate, i.e. don't bombard the site with requests. Respect the crawl-delay setting provided in robots.txt; if there's none, use a conservative crawl rate (e.g. 1 request per 10-15 seconds).
- Identify your web scraper or crawler with a legitimate user agent string. Create a page that explains what you're doing and why, and link back to the page in your user agent string (e.g. 'MY-BOT (+https://yoursite.com/mybot.html)')
- If ToS or robots.txt prevent you from crawling or scraping, ask a written permission to the owner of the site, prior to doing anything else.
- Don't republish your crawled or scraped data or any derivative dataset without verifying the license of the data, or without obtaining a written permission from the copyright holder.
- If you doubt on the legality of what you're doing, don't do it. Or seek the advice of a lawyer.
- Don't base your whole business on data scraping. The website(s) that you scrape may eventually block you, just like what happened in Craigslist Inc. v. 3Taps Inc..
- Finally, you should be suspicious of any advice that you find on the internet (including mine), so please consult a lawyer.
the relevant question isn't "Is this legal?". Instead, you should ask yourself "Am I doing something that might upset someone? And am I willing to take the (financial) risk of their response?".
reference:
https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/
No comments:
Post a Comment