Chapter 9: Building A Web Crawler

📌 some links that will massively boost your expertise in web crawlers (visual explanations and diagrams are 🔥 there so have a look as this is smth you’ll need in your interviews):

https://www.hellointerview.com/learn/system-design/problem-breakdowns/web-crawler
and our favourite Jordan has no life: https://www.youtube.com/watch?v=MdWvMX4J-Vc

We’re essentially building a program that methodically explores the vastness of the internet, grabbing web pages, and extracting information. smth like a digital explorer charting unknown territories.

What's the Goal?

Before we dive into the nuts and bolts, what's the purpose of this web crawler? Is it for:

Search engine indexing (like Google's)?
Data mining (scraping specific information for research or analysis)?
Website archiving (like the Wayback Machine)?
Monitoring website changes?

The goals influence the design decisions we make. A crawler for search engine indexing has different priorities (scale, breadth) than one focused on monitoring specific price changes on e-commerce sites (depth, frequency).

Basic Components – The Crawler's Toolbox

At its heart, a web crawler is a loop. Here's a simplified breakdown:

Seed URLs: You start with a list of initial URLs – the "seed."
Download: Fetch the content of a URL.
Extract Links: Parse the HTML to find more URLs.
Add to Queue: Put these new URLs in a queue to be crawled later.
Repeat: Go back to step 2!

Seems simple, right? The devil's in the details, of course.