A quick guide to web crawling

A quick guide to web crawling

A quick guide to web crawling

Adel wants to buy a used car. He has a budget in mind, the car should be 3 years old, should have done less than 40,000 km and should also potentially last for the next 5 years.  He also needs some necessary features like air-conditioning, leather seats, etc. A web search, reveals a number of websites offering a multitude of car-listings and while Adel is spoilt for choices, he is unable to think and make a decision coherently. Also, he sees different pricing for similar cars. He wonders if there is a simpler and more efficient method by which he can make an informed decision.

To be more specific, is it possible to “crawl” the web, collate the data and make an informed decision?  

To his luck, he knows his way around programming. He can write a python program to implement “web crawling” which automatically collects web pages given a starting address and some conditions (location, category, etc.). He can also perform “Web scraping” using a “pattern” configured beforehand, accounting for the (HTML) structure of the crawled pages, and retrieve relevant data into a database for the desired manipulation.

A web crawler built on open source python packages (Scrapy, Cola, Beautifulsoup, MechanicalSoup, and PySpider) visits websites and extracts data from each until it attains maximum depth. It obtains targeted attributes including; year of manufacture, kilometers covered, color, pricing, etc. and later “scrapes” to get the entire structure of the website.

Adel wants to get answers to the following questions:

– Which brand should he pursue?

– Which color should he pick?

– Does the price justify its value?

His crawling shows him the following details, and a majority of people opt for black or white cars.

He further goes on to cluster the cars based on price, kilometrage, and color.

Adel has now finally clustered the cars based on low kilometrage and price with a color attribute.  He could go further and define an “acceptability” index by adding more features, like fuel efficiency, good history, and resale value.  Now, he could finally arrive at a list of a few cars which meet this acceptability threshold and help him make an informed decision.

The above is a simple demonstration of the power of web-crawling, data aggregation, and analytics could be accomplished using open source tools with data in the public domain.