Day 066 #FromZeroToHacker – Google Dorking

Google Dorking is a hacker technique that uses Google Search to find security holes in the configuration and computer code that websites use, something quite useful in our day-to-day.

Let’s find what we can in our daily #FromZeroToHacker challenge.

Table of contents
Introduction
What have I learnt today?
Stats
Resources

Introduction to Google Dorking

There is a lot more going on behind search engines than just looking for pictures of cats and dogs. We can leverage advanced searching techniques to our advantage to find all sorts of things.

Researching encapsulates everything you do as a Pentester.

Search engines such as Google or Bing are huge indexers of content spread around the World Wide Web. They use crawlers or spiders to search for this content across the World Wide Web.

What have I learnt today?

Let’s Learn About Crawlers

What are crawlers and how they work

Crawlers discover content on the Internet. Sometimes by pure discovery, sometimes by following any and all URLs found from previously crawled websites (like a virus).

Let’s visualise some things

This diagram is a high-level abstraction of how these web crawlers work:

![[day_065_web_crawlers.png]]

A web crawler discovers mywebsite.com, index the entire contents of the domain, looking for keywords and other miscellaneous information.

mywebsite.com has been scraped as having the keywords “Apple”, “Banana” and “Pear”, which are stored in a dictionary by the crawler. If a person searches for “Apple”, mywebsite.com will appear (in which order, is just another entire topic we will discuss below).

Google User search

When a website has links to other websites, this process will repeat, spreading all the links, and discovering new websites (with their keywords).

Google Crawlers spreading

Enter: Search Engine Optimisation

Search Engine Optimisation or SEO is an important topic. Search engines rank or prioritise one domain over others, depending on different factors.

Many of those factors are unknown and change from time to time, but some factors that influence a website’s ranking are:

  • How responsive your website is (to different browsers and devices such as computers, laptops, phones, etc).
  • How easy is it to crawl your website.
  • What kind of keywords your website has.

There are online too that will show you how optimised your domain is. Try Pagespeed from Google with any website you want, especially if you have your own. Let’s try mine, Let’s Learn About:

Google Pagespeed

Yay, I have a good score! Yes, it is a WordPress site and I did almost nothing, but nonetheless…

Robots.txt

Robots.txt is a file served at the root directory that defines the permissions the crawlers have on this website. We can specify which type of crawler is allowed (Don’t allow Bing, allow Google, etc), which routes are allowed or not to be indexed, and a link to the sitemap.

Robots.txt file
  • User-agent specifies the type of crawler that can index your website (an asterisk allows all User-agents.
  • Allow or Disallow specifies the directory or files that a crawler can or can’t index.
  • Sitemap points to the sitemap (something like a map or your website with all the routes). This improves the SEO. More about it in the next section.

We can prevent not only directories but files from being indexed:

Robots.txt file disallow

This will let any user index the website, but it won’t index any file with the .ini extension.

Sitemaps

Sitemaps are like a geographical map that maps your website: Routes, sections and how to navigate it. This helps the crawlers find content on your website.

Sitemaps.xml file

The sitemap uses the .xml format, one of the worst, and looks like this:

Sitemap example

But, how do sitemaps help search engines?

Search engines have a lot of data to process (the whole Internet!). Resources like Sitemaps are helpful for crawlers as the necessary routes to content are already provided (and some others, hidden). All the crawler has to do is scrape this content, rather than going through all the websites in a manual fashion.

The easier a website is to crawl, the more optimised it is for the Search engine.

What is Google Dorking?

Google has a lot of websites crawled and indexed. And we can search for all of them.

The average user looking for a book will just put the name of the book in the search field, maybe also the author’s name, and just go. But we can do much more.

If we want to narrow down our search query, we can use quotation marks. This will make Google search for that specific string. If you Google The Lord of the Rings, you will find results for the book, but also for the film, jewelry stores, and other stuff. But if you search for "The Lord of the Rings" it will find results of websites containing that string, in that order.

There are terms we can use to finetune our search:

  • site:bbc.co.uk Manchester United vs Arsenal will search for results only inside that website.
  • filetype:pdf will search for files by their extension.
  • cache:URL will show Google’s cached version of a specific URL.
  • intitle:Index of will retrieve results where the website has Index of in the title of the page.

There are more search filters you can use. Here is a Google Dork cheatsheet.

Summary

Things we learned today:

  • What search engines do and how they work.
  • Crawlers or spiders.
  • How SEO works and what it is.
  • What Robots.txt are and how to configure them.
  • What sitemaps are.
  • Google Dorking: What it is.

Stats

From 67.173th to 67.206th.

Here is also the Skill Matrix:

Skills Matrix

Resources

Random room

TryHackMe: Google Dorking

Other resources

Google
Bing
Pagespeed
Google Dork cheatsheet.