Day 066 #FromZeroToHacker - Google Dorking

Google Dorking is a hacker technique that uses Google Search to find security holes in the configuration and computer code that websites use, something quite useful in our day-to-day.

Let’s find what we can in our daily #FromZeroToHacker challenge.

Table of contents

Introduction

What have I learnt today?

Stats

Resources

Introduction to Google Dorking

There is a lot more going on behind search engines than just looking for pictures of cats and dogs. We can leverage advanced searching techniques to our advantage to find all sorts of things.

Researching encapsulates everything you do as a Pentester.

Search engines such as Google or Bing are huge indexers of content spread around the World Wide Web. They use crawlers or spiders to search for this content across the World Wide Web.

What have I learnt today?

Let’s Learn About Crawlers

What are crawlers and how they work

Crawlers discover content on the Internet. Sometimes by pure discovery, sometimes by following any and all URLs found from previously crawled websites (like a virus).

Let’s visualise some things

This diagram is a high-level abstraction of how these web crawlers work:

![[day_065_web_crawlers.png]]

A web crawler discovers mywebsite.com, index the entire contents of the domain, looking for keywords and other miscellaneous information.

mywebsite.com has been scraped as having the keywords “Apple”, “Banana” and “Pear”, which are stored in a dictionary by the crawler. If a person searches for “Apple”, mywebsite.com will appear (in which order, is just another entire topic we will discuss below).

When a website has links to other websites, this process will repeat, spreading all the links, and discovering new websites (with their keywords).

Enter: Search Engine Optimisation

Search Engine Optimisation or SEO is an important topic. Search engines rank or prioritise one domain over others, depending on different factors.

Many of those factors are unknown and change from time to time, but some factors that influence a website’s ranking are:

How responsive your website is (to different browsers and devices such as computers, laptops, phones, etc).
How easy is it to crawl your website.
What kind of keywords your website has.

There are online too that will show you how optimised your domain is. Try Pagespeed from Google with any website you want, especially if you have your own. Let’s try mine, Let’s Learn About:

Yay, I have a good score! Yes, it is a WordPress site and I did almost nothing, but nonetheless…

Robots.txt

Robots.txt is a file served at the root directory that defines the permissions the crawlers have on this website. We can specify which type of crawler is allowed (Don’t allow Bing, allow Google, etc), which routes are allowed or not to be indexed, and a link to the sitemap.

User-agent specifies the type of crawler that can index your website (an asterisk allows all User-agents.
Allow or Disallow specifies the directory or files that a crawler can or can’t index.
Sitemap points to the sitemap (something like a map or your website with all the routes). This improves the SEO. More about it in the next section.

We can prevent not only directories but files from being indexed:

This will let any user index the website, but it won’t index any file with the .ini extension.

Sitemaps

Sitemaps are like a geographical map that maps your website: Routes, sections and how to navigate it. This helps the crawlers find content on your website.

The sitemap uses the .xml format, one of the worst, and looks like this:

But, how do sitemaps help search engines?

Search engines have a lot of data to process (the whole Internet!). Resources like Sitemaps are helpful for crawlers as the necessary routes to content are already provided (and some others, hidden). All the crawler has to do is scrape this content, rather than going through all the websites in a manual fashion.

The easier a website is to crawl, the more optimised it is for the Search engine.

What is Google Dorking?

Google has a lot of websites crawled and indexed. And we can search for all of them.

The average user looking for a book will just put the name of the book in the search field, maybe also the author’s name, and just go. But we can do much more.

If we want to narrow down our search query, we can use quotation marks. This will make Google search for that specific string. If you Google The Lord of the Rings, you will find results for the book, but also for the film, jewelry stores, and other stuff. But if you search for "The Lord of the Rings" it will find results of websites containing that string, in that order.

There are terms we can use to finetune our search:

site:bbc.co.uk Manchester United vs Arsenal will search for results only inside that website.
filetype:pdf will search for files by their extension.
cache:URL will show Google’s cached version of a specific URL.
intitle:Index of will retrieve results where the website has Index of in the title of the page.

There are more search filters you can use. Here is a Google Dork cheatsheet.

Summary

Things we learned today:

What search engines do and how they work.
Crawlers or spiders.
How SEO works and what it is.
What Robots.txt are and how to configure them.
What sitemaps are.
Google Dorking: What it is.

Stats

From 67.173th to 67.206th.

Here is also the Skill Matrix:

Let's learn about

Tutorials and learning resources for programmers

Day 066 #FromZeroToHacker – Google Dorking

Introduction to Google Dorking

What have I learnt today?

Let’s Learn About Crawlers

What are crawlers and how they work

Let’s visualise some things

Enter: Search Engine Optimisation

Robots.txt

Sitemaps

What is Google Dorking?

Summary

Stats

Resources

Random room

Other resources

Day 066 #FromZeroToHacker – Google Dorking