Day 016 #FromZeroToHacker – Content discovery

There are ways to discover hidden or private content that is hidden in a web server. Let’s see how we can discover that elusive content.

Time to be Sherlock Holmes in our daily #FromZeroToHacker challenge.

Table of contents
Introduction
What I have learnt today?
Stats
Resources

Introduction to Content Discovery

What is content discovery?

Content discovery is about finding out content that is out of reach for the average user, content the web developers didn’t mean you should have access to. This content may be videos, pictures, a website feature only for admins, hidden data, etc. Content that isn’t intended for public access.

Older versions of the website, configuration files, functionality only available (in theory) to admins, administration panels, etc, are things we can and we should try to find out.

There are three ways to discover content on a website: Manually, Automated, and OSINT (Open-Source INTelligence).

Manual Discovery

This is done by us without any program or code. Just the good old keyboard and mouse.

Robots.txt

The robots.txt file is a document that tells search engines which pages are and aren’t allowed to index, or even ban specific search engines.

It is a common practice to restrict certain website areas such as administration portals or files that aren’t meant for website normal users: This gives us an interesting list of places that the owners don’t want people to discover, but pretty useful for us.

Normally on the root of the website, we can try to reach the file by adding /robots.txt to the website we are trying to attack. For example, check https://www.google.com/robots.txt:

Content discovery by reading robots.txt

And the list goes on and on. If we were crazy enough to attack Google, there is a list of places to poke at.

Favicon

The Favicon is a small icon displayed next to the browser’s address bar:

Favicon image

You may be thinking: Why should I care about that? It is just an icon!

And you are right: It is just an icon. An icon that, if the web developer didn’t change it, can tip us off what framework are they using.

OWASP, the Open Worldwide Application Security Project ( a non-profit dedicated to improving software security ), hosts a database of common framework favicons that you can use to check the favicon you found.

For example, if we visit this example website, we can see a website under construction…and its favicon next to the URL. Let’s review the code:

Favicon URL

Let’s get the md5 hash (Message Digest algorithm, something used to authenticate) on the terminal with curl URL | md5sum:

md5hash of a Favicon

Let’s search that hash in the OWASP database:

OWASP database

Good! It is something called cgiirc, let’s do a quick Google search:

Googling

Nice, now we know what we are attacking!

sitemap.xml

While robots.txt restricts what search engines can look at, sitemap.xml gives a list of every file the web developer wishes to be listed. This, sometimes, contains areas of the website that are hard to navigate or even lists old web pages or resources that the site no longer uses, but it is still available.

Let’s view an XML example:

Content discovery through XML files

HTTP Headers

When requests are made to a web server, the server returns HTTP headers. These headers can sometimes contain important information, such as what server software it uses, the programming language used, the machine IP, and more.

Let’s see an example:

Checking headers

Here we can see that the web server is using nginx/1.18.0 and Ubuntu as OS, and the scripting language is PHP/7.4.3.

Framework stack

Once we have learned the framework a website is using, either by checking the favicon, reading the source code, or looking for clues in comments, credits, or sections of the website, we can look for the framework’s website for more information.

OSINT

There are also external sources available that can help us discover information about our target. These resources are referred to as OSINT (Open Source INTelligence), and they are free:

Google Hacking/Dorking

Google hacking, or Dorking, uses Google’s advanced search features that allow you to improve the way you search. You can use filters (one or more at the same time), to make your search more accurate. Let’s see some common filters

  • site -> site:reddit.com Food -> This searches only ‘Food’ in Reddit.com
  • inurl -> inurl:admin -> This returns results that have ‘admin’ in the URL
  • filetype -> filetype:pdf Socrates -> This returns PDFs that have Socrates in the text
  • intitle -> intitle:admin -> This returns results that contain ADMIN in the title

There are more filters and you can out more about Google hacking on the Wikipedia page.

Wappalyzer

Wappalyzer is an online tool and a browser extension that helps identify what technologies (Programming language, Framework, etc) a website uses. Pretty good if you are lazy (as you should be).

Wayback Machine

The Wayback Machine is a historical archive of websites. You can search for a domain name and it will show you all the times the service took a snapshot of that web page. Imagine looking at how Twitter looked in 2009 or 2014!

In fact, let’s do it:

Link to Wayback Machine

Wayback Machine

Twitter 2009:

Twitter 2009

Twitter 2014:

Twitter 2014

GitHub

Git is a version control system that tracks file changes in a project, locally. GitHub is a hosted version of Git on the internet. Repositories can be public or private. You can use GitHub search feature to look for company names or website names to try to locate repositories belonging to your target.

If discovered, you may have access to their source code and, sometimes, passwords and other content.

S3 Buckets

S3 is a storage service provided by Amazon AWS, that allows people to save files and static websites. The owner of the files can set access permissions to make files public, private, or writeable. When the access permissions are incorrectly set, we can have access to files that we shouldn’t.

The URL is http(s)://{name}.s3.amazonaws.com where {name} is decided by the owner.

We can discover S3 buckets by finding the URL in the website’s page source, GitHub repositories or even automating the process.

Automated discovery

Automated discovery is the process of discovering content by using programming tools (a program, script, etc). This process normally consists of hundreds, thousands, or more of millions of requests to a web server to find whether a file or directory exists, finding content that we shouldn’t have access to.

This process sometimes uses wordlists: a text file that contains a long (thousands or more) list of commonly used words. For example, a password wordlist contains the most frequently used passwords, while a file wordlist contains the most used filenames.

Read more about it in this post Wordlists for pentesters.

Automation tools

There are loads of automation tools available, each one with different features. Let’s see three of them: ffuf, dirb, and GoBuster:

ffuf:

Content discovery with FFUF

dirb:

Content discovery with DIRB

GoBuster:

Content discovery with GoBuster

Stats

From 186.990th to 179.720th. Still in the top 9% in TryHackMe!

Here is also the Skill Matrix:

Skill Matrix

Resources

Path: Web Fundamentals

Introduction to Web Hacking

TryHackMe: Content discovery

Other resources

OWASP Favicon Database
Google Hacking filters
Google Dorks cheat sheet
Wordlists for pentesters