There are ways to discover hidden or private content that is hidden in a web server. Let’s see how we can discover that elusive content.
Time to be Sherlock Holmes in our daily #FromZeroToHacker challenge.
Table of contents |
Introduction |
What I have learnt today? |
Stats |
Resources |
Introduction to Content Discovery
What is content discovery?
Content discovery is about finding out content that is out of reach for the average user, content the web developers didn’t mean you should have access to. This content may be videos, pictures, a website feature only for admins, hidden data, etc. Content that isn’t intended for public access.
Older versions of the website, configuration files, functionality only available (in theory) to admins, administration panels, etc, are things we can and we should try to find out.
There are three ways to discover content on a website: Manually, Automated, and OSINT (Open-Source INTelligence).
Manual Discovery
This is done by us without any program or code. Just the good old keyboard and mouse.
Robots.txt
The robots.txt file is a document that tells search engines which pages are and aren’t allowed to index, or even ban specific search engines.
It is a common practice to restrict certain website areas such as administration portals or files that aren’t meant for website normal users: This gives us an interesting list of places that the owners don’t want people to discover, but pretty useful for us.
Normally on the root of the website, we can try to reach the file by adding /robots.txt to the website we are trying to attack. For example, check https://www.google.com/robots.txt:
And the list goes on and on. If we were crazy enough to attack Google, there is a list of places to poke at.
Favicon
The Favicon is a small icon displayed next to the browser’s address bar:
You may be thinking: Why should I care about that? It is just an icon!
And you are right: It is just an icon. An icon that, if the web developer didn’t change it, can tip us off what framework are they using.
OWASP, the Open Worldwide Application Security Project ( a non-profit dedicated to improving software security ), hosts a database of common framework favicons that you can use to check the favicon you found.
For example, if we visit this example website, we can see a website under construction…and its favicon next to the URL. Let’s review the code:
Let’s get the md5 hash (Message Digest algorithm, something used to authenticate) on the terminal with curl URL | md5sum
:
Let’s search that hash in the OWASP database:
Good! It is something called cgiirc, let’s do a quick Google search:
Nice, now we know what we are attacking!
sitemap.xml
While robots.txt restricts what search engines can look at, sitemap.xml gives a list of every file the web developer wishes to be listed. This, sometimes, contains areas of the website that are hard to navigate or even lists old web pages or resources that the site no longer uses, but it is still available.
Let’s view an XML example:
HTTP Headers
When requests are made to a web server, the server returns HTTP headers. These headers can sometimes contain important information, such as what server software it uses, the programming language used, the machine IP, and more.
Let’s see an example:
Here we can see that the web server is using nginx/1.18.0 and Ubuntu as OS, and the scripting language is PHP/7.4.3.
Framework stack
Once we have learned the framework a website is using, either by checking the favicon, reading the source code, or looking for clues in comments, credits, or sections of the website, we can look for the framework’s website for more information.
OSINT
There are also external sources available that can help us discover information about our target. These resources are referred to as OSINT (Open Source INTelligence), and they are free:
Google Hacking/Dorking
Google hacking, or Dorking, uses Google’s advanced search features that allow you to improve the way you search. You can use filters (one or more at the same time), to make your search more accurate. Let’s see some common filters
- site ->
site:reddit.com Food
-> This searches only ‘Food’ in Reddit.com - inurl ->
inurl:admin
-> This returns results that have ‘admin’ in the URL - filetype ->
filetype:pdf Socrates
-> This returns PDFs that have Socrates in the text - intitle ->
intitle:admin
-> This returns results that contain ADMIN in the title
There are more filters and you can out more about Google hacking on the Wikipedia page.
Wappalyzer
Wappalyzer is an online tool and a browser extension that helps identify what technologies (Programming language, Framework, etc) a website uses. Pretty good if you are lazy (as you should be).
Wayback Machine
The Wayback Machine is a historical archive of websites. You can search for a domain name and it will show you all the times the service took a snapshot of that web page. Imagine looking at how Twitter looked in 2009 or 2014!
In fact, let’s do it:
Twitter 2009:
Twitter 2014:
GitHub
Git is a version control system that tracks file changes in a project, locally. GitHub is a hosted version of Git on the internet. Repositories can be public or private. You can use GitHub search feature to look for company names or website names to try to locate repositories belonging to your target.
If discovered, you may have access to their source code and, sometimes, passwords and other content.
S3 Buckets
S3 is a storage service provided by Amazon AWS, that allows people to save files and static websites. The owner of the files can set access permissions to make files public, private, or writeable. When the access permissions are incorrectly set, we can have access to files that we shouldn’t.
The URL is http(s)://{name}.s3.amazonaws.com where {name} is decided by the owner.
We can discover S3 buckets by finding the URL in the website’s page source, GitHub repositories or even automating the process.
Automated discovery
Automated discovery is the process of discovering content by using programming tools (a program, script, etc). This process normally consists of hundreds, thousands, or more of millions of requests to a web server to find whether a file or directory exists, finding content that we shouldn’t have access to.
This process sometimes uses wordlists: a text file that contains a long (thousands or more) list of commonly used words. For example, a password wordlist contains the most frequently used passwords, while a file wordlist contains the most used filenames.
Read more about it in this post Wordlists for pentesters.
Automation tools
There are loads of automation tools available, each one with different features. Let’s see three of them: ffuf, dirb, and GoBuster:
ffuf:
dirb:
GoBuster:
Stats
From 186.990th to 179.720th. Still in the top 9% in TryHackMe!
Here is also the Skill Matrix:
Resources
Path: Web Fundamentals
Introduction to Web Hacking
Other resources
OWASP Favicon Database
Google Hacking filters
Google Dorks cheat sheet
Wordlists for pentesters