Many times it is not necessary to hack into anything to get credentials or confidential data. Many developers have the strange habit of leaving sensitive information in their code. And it is a Pentester’s duty to get hold of that information.
Today I’m going to do a little tutorial on how you can track the web for sensitive information. To do this, I will use the Scrapy tool, a very powerful crawler that is very easy to configure in python. The most important thing about this tool is the speed with which it performs the crawling. For this little proof of concept, I will show you how to get sensitive information from GitHub repositories.
To begin with, we install scrapy with “pip install scrapy“. Once installed, we create a new project:
Having created the project, we are going to create our first scrapper. For this first example, I will analyze the v1s1t0r github.
We place ourselves in the spiders/ folder inside the project, and generate the scraper:
This command creates a file scrapv1s1t0r.py in which we find the skeleton of the scraper:
In the variable “allowed_domains” we indicate the URLs that the scraper is allowed to explore. As there may be other related githubs that are interesting to check, and as github shows the contents of the files from raw.githubusercontent.com, we are going to allow these two domains. In the start_urls variable the target url is indicated, in this case the github of v1s1t0r.
If we wanted to analyze a set of repositories, instead of just one, it would be enough to save the URLs in a file and assign the list of repositories to the start_urls variable with the following command:
start_urls = open('githubRepos.txt').read().splitlines()
Next we proceed to parse the answer, by means of the function parse. I am interested in saving the results in a file, which I have called “results.txt“. In addition, I will only keep the content of the files, I am not interested in analyzing what is in the different directories of github but in their files. To put this restriction I check that the answer comes from raw.githubusercontent.com.
Besides, I don’t want to analyze only the v1s1t0r github home page, but I want it to go through the different branches until it goes through them all. To do this I will add at the end the following code, which generates new scrapy requests for each URL in the three main branches of a github repository: blob/master, tree/master and raw/master.
All that remains now is to establish what we want to look for. The answer is in response.body, so we can add patterns to search through the re de python library. So not only can we search for literal words, but regular expressions. For example, let’s look for emails that appear in this repository:
The scraper is executed by the following command:
scrapy runspider Scrapv1s1t0rSpider.py
After executing it, let’s see the results.txt file…
Looks like the repository owner left his personal email. When we put it in google, we find a reference to a famous hacker:
Is it really this github from OscarAkaElvis? I’m going to change the regular expression of the email to “OscarAkaElvis”.
As you can see in this example, thanks to this powerful tool we can analyze the whole web in search of useful information. Adjusting the regular expression you need, you can find users, passwords, API Keys, SSH private keys… In short, everything that has been left around and can be useful for us to perform a pentesting.