Earlier this week, I was asked if I could find email addresses based on a set of supplied domains. Knowing search engines like Google and Bing block showing email results, and not wanting to run hundreds of searches, I quickly went to task finding a tool to perform the job. That led me to “theHarvester”, a python script with several modules to hunt for subdomains, emails and hosts. Unfortunately, while the script provided some utility for single domains, it quickly became a burden when presented with several hundred. Not to mention, the results seemed slim. Being a tool developer myself, I decided to write a more comprehensive email collector I have dubbed, Frisbee.
Get the tool on Github: https://github.com/9b/frisbee
Issues with theHarvester
Before I dive into the details of Frisbee, it’s important to understand where theHarvester failed me. Put simply, theHarvester is not actually searching the individual pages from the search query ran in each engine. Instead, the search engine pages themselves are what’s used to find email addresses. How often do you find an email address in the small description snippet of say, a Google or Bing search? Answer, not often. That said, theHarvester does find some emails, but I thought it could be better.
Searching method aside, theHarvester also suffered from some performance issues in that each request was done in linear fashion which is great for not appearing like a bot, but slow when you have hundreds of searches that need to be performed. Finally, theHarvester command line tool was really built to take in one term at a time, so trying to send hundreds of domains at once wasn’t possible without some hacks.
I do want to note, while theHarvester wasn’t the primary way I solved my problem, I do think it’s a nice tool and can easily see why it’s gained some popularity within the security community.
Frisbee addresses all the key issues I identified with theHarvester and allows for users to easily add new modules for different sources of data. Like theHarvester, Frisbee is written in python, offers a command line tool and will make use of search engine manipulation to find results. As for the differences, Frisbee was purpose built for collecting email addresses, so it ignores subdomains, hosts, resolutions, etc.
Similar to most of my tools, I built Frisbee as a standalone library and included a lightweight command line tool that leverages the library. Benefits to this approach are mainly in flexible use––easily take advantages of the free tool provided or import the functionality into your own tools.
Frisbee instances create projects of which you can add jobs for processing. Jobs are described via simple dictionary and solve the use case of wanting to perform numerous look-ups in one search session. Jobs are processed by a single search handler that uses multiprocessing to complete each job order. Finally, jobs make use of modules which are hot-loaded at runtime in order to run searches (concurrent), extract results and crawl the pages (concurrent) for email addresses.
Using Frisbee is simple. The Github page includes documentation on including the module in your own code, or you can use the built-in command line tool.
- Ability to search for email addresses from search engine results
- Modular design that can be extended easily to include new sources
- Modifier options that can filter or target search query
- Limit option to reduce the number of results parsed
- Greedy option to learn from collected results and auto-search
- Save output describing job request and results
As one would expect, Frisbee is able to produce significantly more email results when compared against theHarvester. Due to the modifier options, users can better target their search results and increase the odds of finding specific email addresses of certain types of users. To illustrate, you could run a bing engine search with the domain query for “company.xyz” and a modifier of “site:github.com” in order to find emails potentially associated with developers.
It’s been a while since I’ve written a tool to handle a job I seldom do, but this one was a lot of fun. Detail-oriented individuals will notice there’s only one module and that’s Bing. The reason for this is that Google bans bots (concurrency makes you look like a bot) where Bing doesn’t and Bing produced “good enough” outputs to stop me from writing more modules. That said, if you want to add support for search engines like Yahoo, Ask, Baidu, etc., please do and submit a pull request!