FatBeagle: Automated URL Collection
When visiting a web page, it’s not uncommon for tens of requests to take place behind-the-scenes in order to render the final content. This experience is great for the user, but introduces a level of abstraction that makes surfacing security concerns or suspicious traffic challenging. While performing research into web page content, I’ve found that my own source of browsing traffic can be a great starting point for exploration. Having a historic record of URLs allows me to begin answering questions about resources being accessed that may be suspicious or warrant some investigating.
There’s a number of ways I can collect this information outside of the browser, but I wanted to keep a light footprint, so I decided to code the extraction of my browsing traffic and page content directly into the browser itself using an extension. FatBeagle offers a small toolkit to submit URLs from the browser, either manually or automatically, through a variety of means without the need for external tools. This blog provides a brief overview of the extension and explains some of its technical components.
FatBeagle’s primary utility is to feed an external system URLs observed during the browsing process. It does this through a combination of visited page parsing and through exposed browser APIs that allow for network inspection. By default, FatBeagle respects the privacy of the end-user and is configured to only extract URLs through manual submission, though it can be configured to perform automated analysis.
When set to true, auto-crawling will leverage APIs built into the browser to inspect network activity prior to it leaving the browser. As traffic streams through the browser, FatBeagle will capture the observed URLs and save them to localstorage. On a periodic schedule, the extension will comb through the localstorage entries and send them off to the remote server by issuing a POST request. Once submitted, these entries are removed from localstorage in order to preserve space.
When set to true, auto-extraction will peer into the web pages being loaded within the browser and identify any URL found within the DOM itself. Regular expressions are used to extract content that appear to match the URL pattern and are later validated by checking the TLD. Invalid URLs are removed from the results before being sent to the remote server.
Recognizing that users may not want FatBeagle running all the time, several manual submission processes were added. Users can manually submit URLs in three different ways, 1) Using the pop-up panel exposed by clicking the extension icon, 2) By right-clicking on a highlighted artifact (url, text, etc.) within the web page and submitting via the context menu or 3) Using the documented hot-key combinations to automatically crawl or extract URLs from the page currently being viewed.
During a browsing session, thousands of unique URLs will be identified. Many of these may be duplications or from a high number of frequently occurring hosts such as content delivery networks, social media, etc. In an effort to reduce the noise sent from the extension to the remote servers, a simple state machine was added into the extension. This state machine takes into account three core factors-time, frequency and value-all of which can be configured within the options page.
As the user browses online, the state machine will build a mapping of values and count the number of times during which they are observed for a given period. These values could be domains (least noisy), hostnames or URLs (most noisy). During the submission of content to the remote server, the extension will consult the state machine and remove any item that exceeds the defined threshold. After the given time period has passed, the state machine will be flushed, so that all previous values and counts are cleared.
It should be noted, the state machine doesn’t account for initial priming, so frequently occuring values may slip through filtering during the first few seconds of state being built. While it’s not a perfect solution, employing the state machine ensures that the remote server is not overwhelmed with a significant amount of duplicated or highly observed traffic.
Installing FatBeagle is simple and can be done by visiting the deployed version in the Chrome Web Store. Once installed, a profile will need to be created which includes a name, server url, token and private key. It’s assumed that the server receiving this information supports basic authentication via a username or password. Once the profile is defined, there’s no need to make any additional changes to the default settings. Options for auto-crawling and auto-extraction are made available through the pop-up menu for easy toggling.
Questions, Comments or Issues
If you’d like to contribute to FatBeagle, you can do so by forking the content located on Github. If you identify any bugs or have specific feature requests, you can submit an issue within Github and it will be triaged by the developers.