Friday, January 3, 2014

Tips for Forensic Data Visualization

Notes from Garfinkel's talk at OSDF 2013

Dr. Simson Garfinkel kicked off the 2013 Open Source Digital Forensics Conference with a talk on the use of data-driven visualization in Digital Forensics.  Specifically, Simson focused on 3 core concepts:
Garfinkel, 2013
  1. Never Use Pie Charts
  2. Histograms with Cumulative Distribution Function
  3. Build consistent Data Driven Visualizations

Never Use Pie Charts

Garfinkel borrows a line from Stephen Few and says "Save the Pies for Dessert!" due to the manipulative nature of pie charts.  Pie charts can be accidentally or purposefully manipulative and make comparison of numeric data difficult.  

Histograms with Cumulative Distribution Function

As you may have seen in the PCAP pre-processing tool, tcpflow, the addition of a cumulative distribution function in-line with a bar-chart showing major categories quickly helps an analyst understand the context of data they are reviewing. Whether your histogram is tracking data transfer over time, data by protocol or port or data transfer by IP address, this approach to visualizing forensic data quickly helps you consider "which one of these is not like the other."  To illustrate this, below is a histogram presented by Garfinkel:
Garfinkel, 2013
In the histogram above we can see some periods of lulls in traffic, as well a rapid rise in data transfer about mid-way through the time period.  I can clearly identify the majority of traffic as HTTP.  Port 5222 is identified in the label/key, however, I'm not able to see a bar of that color - so the traffic using that port and it's associated protocol is minuscule in comparison.

This focus on clear, concise and easily digestible context is a win for a field as prone to data over-load as digital forensics.

Consistent Data Driven Visualizations

In Garfinkel's presentation he pays homage to both the stalwarts of data visualization (matplotlib and GraphViz) as well as to more recent newcomer D3JS - a Javascript based group of libraries for creating data driven documents.  While matplotlib and GraphViz are well suited for use in automated generation of visualization data, D3JS poses some problems if work-flows are not web browser based.

One of the fundamental concepts in Digital Forensics is the "repeatability" of a workflow - something that can be difficult to achieve with data driven visualizations.  Graphing libraries such as GraphViz and D3JS use a pseudo-random number to assist in the initial layout of a graph - nodes are "randomly" placed on the canvas.  This random placement can pose substantial problems for the repeatability requirement. One solution to this problem is store out the pseudo-random number with each visualization manufactured.

Other Notes

Other information and considerations that may be helpful as you begin building Data Driven Visualizations include:

SVG or other Vector output - including PDF.  This will allow the graphic to be used in a wide variety of tools - but also will allow "infinite" zoom for larger and more complex visualizations.
Garfinkel, 2013

A Common Vocabulary - Each professional field has a continuously evolving vocabulary that is used to communicate very specific ideas.  Data Driven Visualization is in it's infancy, especially in digital forensics.  This is the time to get in early.

The original presentation slides are linked to here: | | +Simson Garfinkel

Tuesday, December 31, 2013

Malware Comparison with N-Grams

Jason has recently presented his paper on the use of n-gram analysis on Malicious Executables for identifying code re-use within the malware creation industry. The identification of code re-use has a variety of applications that could be quite useful to lots of folks.
N-Gram Clustering of Malware Samples,
Upchurch & Zhou [2013]

Reduction of code requiring analysis

The n-gram analysis performed shows strong reuse of code - a common repository of analysis for common instruction groups can reduce the amount of analysis required. Similar to FLIRT in IDA Pro - a common repository of these high re-use portions of would have a profound affect on the cost of malware analysis and reverse engineering.

Design and Implementation Evolution

Like other software, malware, when designed and implemented in a commercial process, is developed in a formal methodology adhering to other normal laws of projects and software development (think constraints in time, resources, etc.)  Identifying similar strains of malware based on compiled similarity analysis could be used to identify changes in the offensive objective of attackers.  This can also be used to more quickly analyze for family resemblance by Antivirus companies.

Identification of Migration

Extending the idea of family resemblance, n-gram based analysis can show real world relationships between malware authors as these individuals work collaboratively, share code, or even gift code that was previously "cutting edge" to other groups.

Malware Backtracking

As malware authors gain experience and resources their tactics for ensuring anonymity and protecting themselves from public disclosure is likely to improve. In the early years these authors are likely to be more risk oriented and may have left individually identifiable information in a variety of locations - including in the compiled code itself.  It may be possible to link newer malware to older malware, and from there the real world author.

You can find the full paper "First Byte: Force-Based Clustering of Filtered Block N-Grams to Detect Code Reuse in Malicious Software" at

Friday, January 6, 2012

Poor Man's Bin-Diff

Sometimes you happen upon a binary file, nearly identical to another, that requires identification of very minor differences. Usually the first n bytes are identical, than deviations or complete differences exist beyond that point. If you don't have access to a tool like bindiff, or the file format is not applicable to bindiff (like the 2 ISO files I had to compare today, or anything other than an executable), there is a very easy way to identify the start and potentially all differences between the files. With some gnu kung-fu we'll use xxd (the hex dump utility) and diff (the programmer's difference identification tool) to locate the changes and potentially the start of changes. 

First, use xxd to dump the binary file contents to a ASCII hex representation:

# xxd File1.exe > File1_dump 
# xxd File2.exe > File2_dump

Standard PE/Executable file, dumped to hex with xxd.
As you can see above, xxd makes for a very convenient binary file viewer when looking for plain text meta-data, basic data structures and other potentially interesting items. We've essentially converted the binary files into an ASCII based file.

Some useful notes about this method: if there's an offset from the start of similarity, you'll need to use the  following options when creating your xxd dump:

-seek bytes_to_skip Skip to the identical starting point of the two files.
-ps  Create the output without the byte offset column and without the ASCII representation columns.

Next, use diff to view the areas of difference between these files.

# diff File1_dump File2_dump

We can clearly see changes (an IP address) starting at file offset 0x4f70
Using the switch --suppress-common-lines will reduce the content you have to review before identifying areas of interest.

In this manufactured example, we see the difference between the two executable files is an IP address.


Friday, September 12, 2008

RAID Part 3, Iteration, Block Sizes, and the First Steps of Recovery

In the previous installments, I have talked about the general concept of RAID and how it is not really standardized. We broke RAID down into 4 main groups, left symmetrical, left asymmetrical, right symmetrical, and right asymmetrical. Of course if that was all to it, it wouldn't be such a hassle to reconstruct failed RAID arrays. To that end, I spoke about parity levels and that the number of disks don't necessarily have to be equal. This can complicate matters a bit in the reconstruction process. Of course, that is not the end to the configuration variables that must be discovered to reconstruct an array.

Iteration (Delay)
In the previous article, we examined a number of array configurations for RAID 5. The easiest to understand in probably the left-asymmetrical RAID 5. Again, in the previous article, the example used was a left asymmetrical RAID 5, Parity 5, on 5 disks. In this example, the parity rotated on every stripe, thus it rotation iteration is 1 stripe, then rotate.

left asymmetrical RAID 5, Parity 5, on 5 Disks, Iteration 1

However, some manufactures (cough, HP/Compaq, cough) decided this was too simple for their liking, so added a new variable to the game, iteration (delay). In the this configuration, some number of stripes greater than 1 is written/read before the parity rotation occurs. Thus array is somewhat akin to an incestuous relationship between RAID 4 and RAID 5, where the parity remains on the same position (note, that this is not necessarily the same disk as discussed in the previous article) for a number of stripes before rotation. When the previously defined number of iterations has occurred, the parity rotates as normal. The number of iterations of a stripe before rotation is up to you to find out (we will talk about techniques in later articles), but 16 is usually a good place to start.

left asymmetrical RAID 5, Parity 5, on 5 Disks, Iteration 2

Block Size Primer
Up until now, we have ignored block sizes. I am not talking about blocks at the disk level (sectors), but the chunk of data that is written to a disk before moving to the next disk. A "true" RAID five should always use some integer multiple of the sector size of its component disks (usually a 512 byte sector), but I have been surprised before so consider yourself forewarned. If your into guessing, start at 2^3 (8) sectors, moving up by a factor of 2 each time until you get to 512. If you get here, you probably missed something. Of course, if your not into guessing so much, know that the block size is usually some power of 2*sectors_size and read on.

Determining the Parity Level
As I mentioned, though somewhat briefly, in the previous installments of this RAID series, the parity in RAID 5 is a simple XOR of the data blocks within the stripe. The parity is then written to the parity block for the stripe. If one of the disks in the RAID fails, the missing blocks can be recalculated by performing an XOR on the remaining blocks to recover the data. We can use this fact to our advantage in the recovery process. Provided that we are working with a controller failure or other situation that all of the component disks of the RAID are present, we can easily determine the parity level of the RAID. Since the parity block of any particular stripe is equal to the XOR of the remaining data blocks in the stripe AND any value XOR itself is zero, the XOR of an entire stripe, including the parity block will always equal zero. Therefore, if you write a little program that performs XOR on 3, then 4, then 5, etc blocks at a time, the one that consistently results in zero is your parity level! That's it, well not really. There are some other variables we need to talk about for this to work in practice. This first being configuration data at the beginning of the disk that would not play well with such a simple method.
That being said, in the next installment we will talk about areas of the component disks where metadata of the RAID itself may be stored, locating the beginning of the data area, locating other common structures that will help you determine some parameters, and finding the parity blocks programmaticly.


Powered by Qumana

Thursday, September 11, 2008

Mindset - Professional Hackers

*** NOTE: URL's cited (but not linked to) in this article may contain malware that could harm your computer or encourage you to give money away for nothing in return. Please be cautious when browsing to any address not explicitly linked to!

I would like to take a few minutes to review an in-depth article by The Register (Anatomy of a Hack). Specifically I want to focus on the areas that were well executed and areas that create vulnerabilities to the hacker's objectives.

As always, we'll be applying the General Theory of Laziness as a filter through which we look at the hacker's actions.

Fake Google Site:
* The use of is immediately suspicious. It would have been far less suspicious to use a generic domain name. Here the objective of the hacker is to appear legitimate. In-so-far as this attempt is made it backfires. Picking this domain name as your "cover" is LAZY. There are many better options out there, ie: seriously legitimate sounding domains.

GUI Specifics: XP GUI elements observed from a Windows Vista System
* In Figure 1 of the article we see what appears to be a pop-up window over the browser reporting to be a WARNING!!! from "Quick System Scan Results". On any computer other than Windows XP we would immediately identify this window as part of the hack and not an actual system message. Only rendering only one window style is LAZY, but it probably achieved the objective - target a large user base (All Windows XP users).

Exciting Words
* "WARNING!!!", "Spyware.IEMonster.b", "CRITICAL", and "strongly recommended" are all sales tactics. You see the equivalent in real estate that you should have no interest in! Specifically, using a malware name like "Spyware.IEMonstser.b" is LAZY, it plays on the fear of IE users with an obviously fake threat name - when is the last time you saw a threat name and it made actual sense to you?

Spelling, Grammar
* Figure 5 and Figure 7 (others?) contain blatant punctuation and/or grammatical errors. While some professional, REAL software does contain such errors, generally you see these errors in deeper parts of the application. The "client's" first experience with software is set by the installer and other components (such as the company web site). Errors such as these at this point should be a big warning. The hackers were LAZY and did not have someone accurately check grammar, spelling and/or punctuation.

* Eventually this article discusses how the end-point of the hack, getting money out of the whole deal, all points in the same direction. One of the hardest problems of hiding, even online, is money. For money to have value it has to end up in someone or some organization's hands - eventually the investigator identifies a variety of malware that all pay the same person/organization.

My colleague, Dave Gilbert, warns me that "follow the money" has become a cliche phrase, but I will hazard it's use here.

I contend, that in a world that is increasingly digitized, we will see the oldest of investigative techniques to become more and more important. In this case, following the money trail, the way that these hackers get money out of the "hack," leads us back to the actual attackers.

Any thoughts?

Digital Evidence Formats

Folks, this is a little off our normal topic of "Intrusions and Malware Analysis," but I think it's entirely relevant.

The Plea
Over the last 6 years that I've been involved with computer forensics/computer security I regularly reflect on the lack of open standards for handling digital evidence. Can we please, as a community, make a grassroots decision to standardize our imaging on a common format?

The Problem
I, personally, have experienced problems with proprietary image formats that are password protected or have incompatible format versions for the analysis software that I am using. When these problems arise it often costs a ridiculous amount of time and potentially money to "fix" the problem. The problem is proprietary evidence container formats. There, I've said it.

Currently, if I had to choose, I would say that "dd", or raw bitstream, would be the best format to standardize on. Split, complete, whatever. Raw format allows for the most flexibility in performing forensics examinations. Every forensics suite (that I know of) supports the format, there are never issues or concerns regarding format or software version, we're able to split the image to support different file systems if need be... In my opinion, it's the best option available to us.

Now, you may be sitting there thinking, "But raw doesn't support compression" or "But raw doesn't allow me to password protect the evidence." You're right on both counts. I don't think that Raw format is the end-all fix to this problem, but let's start with the raw format and see how we can meet various requirements of evidence containers while staying in the realm of open standards.

If there is one thing that Open Source/Open Standards software has given us it's compression algorithms. One might argue that there are too many compression formats on open platforms and I would agree. We need to identify a compression algorithm that would bring reasonable compression and ease of split file output for our increasingly large image files.

The selected compression standard must also allow for password protection. Some evidence may not be able to be processed with a chain of custody at every point and should be protected from illicit viewing and potential tampering.

Other Details
I would like to think of the file format as more of a wrapper than the actual content itself. In addition to handling some level of compression and protection, I'd like to be able to insert a 'hand receipt' into the package. Generate a hashlog of the drive and insert into the overall package. Think about it more like a portable filesystem than a file itself:

- PackageContents.xml (Contains MD5 of image, information about origin, etc.)
- Hashlog.txt
- ImageFile.raw.dd
This is very similar to how the Open Office Document format works, it makes sense and could make working across company/organization boundaries much more efficient.

Did I miss any major features that you would want to see? Let me know in the comments!

Wednesday, September 10, 2008

The Merry-Go-Round

Thanks guys for inviting me to the party. In return, I will try to share some thoughts until you all get tired of my perspective and style and kick me off this thing. Definitely some interesting topics so far and I plan to study at least one of them well enough to comment intelligently. In the spirit of giving back, I'll go ahead and throw another topic out.

While I find it humorous that I'm doing so, I've spent the better part of the last year trying to help folks new to the Intrusions game get comfortable with cold-box intrusion analysis. I say humorous, because it wasn't all that long ago, in Dave years, that I was scrambling to teach myself how to tackle exams as well as bugging a few good friends and respected colleagues to validate what I thought I knew and straighten me out when I was wrong. When I was learning intrusion analysis, I had the advantage of some pretty extensive criminal examinations and field investigations experience. I've found that understanding what goes into building an investigation with an eye toward prosecution can help in figuring out what a suitable intrusion analysis examination should produce. What I've found most challenging is figuring out ways to relay what I refer to as a "way of thought" to folks who may not have the same perspective I have with regard to an analysis process. This is probably most akin to your Dad or Uncle telling you you're not holding your mouth right when you're baiting a hook, turning a wrench, or tying a knot. So, as I've tried, scratched, thrown away, and forged ideas anew this last year, I've decided that intrusions analysis is most like a merry-go-round.

Great minds and smarter men than me have taught me that the intrusions response process as a whole, of which cold-box analysis is a piece, is a cyclical process. It's almost natural to me that intrusions system analysis is cyclical as well. We basically start with some kind of known, or guess, and go to the unknown. As the unknown, becomes a known, it leads to other unknowns, and so on. I have seen and rode this ride many times. As most of us have, I've chased rabbits down hole after hole, trying to find the nugget (vector, attribution). As I've tried to explain how this can happen, while also working in an environment that leans toward standardization, automation, and desired speed and more speed, I've remembered my short-lived youth on the Jersey Shore where for a time I and my good buds would jump on and off the merry-go-round in a fine arcade establishment. The trick, at least back then, was to time the jumping on and off to coincide with being out of sight of the ride attendant. So, I've concluded that a key aspect of intrusions analysis is knowing where to jump on and when to jump off the beast lest your hard work become irrelevant due to passage of time.