Things I've learned and suspect I'll forget.
A while back I received an email from a friend of mine which had a PDF attachment. Given that the email body was blank and the "report" was unsolicited, I assumed my friend's email had been compromised and that the PDF was malicious. Wanting to examine the PDF later, I found that google recognized that the file was malicious and would not allow me to download it. Instead it gave me only options to "View" or "Learn more". If you click view, a message appears that says "virus found" and "learn more" takes you to this page.
But, we can still get the file from the original text of the email. Click the drop down arrow attached to the reply button and select "show original". Scrolling down a bit in the original message you'll see a section that looks like this:
--f46d0444ef637df04804b6e6fc01
Content-Type: application/pdf; name="744810.pdf"
Content-Disposition: attachment; filename="744810.pdf"
Content-Transfer-Encoding: base64
X-Attachment-Id: file0
JVBERi0xLjYKJeLjz9MNCjEgMCBvYmoNCjw8L1R5cGUvUGFnZS9QYXJlbnQgNSAwIFIgL01lZGlh
Qm94IFswIDAgNjQwIDQ4MF0vQ29udGVudHMgNiAwIFIgL1Jlc291cmNlcyA3IDAgUj4+DQplbmRv
JSVFT0YNCg==
--f46d0444ef637df04804b6e6fc01--
The sample above has been blatantly cut down to conserve space, but the first two lines and the last two lines of the attachment are shown. The part that is interesting is the base64 encoding and the text starting on line 7 and ending on line 9. This is the base64 encoding of the file. A trivial method of decoding this string would be to fire up python, and decode using the base64 module:
>>> import base64
>>> raw = base64.decodestring('JVBERi0xLjYKJeLjz9MNCjEgMCBvYmoNCjw8L1R5cGUvUGFnZS9QYXJlbnQgNSAwIFIgL01lZGlhQm94IFswIDAgNjQwIDQ4MF0vQ29udGVudHMgNiAwIFIgL1Jlc291cmNlcyA3IDAgUj4+DQplbmRvJSVFT0YNCg==')
>>> raw
'%PDF-1.6\n%\xe2\xe3\xcf\xd3\r\n1 0 obj\r\n<</Type/Page/Parent 5 0 R /MediaBox [0 0 640 480]/Contents 6 0 R /Resources 7 0 R>>\r\nendo%%EOF\r\n'
A simple script to do this automatically would look like this:
#simpledecode.py
import sys
def decode(filein, fileout):
import base64
emailFile = open(filein,'r')
rawfile = open(fileout,'wb')
while(True):
line = emailFile.readline()
if line == "":
break
raw = line.strip()
try:
rawfile.write(base64.decodestring(raw))
except:
print "Incorrect Padding: "+ line
raise
print "wrote file: " + fileout
rawfile.close()
if __name__ == '__main__':
if len(sys.argv) == 3:
decode(sys.argv[1],sys.argv[2])
else:
print "simpledecode.py fileIn fileOut"
print "fileIn should be the base64 MIME encoded string"
simpledecode.py is supplied a text file with the base64 encoded file. So all you would need to do is copy the first base64 encoded section of the file (lines 7-9 in the original sample above) into a new text file and supply that as the first argument to simpledecode.py. The second argument is where you want the extracted file to be saved.
In order to make it a little bit simpler to extract the files, I wrote a python program with a bit more sophistication. This script needs only a file of the original email text to access. That is, when you view the original source of the email, hit ctrl-a to select the entire file, and then copy and paste it into a text file. Then supply that file as the argument to the script below. By default, the script will save the attachment to the filename that is included with the attachment. You can also supply your own filename, and if there are multiple attachments they will be written out with a number attached.
#emailextract.py
import sys
def processMIME(emailMessageFile,outputFile=None):
'''
Takes a MIME extended email and extracts the attachments
emailMessageFile - a file containing the plaintext email in MIME format
outputFile - Where to place the new file (if not keeping the original name)
Note: the outputFile will only be written for one file. If there are
multiple message then outputFile will append counts before the file
extension (if a file extension exists)
'''
import re
import base64
emailFile = open(emailMessageFile,'r')
line = ""
#Read lines until the first content-type is shown.
while "Content-Type:" not in line:
line=emailFile.readline()
#get the boundary string
match = re.search("boundary=(\S*)",line)
boundary = match.group(1)
#now that we have the boundary find the attachments
count=0
while line != "":
line=emailFile.readline()
if "Content-Type: application" in line:
processApplicationSection(line, emailFile,outputFile,boundary,count)
count+=1
def processApplicationSection(line, emailFile, outputFile, boundary,count):
'''
Reads a Content-Type: application section of a MIME message and
determines the filename moves emailFile to the data
'''
import re
import random
makeFileName = outputFile == None
while "--".join(boundary) not in line:
#Get the name of the file to write
if makeFileName:
matchFilename = re.search('filename=(\S*)',line)
if matchFilename != None:
outputFile = matchFilename.group(1)
matchFilename = re.search('filename=\"(.*)\"',line)
if matchFilename != None:
outputFile = matchFilename.group(1)
else:
if "." in outputFile:
m = re.search("(.*)(\..*)",outputFile)
filename = m.group(1)
ext = m.group(2)
else:
filename = outputFile
ext = ""
outputFile = filename+str(count)+ext
if line == "\n":
if outputFile == None:
outputFile = ''.join(random.sample('0123456789abcdefg',10))
processRawData(emailFile, outputFile, boundary)
return
line = emailFile.readline()
def processRawData(emailFile, outputFile, boundary):
import base64
rawfile = open(outputFile,'wb')
while(True):
line = emailFile.readline()
if "--"+boundary in line:
break
if line == "":
break
raw = line.strip()
try:
rawfile.write(base64.decodestring(raw))
except:
print "Incorrect Padding: "+ line
raise
print "wrote file: " + outputFile
rawfile.close()
if __name__ == "__main__":
if len(sys.argv) > 1:
emailFile = sys.argv[1]
outputFile = sys.argv[2] if len(sys.argv) > 2 else None
processMIME(emailFile,outputFile)
else:
print "emailextract.py emailIn fileOut(opt)"
print "fileIn should be the plain text original MIME encoded email"
published on 2012-07-22 by alex