Open Source Honeypot Attack Data
Contents
Honeypot Data Google Drive Link (329MB):
- SHA1: 5be9052f0f27e248b2454cf162f04a811acaf48e
- SHA256: 5d8d8c457941cbc1310729a61d67799d504fd9cf33c2e92e5f4b93e14b9f988a
Honeypot Malware Samples Google Drive Link (79MB):
- SHA1: a68249453fdc44a31f177857b558ca0b6f183fea
- SHA256: bbb2290bb3701d4e32fca1f0cc373f2f48b2e48b8b8fa63450116df417dbd9cb
[hr]
Collected Data Breakdown:
I’ve just buttoned up my Masters Thesis. My research topic was a geographical analysis of SSH brute force and dictionary attacks. Since this data is pretty awesome and was expensive to obtain, well, for a graduate student. I thought what the hell, why not release it to the world, right? In sum, there’s a shit ton of attack data and a bunch of malware samples. I aggregated the data from November 2017 to February 2018. There were a total of 6 honeypots, in my research, I only analyzed 5, for reasons. So what’s all available in my data? Let me break it down for you.
The data was collected from 6 honeypots (Ubuntu 16.04 64bit) located in:
- Amsterdam (82.196.7.180)
- London (178.62.11.37)
- Frankfurt (46.101.217.143)
- Bangalore (139.59.4.28)
- San Francisco (165.227.53.76)
- Singapore (128.199.233.243)
Honeynet Architecture: Modern Honey Network Framework
Data Structure: JSON and CSV (Was stored in a MongoDB)
The honeypots both ran a combination of Cowrie and p0f honeypots which, from a high-level view, collect SSH and Telnet attacks as well as passively fingerprints the attacking systems with p0f. Since cowrie is a medium interaction honeypot, I also let the attackers in from time to time subject to the credentials provided. This allowed two things to happen 1. collect command strings from the attacker (these are in the dataset) 2. Collect malware samples that the attacked downloaded to the honeypot.
Malware Samples:
Malware samples were collected are broken down by honeypot location. I haven’t really analyzed them yet so, I have no idea what’s in there. I suspect a bunch of botnet infection scripts and binaries. But, who knows at this point.
[hr]
MongoDB Structure:
Long story short, there is a bunch of tables that the MHN server creates for data storage. All of the main attack data is located within the HPfeeds and Session tables.
[one_half first]
Sessions Table:
- _id = Object ID, can be used as a primary key
- protocol = pcap, ssh, telnet, etc.
- source_ip
- source_port
- destination_port
- honeypot = p0f or cowrie
- timestamp (UTC)
[/one_half]
[one_half]
HPfeeds Table:
- _id
- timestamp
- payload = sub_json data:
- peer_ip
- host_ip
- commands = if logged in, what commands
- credentials
- ttylog
- peer port
- host port
- startTime
- endTime
- SSH Client
- OS: operating system
- channel = cowrie or p0f
[/one_half]
[hr]
Basic Data Analysis:
The data is in JSON and in CSV but the easiest way to work with it is to load the JSON data into a MongoDB instance and only pull the data you want. Here are some of the things I did.
- Use the MongoDB Python API PyMongo.
- Get familiar with the Pandas API and NumPy
- For IP to country name lookups, pull the Geolite2 Country Database and use it with the Python PyGeoIP API. Here is an example Python 3.6 function that does the lookup for you:
[sourcecode language=”python” wraplines=”false” collapse=”false”]
def getCity(ip):
GEOIP = pygeoip.GeoIP(“C:/GeoLiteCity.dat”, pygeoip.MEMORY_CACHE)
data = GEOIP.record_by_addr(ip)
city = data[‘city’]
return(city)
[/sourcecode]
- Here is an example of interfacing with the MongoDB and pulling records:
[sourcecode language=”python” wraplines=”false” collapse=”false”]
for sRecord in session.find({‘honeypot’: “cowrie”}):
# Only run the data aggregation if we are looking at a cowrie honeypot
resultIndex+=1
hpfeedID = sRecord[‘hpfeed_id’]
id = sRecord[‘_id’]
dt = sRecord[‘timestamp’]
dt = str(dt).strip(“{‘$date’: ‘”)[:-2]
dstIP = None
# For Cowrie Attacks, we need to pull the destination IP from hpfeeds
try:
dstIP = sRecord[‘destination_ip’]
except:
for hRecord in hpfeeds.find({‘_id’: hpfeedID}):
if (hpfeedID == hRecord[‘_id’]):
tmp = hRecord[‘payload’]
dump = json.dumps(tmp)
load = json.loads(dump)
dstIP = load[‘hostIP’]
srcIP = sRecord[‘source_ip’]
honeypot = sRecord[‘honeypot’]
srcCountry = getCountry(srcIP)
#srcCity = getCity(srcIP)
if (resultIndex != 0 and resultIndex%5000 == 0):
print(Fore.LIGHTMAGENTA_EX + ‘[+] ‘+ str(resultIndex) + ‘ lines have been written.’ + Fore.RESET)
if (dstIP == BANGALORE):
csvWriter.writerow([id, dt, dstIP, ‘Bangalore’, srcIP, srcCountry, honeypot])
elif (dstIP == FRANKFURT):
csvWriter.writerow([id, dt, dstIP, ‘Frankfurt’, srcIP, srcCountry, honeypot])
elif (dstIP == LONDON):
csvWriter.writerow([id, dt, dstIP, ‘London’, srcIP, srcCountry, honeypot])
elif (dstIP == SINGAPORE):
csvWriter.writerow([id, dt, dstIP, ‘Singapore’, srcIP, srcCountry, honeypot])
elif (dstIP == SANFRANCISCO):
csvWriter.writerow([id, dt, dstIP, ‘San Francisco’, srcIP, srcCountry, honeypot])
[/sourcecode]
[hr]
Analysis Examples:
I’m not planning on going into a ton of detail on my research, not yet at least, but here are some examples of basic analysis.