Detecting IoT Devices in The Internet
Detecting IoT Devices in The Internet
Abstract— Distributed Denial-of-Service (DDoS) attacks Mirai botnet used in these attacks has been estimated at
launched from compromised Internet-of-Things (IoT) devices 145k [33] and 100k [10]. Source code to the botnet was
have shown how vulnerable the Internet is to large-scale DDoS released [25], showing it targeted IoT devices with multiple
attacks. To understand the risks of these attacks requires vulnerabilities.
learning about these IoT devices: where are they? how many are
there? how are they changing? This paper describes three new If we are to defend against IoT security threats, we must
methods to find IoT devices on the Internet: server IP addresses understand how many and what kinds of IoT devices are
in traffic, server names in DNS queries, and manufacturer deployed. Our paper proposes three algorithms to discover the
information in TLS certificates. Our primary methods (IP location, distribution and growth of IoT devices. We believe
addresses and DNS names) use knowledge of servers run by our algorithms and results could help guide the design and
the manufacturers of these devices. Our third method uses TLS deployment of future IoT security solutions by revealing the
certificates obtained by active scanning. We have applied our scale of IoT security problem (how wide-spread are certain
algorithms to a number of observations. With our IP-based
algorithm, we report detections from a university campus over
IoT devices in the whole or certain part of Internet?), the
4 months and from traffic transiting an IXP over 10 days. We problem’s growth (how quickly do new IoT devices spread
apply our DNS-based algorithm to traffic from 8 root DNS over the Internet?) and the distribution of the problem (which
servers from 2013 to 2018 to study AS-level IoT deployment. countries or autonomous systems have certain IoT devices?).
We find substantial growth (about 3.5×) in AS penetration Our goal here is to assess the scope of the IoT problem;
for 23 types of IoT devices and modest increase in device type improving defenses is complementary future work.
density for ASes detected with these device types (at most Our IoT detection algorithms can also help network
2 device types in 80% of these ASes in 2018). DNS also shows
substantial growth in IoT deployment in residential households
researchers study the distribution and growth of target IoT
from 2013 to 2017. Our certificate-based algorithm finds 254k devices and help IT administrators discover and monitor IoT
IP cameras and network video recorders from 199 countries devices in their network. As more every-day objects get
around the world. connected into the Internet, our algorithms may even help
Index Terms— Internet-of-Things (IoT), measurement understand the physical world by, for example, detecting and
techniques. tracking network-enabled vehicles for crime investigation.
Our first contribution is to propose three IoT detection
I. I NTRODUCTION
methods. Our two main methods detect IoT devices from
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
2324 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 28, NO. 5, OCTOBER 2020
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
GUO AND HEIDEMANN: DETECTING IoT DEVICES IN THE INTERNET 2325
of any web content instead of place-holder content. Typically Completeness Threshold Selection: Since some device
place-holder content is quite short. (For example, http: servers may serve both devices and individuals (due to we
//appboot.netflix.com shows place holder “Netflix use necessary condition to determine device-facing server in
appboot” and is just 487 bytes.) So we treat HTML text longer §II-A.1 and risk mis-classifying human-facing manufacturer
than 630 bytes as human-focused content. We determined server as device server) and sometimes we might miss traffic
this threshold empirically from HTTP and HTTPS content to a server name due to observation duration or lost captures,
at 158 server domain names queried by our 10 devices we set a threshold of server names required to indicate the
(Table I). presence of each IoT device type. This threshold is typically
We call the remaining server names device-facing manu- a majority, but not all, of the server names we observe a
facturer server, or just device servers, because they are run representative device talk to in the lab. (This majority-but-not-
by IoT manufacturers and serve devices only. We use device all threshold also mitigates potential detection misses caused
servers for detection. by devices that start talking to new servers.)
Handling Shared Server Names: Some device server Most devices talk to a handful of device server names (up
names are shared among multiple types of IoT devices from to 20, from our laboratory measurements §III-A.1). For these
the same manufacturer and can cause ambiguity in detection. types of devices, we require seeing at least 2/3 device server
If different device types share the exact set of server names, names to believe a type of IoT device exists at a given source
then we cannot distinguish them and simply treat them as the IP address. Threshold 2/3 is chosen because for devices with
same type—a device merge. 3 or more server names, requiring seeing anything more than
If different device types have partially overlapping sets of 2/3 server names will be equivalent to requiring seeing all
device server names, we can not guarantee they are distin- server names for some devices. For example, requiring at least
guishable. If we treat them as separate types, we risk false 4/5 server names is equivalent to requiring all server names
positives and confusing the two types. We avoid this problem for devices with 3 to 4 device server names.
with detection merge: when we detect device types sharing For devices that talk to many device server names (more
common server names, we conservatively report we detect at than 20), we lower our threshold to 1/2. Typically these are
least one of these device types. (Potentially we could look devices with many functions and the manufacturer uses a large
for unique device servers in each type; we do not currently pool of server names. (For example, our Amazon_FireTV,
do that.) as in Table I, has 41 device server names.) Individual devices
Handling Future Server Name Change: The server names will most likely talk to only a subset of the pool, at least over
that our devices (Table I) use are quite stable over 1 to short observations.
1.5 years (as shown in §IV-B). However, both our IP-based and Limitation: Although effective, IP-based detection faces
DNS-based detection risks missing devices that get software two limitations. First, it cannot detect IoT devices in previously
updates that cause them to talking to new server names. We stored traces, since we usually do not know device server
mitigate these potential missed detections by reporting that a IPs in the past, and coverage of commercial historical DNS
device exists when we see a majority of server names for datasets can be limited ( [18]). Second, we assume we can
that device (both IP-based method §II-A.2 and DNS-based learn the set of servers the IoT devices talk to. If we do not
method §II-A.3). For DNS-based method, we also propose learn all servers during bootstrapping (§II-A.1), or if device
a technique to discover new device server names during behavior changes (perhaps due to a firmware update), we
detection (§II-A.3). need to learn new servers. However we cannot learn new
2) IP-Based IoT Detection Method: Our first method detects device servers during IP-based detection because we find it
IoT devices by identifying packet exchanges between IoT hard to judge if an unknown IP is a device server, even with
devices and device servers. For each device type, we track help of reverse DNS and TLS certificates from that IP. These
device-type-to-server-name mapping: a list of device server limitations motivate our next detection method.
names that type of devices talks to. We then define a threshold 3) DNS-Based IoT Detection Method: Our second method
number of server names; we interpret the presence of traffic detects IoT devices by identifying the DNS queries prior
to that number of server names (identified by server IP) from to actual packet exchanges between IoT devices and device
a given IP address as indicating the presence of that type of servers.
IoT device. Strengths: This method addresses the two limitations
Tracking Server IP Changes: We search for device for IP-based detection (§II-A.2). First, we can directly apply
servers by IP addresses in traffic, but we discover device DNS-based detection to old network traces because server
servers by domain names in sample devices. We therefore need names are stable while server IP can change. Second, we can
to track when DNS resolution for server name changes. learn new device server names during DNS-based detection by
We assume server names are long-lived, but the IP addresses examining unknown server names DNS queried by detected
they use sometimes change. IoT devices and learning those look like device servers (using
We also assume server-name-to-IP mappings could be rules in §II-A.1).
location-dependent. Limitations: This method requires observation of DNS
We track changes of server-name-to-IP mapping by resolv- queries between end-user machines and recursive DNS servers,
ing server names to IP addresses every hour (frequent enough limiting its use to locations that can see “under” recursive
to detect possible DNS-based load balancing). To make sure DNS revolvers. This method also works with recursive-to-
IPs for detection are correct, we track server IPs across the authority DNS queries (see §III-B) when observations last
same time period and at roughly the same geo-location as the longer than DNS caching, since then we see users-driven
measurement of network traffic under detection. queries for server names even above the recursive. Detection
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
2326 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 28, NO. 5, OCTOBER 2020
with recursive-to-authority DNS queries reveals presence of but forwarded to a public port. However, certificate scanning
IoT devices at the AS-level, since recursives are usually run will miss devices behind NATs that lack public-facing IP
by ISPs (Internet service providers [39]) for their users. addresses and IoT devices that do not use TLS
Method Description: Our DNS-based method has three Note that prior work has mapped TLS certificate to IoT
components: detection, server learning and device splitting. devices, both by matching text (like “IP camera”) with cer-
Figure 1 illustrates this method’s overall workflow: it repeat- tificates [36], and by using community-maintained annota-
edly conducts detections with the latest knowledge of IoT tions [9]. In comparison, our method uses multiple techniques
device server names, learns new device server names after to improve the accuracy of certificate matching, and also
each detection, and terminates when no new server names are confirms that matched certificates come from HTTPS servers
learned (see the loop of “Detection” and “Server Learning” running in IoT devices.
in Figure 1). This method also revises newly learned server We use existing public crawls of IPv4 TLS certificates.
names by device splitting if it suspects they are incorrect, as We first identify candidate certificates: the TLS certificates
signaled by decreased detection after new server names are that contain target devices’ manufacturer names and (option-
added (see “Device Splitting” in Figure 1). ally) product information. Candidate certificates most likely
Detection: Similar to §II-A.2, for each type of IoT devices, come from HTTPS servers related to target devices such as
we track a list of device server names that type of device talks HTTPS servers ran by their manufacturers and HTTPS servers
to. We interpret presence of DNS queries for above a threshold ran directly in them. We then identify IoT certificates: the
(same as §II-A.2) amount of device server names from a give candidate certificates that come from HTTPS servers running
IP address as presence of that IoT device type. (We call this directly in target devices. Each IoT certificate represents a
IP IoT user IP.) HTTPS-Accessible IoT device.
To cover possible variants of known device servers, in detec- 1) Identify Candidate Certificates: We identify candidate
tion, we treat digits in server name’s sub-domain as matching certificates for every target device by testing each TLS certifi-
any digit. We define sub-domain of a URL as everything cate against a set of text strings we associate with each device
on the left of the URL’s domain (URL’s domain as defined (called matching keys). (We describe where our list of target
in §II-A.1). devices is found in §III-C.)
Server Learning: After each detection, we learn new device Matching Keys: We build a set of matching keys for
server names and use them in subsequent detections. Specif- each target device with the goal to suppress false positives
ically, we examined unknown server names DNS queried in finding candidate certificates. If a target device’s manu-
by IoT user IPs and if we find any unknown server names facturer does not produce any other type of Internet-enabled
resemble device servers for certain IoT device detected at products (per product information on manufacturer websites),
certain IoT user IP (judged by rules in §II-A.1), we extend its matching key is simply the name of its manufacturer
this IoT device’ server name list with these unknown server (called manufacturer key). Otherwise, its matching keys will
names. be manufacturer key plus its product type (like “IP Camera”).
Device Splitting: We may incorrectly merge two types of We also include IoT-specific sub-brands (if any). For example,
devices that talk to different set of servers if we only know “American Dynamics” is the sub-brand associated the IP
their shared server names prior to detection. cameras manufactured by Tyco International.
Incorrect device merges can reduce detection rates. When We do two kinds of matching between a matching key K
we falsely merge different device types P 1 and P 2 as P , we and a field S in TLS Certificate: Match means K is a substring
risk learning new server names for the merged type P that of S (ignore case); Good-Match means K is a Match of S
P 1 and P 2 devices do not both talk to and causing reduced and the character(s) adjacent to K’s match in S are neither
detections of P in subsequent iterations because we miss some alphabetical nor numbers. For example, “GE” is a Match
P 1 (or P 2) devices by searching for the newly-acquired server but not a Good-Match of “Privilege” because the adjacent
names that P 1 (or P 2) do not talk to. characters of “GE” in “Privilege” is “e” (an alphabet). (We
Device splitting addresses this problem by reverting incor- do not simply look for identical K and S because often S
rect merge. If we detect fewer device types P at certain IP uses a prefix or suffix. For example, a certificate’s subject-
after learning new server names, we know P is an incorrect organization field “Amcrest Technologies LLC” will be a
merge of two different device types, P 1 and P 2, and that Good-Match with manufacturer key “Amcrest”, but is not
the new server names learned for P do not apply for both identical due to the suffix “Technologies LLC”.)
P 1 and P 2. We thus split P into P 1 and P 2, with P 1 Requiring Good-Match for manufacturer keys reduces false
talking to P ’s server names before last server learning (without positives caused by IoT manufacturer names being substrings
newly-learned server names) and P 2 talk to P ’s latest server of other companies. For example, name of IP camera man-
names (with the new server names). We show an example of ufacturer “Axis_Communications” is a substring of Telecom
how device splitting reverts an incorrect device merge later in company “Maxis_Communications” but they are not a
controlled experiment (§IV-B). Good-Match.
We use the Match (not Good-Match) rule for other keys
(product types and sub-brand) because they require greater
B. Certificate-Based IoT Detection Method flexibility. For example, product type “NVR” can be matched
Our third method detects IoT devices using HTTPS by to text string like ”myNVR”.
active scanning for TLS certificates and identifying target IoT Key Matching Algorithm: We test each TLS certifi-
devices’ TLS certificates. This method thus covers HTTPS- cate (input) with matching keys from each target device.
Accessible IoT devices either with public IPs or behind NATs Specifically, we examine four subject fields in a TLS certificate
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
GUO AND HEIDEMANN: DETECTING IoT DEVICES IN THE INTERNET 2327
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
2328 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 28, NO. 5, OCTOBER 2020
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
GUO AND HEIDEMANN: DETECTING IoT DEVICES IN THE INTERNET 2329
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
2330 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 28, NO. 5, OCTOBER 2020
Fig. 5. Per-device type as penetrations (omitting 7 device types appearing in less than10 ASes).
and deployment decline as these models replaced by newer For every AS detected with at least one of our 23 IoT device
releases. To support this hypothesis, we estimate release dates types (referred to as IoT-AS for simplicity) from 2013 to 2018,
for our device types and compare these estimated release dates we compute its device type density. We present the empirical
with per-device-type AS penetration (number of ASes where cumulative distribution (ECDF) for device type densities of
each of our 23 device types is found) from 2013 to 2018 IoT-ASes from 2013 to 2018 in Figure 3.
(Figure 5). Our first observation from Figure 3 is from 2013 to 2018,
We estimate release dates for 22 of our 23 device types not only are there 3.5 times more IoT-ASes (as shown by AS
based on estimated release dates for our 26 detectable IoT penetration), the device type density in these IoT-ASes are also
devices (recall §II-A.1). (We exclude device type HP_Printer constantly growing.
here because there are many HP wireless printers released Our second observation is despite the constant growth,
from a wide range of years and it would be inaccurate to device type density in IoT-ASes are still very low as of 2018.
estimate release date of this whole device type based on any In 2018, most (79%) of the IoT-ASes have at most 2 of our
HP_Printer devices.) If a device type includes more than one 23 device types, which is a modest increase comparing to
of our 26 detectable IoT devices (due to device merge), we 2013 where the similar percentage (80%) of IoT-ASes have at
estimate release dates for all these devices and use the earliest most 1 of our 23 device types.
date for this device type. We estimate release date for a given Our results suggest that for IoT devices, besides potential to
IoT device from one of three sources (ordered by priority high further grow in AS penetration (which would lead to growth
to low): release date found online, device’s first appearance in household penetration), there exists even larger potential
date and device’s first customer comment date on Amazon. to grow in device type density (which would lead to growth
com. We confirm all the 22 device types are released at least in device density). This unique potential of two-dimensional
two years before 2017 (2 in 2011, 7 in 2012, 3 in 2013, 5 in growth (penetration and density) sets IoT devices apart from
2014 and 5 in 2015), consistent with our claim that their sales other fast-growing electronic products in recent history such
are declining in 2017. as cell-phone and personal computer (PC) which mostly grow
We compare estimated release dates with per-device-type in penetration (considering that while a person may only own
AS penetration results (Figure 5) and find that detections 1 to 2 cell-phones and PCs, he could own many more IoT
of device types tend to plateau after release, consistent devices).
with product cycles and a decrease in sales and use We rule out the possibility that the increasing AS penetra-
of these devices. For example, Withings_SmartScale and tion and device type density we observe is an artifact of device
Netatmo_WeatherStation, which are released in 2012, stop servers we used in detection (measured around 2017) do not
growing roughly after 2016-10-04 and 2017-04-11, suggesting apply to IoT devices in the past by showing IoT device-type-
a product cycle of about 4 and 5 years. In comparison, TPLink- to-server-name mappings are stable over time in §IV-B.
IPCam/Plug/LightBulb is the only device type released ASes with Highest Device Type Density in 2018: We
around 2016 (TPLink_IPCam on 2015-12-15, TPLink_Plug on examined the top 10 ASes with highest device type density
2016-01-01 and TPLink_Lightbulb on 2016-08-09) and their in 2018 (detected with 8 to 14 of our 23 device types).
AS penetration continue to rise even on 2018-04-10, despite Our first observation is that they are pre-dominantly from the
AS penetration of other device types (released between U.S. (4 ASes) and Europe (3 ASes). There are also 2 ASes
2011 and 2015) roughly stop increasing by 2017. from Eastern Asia (Korea and China) and 1 from Haiti. This
Note the fact that the AS penetrations of our 23 device distribution also consistently show up in top 20 ASes with
types plateau does not contradict with the constant growth 10 ASes from the U.S. and 5 ASes from Europe. Our second
of overall IoT deployment because new IoT devices are observation is that these top 10 ASes are mostly major
constantly appearing. consumer ISPs in their operating regions such as Comcast,
Growth in Device Type Density: Having showed that our Charter, AT&T and Verizon from the U.S., Korea Telecom
23 IoT device types penetrate into about 3.5 times more ASes from South Korea and Deutsche Telekom for Germany.
from 2013 to 2018, we next study how many IoT device types Estimating Actual Overall AS Penetration in 2018: Recall
are found in these ASes—their device type density. We use that the overall AS penetrations for our 23 device types
device type density to show the “depth” of AS-Level IoT reported in Figure 2 are under-estimations of the ground truth,
Deployment. because our DITL data is not complete (8 of 13 root letters
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
GUO AND HEIDEMANN: DETECTING IoT DEVICES IN THE INTERNET 2331
TABLE V
I OT D EPLOYMENT FOR O NE H OUSE IN CCZ D ATA
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
2332 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 28, NO. 5, OCTOBER 2020
TABLE VI
IPC AM D ETECTION B REAK -D OWN
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
GUO AND HEIDEMANN: DETECTING IoT DEVICES IN THE INTERNET 2333
TABLE VIII device use. On day 5, we reboot each device, looking how a
D ETECTED IP C AMERAS AND NVR S BY C OUNTRIES restart affects device traffic.
Our detection algorithm uses the same set of device server
names that we describe in §III-A.1. We collect IPv4 addresses
for these device server names (by issuing DNS queries every
10 minutes) during the same 5-day period at the same location
as our controlled experiments.
Detection During Inactive Days: We begin with detec-
tion using the first 2 days of data when the devices are
inactive. We detect more than half of the devices (6 true
positives out of 10 devices); we miss the remaining 4 devices:
We examine what devices are in each country to gain Amazon_Button, Foscam_IPCam, Amcrest_IPCam, and Ama-
confidence in what we detect. Table VIII shows the top ten zon_Echo (4 false negative). We see no false positives. (All
countries by number of detected devices, and breaks down how 15 no-IoT devices are detected as non-IoT.) This result shows
many devices are found in country by manufacturer. (We show that short measurements will miss some inactive devices, but
show only manufacturer with at least 1000 global detections background traffic from even unused devices is enough to
in Table VI.) detect more than half.
We find manufacturers prefer different operating regions. Detection During Inactive and Active Days: We next con-
We believe these preferences are related to their business sider the first four days of data, including both inactive periods
strategies. While Dahua, Foscam and Hikvision are global, and active use of the devices. When observations include
the latter two show substantially more deployment in the U.S. device interactions, we find all devices.
and China, respectively. Amcrest (formerly Foscam U.S. [7]) We also see one false positive: a laptop is falsely classified
is almost exclusive to the American market. The German as Foscam_IPCam. We used the laptop to configure the device
company Mobotix, while is present in Europe and America, and change the device’s dynamic DNS setting. As part of this
seems completely absent from Asian markets. configuration, the laptop contacts ddns.myfoscam.org, a
device-facing server name. Since the Foscam_IPCam has only
one device server name, this overlap is sufficient to detect the
IV. VALIDATION laptop as a camera. This example shows that IoT devices that
We validate the accuracy of our two main methods by use only a few device server names are liable to false positive.
controlled experiments. Applying Detection to All Data: When we apply detection
Validation requires ground truth, so we turn to controlled to the complete dataset, including inactivity, active use, and
experiments with devices we own. We have 10 devices reboots, we see the same results as without reboots. We
(Table I) from 7 different manufacturers and at different conclude that user device interactions is sufficient for IoT
prices (from $5 to $85, in 2018). This diversity provides detection; we do not need to ensure observations last long
a range of test subjects, but the requirement to own the enough to include reboots.
devices means our controlled experiment is limited in size. Simulating Dynamic IPs: We next show how dynamically
In principle, we could scale up testing by by crowd-sourcing assigned IPs can inflate IoT detections (both at USC, §III-A.2
traffic captures, as shown in [20]. and at an IXP, §III-A.3).
Our experiments also show our method correctly detects We simulate dynamic-assigned IPs by manually re-assigning
multiple devices from same manufacturer (3 devices from random static IPs to our 25 devices every day during our 5-day
Amazon and 2 from TP-Link, as in Table I) using device merge experiment.
and detection merge (recalling §II-A.1). Our IP-based detection with this simulated 5-day dynamic-
IP measurements finds 26 true positive IoT detections from
25 dynamic IPs. One IP is detected with two IoT devices
A. Accuracy of IP-Based IoT Detection because they were each assigned to this IP on a different day.
We validate the correctness and completeness of our Similar to our 4-day and 5-day static-IP detection, we see a
IP-based method by controlled experiments. We set up our false detection of a laptop as Foscam_IPCam, and no false
experiment by placing our 10 IoT devices (Table I) and negatives. This experiment showed 2.6× more IoT devices
15 non-IoT devices in a wireless LAN behind a home router. than we have, less than the 5× inflation that would have
We assign static IPs to these 25 devices. We run tcpdump occurred with each device being detected on a different IP
inside the wireless LAN to observe all traffic from the LAN each day.
to the Internet. We conclude that dynamic addresses can inflate device
We run our experiments for 5 days to simulate 3 possible counts, and the degree depends on address lease times.
cases in real-world IoT measurements.
On Day 1 to 2 (inactive days), we do not interact with B. Accuracy of DNS-Based IoT Detections
IoT devices at all. So first 2 days’ data simulates observations We validate correctness and completeness of our DNS-based
of unused devices and contains only background traffic from detection method by controlled experiments. We use the same
the devices, not user-driven traffic. On day 3 to 4 (active set up, devices and device server names as in §IV-A. We also
days), we trigger the device-specific functionality of each of validate our claim that DNS-based detection can be applied
the 10 devices like viewing the cameras and purchasing items to old network measurements by showing IoT device-type-to-
with Amazon_Button. The first 4 days’ data shows extended server-name mappings are stable over time.
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
2334 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 28, NO. 5, OCTOBER 2020
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
GUO AND HEIDEMANN: DETECTING IoT DEVICES IN THE INTERNET 2335
not further characterize device types. In comparison, our Censys is similar to Shodan but they also support commu-
three detection methods reveal both existence and type of nity maintained annotation logic that annotate manufacturer
IoT devices. Our IP and DNS-based method cover general and model of Internet-connected devices by matching texts
IoT devices talking to device servers rather than just Mirai- with banner information [9].
infected devices. Compared to Shodan and Censys, our IP-based and DNS-
Work from University of Maryland detects Hajime infected based methods cover IoT devices using both public and private
IoT devices by measuring the public distributed hash IP addresses, because we use passive measurements to look
table (DHT) that Hajime use for C&C communication [19]. for signals that work with devices behind NATs. These two
They characterize device types with Censys [9], but types for methods thus cover all IoT devices that exchanges packets
most of their devices remain unknown. In comparison, our with device servers during operation. Our certificate-based
three detection methods detect existence of known devices and method, while also relying on TLS certificates crawled from
always characterize their device types. Our IP and DNS-based IPv4 space, provides a better algorithm to match TLS certifi-
methods cover general IoT devices talking to device servers cates with IoT related text strings (with multiple techniques to
rather than just those infected by Hajime. improve matching accuracy) and ensures matched certificates
come from HTTPS servers running in IoT devices.
Machine-Learning-Based Traffic Analysis Work from Concordia University infers compromised IoT
Work from Ben-Gurion University of the Negev (BGUN) devices by identifying the fraction of IoT devices detected
detect IoT devices from LAN-side measurement by identifying by Shodan that send packets to allocated but un-used IPs
their traffic flow statistics with machine learning (ML) models monitored by CAIDA [40]. Their focus on compromised IoT
such as random forest and GBM [26], [27]. They use a wide devices is different from our focus on general IoT devices.
range of features (over 300) extracted from network, transport Due to their reliance on Shodan data, they cover devices with
and application layers, such as number of bytes and number public IP while our IP-based and DNS-based method cover
of HTTP GET requests. devices on both public and private IP. We also report IoT
Similarly, work from the University of New South deployment growth over a much longer period (6 years) than
Wales (UNSW) characterizes the traffic statistics of 21 IoT they do (6 days).
devices such as packet rates and average packet sizes and Northeastern University infers devices hosting invalid cer-
briefly discusses detecting these devices from LAN-side by tificates (including IoT devices) by manually looking up model
identifying their traffic statistics with ML model (random numbers in certificates and inspecting web pages hosted on
forest) [38]. certificates’ IP addresses [5]. In comparison, our certificate-
Comparing to work from BGUN from UNSW, our work based method introduces an algorithm to map certificates to
uses different features: packet exchanges with particular device IoT devices and does not fully rely on manual inspection.
servers and TLS certificate for IoT remote access rather Work from University of Michigan detects industrial
than traffic statistics or traffic flow features. While they use control systems (ICS) by scanning the IPv4 space with
LAN-side measurement where traffic from each device can ICS-specific protocols and watching for positive
be separated by IP or MAC addresses, our IP-based and responses [28]. Unlike from their focus on ICS-protocol-
DNS-based methods can work with aggregated traffic from compliant devices and protocols, our approaches considers
outside the NAT and cover IoT devices both on public Internet general IoT devices. Our approach also uses different
and behind NAT. Not requiring LAN-side measurement also measurements and signals for detection.
enables our IP-based and DNS-based methods to do Internet- VI. C ONCLUSION
wide detection. Our certificate-based method covers HTTPS-
To understand the security threats of IoT devices requires
Accessible IoT devices on public Internet by crawling TLS
knowledge of their location, distribution and growth. To help
certificates in IPv4 space.
provide these knowledge, we propose two methods that detect
Work from IBM transforms DNS names into embeddings,
general IoT devices from passive network measurements (IPs
the numeric representations that capture the semantics of DNS
in network flows and stub-to-recursive DNS queries) with
names, and classify devices as either IoT or non-IoT based on
the knowledge of their device servers. We also propose a
embeddings of their DNS queries using ML model (multilayer
third method to detect HTTPS-Accessible IoT devices from
perceptron) [24]. In comparison, our three methods not only
their TLS Certificates. We apply our methods to multiple
detect existence of IoT devices, but also categorize their device
real-world network measurements. Our IP-based algorithm
types. While they rely on LAN-side measurement to aggregate
reports detections from a university campus over 4 months
DNS queries by device IPs, our three methods do not require
and from traffic transiting an IXP over 10 days. Our DNS-
measuring from inside the LAN.
based algorithm finds about 3.5× growth in AS penetration
IPv4 Scanners for 23 device types from 2013 to 2018 and modest increase in
device type density in ASes detected with these device types.
Shodan is a search engine that provides information (mainly
Our DNS-based method also confirms substantial growth in
service banners, the textual information describing services
IoT deployments at household-level in a residential neighbor-
on a device, like certificates from HTTPS TLS Service)
hood. Our certificate-based algorithm find 254K IP camera and
about Internet-connected devices on public IP (including IoT
NVR from 199 countries around the world.
devices) [36]. Shodan actively crawls all IPv4 addresses on
a small set of ports to detect devices by matching texts (like ACKNOWLEDGMENT
“IP camera”) with service banners and other device-specific The authors would like to thank Arunan Sivanathan at
information. the University of New South Wales for sharing their IoT
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.
2336 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 28, NO. 5, OCTOBER 2020
device data with us [38]. They thank Paul Vixie for providing [25] P. Loshin. (2016). Details Emerging on Dyn DDoS Attack. [Online].
historical DNS data from Farsight [35]. They also especially Available: http://searchsecurity.techtarget.com/news/450401962/Details-
emerging-on-Dyn-DNS-DDoS-attack-Mirai-IoT-botnet
thank Mark Allman for sharing his CCZ DNS Transactions [26] Y. Meidan et al., “ProfilIoT: A machine learning approach for IoT device
datasets [2] and help run our code on partially un-encrypted identification based on network traffic analysis,” in Proc. SAC, 2017,
version of this dataset. pp. 506–509.
The U.S. Government is authorized to reproduce and distrib- [27] Y. Meidan et al., “Detection of unauthorized IoT devices using machine
learning techniques,” 2017, arXiv:1709.04647. [Online]. Available:
ute reprints for Governmental purposes notwithstanding any https://arxiv.org/abs/1709.04647
copyright notation thereon. [28] A. Mirian et al., “An Internet-wide view of ICS devices,” in Proc. 14th
Annu. Conf. Privacy, Secur. Trust (PST), Dec. 2016, pp. 96–103.
R EFERENCES [29] Motherboard. 1.5 Million Hijacked Cameras Make an Unprecedented
Botnet. Accessed: Jul. 2018. [Online]. Available: https://motherboard.
[1] G. Acar, N. Apthorpe, N. Feamster, D. Y. Huang, Frank, and vice.com/en_us/article/8q8dab/15-million-connected-cameras-ddos-
A. Narayanan. IoT Inspector Project from Princeton Univer- botnet-brian-krebs
sity. Accessed: Nov. 2019. [Online]. Available: https://iot-inspector. [30] Mozilla. Public Suffix List. Accessed: Jul. 2018. [Online]. Available:
princeton.edu/ https://www.publicsuffix.org/
[2] M. Allman. (Jan. 2018). Case Connection Zone DNS Transactions. [31] M. Müller, G. C. M. Moura, R. O. de Schmidt, and J. Heidemann,
[Online]. Available: http://www.icir.org/mallman/data.html “Recursives in the wild: Engineering authoritative DNS servers,” in Proc.
[3] M. Antonakakis et al., “Understanding the mirai botnet,” in Proc. 26th ACM Internet Meas. Conf., 2017, pp. 489–495.
USENIX Secur. Symp., 2017, pp. 1093–1110. [32] No-IP. Domain Names Provided by No-IP. Accessed: Jul. 2018. [Online].
[4] CAIDA. Routeviews Prefix to AS Mappings Dataset. Available: http://www.noip.com/support/faq/free-dynamic-dns-domains/
Accessed: Mar. 2019. [Online]. Available: https://www.caida.org/data/ [33] OVH. DDoS didn’t Break VAC. [Online]. Available:
routing/routeviews-prefix2as.xml https://www.ovh.com/us/news/articles/a2367.the-ddos-that-didnt-break-
[5] T. Chung et al., “Measuring and applying invalid SSL certificates: the-camels-vac
The silent majority,” in Proc. Internet Meas. Conf., 2016, pp. 527–541. [34] SCIP. Belkin WeMo Switch Communications Analysis.
[6] Cloudflare. What is an IXP. Accessed: Nov. 2019. [Online]. Available: Accessed: Jul. 2018. [Online]. Available: https://www.scip.ch/en/?labs.
https://www.cloudflare.com/learning/cdn/glossary/internet-exchange- 20160218
point-ixp/ [35] Farsight Security. Passive DNS Historical Internet Database:
[7] Dahua. Important Message from Foscam Digital Technologies Regard- Farsight DNSDB. Accessed: Jul. 2018. [Online]. Available:
ing US Sales and Service. Accessed: Jul. 2018. [Online]. Available: https://www.farsightsecurity.com/solutions/dnsdb/
http://foscam.us/products.html/ [36] Shodan. Shodan Search Engine Front Page. Accessed: Jul. 2018.
[8] T. Dierks and E. Rescorla, The Transport Layer Security (TLS) Protocol, [Online]. Available: https://www.shodan.io/
[37] S. Siby, R. R. Maiti, and N. O. Tippenhauer, “IoTscanner: Detecting
document RFC 4346, Internet Request For Comments, 2006.
privacy threats in IoT neighborhoods,” in Proc. Workshop IoT Privacy,
[9] Z. Durumeric, D. Adrian, A. Mirian, M. Bailey, and J. A. Halderman,
Trust, Secur., 2017, pp. 23–30.
“A search engine backed by Internet-wide scanning,” in Proc. 22nd ACM
[38] A. Sivanathan et al., “Characterizing and classifying IoT traffic in smart
SIGSAC Conf. Comput. Commun. Secur. (CCS), 2015, pp. 542–553.
cities and campuses,” in Proc. IEEE Conf. Comput. Commun. Workshops
[10] Dyn. Analysis of October 21 Attack. Accessed: Jul. 2018. [Online].
(INFOCOM WKSHPS), May 2017, pp. 559–564.
Available: http://dyn.com/blog/dyn-analysis-summary-of-friday-october- [39] ThousandEyes. What is an ISP? Accessed: Nov. 2019. [Online].
21-attack/ Available: https://www.thousandeyes.com/learning/glossary/isp-internet-
[11] K. Egevang and P. Francis, The IP Network Address Translator (NAT), service-provider
document RFC 1631, Internet Request For Comments, 1994. [40] S. Torabi, E. Bou-Harb, C. Assi, M. Galluscio, A. Boukhtouta, and
[12] Gartner. IoT Installed Base Forcast. Accessed: Mar. 2019. [Online]. M. Debbabi, “Inferring, characterizing, and investigating Internet-scale
Available: https://www.statista.com/statistics/370350/internet-of-things- malicious IoT device activities: A network telescope perspective,” in
installed-base-by-category/ Proc. 48th Annu. IEEE/IFIP Int. Conf. Dependable Syst. Netw. (DSN),
[13] B. Gleeson, A. Lin, J. Heinanen, T. Finland, G. Armitage, and Jun. 2018, pp. 562-573.
A. Malis, A Framework for IP Based Virtual Private Networks, [41] USC/LANDER. (May 19, 2015). FRGP Continuous Flow Dataset.
document RFC 2764, Internet Request For Comments, 2000. [Online]. Available: http://www.isi.edu/ant/lander
[14] GlobalInfoResearch. IP Cam Market Report. Accessed: Jul. 2018. [42] Wikipedia. Autonomous System (Internet). Accessed: Mar. 2019.
[Online]. Available: https://goo.gl/254g2M [Online]. Available: https://en.wikipedia.org/wiki/Autonomous_system_
[15] GlobalInfoResearch. NVR Market Report. Accessed: Jul. 2018. [Online]. (Internet)
Available: https://goo.gl/sxQRis [43] ZMap. ZMap 443 HTTPS SSL Full IPv4 Datasets. Accessed: Jul. 2018.
[16] H. Guo and J. Heidemann. IoT Traces From 10 Devices we Purchased. [Online]. Available: https://censys.io/data/443-https-ssl_3-full_ipv4
Accessed: Jul. 2018. [Online]. Available: https://ant.isi.edu/datasets/iot/
[17] H. Guo and J. Heidemann, “Detecting IoT devices in the Internet Hang Guo received the B.S. degree from the Bei-
(extended),” USC/ISI, Marina del Rey, CA, USA, Tech. Rep. ISI-TR- jing University of Posts and Telecommunications
726B, 2018. in 2014 and the Ph.D. degree from the University of
[18] H. Guo and J. Heidemann, “IP-based IoT device detection,” in Proc. Southern California in 2020. His research interests
Workshop IoT Secur. Privacy, 2018, pp. 36–42. include Internet traffic analysis, network security,
[19] S. Herwig, K. Harvey, G. Hughey, R. Roberts, and D. Levin, “Measure- and the Internet of Things (IoT). In 2020, he joined
ment and analysis of Hajime, a peer-to-peer IoT botnet,” in Proc. Netw. Microsoft Azure Team, as a Software Engineer.
Distrib. Syst. Secur. Symp., 2019, pp. 1–15.
[20] D. Y. Huang, N. Apthorpe, G. Acar, F. Li, and N. Feamster, “IoT
inspector: Crowdsourcing labeled network traffic from smart home
devices at scale,” Sep. 2019, arXiv:1909.09848. [Online]. Available:
https://arxiv.org/abs/1909.09848. John Heidemann (Fellow, IEEE) received the B.S.
[21] B. Krebs. Krebs Hit With DDoS. Accessed: Jul. 2018. [Online]. Avail- degree from the University of Nebraska-Lincoln
able: https://krebsonsecurity.com/2016/09/krebsonsecurity-hit-with- in 1989, and the M.S. and Ph.D. degrees from
record-ddos/ the University of California at Los Angeles
[22] P. Krzyzanowski. Understanding Autonomous Systems. Accessed: in 1991 and 1995, respectively. He is currently a
Nov. 2019. [Online]. Available: https://www.cs.rutgers.edu/~pxk/352/ Principal Scientist at the University of Southern
notes/autonomous_systems.html California/Information Sciences Institute (USC/ISI)
[23] J. Kurkowski. Lib Tldextract. Accessed: Jul. 2018. [Online]. Available: and a Research Professor at USC in computer
https://pypi.python.org/pypi/tldextract science. At ISI, he leads the Analysis of Network
[24] F. Le, M. Srivatsa, and D. Verma, “Unearthing and exploiting latent Traffic (ANT) Lab, observing and analyzing Internet
semantics behind DNS domains for deep network traffic analysis,” in topology and traffic to improve network reliability,
Proc. Workshop AI for Internet of Things, 2019, pp. 1–6. security, protocols, and critical services. He is a Senior Member of the ACM.
Authorized licensed use limited to: Indian Institute of Information Technology Kottayam. Downloaded on August 21,2024 at 07:44:37 UTC from IEEE Xplore. Restrictions apply.