Extracting Cybersecurity Related Linked Data From Text: September 2013
Extracting Cybersecurity Related Linked Data From Text: September 2013
net/publication/261277324
CITATIONS READS
137 611
4 authors, including:
All content following this page was uploaded by Tim Finin on 19 June 2022.
Abstract—The Web is typically our first source of information descriptions. Significant amounts of key information, however,
about new software vulnerabilities, exploits and cyber-attacks. In- even in such detailed descriptions, remain only in unstructured
formation is found in semi-structured vulnerability databases as text, such as the systems that are likely to be affected, the oper-
well as in text from security bulletins, news reports, cybersecurity
blogs and Internet chat rooms. It can be useful to cybersecurity ating systems environment for which the attack can occur, the
systems if there is a way to recognize and extract relevant versions of products affected, and the relationships between
information and represent it as easily shared and integrated these entities. Vulnerabilities are also mentioned in various
semantic data. We describe such an automatic framework that security bulletins and blogs, which typically are narrative
generates and publishes a RDF linked data representation of descriptions that include the above mentioned relationships,
cybersecurity concepts and vulnerability descriptions extracted
from the National Vulnerability Database and from text sources. though do not include any structured or semi-structured data.
A CRF-based system is used to identify cybersecurity-related Collaborating and expressing these sources of information in a
entities, concepts and relations in text, which are then represented structured, semantic, machine-understandable format can help
using custom ontologies for the cybersecurity domain and also machines deal with possible “zero-day” attacks.
mapped to objects in the DBpedia knowledge base. The resulting We describe an information extraction framework to extract
cybersecurity linked data collection can be used for many
purposes, including automating early vulnerability identification, cybersecurity-relevant entities, terms and concepts from the
mitigation and prevention efforts. NVD and from unstructured text. These extracted concepts
Index Terms—cybersecurity, linked data, information extrac- are then mapped and linked to related resources on the Web
tion, ontology using an OWL ontology language [2] and represented as RDF
linked open data [3]. Such a publicly available linked open
I. I NTRODUCTION data resource will help organizations uncover knowledge from
Cybersecurity is a critical concern as society has become multiple sources of cybersecurity-related data on the Web and
highly interconnected and reliant on a global system of support systems that automatically ingest, reason over and use
computers, communication networks and software systems. the data to provide better cybersecurity.
Cyber crime is more professional with the emergence of II. BACKGROUND AND P REVIOUS W ORK
increasingly powerful methods of intrusion and exploits. For
example, cyber criminals targeted users of Skype, Facebook Our approach combines two aspects of the problem. The
and Windows using multiple blackhole exploits in late 2012 first is extracting relevant information about new security
[1]. Many systems are under threat from vulnerabilities that vulnerabilities, attacks and events from text. The second
are known and publicly documented. One reason for this is is representing and integrating this information along with
that these systems are not patched on a regular basis. While data extracted from the the National Security Vulnerability
information about known vulnerabilities and patches for them Database as a linked data resource using custom ontologies in
is publicly available online, much of it is provided as text that the Semantic Web languages RDF and OWL.
is suitable for security experts, but not easily understood or
A. Information Extraction
directly usable by automated security systems.
One of the best public resources of security information is Several repositories and security advisory sources address
the National Vulnerability Database (NVD) and its associated security changes and threat trends that might affect the overall
components, including the Common Vulnerabilities and Ex- security of a computer system. These sources can be used
posures (CVE) and Common Weakness Enumeration (CWE), in a variety of ways to enhance the process of detection of
and Product Dictionary (CPE) datasets1 . These resources list an attack. NVD is a U.S. government repository of standards
vulnerabilities and exposures, categorize them by type and based vulnerability management data represented using the Se-
severity, provide common names and identifiers, include links curity Content Automation Protocol (SCAP) [4]. Information
to patches and other information and have details as short text sources such as the NVD and IBM XFORCE2 provide XML
feeds that report vulnerabilities with varying degrees of detail.
1 See [Link] [Link] [Link] and
[Link] 2 [Link]
to identify and classify mentions of entities and concepts that
goes beyond their simple approach in terms of precision and
recall.
The quality of the concepts extracted from free text largely
depends on the method applied for concept spotting. More
et al. [5] used OpenCalais [11], an information extraction
system designed to recognize general entities such as people,
places and organizations. Because of its orientation toward
general coverage, it was unable to identify many of the entities
and concepts important for cybersecurity. Similar experiments
were run on the NERD information extraction framework [12],
which failed to identify relevant technical jargon from the
given piece of security-related text. These annotation tools are
designed to capture information based on a custom ontology
which models people, places and organizations. The Stanford
Fig. 1. System architecture for extracting linked cybersecurity data from text Named Entity Recognition [13] also does not identify key
cybersecurity concepts without proper feature filtering. Our
approach introduces a cybersecurity entity and concept spotter
To the best of our knowledge, these repositories consolidate that was primarily trained to identify entities (e.g., software
information present across multiple data sources, though are products and operating systems) and concepts (e.g., denial of
manually monitored. service and buffer overflow) which are related to computer
These dictionaries not only contain redundant or overlap- security, threats and vulnerabilities in software products.
ping information, but also miss out on important concepts Khadilkar et al. [14] demonstrated the concept of using a
such as the means and consequence associated with attacks semantic model to facilitate information representation and
and the versioning of a software product. Similar information describe an ontology for the National Vulnerability Database.
is available in cybersecurity blogs such as Krebs On Security3 The ontology modeled information for software products and
and the Metasploit Blog4 , but their content is unstructured text, generic security concepts, though is unable to characterize and
which can lead to an information overload, especially during capture information from unstructured sources of information.
threat analysis of a system. Furthermore, analyzing and inte- Undercoffer et al. [15] specify an ontological model for
grating multiple textual resources can become a cumbersome categorizing computer attacks that used taxonomic charac-
task for system administrators. Extracting actionable content teristics of an intrusion to be limited to specific classes and
from these informal sources and representing it as a linked attributes centered on the target of an attack. Our framework
RDF data can enhance distribution of security information and consolidates information across different knowledge bases and
the discoverability of security-related concepts. carries out concept-spotting for entities of interest, that can
More et al. [5], for example, demonstrate effective reasoning initiate characterization and understanding of the overall nature
over such a semantically rich data for a situation aware intru- of the attack.
sion detection system. Their framework requires a condensed
B. Linked Data
source of web resources that provide meaningful information
about the threat, and data sources that provide entities that map Linked data [16] enables publishing structured, machine-
well into the ontology. Our approach provides automation to readable interpretation of heterogeneous sources of informa-
generate and update such a linked data resource that can be tion. As defined by Bizer et al. [3], it is “a set of best practices
used to inform advanced intrusion detection and mitigation for publishing and connecting structured data on the Web.” It
systems. focuses on interconnecting data and resources on the Web by
Mulwad et al. [6] describe a prototype system that analyzed defining relations between ontologies, schemas and/or directly
relevant text snippets from the Web to generate assertions linking the published data to other existing resource on the
about vulnerabilities, attacks and threats. The system extracted Web.
concepts of interest using an SVM classifier and queried This approach can be leveraged to the cybersecurity domain
Wikitology [7] – a knowledge base of entities from Wikipedia, by building an RDF data store for vulnerabilities, severity met-
Yago [8] and Freebase [9]. The classification mechanism and rics, affected products and any remedial information. Relevant
the spotted concepts were limited to the identification of two information about these concepts from other sources on the
classes: the means and the consequence of an attack. We Web can be interlinked. With NVD data represented as RDF
adopted an approach that uses a Conditional Random Field linked data, the task of finding all vulnerabilities pertaining to
(CRF) algorithm trained with ground truth annotations [10] a single product version is reduced to the task of traversing
the product-vulnerability dependency graph.
3 [Link] Additional contextual information obtained through estab-
4 [Link] lishing meaningful semantic links can help consolidate avail-
able information regarding a security threat. Moreover, the described classes are most notable. Network Terms was iden-
data representation for this interlinking will be in a struc- tified as an important class since most of the attacks are using
tured, machine-readable format enabling faster, automated data network technology these days. Thus it is important to extract
consumption. The linked data resource can help improve the relevant terms in text so that information regarding networks
discoverability of data through the use of SPARQL [17] can be identified. The idea behind modeling the Attack class
queries, SPARQL endpoints and resolvable URIs. It also helps came from the work of Undercoffer et al. [15]. An Attack
in use cases such as distinguishing relevant vulnerabilities can be Further classified as a Means, which helps to identify
based on a product term or version. Such an interlinked a method of an attack, or as a Consequence that describes
corpus of data will enable stakeholders to share security- the final result of an attack. For example, “buffer overflow”
related information in a single resource, create business in- is considered to be an instance of a Means, since it is not an
telligence, support automated decision making systems and attacker’s final goal, but merely a step to achieve a desired
thereby speedup the exchange and digestion of information consequence, such as a “denial of service.”
across different organizations. Whether a phrase is considered to be an instance of a
Means or Consequence is not always clear in a given text. We
III. S YSTEM A RCHITECTURE
instructed annotators to use their discretion during annotation.
Figure 1 shows the organization of our system, which is When it was difficult to decide between them for a phrase, it
divided into three major components. was tagged as an Attack Class. In analyzing the gold standard
1) A CRF-based cybersecurity entity and concept spotter annotation data we found that the inter-annotator agreement
that identifies relevant concepts and entities from text for these two subclasses was lower than all of the other
2) An ontology-based RDF triple generator that generates classes. In this experiment, we took a random data sample
triples based on extracted information provided by the from our corpus and asked two annotators to annotate the
entity and concept spotter data for four classes (Software Products, Operating System,
3) A link generator that uses DBpedia Spotlight [18] to link Means and Consequences). We found the agreement between
extracted entities and concepts to DBpedia resources and the annotators to be over 90% for Software Products and
aligns them with our cybersecurity-specific vocabulary. Operating System. For Consequences, the agreement was 75%,
In the following sections, these components will be described while for Means it was 52%.
in detail. The NER Modifier class also deserves some explanation.
Understanding the version or versions of a software product
A. Cybersecurity Entity and Concept Spotter
being discussed is an important fact. In the text,
In order to extract relevant information from text, we
“This vulnerability is present in Adobe Acrobat X
developed a entity and concept spotter that identifies important
and earlier versions...”
entities and concepts in a given piece of text. This was done us-
ing general implementation of conditional random field (CRF) the phrase “and earlier versions” indicates that all Adobe
algorithm provided by Stanford named entity recognizer using Acrobat versions before version 10 are also vulnerable to
a set of features for proper identification of concepts from the the threat. These words hold key information about other
input text. We analyzed several cybersecurity-related blogs, versions that are vulnerable. The NER Modifier class identifies
security bulletins and CVE descriptions and identified a set of these terms. It was observed that such terms were generally
key classes that are relevant in terms of data representation of described immediately before or after a Software term or
a vulnerability. We identified the following seven classes of an Operating System term. Identifying these pieces of text
relevance: leverages the identification of product versions that may be
1) Software (e.g. Microsoft .NET Framework 3.5) susceptible to the vulnerability, though are not documented
accordingly.
a) Operating System (e.g. Ubuntu 10.4)
Based on these classes, our extraction framework was
2) Network Terms (e.g. SSL, IP Address, HTTP)
trained using the Stanford NER [19], a CRF based named
3) Attack
entity recognition framework that is pre-trained to identify
a) Means: Way to attack (e.g. Buffer overflow) entities such as people, places and organizations. It includes
b) Consequences: Final result of an attack (e.g. Denial of a large feature set that can be customized to train a general
Service) implementation of a CRF model. We chose a training dataset
4) File Name (e.g. [Link]) consisting of over 30 security blogs, 240 CVE descriptions and
5) Hardware (e.g. IBM Mainframe B152) 80 official security bulletins from Microsoft and Adobe. The
6) NER Modifier: This always follows Software or OS and data corpus [20] was manually annotated by twelve Computer
helps in identifying software version information. Science graduate students, who had a fair understanding of
7) Other Technical Terms: Technical terms that cannot be cybersecurity related terms, concepts and technical jargon. We
classified in any of the above mentioned classes. developed a custom application to simplify the annotation
Each of these classes was chosen to represent key aspects in process using the BRAT rapid annotation framework [21],
identification and characterization of the attack. The following [22].
name, the Web resources where it was first documented, and
the severity metrics. The Vulnerability class hence is defined to
have corresponding relationships with all other classes which
are used to model entities that are part of an NVD entry.
Product: The Product class models the hardware and
software products that are affected by a vulnerability. The
Software subclass is further classified as an Operating System
or Web Browser to correctly classify operating systems and
Web browsers, apart from a generic “application” tag. The
affected product information described in an NVD entry is
limited to a list of product names described for a particular
vulnerability. This information is incorporated using the CPE
format, which includes version granularity. The affectsProduct
relationship models a one-to-many mapping between the vul-
Fig. 2. A high level sketch of the IDS ontology
nerability identifier and the list of affected products. Additional
information about the affected products in the NVD entry is
extracted using our cybersecurity entity and concept spotter.
Feature Set Engineering: Feature set selection is a critical Weakness: An NVD entry contains a unique Common
task in training a NER system. Though the Stanford NER Weakness Enumeration (CWE) identifier that classifies a vul-
provides an extensive selection of applicable features, filtering nerability based on a hierarchy of attack classes modeled to
a subset that can capture all the relevant information pertaining generalize different attack signatures. For example, Cross-side
to the cybersecurity domain is a tedious task. Feature selection scripting (XSS) is a subclass of Injection which is a subclass of
is important, as applying all of the available features to the the Invalid Input. The severity score for a CVE ID is derived
training and test data will not only slow down the annotation on parameters specified for the corresponding CWE ID. The
process, but also diminish the quality of results. Feature Weakness class is thus included to extract more information
selection for our cybersecurity entities and concept spotter regarding the metrics used to score the vulnerability’s severity,
engine was carried out manually by analyzing the text and by which the means for addressing a threat will be refined and
checking which features would be suitable. We selected a set enhanced. In addition, we define classes for each concept that
of features that performed well for our analysis. The features is spotted by our classifier such as Network Terms (IP address,
that were used to train this system are: useTaggySequences, HTTP), Means (Buffer Overflow), and Consequence (Denial
useNGrams, usePrev, useNext, maxNGramLeng, useWordPairs of Service).
and gazette. A detailed discussion on our cybersecurity entity All the concepts from the IDS vocabulary are aligned with
and concept spotter can be found in Lal et al. [23]. concepts identified in the DBpedia ontology, by assigning a
relevant DBpedia resource, thereby resolving the ambiguity
B. The IDS Ontology of entities mapped in our ontology.
We use the IDS ontology5 , partially depicted in Figure 2, to
C. RDF Representation of NVD
represent concepts and entities that are relevant to the cyber-
security domain. This vocabulary was originally developed by Applying semantic web technologies to represent the data
Undercoffer et al. [15], further enhanced by More et al. and provided by the NVD dataset is useful for semantic analysis of
this effort [5]. The ontology is expected to continue to evolve vulnerabilities and exploits. However, correlating this data to
to cover additional concepts. We extended the ontology to the existing concepts on the Web and reasoning over such a
provide model relations that capture the NVD schema structure corpus is a vital task to avail this information for different
and the security exploit concepts extracted by the NER. The applications, front-end services and data consumers (e.g.,
new key classes defined in the ontology, specific to the entities security practitioners and system administrators). Semantics
which are part of the NVD dataset include Vulnerability, allow machine interpretation of links and relations between
Product and Weakness. different properties of a vulnerability. Interlinking leads to an
Vulnerability: A vulnerability is an important class in integrated and well-connected data corpus, available via an
the ontology, as each entry in the NVD is identified and endpoint for advanced applications such as a semantic search
documented based on a CVE number. The CVE number is and vulnerability statistics.
a unique identifier for a vulnerability description provided by The NVD provides XML feeds for vulnerabilities that
MITRE, on an incremental basis for each year. All information are published in a particular year. The NVD datasets are
related to a particular identified vulnerability is associated with updated immediately with raw information whenever a new
the CVE ID, including the list of affected products (identified vulnerability is reported to the CVE repository, and iterated to
based on their unique Common Platform Enumeration (CPE) a valid, confirmed source after analysis.
Our RDF-generation platform ingests XML feeds from the
5 [Link] NVD dataset and generates RDF triples via an Extensible
Stylesheet Language Transformation and the Jena RDF API These resources are then mapped with an appropriate object
[24]. The system includes primary attributes included directly property from the IDS vocabulary. The choice for DBpedia
in the NVD schema, as well as advanced properties fetched Spotlight as a link generation tool, and the precision of the
from the sources described in the former. For example, a concept extraction and linking component are described in
NVD entry contains the CWE ID for the weakness class it detail in Joshi et al. [25].
belongs to. The CWE schema includes attributes such as the The IDS vocabulary models key aspects of a cyber attack
Access Vector, Access Complexity and Authentication. These which are not represented precisely in the DBpedia ontol-
attributes are used to calculate the severity score for a threat. It ogy. For example, the terms Buffer Overflow and Denial of
is observed that the vulnerabilities with the same combination Service are aptly represented as “Means” and “Consequence”
of these features, take place under the same context or running respectively in the IDS vocabulary. These concepts are highly
environment. specific to a domain and hence not modeled in the DBpedia
ontology.
D. Linked Cybersecurity Data Figure 3 shows a sample NVD entry which specifies
Establishing the relations between security exploit terms and the CVE identifier for the vulnerability description, together
identifiers that uniquely identify these concepts is essential to with the list of affected products with Common Platform
data integration. The objective to actually link instances and Enumeration (CPE) names, the Weakness identifier, and the
concepts with resources on the Web is a challenging aspect. source where the vulnerability was documented. We extract
After RDF instances are generated from the properties pro- information from this data to generate machine-understandable
vided by the NVD schema, the link generation component of assertions in RDF, as shown in Figure 4. We use the IDS
our framework connects the security concepts extracted from ontology to interpret key security concepts such as the vul-
the vulnerability descriptions to the existing deferenceable nerability sources and severity metrics. Besides modeling
resources on the Web. Each NVD entry mentions a short semi-structured information, our framework extracts relevant
summary of the vulnerability description, which is essentially DBpedia resources from the text description such as Arbi-
unstructured text. This module annotates security-related terms trary code execution and maps them to appropriate relation-
from the vulnerability description and maps them to cor- ships (hasConsequences) from the IDS vocabulary.
responding DBpedia resources using DBpedia Spotlight, an Based on the relationships established with the linked con-
annotation tool for finding mentions of DBpedia resources in cepts, we can retrieve vulnerabilities and attack descriptions
free text. DBpedia Spotlight provides flexibility to configure pertaining to a specific product version, those affected by a
annotations to specific use cases, through quality metrics such specific means (Buffer Overflow), or those attacks that are
as topical pertinence and disambiguation confidence. Binding carried out under the same operating environment. We can
the DBpedia references to the identified security concepts will query over such a knowledge base via SPARQL queries
enhance association of our linked data resource with other to avail statistics on vulnerability trends, and can view the
instances in the Linked Open Data cloud. past history associated with a vulnerability or a particular
Entities with valid (contextual) resources in DBpedia are software product. A triple store of such condensed information
annotated based on adequate tuning of the confidence and facilitates for a rich linked data resource, that can be used for
support metrics. After experimentation with these parameters semantic analysis of vulnerabilities.
over our dataset, we selected a confidence of 0.3 and a support The NVD datasets provide an RSS data feed on all recent
of 20 for generating DBpedia links for a specific vulnerability CVE vulnerabilities. These immediate data sources can be
descriptions. represented as machine-understandable assertions as shown
The annotations and subsequent linkages provided by DB- above. Such RDF assertions can be added to the triple store,
pedia Spotlight are not final, or complete. For a given piece of and can help in applications such as a situation aware intrusion
text, the DBpedia Spotlight API returns the sets of annotated detection system that can consume linked data to generate
terms and corresponding DBpedia resources. However, the rules and alerts on possible threats. In the future, we plan
annotation does not provide the corresponding class from to extend the concept spotting system into an information
the DBpedia ontology that the resource belongs to. We em- extraction framework that is not limited to the NVD dataset
ploy our cybersecurity entity and concept spotter to map and its auxiliaries. The proposed system will extract concepts
the security exploit concepts to appropriate classes from the from free text, find relationships between entities spotted in the
IDS vocabulary. The NVD descriptions (vuln summary) are text, make assertions about them based on a specific heuristic
passed through the concept spotter, that identifies relevant and publish it to the linked cybersecurity data resource.
terms, assigns a class label, and returns a set of <Concept,
Class> tuples for the description. The Concept terms are IV. S YSTEM E VALUATION AND C HALLENGES
then passed through DBpedia Spotlight. The annotated terms The focus in this paper has been on the problem of extract-
from DBpedia Spotlight were matched against the entities ing cybersecurity concepts, entities and relations and generat-
identified by our system using a string comparison. The ing linked data representations of them. In order to generate a
corresponding DBpedia resource for the matched concept is quality linked data resource that captures all relevant security
assigned a class value, based on the Concept, Class pairs. information from within a text description, the cybersecurity
<?xml version=“1.0” encoding=“UTF-8”?> @prefix rdfs:<[Link] .
<nvd xmlns:vuln=“[Link] @prefix rdf:<[Link] .
xmlns:cvss=“[Link] @prefix ebqids:<[Link] .
<entry id=“CVE-2012-0150”> @prefix dbpedia:<[Link] .
<vuln:vulnerable-software-list> <[Link]
<vuln:product>cpe:/o:microsoft:windows vista::sp2:x64 ebqids:cveID “[Link]
</vuln:product> ebqids:cweID “[Link]
<vuln:product>cpe:/o:microsoft:windows [Link]x86 ebqids:affectsProduct “dbpedia:Windows Vista” ,
</vuln:product> ”dbpedia:Windows 7” ;
<vuln:product>cpe:/o:microsoft:windows 7::sp1:x86 ebqids:summary “Buffer overflow in [Link] in Microsoft
</vuln:product> Windows Vista SP2, Windows Server 2008 SP2, R2, and R2 SP1,
<vuln:product>cpe:/o:microsoft:windows vista::sp2 and Windows 7 Gold and SP1 allows remote attackers to execute
</vuln:product> arbitrary code via a crafted media file, aka “[Link]
</vuln:vulnerable-software-list> Buffer Overflow Vulnerability.”” ;
<vuln:cve-id>CVE-2012-0150</vuln:cve-id> ebqids:hasAccessComplexity “MEDIUM” ;
<vuln:cvss> ebqids:hasAccessVector “NETWORK” ;
<cvss:base metrics> ebqids:hasAuthentication “NONE” ;
<cvss:score>9.3</cvss:score> ebqids:hasSeverityScore “9.3” ;
<cvss:access-vector>NETWORK</cvss:access-vector> ebqids:hasVulnerabilitySource
<cvss:access-complexity>MEDIUM</cvss:access-complexity> “[Link] ;
<cvss:authentication>NONE</cvss:authentication> ebqids:hasMeans “dbpedia:Buffer overflow” ;
</cvss:base metrics> ebqids:hasConsequence “dbpedia:Arbitrary code execution” ;
</vuln:cvss> ebqids:hasTerms “[Link] file” ,
<vuln:cwe id=“CWE-119” /> “[Link] library” ,
<vuln:references xml:lang=“en” “[Link] (computing)” .
reference type=”VENDOR ADVISORY”>
<vuln:source>MS</vuln:source> Fig. 4. Turtle representation of extracted information
<vuln:reference
href=“[Link]
xml:lang=“en”>MS12-013</vuln:reference>
</vuln:references>
<vuln:summary>Buffer overflow in [Link] in Microsoft
Windows Vista SP2, Windows Server 2008 SP2, R2, and R2 SP1,
and Windows 7 Gold and SP1 allows remote attackers to execute
arbitrary code via a crafted media file, aka “[Link]
Buffer Overflow Vulnerability.”
</vuln:summary>
</entry>
</nvd>