Extracting Cybersecurity Related Linked Data From Text: September 2013

The paper presents an automatic framework for extracting cybersecurity-related linked data from both structured and unstructured text sources, including the National Vulnerability Database and various online security bulletins. It utilizes a Conditional Random Field (CRF) system to identify relevant entities and concepts, which are then represented in RDF linked data format to enhance the discoverability and integration of cybersecurity information. This framework aims to support automated systems in vulnerability identification and mitigation efforts by providing a structured, machine-readable resource for cybersecurity data.

Uploaded by

47hwhqrktq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views9 pages

Extracting Cybersecurity Related Linked Data From Text: September 2013

Uploaded by

47hwhqrktq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/261277324

Extracting Cybersecurity Related Linked Data from Text

Conference Paper · September 2013

DOI: 10.1109/ICSC.2013.50

CITATIONS READS
137 611

4 authors, including:

Tim Finin Anupam Joshi

University of Maryland, Baltimore County University of Maryland, Baltimore County
598 PUBLICATIONS 32,936 CITATIONS 539 PUBLICATIONS 22,940 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Tim Finin on 19 June 2022.

The user has requested enhancement of the downloaded file.

Extracting Cybersecurity Related
Linked Data from Text
Arnav Joshi, Ravendar Lal, Tim Finin and Anupam Joshi
Computer Science and Electrical Engineering
University of Maryland, Baltimore County, Baltimore, MD 21250 USA
{arnavj1, rlal1, finin, joshi}@[Link]

Abstract—The Web is typically our first source of information descriptions. Significant amounts of key information, however,
about new software vulnerabilities, exploits and cyber-attacks. In- even in such detailed descriptions, remain only in unstructured
formation is found in semi-structured vulnerability databases as text, such as the systems that are likely to be affected, the oper-
well as in text from security bulletins, news reports, cybersecurity
blogs and Internet chat rooms. It can be useful to cybersecurity ating systems environment for which the attack can occur, the
systems if there is a way to recognize and extract relevant versions of products affected, and the relationships between
information and represent it as easily shared and integrated these entities. Vulnerabilities are also mentioned in various
semantic data. We describe such an automatic framework that security bulletins and blogs, which typically are narrative
generates and publishes a RDF linked data representation of descriptions that include the above mentioned relationships,
cybersecurity concepts and vulnerability descriptions extracted
from the National Vulnerability Database and from text sources. though do not include any structured or semi-structured data.
A CRF-based system is used to identify cybersecurity-related Collaborating and expressing these sources of information in a
entities, concepts and relations in text, which are then represented structured, semantic, machine-understandable format can help
using custom ontologies for the cybersecurity domain and also machines deal with possible “zero-day” attacks.
mapped to objects in the DBpedia knowledge base. The resulting We describe an information extraction framework to extract
cybersecurity linked data collection can be used for many
purposes, including automating early vulnerability identification, cybersecurity-relevant entities, terms and concepts from the
mitigation and prevention efforts. NVD and from unstructured text. These extracted concepts
Index Terms—cybersecurity, linked data, information extrac- are then mapped and linked to related resources on the Web
tion, ontology using an OWL ontology language [2] and represented as RDF
linked open data [3]. Such a publicly available linked open
I. I NTRODUCTION data resource will help organizations uncover knowledge from
Cybersecurity is a critical concern as society has become multiple sources of cybersecurity-related data on the Web and
highly interconnected and reliant on a global system of support systems that automatically ingest, reason over and use
computers, communication networks and software systems. the data to provide better cybersecurity.
Cyber crime is more professional with the emergence of II. BACKGROUND AND P REVIOUS W ORK
increasingly powerful methods of intrusion and exploits. For
example, cyber criminals targeted users of Skype, Facebook Our approach combines two aspects of the problem. The
and Windows using multiple blackhole exploits in late 2012 first is extracting relevant information about new security
[1]. Many systems are under threat from vulnerabilities that vulnerabilities, attacks and events from text. The second
are known and publicly documented. One reason for this is is representing and integrating this information along with
that these systems are not patched on a regular basis. While data extracted from the the National Security Vulnerability
information about known vulnerabilities and patches for them Database as a linked data resource using custom ontologies in
is publicly available online, much of it is provided as text that the Semantic Web languages RDF and OWL.
is suitable for security experts, but not easily understood or
A. Information Extraction
directly usable by automated security systems.
One of the best public resources of security information is Several repositories and security advisory sources address
the National Vulnerability Database (NVD) and its associated security changes and threat trends that might affect the overall
components, including the Common Vulnerabilities and Ex- security of a computer system. These sources can be used
posures (CVE) and Common Weakness Enumeration (CWE), in a variety of ways to enhance the process of detection of
and Product Dictionary (CPE) datasets1 . These resources list an attack. NVD is a U.S. government repository of standards
vulnerabilities and exposures, categorize them by type and based vulnerability management data represented using the Se-
severity, provide common names and identifiers, include links curity Content Automation Protocol (SCAP) [4]. Information
to patches and other information and have details as short text sources such as the NVD and IBM XFORCE2 provide XML
feeds that report vulnerabilities with varying degrees of detail.
1 See [Link] [Link] [Link] and
[Link] 2 [Link]
to identify and classify mentions of entities and concepts that
goes beyond their simple approach in terms of precision and
recall.
The quality of the concepts extracted from free text largely
depends on the method applied for concept spotting. More
et al. [5] used OpenCalais [11], an information extraction
system designed to recognize general entities such as people,
places and organizations. Because of its orientation toward
general coverage, it was unable to identify many of the entities
and concepts important for cybersecurity. Similar experiments
were run on the NERD information extraction framework [12],
which failed to identify relevant technical jargon from the
given piece of security-related text. These annotation tools are
designed to capture information based on a custom ontology
which models people, places and organizations. The Stanford
Fig. 1. System architecture for extracting linked cybersecurity data from text Named Entity Recognition [13] also does not identify key
cybersecurity concepts without proper feature filtering. Our
approach introduces a cybersecurity entity and concept spotter
To the best of our knowledge, these repositories consolidate that was primarily trained to identify entities (e.g., software
information present across multiple data sources, though are products and operating systems) and concepts (e.g., denial of
manually monitored. service and buffer overflow) which are related to computer
These dictionaries not only contain redundant or overlap- security, threats and vulnerabilities in software products.
ping information, but also miss out on important concepts Khadilkar et al. [14] demonstrated the concept of using a
such as the means and consequence associated with attacks semantic model to facilitate information representation and
and the versioning of a software product. Similar information describe an ontology for the National Vulnerability Database.
is available in cybersecurity blogs such as Krebs On Security3 The ontology modeled information for software products and
and the Metasploit Blog4 , but their content is unstructured text, generic security concepts, though is unable to characterize and
which can lead to an information overload, especially during capture information from unstructured sources of information.
threat analysis of a system. Furthermore, analyzing and inte- Undercoffer et al. [15] specify an ontological model for
grating multiple textual resources can become a cumbersome categorizing computer attacks that used taxonomic charac-
task for system administrators. Extracting actionable content teristics of an intrusion to be limited to specific classes and
from these informal sources and representing it as a linked attributes centered on the target of an attack. Our framework
RDF data can enhance distribution of security information and consolidates information across different knowledge bases and
the discoverability of security-related concepts. carries out concept-spotting for entities of interest, that can
More et al. [5], for example, demonstrate effective reasoning initiate characterization and understanding of the overall nature
over such a semantically rich data for a situation aware intru- of the attack.
sion detection system. Their framework requires a condensed
B. Linked Data
source of web resources that provide meaningful information
about the threat, and data sources that provide entities that map Linked data [16] enables publishing structured, machine-
well into the ontology. Our approach provides automation to readable interpretation of heterogeneous sources of informa-
generate and update such a linked data resource that can be tion. As defined by Bizer et al. [3], it is “a set of best practices
used to inform advanced intrusion detection and mitigation for publishing and connecting structured data on the Web.” It
systems. focuses on interconnecting data and resources on the Web by
Mulwad et al. [6] describe a prototype system that analyzed defining relations between ontologies, schemas and/or directly
relevant text snippets from the Web to generate assertions linking the published data to other existing resource on the
about vulnerabilities, attacks and threats. The system extracted Web.
concepts of interest using an SVM classifier and queried This approach can be leveraged to the cybersecurity domain
Wikitology [7] – a knowledge base of entities from Wikipedia, by building an RDF data store for vulnerabilities, severity met-
Yago [8] and Freebase [9]. The classification mechanism and rics, affected products and any remedial information. Relevant
the spotted concepts were limited to the identification of two information about these concepts from other sources on the
classes: the means and the consequence of an attack. We Web can be interlinked. With NVD data represented as RDF
adopted an approach that uses a Conditional Random Field linked data, the task of finding all vulnerabilities pertaining to
(CRF) algorithm trained with ground truth annotations [10] a single product version is reduced to the task of traversing
the product-vulnerability dependency graph.
3 [Link] Additional contextual information obtained through estab-
4 [Link] lishing meaningful semantic links can help consolidate avail-
able information regarding a security threat. Moreover, the described classes are most notable. Network Terms was iden-
data representation for this interlinking will be in a struc- tified as an important class since most of the attacks are using
tured, machine-readable format enabling faster, automated data network technology these days. Thus it is important to extract
consumption. The linked data resource can help improve the relevant terms in text so that information regarding networks
discoverability of data through the use of SPARQL [17] can be identified. The idea behind modeling the Attack class
queries, SPARQL endpoints and resolvable URIs. It also helps came from the work of Undercoffer et al. [15]. An Attack
in use cases such as distinguishing relevant vulnerabilities can be Further classified as a Means, which helps to identify
based on a product term or version. Such an interlinked a method of an attack, or as a Consequence that describes
corpus of data will enable stakeholders to share security- the final result of an attack. For example, “buffer overflow”
related information in a single resource, create business in- is considered to be an instance of a Means, since it is not an
telligence, support automated decision making systems and attacker’s final goal, but merely a step to achieve a desired
thereby speedup the exchange and digestion of information consequence, such as a “denial of service.”
across different organizations. Whether a phrase is considered to be an instance of a
Means or Consequence is not always clear in a given text. We
III. S YSTEM A RCHITECTURE
instructed annotators to use their discretion during annotation.
Figure 1 shows the organization of our system, which is When it was difficult to decide between them for a phrase, it
divided into three major components. was tagged as an Attack Class. In analyzing the gold standard
1) A CRF-based cybersecurity entity and concept spotter annotation data we found that the inter-annotator agreement
that identifies relevant concepts and entities from text for these two subclasses was lower than all of the other
2) An ontology-based RDF triple generator that generates classes. In this experiment, we took a random data sample
triples based on extracted information provided by the from our corpus and asked two annotators to annotate the
entity and concept spotter data for four classes (Software Products, Operating System,
3) A link generator that uses DBpedia Spotlight [18] to link Means and Consequences). We found the agreement between
extracted entities and concepts to DBpedia resources and the annotators to be over 90% for Software Products and
aligns them with our cybersecurity-specific vocabulary. Operating System. For Consequences, the agreement was 75%,
In the following sections, these components will be described while for Means it was 52%.
in detail. The NER Modifier class also deserves some explanation.
Understanding the version or versions of a software product
A. Cybersecurity Entity and Concept Spotter
being discussed is an important fact. In the text,
In order to extract relevant information from text, we
“This vulnerability is present in Adobe Acrobat X
developed a entity and concept spotter that identifies important
and earlier versions...”
entities and concepts in a given piece of text. This was done us-
ing general implementation of conditional random field (CRF) the phrase “and earlier versions” indicates that all Adobe
algorithm provided by Stanford named entity recognizer using Acrobat versions before version 10 are also vulnerable to
a set of features for proper identification of concepts from the the threat. These words hold key information about other
input text. We analyzed several cybersecurity-related blogs, versions that are vulnerable. The NER Modifier class identifies
security bulletins and CVE descriptions and identified a set of these terms. It was observed that such terms were generally
key classes that are relevant in terms of data representation of described immediately before or after a Software term or
a vulnerability. We identified the following seven classes of an Operating System term. Identifying these pieces of text
relevance: leverages the identification of product versions that may be
1) Software (e.g. Microsoft .NET Framework 3.5) susceptible to the vulnerability, though are not documented
accordingly.
a) Operating System (e.g. Ubuntu 10.4)
Based on these classes, our extraction framework was
2) Network Terms (e.g. SSL, IP Address, HTTP)
trained using the Stanford NER [19], a CRF based named
3) Attack
entity recognition framework that is pre-trained to identify
a) Means: Way to attack (e.g. Buffer overflow) entities such as people, places and organizations. It includes
b) Consequences: Final result of an attack (e.g. Denial of a large feature set that can be customized to train a general
Service) implementation of a CRF model. We chose a training dataset
4) File Name (e.g. [Link]) consisting of over 30 security blogs, 240 CVE descriptions and
5) Hardware (e.g. IBM Mainframe B152) 80 official security bulletins from Microsoft and Adobe. The
6) NER Modifier: This always follows Software or OS and data corpus [20] was manually annotated by twelve Computer
helps in identifying software version information. Science graduate students, who had a fair understanding of
7) Other Technical Terms: Technical terms that cannot be cybersecurity related terms, concepts and technical jargon. We
classified in any of the above mentioned classes. developed a custom application to simplify the annotation
Each of these classes was chosen to represent key aspects in process using the BRAT rapid annotation framework [21],
identification and characterization of the attack. The following [22].
name, the Web resources where it was first documented, and
the severity metrics. The Vulnerability class hence is defined to
have corresponding relationships with all other classes which
are used to model entities that are part of an NVD entry.
Product: The Product class models the hardware and
software products that are affected by a vulnerability. The
Software subclass is further classified as an Operating System
or Web Browser to correctly classify operating systems and
Web browsers, apart from a generic “application” tag. The
affected product information described in an NVD entry is
limited to a list of product names described for a particular
vulnerability. This information is incorporated using the CPE
format, which includes version granularity. The affectsProduct
relationship models a one-to-many mapping between the vul-
Fig. 2. A high level sketch of the IDS ontology
nerability identifier and the list of affected products. Additional
information about the affected products in the NVD entry is
extracted using our cybersecurity entity and concept spotter.
Feature Set Engineering: Feature set selection is a critical Weakness: An NVD entry contains a unique Common
task in training a NER system. Though the Stanford NER Weakness Enumeration (CWE) identifier that classifies a vul-
provides an extensive selection of applicable features, filtering nerability based on a hierarchy of attack classes modeled to
a subset that can capture all the relevant information pertaining generalize different attack signatures. For example, Cross-side
to the cybersecurity domain is a tedious task. Feature selection scripting (XSS) is a subclass of Injection which is a subclass of
is important, as applying all of the available features to the the Invalid Input. The severity score for a CVE ID is derived
training and test data will not only slow down the annotation on parameters specified for the corresponding CWE ID. The
process, but also diminish the quality of results. Feature Weakness class is thus included to extract more information
selection for our cybersecurity entities and concept spotter regarding the metrics used to score the vulnerability’s severity,
engine was carried out manually by analyzing the text and by which the means for addressing a threat will be refined and
checking which features would be suitable. We selected a set enhanced. In addition, we define classes for each concept that
of features that performed well for our analysis. The features is spotted by our classifier such as Network Terms (IP address,
that were used to train this system are: useTaggySequences, HTTP), Means (Buffer Overflow), and Consequence (Denial
useNGrams, usePrev, useNext, maxNGramLeng, useWordPairs of Service).
and gazette. A detailed discussion on our cybersecurity entity All the concepts from the IDS vocabulary are aligned with
and concept spotter can be found in Lal et al. [23]. concepts identified in the DBpedia ontology, by assigning a
relevant DBpedia resource, thereby resolving the ambiguity
B. The IDS Ontology of entities mapped in our ontology.
We use the IDS ontology5 , partially depicted in Figure 2, to
C. RDF Representation of NVD
represent concepts and entities that are relevant to the cyber-
security domain. This vocabulary was originally developed by Applying semantic web technologies to represent the data
Undercoffer et al. [15], further enhanced by More et al. and provided by the NVD dataset is useful for semantic analysis of
this effort [5]. The ontology is expected to continue to evolve vulnerabilities and exploits. However, correlating this data to
to cover additional concepts. We extended the ontology to the existing concepts on the Web and reasoning over such a
provide model relations that capture the NVD schema structure corpus is a vital task to avail this information for different
and the security exploit concepts extracted by the NER. The applications, front-end services and data consumers (e.g.,
new key classes defined in the ontology, specific to the entities security practitioners and system administrators). Semantics
which are part of the NVD dataset include Vulnerability, allow machine interpretation of links and relations between
Product and Weakness. different properties of a vulnerability. Interlinking leads to an
Vulnerability: A vulnerability is an important class in integrated and well-connected data corpus, available via an
the ontology, as each entry in the NVD is identified and endpoint for advanced applications such as a semantic search
documented based on a CVE number. The CVE number is and vulnerability statistics.
a unique identifier for a vulnerability description provided by The NVD provides XML feeds for vulnerabilities that
MITRE, on an incremental basis for each year. All information are published in a particular year. The NVD datasets are
related to a particular identified vulnerability is associated with updated immediately with raw information whenever a new
the CVE ID, including the list of affected products (identified vulnerability is reported to the CVE repository, and iterated to
based on their unique Common Platform Enumeration (CPE) a valid, confirmed source after analysis.
Our RDF-generation platform ingests XML feeds from the
5 [Link] NVD dataset and generates RDF triples via an Extensible
Stylesheet Language Transformation and the Jena RDF API These resources are then mapped with an appropriate object
[24]. The system includes primary attributes included directly property from the IDS vocabulary. The choice for DBpedia
in the NVD schema, as well as advanced properties fetched Spotlight as a link generation tool, and the precision of the
from the sources described in the former. For example, a concept extraction and linking component are described in
NVD entry contains the CWE ID for the weakness class it detail in Joshi et al. [25].
belongs to. The CWE schema includes attributes such as the The IDS vocabulary models key aspects of a cyber attack
Access Vector, Access Complexity and Authentication. These which are not represented precisely in the DBpedia ontol-
attributes are used to calculate the severity score for a threat. It ogy. For example, the terms Buffer Overflow and Denial of
is observed that the vulnerabilities with the same combination Service are aptly represented as “Means” and “Consequence”
of these features, take place under the same context or running respectively in the IDS vocabulary. These concepts are highly
environment. specific to a domain and hence not modeled in the DBpedia
ontology.
D. Linked Cybersecurity Data Figure 3 shows a sample NVD entry which specifies
Establishing the relations between security exploit terms and the CVE identifier for the vulnerability description, together
identifiers that uniquely identify these concepts is essential to with the list of affected products with Common Platform
data integration. The objective to actually link instances and Enumeration (CPE) names, the Weakness identifier, and the
concepts with resources on the Web is a challenging aspect. source where the vulnerability was documented. We extract
After RDF instances are generated from the properties pro- information from this data to generate machine-understandable
vided by the NVD schema, the link generation component of assertions in RDF, as shown in Figure 4. We use the IDS
our framework connects the security concepts extracted from ontology to interpret key security concepts such as the vul-
the vulnerability descriptions to the existing deferenceable nerability sources and severity metrics. Besides modeling
resources on the Web. Each NVD entry mentions a short semi-structured information, our framework extracts relevant
summary of the vulnerability description, which is essentially DBpedia resources from the text description such as Arbi-
unstructured text. This module annotates security-related terms trary code execution and maps them to appropriate relation-
from the vulnerability description and maps them to cor- ships (hasConsequences) from the IDS vocabulary.
responding DBpedia resources using DBpedia Spotlight, an Based on the relationships established with the linked con-
annotation tool for finding mentions of DBpedia resources in cepts, we can retrieve vulnerabilities and attack descriptions
free text. DBpedia Spotlight provides flexibility to configure pertaining to a specific product version, those affected by a
annotations to specific use cases, through quality metrics such specific means (Buffer Overflow), or those attacks that are
as topical pertinence and disambiguation confidence. Binding carried out under the same operating environment. We can
the DBpedia references to the identified security concepts will query over such a knowledge base via SPARQL queries
enhance association of our linked data resource with other to avail statistics on vulnerability trends, and can view the
instances in the Linked Open Data cloud. past history associated with a vulnerability or a particular
Entities with valid (contextual) resources in DBpedia are software product. A triple store of such condensed information
annotated based on adequate tuning of the confidence and facilitates for a rich linked data resource, that can be used for
support metrics. After experimentation with these parameters semantic analysis of vulnerabilities.
over our dataset, we selected a confidence of 0.3 and a support The NVD datasets provide an RSS data feed on all recent
of 20 for generating DBpedia links for a specific vulnerability CVE vulnerabilities. These immediate data sources can be
descriptions. represented as machine-understandable assertions as shown
The annotations and subsequent linkages provided by DB- above. Such RDF assertions can be added to the triple store,
pedia Spotlight are not final, or complete. For a given piece of and can help in applications such as a situation aware intrusion
text, the DBpedia Spotlight API returns the sets of annotated detection system that can consume linked data to generate
terms and corresponding DBpedia resources. However, the rules and alerts on possible threats. In the future, we plan
annotation does not provide the corresponding class from to extend the concept spotting system into an information
the DBpedia ontology that the resource belongs to. We em- extraction framework that is not limited to the NVD dataset
ploy our cybersecurity entity and concept spotter to map and its auxiliaries. The proposed system will extract concepts
the security exploit concepts to appropriate classes from the from free text, find relationships between entities spotted in the
IDS vocabulary. The NVD descriptions (vuln summary) are text, make assertions about them based on a specific heuristic
passed through the concept spotter, that identifies relevant and publish it to the linked cybersecurity data resource.
terms, assigns a class label, and returns a set of <Concept,
Class> tuples for the description. The Concept terms are IV. S YSTEM E VALUATION AND C HALLENGES
then passed through DBpedia Spotlight. The annotated terms The focus in this paper has been on the problem of extract-
from DBpedia Spotlight were matched against the entities ing cybersecurity concepts, entities and relations and generat-
identified by our system using a string comparison. The ing linked data representations of them. In order to generate a
corresponding DBpedia resource for the matched concept is quality linked data resource that captures all relevant security
assigned a class value, based on the Concept, Class pairs. information from within a text description, the cybersecurity
<?xml version=“1.0” encoding=“UTF-8”?> @prefix rdfs:<[Link] .
<nvd xmlns:vuln=“[Link] @prefix rdf:<[Link] .
xmlns:cvss=“[Link] @prefix ebqids:<[Link] .
<entry id=“CVE-2012-0150”> @prefix dbpedia:<[Link] .
<vuln:vulnerable-software-list> <[Link]
<vuln:product>cpe:/o:microsoft:windows vista::sp2:x64 ebqids:cveID “[Link]
</vuln:product> ebqids:cweID “[Link]
<vuln:product>cpe:/o:microsoft:windows [Link]x86 ebqids:affectsProduct “dbpedia:Windows Vista” ,
</vuln:product> ”dbpedia:Windows 7” ;
<vuln:product>cpe:/o:microsoft:windows 7::sp1:x86 ebqids:summary “Buffer overflow in [Link] in Microsoft
</vuln:product> Windows Vista SP2, Windows Server 2008 SP2, R2, and R2 SP1,
<vuln:product>cpe:/o:microsoft:windows vista::sp2 and Windows 7 Gold and SP1 allows remote attackers to execute
</vuln:product> arbitrary code via a crafted media file, aka “[Link]
</vuln:vulnerable-software-list> Buffer Overflow Vulnerability.”” ;
<vuln:cve-id>CVE-2012-0150</vuln:cve-id> ebqids:hasAccessComplexity “MEDIUM” ;
<vuln:cvss> ebqids:hasAccessVector “NETWORK” ;
<cvss:base metrics> ebqids:hasAuthentication “NONE” ;
<cvss:score>9.3</cvss:score> ebqids:hasSeverityScore “9.3” ;
<cvss:access-vector>NETWORK</cvss:access-vector> ebqids:hasVulnerabilitySource
<cvss:access-complexity>MEDIUM</cvss:access-complexity> “[Link] ;
<cvss:authentication>NONE</cvss:authentication> ebqids:hasMeans “dbpedia:Buffer overflow” ;
</cvss:base metrics> ebqids:hasConsequence “dbpedia:Arbitrary code execution” ;
</vuln:cvss> ebqids:hasTerms “[Link] file” ,
<vuln:cwe id=“CWE-119” /> “[Link] library” ,
<vuln:references xml:lang=“en” “[Link] (computing)” .
reference type=”VENDOR ADVISORY”>
<vuln:source>MS</vuln:source> Fig. 4. Turtle representation of extracted information
<vuln:reference
href=“[Link]
xml:lang=“en”>MS12-013</vuln:reference>
</vuln:references>
<vuln:summary>Buffer overflow in [Link] in Microsoft
Windows Vista SP2, Windows Server 2008 SP2, R2, and R2 SP1,
and Windows 7 Gold and SP1 allows remote attackers to execute
arbitrary code via a crafted media file, aka “[Link]
Buffer Overflow Vulnerability.”
</vuln:summary>
</entry>
</nvd>

Fig. 3. An excerpt of an NVD XML entry

Fig. 5. Class-wise evaluation of the cybersecurity concept and entity spotter.

entity and concept spotter was trained over a data corpus of

unstructured texts from security blogs, CVE descriptions and evaluate our CRF classifier. The entity and concept spotter
security bulletins. generated consistent results after applying five-fold cross-
Our gold-standard dataset was created from human anno- validation, as shown in Figure 6. The weighted average of
tations of these unstructured pieces of text. The dataset was the precision value was calculated to be 0.83, the weighted
randomized and split into five equal chunks. The CRF-based average for recall was 0.76 and the weighted average F1
classifier was trained over this dataset using the Stanford NER score was 0.80. This weighted average score was calculated
and appropriate feature selection, as mentioned previously. We from the values in Table I. We also noted that the Gazetteers
evaluated the classifier using five-fold cross-validation, where feature from Stanford NER helped improve the score of
four chunks of data were provided as training input to the Software and Operating System classes. There was notable
classifier system and one chunk as a test set. The training inconsistency between the Means and Consequences classes.
set, on average, consisted of 3800 tagged entities and over Their collective precision score was recorded as 0.75. A
38000 tokens while the test set, on an average, consisted of possible explanation would be that most of the false positives
over 9000 tokens and over 1200 entities. Figure 6 shows the in both classes belonged to the opposite class. Moreover, it
results of each run in the five-fold cross validation experiment. was observed that entities tagged as Means and Consequences
On analysis, the trained model was observed to demonstrate were the most ambiguous terms encountered during the
promising results. Figure 5 shows a graph and a breakdown annotation process. Table I shows the statistics for the tested
of the overall system performance on test data. dataset in terms of true positives (TP), false positives (FP)
We used the precision, recall and F1 score measures to and false negatives (FN). These statistics were calculated
@prefix rdfs: <[Link]
@prefix rdf: <[Link] .
@prefix ebqids: <[Link] .
@prefix dbpedia: <[Link] .
<[Link]
ebqids:cveID “[Link]
ebqids:cweID “[Link] ;
ebqids:summary “Stack-based buffer overflow in Adobe Reader
and Acrobat 9.x before 9.5.3, 10.x before 10.1.5, and 11.x
before 11.0.1, not different from CVE-2013-0626.” ;
ebqids:hasAccessComplexity “LOW” ;
ebqids:hasAccessVector “NETWORK” ;
ebqids:hasAuthentication “NONE” ;
Fig. 6. Results of five-fold cross validation experiment ebqids:hasSeverityScore “10.0” ;
ebqids:hasVulnerabilitySource
TABLE I
R ESULTS OF C YBERSECURITY C ONCEPT S POTTER
“[Link] ,
“[Link] ,
“[Link] ,
“[Link] ;
C LASS TP FP FN ebqids:hasMeans “dbpedia:Buffer overflow” ;
ATTACK 30 14 27 ebqids:affectsProduct “dbpedia:Adobe Acrobat” .
CONSEQUENCES 299 123 135
FILE 52 0 0
HARDWARE 3 0 2 Fig. 7. An NVD entry excerpt which has an incomplete description, since
MEANS 185 94 177 it refers to another NVD CVE entry.
MODIFIER 320 79 147
NETWORK 14 15 45
OPERATINGSYSTEM 920 34 36
OTHER 167 89 230
SOFTWARE 1449 224 268 ditional information about the thing it denotes. Integration
T OTAL 3439 672 1063 can be further enhanced by linking a URI to another from
a central knowledge base like DBpedia. These links assert the
equivalence of objects the two URIs denote. Such a central
collectively for five-fold cross validation. resource serves as a common knowledge hub, allowing sets of
URIs to be understood as equivalent if they link to the same
Our concept spotter system does face certain challenges source.
when identifying entities for some specific NVD descriptions
that refer to another NVD CVE description. There are certain However, not all concepts and terms spotted in the vulner-
sets of entries in the NVD repositories (mostly related to ability descriptions can be associated with a valid, available
the same software product) that are observed to have the resource. This may be the case when there is no relevant DB-
same summary description, with minor changes in the rest pedia resource available for the concept. The terms extracted
of the NVD (CVE, CVSS) properties. However, they provide by the cybersecurity entity and concept spotter, though not
references to other CVE IDs that might have the appropriate, instantiated to relevant URIs, are important for profiling an
more granular details regarding the attack. attack.
Figure 7 shows an excerpt of NVD CVE-2013-0610 entry
There is a considerable difference in the number of annota-
that describes a buffer overflow attack on Adobe Acrobat and
tions picked by our cybersecurity classifier and the number
Reader. Although the NVD summary describes the means of
of annotations (and thereby links) generated by DBpedia
the attack (Buffer Overflow) and the affected product (Adobe
Spotlight. Figure 8 shows the comparison for the number of
Acrobat), it does not provide further information such as the
annotations extracted from a set of 300 NVD vulnerability
consequences. Moreover, the severity score for the entry is 10
descriptions by DBpedia spotlight and our cybersecurity clas-
(“Critical”). Retrieving the text associated with the referenced
sifier.
NVD CVE entries might help gather more information about
the nature of such a critical attack, not only for a single CVE This not only demonstrates the performance of our classifier,
but a group of CVEs that might be reported together. In the but also indicates the absence of entities that describe security
future, we plan to consolidate these missed sources to give concepts in the DBpedia knowledge base. In order to represent
richer context on such vulnerabilities. these terms in useful RDF instances, we plan to resolve the
Linked data supports data integration and interoperation unidentified concepts to external URIs that formally describe
by using the RDF representation, which has globally unique the security concept, and thereby reduce fact duplication and
identifiers (URIs). Moreover, the linked data paradigm stip- re-utilize existing URIs. Hence our prototype can support
ulates that these identifiers should be “resolvable”, i.e., one knowledge generation of terms relevant to cybersecurity that
can use an HTTP GET request on a URI and retrieve ad- are not identifiable as relevant DBpedia resources.
[5] S. More, M. Matthews, A. Joshi, and T. Finin, “A Knowledge-Based
Approach to Intrusion Detection Modeling,” in Security and Privacy
Workshops (SPW), 2012 IEEE Symposium on, 2012, pp. 75–81.
[6] V. Mulwad, W. Li, A. Joshi, T. Finin, and K. Viswanathan, “Ex-
tracting Information about Security Vulnerabilities from Web Text,” in
IEEE/WIC/ACM Int. Conf. on Web Intelligence and Intelligent Agent
Technology, vol. 3, 2011, pp. 257–260.
[7] Z. Syed, “Wikitology: A Novel Hybrid Knowledge Base Derived
from Wikipedia,” Ph.D. dissertation, University of Maryland, Baltimore
County, August 2010.
[8] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A Core of
Semantic Knowledge,” in 16th Int. World Wide Web Conf. New York:
ACM Press, 2007.
[9] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase:
a collaboratively created graph database for structuring human knowl-
Fig. 8. Comparison of number of annotations edge,” in Proc. ACM Int. Conf. on Management of Data. ACM, 2008,
pp. 1247–1250.
[10] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional Random
V. C ONCLUSION AND F UTURE W ORK Fields: Probabilistic Models for Segmenting and Labeling Sequence
Data,” in 18th Int. Conf. on Machine Learning. Morgan Kaufmann,
We demonstrate a prototype for an entity and concept spot- 2001, pp. 282–289.
ting framework that identifies cybersecurity-related concepts [11] T. Reuters, “OpenCalais,” 2009.
from heterogeneous data sources, aligns and links them to [12] G. Rizzo and R. Troncy, “NERD: a framework for unifying named
relevant resources on the Web using the IDS ontology, and entity recognition and disambiguation extraction tools,” in 13th Conf. of
the European Chapter of the Association for Computational Linguistics.
generates an RDF linked data collection. We provide a seman- Association for Computational Linguistics, 2012, pp. 73–76.
tic data representation for the concepts that are not limited to [13] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local
the NVD dataset. The linked data generation module leverages information into information extraction systems by Gibbs sampling,”
in 43rd Annual Meeting on Association for Computational Linguistics.
interoperability and reuse of URIs, thereby enhancing the Association for Computational Linguistics, 2005, pp. 363–370.
binding with the Linked Open Data cloud. [14] V. Khadilkar, J. Rachapalli, and B. Thuraisingham, “Semantic Web
Our evaluation showed promising results for the extraction Implementation Scheme for National Vulnerability Database (Common
framework. We plan to focus on further extracting previously Platform Enumeration Data),” University of Texas at Dallas, Tech. Rep.
UTDCS-01-10, 2010.
unidentified security concepts from any given piece of text,
[15] J. Undercoffer, J. Pinkston, A. Joshi, and T. Finin, “A Target-Centric
identify properties and find relationships based on a heuristic. Ontology for Intrusion Detection,” in IJCAI-03 Workshop on Ontologies
There are ongoing efforts to enhance the ontology to model and Distributed Systems. Morgan Kaufmann, 2004, pp. 47–58.
detailed network-related terms and privacy concepts. We be- [16] “Linked Data,” [Link] 2007.
[17] E. PrudHommeaux, A. Seaborne et al., “SPARQL query language for
lieve that expressing structured and unstructured cybersecurity- RDF,” 2008, W3C Recommendation.
related text as linked data has potential to leverage automatic [18] P. N. Mendes, M. Jakob, A. Garcı́a-Silva, and C. Bizer, “DBpedia
consumption and reasoning of security concepts, and can spotlight: shedding light on the web of documents,” in 7th Int. Conf.
drive applications such as a situation aware intrusion detection on Semantic Systems. ACM, 2011, pp. 1–8.
[19] “Stanford NER,” [Link]
system to detect and prevent potential “zero-day” attacks.
[20] R. Lal, “Annotations of cybersecurity blogs and articles,”
Acknowledgment [Link] June 2013.
[21] P. Stenetorp, S. Pyysalo, G. Topić, T. Ohta, S. Ananiadou, and J. Tsujii,
This research was supported by grants from AFOSR (FA9550- “BRAT: a web-based tool for NLP-assisted text annotation,” in Demon-
08-1-0265) and NSF (IIS-1250627). strations, 13th Conf. of the European Chapter of the Association for
Computational Linguistics. Association for Computational Linguistics,
R EFERENCES 2012, pp. 102–107.
[1] “Cyber criminals target Skype, Facebook and Windows users,” [22] “BRAT Annotation Tool,” [Link]
[Link] [23] R. Lal, “Information Extraction of Security related entities and concepts
[2] D. McGuinness and F. e. a. Van Harmelen, “OWL web ontology from unstructured text,” Master’s thesis, University of Maryland Balti-
language overview,” World WIde Web Consortium, Tech. Rep., 2004. more County, 2013.
[3] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data - the story so far,” [24] J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and
Int. Journal on Semantic Web and Information Systems, vol. 5, no. 3, K. Wilkinson, “The JENA Semantic Web platform: architecture and
pp. 1–22, 2009. design,” HP Laboratories, Tech. Rep. Technical Report HPL-2003-146,
[4] S. D. Quinn, D. A. Waltermire, C. S. Johnson, K. A. Scarfone, and J. F. 2003.
Banghart, “SP 800-126. The Technical Specification for the Security [25] A. Joshi, “Linked Data for Software Security Concepts and Vulnera-
Content Automation Protocol (SCAP): SCAP Version 1.0,” National bility Descriptions,” Master’s thesis, University of Maryland Baltimore
Institute of Standards & Technology, Gaithersburg, MD, Tech. Rep., County, 2013.
2009.

View publication stats

Constructing Cybersecurity Knowledge Graphs
No ratings yet
Constructing Cybersecurity Knowledge Graphs
13 pages
Extracting Information About Security Vulnerabilities From Web Text
No ratings yet
Extracting Information About Security Vulnerabilities From Web Text
4 pages
Cybersecurity Knowledge Graph Survey
No ratings yet
Cybersecurity Knowledge Graph Survey
4 pages
CVE-driven Attack Technique Prediction With Semant
No ratings yet
CVE-driven Attack Technique Prediction With Semant
15 pages
Cybersecurity NER with RDF-CRF Model
No ratings yet
Cybersecurity NER with RDF-CRF Model
11 pages
Explainable AI
No ratings yet
Explainable AI
4 pages
Security and Communication Networks - 2022 - Zhou - CTI View APT Threat Intelligence Analysis System
No ratings yet
Security and Communication Networks - 2022 - Zhou - CTI View APT Threat Intelligence Analysis System
15 pages
AlexSynasc2021 v2
No ratings yet
AlexSynasc2021 v2
7 pages
AI-Driven Cybersecurity
No ratings yet
AI-Driven Cybersecurity
26 pages
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
No ratings yet
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
5 pages
Cyber Security Forensic Presentation
No ratings yet
Cyber Security Forensic Presentation
36 pages
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
6 pages
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
10 pages
Research Article: CTI View: APT Threat Intelligence Analysis System
No ratings yet
Research Article: CTI View: APT Threat Intelligence Analysis System
15 pages
Intelligent Association of CVE Vulnerabilities Bas
No ratings yet
Intelligent Association of CVE Vulnerabilities Bas
7 pages
Analysis Report On Attacks and Defence Modeling Ap
No ratings yet
Analysis Report On Attacks and Defence Modeling Ap
10 pages
A Novel Approach For Cyber Threat Analysis Systems Using BERT Model From Cyber Threat Intelligence Data
No ratings yet
A Novel Approach For Cyber Threat Analysis Systems Using BERT Model From Cyber Threat Intelligence Data
27 pages
Ontoenricher: A Deep Learning Approach For Ontology Enrichment From Unstructured Text
No ratings yet
Ontoenricher: A Deep Learning Approach For Ontology Enrichment From Unstructured Text
16 pages
Developing An Ontology For Cyber Security Knowledge Graphs
No ratings yet
Developing An Ontology For Cyber Security Knowledge Graphs
4 pages
IJNSA
No ratings yet
IJNSA
13 pages
AI-Driven Threat Detection Leveraging Big Data For Advanced Cybersecurity Compliance
No ratings yet
AI-Driven Threat Detection Leveraging Big Data For Advanced Cybersecurity Compliance
12 pages
ATML
No ratings yet
ATML
25 pages
Cyber Security 01.
No ratings yet
Cyber Security 01.
8 pages
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
No ratings yet
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
4 pages
Print
No ratings yet
Print
36 pages
Cybersecurity Concepts and Solutions
No ratings yet
Cybersecurity Concepts and Solutions
21 pages
Lecture 8
No ratings yet
Lecture 8
11 pages
Cyber Threats & Defense Strategies
No ratings yet
Cyber Threats & Defense Strategies
5 pages
Unit 2
No ratings yet
Unit 2
41 pages
Ner X LSTM
No ratings yet
Ner X LSTM
6 pages
Threat Behavior Textual Search by Attention Graph Isomorphism
No ratings yet
Threat Behavior Textual Search by Attention Graph Isomorphism
15 pages
Mid Term Major Project
No ratings yet
Mid Term Major Project
19 pages
Computer and Mobile Security Issues
No ratings yet
Computer and Mobile Security Issues
34 pages
Common Vulnerabilities and Exposures Project Report
No ratings yet
Common Vulnerabilities and Exposures Project Report
22 pages
Threat Detection and Response Strategies
No ratings yet
Threat Detection and Response Strategies
5 pages
A Survey of Deep Learning Models, Datasets, and Applications For Cyber Attack Detection
No ratings yet
A Survey of Deep Learning Models, Datasets, and Applications For Cyber Attack Detection
16 pages
A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence
No ratings yet
A Crawler Architecture For Harvesting The Clear, Social, and Dark Web For Iot-Related Cyber-Threat Intelligence
6 pages
Cyber Security S
No ratings yet
Cyber Security S
7 pages
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
No ratings yet
Automated Emerging Cyber Threat Identification and Profiling Based On Natural Language Processing
16 pages
Cybersecurity Risks for Tech Startups
No ratings yet
Cybersecurity Risks for Tech Startups
24 pages
IT Security: Threat (Computer)
No ratings yet
IT Security: Threat (Computer)
5 pages
Lecture 1 - Introduction To IDS 16092024 051537pm
No ratings yet
Lecture 1 - Introduction To IDS 16092024 051537pm
70 pages
Security Dumps CertEmpire Or5hvd
No ratings yet
Security Dumps CertEmpire Or5hvd
1,550 pages
Dark 2
No ratings yet
Dark 2
6 pages
03 - CO1-ethical Hacking Terminologies
No ratings yet
03 - CO1-ethical Hacking Terminologies
53 pages
Detection of Cyber Attack in Network Using Machine Learning Techniques
No ratings yet
Detection of Cyber Attack in Network Using Machine Learning Techniques
8 pages
AI-Driven Cybersecurity An Ov
No ratings yet
AI-Driven Cybersecurity An Ov
19 pages
Vulnerabilities and Exploits in Cyber Security
No ratings yet
Vulnerabilities and Exploits in Cyber Security
6 pages
Zero-Day Attack Paper2
No ratings yet
Zero-Day Attack Paper2
25 pages
Security Assessment Vunerability Threat
No ratings yet
Security Assessment Vunerability Threat
20 pages
(20200620) MALOnt - An Ontology For Malware Threat Intelligence
No ratings yet
(20200620) MALOnt - An Ontology For Malware Threat Intelligence
8 pages
Intelligent Security Analysis with Big Data
No ratings yet
Intelligent Security Analysis with Big Data
10 pages
Cybersecurity Awareness and OSINT Guide
No ratings yet
Cybersecurity Awareness and OSINT Guide
77 pages
Information Assurance Notes 1
No ratings yet
Information Assurance Notes 1
6 pages
Navigating The Cyber Security Landscape: A Comprehensive Review of Cyber-Attacks, Emerging Trends, and Recent Developments
No ratings yet
Navigating The Cyber Security Landscape: A Comprehensive Review of Cyber-Attacks, Emerging Trends, and Recent Developments
70 pages
Review Machine Learning Techniques Applied To Cybersecurit
No ratings yet
Review Machine Learning Techniques Applied To Cybersecurit
14 pages
NVD - Cve-2022-2839
No ratings yet
NVD - Cve-2022-2839
3 pages
Xiong2021 Article CyberSecurityThreatModelingBas
No ratings yet
Xiong2021 Article CyberSecurityThreatModelingBas
21 pages
Detecting Cyber Security Threats in Weblogs
No ratings yet
Detecting Cyber Security Threats in Weblogs
12 pages
Ppt. On Cybercrime
No ratings yet
Ppt. On Cybercrime
7 pages
Chapter 3 Database Integrity, Security and Recovery
No ratings yet
Chapter 3 Database Integrity, Security and Recovery
39 pages
Dogs
No ratings yet
Dogs
37 pages
New Text Document
No ratings yet
New Text Document
5 pages
Unit 4
No ratings yet
Unit 4
72 pages
Exchange 2010 Migration Guide
No ratings yet
Exchange 2010 Migration Guide
16 pages
A Proactive Approach For Detecting SQL and XSS Injection Attacks
No ratings yet
A Proactive Approach For Detecting SQL and XSS Injection Attacks
6 pages
End Sem Question
No ratings yet
End Sem Question
2 pages
Understanding Mutual TLS Security Protocol
No ratings yet
Understanding Mutual TLS Security Protocol
5 pages
Bluetooth Security Issues, Threats and Consequences: Presented By: Ankush Hans. 208 Mca 2 Sem
No ratings yet
Bluetooth Security Issues, Threats and Consequences: Presented By: Ankush Hans. 208 Mca 2 Sem
38 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
17 pages
EBOOK MethodologyDeepDive 3.0 - v2 1
No ratings yet
EBOOK MethodologyDeepDive 3.0 - v2 1
32 pages
WireGuard: Next Generation Kernel Network Tunnel
100% (1)
WireGuard: Next Generation Kernel Network Tunnel
20 pages
SHIELD E-Wallet & Superapp Factsheet
No ratings yet
SHIELD E-Wallet & Superapp Factsheet
4 pages
Network Security 7.2 Support Engineer Exam Description
No ratings yet
Network Security 7.2 Support Engineer Exam Description
3 pages
2 Fa
No ratings yet
2 Fa
20 pages
KB CertificateManagementandCashDepositAPI 071124 1539
No ratings yet
KB CertificateManagementandCashDepositAPI 071124 1539
8 pages
Lecture6 IAM Architecture
No ratings yet
Lecture6 IAM Architecture
10 pages
2025 Cybersecurity Best Practices For Ransomware
No ratings yet
2025 Cybersecurity Best Practices For Ransomware
32 pages
Summary
No ratings yet
Summary
2 pages
macOS Guide for HyperPKI HYP2003 Token
No ratings yet
macOS Guide for HyperPKI HYP2003 Token
13 pages
OS Memory Management Basics
No ratings yet
OS Memory Management Basics
20 pages
Spring Security Intro
No ratings yet
Spring Security Intro
119 pages
ReaQta-Hive Threat Monitoring Guide
No ratings yet
ReaQta-Hive Threat Monitoring Guide
7 pages
Clerical Position Cover Letter Sample
100% (1)
Clerical Position Cover Letter Sample
7 pages
AI Prompt Engineering For NIST 800 RMF: Bruce Brown, CISSP, ISC2 CGRC
No ratings yet
AI Prompt Engineering For NIST 800 RMF: Bruce Brown, CISSP, ISC2 CGRC
80 pages
HTCS501 Unit 4
No ratings yet
HTCS501 Unit 4
17 pages
Siemens Data Diode for Secure IoT Connectivity
No ratings yet
Siemens Data Diode for Secure IoT Connectivity
3 pages
@cseit251451 2025
No ratings yet
@cseit251451 2025
9 pages
Ad Self Service Plus Data Sheet
No ratings yet
Ad Self Service Plus Data Sheet
2 pages