How To Succeed With Data Classification Using Modern Approaches
How To Succeed With Data Classification Using Modern Approaches
Overview
Key Findings
Manual data classification approaches can result in misclassification of data due to human
error or a lack of user awareness training.
While users label/tag their data, these labels remain one-dimensional, serving a single purpose,
and do not provide sufficient context for increasing regulatory data controls.
Recommendations
To implement an effective data classification program, security and risk management leaders
tasked with data security must:
Establish a data classification program by shifting focus from user awareness and training
toward automation and the enrichment tools that generate metadata.
Increase depth and dimensionality in the data classification approach by segmenting into a
discovery phase, data enrichment phase and control phase.
Introduction
Data classification is vital as it is useful in supporting controls for data security and governance
such as data loss prevention (DLP), data access governance and enterprise digital rights
management (EDRM). It also helps organizations to understand data in the context of its usage
and risk levels. However, Gartner observes that unstructured data is becoming increasingly
difficult to manage. As a result, the individuals or systems that are tasked with processing
information rarely classify, label or enforce controls on every piece of data. This inconsistency
makes classification unreliable as a driver of and means of support for data security and
compliance efforts. Organizations need a practical data classification approach that provides a
foundation for the business to understand and address the mitigating measures necessary.
There are two types of tools that are available in the market for data classification.
1. User-driven/manual tools: These tools enforce the classification of data at the time of creation
or use. They rely on user education and awareness, an absence of which will lead to
inconsistent and misclassified data.
2. Automated tools: These tools are based on out-of-the-box policies and templates that are
provided by the vendors to identify the sensitive data and further classify it. Apart from
analyzing the content, leading tools also leverage context such as location, access groups and
adjacent documents. Automated tools get the best results with well-known standard data
types (such as driving license information, proper names and social security numbers). If your
intellectual property data is consistently well-formatted (such as with an account number or
project coding system), then automated systems will succeed there.
The introduction of machine learning to automated data classification tools has proven to be
beneficial, especially as some of these tools are now supporting dynamic feedback. These tools
learn from the responses provided by the security analyst/administrators, which helps to quickly
address any false positives. But for most tools, the cost of implementing and tuning them to
reliably identify sensitive internal or proprietary data in detail is prohibitively expensive — and for
those use cases, user-driven classification should be considered instead (or preferably as well).
Analysis
Enrich, Don’t Just Classify Data
Traditional data classification approaches have always relied on users. Data owners and data
creators were responsible for classifying any file or document they created or owned. There are
some prerequisites involved, including user awareness training, educating users about the
importance of data classification and preexisting data classification policies.
To accommodate users, sensitivity classification schemes are often simplified into “buckets.” The
four levels of classification that are often used are:
Restricted
Confidential
Internal
Public
This approach is dependent on the understanding (and often the risk appetite) of the users that
are classifying the information. This is prone to human error, which might also lead to
misclassification of data.
Underclassified (either through error or because users realize that a lower classification will
make their job easier).
Overclassified (a common mistake when users are risk-averse or uncomfortable with the
scheme, leading to overspending and difficulty in accessing and handling the data).
SRM leaders currently using a traditional classification scheme — and finding that it does not
support the increased detail demanded of modern data governance laws — should take steps to
evolve toward metadata enrichment. As metadata in general terms refers to data about data, this
approach provides additional information to the data, which can be further embedded directly into
the files. This approach is called “descriptive classification.” Here, data is classified not in
accordance with control requirements, but in accordance with the semantic description of the
data. Figure 1 is an example of descriptive classification.
Here, users set the description of the data (such as customer records, financial data and HR data)
which is mapped with the control requirement so that the description itself yields metadata. The
benefits of this method are a reduction in the need for awareness, and a reduction in human error
and misclassification This approach also provides a good transition from control-based
classifiers, as each descriptive classifier maps to a control. The organization also gains the
benefit of inferred metadata associated with the descriptive classifier, so for example “HR data”' is
taken to contain both “personal” and “personal sensitive” data. Also, as there is high risk of data
exfiltration, this approach will help organizations to easily classify the information and ensure that
only the right people have access to any sensitive data. The one downside is that the list of
descriptive classifiers is far longer.
The first phase is a discovery process, which involves locating information. This may seem trivial,
but the nature of our digital world means that information is everywhere, and much of it is
unknown to IT teams. Most of the work carried out by the automated data classification tool
provides data discovery capabilities. Next comes enrichment, which takes the result of discovery
and applies tags or labels to data objects. Many tools provide the needed automation for this step
by using content inspection capabilities as well as AI-driven methods including machine learning,
natural language processing (NLP) and computer vision.
For example, some of the tags associated with a résumé document would include aspects like
“Personal,” “Sensitive,” “HR,” “CV,” “DOB: 19760822,” “Last Edit: 20190326” and “Region: India.” The
last step is applying controls where these tags provide the critical metadata needed by control
tools — such as data retention tools, DLP tools or content collaboration platforms — to properly
handle the files in question (see Figure 2).
About Careers Newsroom Policies Site Index IT Glossary Gartner Blog Network Contact Send
Feedback