Chapter4_DA_New.pptx
Chapter4_DA_New.pptx
owned social media platforms, such as your official Twitter account, Facebook fan
pages, blogs, and YouTube channel.
Some data for analytics, however, will also be harvested from nonofficial social media
platforms, such as Google search engine trends data or Twitter search stream data.
STEP 2: EXTRACTION
•Once a reliable and minable source of data is identified, next comes the science of extraction
stage.
•The type (e.g., text, numerical, or network) and size of data will determine the method and
tools suitable for extraction.
•Small-size numerical information, for example, can be extracted manually (e.g., going through
your Facebook fan page and counting likes and copying comments), and a large-scale
automated extraction is done through an API (application programming interface).
•Manual data extraction maybe practical for small-scale data, but it is the API-based
extraction tools that will help you get most out of your social media platforms.
•Mostly, the social media analytics tools use API-based data extraction.
•APIs, in simple words, are sets of routines/protocols that social media service companies
(e.g., Twitter and Facebook) have set up that allow users to access small portions of data
hosted in their databases.
•The greatest benefit of using APIs is that it allows other entities (e.g., customers,
•Some data, such as social networks and hyperlink networks, can only be extracted
through specialized tools. Two important issues to bear in mind here are the privacy and
ethical issues related to mining data from social media platforms.
•Privacy advocacy groups have long raised serious concerns regarding large-scale
mining of social media data and warned against transforming social spaces into
behavioral laboratories.
•The social media privacy issue first came into the spotlight particularly due to the
large-scale “Facebook Experiment” carried out in 2012, in which Facebook
manipulated the news feeds feature of thousands of people to see if emotion
contagion occurs without face-to-face interaction (and absence of nonverbal cues)
between people in social networks (Kramer, Guillory et al. 2014).
•Though the experiment was consistent with Facebook’s Data Use Policy (Editorial 2014)
and helped promote our understanding of online social behavior, it does, however,
raise serious concerns regarding obtaining informed consent from participants and
allowing them to opt out. The bottom line here is that your data extraction practices
STEP 3: CLEANING
• This step involves removing the unwanted data from the automatically extracted
data. Some data may need a lot of cleaning, and others can go into analysis
directly. In the case of the text analytics, for example, cleaning, coding, clustering, and
filtering may be needed to get rid of irrelevant textual data using natural language
processing (NPL). Coding and filtering can be performed by machines (i.e., automated) or
can be performed manually by humans. For example, DiscoverText combines both machine
learning and human coding techniques to code, cluster, and classify social media data
(Shulman 2014).
STEP 4: ANALYZING
• At this stage the clean data is analyzed for business insights. Depending on the layer of
social media analytics under consideration and the tools and algorithm employed, the
steps and approach you take will greatly vary. For example, nodes in a social media
network can be clustered and visualized in a variety of ways depending on the algorithm
employed. The overall objective at this stage is to extract meaningful insights without the
data losing its integrity. While most of the analytics tools will follow you through the
step-by-step procedure to analyze your data, having background knowledge and an
• STEP. 5: VISUALIZATION
• In addition to numerical results, most of the seven layers of social media analytics will
also result in visual outcomes. The science of effective visualization known as visual
analytics is becoming an important part of interactive decision making facilitated by
solid visualization (Wong and Thomas 2004; Kielman and Thomas 2009). Effective
visualization is particularly helpful with complex and huge data because it can reveal
hidden patterns, relationships,
and trends. It is the effective visualization of the results that will demonstrate the value
of social media data to top management. Depending on the layer of the analytics, the
analysis part will result in relevant visualizations for effective communication of
results.Text analytics, for instance, can result in a word cooccurrence cloud; hyperlink
analytics will provide visual hyperlink networks; and location analytics can produce
interactive maps. Depending on the type of data,
different types of visualization are possible, including the following. Network data
(with whom)—network data visualizations can show who is connected to
whom. For example, a Twitter following-following network chart can show who
is following whom. Different types of networks are discussed in a later chapter.
Topical data (what)—topical data visualization is mostly focused on what aspect of a
phenomenon is under investigation. A text cloud generated from social media
comments can show what topics/themes are occurring more frequently in the
discussion. Temporal data (when)—temporal data visualization slice and dice
data with respect to a time horizon and can reveal longitudinal trends, patterns,
and relationships hidden in the data. Google trends data, for example, can
visually investigate longitudinal search engine trends .eospatial data
(where)—geospatial data visualization is used to map and locate data, people, and
resources.
STEP 6: INTERPRETATION
• Interpreting and translating analytics results into a meaningful business
problem is the art part of social media analytics. This step relies on
human judgments to interpret valuable knowledge from the visual data.
• Meaningful interpretation is particularly important when we are dealing
with descriptive analytics that leave room for different interpretations.
• Having domain knowledge and expertise are crucial in consuming the
obtained results correctly.
Two strategies or approaches used here can be
1) producing easily consumable analytical results and
2) improving analytics consumption capabilities
(Ransbotham 2015). The first approach requires training data scientists and
analysts to produce interactive and easy-to-use visual results. And the
second strategy focuses on improving
Social Network Analysis
•Social network analysis is the process of
investigating social structures through the use of
network.
• SNA is the practice of representing networks
of people as graphs and then exploring these
graphs
• A typical social network representation has
nodes for people, and edges connecting two
nodes to represent one or more relationships
between them as shown in figure.
• The resulting graph can reveal patterns of
connection among people.
The following are some everyday types of social media networks that we come
across and that can be subject to network analytics.
• FRIENDSHIP NETWORKS The most common type of social media networks facebook
• FOLLOW-FOLLOWING NETWORKS: In the follow-following network, users follow (or
keep track of) other users of interested. Twitter is a good example of
follow-following network where users follow influential people,brands, and
organizations.
• CONTENT NETWORKS Content networks are formed by the content posted by social
media users. A network among YouTube videos is an example of a content network.
• PROFESSIONAL NETWORKS LinkedIn is a good example of professional networks
where people manage their professional identify by creating a rofile that lists their
achievements, education, work history, and interests. Nodes in these networks are,
for example, people, brands, and organizations, and links are professional relations
(such as coworker, employee, or collaborator).
Introduction to Natural Language Processing
• Natural language processing (NLP) is the
intersection of computer science,
linguistics and machine learning.
•The field focuses on communication
between computers and humans in
natural language.
•NLP is all about making computers
understand and generate human
language.
•Applications of NLP techniques include
voice assistants like Amazon's Alexa and
Apple's Siri, but also things like machine
The field is divided into the three parts:
•This is very similar to TF but the only difference is that TF is the frequency counter for a term t
•In other words, DF is the number of documents in which the word is present. We consider one
occurrence if the term is present in the document at least once, we do not need to know the
IDF helps assess the significance of a word based on its frequency across a collection of documents:
▪ IDF is high for terms that are rare across the corpus.
▪ IDF is low for terms that appear in many documents, as these are typically considered less informative.
The idea behind IDF is that if a term appears in many documents, it is likely a common term that doesn't
provide much useful information for distinguishing documents from one another. Conversely, terms that appear
in fewer documents are considered more meaningful for differentiating between documents.
IDF(cat)=log(53)≈0.2218
N-gram can be defined as the contiguous sequence of n items from a given sample of text or speech. The
items can be letters, words, or base pairs according to the application. The N-grams typically are collected
Sentiment Analysis
• Sentiment analysis is an automated
Sentiment Analysis