Slide 2
Slide 2
Recurring Collection
"First, data may need to be collected just once for a specific purpose, such as a
one-time analysis of a historical dataset. This might happen if we're studying past
trends or conducting a one-off analysis of company metrics. However, in other
cases, we may need ongoing data collection, which allows us to stay updated with
real-time or frequently updated data. This might be particularly useful in fields
like logistics or supply chain management, where conditions can change daily or
even hourly."
Sources of Data
"Data can come from different sources. For instance, a company might use its own
data, such as internal sales records or employee performance metrics.
Alternatively, we might rely on third-party data�data collected from external
organizations. This is common in market research, where external data provides
valuable context."
Data Formats
"Lastly, data can come in various formats, including JSON, CSV, XML, and relational
databases. JSON is common in web applications, CSV is widely used for spreadsheet-
like data, XML is prevalent in configuration and data exchange, and SQL or
relational databases are typical for complex, structured datasets. Recognizing
these formats is essential because each requires a different approach for
processing and analysis."
Structured Data
"First, structured data is highly organized, making it easy to search, filter, and
extract information. This data usually comes in the form of spreadsheets or
databases, where each entry follows a consistent format. For example, customer
records in a database are often structured, with clear fields for names, contact
information, and transaction histories. The structure helps machine learning
algorithms quickly identify patterns and relationships within the data."
Unstructured Data
"Next, we have unstructured data, which doesn�t have a defined structure and is
harder to query directly. This includes images, videos, audio files, and large
blocks of text. For instance, if we�re training a model to recognize images or
understand spoken language, we�re working with unstructured data. Unlike structured
data, this type requires additional processing to make it useful for machine
learning, such as tagging or converting it into numerical features."
Semi-structured Data
"Finally, there�s semi-structured data, which has elements of both structured and
unstructured formats. For example, in emails, headers like sender, receiver, and
timestamp have a predictable structure, but the body content is unstructured text.
Similarly, XML and JSON documents can be structured in parts, but they might also
contain free-form information that requires extra processing. Semi-structured data
offers some advantages for querying, but also presents challenges similar to
unstructured data."
This slide shows a sample dataset, which is a common format we encounter in machine
learning. Here, I�ll explain some important terms: Features, Values, and Data
Examples, which are crucial for understanding how data is structured."
Data Example
"First, let�s look at the entire row. Each row here represents a single data
example or record. In machine learning, each data example is an individual entry in
our dataset, containing information about a single subject or instance. In this
case, each row represents a user with details about their age, job, marital status,
education, and other factors."
Feature
"Next, the columns at the top�like user_id, age, job, marital, and so on�are called
features. Features are the attributes or characteristics of each data example that
our machine learning model will use to make predictions or gain insights. Each
feature represents a specific type of information. For instance, age represents the
user�s age, job represents their job type, and education indicates their
educational background."
Value
"Now, each cell within the table holds a specific value, which is the actual data
for a given feature in a data example. For example, the value for the job feature
in the second row is 'technician,' and the value for the age feature in that same
row is '44.' Values are the data points that machine learning algorithms analyze to
detect patterns."