0% found this document useful (0 votes)
32 views

Combining Datasets

1) Pandas pd.concat() function concatenates Series or DataFrame objects along an axis, similar to np.concatenate but with additional options like preserving indices and keys. 2) pd.concat() can concatenate higher dimensional objects like DataFrames and handle duplicate indices. 3) The append() method provides a simpler way to concatenate than direct array operations like pd.concat(). 4) pd.merge() joins DataFrames on common columns or indices and supports one-to-one, many-to-one, and many-to-many joins through different join types.

Uploaded by

Ben Ten
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Combining Datasets

1) Pandas pd.concat() function concatenates Series or DataFrame objects along an axis, similar to np.concatenate but with additional options like preserving indices and keys. 2) pd.concat() can concatenate higher dimensional objects like DataFrames and handle duplicate indices. 3) The append() method provides a simpler way to concatenate than direct array operations like pd.concat(). 4) pd.merge() joins DataFrames on common columns or indices and supports one-to-one, many-to-one, and many-to-many joins through different join types.

Uploaded by

Ben Ten
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Combining Datasets: Concat

and Append

PREPARED BY
R.AKILA,AP(SG)
BSACIST
REFERENCE:HTTPS://JAKEVDP.GITHUB.IO/
PYTHONDATASCIENCEHANDBOOK/03.07-
MERGE-AND-JOIN.HTML
Concatenation

Pandas has a function, pd.concat(), which has a


similar syntax to np.concatenate but contains a
number of options.
pd.concat() can be used for a simple concatenation
of Series or DataFrame objects, just
as np.concatenate() can be used for simple
concatenations of arrays.
It also works to concatenate higher-dimensional
objects, such as DataFrames:
Duplicate indices

One important difference


between np.concatenate and pd.concat is that
Pandas concatenation preserves indices, even if the
result will have duplicate indices.
Ignoring the index

Sometimes the index itself does not matter, and you


would prefer it to simply be ignored. This option can
be specified using the ignore_index flag. With this
set to true, the concatenation will create a new
integer index for the resulting Series:
Adding MultiIndex keys

Another option is to use the keys option to specify a


label for the data sources; the result will be a
hierarchically Indexed series containing the data:
Concatenation with joins

In practice, data from different sources might have


different sets of column names, and pd.concat offers
several options in this case. Consider the
concatenation of the following two DataFrames,
which have some (but not all!) columns in common:
By default, the entries for which no data is available
are filled with NA values.
To change this, we can specify one of several options
for the join and join_axes parameters of the
concatenate function.
By default, the join is a union of the input columns
(join='outer'),
but we can change this to an intersection of the
columns using join='inner':
Another option is to directly specify the index of the
remaininig colums using the join_axes argument,
which takes a list of index objects. 
The append() method

Because direct array concatenation is so


common, Series and DataFrame objects have
an append method that can accomplish the same
thing in fewer keystrokes.
For example, rather than calling pd.concat([df1,
df2]), you can simply call df1.append(df2):
Relational Algebra

The behavior implemented in pd.merge() is a subset of


what is known as relational algebra, which is a formal
set of rules for manipulating relational data, and forms
the conceptual foundation of operations available in
most databases.
Categories of Joins
The pd.merge() function implements a number of types
of joins: the one-to-one, many-to-one, and many-to-
many joins. All three types of joins are accessed via an
identical call to the pd.merge() interface;
One-to-one joins

The simplest type of merge expresion is the one-to-one join, which is


in many ways very similar to the column-wise concatenation. 
To combine this information into a
single DataFrame, we can use
the pd.merge() function:
The pd.merge() function recognizes that
each DataFrame has an "employee" column, and
automatically joins using this column as a key. 
The result of the merge is a new DataFrame that
combines the information from the two inputs.
Notice that the order of entries in each column is not
necessarily maintained: in this case, the order of the
"employee" column differs between df1 and df2, and
the pd.merge() function correctly accounts for this.
Many-to-one joins

Many-to-one joins are joins in which one of the two


key columns contains duplicate entries.
For the many-to-one case, the
resulting DataFrame will preserve those duplicate
entries as appropriate.
Consider the following example of a many-to-one join:
The resulting DataFrame has an aditional column with
the "supervisor" information, where the information is
repeated in one or more locations as required by the
inputs.
Many-to-many joins

If the key column in both the left and right array
contains duplicates, then the result is a many-to-
many merge.
This will be perhaps most clear with a concrete
example. Consider the following, where we have
a DataFrame showing one or more skills associated
with a particular group. 
Specification of the Merge Key
The on keyword

Most simply, you can explicitly specify the name of


the key column using the on keyword, which takes a
column name or a list of column names:
The left_on and right_on keywords

This option works only if both the left and


right DataFrames have the specified column name.
At times you may wish to merge two datasets with
different column names;
for example, we may have a dataset in which the
employee name is labeled as "name" rather than
"employee".
 In this case, we can use
the left_on and right_on keywords to specify the two
column names:
The result has a redundant column that we can drop
if desired–for example, by using the drop() method
of DataFrames:
The left_index and right_index keywords

Sometimes, rather than merging on a column, you


would instead like to merge on an index. For
example, your data might look like this:
Specifying Set Arithmetic for Joins
Here we have merged two datasets that have only a
single "name" entry in common: Mary.
By default, the result contains the intersection of the
two sets of inputs; this is what is known as an inner
join.
We can specify this explicitly using the how keyword,
which defaults to "inner":
Other options for the how keyword are 'outer', 'left',
and 'right'. An outer join returns a join over the
union of the input columns, and fills in all missing
values with NAs:
The left join and right join return joins over the left
entries and right entries, respectively. For example:

You might also like