Handout6 - visualization
Handout6 - visualization
In: Out:
• Let’s suppose you want to add the number 2 to every item in the list.
The intuitive way to do this is something like this:
In: Out:
• That was not possible with a list, but you can do that on an array:
In: Out:
• It should be noted here that, once a Numpy array is created, you
cannot increase its size.
• To do so, you will have to create a new array.
Create a 2d array from a list of list
• You can pass a list of lists to create a matrix-like a 2d array.
In:
Out:
The dtype argument
• You can specify the data-type by setting the dtype() argument.
• Some of the most commonly used NumPy dtypes are: float, int, bool, str,
and object.
In:
Out:
The astype argument
• You can also convert it to a different data-type using the astype method.
In: Out:
• Remember that, unlike lists, all items in an array have to be of the same
type.
dtype=‘object’
• However, if you are uncertain about what data type your array will
hold, or if you want to hold characters and numbers in the same
array, you can set the dtype as 'object'.
In: Out:
The tolist() function
• You can always convert an array into a list using the tolist() command.
In: Out:
Inspecting a NumPy array
• There are a range of functions built into NumPy that allow you to
inspect different aspects of an array:
In:
Out:
Extracting specific items from an array
• You can extract portions of the array using indices, much like when
you’re working with lists.
• Unlike lists, however, arrays can optionally accept as many
parameters in the square brackets as there are number of dimensions
In: Out:
Boolean indexing
• A boolean index array is of the same shape as the array-to-be-filtered,
but it only contains TRUE and FALSE values.
In: Out:
Print(arr3[boo]) ###try it
Pandas
• Pandas, like NumPy, is one of the most popular Python libraries for
data analysis.
• It is a high-level abstraction over low-level NumPy, which is written in
pure C.
• Pandas provides high-performance, easy-to-use data structures and
data analysis tools.
• There are two main structures used by pandas; data frames and
series.
Indices in a pandas series
• A pandas series is similar to a list, but differs in the fact that a series
associates a label with each element. This makes it look like a dictionary.
• If an index is not explicitly provided by the user, pandas creates a RangeIndex
ranging from 0 to N-1.
• Each series object also has a data type.
In: Out
:
• As you may suspect by this point, a series has ways to extract all of
the values in the series, as well as individual elements by index.
In: Out
:
Out:
• It is easy to retrieve several elements of a series by their indices or
make group assignments.
Out:
In:
Filtering and maths operations
• Filtering and maths operations are easy with Pandas as well.
In: Out
:
Pandas data frame
• Simplistically, a data frame is a table, with rows and columns.
• Each column in a data frame is a series object.
• Rows consist of elements inside series.
Out:
• You can also create a data frame from a list.
In: Out:
• You can ascertain the type of a column with the type() function.
In:
Out:
• A Pandas data frame object as two indices; a column index and row
index.
• Again, if you do not provide one, Pandas will create a RangeIndex from 0
to N-1.
In:
Out:
• There are numerous ways to provide row indices explicitly.
• For example, you could provide an index when creating a data frame:
In: Out:
• or do it during runtime.
• Here, I also named the index ‘country code’.
Out:
In:
• Row access using index can be performed in several ways.
• First, you could use .loc() and provide an index label.
In: Out:
In: Out:
• A selection of particular rows and columns can be selected this way.
In: Out:
• You can feed .loc() two arguments, index list and column list, slicing
operation is supported as well:
In: Out:
Filtering
• Filtering is performed using so-called Boolean arrays.
Deleting columns
• You can delete a column using the drop() function.
In: Out:
In: Out:
Reading from and writing to a file
• Pandas supports many popular file formats including CSV, XML, HTML,
Excel, SQL, JSON, etc.
• Out of all of these, CSV is the file format that you will work with the
most.
• You can read in the data from a CSV file using the read_csv() function.
• Similarly, you can write a data frame to a csv file with the to_csv()
function.
• Pandas has the capacity to do much more than what we have covered
here, such as grouping data and even data visualisation.
• However, as with NumPy, we don’t have enough time to cover every
aspect of pandas here.
Exploratory data analysis (EDA)
Exploring your data is a crucial step in data analysis. It involves:
• Organising the data set
• Plotting aspects of the data set
• Maybe producing some numerical summaries; central tendency and
spread, etc.
“Exploratory data analysis can never be the whole story, but nothing
else can serve as the foundation stone.”
- John Tukey.
Data Visualization
• Why data visualization?
• Gain insight into an information space by mapping data onto
graphical primitives
• Provide qualitative overview of large data sets
• Search for patterns, trends, structure, irregularities,
relationships among data
• Help find interesting regions and suitable parameters for
further quantitative analysis
• Provide a visual proof of computer representations derived
31
We take numbers and convert them to pictures
Data Representation
32
How do limits on display area and presentation time affect
our design of a visualization tool?
33
How can we help users to interact in order to explore data and
navigate their way through many views of that data?
Interaction
34
What aspects of human visual performance and cognition must
we be aware of when designing visualization tools?
Human
performance
Interaction
35
Representation
36
Why do we represent?
• Represent: to present clearly to the mind.
• Raw data is in a form unsuited to build insight views or create mental
model.
• As data comes in different formats, three principles should be
identified:
• Type: numeric / categorical / ordinal / relationship / text / audio / images …..
• Dimension: the more attributes the more difficulties to visualize.
• User: interpret a representation is a complex task where his characteristics
influence the design and effectiveness of a representation.
37
Different types of Visualization
38
Star Plots (Spidergram)
• For displaying multivariate
data. Each star represents Class average
(light grey)
a single observation.
• It is generated in a multi-
Bob’s
plot format with many Chemistr
stars on each page and y score
each star representing
one observation.
• It is used to examine the Bob’s Tony’s
relative values for a single spidergram spidergram
data point.
• It is used to locate similar
or dissimilar points
39
Magnification
It is the process of enlarging the apparent size (not the
physical size) of something to show a relative attribute.
Suppose that New Zealanders To represent this fact, magnify the
owned ten times as many bicycles area of New Zealand on a map by ten
as do Australians. times
Australia
New
Zealand
40
Parallel coordinate plots
41
Parallel coordinate plots
42
Iconic representation
Iconic representation is the use of pictorial images to make actions, objects,
and concepts in a display easier to find, recognize, learn, and remember.
43
Chernoff Faces
44
Chernoff Faces
45
Text
• A tag cloud (word cloud ) is a visual representation of text data.
• It is used to depict keyword metadata (tags) on websites ( for
navigation), or to visualize free form text.
• Tags are usually single words, and the importance of each tag is
shown with font size or color.
• This format is useful for quickly perceiving the most prominent terms
to determine its relative prominence.
46
Time-Varying Data
• Show variable(s) related to time.
• Variables could be discrete (histogram) or continues
(ThemeRiver)
• ThemeRiver is a visualization that depicts thematic
changes in a collection of variables over a period of time
using a river metaphor.
47
Time-Varying Data
48
Download the data
• Download the Pokemon dataset from:
https://github.com/LewBrace/da_and_vis_python
• Unzip the folder, and save the data file in a location you’ll remember.
Reading in the data
• First we import the Python packages we are going to use.
• Then we use Pandas to load in the dataset as a data frame.
NOTE: The argument index_col argument states that we'll treat the first column
of the dataset as the ID column.
NOTE: The encoding argument allows us to by pass an input error created
by special characters in the data set.
Examine the data set
• We could spend time staring at these
numbers, but that is unlikely to offer
us any form of insight.
• We could begin by conducting all of
our statistical tests.
• However, a good field commander
never goes into battle without first
doing a recognisance of the terrain…
• This is exactly what EDA is for…
Plotting a histogram in Python
Bins
• You may have noticed the two histograms we’ve seen so far look different,
despite using the exact same data.
• This is because they have different bin values.
• The left graph used the default bins generated by plt.hist(), while the one on the
right used bins that I specified.
• There are a couple of ways to manipulate bins in matplotlib.
• Here, I specified where the edges of the bars of the histogram are;
the bin edges.
• You could also specify the number of bins, and Matplotlib will automatically
generate a number of evenly spaced bins.
Seaborn
• Matplotlib is a powerful, but sometimes unwieldy, Python library.
• Seaborn provides a high-level interface to Matplotlib and makes it easier
to produce graphs like the one on the right.
• Some IDEs incorporate elements of this “under the hood” nowadays.
Benefits of Seaborn
• Seaborn offers:
- Using default themes that are aesthetically pleasing.
- Setting custom colour palettes.
- Making attractive statistical plots.
- Easily and flexibly displaying distributions.
- Visualising information from matrices and DataFrames.
• The last three points have led to Seaborn becoming the exploratory
data analysis tool of choice for many Python users.
Plotting with Seaborn
• One of Seaborn's greatest strengths is its diversity of plotting
functions.
• Most plots can be created with one line of code.
• For example….
Histograms
• Allow you to plot the distributions of numeric variables.
Other types of graphs: Creating a scatter plot
Name of our
Name of variable we dataframe fed to the
want on the x-axis “data=“ command
Colour by stage.
Separate by stage.
Generate using a swarmplot.
Rotate axis on x-ticks by 45 degrees.
Heatmaps
• Useful for visualising matrix-like data.
• Here, we’ll plot the correlation of the stats_df variables
A box plot
A box plot
• The total, stage, and legendary entries are not combat stats so we should remove
them.
• Pandas makes this easy to do, we just create a new dataframe
• We just use Pandas’ .drop() function to create a dataframe that doesn’t include
the variables we don’t want.
Plotting all data: Empirical cumulative
distribution functions (ECDFs)
• An alternative way of visualising a
distribution of a variable in a large dataset
is to use an ECDF.
• Here we have an ECDF that shows the
percentages of different attack strengths
of pokemon.
• An x-value of an ECDF is the quantity you
are measuring; i.e. attacks strength.
• The y-value is the fraction of data points
that have a value smaller than the
corresponding x-value. For example…
Networkx python package explanation:Example
Node and Edge iterator
Directed Graph
Networkx : Drawing and plotting
Other Graph Operators