Pandas is a Python library used for working with tabular data and doing data analysis. It allows importing data from various formats, cleaning, transforming, and preparing the data. Pandas provides data structures and functions for fast manipulation of numerical tables and time series. It aims to make working with structured data intuitive with data manipulation tools and data analysis tools.
Download as PPTX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
30 views
Pandas
Pandas is a Python library used for working with tabular data and doing data analysis. It allows importing data from various formats, cleaning, transforming, and preparing the data. Pandas provides data structures and functions for fast manipulation of numerical tables and time series. It aims to make working with structured data intuitive with data manipulation tools and data analysis tools.
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 94
Pandas
Dr. Noman Islam
Getting Started Introduction • It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. • Pandas is designed for working with tabular or heterogeneous data. • NumPy, by contrast, is best suited for working with homogeneous numerical array data. Series Using indexes to access data Filtering Creating series from dictionary Checking for null values Joining two series Dataframe Modifying column values Adding / deleting a new column Transpose a dataframe Slicing in dataframe Values attribute Indexing, selection and filtering Dropping entries from axis Axis parameter Indexing and slicing Selection with loc and iloc Apply function Function application and mapping Descriptive statistics Reading and writing data Filtering out missing values Filling missing data Data Loading, Storage, and File Formats Reading data in pandas Optional arguments • Indexing Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all. Type inference and data conversion This includes the user-defined value conversions and custom list of missing value markers. • Datetime parsing Includes combining capability, including combining date and time information spread over multiple columns into a single column in the result. • Iterating Support for iterating over chunks of very large files. • Unclean data issues Skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas. Examples df = pd.read_csv('examples/ex1.csv')
pd.read_csv('examples/ex2.csv', names=names, index_col='message') • Passing regular expression as separator • result = pd.read_table('examples/ex3.txt', sep='\s+') • Skipping rows • pd.read_csv('examples/ex4.csv', skiprows=[0, 2, 3]) Handling null values Reading text file in pieces Iterating over chunk Writing file Working with delimited format Writing csv data JSON format • The pandas.read_json can automatically convert JSON datasets in specific arrange‐ ments into a Series or DataFrame • The default options for pandas.read_json assume that each object in the JSON array is a row in the table Web scraping
• The pandas.read_html function has a number of options, but by
default it searches for and attempts to parse all tabular data contained within tags. Saving in binary format HD5 format Excel format Interacting with Web API Interacting with database Data Cleaning and Preparation Handling missing data dropna options • With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value • Passing how='all' will only drop rows that are all NA • To drop columns in the same way, pass axis=1 • df.dropna(thresh=2) Filling In Missing Data • Calling fillna with a constant replaces missing values with that value • Calling fillna with a dict, you can use a different fill value for each column • df.fillna({1: 0.5, 2: 0}) • fillna returns a new object, but you can modify the existing object in- place: • _ = df.fillna(0, inplace=True) Removing Duplicates • data.duplicated() • Relatedly, drop_duplicates returns a DataFrame where the duplicated array is False • You can specify any subset of them to detect duplicates • data.drop_duplicates(['k1']) • duplicated and drop_duplicates by default keep the first observed value combina‐ tion. Passing keep='last' will return the last one Map Replace • data.replace(-999, np.nan) • data.replace([-999, -1000], np.nan) • data.replace([-999, -1000], [np.nan, 0]) • data.replace({-999: np.nan, -1000: 0}) Map with index Other index methods • data.rename(index=str.title, columns=str.upper) • data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}) • data.rename(index={'OHIO': 'INDIANA'}, inplace=True) Discretization and binning • Let’s divide data into bins • ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32] • bins = [18, 25, 35, 60, 100] • cats = pd.cut(ages, bins) • Lets see the code for each data and categories: • cats.codes • cats.categories • What are the counts for each category? • pd.value_counts(cats) • Making the left side a close interval • pd.cut(ages, [18, 26, 36, 61, 100], right=False) • Providing your own group names: • group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior'] • pd.cut(ages, bins, labels=group_names) Detecting and filtering outliers • data.describe() • Suppose you wanted to find values in one of the columns exceeding 3 in absolute value: • col = data[2] • col[np.abs(col) > 3] • To select all rows having a value exceeding 3 or –3, you can use the any method on a boolean DataFrame: • data[(np.abs(data) > 3).any(1)] • Here is code to cap values outside the inter‐ val –3 to 3 • data[np.abs(data) > 3] = np.sign(data) * 3 Permutation and random sampling • Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering: • df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)) • sampler = np.random.permutation(5) • df.take(sampler) • To select a random subset without replacement, you can use the sample method on Series and DataFrame: • df.sample(n=3) • To generate a sample with replacement (to allow repeat choices), pass replace=True to sample: • choices = pd.Series([5, 7, -1, 6, 4]) • draws = choices.sample(n=10, replace=True) Data Wrangling: Join, Combine, and Reshape Hierarchical Indexing • In many applications, data may be spread across a number of files or databases or be arranged in a form that is not easy to analyze. • This chapter focuses on tools to help combine, join, and rearrange data. • Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. • Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form. Example • data.unstack() • data.unstack().stack() • With a DataFrame, either axis can have a hierarchical index: Names for indexes • The hierarchical levels can have names Reordering and Sorting Levels • The swaplevel takes two level numbers or names and returns a new object with the levels interchanged • sort_index, on the other hand, sorts the data using only the values in a single level Summary statistics • frame.sum(level='key2') • frame.sum(level='color', axis=1) Combining and Merging Datasets • pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations. • pandas.concat concatenates or “stacks” together objects along an axis. • The combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another. Merge • pd.merge(df1, df2) • pd.merge(df1, df2, on='key') • pd.merge(df3, df4, left_on='lkey', right_on='rkey') • By default merge does an 'inner' join; the keys in the result are the intersection, or the common set found in both tables. • Other possible options are 'left', 'right', and 'outer'. • The outer join takes the union of the keys, combining the effect of applying both left and right joins • To merge with multiple keys, pass a list of column names • A last issue to consider in merge operations is the treatment of overlapping column names. • Merge has a suffixes option for specifying strings to append to overlapping names in the left and right DataFrame objects Merging on index • In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index should be used as the merge key Join method • DataFrame has a convenient join instance for merging by index. Concatenating Along an Axis • pd.concat([s1, s2, s3]) • pd.concat([s1, s2, s3], axis=1) • pd.concat([s1, s4], axis=1, join='inner') • pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']]) Combining Data with Overlap • np.where(pd.isnull(a), b, a) • Series has a combine_first method, which performs the equivalent of this operation along with pandas’s usual data alignment logic
• With DataFrames, combine_first does the same thing column by
column, so you can think of it as “patching” missing data in the calling object with data from the object you pass