Faster pandas

Faster pandas

English | MP4 | AVC 1280×720 | AAC 48KHz 2ch | 1h 24m | 280 MB

Data scientists often favor pandas, because it lets them work efficiently with larger amounts of data—a useful quality as data sets become bigger and bigger. In this course, instructor Miki Tebeka shows you how to improve your pandas’ code’s speed and efficiency. First, Miki explains why performance matters and how you can measure it with Python profilers. Then, the course teaches you how to use vectorization to manipulate data. The course also walks through some common mistakes and how to address them.

Python and pandas have many high-performance built-in functions, and Miki covers how to use them. Pandas can use a lot of memory, so Miki offers good tips on how to save memory. The course demonstrates how to serialize data with SQL and HDF5. Then Miki goes over how to speed up your code with Numba and Cython. Alternative DataFrames can also speed up your code, and Miki steps through some options. Plus, explore a few extra resources that you can check out.

Table of Contents

1 pandas and performance
2 What you should know
3 Working with the files on GitHub
4 Why performance matters
5 Setting goals
6 Measuring performance
7 Profiling
8 Challenge Identify bottleneck
9 Solution Identify bottleneck
10 What is vectorization
11 Boolean indexing
12 Understanding ufuncs
13 Challenge Selecting and manipulating data
14 Solution Selecting and manipulating data
15 The limitations of appending
16 The limitations of object dtype
17 The limitations of row iteration
18 Understanding the isin function
19 Parsing time once
20 Challenge Query a DataFrame
21 Solution Query a DataFrame
22 Using built-in functions
23 Understanding eval and query
24 Understanding the join function
25 Challenge Join and query
26 Solution Join and query
27 Why memory is important
28 Measuring memory
29 Loading parts of data
30 Categorical data
31 Challenge Reducing memory
32 Solution Reducing memory
33 Various formats and why not CSV
34 Optimizing with SQL
35 Optimizing with HDF5
36 Challenge Bike ride duration
37 Solution Bike ride duration
38 What is Numba
39 Using Numba
40 What’s Cython
41 Writing Cython code
42 Compiling Cython
43 %%cython magic
44 Challenge Cython speedup
45 Solution Cython speedup
46 Overview of alternative DataFrames
47 Using Dask
48 Using Vaex
49 Challenge Vaex vs. pandas
50 Solution Vaex vs. pandas
51 Next steps