Modern Web Scraping with Python using Scrapy and Splash

Modern Web Scraping with Python using Scrapy and Splash

English | MP4 | AVC 1280×720 | AAC 44KHz 2ch | 6 Hours | 2.63 GB

Become an expert in web scraping and web crawling using Python 3, Scrapy and Scrapy Splash

Web Scraping nowdays has become one of the hottest topics, there are plenty of paid tools out there in the market that don’t show you anything how things are done as you will be always limited to their functionalities as a consumer.

In this course you won’t be a consumer anymore, i’ll teach you how you can build your own scraping tool ( spider ) using Scrapy.

You will learn:

  • The fundamentals of Web Scraping
  • How to build a complete spider
  • The fundamentals of XPath
  • How to locate content/nodes from the DOM using XPath
  • How to store the data in JSON, CSV… and even to an external database(MongoDb)
  • How to write your own custom Pipeline
  • Fundamentals of Splash
  • How to scrape Javascript websites using Scrapy Splash
  • The Crawling behavior
  • How to build a CrawlSpider
  • How to avoid getting banned while scraping websites
  • How to build a custom Middleware
  • Web Scraping best practices
  • How to scrape APIs
  • How to use Request Cookies
  • How to scrape infinite scroll websites
  • Host spiders in Heroku for free
  • Run spiders periodically with a custom script
  • Prevent storing duplicated data
  • Deploy Splash to Heroku
  • Write data to Excel files
  • Login to websites using FormRequest
  • Download Files & Images using Scrapy
  • Use Proxies with Scrapy Spider
  • Use Crawlera with Scrapy & Splash
  • Use Proxies with CrawlSpider
Table of Contents

Introduction – UPDATED –
1 Intro to Web Scraping & Scrapy
2 Setting up the Development Environment – Linux Users
3 Setting up the Development Environment – Windows Users
4 Hello World Scrapy
5 Frequently Asked Questions (Common errors)
6 Where to find all the code !

XPath Selectors
7 XPath Terminology
8 XPath Syntax
9 XPath Axes
10 XPath Predicates

Build a Complete Spider from A to Z
11 Locating, Quotes, Authors and Tags
12 Update Author is not loading
13 Scrapy XPath Selectors
14 Pagination
15 Feed Exporters
16 Items and Item Loader
17 Input and Output Processors
18 Output isn’t showing correctly
19 FInal Touches

Writing a Custom Pipeline – Store the Data in MongoDb
20 MongoDb Terminology
21 Setting Up MongoDb on Linux
22 Setting Up MongoDb on Windows
23 Writing the MongoDb Pipeline (UPDATED)

Scraping Javascript Websites using Splash
24 Why using Splash
25 Setting up Splash on Linux
26 Setting up Splash on Windows 10 HomePro edition (NEW)
27 Writing Lua Scripts
28 Splash Request
29 Dealing with Pagination
30 Learn LUA in 15 minutes

The Crawl Spider
31 The Crawling Behaviour
32 Outisde US
33 The Crawl Spider Simplified
34 Setting up the Rules
35 Challenge Solution(Building the Parse Method)

Avoid Getting Banned
36 Technics Used by Websites Administrators to Prevent Web Scraping
37 Web CrawlingScraping Best Practices
38 Custom Middleware (User Agent Rotator Middleware)

Scraping APIs(REST API) – Infinite Scroll Pagination
39 Introduction
40 Airbnb code UPDATE (Request Cookies) NEW
41 Another way to scrape Airbnb restaurant detail page
42 REST API
43 Working With JSON Objects
44 The Airbnb JSON Object
45 Hidden XHR
46 Airbnb Spider
47 IMPORTANT NOTE
48 Infinite Scroll Pagination
49 Spider Arguments

Hosting spiders for free – Exclusive –
50 Deploy spiders to ScrapingHub cloud
51 Deploy spiders locally
52 Deploy spiders to Heroku
53 The MLab add-on
54 Execute spiders periodically
55 Prevent storing duplicated data
56 Deploy Splash to Heroku
57 Project source code

Writing data to Excel files
58 Introduction to XlsxWriter
59 Setting the Item class
60 Writing data to Excel files(Using a custom Pipeline)
61 Project source code
62 Challenge for those who are adventurous

Scrapy POST requests
63 Login to websites using FormRequest
64 XML Http Post Requests
65 Project source code
66 Code UPDATE XHR repeated data (Assignment)

The Media Pipeline
67 Media Pipelines
68 The Images Pipeline
69 Extending The Images Pipeline (Store images with custom names)
70 Files Pipeline (Article)
71 Challenge (Files Pipeline)
72 Project source code

Paid and Free proxies with ScrapySplash
73 Using Crawlera with Scrapy
74 Using Crawlera with Splash
75 Using Heroku as a Proxy (FREE)
76 Using FREE Proxies with the CrawlSpider
77 Challenge
78 Project source code

BONUS
79 Files Pipeline
80 Crawlera GIFT
81 Bonus Lecture