A Web Scraping Project using Scrapy

Tanja Adžić
18 min readDec 17, 2023

--

What is Web Scraping?

First things first. Web scraping, also known as web data extraction, is a way to collect information from websites. This can be done by using special software that accesses the internet like a web browser, or by automating the process with a bot or a so called web crawler. It is basically a method of copying specific data from the web and saving it in a local database or spreadsheet for future use.

Photo by Nicolas Picard on Unsplash

There are different types of data extraction tehniques, I’ll name a few:

  1. Static Web Scraping: This is the most basic form of web scraping, where data is extracted from web pages that are primarily composed of HTML and CSS. It’s used for collecting data from websites with fixed, as its name says — static, unchanging content.
  2. Dynamic Web Scraping: Dynamic web scraping involves the use of tools or scripts that can interact with the page and extract data from elements that load after the initial page load (for example: pages that use JavaScript to load content dynamically).
  3. API-based Web Scraping: Many websites provide Application Programming Interfaces (APIs) that allow developers to access their data in a structured and organized manner. API-based web scraping is a more reliable and efficient way to gather data, as it’s the intended way to access information. Probably the most enjoyable way to retrive data.
  4. Screen Scraping: In cases where a website doesn’t provide data in a machine-readable format, screen scraping involves capturing the visual information displayed on a website, often through capturing screenshots or reading text from the screen.
  5. Social Media Scraping: Collecting data from social media platforms, like Twitter (I mean ‘X’) or Facebook, is a specialized form of web scraping. It’s often used for sentiment analysis, trend tracking, or marketing research.
  6. Image Scraping: This is the process of extracting data from images on the web, such as text from images, logos, or other graphical elements.

Collecting information from websites? Why?

I did kind of partially answered my own question in the previous section, however I do want to go through the most common use cases of web scraping:

  • Data collection for the purpose of market research, stock market analysis, competitor analysis, generating leads for sales and marketing.
  • Price comparison that is used by many e-commerce businesses to monitor competitors prices and make informed decisions about their business.
  • Social Media analysis for sentiment analysis, trend tracking, or monitoring public opinion on specific topics.
  • Content aggregation that is used to display information from multiple sources in one place. It is usually used by news aggregators, job search engines, and real estate listing websites.
  • Search Engine Optimization (SEO) for analyzing website rankings, tracking keyword performance, and identifying areas for improvement in search engine results.
  • Weather and Environmental Monitoring for collecting data from various sources for weather forecasting and climate research.

This sounds super illegal.

Technically, it is and it isn’t. Determining whether your web scraping activities are legal can be complex and may depend on various factors, including your location, the website you are scraping, and the nature of the data you are collecting. I recommend you go through these steps first in order to determine the legality of data extraction:

  • Website Terms of Service: Start by reading the terms of service, terms of use, or website’s “robots.txt” file. Many websites explicitly state whether web scraping is allowed or prohibited. Following these terms is essential for legal scraping.
  • Copyright and Intellectual Property Rights: If you’re scraping content that is protected by copyright, you may need permission from the content owner. Publicly available data is generally easier to scrape legally.
  • Data Privacy Laws: When collecting personal information, consider whether you are compliant with data protection regulations like General Data Protection Regulation (GDPR) in Europe.
  • Consent: If you are scraping user-generated content or personal data, consider obtaining explicit consent from the data subjects or providing clear notifications about data collection and usage.
  • Consult Legal Advice: When in doubt, make sure you consult with a legal counsel who specializes in technology and internet law. They can provide advice specific to your situation and jurisdiction.
  • Rate Limiting: This, technically, does not always have to be determined by website’s ToS, however it is part of ethical scraping. Don’t overload website’s servers with requests, use responsible scraping techniques, including rate limiting and throttling, to avoid causing server stress or service disruption.

Most popular Web Scraping Tools

Here are some of the most popular and common tools used for web scraping:

  1. Beautiful Soup: Beautiful Soup is a widely used Python library for parsing and navigating HTML and XML documents. It’s known for its simplicity and ease of use in web scraping tasks.
  2. Scrapy: Scrapy is a powerful Python framework for web scraping. It provides a more structured and efficient approach to scraping websites, making it popular for complex scraping tasks and large-scale projects.
  3. Selenium: Selenium is a versatile tool for web testing and automation that is often used for web scraping tasks that require interaction with dynamic websites. It can control web browsers like Chrome and Firefox to automate tasks.
  4. Requests and Custom Scripts: Many web scrapers prefer to write custom scripts using Python’s Requests library, along with other libraries like lxml or XPath, for web scraping tasks. This approach offers flexibility and control over the scraping process.

For this project, I chose Scrapy as my tool. Follow along to learn more about how it is used.

What is Scrapy?

“An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.” — scrapy.org

Scrapy gives you the capability to construct highly adaptable web scrapers — Spiders, that extract the HTML of webpages, analyze and manipulate the data, and preserve it in the preferred format and location.

Here is a code example of a Spider from the official website:

import scrapy

class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://www.zyte.com/blog/']

def parse(self, response):
for title in response.css('.oxy-post-title'):
yield {'title': title.css('::text').get()}

for next_page in response.css('a.next'):
yield response.follow(next_page, self.parse)

Many advantages of Scrapy include more functionality and scraping at large scale, as well as:

  • CSS Selector & XPath Expressions Parsing
  • Data formatting (CSV, JSON, XML) and Storage (FTP, S3, local filesystem)
  • Robust Encoding Support
  • Concurrency Managament
  • Automatic Retries
  • Cookies and Session Handling
  • Crawl Spiders & In-Built Pagination Support

Feel free to check out Scrapy’s Documentation and read more about its wonderfull capabilities.

Crash Course on Scrapy

Scrapy Project

Working with Scrapy means we need to understand the base of our scraper, which in this case is the Scrapy Project. Scrapy project is an organized structure that contains all the code, configurations, and components we need to actually perform web scraping — sounds a lot like a Scrapy template. It might sound a bit complicated at first, b… there is no ‘but’, it probably is complicated — so I’ll do my best to explain it here.

Creating a Scrapy project is fairly easy from your terminal. I will include previuos steps I’ve done before creating a Scrapy project, such as creating a virtual envorenment in which I installed Scrapy.

python -m venv venv
source venv/bin/activate
pip install scrapy

There are couple of fun little websites that are meant for scraping practice, such as Books to Scrape and Quotes to Scrape. As I will be scraping books, I’ll create a Scrapy project called bookscraper:

scrapy startproject bookscraper

By running this command in your terminal, you will get something like this:

Scrapy Project Structure

Let’s take a look at what this all means, and what are these building blocks of a Scrapy project: Spiders, Items, Middlewares, Pipelines and Settings.

  1. Spiders (spiders/):
  • Directory containing spider scripts. Each spider script defines how to navigate a website, what data to scrape, and how to extract and process the information.

2. Items (items.py):

  • Defines the data structure (items) that will be scraped. It acts as a data schema for the scraped data.

3. Middleware (middlewares.py):

  • Contains custom middleware components that can process requests and responses globally. It is useful when you want to modify how the request is made.

4. Pipelines (pipelines.py):

  • Defines the pipelines that process and store the scraped items. Pipelines handle tasks such as cleaning, validation, and storing data in databases (csv, json, sql).

5. Settings (settings.py):

  • Configures various settings for the Scrapy project, including user-agent, download delay, and other customization options.

These 5 component will all be included in this project — not just the spider, so stick around for an exciting tutorial (just not as exciting as the Hunger Games movies, wow).

The Spider

To create my Spider for the book scraping project, I will use a handy command that creates a generic Spider for my books.toscrape website:

(Note: make sure you are in the spiders/ directory before running this command from your terminal)

scrapy genspider bookspider books.toscrape.com

You should get a confirmation that the bookspider has been created:

Spider creation confirmation

This is our starting point:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]

def parse(self, response):
pass

We have the BookspiderSpider class which contains the basic template of our Spider:

  • name — this is the class attribute which will be used when running our Spider
  • allowed_domains — a class attribute that forbids the Spider from scraping the entire internet. It is allowed to scrape the website we provided.
  • start_urls — this is the url that the Spider will use, but it will change
  • parse function — once the response has been received from the website, the parse function will run. It will contain all the elements we want to scrape from the website.

Scrapy Shell — Finding which elements we need from the page

Why is Scrapy Shell important? For one, it let’s us run our scraping code without running the Spider. Furthermore, it allows us to debug any potential errors that might be raised during the process. In a nutshell, it is a very handy testing tool — just use it as a testing ground for all code, before running the Sprider. Let’s try it out:

scrapy shell

This command activated the shell, and you should see an output that gives you an overview of available scrapy objects:

Scrapy Shell Command output

The next step should be to actually fetch our page in the Scrapy Shell, which is then saved in the response variable:

fetch('https://books.toscrape.com/')

Now for the challenging part of finding the needed elements from the page’s HTML. By looking at a single product, we see there are several elements to it: name of the book, number of stars (rating), book price, and if it is in stock. There is also an url link to the page.

example product on books.toscrape.com

I will scrape the basic information about the book: name, price, and url. If you are unsure how to find the tags that correspond to the information you need from the page, I suggest you briefly go over it here.

Using Scrapy Shell, I will test the tag for a single book item, which in this case looks like this: <article class=”product_pod”> and save it to a variable:

books = response.css("article.product_pod")

By testing the len() of the books variable we get 20, and it matches the number of books on the first page.

In order to test other tags associated with this item, I’ll save the first book item to a variable book and proceed with testing its tags:

#fetching the first book item and saving it to a variable
book = books[0]

#fetching the name tag of the book
book.css('h3 a::text').get()

#fetching the price tag of the book
book.css('div.product_price .price_color::text').get()

#fetching the url tag of the book
book.css('h3 a').attrib['href']

The responses should look something like this:

Scrapy SHell elements responses

By registering the key elements from the webpage, we can now add them to our Spider:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]

def parse(self, response):

books = response.css("article.product_pod")

#looping through the books variable
for book in books:
yield {
'name': book.css('h3 a::text').get(),
'price': book.css('div.product_price .price_color::text').get(),
'url': book.css('h3 a').attrib['href']
}

Now that we have our Spider ready, we should be able to run it by using the following command from our bookscraper folder (remember to exit the Scrapy Shell by entering exit()):

scrapy crawl bookspider

The output shows ‘item_scraped_count’: 20 which tells us that our scraping was a success. However, I did wanted to have the data in .csv, and getting it in a file is quite easy with the following command:

scrapy crawl bookspider -O scraped_test_data.csv

If you want the data in .json just use scraped_test_data.json extension.

Including multiple pages for crawling

By inspecting the ‘next’ element for the pages, we het the following tag:

response.css('li.next a ::attr(href)').get()

Now we need to update our Spider and include all the pages:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]

def parse(self, response):

books = response.css("article.product_pod")

#looping through the books variable
for book in books:
yield{
'name': book.css('h3 a::text').get(),
'price': book.css('div.product_price .price_color::text').get(),
'url': book.css('h3 a').attrib['href']
}

#declaring the next page variable
next_page = response.css('li.next a ::attr(href)').get()

#condition for no remaining pages
if next_page is not None:

#condition for when the url does not contain 'catalogue/' in next page
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page

else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page

yield response.follow(next_page_url, callback = self.parse)

Scraping additional product details

To include additional information for every book, I will need to scrape it from the book’s url we already obtained. If you go to a random book page like this one, you’ll see there are a lot more information to be scraped, like: stock status, UPC, product type, price before and after tax, number of reviews, and a lenghtly paragraph that contains product description. Next step is to update our Spider to include this information.

Seeing as we will now be collecting all of our data from the book page and not the book overview page, and that also includes data that we scraped previously, we need to adjust the code to loop through the book url:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]

def parse(self, response):

books = response.css("article.product_pod")

#looping through the books variable
for book in books:

#fetching the individual book url (same as previous url tag)
relative_url = book.css('h3 a').attrib['href']

#condition for when the url does not contain 'catalogue/'
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url

yield scrapy.Request(book_url, callback = self.parse_book_page)

#declaring the next page variable
next_page = response.css('li.next a ::attr(href)').get()

#condition for no remaining pages
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)

The next step is to determine elements and their tags for all product information we want to scrape, that is:

  • Book URL
  • Title
  • UPC
  • Product Type
  • Price
  • Price Excluding Tax
  • Price Including Tax
  • Tax
  • Availability
  • Number of Reviews
  • Stars
  • Category
  • Description

It is best to start with testing the elements using Scrapy Shell:

#fetching a random book url
fetch('https://books.toscrape.com/catalogue/in-her-wake_980/index.html')

#inspecting the entire product page
response.css('.product_page')

#fetching the title of the book
response.css('.product_main h1::text').get()

#fetching the price of the book
response.css('.product_main p.price_color ::text').get()

#fetching the product category (using xpath)
response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()

#fetching the book description (using xpath)
response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()

#fetching other product information as a table row
table_rows = response.css('table tr')

#fetching the second row (product type) from the product info table
table_rows[1].css('td::text').get()

#fetching the number of stars
response.css('p.star-rating').attrib['class']

Now that we inspected all the elements we need from the book page, we can start adding them to a new function parse_book_page in our Bookscraper class (here the book variable was created that contains the main product page class):

def parse_book_page(self, response):

#defining the book variable for the main product page class
book = response.css("div.product_main")[0]

#defining the table row for product information tag
table_rows = response.css("table tr")

yield {
'url': response.url,
'title': book.css("h1 ::text").get(),
'upc': table_rows[0].css("td ::text").get(),
'product_type': table_rows[1].css("td ::text").get(),
'price_excl_tax': table_rows[2].css("td ::text").get(),
'price_incl_tax': table_rows[3].css("td ::text").get(),
'tax': table_rows[4].css("td ::text").get(),
'availability': table_rows[5].css("td ::text").get(),
'num_reviews': table_rows[6].css("td ::text").get(),
'stars': book.css("p.star-rating").attrib['class'],
'category': book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get(),
'description': book.xpath("//div[@id='product_description']/following-sibling::p/text()").get(),
'price': book.css('p.price_color ::text').get(),
}

We can run our Spider again, this time saving it to .json for better overview of our information:

scrapy crawl bookspider -O book_data.json

Works like a charm. Let’s see how we can clean the data next.

Cleaning Data — Items & Pipeline

Scrapy Items

Items are defined as data containers used to define the structure of the data we want to scrape from websites. Items are defined as simple Python classes, and each item class represents a specific type of data that we want to extract. When a spider extracts data from a website, it populates instances of these item classes with the scraped data. Scrapy supports multiple types of items, and the best option is to define our own item objects in the items.py file.

The default template already exists in the items.py file:

class BookscraperItem(scrapy.Item):
#define the fields for your item here like:
name = scrapy.Field()
pass

And here is our BookItem class with all items we extract from the website:

class BookItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
upc = scrapy.Field()
product_type = scrapy.Field()
price_excl_tax = scrapy.Field()
price_incl_tax = scrapy.Field()
tax = scrapy.Field()
availability = scrapy.Field()
num_reviews = scrapy.Field()
stars = scrapy.Field()
category = scrapy.Field()
description = scrapy.Field()
price = scrapy.Field()

The next step is to import the BookItem to our Spider which will allow it to store data in the item, and yield the book_item. Here is my Spider after this change:

import scrapy
from bookscraper.items import BookItem

class BookspiderSpider(scrapy.Spider):
#basic config
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]

def parse(self, response):

"""
Parse function for the main page.

Parameters:
- response: The response object containing the HTML of the main page.

Processing Steps:
- Extracts the URLs of individual books from the main page.
- For each book URL, constructs the absolute URL.
- Initiates a new request for each book URL, calling the parse_book_page function for detailed scraping.
- Determines the URL of the next page and follows it if there are more pages.

Returns:
- Yields requests for individual book pages.
"""

books = response.css("article.product_pod")

#looping through the books variable
for book in books:

#fetching the individual book url
relative_url = book.css('h3 a').attrib['href']

#condition for when the url does not contain 'catalogue/'
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url

yield scrapy.Request(book_url, callback = self.parse_book_page)

#declaring the next page variable
next_page = response.css('li.next a ::attr(href)').get()

#condition for no remaining pages
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback = self.parse)

def parse_book_page(self, response):

"""
Parse function for individual book pages.

Parameters:
- response: The response object containing the HTML of an individual book page.

Processing Steps:
- Extracts various details about the book, including title, URL, UPC, product type, prices, tax, availability, number of reviews, star rating, category, description, and price.
- Utilizes CSS and XPath selectors to locate and extract specific elements from the HTML structure.

Returns:
- Yields a dictionary containing the extracted information for each book.
"""

#defining the book variable for the main product page class
book = response.css("div.product_main")[0]

#defining the table row for product information tag
table_rows = response.css("table tr")

#instantiating the book_item
book_item = BookItem()

#items
book_item['url'] = response.url
book_item['title'] = book.css("h1 ::text").get()
book_item['upc'] = table_rows[0].css("td ::text").get()
book_item['product_type'] = table_rows[1].css("td ::text").get()
book_item['price_excl_tax'] = table_rows[2].css("td ::text").get()
book_item['price_incl_tax'] = table_rows[3].css("td ::text").get()
book_item['tax'] = table_rows[4].css("td ::text").get()
book_item['availability'] = table_rows[5].css("td ::text").get()
book_item['num_reviews'] = table_rows[6].css("td ::text").get()
book_item['stars'] = book.css("p.star-rating").attrib['class']
book_item['category'] = book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()
book_item['description'] = book.xpath("//div[@id='product_description']/following-sibling::p/text()").get()
book_item['price'] = book.css('p.price_color ::text').get()

yield book_item

There are a lot of advantages of using Scrapy Items:

  1. Structured Data Representation: Items allow you to define a clear and structured representation of the data you want to scrape. Each item class corresponds to a specific type of data.
  2. Consistency: By using the same item classes, you ensure that the structure of your scraped data remains uniform, even if you’re extracting different types of information from different websites.
  3. Built-in Validation: If you attempt to assign an unexpected field or an incorrect data type, Scrapy will raise an error. This helps catch potential issues early in development.
  4. Documentation: By reviewing the item classes, anyone can quickly understand what types of data are scraped and what fields are expected. Also the code is readable and clearly defined.
  5. Easy Integration with Pipelines: Items integrate with Scrapy Pipelines, allowing you to perform additional processing tasks on the scraped datam such as cleaning and storage.

Scrapy Pipelines

Scrapy Pipelines are used as data processors where all of our data goes through. They provide a way to define a series of actions that should be performed on each item as it passes through the system. Once a spider yields an item, it goes through the configured pipelines in the order specified, allowing for additional processing, validation, and storage of the scraped data.

Use cases of Scrapy Pipelines include:

  • Data Cleaning and Validation: Cleaning and validating scraped data, so it adheres to the desired structure and format.
  • Removing Duplicates: Identifying and eliminating duplicate entries from the scraped data to maintain data integrity.
  • Storing Data in Databases: Saving scraped data to databases, such as SQLite, MySQL, or MongoDB, for efficient storage and retrieval.
  • Exporting to Various Formats: Exporting data to different file formats like CSV, JSON, or XML for further analysis or integration with other systems.
  • Connecting to Data Warehouses: Integrating with data warehouses or cloud-based storage solutions to store and analyze large volumes of scraped data.
  • Enforcing Rate Limits: Implementing rate-limiting mechanisms to control the frequency of requests and avoid overloading the target website.

This is the default code that comes in the pipelines.py file:

from itemadapter import ItemAdapter

class BookscraperPipeline:
def process_item(self, item, spider):
return item

There are a couple of data preprocessing steps to be added to the pipeline, in order to standardize and clean our data:

  • remove whitespaces
  • converting strings to lowercase (category and product type)
  • convert all prices to floats and remove the currency sign
  • extract the number of items in stock and convert it to an integer
  • convert number of reviews to an integer
  • convert stars to an integer

Here is the full Pipeline code:

from itemadapter import ItemAdapter

class BookscraperPipeline:
def process_item(self, item, spider):

adapter = ItemAdapter(item)

#remove all whitespaces from strings
field_names = adapter.field_names()
for field_name in field_names:
if field_name != 'description':
value = adapter.get(field_name)
adapter[field_name] = value.strip()

#converting strings to lowercase (category and product type)
lowercase_keys = ['category', 'product_type']
for lowercase_key in lowercase_keys:
value = adapter.get(lowercase_key)
adapter[lowercase_key] = value.lower()

#convert all prices to floats and remove the currency sign
price_keys = ['price', 'price_excl_tax', 'price_incl_tax', 'tax']
for price_key in price_keys:
value = adapter.get(price_key)
value = value.replace('£', '')
adapter[price_key] = float(value)

#extract the number of items in stock and convert it to an integer
availability_string = adapter.get('availability')
split_string_array = availability_string.split('(')
if len(split_string_array) < 2:
adapter['availability'] = 0
else:
availability_array = split_string_array[1].split(' ')
adapter['availability'] = int(availability_array[0])

#convert number of reviews to an integer
num_reviews_string = adapter.get('num_reviews')
adapter['num_reviews'] = int(num_reviews_string)

#convert stars to an integer
stars_string = adapter.get('stars')
split_stars_array = stars_string.split(' ')
stars_text_value = split_stars_array[1].lower()
if stars_text_value == "zero":
adapter['stars'] = 0
elif stars_text_value == "one":
adapter['stars'] = 1
elif stars_text_value == "two":
adapter['stars'] = 2
elif stars_text_value == "three":
adapter['stars'] = 3
elif stars_text_value == "four":
adapter['stars'] = 4
elif stars_text_value == "five":
adapter['stars'] = 5

return item

In order the Pipeline to be active, we need to enable it in the settings.py file (mine is around line 60):

ITEM_PIPELINES = {
'bookscraper.pipelines.BookscraperPipeline': 300,
}

I saved the data again as a .json, and here is the first item:

{"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", 
"title": "A Light in the Attic",
"upc": "a897fe39b1053632",
"product_type": "books",
"price_excl_tax": 51.77,
"price_incl_tax": 51.77,
"tax": 0.0,
"availability": 22,
"num_reviews": 0,
"stars": 3,
"category": "poetry",
"description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more",
"price": 51.77}

Make sure you check an item or two to see if there are any irregularities, or if you want a specific key to be represented in another way. Mine one looks pretty good, and it is all I wanted to scrape from this page.

Now what?

The topics covered here are pretty basic and will get you started with scraping in no time. However, Scrapy does offer quite a lot, such as: saving your data to databases, running your spider in the cloud (this sounds like a song you’d hear in a kindergarten), and fake user-agents, as well as proxy manipulation. These are all fantastic options to have, and I do encourage you to play around with Scrapy documentation and have a go at what seems interesting. Hopefully, I’ll cover these topics soon.

If you found this blog helpful, you might want to check out some of my other blogs related to databases:

--

--

Tanja Adžić

Data Scientist and aspiring Data Engineer, skilled in Python, SQL. I love to solve problems.