Create web scraper to scrape products from souq

Web Scraping Tutorial

This tutorial will guide you through a code example for web scraping using Python. We will use the requests, urllib.request, time, BeautifulSoup, json, and csv libraries to extract data from a website. The code will scrape product information from a specific URL and store it in both CSV and JSON formats.

Prerequisites

Make sure you have Python installed on your system. You can download the latest version of Python from the official website: Python.org

Installation

To run the code, you’ll need to install the required libraries. Open your terminal or command prompt and use the following commands:

pip install requests
pip install beautifulsoup4

Code

Copy the following code into a Python file, such as web_scraping_example.py:

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import json
import csv

# Open the CSV file for writing
filecsv = open('SouqDataapple.csv', 'w', encoding='utf8')

# Set the URL you want to scrape from
url = 'https://saudi.souq.com/sa-ar/apple/new/a-c/s/?section=2&page='

# Open the JSON file for writing
file = open('SouqDataapple.json', 'w', encoding='utf8')
file.write('[\n')

# Create a dictionary to store the data
data = {}

# Define the CSV columns
csv_columns = ['name', 'price', 'img']

# Loop through multiple pages
for page in range(1000):
    print('---', page, '---')
    # Send a GET request to the URL
    r = requests.get(url + str(page))
    print(url + str(page))
    
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(r.content, "html.parser")
    
    # Find all the product items on the page
    ancher = soup.find_all('div', {'class': 'column column-block block-grid-large single-item'})
    
    # Create a CSV writer object
    writer = csv.DictWriter(filecsv, fieldnames=csv_columns)
    
    # Initialize the CSV file with headers
    writer.writeheader()
    
    # Loop through each product item
    for pt in ancher:
        # Extract the name, price, and image URL
        name = pt.find('h6', {'class': 'title itemTitle'})
        itemPrice = pt.find('span', {'class': 'itemPrice'})
        img = pt.find('img', {'class': 'img-size-medium'})

        if img:
            # Write the data to the CSV file
            writer.writerow({'name': name.text.replace('                    ', '').strip('\r\n'),
                             'price': itemPrice.text,
                             'img': img.get('src')})
            
            # Store the data in the dictionary
            data['name'] = name.text.replace('                    ', '').strip('\r\n')
            data['price'] = itemPrice.text
            data['img'] = img.get('src')
            
            # Convert the data to JSON format
            json_data = json.dumps(data, ensure_ascii=False)
            
            # Write the JSON data to the file
            file.write(json_data)
            file.write(",\n")

# Finish writing the JSON file
file.write("\n]")

# Close the files
filecsv.close()
file.close()

Code Explain

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import json
import csv

These lines import the required libraries: requests for sending HTTP requests, urllib.request for making URL requests, time for time-related functions, BeautifulSoup for parsing HTML, json for working with JSON data, and csv for working with CSV files.

filecsv = open('SouqDataapple.csv', 'w',encoding='utf8')

This line opens a CSV file named “SouqDataapple.csv” in write mode and assigns the file object to the variable filecsv. The 'w' parameter indicates that the file will be opened for writing.

url = 'https://saudi.souq.com/sa-ar/apple/new/a-c/s/?section=2&page='

This line assigns the URL of the website from which we want to scrape data to the variable url. The URL points to a specific page on the Souq.com website.

file = open('SouqDataapple.json', 'w',encoding='utf8')
file.write('[\n')

These lines open a JSON file named “SouqDataapple.json” in write mode and assign the file object to the variable file. The 'w' parameter indicates that the file will be opened for writing. Then, it writes an opening square bracket ([) followed by a new line character (\n) to the file.

data = {}
csv_columns = ['name','price','img']

These lines initialize an empty dictionary named data to store the scraped data, and a list named csv_columns containing the column names for the CSV file.

for page in range(1000):
    print('---', page, '---')
    r = requests.get(url + str(page))
    print(url + str(page))
    soup = BeautifulSoup(r.content, "html.parser")
    ancher = soup.find_all('div', {'class': 'column column-block block-grid-large single-item'})

These lines start a loop that iterates through a range of 1000, representing the number of pages to scrape. For each iteration, it sends a GET request to the URL concatenated with the page number. The response is stored in the variable r. It then prints the current page number and the full URL to the console. The HTML content of the page is parsed using BeautifulSoup and stored in the soup variable. The ancher variable is assigned the result of finding all div elements with the specified class name.

writer = csv.DictWriter(filecsv, fieldnames=csv_columns)
i = 0
writer.writeheader()

These lines create a CSV writer object named writer using the csv.DictWriter class, which writes dictionaries to a CSV file. It takes the filecsv file object and the csv_columns list as parameters. The i variable is initialized to 0. Then, it writes the header row to the CSV file using the writeheader() method.

for pt in ancher:
    name = pt.find('h6', {'class': 'title itemTitle'})
    itemPrice = pt.find('span', {'class': 'itemPrice'})
    img = pt.find('img', {'class': 'img-size-medium'})

    if img:
        writer.writerow({'name': name.text.replace('                    ', '').strip('\r\n'),
                         'price': itemPrice.text,
                         'img': img.get('src')})
        data['name'] = name.text.replace('                    ', '').strip('\r\n')
        data['price'] = itemPrice.text
        data['img'] = img.get('src')
        json_data = json.dumps(data, ensure_ascii=False)
        file.write(json_data)
        file.write(",\n")

These lines iterate through each pt in the ancher list, which represents each product item on the page. It finds the relevant elements within each item using their respective HTML tags and class names. If an image exists (if img), it writes a row to the CSV file using the writer.writerow() method. It also populates the data dictionary with the extracted data. The data is then converted to JSON format using json.dumps(), and it is written to the JSON file followed by a comma and a new line character.

file.write("\n]")
filecsv.close()
file.close()

These lines write a closing square bracket (]) followed by a new line character (\n) to the JSON file, indicating the end of the JSON array. The filecsv and file files are closed using the close() method, which ensures that any pending data is flushed and the resources are released.

Usage

Open a terminal or command prompt.
Navigate to the directory where you saved the Python file.
Run the following command to execute the code:
```
python web_scraping_example.py
```

The code will start scraping the website and store the data in both CSV and JSON formats. You can customize the code as needed for your own web scraping requirements.

Please note that web scraping should be done responsibly and in accordance with the website’s terms of service. Make sure to respect the website’s policies and do not overwhelm their servers with too many requests.

That’s it! You have completed the web scraping tutorial. You can now use the extracted data for further analysis or any other purposes you desire.

I hope this tutorial helps you understand the process of web scraping using Python. If you have any questions, feel free to ask!

Hamza Salem Blog- Developer Relations