Create web scraper to scrape products from souq
Web Scraping Tutorial
This tutorial will guide you through a code example for web scraping using Python. We will use the requests
, urllib.request
, time
, BeautifulSoup
, json
, and csv
libraries to extract data from a website. The code will scrape product information from a specific URL and store it in both CSV and JSON formats.
Prerequisites
Make sure you have Python installed on your system. You can download the latest version of Python from the official website: Python.org
Installation
To run the code, you’ll need to install the required libraries. Open your terminal or command prompt and use the following commands:
pip install requests
pip install beautifulsoup4
Code
Copy the following code into a Python file, such as web_scraping_example.py
:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import json
import csv
# Open the CSV file for writing
filecsv = open('SouqDataapple.csv', 'w', encoding='utf8')
# Set the URL you want to scrape from
url = 'https://saudi.souq.com/sa-ar/apple/new/a-c/s/?section=2&page='
# Open the JSON file for writing
file = open('SouqDataapple.json', 'w', encoding='utf8')
file.write('[\n')
# Create a dictionary to store the data
data = {}
# Define the CSV columns
csv_columns = ['name', 'price', 'img']
# Loop through multiple pages
for page in range(1000):
print('---', page, '---')
# Send a GET request to the URL
r = requests.get(url + str(page))
print(url + str(page))
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(r.content, "html.parser")
# Find all the product items on the page
ancher = soup.find_all('div', {'class': 'column column-block block-grid-large single-item'})
# Create a CSV writer object
writer = csv.DictWriter(filecsv, fieldnames=csv_columns)
# Initialize the CSV file with headers
writer.writeheader()
# Loop through each product item
for pt in ancher:
# Extract the name, price, and image URL
name = pt.find('h6', {'class': 'title itemTitle'})
itemPrice = pt.find('span', {'class': 'itemPrice'})
img = pt.find('img', {'class': 'img-size-medium'})
if img:
# Write the data to the CSV file
writer.writerow({'name': name.text.replace(' ', '').strip('\r\n'),
'price': itemPrice.text,
'img': img.get('src')})
# Store the data in the dictionary
data['name'] = name.text.replace(' ', '').strip('\r\n')
data['price'] = itemPrice.text
data['img'] = img.get('src')
# Convert the data to JSON format
json_data = json.dumps(data, ensure_ascii=False)
# Write the JSON data to the file
file.write(json_data)
file.write(",\n")
# Finish writing the JSON file
file.write("\n]")
# Close the files
filecsv.close()
file.close()
Code Explain
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import json
import csv
- These lines import the required libraries:
requests
for sending HTTP requests,urllib.request
for making URL requests,time
for time-related functions,BeautifulSoup
for parsing HTML,json
for working with JSON data, andcsv
for working with CSV files.
filecsv = open('SouqDataapple.csv', 'w',encoding='utf8')
- This line opens a CSV file named “SouqDataapple.csv” in write mode and assigns the file object to the variable
filecsv
. The'w'
parameter indicates that the file will be opened for writing.
url = 'https://saudi.souq.com/sa-ar/apple/new/a-c/s/?section=2&page='
- This line assigns the URL of the website from which we want to scrape data to the variable
url
. The URL points to a specific page on the Souq.com website.
file = open('SouqDataapple.json', 'w',encoding='utf8')
file.write('[\n')
- These lines open a JSON file named “SouqDataapple.json” in write mode and assign the file object to the variable
file
. The'w'
parameter indicates that the file will be opened for writing. Then, it writes an opening square bracket ([
) followed by a new line character (\n
) to the file.
data = {}
csv_columns = ['name','price','img']
- These lines initialize an empty dictionary named
data
to store the scraped data, and a list namedcsv_columns
containing the column names for the CSV file.
for page in range(1000):
print('---', page, '---')
r = requests.get(url + str(page))
print(url + str(page))
soup = BeautifulSoup(r.content, "html.parser")
ancher = soup.find_all('div', {'class': 'column column-block block-grid-large single-item'})
- These lines start a loop that iterates through a range of 1000, representing the number of pages to scrape. For each iteration, it sends a GET request to the URL concatenated with the page number. The response is stored in the variable
r
. It then prints the current page number and the full URL to the console. The HTML content of the page is parsed using BeautifulSoup and stored in thesoup
variable. Theancher
variable is assigned the result of finding alldiv
elements with the specified class name.
writer = csv.DictWriter(filecsv, fieldnames=csv_columns)
i = 0
writer.writeheader()
- These lines create a CSV writer object named
writer
using thecsv.DictWriter
class, which writes dictionaries to a CSV file. It takes thefilecsv
file object and thecsv_columns
list as parameters. Thei
variable is initialized to 0. Then, it writes the header row to the CSV file using thewriteheader()
method.
for pt in ancher:
name = pt.find('h6', {'class': 'title itemTitle'})
itemPrice = pt.find('span', {'class': 'itemPrice'})
img = pt.find('img', {'class': 'img-size-medium'})
if img:
writer.writerow({'name': name.text.replace(' ', '').strip('\r\n'),
'price': itemPrice.text,
'img': img.get('src')})
data['name'] = name.text.replace(' ', '').strip('\r\n')
data['price'] = itemPrice.text
data['img'] = img.get('src')
json_data = json.dumps(data, ensure_ascii=False)
file.write(json_data)
file.write(",\n")
- These lines iterate through each
pt
in theancher
list, which represents each product item on the page. It finds the relevant elements within each item using their respective HTML tags and class names. If an image exists (if img
), it writes a row to the CSV file using thewriter.writerow()
method. It also populates thedata
dictionary with the extracted data. The data is then converted to JSON format usingjson.dumps()
, and it is written to the JSON file followed by a comma and a new line character.
file.write("\n]")
filecsv.close()
file.close()
- These lines write a closing square bracket (
]
) followed by a new line character (\n
) to the JSON file, indicating the end of the JSON array. Thefilecsv
andfile
files are closed using theclose()
method, which ensures that any pending data is flushed and the resources are released.
Usage
Open a terminal or command prompt.
Navigate to the directory where you saved the Python file.
Run the following command to execute the code:
python web_scraping_example.py
The code will start scraping the website and store the data in both CSV and JSON formats. You can customize the code as needed for your own web scraping requirements.
Please note that web scraping should be done responsibly and in accordance with the website’s terms of service. Make sure to respect the website’s policies and do not overwhelm their servers with too many requests.
That’s it! You have completed the web scraping tutorial. You can now use the extracted data for further analysis or any other purposes you desire.
I hope this tutorial helps you understand the process of web scraping using Python. If you have any questions, feel free to ask!