How to Rotate Proxies for Web Scraping Using Python
How to Rotate Proxies for Web Scraping Using Python (https://fatadev.com/)
This article is about how to rotate proxies for web scraping with the Python programming language.
The first step in the process is to look for the proxies list used for the IP address rotation process.
Then the second step for rotate proxies is to check whether the proxy IP address is running normally and can be used for the data scraping process.
After checking the proxy, the proxy is separated into a list of proxies that can be used for the web scraping process.
And the final step is to use a list of proxies that can be used and then try to access each web page with a different IP address.
However, if all of the proxies cannot be used in rotating proxies, a message will appear that all proxies failed to access the website.
How to get a free proxy list can be accessed and downloaded from the free proxy list.
Then the next step is to check the proxy list above with the following python code
import threading
import queue
import requests
q = queue.Queue()
valid_proxies = []with open("proxy_list.txt", "r") as f:
proxies = f.read().split("\n")
for p in proxies:
q.put(p)def check_proxies():
global q
while not q.empty():
proxy = q.get()
try:
res = requests.get("http://ipinfo.io/json", proxies={"http":proxy, "https:": proxy})
except:
continue
if res.status_code == 200:
print(proxy)for _ in range(10):
threading.Thread(target=check_proxies).start()
The meaning and function of the code above is as follows
import threading
This is a statement to import the threading module, which is used to create and manage threads in Python. A thread is the smallest unit of execution in a program that can run independently.
import queue
This is to import the queue module, which is used to create FIFO (First-In-First-Out) queues.
q = queue.Queue()
This creates an empty queue object using the queue module. This queue will be used to store a list of proxies to be checked.
valid_proxies = []
This is a variable used to store a list of valid proxies (in this case, a valid proxy is able to access “http://ipinfo.io/json” successfully).
with open(“proxy_list.txt”, “r”) as f:
This opens the file “proxy_list.txt” in read mode (“r”) and addresses it with the variable “f”.
proxies = f.read().split(“\n”)
It reads the contents of the file “proxy_list.txt” and separates them into a list of proxies by using the newline character (“\n”).
for p in proxies:
It iterates through the list of proxies read from the file.
q.put(p)
This code is to add each proxy to queue q so it will be checked later.
def check_proxies()
It defines a function called “check_proxies” that will be executed by each thread.
This function does several things including taking a proxy from the q queue to be checked, trying an HTTP request to http://ipinfo.io/json, if successful (status 200) then the proxy is considered valid and if there is an error (proxy cannot be accessed) then the program continues to the next proxy.
for _ in range(10):
This iterates 10 times.
threading.Thread(target=check_proxies).start()
It creates and starts 10 threads to run the “check_proxies” function. Each thread will take the proxy from the queue and try to check its validity independently. This is what is called multithreading
Then run the code above, so it will produce output like the one below
Then copy and paste the IP address of the proxy list that was successfully checked, then save it with the name valid_proxies.txt.
The next step is to use the proxy to carry out the checking process on the website with package requests. Here is the code.
import requests
with open("valid_proxies.txt", "r") as f:
proxies = f.read().split("\n")sites_to_check = ["https://top-1000-sekolah.ltmpt.ac.id/?page=1&per-page=100",
"https://top-1000-sekolah.ltmpt.ac.id/?page=2&per-page=100",
"https://top-1000-sekolah.ltmpt.ac.id/?page=3&per-page=100"]for site in sites_to_check:
proxy_succeed = False
for proxy in proxies:
try:
res = requests.get(site, proxies={"http": proxy, "https": proxy}, timeout=5)
if res.status_code == 200:
print(f" Proxy {proxy} accessed successfully {site}")
proxy_succeed = True
break
except:
continue if not proxy_succeed:
print(f" All proxies fail to access {site}")
The meaning and function of the code above is as follows
with open(“valid_proxies.txt”, “r”) as f:
This code is to open the file “valid_proxies.txt” in read mode (“r”) and address it as variable “f”.
proxies = f.read().split(“\n”)
It reads the contents of the file “valid_proxies.txt” and separates them into a list of proxies using the newline character (“\n”). Each line in the file will be an element in the “proxies” list.
sites_to_check
This is a list of websites that will be accessed using a proxy. In this example, we have three websites to test.
proxy_succeed = False
This is a variable used to mark whether the proxy successfully accessed the current site or not. Initially, it is set as “False” because no proxy has been tested yet. Then the codes that follow are the process of checking the IP address.
try:
This is a try-except block that tries to test the proxy by making an HTTP request using the requests module.
res = requests.get(site, proxies={“http”: proxy, “https”: proxy}, timeout=5)
This code sends a GET request to the specified website using the proxy under test. Timeout is set to 5 seconds, meaning that if the request takes more than 5 seconds, it will be considered failed and proceed to another proxy IP.
if res.status_code == 200:
If the HTTP response code status is 200 (OK), then it means the proxy successfully accessed the website.
print(f”Proxy {proxy} accessed successfully {site}”)
If the proxy successfully accesses the website, then a message containing proxy information and the accessed site will be printed on the screen.
proxy_succeed = True
The proxy_succeed variable is set as “True” to indicate that the proxy successfully accessed the website.
if not proxy_succeed:
After trying all the proxies in the list, the program checks whether any of the proxies can successfully access the site. If there are none, a message will be printed that all proxies failed to access the site.
If the code above is run, it will produce an output IP address that accesses a different website page. And this will be useful for the data or website scraping process in the next step.
That’s step by step how to do rotate proxies for the website scraping process in Python. Hope it is useful.