Proxy usage when making requests with Python
In this article I would like to describe proxy usage techniques within Python code, starting from basic usage to advanced requests through intermediate proxy pool service.
Sometimes you need to connect to some resource that may be blocked by your ISP, sometimes you want to hide your real IP address, sometimes you need to bypass rate limiting when using a web-crawler. (NOTE: not all proxies support all the features, see classification below).
When talking about anonymity (hiding real user who is making request) there are three types of proxies:
- highly anonymous - the web server can’t detect the fact you are using a proxy;
- anonymous - the web server can know you are using a proxy, but it can’t see your real IP;
- transparent - the web server can know you are using a proxy and it can also see your real IP.
And yes, we’ll be using free proxies that are available as public lists and anyone across the web can use them.
And yes, we’ll be using Python 3.7.
The simplest way to use a proxy and to make requests in general is obviously through requests package.
The nice thing about this library is support of HTTPS, HTTP Basic Auth and configuration via environment variables (
HTTPS_PROXY). But in this article we will focus on asynchronous request and particularly on aiohttp.client library.
Let’s begin with simple request using aiohttp.
async with aiohttp.ClientSession() as session:
Authentication and providing proxy url via environment is also supported
async with aiohttp.ClientSession(trust_env=True) as session:
This way you’ll tell client to use same
HTTPS_PROXY environment variable for your proxy. And for the authentication just provide your credentials within uri like this
If you prefer
~/.netrc file that would also work.
The example above is only ok when a proxy you are using is reliable and you need to make just a couple of requests. But what happens if proxy becomes unavailable, you hit requests limit or get any other connection related error. A pool of proxy servers may help to deal with all the issues mentioned. The simplest prototype of this system should:
- store a list of available proxy servers
- iterate over them via round-robin
- check proxy availability and if works properly return it to use when making next request.
That’s enough to begin with, so we can implement our first pool. We’ll start with compiling a list of proxies from public resources like Free Proxy List and HideMy.name. Storing this list locally gives us access speed advantage and ability to come up with individual set that will work best for a specific case.
Suppose we have a file
proxies.txt and here’s our pool class definition with methods we want to implement
What do we have here? We lazily load proxies list from a file, then we have an infinite async iterator that goes through this list, check proxy availability and spit url to the caller. Also we have
get_proxy method that helps retrieve single proxy at a time (we’ll rewrite this method later, because right now it always returns same first proxy at each invocation). As a
check_url I’ve used Amazon Check IP but you can use anything reliable enough that will work for your case (e.g. httpbin).
_check_proxy method implementation is also straightforward, but it introduces another
_check_response function. In most cases checking response status should be enough (like
return resp.status == HTTPStatus.OK) but we can add extra logic to confirm proxy works correctly.
from aiohttp import client_exceptions
We need to catch all possible connection errors and skip this proxy in such cases. When checking response we might want to confirm that target server sees our IP as proxy IP (or at least includes it in response text in case of transparent proxy)
from http import HTTPStatus
That’s it for now, so we can actually make requests using our
As an output you might see something like this
Cannot connect to host 22.214.171.124:57107 ssl:None [Connect call failed ('126.96.36.199', 57107)]
So it skips first item in our list and then it successfully makes request using next candidate from the pool, great job!
Assuming the code above works properly let’s make some optimizations. First we need to make sure proxy list will be shared between
ProxyPool instances not to make heavy I/O operations on each instantiation. We are going to use class attribute for that
from random import shuffle
As you see we are using same lazy loading technique but this time list of proxies is stored as a class attribute for all instances to share it. Additionally we shuffle this list per instance in case there will be multiple pools in our code.
This way we’ll make sure load distribution between proxies instead of multiple connections trying to use on proxy at the same moment.
Another optimization will be related to iteration and
get_proxy method. We want to store iterator state to resume from the last item and not just return first item from the list. Defining one more instance attribute will do the trick
We can reuse same instance of
async_generator for each pool instance because
itertools.cycle over our proxy list will never exhaust making sure we return new proxy on each call using round-robin algorithm. Confirm that with
Now we have fully functional
ProxyPool that can be used within our request-based application. For implementation reference you can check this github code.
$ docker run -d -p 8899:8899 -p 8081:8081 -v /var/www/scylla:/var/www/scylla --name scylla wildcat/scylla:latest
We can both retrieve list of proxies from scylla to use with our proxy pool or use it as a forward proxy server. For the latter one the usage is really simple, you just specify url to your proxy and for the each request it selects random proxy and
returns it to you.
As expected we get our proxy IP in output and it’s similar to what we’ve seen earlier
Requesting http://api.ipify.org using http://127.0.0.1:8081
….but! With such a simplicity we have to pay a very high price
Note: HTTPS requests are not supported at present.
To deal with this we need to apply this workaround
async def get_proxy():
In the code above we explicitly request proxy supporting
https connection and randomly choose one of them. You can also sort
proxies list based on
stability property of each item. (NOTE: when building a url to proxy we still use http because only http proxies are supported. This implies all the data passed between you and proxy will be unencrypted, so make sure not to pass any sensitive data. Do not be confused that we filtered our proxies by https parameter, that only means proxy server can send https requests to other resources that enforces encryption, not that you can make secure requests).
I guess that’s it for today. Don’t be shy and leave your comments or share your knowledge about proxies below.