Strip Domain from URL input

Filtering URL's to strip out the base domain

How to filter out the domain from any given URL string

Here is something that comes up quite a lot in SEO tools that I put together.

to be honest I’m mostly posting this here so I can find it later when I need it in future projects ;)

Particularly for linkbuilding it makes sense not to build multiple links from the same domain.
When doing this alone on a small project, this isn’t really a problem. But when working on a big project, or a project with several people working on it, you need a method to check a potential link source against a list domains from which a link has already been placed.

You could always just compare the input URL against a list of existing links, but that wouldn’t work.
For example:

if you already have a link at

http://www.example.com/example-article

and then look to place a link at:

http://www.example.com/new-article

The two URLs are not the same and searching the existing URL list for the new URL would return no results, even though we have already received a link from this domain.

So we could strip anything after the “/” and then compare the two.

This works fine, but doesn’t take into account various sub-domains or the “www/non-www” variance.

We also need to think about handling the use/absence of the “http://” in the input URL and or the use/absence of the “https://” in the URL.

Also if we want to leave this open to user input we need to deal with the fact that users will input the data in various styles.

We need to sanitise the input so that all of the following URLs all match up as coming from the same domain:

http://www.example.com/abc
https://www.example.com/abc
http://example.com/abc
http://example.com
http://www.example.com/
http://subdomain.example.com/
http://subdomain.example.com/abc
www.example.com/abc
example.com/abc
example.com
www.example.com/
subdomain.example.com/
subdomain.example.com/abc

the issue become more complicated with various second level TLDs such as .co.uk or .org.uk.
Simply breaking the URL string around the “.”s and taking the last 2 would work for urls like www.beispiel.de but running it on www.example.co.uk would output “co.uk”.

So what we need to do is run an extensive series of checks to filter down the input and leave only the domain as the output.

Like so:

//Assuming the user input string is assigned to the variable $url
//Check for https and replace it with http
$url = str_replace("https", "http", $url);
 
//if http:// is missing, add it to the begining to standardise the input string
$http = strpos($url, "http://");
if ($http === false) {
   $url = "http://".$url;
}
 
//Use the PHP function parse_url to strip the URL to the root level
$domain = parse_url($url, PHP_URL_HOST);
 
//Split the URL string around the "."s
$pieces = explode(".", $domain);
 
//combine the 2nd and 3rd parts around a "."
$out = $pieces[1].".".$pieces[2];
 
//Check the 4 piece from the broken down string, this is to deal with second level TLDs such as ".co.uk"
if (empty($pieces[3])) {
//Not a second level TLD :)	
}
else{
//Is a second level TLD, add the remainder to the end
	$out .= ".".$pieces[3];
}
 
//Next we need to check the character length of the second part of the input string
//this is to prevent a input such as danclarkie.co.uk being output as simply "co.uk"
if (strlen($pieces[1]) < 4){
	if (empty($pieces[2])){
		$out = $pieces[0] .".". $pieces[1];
	}
	else{
	$out = $pieces[0] .".". $pieces[1] .".". $pieces[2];
	}
}
//The final output will be the URL stripped to its domain under the variable $out.
echo $out;

Yay!