PHP利用正则表达式将相对路径转成绝对路径的方法示例
前言
大家应该都有所体会,很多时候在做网络爬虫的时候特别需要将爬虫搜索到的超链接进行处理,统一都改成绝对路径的,所以本文就写了一个正则表达式来对搜索到的链接进行处理。下面话不多说,来看看详细的介绍吧。
通常我们可能会搜索到如下的链接:
<!--空超链接--> <ahref=""></a> <!--空白符--> <ahref=""rel="externalnofollow"></a> <!--a标签含有其它属性--> <ahref="index.html"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"alt="超链接">index.html</a> <ahref="/"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"target="_blank">/target="_blank"</a> <atarget="_blank"href="/"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"alt="超链接">target="_blank"/alt="超链接"</a> <atarget="_blank"title="超链接"href="/"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"alt="超链接">target="_blank"title="超链接"/alt="超链接"</a> <!--根目录--> <ahref="/"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow">/</a> <ahref="a"rel="externalnofollow">a</a> <!--含参数--> <ahref="/index.html?id=1"rel="externalnofollow">/index.html?id=1</a> <ahref="?id=2"rel="externalnofollow">?id=2</a> <!--//--> <ahref="//index.html"rel="externalnofollow">//index.html</a> <ahref="//www.mafutian.net"rel="externalnofollow">//www.mafutian.net</a> <!--站内链接--> <ahref="http://www.hole_1.com/index.html"rel="externalnofollow">http://www.hole_1.com/index.html</a> <!--站外链接--> <ahref="http://www.mafutian.net"rel="externalnofollow">http://www.mafutian.net</a> <ahref="http://www.numberer.net"rel="externalnofollow">http://www.numberer.net</a> <!--图片,文本文件格式的链接--> <ahref="1.jpg"rel="externalnofollow">1.jpg</a> <ahref="1.jpeg"rel="externalnofollow">1.jpeg</a> <ahref="1.gif"rel="externalnofollow">1.gif</a> <ahref="1.png"rel="externalnofollow">1.png</a> <ahref="1.txt"rel="externalnofollow">1.txt</a> <!--普通链接--> <ahref="index.html"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow">index.html</a> <ahref="index.html"rel="externalnofollow"rel="externalnofollow"rel="externalnofollow">index.html</a> <ahref="./index.html"rel="externalnofollow">./index.html</a> <ahref="../index.html"rel="externalnofollow">../index.html</a> <ahref=".../"rel="externalnofollow">.../</a> <ahref="..."rel="externalnofollow">...</a> <!--非链接,含有链接冒号--> <ahref="javascript:void(0)"rel="externalnofollow">javascript:void(0)</a> <ahref="a:b"rel="externalnofollow">a:b</a> <ahref="/a#a:b"rel="externalnofollow">/a#a:b</a> <ahref="mailto:'mafutian@126.com'"rel="externalnofollow">mailto:'mafutian@126.com'</a> <ahref="/tencent://message/?uin=335134463"rel="externalnofollow">/tencent://message/?uin=335134463</a> <!--相对路径--> <ahref="."rel="externalnofollow">.</a> <ahref=".."rel="externalnofollow">..</a> <ahref="../"rel="externalnofollow">../</a> <ahref="/a/b/.."rel="externalnofollow">/a/b/..</a> <ahref="/a"rel="externalnofollow">/a</a> <ahref="./b"rel="externalnofollow">./b</a> <ahref="./././././././././b"rel="externalnofollow">./././././././././b</a><!--其实就是./b--> <ahref="../c"rel="externalnofollow">../c</a> <ahref="../../d"rel="externalnofollow">../../d</a> <ahref="../a/../b/c/../d"rel="externalnofollow">../a/../b/c/../d</a> <ahref="./../e"rel="externalnofollow">./../e</a> <ahref="http://www.hole_1.org/./../e"rel="externalnofollow">http://www.hole_1.org/./../e</a> <ahref="./.././f"rel="externalnofollow">./.././f</a> <ahref="http://www.hole_1.org/../a/.../../b/c/../d/.."rel="externalnofollow">http://www.hole_1.org/../a/.../../b/c/../d/..</a> <!--带有端口号--> <ahref=":8081/index.html"rel="externalnofollow">:8081/index.html</a> <ahref="http://www.mafutian.net:80/index.html"rel="externalnofollow">:80/index.html</a> <ahref="http://www.mafutian.net:8081/index.html"rel="externalnofollow">http://www.mafutian.net:8081/index.html</a> <ahref="http://www.mafutian.net:8082/index.html"rel="externalnofollow">http://www.mafutian.net:8082/index.html</a>
处理的第一步,设置成绝对路径:
http://.../../../
然后本文讲讲如何去除绝对路径中的'./'、'../'、'/..'的实现代码:
functionurl_to_absolute($relative) { $absolute=''; //去除所有的'./' $absolute=preg_replace('/(?<!\.)\.\//','',$relative); $count=preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); //迭代去除所有的'/abc/../' do { $absolute=preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//','/',$absolute); $count=preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); }while($count>=1); //除去最后的'/..' $absolute=preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.$/','/',$absolute); $absolute=preg_replace('/\/\.\.$/','',$absolute); //除去存在的'../' $absolute=preg_replace('/(?<!\.)\.\.\//','',$absolute); return$absolute; } $relative='http://www.mytest.org/../a/.../../b/c/../d/..'; var_dump(url_to_absolute($relative)); //输出:string'http://www.mytest.org/a/b/'(length=26)
总结
以上就是这篇文章的全部内容了,希望本文的内容对大家的学习或者工作能带来一定的帮助,如果有疑问大家可以留言交流,谢谢大家对毛票票的支持。