OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Checking for dead links locally in a static website (using wget?)

  • Thread starter Thread starter Matthieu Moy
  • Start date Start date
M

Matthieu Moy

Guest
A very nice tool to check for dead links (e.g. links pointing to 404 errors) is wget --spider. However, I have a slightly different use case where I generate a static website, and want to check for broken links before uploading. More precisely, I want to check both:


  • Relative links like <a href="some/file.pdf">file.pdf</a>


  • Absolute links, most likely to external sites like <a href="http://example.com">example</a>.

I tried wget --spider --force-html -i file-to-check.html, which reads the local file, considers it as HTML and follows each links. Unfortunately, it can't deal with relative links within the local HTML file (errors out with Cannot resolve incomplete link some/file.pdf). I tried using file:// but wget does not support it.

Currently, I have a hack based on running a local webserver through python3 http.server and checking the local files through HTTP:

Code:
python3 -m http.server &
pid=$! 
sleep .5
error=0
wget --spider -nd -nv -H -r -l 1 http://localhost:8000/index.html || error=$? 
kill $pid
wait $pid
exit $error

I'm not really happy with this for several reasons:


  • I need this sleep .5 to wait for the webserver to be ready. Without it, the script fails, but I can't guarantee that 0.5 seconds will be enough. I'd prefer having a way to start the wget command when the server is ready.


  • Conversely, this kill $pid feels ugly.

Ideally, python3 -m http.server would have an option to run a command when the server is ready and would shutdown itself after the command is completed. That sounds doable by writing a bit of Python, but I was wondering whether a cleaner solution exists.

Did I miss anything? Is there a better solution? I'm mentioning wget in my question because it does almost what I want, but using wget is not a requirement for me (nor is python -m http.server). I just need to have something easy to run and automate on Linux.
<p>A very nice tool to check for dead links (e.g. links pointing to 404 errors) is <a href="https://www.digitalocean.com/commun...-links-on-your-website-using-wget-on-debian-7" rel="nofollow noreferrer"><code>wget --spider</code></a>. However, I have a slightly different use case where I generate a static website, and want to check for broken links before uploading. More precisely, I want to check both:</p>
<ul>
<li><p>Relative links like <code><a href="some/file.pdf">file.pdf</a></code></p>
</li>
<li><p>Absolute links, most likely to external sites like <code><a href="http://example.com">example</a></code>.</p>
</li>
</ul>
<p>I tried <code>wget --spider --force-html -i file-to-check.html</code>, which reads the local file, considers it as HTML and follows each links. Unfortunately, it can't deal with relative links within the local HTML file (errors out with <code>Cannot resolve incomplete link some/file.pdf</code>). I tried using <code>file://</code> but <code>wget</code> does not support it.</p>
<p>Currently, I have a hack based on running a local webserver through <code>python3 http.server</code> and checking the local files through HTTP:</p>
<pre><code>python3 -m http.server &
pid=$!
sleep .5
error=0
wget --spider -nd -nv -H -r -l 1 http://localhost:8000/index.html || error=$?
kill $pid
wait $pid
exit $error
</code></pre>
<p>I'm not really happy with this for several reasons:</p>
<ul>
<li><p>I need this <code>sleep .5</code> to wait for the webserver to be ready. Without it, the script fails, but I can't guarantee that 0.5 seconds will be enough. I'd prefer having a way to start the <code>wget</code> command when the server is ready.</p>
</li>
<li><p>Conversely, this <code>kill $pid</code> feels ugly.</p>
</li>
</ul>
<p>Ideally, <code>python3 -m http.server</code> would have an option to run a command when the server is ready and would shutdown itself after the command is completed. That sounds doable by writing a bit of Python, but I was wondering whether a cleaner solution exists.</p>
<p>Did I miss anything? Is there a better solution? I'm mentioning <code>wget</code> in my question because it does almost what I want, but using <code>wget</code> is not a requirement for me (nor is <code>python -m http.server</code>). I just need to have something easy to run and automate on Linux.</p>
 

Latest posts

Top