SEMrush Toolbox #8: Log File Analyzer
- Log File Analyzer Overview
- An Overview Of HTTP Status Codes
- Retrieving Status Codes Using The Command Line
- Inconsistent HTTP Response Codes
- Key Things To Look For After Importing Log Files
- Should You 404 Or 301 Deleted Web Pages?
- Optimizing Pages That Aren’t Crawled Often
- How Often Should You Analyze Server Logs?
Craig: Hi guys, and welcome to today's SEMrush webinar.
I'm in Iceland, in the middle of an airport, so apologies for that, but you get to see something different for a change. But today, we are joined by the very famous Peter Nikolov.
We are going to be talking about the log file analyzer tool today and I don't think there's anyone better in the technical sense than Peter. So thank you for coming on, Peter, and for anyone who doesn't know you, can you just give the audience a bit of a background about what you do and things like that?
Peter Nikolov: So, my name is Peter Nikolov, actually, but you call me just Peter if you like it. So with my company, we work just on technical SEO analysis or building mobile responsive websites or building the applications. So I built my first website over 22 years ago in university. And so I am kind of part of the old generation of one of the pioneers of the Internet.
So log file analysis today is one of the often forgotten things that we should educate new marketers about that, because it's really, really, really important to know what's happening in your website. So, thanks for joining.
Craig: Perfect. Thank you.
Log File Analyzer Overview
Craig: So, what I'll do first of all guys is just give you a brief overview of where to find the tool, and what you can do with the tool, then we'll come to Peter, who will go into a lot more detail on things. So, I'm going to start sharing my screen just now.
First thing's first, where do you find the log file analyzer? So, anyone who logs in, you'll see up in the top left, top left-hand corner here, you'll see a tab called "SEO Toolkit". So, if you just go into SEO Toolkit, it will open up all the SEO tools.
You will be able to see Log File Analyzer down at the bottom here. So, if you just click on that, that opens up the tool. You can also see the URL up top there.
It's a new tool, and it's something that... I'm not the most technical SEO guy in the world... and what I think SEMrush has brought to the table has been able to read this information in a nice, clean, easy-to-understand format.
So, you upload your log file here, but before I tell you where to upload it, where do you get your log file is more important. And you can see here: the URL; hopefully can share that with the guys on the chat, “what is an access log, and where do you find it”?
Now, most people will have Cpanels and stuff like that, and what you normally do is you go in your file manager, and it tells you the rest of the instructions here, and you basically download the log files there. And what you do is come back to SEMrush, and go to the green button here, and just click "Upload Log File".
So you can get your last 30 days log files or whatever log files you want to get, upload them, and then you can analyze them using this great tool. So, first thing's first, you can look at all Google bots, or you can look at the Google desktop or the Google smartphone.
What we can see here are the hits that Google's bots are having on my particular website, and what you want to also be able to see is this kind of status codes. So, when Google hits your website, what status codes have you been served with.
But more importantly, what you want to see here is the hits by pages. So we can see here that the code frequency of the WP contents folder on that particular website is the one that gets the most hits. It's the one that's hit most frequent, and when it was last crawled.
What you want to be doing here, is analyzing the crawl frequency, and also the status as well. Obviously, you want a 200 status for pages that are alive on your website. Obviously, you'll have 301 redirects, but obviously, you want to eliminate any 404s or whatever else you may find here.
So, you can basically scroll through all the URL structures there, and you'll be able to see my SEMrush tutorial, for example, it's been hit 39 times by Google's bot in the last 30 days. The deeper into my website, the less frequent the Google bot is hitting my pages. So that's all of my pages.
What I've also got here are inconsistent status codes. So, what you want to have a look at here, as a sort of example, the last time Google bot forwarded that page, there was a 404 error. But if I click on that little icon, it will tell me that there was a 301 there, it was a 404, (then) it was a 301. So, something's strange going on with that particular URL, something obviously I want to address and understand what's going on there.
So, you want to make sure that you've not got inconsistent status codes, and there should be no real valid reason for that. You've also got all your filtering options that you get with most SEMrush tools as well. So, you can filter by path, last status codes, and file types as well.
If you just want to look at the HTML files, just make sure that you select "ALL" there, and unless you're looking for the inconsistent ones ... but, yeah, so there are 405 HTML status codes there that you can have a look at as well.
So, it's quite a simple and easy to use tool, and it's something you probably should be checking on a regular basis. It's one thing, you know, getting people onto your website and stuff like that, and people obsess over UX, but if Google's bots can't really get about your website, then you've got a massive problem. You're not going to get people onto your website.
It's something that is really, really important, and something people can now do with ease by uploading your log files here. You can also export the data here if you want to feed that back to people or whatever.
And yeah, so, that is pretty much the tool in a nutshell, and it doesn't do anything fancy. It just allows you to analyze your log files in a much more user-friendly way. I'm gonna stop sharing my screen, and I'm gonna let the expert come in and take over.
Peter, if you want to carry on and show the guys how it all really works and all the knowledge that you've got.
Peter Nikolov: Yes. Thank you very much. So, I'll start by sharing a screen for the presentation I just prepared today about that. It's called "Let's Make Log File Analysis Great Again!".
An Overview Of HTTP Status Codes
So, today, why are people returning to log file analysis? Google Search Console, old version, you know, will be trash, very soon. Crawl errors were what most people used it (GSC) for, and today (when you click on) crawl errors, it explains that you should use a “new index coverage report”.
Before, there were errors for Google Bot Desktop, Google Bot Mobile, and News Bot. News Bot was removed, and today Google Bot Desktop and Google Bot for mobile phones are also out. So, you can't see anything anymore here.
Why is this important? Because there are three ways to communicate with Google. With good, with bad, and the ugly. What are the HTTP status codes that you need to know? The good HTTP status code is 200. Or 304, this means that file is not modified. Those are the status codes that Google should expect from your website, also your users.
So, 200 just returned the file, and 304 is actually, Google likes that, but not many people know that because this actually saves bandwidth for Google.
What are bad status codes? Bad status codes, actually they are not really, really bad, but if they are seen often in some kind of a website, this can be a sign of a very serious trouble. You should only use 301 redirect but some people use a 302. This is not a good redirect; actually, you must use 301.
404 would mean page not found, or 503 would mean service unavailable. Those status codes are not really, really bad, but for example, if 19% of a website returns 404 or 503, this is a clear sign for Google that some disaster happened to that website, and they start to, (aggressively in 404) or (not so aggressively a 503), crawl that website. Actually, if you return 503, Google sees that your website has been hit with lots of requests, and starts to crawl the website a little bit slower than before.
And ugly status codes that you should never return to Google or to any kind of other bot; this is redirects like 307 or 308. This is also 401, this requires a user authentication. You know; HTTP password.
Another that you should never return to Google is 403, this is forbidden. It is forbidden to be crawled, to be displayed. Also, another very bad one is 410, which means that the page is gone. And 500, that means an internal server error. You should never return that to Google, and you should observe your website's log file, what's actually going on here.
Retrieving Status Codes Using The Command Line
So, let's see how to get status codes. You must get the status codes; one of the easiest ways is to use the Curl. Curl is a command line application that you can use in this way. For example, I am opening on my Mac, you type "Curl -I" to get result code, "http://www.mobiliodevelopment.com". My website returns "200 OK", so it's perfect.
When you get a huge log file that Craig showed you from C-Panel, you must edit it. So, what is the actual situation? For example, here is one of my log files, which is over seven megabytes in size. So I can see the IP, date and time, gets, here is the return codes, how many bytes are transferred, and user. This is an endless file because it has a few thousand lines, and you can't access it very good.
So, some people do that, for example, we import it to Excel, and then show you how to do. So, you open in Excel, you select a blank workbook, you go into data, get external data, import from text file, you select one of text file, that's important, you click to get data, you select "Delimited", that's important, and select “space”, that is very important. Here, just click to finish, type in "OK", and here is... that's my access log file imported to Excel. So, right now I can do some kind of analysis here, for example, remove that data that doesn't need to be shown.
For example, if you have a very large file, your Excel can be broken because this is a lot of data. That's why some people make some kind of commands, for example, for easy analyzing. So, that command right now will show me a list of status codes. I will run it, and here is it.
Right now, it returns how many times my website returns pages here, how many times there are some kind of redirects, how many times there are page not found, or authorizations, or page not modified, so things like that. But this is for everything. This means for all users, for all bots, for all scrapers, for a Facebook bot, for a Twitter bot, or anything.
We need some kind of an easier way to filter all situations. That's why you can make some kind of easier, for example, you need just to GREP a Google bot result, and we open it here, we type "Grep"... and this returns just what kind of my document is from Google bot. But it's actually too much information, and a lot of people don't know how to use it because it's really boring. But I'm one of the guys who does this almost every day.
What does Google say about that? By the way, this is very important. John Mueller says, "URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index". And they also say "Yes, internal links do impact crawl budget through crawl demand".
Inconsistent HTTP Response Codes
If you're a little bit bored, let's continue on log file analysis. So, just like Craig, let’s import log files in SEMrush Log Analyzer. So, what is the first problem? I can open, and I can see that some of the results come with some kind of warning. It tells me that “this service has inconsistent status codes”. What does this mean?
This is my home page because it's just a slash. I'm opening here. I'm opening just to see what the error is. I can see 301, 200, 301, 200.
So, one of the quick things that you must open here is to return to the command line, and type "Curl -I HTTPS" and let's see that URL. We can see that there is redirect to HTTP version because my website doesn't have an HTTPS version. And right now, we are opening, we are trying to open the same URL with HTTP, and we can see 200 OK. So, what does this mean? This means that one time for the same URL, we can see 301, and the other side, we can see 200 OK.
And that's why SEMrush Log Analyzer told me: “Okay, this status code is inconsistent, see?” Google Bot is desperate to try to index my website through an HTTPS, but since my website doesn't have a working HTTPS version, we redirect to the HTTP version, and it's indexed the HTTP version. So, that doesn't mean that here we have some kind of a problem, but it shows me that something interesting happens.
You can see I have pages that are loaded once per month. There are some kinds of pages that are loaded once for every two weeks, and that is very interesting because two weeks mean that for one month, that file is crawled three times, and this means, for Google, that page isn't very important. Because other pages are crawled once every six hours.
So, this is clearly showing for me, that this page, this page, is much more important than pages below that are crawled every month, for example, like this one, "Digitalization", that web page is crawled just once. And it's not good.
If you have such a situation, you must... optimize your internal linking with a site audit. Once you optimize your internal linking, that web page must be crawled a little bit better than before, for example, like every two days or every few hours.
Here we can see our web pages with inconsistent status codes. For example, if you have a lot of redirects, you can filter them just with pages to redirects and you can see them. You can optimize, for example, and also you can optimize just a 404. If you have a 404, it's not actually very good for your site because if you made a migration, and if you make URL optimizations, for example, you can delete a very important page, and right now that page return 404, but you didn't know because you updated the URLs and that page can be lost forever.
So, that is very important to see, and what is very important, is to be sure that your website never turns error like 500 to this, because they are server errors, to Google.
Craig: Perfect, Peter. Thank you for an excellent in-depth description of how it all works.
Key Things To Look For After Importing Log Files
Craig: But it's now time for questions and answers, guys. So, this one's from Andy Simpson. So he's asking, "When you first import a log file and see the data displayed, what are the first things you look and scan for, and what sets off alarm bells in the data?"
Peter Nikolov: For me, what is very important, once you open that, when you import your files, is to see the status codes. So, for example, over 16% of status codes, 200, that is perfect. So, there are 28% that is 300, and some things are 400. That is very important for the quality of the site. If your own website is good, you will see a lot of 200.
But if your website has some kind of compatibility issues or indexing issues or things like that, you will see that 200 and 400 or even 500 will be a very, very high level. I have seen that website wheres combining them is for over 50%, and just imagine how desperate Google bot is trying to find something in your website to be indexed, but your website is throwing redirect change or redirects or file not founds or some kind of server errors or things like that.
Once you opened, to see that status code, and also to check what's going on here by dates and status code. For example, if you have seen a rise of data errors on some day, this could be, for example, an indication for some kind of attack or it also can be that someone is trying to attack your web server, and that is definitely not good for you.
Craig: Perfect, Peter. The answer, if I'm answering it, I'm looking at status codes and call frequency, but I'm not the kind of server log specialist in any way. I just want to make sure that the pages that I want to be called frequently are being found.
Should You 404 Or 301 Deleted Web Pages?
So, other questions. We've got a question here from Brian Riser. So, it's about redirects. Let's say he wanted to delete all the articles related to a particular category on his workplace website, and there are a few thousand articles. That would create a whole lot of 301 redirects, but how should they handle that?
Well, there are a few ways to delete articles from a website. The first way is just to delete and let WordPress return 404. So, that is also possible, for example, if you need to definitely trash something and to tell Google that this never will exist anymore on the website.
But if you need to, for example, use the links or rankings or just to ensure that Google keeps really, really crawling that website, you can redirect and you can make the redirect links between your articles, or even you can make a redirect between some category to another category, which is very good.
And yes, this will create a lot of 301 redirects too, and that will be in your log file, and unfortunately, you should live with that.
So, Karina is asking, if I understood right, did you just confirm that it is okay to leave tons of pages returning a 404 error instead of redirecting them all to a 301? Can you just confirm whether that's actually what you said?
Peter Nikolov: Yes, this is actually... can be explained quick. For example, let's say if I have some kind of products or some kind of articles that won't be covered anymore. How to deal with them? So, one way is to return 404.
For example, let's say, what if I'm importing some kind of old laptops like, for example, Apple Macintosh Power Mac that wasn't produced for more than 10 years, or 10 years ago won't be produced anymore, won't be sold, they're unsupported. So, what if you have e-commerce that sold those laptops? It's good to leave them with 404, because this means to Google, okay, this doesn't exist anymore, and you can delete.
But on the other side, what if you have some kind of product that is updated with the new product, and you have a new URL? For example, you have one of those, for example, Apple Macintosh MacBook here released 2015, and released to 2018, because 2015 anymore wasn't be sold. You can make the redirect to 2018, or 2019 for example.
So, unfortunately, there isn't a one-size-fits-all solution that you can quickly implement. So, in one case, 404 is good. In the other case, 301 is also good. But you must decide how to do individually with them and to implement that.
Craig: So, Peter, and I've got a part to add to that, and because I would probably disagree with that ... I had an e-commerce customer years ago, and they lost a lot of products, a lot of products were discontinued. And what we found was, it was serving a 404 error, the products were gone, but they get jammed in Google's index, and the bot will start crawl in an index in those 404s, and they wouldn't shift.
So, we had to put on a 410 status code to eliminate that problem, because obviously, you don't want to waste crawl budget crawling 404 errors, and that's a waste of crawl budget in my opinion. So, would you not potentially just put a 410 on a product that was just discontinued rather than just leaving a 404?
Peter Nikolov: Yes, but as I said before, this must be checked individually for every site. Use some kind of plugins for WordPress that redirects to the homepage if there is 404. That's why I say everyone must decide individually for every site how to deal. Do redirects or do 404s.
Craig: Cool, no worries. Thanks for clarifying that. The next question is from Paolo Taca, and he's saying this tool's been basically sitting under his nose the whole time. Let's say he has a website with 1,000 pages, should he focus his efforts into a page that crawls more often or not? What's gonna provide the best results?
Optimizing Pages That Aren’t Crawled Often
Peter Nikolov: I think that is also a tricky question. It's good to check why some web pages don’t crawl often because that page that isn't crawled often, this means that page is not very important, and information here can be a little bit delayed. You must check in analytics how your web page is getting visited because they can be visited a little bit at different times, different months, or even different years.
Craig: Perfect. And what I would add to that as well is, if there's a page there that does generate you money and it's intended to rank well and it's not getting crawled frequently, that's something you can get a quick result from if you, you know, maybe do the right internal linking structure, got a bit of that content.
So, as Peter says, it's quite a hard question to answer. Don't just focus your efforts solely based on the fact that a page actually gets crawled more often, because that page might not be the one that gets the most searches for you more or the one that can rank the quickest, and hopefully that answers your question.
Craig: What are the most common things you come across when you're analyzing log files in terms of problems? What do you think most often?
Peter Nikolov: What is more important is to ensure Google receives the right content. This means almost 0 redirects, minimum redirects if there is possible, minimum client errors like 404 or things like that. Almost 0 server-side errors like 500 because they can effectively nuke your website.
So, it's important to update your website to just important files. That is much very important to Google, because if you pass Google to some kind of redirect chain, like Page A redirect to Page B, then to Page C, then to Page D, then to Page E, then to another website, Google will start to lose interest in your website and start to visit your website less and less often, and this means also in your rankings and things like that. So, that's important, just to return the most important things first.
How Often Should You Analyze Server Logs?
Craig: Cool. So the next question is, how often would you recommend analyzing your server logs?
Peter Nikolov: Well, honestly, on my personal website I do this almost daily. Every day, someone's trying to log on my WordPress, but actually, they can't. So, it's good to have almost every day, a very quick check from your website for quick hold. But if you need to make an in-depth analysis, it's good to make this almost every week or every two weeks in SEMrush analyzer just to see what's going on your website in terms of technical details.
Craig: Excellent, excellent, Peter. Sadly, we are out of time though. That is the end of the webinar, so thank you, everyone, for joining us, and thank you, Peter, for being on and giving away all those good tips and information. We will be back in a few week's time with another webinar, but thanks for all the questions and hopefully we answered them all. So, thanks again, guys.
Peter Nikolov: Have a great day, bye-bye.
Craig: You too, Peter. Thank you.