What’s Really Important for Technical SEO?

OK, last session of the day, SMX London Day 1. What’s really important for technical SEO with

Moderator: Rob Kerry, Head of Search, Ayima

Speakers:

Richard Baxter, Director, SEOgadget
Martijn Beijk, SEO & Web Analytics Consultant, Onetomarket
Jonathan Hochman, Consultant, JE Hochman & Associates LLC
John Mueller, Sr. Webmaster Trends Analyst, Google

Twitter names: @robkerry @richardbaxter @martijnbeijk @jehochman @JohnMu

First up is Richard Baxter.

Wants to discuss technical SEO strategy. Not just how not to do things wrong, but how to build on opportunities.

The biggest issues he’s seen on large sites:

  1. Thin content on category pages. Find a way to get specific content on these pages, even if it means chaining an intern to a desk for 6 months while they write 200 words for each page.
  2. Poor internal link architecture. Increased links on homepage and improved internal nav elements with CSS styled drop downs and popular products. Made a huge difference.
  3. Excessive duplication of content via paginated navigation / faceted navigation. He’s seen so much misuse of rel canonical on these pages. People rel canonical deep navigation pages back to the root. That’s really wrong. Avoid blocking pages with robots.txt.
  4. Indexed staging server. Use IP detection, anything that isn’t from a pre-approved range, redirect it.
  5. Canonicalization. Not just the WWW vs non-WWW version. Check trailing slash.
  6. Soft 404 headers / No 404 response
  7. IF-MODIFIED-SINCE produces 500 error. Make sure you support IF-MODIFIED-SINCE
  8. Sitemap.xml file responded with a 404 server header response even though the file was downloaded
  9. Bingbot / MSNbot crawl blocked. Oops. Client wondered why they weren’t ranking in Bing. They were blocking the crawler because years earlier they banned MSNbot.
  10. 40,000 internal 301 redirects
  11. Very easy to inject links with HTML. Check your search boxes and other form functionality to make sure that the HTML is stripped.
  12. UGC content on internal pages in JavaScript. Got a 75% increase in long tail search traffic when they fixed this. If you tons of data / content make sure it’s embeddable.

[OK, great start. Now it’s … there having technical problems with the computer. Please stand by. And we’re back. Thanks for waiting.]

Martin Beijk to discuss why speed matters

Google announced let’s make the web faster.

You can now monitor your speed issues with Google Analytics.

If you’re using WordPress, Yoast’s plug-in can help you.

Expensive queries should have indexes. Log slow queries. Watch your JOIN clauses. Beware of “Lazy Programmer Syndrome.” Use MySQL EXPLAIN query.

MySQL: MyISAM or InnoDB

Use MySQL memory tables!

Best practices:

  • Think of your application servers (eg Tomcat, JBoss)
  • Prefer memory caching over disk cashing (usually)
  • Use load balancing or Proxy Pass
  • Set HTTP headers for different types of content
  • Expires header, Etag, strip cookies

Virtual host vs .htaccess

You shouldn’t be on a shared hosting server. Don’t use .htaccess!

Benchmarking:

Use software like ApacheBench, Siege

See how different types of content are performing.

Don’t do this on your production servers. [Yeah, that could be bad]

Use Varnish.

CDN’s are a nice buzzword. They’re not always necessary.

Takeaways:

  • Don’t use shared hosting
  • Quality of programming, speed of code
  • Speed of DB queries
  • Use caching in memory when you can
  • Leverage existing tools for your platform
  • Running WordPress? Check out http://bit.ly/speedupwordpress

Batting third is Jonathan Hochman from Hochman Consulting

Why details matter

Client wasn’t showing up. Because their robots.txt blocked everybody from everything. It was put on the testing site and got moved over. Oops. It’s a very easy mistake to make.

Put a password on your staging site. Don’t block it in a way that can get moved over.

Duplicating yourself with multiple domain names.

[He’s providing the code for redirects.]

404 error codes are important. Make sure you’re giving a 404 or a 301 if the page is no longer there. There is almost always a way forward.

[More code on setting up redirects]

WordPress SEO from Yoast is great.

Most CMS systems have issues that need to be remedied. One of them actually does cloaking. That can give you a problem.

Google is your friend. If you’re looking for something, just Google for it.

Unique titles & meta descriptions

Failure to do this is a sign of low quality. We have things that set these up automatically but allow for changing them.

[Code for generating these automatically. Uses first 80 characters of the page for description]

Good code is often five times shorter than average code. If you have millions of pages, code optimization is high priority because it can get more of your code into Google’s index.

Watch our for infinite URL spaces, like calendars that go on forever.

Spelling & Typos. Often a page didn’t rank because the keyword was spelled wrong.

Broken links. Fix them.

Server speed

Enabling gzip compression via .htaccess

Hacking and malware

Scanning only detects 30% of threats.

File Integrity Monitoring detects close to 100%. Compare a clean copy off the server with the code on the server.

Patch WordPress. Failure to patch WordPress is the #1 source of exploits.

HTML Validation Errors. Fixing them can help your code load faster, load cleanly. Search engines sometimes fail to read messed up pages.

SEO intangibles: Happy visitors generate referrals, tweets, and links. They’re more likely to trust you and convert.

Printing: Put in a print media stylesheet

Favicons: Install a branded favicon. People multitask and now you’re gone.

Forms: I like wufoo.com. If your form has friction you’ll lose conversion. Put effort into your forms.

The big picture: User behavior is a ranking signal. Get it right.

And finally, John Mueller (aka JohnMu), from Google. Webmaster Trends Analyst. Connects webmasters w/ Google engineers.

Provided http://bit.ly/smxgooglelondon for questions.

The Web Search pipeline:

URLs -> Scheduler -> Crawler <-> Internet

Parser -> Indexing -> Index

A few years ago we mentioned that we found a trillion URLs. Thanks to your hard work, there are a lot more now.

Scheduler: Picks URLs to crawl, manages load per server, automatically adapts

Issues: Slow server, too many URLs

Crawler: Checks robots.txt, checks URLs, gives feedback to scheduler

Possible issues: Slow responses, error messages

Diagnosing crawling: The biggest blackbox item is the scheduler. Best way to monitor the pipeline is Google Webmaster Tools. Fetch as Googlebot function. Crawl errors, crawl rate stats & crawl rate setting.

Also check your server logs.

Parser: Pulls all the content apart and extracts the logical parts. Title, headings, links.

Possible issues: Bad content, “Soft 404” pages

Indexing: Analyzes the content and sees how much is indexable. Is it duplicate content? Spammy? Hacked?

More SMX London coverage: