English Español Deutsch Français Italiano Português (Brasil) Русский 中文 日本語

#SEOisAEO: How to Help Google/Amazon Make Sense of a Chaotic, Unstructured Web




Jason: Thank you all for coming along. I think this is going to be an amazing episode.Aaron, I've seen a couple of things about internal enterprise knowledge graphs, and I didn't know much about it until two weeks ago.

Aaron: Yeah, that's becoming very much a thing. Small outfits you may have heard of with internal knowledge graphs include Amazon, Netflix, Airbnb. Everyone's doing it these days.

Martha: That's because it's cool.

Aaron: It's cool. And challenging. It's an emerging field, so always some growing pains.

Jason: Great. Bill Slawski sent me a link to Xin Luna Dong's presentation, and some papers she has written. She's my new favorite person. She's been figuring out how to extract data, clean it and structure it automatically at both Amazon and Google... so she's obviously a very pertinent person to be listening to. Here’s her diagram showing entities and attributes we've managed to understand - head knowledge that is manually curated or imported from large databases. The challenge she is trying to overcome is how to collect all the unknown entities and unknown attributes, automatically in a scalable way.


She identifies four methods to extract data from the web that can be cleaned, structured and turned into long tail knowledge - free text, DOM trees, HTML tables, and annotations. Now that I've said it like that, it seems terribly simple. It took me a couple of days to get my head around it :)


So, let’s look at each of these in turn.

Luna Dong’s annotated web is basically schema and structured data. How can we, as marketers, best leverage schema?


Martha: Excellent. So, when talking about Schema markup, we often talk about “rich snippet hunting” tactics. But I'd like to talk a little bit about strategy because the question here is around marketers, not necessarily technologists. As marketers, how do we best leverage Schema? There's great, great evidence, both from our customers at SchemaApp, as well as Google's published case studies, that the benefits go well beyond the click - it goes well into the quality of traffic, the engagement, and right through the customer experience. And that is great for business KPIs. If you're in a babysitting business, it might mean that you're getting more babysitters booked, if you're in the game business, it might mean that the right people are landing on the right content and engaging with that content to then maybe engage in the community or buy a different game. Well implemented Schema can help the whole customer experience.

Then we also need to think about schema markup as how we control how our information is being understood. For marketing that’s a control point for things like what images, texts, text type, color people are seeing for your brand. Think of schema markup as a way to control the understanding of the language and the content that you have.

Because so many machines (Amazon, Google, Siri, Alexa, Facebook...) deliver our content to people, they are basically controlling the customer experience. As marketers, we want to make sure we control this as much as possible. Hence control points.

Lastly, what do you add schema markup to? When marketers are just going after the featured snippets, they'll think, "Well, I just want to get the review snippet so I'll just markup my reviews." Or, "I just want to get that product rich snippet, so I'm just going to markup the availability, the price, and perhaps the review and the description of the product."

But when Google, Yahoo!, Bing and Yandex sat down and created schema.org, they didn't create 850 plus classes for shits and giggles, right? There's a reason that they created that extensive vocabulary, and part of it's because it helps their understanding. But also, it's because they're using it. Gary Ilyes said in a couple conversations and presentations, “it helps us understand, and we're using it”. And that makes sense, right? If we're making their jobs easier, why wouldn't they use it?

When I talk to marketers, I ask, "What's unique and strategic about your business? What do you need people to understand?" Because if these machines like Amazon and Google are pushing your message out, then you want to make sure that those things are understood. So that may not just be your products and your organization. It may be the services you offer, the people in your organization, some of the key content that you've written.

So a schema markup strategy is all about your business value first. And making sure that business information is documented in one place, ideally, on the website and represented through Schema.

Jason: Yeah, that's absolutely brilliant. What I heard there, is that you're talking about communicating so that these machines understand how you want to be presented. So they present you as you would wish. And then you say that our direct consumers of this Schema information are actually machines and that the people are consumers of this information, only in a secondary manner. Brilliant stuff. I thought that was great.

HTML tables and lists seem terribly simple but for these machines, they are devilishly tricky. Why?


Arnout: Well, it's because people are not creating them in the best way: they're missing the caption tag or the table headings or the table rows, and just put in data. And without those, you basically don't communicate clearly to a search engine what the context of that table is. And if they don't understand that, then how can you make sense of it?

Not a lot of marketers know this, but there is a separate search engine for tables in Google that you can actually use. And by putting in some of the keywords/search intent you want to rank for or get a featured snippet, you can easily find loads of things on how other websites have done it properly.



With this table search engine, you can use search modifiers to say in URL / domain name, and it will show you all the tables that Google has found and has indexed in table search. And you can do it for competitors as well. And keywords. And for loads of other things.

Jason: Cool! I read that 99% of all tables on the web are used for design and not for structuring data, which is a big, big percentage. And that the problem is actually finding tables that actually have any data in them!

Martha: We've actually also seen some rich snippets appear directly informed from tables. They're not documented, but we're seeing that information pulled in not only in the search but in the actual rich result.

Arnout: Exactly! That's what I meant. I've seen it with babysitting rates, where you just say, your 16-year-old earns this and a 17-year-old should earn this. So, yeah, Google is picking up what we are communicating… but I was just basically saying that it's a great way to do research and see how well Google is picking up your tables.

Aaron: We've seen them pick them up in terms of featured snippets, where they're extracting tabular data and presenting that in the search results. So not a rich snippet in the sense that they're derived from structured data, but the regular parsing of web pages.

But what's ironic about that featured snippets from tables is really 1994 HTML. Ironically, you have to use an older technology for Google to understand your tables properly. And I've even seen SEOs do things like add just a table with table tags and TDs and THs, in a no script because they know that Google can't understand their CSS tables, so it's kinda funny. And then, of course, I'd be remiss not to bring in dataset and dataset tables that Google has recently started to index. So where the semantics of comma delimited text files has until this point been pretty much opaque, through structured data markup, people can now inform Google of what a formal table, a CSV table, is all about. And I wouldn't be surprised if some point in the future they started to make that level of semantics available to regular tables.

Jason: Brilliant stuff. Super. Next up, a DOM tree extraction. Once again, it's extracting data here from HTML. Sites like Yelp and Amazon have reasonably good HTML. Google and Amazon need to extract data from that. I know that Diffbot are particularly keen on that.

How can we help make DOM-digestion easier for these machines?


Aaron: First of all, you have to use structured data annotations if at all possible. That's going to obviously let the search engines know what a piece of content is about. But if you're not doing that, you don't need to do anything special except being consistent and be big.

You mentioned Amazon. They have a history. Quite famously Amazon has never used structured data markup on their pages. But they get rich snippets and enhanced search results, both in Google and Bing. Normally, you wouldn't be awarded the rich snippets if you didn't have structured data. But Amazon's an important website, and it has a fairly well-known structure, so I don't think it's too difficult for Google to come up with parsing routines by which it can easily identify a product page and extract the price, the reviews, the images…So, better you can standardize your layout and format, the more Google's going to be able to make sense of your DOM.I'm not sure if any of you have heard of Diffbot. It's a new knowledge graph that's come out where it claims to have 10 times, I think, or 100 times, the number of facts as the Google knowledge graph. And they're really not using annotations. They've just gone through and parsed the entire web, and then tried to extract the facts and relationships from that data. So, a little DOM structure can go a long way in facilitating an understanding of what a web page is about.

Jason: Yes indeed, Diffbot is where I started hearing about the DOM-digestion idea. I agree, consistency is the key. It seems to me terribly obvious that if it's consistent, then the machines will be able to make sense of it, and if it's not consistent, you’re creating unnecessary problems. Plus, you're saying they managed to analyze and understand Amazon. With Machine Learning, there's no reason they shouldn't do that with other big-ish sites.

Aaron: Exactly. Possibly first is sites based on templates - standard enterprise products, like Magento for e-commerce, or WordPress for blogging - the search engines have come to a pretty good understanding of how they are structured and how they can extract data from them. But whether off-the-peg or bespoke, still make sure you’re consistent.

Jason: Brilliant. Super stuff. Finally, free text - thanks to concepts such as natural language processing, relatedness, co-occurrence, topical siloing, semantic triples, Google, Amazon, Graphiq, and Diffbot are getting better and better at understanding this free text.

What's our best approach to content creation to help machines extract data from a free text?


Martha: Aaron just hit the nail on the head here - consistency and quality to facilitate understanding. And I'm seeing nods from Arnout :)

The machines are trying to service the client. Amazon's taking a very service-based approach that mixes Skills with answering questions. So when you're creating content, think what questions are you trying to answer for your customers and what type of logic is required to understand your answer... And then be consistent and logical in how you set up your texts. There was a customer site that I was recently looking at, and their schema markup was very basic, but they were using clear language and correctly implementing H1 H2 and H3 tags to articulate what was a question, what was an answer, and what was the topic of the page.  So I think it really starts at that foundational SEO standpoint where you're thinking about what makes the content rich and helpful and of service to your clients, and then use some of that structure in the page to help everyone understand it. Ideally, add the schema to take that explicitness even further.

Jason: Super. So, we come up with consistency again and that's not what humans are necessarily the best at doing!

Right. So thanks to Luna Dong we’ve looked at the four different ways that these machines are extracting data from an unstructured web. Now we're going to move on to more general questions.

A thought: semantics equals language. It's not at all, is it? That's a completely silly way of looking at it

Aaron: You're presenting epistemological questions? It’s 6:30 in the morning!  For me, that makes it a little difficult to get into Husserl and Heidegger :)No, semantics is not just about language. But to a certain degree, you always have to map meaning to something concrete that you can describe, and that tends to be a language for humans.

But to poke at this a bit, things can have meaning that aren't themselves directly described by language. So if I have a picture of a banana, it's about a banana, and it's a type of fruit. No language in an image as such, but ultimately, it is all about language. But you can have semantic associations that aren't language-based. You can have meaning that's derived from some system that isn't ultimately dependent, directly, upon language.

And what we increasingly see, is that Google, et al. are trying to figure out what something is about based on absolutely any clue that they have… and AI and ML are starting to feature very large in this. So you may see them parsing image libraries to figure out that there are red cars in there. But yep, at the end of the day, you're going to need language to describe a car and the color red! Once you've got foundational semantics in the tank, then you can go ahead and start to extract clues from other places. So, language always plays a role, and the more in which things are labeled, the clearer it's going to be to a data consumer. But certainly, the inference of meaning is not limited to language at a base level.

Jason: Ooh! “The inference of meaning is not limited to language”. Love it. I think that's my favorite moment.

Aaron: Hooray.

Jason: So, on to Arnout. You said yesterday,

"Google and Amazon have different approaches to voice search and answers." Can you explain?

Arnout: The way I see Amazon and Google approaching voice and answering is it has always been Google's mission statement to answer every question. This is why you get a million results for every search. Whereas Amazon works with actions. So they ask you to define actions, and from that, basically, when you talk to Alexa, she can fix stuff for you. She can order you a cab or whatever you want. And I think that is a fundamental difference in the way those two are approaching this whole answer thing.

And I think it's a really interesting way of looking at it, and it really opened my eyes when I discovered this in a talk at SMX, and I started thinking about it. It has always been Amazon's drive to give that 100% customer satisfaction. So, you must always be able to deliver what you promise. Whereas Google will just say, I'll do my utmost, and probably give way more than you expect. I think that's where a big difference between those two lie.

From a practical perspective, if you are a service-based company, you should develop an Alexa skill so you can actually tap into that opportunity.

As for tapping into Google, it is more about helping Google understand your website/service and everything about your company, so they can serve you whenever the question requires it.

Martha: I love that, I do. I think it's a great way of thinking about the different consumers. I’d add that Google is providing service but it's also owning the customer experience. But as humans, we're changing our behavior and we're going to force Google to also change because we're asking Alexa certain questions Google can’t manage as effectively. Also, my kids had a voice-off in our house where they were testing which system could do better animal sounds. And Google actually used to be able to do dinosaurs and unicorns, but it can't anymore. And so then the question is, is it unlearning? Lastly, as humans, we are changing the questions we're asking and our behaviors… which is going to then also going to push technology change.

Aaron: Tomato, tomaytoe, to a certain degree, and I think that Alexa started out a bit more transactionally. Alexa, how can I buy something on Amazon? Where Google, being the search engine, was more about answering questions. But then if I get used to Google answering questions for me, eventually I will buy things from Google. They're all doing the same things at the end of the day - understanding what the query's about and trying to come up with a reasonable response.

That said, I think that the flavors of that are interesting and do definitely have nuances.  But really, I'd throw the question back on your own organization in terms of what you want to accomplish. Imagine you have no website, right? Which is really what voice search is. What is the most important information that you want to convey to your customer or consumer? And that's what you need to focus on. And it's all about the mechanisms by which the different smart speakers get there. And they are pretty much the same at the end of the day from that lens.

Jason: Okay, cool. Great stuff, thank you.

Schema is a powerful way to structure data for Google and Amazon. Long and short term. What strategy?


Martha: Actually, Aaron couldn't have more perfectly set it up with “imagine there's no website”. People get really angry at me when I talk about this, but it’s an important question to ask. If we're seeing the customer experience, whether it is searching for jobs, events, answers, or whatever happening directly in the search engine or through voice search where we don't even have an interface. All of a sudden, the traffic's not coming to our website. So now it's a question of: if people are getting service and answers through these other means, we need these machines to understand how we want our data to be presented. We need a control point.

So in order for the true understanding of your content,  a tactic you can do to move towards that understanding and potentially future-proofing yourself is to make sure that you're not just defining what the thing is, but also how it relates to other things in your business and other things on the web.

And Aaron alluded to a little bit of this around inferencing. So, the robots are trying to infer and understand, and we need to help them. A local business is a great example, where you can define if the local business is a headquarters and has sub-organizations and what's the relationship between the entities within the organization. You can use something like the property areaServed and use Wikipedia definitions or even links to sort of the city of Guelph or Toronto or Vancouver to explicitly define areaServed. Same thing with explicitly indicating the relationship between the organization and Twitter, LinkedIn, and Facebook.

But you can also then use definitions and some more advanced things if you're using additionalType or multiType. Let's say I'm a gas station but I specifically sell diesel. You could actually use additional type there. In short, Semantics is all about triples, which is defining how things are connected. And knowledge graphs, are representations of how that data is all connected.

In order to prepare for this future state within the tactics of your DOM schema, ask yourself how things are connected. And there's a free tool on our website called Schema Paths, P-A-T-H-S, that allows you to put two different entities in and we'll tell you how you can relate them. And this was created by my co-founder Mark, who got tired of reading schema.org to try to figure out how things are connected. But it'll really, really helpful for you to start building that knowledge graph for your company in schema.

Jason: Brilliant. The big question, which Arnout  particularly wanted to answer, is:

What's next on the unstructured web, where are we heading to with Amazon and Google?


Arnout: It is basically two-folded. I think we are creating a kind of training set for their algorithms to understand the unstructured content. We’re explicitly identifying this is the same as this, and this fits into this. With that, crawlers and indexers can really start grasping what is on a certain webpage. At some point, we might not need to mark things up as much, or maybe even not at all.

It's really important to get all of the things that talk about to be connected and up-to-date with the info you can find on the web. I think this is where loads of companies are leaving things on the table - like with Google local search where to improve the relevancy of your listing, you need to update your NAPs on multiple places. Same will be true of information about your company and its offers.

In the end, at some point, users will not end up at your website anymore - they will use your service, but through other means - in the SERP, through voice commands, and multiple other ways.

Jason: Yeah, brilliant stuff.

Martha: We have to help them. So when you think about investing time and money in structured data to help, you're allowing them to do other things that are more important or more advanced faster. Perhaps allowing them to move away from a stepping stool, a crutch, to be able to move faster on other things. So from a business perspective, I always think “if it's helping, why not?”

But the other thing to keep an eye on is how then does the marketer or the company have a control point around their data? I think we're seeing the shift - the dataset is us publishing data in bulk and helping this open data world to come together and use mass amounts of data. Fact checked data commons in a central place to where you could maybe license your facts.

Today we're giving a lot of our content away to help Google and Amazon. At what point will that exchange of information change, and the imbalance about who's getting value from it, change, to where we'll then license our content? But I think we'll get to a point where we'll move to maybe a licensing model, and it's more than just the two players. I don't think it's just Google and Amazon. Microsoft's going to come back with a vengeance on this. There's lots of talk about Bing and their knowledge graph Facebook's trying to figure out their voice solution and how they do engagement and they're heavy in the knowledge graph space. Apple has Siri. Some say that they're less strong on the knowledge side, but still, they own a lot of that channel, through Apple devices.

Jason: Brilliant stuff. Yeah, more than just two players, licensing, and I like the idea of a control point. So Aaron, what can you add to that?

Aaron: That's great stuff, Martha. Thanks. There are really different buckets of what is impacted by structured data annotations and what Google, et al. are capable of parsing, and the evolutionary path of those is quite different.

I'll start with datasets that you just mentioned, Martha, where the semantics are opaque. For example, you just can't tell from a CSV what any of those fields are about. For example, there's a dataset out there for cat ownership by postal code in the U.K. But all you're going to see is a row that says seven, and then another one that has something that looks like a postal code. You're not going to know that that's about the number of cats per household. So that really opens up the possibilities for data where the semantics are opaque because it can't be inferred from a web environment.

Then you have another bucket where you can probably infer meaning from the content, but Google really requires quite accurate information in order to provide accurate results. Job posting is a really good example of that. It could probably guess that the salary range is between 30,000 and 50,000 pounds, but it doesn't want to guess in that circumstance. In order to give reliable job listings, you need employers to explicitly give that information.

Another bucket is using structured data simply as a mechanism to populate the graph. You see that recently with podcasts where you're now able to give a JSON-LD feed of the podcast data which is separated from any context. So there are situations where you may not need a context in order to provide the semantics. Google doesn't need to then crawl pages that are landing pages for podcasts to try to figure out what's it about. You can just provide that data directly.

And then, finally, there's that ginormous bucket of information on the web. Importantly, annotations help but aren't necessarily required. John Mu (I think) said they hope to come to a point where structured data is no longer necessary. And I think that that's the biggest bucket of information out there, and we're going to see AI and ML improve and make that more accessible. To Martha's point, annotations make Google guess less and that helps, but certainly the world we're going toward is one where, to the greatest degree possible, Google, et al. can figure out stuff simply by parsing the content.

I'll just finally kinda conclude this by saying that one way or another it's all going structured. That is to say that the 10 blue links at the left-hand side of the page on desktops are already much less important now that we have what we call the knowledge graph area over there on the right. That tends to be facts and data, which is really what people are mostly after. Essentially people take what's at the top - all that fact-y, knowledge graphy-y stuff - and then, if they really want, there are some web pages they can look at. Don't put your stock in web pages in the future.

Martha: Mm-hmm (affirmative).

Arnout: Mm-hmm (affirmative).

Jason: Yeah, okay. Super. And Martha, did you want to add anything to the what's next question?

Martha: You know, my brain was sort of just going to the kind of what's that next after voice? And I bring this up only because I had an interesting conversation with Steve Macbeth at Microsoft - he was the executive sponsor for schema.org from Microsoft when they started in 2011 - and he talked a lot about how they're using schema as a way to connect augmented reality with reality. In this VR/AR space, if I'm shopping and I look at a product in a window, can I learn what brand it is, what the price is… and then could I order through voice, through my augmented reality and have it delivered to my home? Who owns that channel?

So, what comes after voice? Who's disrupting Google? Amazon's disrupting Google. Who else is playing quietly? My eye is on Microsoft just because of some of the things they keep talking about, like using their Semantic entity engine hugely across Microsoft… and we're just seeing announcements now talking about 360 using it.

Then there is a question of owned channels. As someone who has an Alexa and a Google Home, what does it mean to e-commerce if Amazon owns the channel for ordering in our home, or if Google owns the ordering logistics channel? Also, someone asked a question around AdWords, and my hypothesis - I don't have evidence, so Martha instinct - is that the business of search is going to change. Owning the distribution channel, the ordering channel, might become more important than serving ads. Providing service will become more important than just having ads. And that's just a hypothesis at this point.

Jason: Brilliant. Big statement. Thank you. Now, Position0.


Do any of the points we've covered today add something to the position zero profile, as I'm calling it, i.e., how can they help us with position zero? Arnout.

Arnout: Well, I think consistency and using a schema markup to provide a deeper context on what your page/content is about. I think that is one of the main things. And I think the other one is really satisfying the user is an enormously important one as well. And we keep on forgetting this. Loads of people are still looking at vanity metrics, in my opinion. They go and look at traffic and clicks and everything, but they don't look at, “did I actually add value to that visit”, because I think That is a major ingredient to get to this position.

Aaron: Arnout, you referred to traffic as a vanity metric. I like that. I like that a lot, and that's really at the core of the future position zero. As zero suggests, position zero is above the SERPs, where you no longer send traffic to websites, and that's where we're going. You're not providing a linked search description so that people can go visit your website, where you can then send them down your preferred conversion funnel and spam them with your damn newsletter and stuff. Position zero means give me the facts, Jack, and make the most of it. And it's a hard and challenging and fascinating new world, I think.

Martha: Position0 requires quality content. Then SchemaI'll ask a question back, which can maybe be a topic of a next webinar, is “What are the new KPI in the AEO world when traffic isn't getting to your website? How do you truly measure that you're adding value?”

Jason: Oh dear. Yeah, oh dear. That's a big question. Unfortunately, we've got four minutes left. Thank you very much to all of you. That episode was brilliant. Like last week and the week before, my head is kind of spinning and I'll have to re-listen to all this to really get it in my brain. Arnout, Aaron, and Martha. I thank you from the bottom of my heart for coming along, talking about all that. For me, that was incredibly informative.

Next week, episode seven, we're going to move on to the knowledge graph - how well is Google doing in understanding the world. So it's going to be an analysis of how good we think Google has become at building its knowledge graph, which is going to be a really interesting discussion.

So please do join us next week. See you then. Thank you very much for joining us. Thank you, guys.


This is the 6th episode of Epic Series by Jason.