How to Train Your AI Beast: Use Great Content (and Lots of It)

Editor’s note: To introduce the concepts and potential speakers for our new ContentTECH Summit, we’re sharing some tech-focused talks from the most recent Intelligent Content Conference. This one from Val Swisher, CEO of Content Rules, digs deep into the world of natural language processing, its role in artificial intelligence, and its impact on the future of content. Watch the video of her complete session or read the edited transcript below.

At the heart of artificial intelligence lies a sophisticated natural language processor. The good news is, that means content strategists are well positioned for an AI future. After all, there’s a natural synergy between content management/taxonomies and the natural linguistic patterning involved in training AI and cognitive systems.

In fact, content management is critical to AI effectiveness, to mining of dark content, and to determining future applications such as virtual agents and chatbots.

Automation and artificial intelligence are everywhere

How many of us work on our own cars anymore? We don’t do this anymore. Soon, nobody will know how to drive a stick, other than diehards. It’s really hard to find a standard transmission car to buy, because our cars shift for us.

In fact, we don’t even need to know how to park anymore. I have a Tesla, and I don’t even have to drive anymore. I push the button and the car does it. We could probably come up with a big, huge list of things we don’t have to do anymore because automation is everywhere.

This is Microsoft Cognitive Services Emotion API. I uploaded my headshot, and I thought it was pretty interesting to see it says I am .99999 (five nines) happy. And I’m a teeny bit disgusted.

It could tell from this that I was really happy. But, somehow, there was a little look of disgust – maybe it was in my eye or something.

I like this next one even better, because it made me younger than I am, and I was all for this. Somehow it looked at me and said I am 47.4. It knows that I’m female and I’m smiling. It knows that I’m wearing glasses. So it knows a whole bunch of stuff about me from one uploaded photo.

My son loves this one because it says he’s 32, but he was 18 when this picture was taken. Let’s look at all the things the API figured out.

First of all, it knows that this is music and this is a brass instrument. It recognizes that this is a person, a man (the beards probably helped). It knows this is indoor, and it’s pretty darn sure this is a concert band.

This instrument is a euphonium, which was invented after most composers were already dead. So there’s no orchestral music for euphonium. The only band that has a euphonium is a concert band. Did Microsoft really know that this is a euphonium, and that there are no euphoniums in orchestras, so it has to be a concert band?

I don’t know the answer, but I found that mind-boggling.

The magic and application of cognitive systems

What I really want to talk about are big cognitive systems and the kinds of things that they allow us to do and how they do it. What’s the magic inside a cognitive system? I was super curious about this.

These are just a few examples:

  • Microsoft
  • Google DeepMind
  • Baidu Minwa
  • IBM Watson, the mother (or father) of all cognitive systems

I’m going to be talking a lot about Watson, because there’s a lot of information about Watson. And I know a lot of people at IBM Watson, because I run around talking about them, so they like me.

We see cognitive systems in them in lots of places.

There’s a big Watson oncology project. Doctors try to diagnose cancers and come up with possible action plans for patients. If they can use a cognitive system that has instantaneous access to a huge amount of information, theoretically they should come up with better diagnoses or treatment plans.

In the IBM world, there’s no such thing as IBM Watson. The company has focused it on different verticals: IBM Watson Health, IBM Watson for financial services, and, in the high-tech vertical, IBM Watson cognitive services.

Cognitive systems are even used in cooking. You know what I want? I want to be able to open my refrigerator, point my phone to scan everything in it, do the same in my cupboard, and get a recipe based on the ingredients I have to make for dinner tonight.

There’s IBM Watson for travel. I want a travel system to know me so well that it picks the hotel, the beach, the airfare, and has a Mai Tai waiting for me when I get there.

Of course, the real reason we have Watson is so it can beat the pants off of a human being in Jeopardy. In 2011, IBM Watson annihilated the humans, and I’m going to explain to you how it was able to do this. Once you understand how it works, you’re going to say, “We didn’t stand a chance against this machine, there was no way.”

Content meaning and intent make AI work

At the base of all these systems is an understanding of the meaning and intent of content. They work by understanding not just the content, but the intent and the meaning of the content.

What we as content strategists have done for years — what I have done my entire career — is the key to making AI work.

If you had told me 24 years ago when I started my company that somehow we would end up on the leading edge of technology in content, I don’t think I would’ve believed you. It’s just content. It’s just words and pictures.

But that’s how a cognitive system works. Here’s a video about how IBM Watson works.

The video gives an example of a sentence IBM Watson needs to parse. The sentence is, “Fly over the boat with the red bow.” What does this sentence mean? It can mean a lot of things, it could mean fly as a noun: there’s a fly that’s over the boat. It could mean the boat has a little red bow on it.

It could mean I’m flying over the boat that has a red bow, as in a bow and arrow. Or a red bow as in the front of a boat. The challenge for all of these systems is to parse this sentence and understand what the heck it means.

How does it do this? Well, the guts, the internal, the heart of a cognitive system is a natural language processor, or NLP.

Cognitive or artificial intelligence systems need sophisticated natural language processors so they can understand the meaning and intent of content. What’s the magic of an NLP? How does an NLP work?

Think of your content as a basket of bread, and each loaf is a sentence. An NLP looks at each sentence, parses it, and can even figure out if you’ve made mistakes.

Each word or phrase is a slice of that loaf of bread. In this example, the NLP knows that .zip is a file extension. It knows that there are lowercase words. It knows that New Zealand is a place, and it knows that it’s not new and Zealand.

It knows you have a stop at the end, and it would know if you were missing your stop. It would know if it was a question mark or an exclamation point. A natural language processor works by looking at each and every sentence and figuring out all the different parts of speech so that it can understand the meaning and intent of that sentence.

That’s at the heart of all of this. Everything comes from a natural language processor being able to parse a sentence like this one.

Now we understand the natural language processor.

So how does a cognitive system work? How does something like IBM Watson work? How do I play with this?

4 steps to using AI

There are four stages to making a cognitive system like Watson work:

  • Upload
  • Curate
  • Ingest
  • Train

Remember, everything depends on an NLP that needs to understand the meaning and intent of every sentence you put into it.


The very first thing we have to do when we deploy a cognitive system is to upload as much content as we can possibly get our hands on to the domain in which we want the system to operate. That large body of content is often called a corpus.

You have to upload a lot of content. I mean everything. I mean your entire corpus. Does anyone know what they uploaded to Watson when Watson won at Jeopardy? They uploaded all of Wikipedia. The whole thing.

Wikipedia doesn’t always have correct information but, for Jeopardy, it’s going to have enough correct information. Imagine that you had all of Wikipedia in your brain. We didn’t stand a chance. We (humans) don’t have access to that much content in that short amount of time.


Once you upload everything you can get your hands on, you realize there’s a lot of garbage in the content. So the next thing you have to do is curate the content, which means you’ve got to throw out the stuff that’s either incorrect, outdated, or not good for whatever reason.

A machine can’t do this. Only a person or people who understand the domain can do this. You need to have oncologists who are curating the corpus of oncology information. You certainly don’t want me doing that.

This takes a long time and it’s not simple. Can you do it in the other direction? Can you curate it first and then upload it?

You can, but you’ll see that there’s a good reason not to — that has to do with statistics and how content is accessed.


After the content is curated, then the system itself does a process called ingestion, which I love because it’s so human. You think of digesting, right?

After curation, the content is ingested. When an AI NLP ingests content, it preprocesses the content, creating indices and metadata. This makes working with the content more efficient in the future.

When it ingests the content, the system itself actually goes in and applies its own metadata. All that work we’ve been doing on taxonomies? I’m saying cognitive systems are going to have their own taxonomies. They’re going to understand the meaning and intent, and they’re going to understand who’s looking at it, and they’re going to create their own.

Vendors are saying, “No, we need taxonomies, because we have to sell more CCMS systems.” We’ll see who is right.

But these systems do create their own internal taxonomy that allows them to get access to all that information very quickly, more quickly than anything we would come up with. It’s not something we would use as humans. It’s all internally ingested by the system.


Once the information is ingested, human experts train the AI on how to interpret it. This is called machine learning.

This is a lot of work, and it’s not for the meek and timid. Our jobs aren’t going away anytime soon, because there’s a lot of work to be done.

Once you’ve got the corpus uploaded, you’ve curated it, and it’s ingested, you have to train the system using something called the ground truth.

We have all this content in the system, and now we have to teach it linguistics. This is the part that blew my mind. We’re not teaching Watson about oncology, or about stocks and bonds, or about the Mai Tai that I want in Hawaii. We’re teaching it how to understand the linguistics of the domain.

How do we speak about oncology? We speak about oncology using different words than when we speak about my Mai Tai. We’re really teaching it the linguistics because that’s all it can understand. It’s not like somehow it became a doctor. It’s got this huge corpus, and it understands linguistics. When we query it, it’s got this internal metadata so it parses the query, it applies its internal metadata, and, poof, it finds your answer.

A ground truth is a question-and-answer pair. We teach the engine by feeding it question-and-answer pairs. Most customers I’m working with that are planning for a cognitive system, and a few who are bold enough to actually start working with cognitive systems, tend to start with support content (big surprise).

If you want to start working on your question-and-answer pairs, that’s a good place to start. Once the engine understands linguistics, it can go into the corpus and start figuring out all kinds of other stuff.

There are two kinds of content for a cognitive system:

  • Structured content (the question-and-answer pairs)
  • Dark content

Structured content for AI isn’t about DITA or XML or any of those acronyms. It’s just question-and-answer pairs in a structure.

Everything else is dark content. We don’t have access to that dark content today. Think of all the content at your company that you can’t really search for. There are tweets and other social media posts, there are Slack channels, there are PDFs. I’ve been quoted as saying, “PDF is the graveyard of all content.”

Ever try searching in a PDF if you weren’t in the PDF? A cognitive system can. It’s going to go into all that dark content, understanding the meaning and the intent and the linguistics, and it’s going to find all kinds of information and answers that weren’t available before, because they weren’t in a format we could get to.

What about chatbots? Are they cognitive systems?

A chatbot isn’t a cognitive system. A chatbot is a front end to a something. What’s behind the chatbot could be a cognitive system. It could be some other type of CMS, or it could be a whole bunch of flat files. A chatbot needs to understand your meaning and intent. It still needs natural language processing, but it is not a cognitive system. It is a front end to your information.

People are deploying chatbots all the time.

This example is Microsoft Support. I know the guy who’s responsible for this project. I called him and said, “Okay, I want to snap a slide for my presentation, show me something that’s going to work.” He told me to go online and say, “I can’t connect to Xbox Live.”

Here’s what that exchange looked like.

I say, “I can’t connect to Xbox Live.” She says, “Here’s what I think you’re asking about: an Xbox Live Account Subscription Issue, is that correct?”

If I really couldn’t connect to Xbox Live at this point, I would probably have picked up my phone. We have a whole new series of challenges with chatbots: You should understand my meaning and intent, even if I haven’t told you, otherwise I’m going to be frustrated.

When we use a chatbot, a lot of the time we put in what we’re thinking in whatever words we use. It may or may not be real sentences, which may or may not be formulated correctly. The first thing that the chatbot has to do is to ask, “Is this what you mean?” “I think this is what you mean.” “Oh, that’s not what you mean? Let me take another guess. Is this what you mean?” It’s a whole other level of interface.

There is a natural language processor in here, and it did parse the sentence “I can’t connect to Xbox Live.” It did go back into its corpus using whatever metadata it had. It came up with a subscription issue. But maybe the thing isn’t plugged in. There are all kinds of reasons that could keep me from connecting to Xbox Live. Maybe it’s broken. Maybe my mother pulled the plug out of the wall and threw it away.

A chatbot has an NLP and a cognitive system has an NLP, but that’s where the similarity ends.

How a cognitive system returns answers

So I’ve uploaded all this stuff, and I’ve curated it, and the system has ingested it, and I’ve trained it. Now that it has all this information, how does it work?

Well, let’s say I type in a query. The first thing a cognitive system does is to figure out all the parts of speech, and the NLP gets to work parsing your sentence.

The better the sentences we write, the better chance that this thing’s going to come back with an answer that makes sense. It’s going to generate a hypothesis based on its understanding of the linguistics. Then it’s going to go out and locate evidence for what it thinks is the answer. Then it’s going to score the evidence and estimate the confidence level.

It might return three answers and show which one it thinks is the best. In fact, it has a 55% statistical confidence that this one is the best. How does it get that? The machine is always learning more – about linguistics and by people saying that’s the right answer.

Have you ever noticed in Google Translate you can actually say, “No, that’s wrong”? As these engines are learning, experts can score and can put in new information. Over time, the answers that were incorrect start floating to the bottom.

On IBM Bluemix (now IBM Cloud), you can get an account for the IBM Watson platform. IBM has about three bazillion APIs for Watson — it wants you to connect everything. All IBM wants is your corpus.

Then you can play around with Watson like I did. I went into this thing called Retrieve and Rank. The corpus for this particular one is the Cranford Collection, which is all about aerodynamics. I used it because I was sitting on a Southwest plane, and I see a guy walking up a ladder and putting duct tape on the wing. And then we took off. The duct tape was gone by the time we landed, and I was freaking out. True story — I have photos to prove it.

In this case, thank goodness, Watson actually generated the question too. It generated this question: “What is the best theoretical method for calculating pressure on the surface of a wing alone? What does that have to do with duct tape?”

Then it went off and it found answers in the corpus. It did two different kinds of answers. The top are the Watson answers, the machine learning approach. It said, “I’m 50% sure it’s this answer on the bottom.”

It also went out to the web and did a standard search and found a couple of things. Of course, it doesn’t do statistical significance based on web searches. It only can do it based on the corpus that it has access to and knows and understands.

Even artificial intelligence needs quality content

Artificial intelligence is based on natural language processing. That means it parses sentences to understand the meaning and intent of content and to make decisions based on that meaning and intent. That means the quality of your content becomes extremely important.

You need to make sure that your content makes sense, so AI systems can learn the appropriate linguistics of your domain. If you put garbage content in, you’ll get garbage out.

Want to learn more about AI, chatbots, and the future of content and technology? Sign up to be notified when ContentTECH Summit registration opens.

Cover image by Joseph Kalinowski/Content Marketing Institute