How does SafeSearch work?

Today's question comes from Zurich.

Gary wants to know, how does safe search, both for text and images, work?

Well, I worked on the initial version of Safe Search for text, so let's concentrate on that.

Don't want to give away anything that spammers could use.

But I can talk about way back in 2000 how Safe search works.

So you can kind of get an idea.

And the idea is roughly what you would expect, which is we look for certain words and we give them certain weight.

And if you have enough words with enough weight, then we sort of say, okay, this looks like it might be a sort of porn or porn related document, and you can have various thresholds where you can say, okay, it might be safe at this level, but unsafe once you get too many and you can do things like,

Well, if it's a book, it's a really long thing, and it's got one word that's not quite as bad as if you have just like a very small document and you have that same word, and you can very much imagine that some words are worse and more likely to be pornographic than other words.

So certain slang terms.

It turns out, Misspellings amateur Misspelled Amature is much more likely to be amateur porn than amateur radio or something along those lines.

But you do have to be careful because there's words like breast, which can be breast cancer or sex can be sex education.

So you do want to try to do the learning to learn which words should carry, which weights, in which words should have more weight in the those sorts of things.

But it actually is relatively sophisticated in terms of trying to figure out you can imagine doing a lot more than just pure content analysis or using the straight words, but at least to a first approximation, that's a pretty good way to sort of classify something as porn or not.

One thing that I wanted to mention, which if you go down to the metadata for this video, we have a place where you can click.

And if you think you have been detected as porn, when you're not pornographic or you think you found a bug or an error with safe search, you can sort of report that and pass that information along.

And so people can adjust the algorithms or otherwise make improvements so that we don't necessarily say that a site that it's really, really good is pornographic if it's not.

But you would be surprised at how well just doing some pretty simple scanning with some relatively simple weights can catch a large fraction of the porn on the Web.

Previous search engines, just a little bit of historical digression here.

At least.

I remember in the early days, Alta Vista you could search for sex and have their family mode on, and they would have only, like, 20 results returned because they had basically said, okay, we are only going to allow these results for this query, or we're only going to say these results are safe. And the mental model that Google had was different.

We sort of said, okay, if there's a mother she's searching with her Cub Scout son, would she be surprised?

Would she be offended by the results?

But at the same time, you'd like to get the comprehensiveness of the web.

So you'd like to score the entire web and find the documents that are porn and exclude those.

But then if there's something about sex education or things along those lines, you would like those to be returned.

So it's a pretty good approach.

It's worked very well.

And thankfully, there's a much better team of engineers who are much more sophisticated in the ways that they analyze pages now.

So all of that original stuff that I wrote back in 2000, I'm sure has been replaced by much better stuff at this point.