Building the Web Index – Intro to Computer Science


So now we’re ready to build the Web index. We’re going to use the split operation to do this. So we can certainly build a way to separate all the words in a Web page ourselves using the operation we’ve already seen. We can use indexing to go through the letters in the string We can identify characters that we think of as separating words such as a space or a comma. And we can collect those all as separate strings. The good news is Python provides a built-in operation that does exactly what we need. It’s called split. We invoke split on a string and what it outputs is a list of the words in the string. For now we’re not going to worry too much about the details of how split decides how to separate a string into words, but let’s look at a few examples to see that it does something pretty close to what we want for building our Web index. So we’ve defined quote as a string which is this quote from Robert Reich. “In Washington it’s dog eat dog. In academia, it’s exactly the opposite.” And now we’ll use split to divide the quote into its words. And you can see the result. We have a list where the elements of the list are the words from the quote. And split does a pretty good job of dividing the string into the words that we want for our Web index. It’s not perfect. We can see that Washington, which was followed by a comma in the quote, ends up as the string Washington including the comma. That’s a different value than the keyword Washington. So if we search for Washington and we built our index this way, we’re not going to actually find this occurance. So this isn’t perfect, but it’s going to be good enough for what we use for now. Later on we can think about ways to divide the Web page into its component words in a way that will be more accurate. Figuring out exactly how to decide when something’s a new word is a fairly tough problem, though. As a second example to get a feel for how split works, let’s try another quote. Here we have a quote from David Letterman. “There’s no business like show business, but there are several businesses like accounting.” I’ve introduced a new Python construct. We’re using the triple quote here. The nice thing about triple quotes is that we can divide them over multiple lines. When we tried to enter the previous quote using just the double quotes, it fell off the edge of the screen. Using triple quotes, we can define one string spread over multiple lines. And now when we print out the result of splitting the quote, we see it’s divided into words. It’s fine that there are new lines. It still separating it into the words we have. We still have some issues, like the parentheses is included as part of David. So this quote wouldn’t show up for the keyword David. I’ll leave it for you to decide at the end of the class whether computing is a business more like show business or business more like accounting. I’m going with show business, though. So now let’s see if you can write the code to build an index using the URLs collected from our Web problem.

Add a Comment

Your email address will not be published. Required fields are marked *