SEARCHING THE WEB
WHAT IS THE INTERNET?
The Internet is a network of networks, and networks are groups of computers connected together by TCP/IP protocols – Transmission Control Protocol (TCP) and Internet Protocol (IP). The Internet is often simply referred to as the “Net,” WWW (World-wide Web) or the “Web,” and now covers the world. The Internet provides us the connections to 10′s of thousands of computers that have information stored in electronic files.
The Internet is actually much larger than the “Web,” the Web refers to information accessible via HTML (Hypertext Markup Language) coded “pages.” HTML enables pages to be linked to each other via hyperlinks; these links are usually shown as words colored in blue and underlined. When you move your cursor over them, the cursor will change from a pointing arrow or I-beam to a pointing hand. Unfortunately while these pages are connected to each other, they do not provide an organized way of finding that information. In a way, the Internet is a vast unorganized mass of information whose only relationship is that they are accessible via the WWW and may link to other possibly related pages. Fortunately other methods have been developed over the years to guide us to the information we seek. We will discuss the more common and popular search tools that help us find the information we need out of the millions of Web sites in existence. We will also discuss briefly the limitations of these tools and the quality of the information available.
If you would like more explanations of the Internet look at How Internet Infrastructure Works (http://computer.howstuffworks.com/internet-infrastructure.htm) by Jeff Tyson, offered by HowStuffWorks, Inc.
What are you really searching?
Finding the Web documents (a.k.a. Web “pages” or “sites”) you want can be easy or seem impossibly difficult. This is in part due to the sheer size of the WWW, currently estimated (http://news.netcraft.com/archives/web_server_survey.html) at over 633 million sites, which is not counting the number of documents or pages available at each one! It is also difficult because the WWW is not indexed in any standard vocabulary. Unlike a library’s catalog, which will use standardized Dewey or Library of Congress subject headings (i.e. a controlled list of topics) to find books, in Web searching you are always guessing what words will be in the pages you want to find or guessing what subject terms were chosen by someone to organize their web page or site covering some topic.
When you a search of the Web you are NOT searching it directly because it is impossible. Your computer cannot search all the billions of Web pages directly. What you can do is access the many intermediate search tools available. For example, search engines (like Google) search the WWW and produce an index of the sites they have accessed. When you do your “search” you are actually searching these prepared indexes. So at no time are you searching directly or in “real time” the contents of the Web. Even worse no one search engine can index the entire Web, and any search tool that claims so is distorting the truth – (see “Google’s Misleading Blog Post: The Size Of The Web And The Size Of Their Index Are Very Different” (from July 2008) – http://www.techcrunch.com/2008/07/25/googles-misleading-blog-post-on-the-size-of-the-web/)
Another handy web site, World Wide Web Size – http://www.worldwidewebsize.com/, shows a comparison between search engines. Note again that they are showing the size of the “indexed Web” only. But even if a Internet search engine could index the entire web once it wouldn’t be good enough because the Web is changing and growing constantly. On December 8, 2010, they were showing at least 2.76 billion pages, but on December 11, 2012 the number was up to 7.49 billion pages!
Again, what we commonly view as the “Internet” is largely made up of documents or “pages” written in HTML (Hypertext Markup Language) and viewed using a HTML browser such as Firefox, Internet Explorer, Netscape, Opera, or others. To go to a specific site one needs the URL or uniform resource locator, this is the “address” that identifies where the site is. Specific “pages” within a site will have individual file names and usually start after a / mark. For example, the URL for this page is http://bay.cooslibraries.org/programs/free-computer-classes/searching-the-web/.
WHAT INFORMATION IS OUT THERE?
One of the biggest myths about the Internet is that it has everything. Last I heard the Internet contained less than 5% of all human knowledge. Even that number only holds as long as you don’t get bogged down with arguments about how to define ‘knowledge.’ While there are in-depth resources to be found on the Internet, the bulk of the publicly available Web is pamphlet-level or brochure type information. Keep in mind that newspapers, magazines, books and other text information started to be produced in electronic format only in the 1970′s. This means that any information before this date must be manually keyed in by human beings or scanned and interpreted by OCR (optical character recognition) software. The cost to transfer publications to electronic format is a major limiter in adding them to the Internet.
While OCR scanning has improved greatly it still requires humans to proof-read for errors. This scanning or typing and then proofreading adds an enormous cost and companies funding it need want to see a monetary return on such investments. In addition, much information is not available on the Internet because it is copyrighted. That means someone owns all those documents, magazines, newspapers or books such as authors and/or publishers, and they want to be paid for their work. Because of this much valuable information will not be available for free on the Internet for many decades.
Non-copyrighted information is slowly being added to the Web, often by volunteers who are passionate about a particular subject. Some of the more visible volunteer efforts include genealogy (where genealogy societies coordinate their activities), classic books that have fallen out of copyright Project Gutenberg (http://www.gutenberg.org/wiki/Main_Page) and Wiretap (http://wiretap.area.com/), support groups, and other such special interest groups.
HOW TO FIND INFORMATION
Before we get into the types of tools available to help you find information you should take a moment to think about the information you are trying to find.
- Are you looking for just some quick bits of trivia, a short magazine article, or more extensive information?
- What subject area would your information be classified under?
- Who would be the most likely producers of the information you want?
- How would authors phrase the information?
TAKING YOUR LANTERN and USING a LARGE PINCH of SALT
It isn’t enough to find your information, if the task at hand is important; you also need to make judgments about the source’s authority.
To give you an example, there was once a U.S. general who wanted a review of a Steven Spielberg movie, his staff searched and found one which he later used. However, it came to light that the review had been written by a sixteen-year old fan. While the sixteen-year old may have written a thoughtful review, he was not a film critic and his opinion did not carry much weight in professional or academic circles. Basically he didn’t have a proven professional track record and so using his review did not reflect well on the General’s professionalism or scholarship. In case you didn’t know, Generals do NOT like to be embarrassed.
You don’t have to be a general or a university researcher for source authority to be important. High school and college students, business owners, etc. all need to notice who created and is responsible for the information they use, or risk poor grades, embarrassment or loss of profit.
You may think that it doesn’t matter in your case but incorrect information can have unpleasant effects in even the smallest tasks. Some food recipes are written at the keyboard and have never been tested. Knitting and crochet patterns may have errors; car maintenance instructions could give the wrong type of oil and/or weight. Political or religious opinions could be based on false assumptions or manipulate the facts, or even be complete lies.
That isn’t to say that nothing on the Internet is correct or trustworthy, but one must note the source of your information and decide for yourself if you wish to believe them. Taking a good look at sources and maintaining a healthy dose of skepticism is wise. Web sites that provide no indication of who created the information and provide no contact information should be viewed cautiously. The Internet is not the place to be gullible.
As you work with the Internet you will come to recognize certain markers that will help you sort out who is doing what. All Web sites are divided among several “domains”, these are represented by three-letters in the URL or “address.” Those that have a domain of .edu are associated in some way with an educational institution, but may be produced by students rather than university staff. I.e. such sites and material may still not reflect the official opinions or scholarship level of the institution. Domains of .com merely mean that such Web sites utilize a domain associated with commercial activities, i.e. businesses. Domains of .org are usually organizations, such as Lions Club, charities, car clubs or non-profit organizations. Governmental sites in the U.S. such as city, state or Federal usually claim the .gov domain, but don’t have to. For example, New York City in the state of New York URL is www.nyc.gov but the City of Coos Bay URL address is http://www.coosbay.org/. We are not using the .gov domain even though it an official city Web site. Your Internet provider may have a .net because they provide a network. The U.S. military is represented by the .mil domain.
Gradually more domains are being added as the number of addresses possible under each is used up. As this happens new sites either have to use creative addresses, move to the newly released domains, or take an address in another domain.
You may discover that if you miss-type or miss guess a Web site you may end up at an “opportunistic site” who counts on users entering URLs incorrectly. The more harmless of these will be businesses who hope to profit from your misdirection. Others may be spoofs of more serious sites. The really annoying ones are actively hostile or pushing pornography. One example of this is the White House (the U.S. President’s Web site). Type www.whitehouse.gov and you will go to the presidential White House website. But for a few years if you typed www.whitehouse.com you would reach an adult entertainment site. (Note: It is no longer one).
SPECIFIC SEARCH TOOLS
The most obvious first step in searching the Internet is to choose which search tool to use. There are many choices, first in the types of tools, and then the hundreds of choices under each. We are going to look at three types today.
Directories are lists of site addresses found and approved by individuals. So rather then automatically created by computer software, a human being has gone to the trouble of finding, reviewing, and compiling lists of sites that cover a particular topic well.
- Yahoo – http://yahoo.com – is the largest subject directory on the Internet and is an excellent for general public topics. It also has a search capability built in.
- About.com – http://about.com – Topics are researched and presented by a topic expert who provides a selection of links.
- Dmoz – http://www.dmoz.org/ – Open directory project “is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors.”
- Infomine – http://infomine.ucr.edu – Sch0larly Internet Resource Collections
- Ipl2 – http://www.ipl.org – Merger of the resources from the Internet Public Library (IPL) and the Librarians’ Internet Index (LII) websites.
- Internet Subject Directories – http://www.digital-librarian.com/subject.html – a list of subject directories to the Internet
- The WWW Virtual Library – http://vlib.org/
Specialized Subject Directories:
II. SEARCH ENGINES
Search engines are probably the most popular tool for finding information on the Internet. However, for a number of reasons, no one search engine (including Google) searches the entire Web. In fact the most successful, Google, actually retrieves only 20-30% of the Web. Others may retrieve parts of the missing 70-80% but you should be cautious of any search engine initially. Many a popular search engine has succumbed to the temptation to mix paid advertising sites in with the ‘real’ results. Sites told to ‘fess up; Search Results Often Advertisers (http://www.commercialalert.org/issues/culture/search-engines/sites-told-to-fess-up-search-results-often-advertisers).
There are hundreds of search engines, some are specialized for a specific topic or need. Most of us only need a couple of favorites. For those of us who have particular interests we may want to look at some of the topic oriented search engines as well. For more about search engines go to How Search Engines Work (http://searchenginewatch.com/webmasters/article.php/2168031) or Search Engine Showdown (http://www.searchengineshowdown.com/). For links to specialized search engines go to Search Links (http://searchenginewatch.com/links/).
- Google – http://www.google.com
- Yahoo.com – http://yahoo.com
- Bing.com – http://bing.com – was MSN Search
- Ask – http://www.ask.com
- AOL Search – http://search.aol.com
III. META-SEARCH ENGINES
Metasearch engines allow you to search using several search engines at one time. Search structure and commands allowed will differ in each search engine and using a metasearch engine may not allow for this. This means that your search will not be evenly effective in each.
- Dogpile – http://www.dogpile.com/ (Searches Google, Yahoo!, Bing, and Ask)
- Info – http://www.info.com/ (Searches Google, Yahoo!, Bing, Ask, and About)
- Ixquick – http://www.ixquick.com/
- Mamma – http://www.mamma.com/
- Metacrawler – http://www.metacrawler.com
- Vivisimo – http://vivisimo.com/
- Yippy.com – http://yippy.com (Ask, Open Directory, Gigablast, and others)
See ‘Metacrawlers and Metasearch Engines’ – at the Search Engine Watch website for more information on metasearch engines available and their differences, http://searchenginewatch.com/2156241.
IMPROVING YOUR RESULTS!
It probably hasn’t taken you long to get tired of getting more than a few million hits for every little thing you look up. And while Google’s approach is very good it does not guarantee you the best results, only the most popular. Fortunately there are a few techniques available to narrow your results.
Boolean logic is limiting your search results by applying logic controls. On the Internet that means using the commands AND, OR, and NOT. The most common of these is AND. In Google the “AND” is assumed, i.e. if you type Chain Saws, Google will assume that you want Chain AND Saws. This means if both words do not appear in a Web page it will not be returned. To further control this you could search it as a phrase by placing the works within quote marks – “Chain Saws”. This tells Google that the two words must be adjacent to each other and in that order.
If you wanted to search for fruits and didn’t care what kind, you could type: apples OR oranges OR plums OR apricots. You would get any pages that had one, two or more of the requested fruits. As long as even one of these fruits was on a Web page it would be returned.
NOT – Getting material that does not relate to your search is a constant problem. In cases where it is obvious why, you can utilize the NOT command. One good example that most of us will run into are Blogs. Blogs, a sort of online journal, are very popular these days. Unfortunately, they can be a real hazard to the serious searcher. Since authors of blogs may wander over a wide spectrum of topics they have a tendency to show up in many searches even though the blogs’ real focus has nothing to do with what you are looking for. So you could enter a search such as “Chain Saws” NOT (blog OR blogs). In Google you can use the “-” minus sign instead of typing “NOT”, like this: -blog.
Parenthesis ( ) are used to nest commands, essentially you are saying you want the search in the parenthesis to be done first. In other words you are controlling the order your search words are acted upon. In the above case I am saying NOT to both the word blog and the word blogs. Some search engines will not understand plurals and will treat each as an entirely different word. To know how your search engine operates it pays to visit its help or advanced search tip pages. Google’s “Advanced Search Tips” (http://www.google.com/help/refinesearch.html) page illustrates what functions are available and how to operate them.
IF YOU WANT TO LEARN MORE
One of the easiest ways to look up definitions is with Google. Type “define URL” in the Google search box.
Additional information on learning about the Internet:
- Introduction to the Internet and the World Wide Web – http://www.webliminal.com/search/search-web01.html – by Ernest Ackermann (this is also a good example of a well done personal Web site).
- Choose the Best Search for Your Information Need – http://www.noodletools.com/debbie/literacies/information/5locate/adviceengine.html
- Search Engine Showdown – http://www.searchengineshowdown.com/ – Users’ Guide to Web Searching.
One of the sites that also provides current information and comparisons of search engines. If you want to see how dated a search engines index is this is the place to look.
- BARE BONES 101 – http://www.sc.edu/beaufort/library/pages/bones/bones.shtml – From University of South Carolina, Beaufort Library
You may now move on to the Exercise (http://bay.cooslibraries.org/programs/free-computer-classes/web-search-exercise)