Showing posts with label Internet. Show all posts
Showing posts with label Internet. Show all posts

Monday, November 21, 2011

Web Crawler

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots,Web spiders,Web robots,or—especially in the FOAF community—Web scutters

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages,such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained." A crawler must carefully choose at each step which pages to visit next.

The behavior of a Web crawler is the outcome of a combination of policies:
  • a selection policy that states which pages to download,
  • a re-visit policy that states when to check for changes to the pages,
  • a politeness policy that states how to avoid overloading Web sites, and
  • a parallelization policy that states how to coordinate distributed Web crawlers.

Wednesday, July 27, 2011

History of Internet Explorer

    The following is a history of the Internet Explorer graphical web browser from Microsoft developed over 9 major software versions including 1.0 (1995), 2.0 (1995) 3.0 (1996), 4.0 (1997), 5.0 (1999), 6.0 (2001), 7.0 (2006), 8.0 (2009), and 9.0 (2011), which began public beta testing in September 2010.[1] Internet Explorer has supported Microsoft Windows, but some versions also had an Apple Macintosh version, see Internet Explorer for Mac. For the UNIX version, see Internet Explorer for UNIX. For mobile versions such as Pocket Internet Explorer and versions for Windows CE see Internet Explorer Mobile.
         The first Internet Explorer was derived from Spyglass Mosaic. The original Mosaic came from NCSA, but since NCSA was a public entity it relied on Spyglass as its commercial licensing partner. Spyglass in turn delivered two versions of the Mosaic browser to Microsoft, one wholly based on the NCSA source code, and another engineered from scratch but conceptually modeled on the NCSA browser. Internet Explorer was initially built using the Spyglass, not the NCSA source code[2] The license to Microsoft provided Spyglass (and thus NCSA) with a quarterly fee plus a percentage of Microsoft's revenues for the software.
      Internet Explorer has been the most widely used web browser since 1999, attaining a peak of about 95% usage share during 2002 and 2003 with Internet Explorer 5 and Internet Explorer 6. Since its peak of popularity, its usage share has been declining in the face of renewed competition from other web browsers, and is currently 43.55% as of February 2011. Microsoft spent over US$100 million per year on Internet Explorer in the late 1990s,[1] with over 1000 people working on it by 1999.[2][update]
Since its first release, Microsoft has added features and technologies such as basic table display (in version 1.5); XMLHttpRequest (in version 5), which aids creation of dynamic web pages; and Internationalized Domain Names (in version 7), which allow Web sites to have native-language addresses with non-Latinsource code of Spyglass Mosaic, used without royalty in early versions) and security and privacy vulnerabilities, and both the United States and the European Union have alleged that integration of Internet Explorer with Windows has been to the detriment of other browsers. characters. The browser has also received scrutiny throughout its development for use of third-party technology (such as the
The latest stable release is Internet Explorer 9, which is available as a free update for Windows 7, Windows Vista SP2, Windows Server 2008 and Windows Server 2008 R2. Internet Explorer was to be omitted from Windows 7 and Windows Server 2008 R2 in Europe, but Microsoft ultimately included it, with a browser option screen allowing users to select any of several web browsers (including Internet Explorer).[3][4][5][6]
Versions of Internet Explorer for other operating systems have also been produced, including an embedded OEM version called Pocket Internet Explorer, later rebranded Internet Explorer Mobile, which is currently based on Internet Explorer 7 and made for Windows Phone 7, Windows CE, and previously Windows Mobile. It remains in development alongside the more advanced desktop versions. Internet Explorer for MacInternet Explorer for UNIX and (Solaris and HP-UX) have been discontinued.