Components of Search Engine web crawler, indexer, spider

One of the main requirements for mastering Search Engine  Optimization (SEO) is to know the ins and outs of search engines, aka search engines. Therefore, the owner of a site / blog should be provided with a basic understanding of what an indexer, web spider, web crawler, robot or bot has been the mainstay of search engines.

All of these elements play an important role in displaying the main page or article page of a site so that visitors can access it.

After previously discussed in detail about the Definition of Search Engines, Benefits and Types of Search Engines , on this occasion we will explore more deeply what components are in search engines. This article will be created in a question and answer format to make it easier for you to learn in more detail what the components of a search engine are and their explanations.

What are the components of a search engine

Search engines have several components so that they can provide search engine information services with millions of databases. These components include:

  1. Query Interface
  2. Query Engine
  3. Database
  4. Spider or Web Crawler
  5. Indexer

What is a Query Interface

Reporting from the Salt Agency , in terms of search engines, data search is a combination of user-agents (crawlers), databases, how to maintain them, and search algorithms. The user then views and interacts with the query interface or query interface.

On the front end of the application (Front-End), search engines have a user interface where they can enter queries and find specific information related to what they have typed. When a user clicks “search”, the search engine fetches the results from its database and then ranks them based on various weights and algorithm scores associated with the data points in the HTML document. Modern search engines also apply a personalized search aspect to queries, so that rankings 1-10 are no longer linear or precise.

All search engines provide different user interfaces and various special content blocks. It is also important to note in some cases that there are less than 10 top results on SERPs.

The query interface is what 99 percent of internet users think when you talk about search engines, because it is the page they interact with and take value from search engines.

Query interfaces can come in many forms, from the standard and simple search bars that many of us are accustomed to (e.g. Google), to more visual concepts (e.g. Bing), and very busy pages that act as information centers without even moderate queries. done (eg Yahoo).

components of search engine web crawler indexer spider

What is a Query Engine

Query Engine is a program whose job is to translate user desires into a language understood by computer engines, in this case search engines. Query Engine is what immediately searches the right archives and documents in the search engine database system.

Reporting from Techopedia , Search engine queries are requests for information made using search engines. Every time the user places a string of characters in the search engine and presses “Enter”, a search engine request is made. Character sets (often one or more words) act as keywords that search engines use to algorithmically match results to queries. These results are displayed on the search engine results page (SERP) in order of significance (according to the algorithm).

Every search engine request adds to the mass of analytical data on the Internet. The more data the search engine collects, the more accurate the search results will be and that’s a good thing for internet users.

Search engine queries used to be simple enough, but users need to become more understanding as the number of sites on the Web has ballooned. For example, to get definitions of company software terms, you can’t just make a search engine query for it. Instead, it’s best to use “enterprise software definition” or, if you have a trusted source in mind, “SEON enterprise software definition.” There are many other tricks, such as using a quote in the search bar or specifying where a search engine is looking using the “site:” function.

Although most people make search engine requests without a second thought, companies selling products and services or producing content for the Web pay close attention to data on popular search engine requests and the global number of specific search engine requests for specific keywords. This data helps them optimize their site to match the range of inquiries to the products or services they offer.

One of the requirements for a program to be said in search engines is having a query engine.

What is Database

Reporting from Wikipedia, a database is a database or database, which is a collection of information stored on a computer systematically so that it can be checked using a computer program to obtain information from the database. The software used to manage and retrieve database queries is called a database management system. The database system can be learned in information science.

The term “database” originates from computer science. Although later the meaning was broader, including things outside the field of electronics. Records that are similar to databases actually existed before the industrial revolution, namely in the form of ledgers, receipts and data sets related to business.

Konsep dasar dari basis data adalah kumpulan dari catatan-catatan, atau potongan dari pengetahuan. Sebuah database memiliki penjelasan terstruktur dari jenis fakta yang tersimpan di dalamnya: penjelasan ini disebut skema. Skema menggambarkan objek yang diwakili suatu basis data, dan hubungan di antara objek tersebut. Ada banyak cara untuk mengorganisasi skema, atau memodelkan struktur basis data: ini dikenal sebagai model basis data atau model data.

The model commonly used today is the relational model, which according to layman terms represents all information in the form of interconnected tables where each table consists of rows and columns (the actual definition uses mathematical terminology). In this model, the relationships between tables are represented by using the same values ​​between the tables. Other models such as the hierarchical model and the network model use a more explicit way of representing relationships between tables.

The term database refers to a collection of interconnected data, and the software should be referred to as a database management system (DBMS). If the context is clear, many administrators and programmers use the term database for both meanings.

So the concept of a database is a collection of data that forms a file (file) that is interconnected (relation) in a certain way to form new data or information. Or is a collection of data that are interconnected with one another organized according to a certain scheme or structure.

Various collections or lists of documents and archives from all sites on the internet are the search engine databases themselves.

components of search engine web crawler indexer spider 3

what is a spider web

Understanding Web Spider is a program specifically designed to visit a site, read pages and various other important information in an effort to make important notes for the search engine index. Thus the performance is lighter and faster.

Most of the search engine services carry this spider program which is sometimes given a different name even though it has a similar function. Web Crawlers or bots are another name for robots used by search engines, while the main goal of the spider program is to visit the site in its entirety or to articles / posts that have just been released. The entire site or certain pages can be visited and indexed selectively.

Why is this unique program named spider? Naming like an insect animal “spider” because of the similar way it works. Web Spiders visit many parallel sites at the same time, exactly the feet of a spider that can reach a wide area on its cage. Web Spiders can “crawl” to various pages of a website in several ways. One is to follow all the hypertext links on each page until all the pages are read.

Examples of spider programs include web spider or crawler search engine google, yahoo, yandex, bing, and others.

As an added reference, based on the webopedia, spiders are programs that automatically retrieve Web pages. Spiders are used to feed pages to search engines. It is called a spider because it crawls through the Web. Another term for this program is webcrawler.

Since most Web pages contain links to other pages, spiders can start almost anywhere. As soon as he saw a link to another page, he read and picked it up. Large search engines, such as Alta Vista, have many spiders that work in parallel.

What is a Web Crawler

Discussing web crawlers is no different than offending spiders. Why is that? To be honest, the two of them are like mirrors, similar definitions only wrapped in different terms. Yes, crawlers are another name for spiders and vice versa. As reported by the TechTarget page, a crawler is an artificial program that aims to visit the site and read all the pages and information to generate important records for the search engine indexes.

The role of web crawlers is irreplaceable and all search engine services need it. Each carries a different name even though they have the same background, it’s no wonder some people call it a spider or a bot. Crawlers are deliberately programmed to visit all sites submitted or submitted to search engine services. Not only the entire site, but all the pages one by one, whether it’s new articles released and updated.

Crawlers are deliberately designed specifically to be selective in visiting and indexing the whole site or specific pages only. Then, why is it given that name? It turns out that all refer to a way of working, in which a program slides or moves slowly along a site and page at the same time. Even follow the link from one page to another until all pages have been read. This crawling can be interpreted as the meaning of crawling itself.

Simply put, the web crawler program functions to visit all site content in its entirety and then provide data to the indexer to be processed in search engine query engines.

Understanding Metacrawler

As an additional reference, there is another program that is slightly related, namely the metacrawler program.

Reporting from Wikipedia , Metacrawler is a metasearch engine that combines search results on the web from Google, Yahoo !, Bing, Ask.com, About.com, MIVA, Looksmart and other popular search engines. MetaCrawler also gives users the option to search for images, videos, news. MetaCrawler got its start in the late 90’s when the verb “metacrawled” was used by talk show host Conan O’Brien on TRL. MetaCrawler is a registered trademark of InfoSpace, Inc.

MetaCrawler was originally developed in 1994 at the University of Washington by graduate student Aaron Collins and Professor Oren Etzioni as Erik Selberg’s Ph.D. qualification project. Originally, it was created to provide a reliable layer of traction for web search engine programs to study the semantic structure on the World Wide Web. MetaCrawler is not the first metasearch engine on the World Wide Web. The feat belongs to SavvySearch, which was developed at Colorado State University, even though it was launched just four months before MetaCrawler.

components of search engine web crawler indexer spider 4

What is an Indexer

The definition of indexer or indexing is a process of collecting, decomposing and storing data for reuse by search engines. It can be concluded that the indexer is a part or “place” of all data sets collected by search engines. Its job is closely related to displaying query result data and search results with keywords  or keywords.

The indexer function is to speed up the search process. The way it works is similar to the principle of using an index in a dictionary or book. Without an index, the search engines work harder and slower in generating answers to users’ questions. Without a touch of index, you can be sure that performance is getting more complicated because you continue to search for web pages and data that are directly related to keywords, even helping to ensure that nothing is missed.

Indexers are not alone in carrying out their duties, they even facilitate the presence of a web spider or web crawler. That way information gathering becomes lighter and faster, it even helps the data update process and is free from spam threats. There are many different parts that are closely attached to the search engine index such as design factors and data structure. The design factor aims to design the index architecture and determine how the index works.

What a Robot or Bot!

Apart from indexers, spiders and crawlers, there is one more term that must be known, namely robot or commonly called a bot. Robot in the basic sense is a program that runs automatically on the internet, but in fact not everything goes out of control just like that.

Some bots will actually execute commands after receiving specific input or input. Bots are divided into several types, but most of those circulating on the internet are closely related to crawlers, chat room bots and malicious bots. Crawlers are often used by search engine services to scan websites regularly.

This type of bot will move slowly down the site following the links on each page. The crawler then stores the content of each page in the search index. Using complex algorithms, search engines can display the most relevant pages crawlers have found to answer specific questions.

Apart from being known as “bot”, search engine robots also have other names that are no less popular, such as bots, wanderers, spiders, and crawlers. Regardless of the difference in name, the program is a mainstay for world-famous search engine services such as Google, Bing, and Yahoo! The robot is deliberately designed to build databases. Actually, most of the robots work like web browsers, only they don’t need user interaction.

It’s easy for robots to access a site. As noted in the previous paragraph, bots often use links to find and connect to other sites. That way you can index the title, summary or the entire document content much faster than humans can. Therefore, search engine services are constantly trying to improve the quality of bots for better search results.

The presence of robots is indeed beneficial to search engines, especially since they are known for their efficiency and speed. However, it could be disastrous for the site owner if the robot is poorly designed. Why is that? When “low-quality robots” crawl the site, the server’s performance gets even harder. Therefore, site owners can exclude or limit robot access by placing a robots.txt file on the server. Later the file will describe the instructions regarding what parts of the site can be accessed.

Conclusion

If you look at the explanation of indexers, web spiders, web crawlers and bots, it can be concluded that the three terms have similar meanings. Spider is another name for crawlers and bots, and vice versa. Almost all search engine services, especially giant players like Google, Bing and Yahoo, rely heavily on it for indexing.

The indexer or indexing process really depends on the readiness of the spider, crawler or bot. With a neat, quality construction, it actually helps data collection more quickly and accurately. Generally, every search engine service has a different reliable robot, so that it affects search results. Until now, Google’s bot still comes out as a champion because it always produces relevant information according to questions or keywords.

It is very important for site owners to know what indexers, spiders, crawlers, robots or bots on search engines are so that there are no misunderstandings and increasingly understand the ins and outs of the world of SEO. Users can also get effective results without wasting a lot of time.

Source : Komponen Search Engine : web crawler, indexer, spider