Google Search Appliance Deployment Architectures August 2014
© 2014 Google
Deployment Architectures The Google Search Appliance (GSA) provides a number of features to allow its deployment into a variety of architectures to meet diverse requirements for high-availability, throughput and content scale. This paper provides guidelines for using these architectural features as well as other supporting deployment components, to design an appropriate GSA solution architecture.
About this document The recommendations and information in this document were gathered through our work with a variety of clients and environments in the field. We thank our customers and partners for sharing their experiences and insights.
This paper describes some examples of typical deployment architectures and how the GSA can be deployed to handle the needs of a large, performance heavy and failure intolerant user community.
Roles and Responsibilities
GSA administrators: Deploy and configure the GSAs to best serve content to the general search community in the organization.
Network administrators: Confirm network capability and availability to handle the traffic the different configurations might impose on the corporate network.
Content owners: Broker access to the various content sources the GSA will need to access to facilitate crawl, index, and search.
GSA, SNMP, load balancer, network switch, search proxy server
● ● ●
Learngsa.com provides educational resources for the GSA. GSA product documentation provides complete information about the GSA. Google for Work Support Portal provides access to Google support.
Contents About this document Chapter 1 Architectural Components Core architectural features of the GSA GSA Mirroring GSA Unification GSA Distributed Crawling and Serving Combining GSA Mirroring, GSA Unification and Distributed Crawl and Serve Chapter 2 Architecting for High Availability GSA High Availability Monitoring and Failure Detection Connector High Availability Connector High Availability for Content Traversal Security Mechanism High Availability Chapter 3 Architecting for High Performance GSA load-balancing for query performance Security mechanism performance GSA feature performance considerations Chapter 4 Architecting for large-scale indexes GSA crawling strategies
Chapter 1 Architectural Components Designing the architecture of a Google Search Appliance implementation requires consideration of a number of features of the GSA, as well as integration points with supporting components of a GSA deployment. This chapter provides an overview of the role each component, such as GSA connectors and security mechanisms, plays within the architecture of a GSA deployment.
Core architectural features of the GSA The GSA provides three main features that can be deployed to handle high availability, performance, and scaling requirements: ● ● ●
GSA Mirroring GSA Unification GSA Distributed Crawling and Serving
GSA Mirroring GSA Mirroring provides replication of a search index and most configuration settings to one or more “replica” search appliances. Mirroring architectures can be designed for either Active-Passive or Active-Active configurations, and can be used as a means for providing high-availability and high-throughput in a GSA deployment. For more information about GSA Mirroring, see the GSA Help Center article: Configuring GSA Mirroring and Specifications and Limits: GSA Mirroring.
GSA Unification GSA Unification allows a group of search appliances to be configured so that documents indexed separately over several search appliances can be searched by a single search query. The search appliances in the configuration each index different sets of documents and are set up with their own collections, front ends, and other administrator-configurable features. When a user performs a search, the search appliances communicate with each other to merge results from their separate indexes. A unified GSA environment is typically used when there is a need to provide search and index services for a larger corpus of documents than a single Google Search Appliance can accommodate, or to aggregate corpuses that are geographically distributed. One search appliance in the configuration is designated the primary search appliance or primary node, whilst other search appliances are designated the secondary search appliances or “secondary nodes.” For more information about GSA Unification, see the GSA Help Center articles Configuring GSA Unification and Specifications and Limits: GSA Unification.
GSA Distributed Crawling and Serving Distributed crawling and serving (DC+S) is a Google Search Appliance feature that expands the search appliance's document capacity. Distributed crawling and serving allows for several search appliances to act as if they are a single search appliance. After DC+S is enabled, all crawling, indexing, and serving are configured on one search appliance, called the “admin master”, and other appliances are known as “nonmaster” appliances. For example, in a case where two search appliances are each licensed to crawl 100 million documents, enabling DC+S will allow the search appliances to crawl a total of 200 million documents. For more information about GSA DC+S, see the GSA Help Center articles Configuring Distributed Crawling and Serving and Specifications and Limits: GSA Distributed Crawling and Serving.
Combining GSA Mirroring, GSA Unification and Distributed Crawl and Serve As of GSA version 7.2, the following architectural combinations are supported: GSA Mirroring can be used in: ●
A stand-alone GSA Mirroring network, with one master appliance, linked to one or more replicas.
A GSA Unification network.
A GSA Distributed Crawl and Serve network, where replica appliances can be established for each of the non-master appliances.
GSA Unification can be used in: ●
A stand-alone GSA Unification network, with one primary appliance, linked to one or more nodes.
A GSA Unification network with GSA Mirroring for any node.
GSA Distributed Crawl and Serve can be used in: ●
A stand-alone GSA Distributed Crawl and Serve network, with one master appliance, linked to one or more non-master appliances.
A GSA Distributed Crawl and Serve network with GSA Mirroring for all nodes.
Chapter 2 Architecting for High Availability GSA High Availability To achieve high availability serving with the GSA, a load balancer should be deployed in-front of multiple search appliances that contain the same search index. Through monitoring and failure detection, a failover event would involve switching the load balancer to redirect traffic to a replica GSA. This can be done manually or automatically (depending on the load balancing or monitoring solution in use). The GSA does not provide any automatic load-balancing and failover features, and as such, load-balancing and failover must be handled by external components. To achieve a consistent search index across multiple appliances, GSA Mirroring is typically used, however in cases where GSA Mirroring is not an option (e.g. due to lack of network connectivity between appliances), each search appliance should be configured to crawl, index, and serve identical content, so as to ensure a consistent user experience in the event of a failover. Note on licensing: If all traffic is only directed to the primary GSA, this architecture is known as ActivePassive configuration. In this scenario, a “production” GSA license is only required for the primary appliance, whilst the replica appliances can utilize a “backup” license. For a further discussion on GSA load-balancing, see the GSA Help Center article Configuring Search Appliances for Load Balancing or Failover.
Monitoring and Failure Detection Using SNMP to monitor the GSA The Google Search Appliance allows for SNMP (Simple Network Management Protocol) integration so that you can receive messages when the operational state of the Google Search Appliance changes. The search appliance listens for SNMP requests on UDP port 161, supporting SNMP v1, v2 and v3. The SNMP server on the Google Search Appliance provides a subset of the status information about the search appliance that is available in the Admin Console. The search appliance supports SNMP Get and GetNext commands. It does not support Trap, nor setting values through SNMP Set. Note: This feature is typically only configured in environments where SNMP is already used to manage other devices on the network such as routers, switches, application servers, or storage servers. If SNMP is not used, consider implementing a custom server monitoring system (see below). For more details, see the GSA Admin Console help page for: Administration > SNMP Configuration. Using a custom monitoring system Where a monitoring system is not available to check the health of the GSA, a custom system can be setup using a simple web application. The web application can make use of the admin API and access the GSA status information. The status can then be rendered by the web application to a status page for
monitoring by the administrator. The admin API can also be used to monitor the Google Search Appliance and store the information in a database or log files; the information can then be reviewed over time for analytical purposes. For best practices on implementing a monitoring system for search queries, see Monitoring GSA Serving.
Connector High Availability GSA Connectors perform two main functions that have the potential to impact availability: 1) content discovery and traversal; and 2) security services for authentication and authorization.
Connector High Availability for Content Traversal Although content traversal can be considered a less critical service to consider for establishing high availability, it is sometimes required for environments where content freshness is a critical priority. In cases where a connector is in use, the approach to establishing high availability for crawling will differ depending on the type of connector in use, and the version of the connector framework (ie. v3.x or v4.x). For most connectors on the Connector Manager 3.x Framework (with the exception of the File System connector), outages in connector crawling (related to the connector itself) will require manual intervention, to restart the connector or enable another connector for crawling. Since these connectors are stateful, most recovery exercises for connector crawling failures will require a restart (or “reset”) of connector traversal. In some cases, the connector’s traversal state information (such as the SharePoint connector’s XML state-file) may be backed up in order to recommence traversal from the current status, instead of requiring a full traversal. For connectors on the Connector 4.x Framework (and also the File System connector 3.x), high availability crawling can be achieved through deploying “backup” instances of the connector, and placing them behind a load-balancer. These secondary connectors should have full traversal and discovery disabled, so that they only perform “retrieval” of specific documents upon request from the GSA. In this way, if one connector is down, the load-balancer will be able to route crawl requests to another connector and maintain crawl availability. Manual intervention is still required in this scenario for the “listing” and discovery of new content from the primary connector that is configured for traversal.
Diagram 1: Example architecture for achieving high availability for content traversal via connectors
Connector High Availability for Security Mechanisms For connectors that provide authentication and authorization services, high availability can be achieved by deploying multiple connectors and using a load balancer to direct traffic between connectors. For more details on this scenario, refer to GSA Notes from the Field: Introduction to Content Integration.
Security Mechanism High Availability For secure search environments, high availability of security mechanisms is a critical concern, as outages in authentication and authorization mechanisms will result in outages in secure search. High availability for security mechanisms is typically established by deploying a load-balancer in front of multiple instances of the security mechanism. For example, a load-balancer can be deployed in front of multiple connectors for authentication or authorization requests or a load-balanced URL may be configured for Single Sign-On systems.
Diagram 2: Example architecture for achieving high availability and performance for security mechanisms
Chapter 3 Architecting for High Performance GSA load-balancing for query performance The number of Queries per Second (QPS) that a single search appliance supports is dependent on the length of time each query takes, the number of concurrent requests, and the type of front-end features that are enabled. A single search appliance can support up to 50 concurrent requests, but often the QPS of a single GSA is lower than this, particularly with secure search. For more information about concurrent requests, see the Help Center articles: Designing a search solution and How many concurrent users can a GSA handle? To reach desired levels of query throughput, multiple appliances can be deployed behind a load balancer, which splits search traffic equally across the GSAs. Each appliance should contain the same index, either through GSA Mirroring or independent crawling of the same content. In this configuration, the GSAs are considered to be in an Active-Active configuration (since all GSAs are serving results actively) and should each be licensed as a “Production” appliance. In load balancing configurations that contain secure search, sticky sessions should be used since each secure search session is tied to a specific appliance. Hence if a user is redirected to a different appliance after they have already authenticated, they may be asked to authenticate again.
Security mechanism performance Another common performance bottleneck is during authorization of search results, particularly when latebinding (real-time security checks against a content system) is used to determine the permissions of a document. Due to performance gains when performing security checks on the GSA, early-binding is the recommended method of authorization. However, there are circumstances where late-binding is needed due to the Access Control Lists (ACLs) not being able to be extracted, or where the security permissions of documents change frequently and real-time checks are required. To improve the performance of authorization, a number of different strategies can be taken: ●
If high-load causes degradation in authorization performance, deploy multiple, load-balanced instances of the authorization mechanism (eg. multiple SAML providers).
If a connector is used for authorization, and is also used for crawling and traversal of content, deploy other connectors to be dedicated to authorization and use a proxy or load balancer to segregate traffic for crawling vs authorization requests from the GSA.
Whilst the authentication process is a less common performance consideration (since users only perform it once per session), the same strategies can be employed for optimizing the performance of authentication mechanisms. See Diagram 2 for an example of implementing high performance security mechanisms.
GSA feature performance considerations Whilst the GSA provides various features to enhance the user experience, when designing a GSA deployment for high performance, there are a number of considerations regarding the performance impact of enabling search features: ●
Dynamic Navigation when used with secure search requires the GSA to perform a much larger amount of authorization requests. Without Dynamic Navigation, the GSA performs authorization requests until it retrieves 10 permitted results for the user when using late-binding (assuming 10 results are requested), and 1000 permitted results for the user when using early-binding. With Dynamic Navigation, the GSA will perform authorization requests until it retrieves 10,000 permitted results. This will have considerable impact in scenarios using late-binding, and may impact scenarios using early-binding where a user has limited access to content. For additional details, see the Help Center article: Search results are slow when dynamic navigation is enabled)
Dynamic Results Clustering performs many authorization requests to determine result clusters and also submits an additional HTTP request to the GSA.
OneBox requests are performed synchronously and hence add to overall query response time, particularly where requests are made to a third party system.
Query Suggestions sends additional HTTP requests to the GSA to determine suggestions as a user is typing. For deployments with many users, this can add significant load to the GSA. This impact can be lessened by adjusting the timeout after which a keystroke event fires the request to the Query Suggestions service on the GSA (see the Help Center article: Impact of query suggest on the general search performance).
Search result filters (eg. duplicate folder, title or snippet filters) can introduce performance impacts in secure search scenarios due to an increased number of authorization requests required to return a result-set. For example, if a user requests 100 results in a late-binding scenario, and the first 100 permitted results that the GSA finds happen to fall within the same directory, via duplicate directory filtering, the GSA would group these under one result, and require another 99 permitted results to be found before the search results are displayed.
Chapter 4 Architecting for large-scale indexes For deployments involving a large number of documents, the time it takes to perform a full reindex of content often exceeds a number of days or even weeks. In these situations, it can be extremely beneficial to consider architectural strategies to optimize the way documents are indexed.
GSA crawling strategies For crawling web content with the GSA, a number of strategies can be taken to optimise indexing speeds: ●
Hostload settings can be used to increase the number of threads used to crawl the web server(s), along with control over time periods in which crawling can be performed more aggressively.
Multiple start paths should be used, if possible, as this will spawn separate crawling threads. Having just one start path for a large amount of content will lead to slower crawling/discovery speeds than a configuration with multiple start paths.
Ensure that the GSA network location relative to the web server is as close or optimized as possible to minimize latency in crawling and downloading files.
If crawl throughput is of critical importance, consider using GSA Distributed Crawl and Serve, so the crawl load is distributed across each appliance in the DC+S network.
Connector traversal strategies For large-scale deployments involving connectors, a connector traversal strategy should be established that takes advantage of parallel indexing and considers the importance and priorities of the various content being indexed. Here are some examples of common strategies for indexing multi-million document repositories: ●
Deploy one connector per connector manager, and maximize the number of threads within the connector that are used (if available, such as with the File System connector, this typically can be specified in the connector properties files).
Configure each connector to index a subset of content (e.g. 3 million documents per connector). If the content cannot be easily split by categories, the “Follow patterns” in the connector can be used to split content by pattern (eg. splitting directories across connectors by alphabetical order, such as [a-g], [h-m], etc).
Increase the GSA host load, or deploy multiple connectors to increase traversal throughput (similar to the approach for high availability, described above).
If some content is considered higher priority and more important than other content, make sure that the connector traversal strategy reflects that. This may be controlled by having higher traversal rates than other connectors (to ensure content is indexed first), or organising traversal schedules to ensure that high priority content is indexed before other content.