Google Search Appliance Deployment Scenario Handbook May 2014

© 2014 Google

1

Deployment Scenario Handbook This document describes scenarios for deploying a Google Search Appliance (GSA).

About this document The recommendations and information in this document were gathered through our work with a variety of clients and environments in the field. We thank our customers and partners for sharing their experiences and insights. What’s covered

This guide describes more advanced GSA configurations that an architecture might require as more content sources are integrated with the GSA.

Primary audience

First-time Google Search Appliance administrators, experienced GSA administrators, and GSA functional analysts.

IT environment

GSA configured for public search with internet and intranet web sites and file shares.

Deployment phases

Initial configuration of the GSA and onboarding of additional content sources onto the GSA.

Other resources

● ● ● ●

Learngsa.com provides educational resources for the GSA. GSA product documentation provides complete information about the GSA. Google for Work Support Portal provides access to Google support. GSA Notes from the Field provide help for designing and deploying an enterprise search solution that is based on the Google Search Appliance (GSA).

2

Contents About this document Chapter 1 Basic Search on a Public Website Scenario overview Requirements Assumptions Key considerations Recommended approach Alternative approach Project task overview Long term enhancements Chapter 2 Basic Internal Search Scenario overview Requirements Assumption Key considerations Recommended approach Alternative approaches Project task overview Long term enhancements Chapter 3 Internal Search over intranet, File System, and SharePoint Scenario overview Requirements Assumptions Key considerations Recommended approach Alternative approaches Project task overview Long term enhancement Chapter 4 Indexing through Feeds Scenario overview Requirements Assumptions Key considerations Recommended approach Alternative approaches Project task overview Chapter 5 Cookie Translation with Silent Authentication Scenario overview Requirements Assumptions Key considerations

3

Recommended approach Alternative approach Project task overview Long term enhancements Chapter 6 Silent Authentication—Integrating with NTLM and the SAML Bridge Scenario overview Requirements Assumptions Key considerations Recommended approach Alternative approach Project task overview Long term enhancement Chapter 7 Implementing a Reverse Proxy for Perimeter Security and Other Reasons Scenario overview Requirements Assumptions Key considerations Recommended approach Alternative approach Project task overview Long term enhancements Chapter 8 Relevancy Testing Scenario overview Requirement Assumptions Key considerations Recommended approach Alternative approach Project task overview Long term enhancement Summary

4

Chapter 1 Basic Search on a Public Website Scenario overview Acme Inc. is a large, multinational producer of consumer electronics with a large external web presence. Their web content includes general corporate information, as well as specific marketing material for each of its product units. They also run some support forums for their products. In the use case for this scenario, they want to use the Google Search Appliance to drive a general search box, which would search across all content, as well as specific search boxes for each of its product units. All of Acme Inc.’s external web properties are public with no restricted access to specific users and/or groups.

Requirements ●

Index all web-exposed, public content.



Provide a general search box, which would return results across all indexed content, including content across all product units.



Provide specific search boxes, which would return results specific to a particular product unit.



Style the search box and result page according to Acme Inc. corporate branding standards.



Index the Acme Inc. support forum every hour, because its content is rapidly changing and/or being added.



Handle 20 queries per second at peak load time with high availability in case of a GSA issue/outage.



Do not mix content from different languages in search results.

Assumptions ● ●

There are distinct pages for Acme Inc.’s web content in each language. There are existing web properties, on which search boxes will be placed.

Key considerations ●

Ensure there is enough capacity to handle 20 queries per second at peak load times.



Decide whether to present results directly from the search appliance or by means of a web application presentation layer.



Use reporting or analytics to gauge user interaction with search materials.

5

Recommended approach Google’s recommended approach for implementing basic search on a public website covers the following areas:

● ● ● ●

Deployment architecture Crawl and index configuration Front end configuration Administrative items

Deployment architecture To account for load and failover capabilities, Acme Inc. will use a total of three GSAs in a production configuration. Two of the three will be used as active-active configured search appliances for adequate capacity planning. The third search appliance will be used as a hot backup for failover. Acme Inc. will configure all three GSAs for mirroring, with one acting as the primary search appliance, on which all configuration changes should be made. To achieve a highly available, active-active configuration, they will deploy a load balancer in front of the GSAs. The load balancer will serve both of the following functions:



Actively distribute search query traffic evenly across the two active-active GSAs.



Ping the two active-active GSAs, failing over to the hot backup unit in case of a failed expected response from one of the active units.

Because the GSA is being deployed in an existing web application, Google recommends the approach of processing requests and responses from the GSA through the use of a web application presentation layer. In this case, the GSA will be used as a service, with the web application layer sending down request queries and parsing the resulting XML response in accordance with marketing and branding guidelines for page formatting. Because the search page won’t be exposed directly on the GSA, the GSA should not be exposed to the public and it should be firewalled on Acme Inc.’s network utilizing perimeter security via network firewalls.

Crawl and index configuration Acme Inc. will configure collections for each language set of the web properties. In this way, the site parameter can be used to distinguish queries meant for a particular language, depending on which page the user initiated the search from. Because each product unit wants to have search over its own documents only, Acme Inc. will also configure collections for each product unit. Acme Inc. will configure start URLs for top-level pages. For content that changes frequently, they can use crawler frequency to make sure the content gets crawled at least once a day. For more control over specific crawl times, they can use the Admin API or a web feed to ensure that specific pages get into the crawl queue at multiple points during the day.

6

Front end configuration Each search box deployed on the Acme Inc. web properties will have a set of query parameters tied to it. These parameters will be sent down with the query to the GSA to shape the type of results that appear in the search results page. For example, a search box deployed on an English product page should pass down the collection parameters for English, as well as that specific product unit. The Results DTD should be consulted to see in which XML elements the GSA returns information. These elements should be parsed by the front end and displayed on the page accordingly.

Administrative items Acme Inc. will use the Advanced Search Reporting feature to create reports about what users are searching for and what they are clicking in search results pages. These reports should be generated and analyzed frequently, as they are a good indicator of general search satisfaction.

Alternative approach Instead of using a web application layer in front of the GSA, Acme Inc. could expose search on the GSA directly, customizing the stylesheet for a front end accordingly. Although more difficult to customize fully, this approach might lead to less development effort and make it easier to take advantage of new, out-ofthe box, front end features that become available on the GSA. With this approach, make sure there isn’t any secure content marked as “public” in the GSA index, as users will get direct access to run queries on the search appliance. A reverse proxy can be used to restrict access to the GSA in terms of whitelisting certain URL patterns that can be submitted. In order to keep the number of collections defined on the GSA at a reasonable level below 200, an alternative to using separate collections for product units can be to use a specific metadata parameter for each product unit that would get indexed along with the content. This metadata parameter would then be applied as a filter to queries applicable to a certain product unit; that way the GSA would retrieve content only applicable to a certain product unit.

7

Project task overview The following table lists the project tasks and activities for implementing basic search on a public website. Task

Activities

Plan deployment architecture

● ● ● ●

Rack and cable search appliances Configure appliances and setup mirroring Configure load balancer in front of the GSAs Set up perimeter security around GSAs

Configure crawl and index

● ● ●

Set up start URLs for crawling content Configure collections identified for languages and product units Identify frequently changing content and ensure it gets indexed one or more times a day

Configure front end

● ●

Enable existing web application layer for the addition of search boxes Parse Response XML from GSA and display results in accordance with company UI guidelines

Long term enhancements ● Tweak search and features based on reports showing user search patterns. ● Identify content for KeyMatches. ● Enable more complex synonym lists. Enable dynamic navigation for metadata-driven facet navigation.

8

Chapter 2 Basic Internal Search Scenario overview Acme Inc. has a large internal web presence that extends out to different parts of the globe. In the use case for this scenario, they want to consolidate the searching of all their internal websites and pages in one place so their employees will not have to go to different websites to search for information. Although all users can access the Acme Inc. intranet, not all of them have access to all the information on the various sites in their corporate domain. For example, Human Resources information access is desirable through search, thus securing personal information is an important requirement.

Requirements ●

Index the following content:

○ ○ ○

Corporate files shares Internal web pages HR information



Provide a general search box, which would return a results page across all indexed content, including content across all product units.



Provide specific search boxes, which would return results specific to a particular product unit.



Style the search box and result page according to Acme Inc. corporate branding standards.



Present search results for secure content only to users authorized to see the content.



Provide failover capability in case of a GSA outage/issue.

Assumption There is a mechanism in place to authenticate a user.

Key considerations ●

Decide whether to present results directly from the search appliance or by means of a web application presentation layer.



Decide whether to manage security by using the search appliance or by means of the application fronting the GSA.



Decide how to configure the Google Search Appliance Connector for File Systems to index file share content.



Use reporting or analytics to gauge user interaction with search materials.

9

Recommended approach Google’s recommended approach for implementing basic internal search covers the following areas:

● ● ● ● ●

Deployment architecture Crawl and index configuration Secure search configuration Front end configuration Administrative items

Deployment architecture To account for failover capabilities, Acme Inc. will use a total of two GSAs in a production configuration. The two GSAs will be used as active-passive configured search appliances with one primary appliance and a hot backup for failover. Acme Inc. will configure both search appliances for mirroring, with one GSA acting as the primary search appliance, on which all configuration changes should be made. To achieve an active-passive configuration, they will deploy a load balancer in front of the GSAs. The role of the load balancer will be to ping the active GSA, failing over to the hot backup unit in case of a failed expected response from the active unit. Because the GSA is being deployed internally, serving results right off the GSA is recommended, styling them by using the GSA stylesheet. In this case, Acme Inc. can modify the stylesheet by using the Page Layout Helper, an XSLT wizard on the GSA, to add certain features to the display quickly. In case of additional desired modifications, they can manually modify the stylesheet to make changes. Take note that Google Support does not support any custom XSLT modifications. A reverse proxy is needed in the architecture if query fidelity is required to ensure search query parameters are not tampered with or cannot be submitted ad-hoc to the GSA. If secure content is marked in the index as “public,” with security being applied by an application layer based on metadata, a reverse proxy should be used to front the GSAs and filter search queries so no one can access it directly to submit their own queries. This is needed to ensure the URL is not manipulated by a user to see items that contain metadata they are not allowed to see or come from a collection they wouldn’t be entitled to see. The Google Search Appliance Connector for File Systems, used to index file share content, should be hosted on an external server in a Production environment. The connector runs in a JVM and comes built in with Tomcat.

Crawl and index configuration Acme Inc. will configure start URLs for top-level pages. For distinguishing content based on Acme Inc.’s departments, collections can be established for each department. The Google Search Appliance Connector for File Systems should be used to index file shares. The connector supports: ● ● ●

Authorization by early binding (ACLs) Need to maintain last access dates on files and directories that are being traversed The share is a non-HTTP exposed Windows DFS domain root share

10

Secure search configuration Acme can use one of the following strategies to secure content, depending on whether authorization is required or not:

● ●

Only authentication is required, but not authorization Authentication and authorization are required

Only authentication is required If authentication is required, but not authorization:



Crawl content with an admin account and mark the content as “public.”



Place this crawled content into a collection.



The application tier above the GSA handles authentication. Once a user has been authenticated to a page with a search box on it, a search is executed on the collection where the content was placed.



As this strategy will mix public and secure content, if there is a desire to restrict certain users from seeing secure content, use a reverse proxy in front of the GSA to make sure proper queries are sent to the GSA. The reverse proxy will ensure the GSA is not directly exposed to unauthenticated users, where users can build their own search parameters onto queries.

Take note that a reverse proxy will add another component to the architecture. For more information, see Chapter 7, Implementing a Reverse Proxy for Perimeter Security and Other Reasons. Authentication and authorization are required If both authentication and authorization are required:



Crawl content with the account of a user who has access and do not mark the content as “public.”



Users may need to submit their credentials upon executing a search and results would be authorized against service end-points using HEAD request checks.



Determine how to integrate with the authentication mechanism that is available. The possibilities include:

○ ○ ○ ○

Kerberos (for more information, see the Kerberos scenario described in Chapter 6) Integrated Windows Authentication NTLM by utilizing the SAML Bridge LDAP or Basic prompt for username/password by the GSA Cookie translation to integrate with forms authentication and provide a verified username back to the GSA

11

Front end configuration Each search box deployed on the web properties will have a set of query parameters tied to it. These parameters will be sent down with the query to the GSA to shape the type of results that appear in the search results page. For example, a search box deployed on the HR department page should pass down the collection parameters for that specific department. Google recommends that Acme Inc. style the results by using the Page Layout Helper. This way, certain features can be turned on or off. Another advantage of using this XSLT wizard is the increased chance of compatibility with future versions of the XSLT.

Administrative items Acme Inc. will use the Advanced Search Reporting feature to create reports about what users are searching for and what they are clicking in search results pages. These reports should be generated and analyzed frequently, as they are a good indicator of general search satisfaction.

Alternative approaches For the secure search configuration “Only authentication is required,” instead of using an application fronting the GSA to perform authentication, use the Perimeter Security feature on the GSA, which ensures that the search appliance doesn't serve any results without user authentication. When perimeter security is enabled, the search appliance must authenticate a user with one of the configured authentication mechanisms before serving any results. If authentication fails, the GSA will not serve any results, even if they are public. Use policy ACLs for content that only authenticated users can access. With this approach, an “everyone” group can be used to govern access to this content. This approach will require the “everyone” group resolution at authentication time.

Project task overview The following table lists the project tasks and activities for implementing basic internal search. Task

Activities

Plan deployment architecture

● ● ● ● ●

Rack and cable search appliances Configure search appliances and setup mirroring Configure load balancer in front of the GSAs Set up perimeter security around GSAs Procure server to host File System Connector

Configure crawl and index

● ● ● ●

Set up start URLs for crawling content Configure collections identified for departments Install and Configure File System Connector Identify security mechanisms and configure crawler for access

Configure front end

● ●

Enable existing web application layer for the addition of search boxes Configure XSLT modifications per front end using the onboard wizard

12

Long term enhancements ●

Tweak search and features based on reports showing user search patterns.



Identify content for KeyMatches.



Enable more complex synonym lists.



Enable Entity Recognition to automatically enrich documents with metadata using text-based dictionaries, terms, or regular expressions.



Enable Dynamic Navigation for metadata-driven facet navigation.



Enable Expert Search for office and/or department listings.



Identify areas for which OneBoxes can be of value.

13

Chapter 3 Internal Search over intranet, File System, and SharePoint Scenario overview Acme Inc. houses different corpora that are being served up on different servers on their corporate network. These data silos are accessed by way of different data management applications, such as SharePoint, as well as secure files shares. Having to go to different applications to find information has become tedious and very time consuming for their employees. Not only that, the loss in productivity trying to locate a particular piece of information has started to show up on their bottom line because of the repetitive searching between disjointed systems to search for information and ineffective existing search tools.

Requirements ●

● ● ● ●

Index the following content, while keeping it secure: ○ Secure file shares ○ SharePoint portal data used to host internal sites Present search results for secure content only to users authorized to see the content. Create a standard UI for data access. Create custom interfaces for internal and external users. Deployment must result in a measurable business benefit.

Assumptions ● ●

There are more than 500K documents in SharePoint. An automated analytics solution is desirable.

Key considerations ●

Decide whether to use the onboard or offboard Google Search Appliance Connector for SharePoint.



Decide whether to crawl web-enabled SMB file shares or use the file system connector.



Decide whether to present results directly from the GSA or by means of a web application presentation layer.



Decide whether to manage security by using the search appliance or by means of a fronting application



Decide whether silent authentication is needed, where users are not re-prompted by the GSA for credentials.

14

Recommended approach Google’s recommended approach for implementing internal search over intranet, file system, and SharePoint covers the following areas:

● ● ● ●

Benefit analysis Deployment architecture Crawl and index configuration Serve-Time authentication and authorization configuration

Benefit analysis To gauge the business benefit of the resulting search solution, Acme Inc. will conduct a short study to capture time spent on existing platforms. Automated tools should be used to gather this information whenever possible. If there are any analytics tools in place, they should be used to gather information about the usage of search or the time it takes to find information on the current systems. If no analytics are in place, Acme Inc. should consider implementing an analytics solution for automated evaluation of effectiveness in the future. After the deployment has concluded, Acme Inc. will conduct an evaluation of the new solution to gauge its effectiveness. To recognize the right metrics, they will compare similar use cases evaluated before beginning the deployment.

Deployment architecture Acme Inc. will deploy the offboard SharePoint connector, as the total SharePoint document count is over 500K. If file shares can be web-enabled, then they can be directly crawled by the GSA. Results will be presented directly from the GSA by using customized front ends for different data stores. In the case of searching SharePoint, the Search Box for SharePoint will be deployed and used. Consider utilizing the Google Search Appliance Connector for File Systems to index file shares. Some scenarios where the connector should be used include: ● ● ●

Authorization by early binding (ACLs) Need to maintain last access dates on files and directories that are being traversed The share is a non-HTTP exposed Windows DFS domain root share

15

Crawl and index configuration Acme Inc. will configure crawl and index for the following types of content sources:



SharePoint—To index content on SharePoint, Acme will install and configure the SharePoint connector on a separate server. They will also install Google Web Services for SharePoint on every SharePoint web front end that exists in the farm. ACLs will be fed into the GSA as a connector configuration option. As the SharePoint connector is being used to index SharePoint content, the Active Directory Groups Connector will be required to resolve a user’s AD Group memberships at serve time, which are needed for content Authorization and for mapping AD users/groups to SharePoint local groups.



File Shares—To index file shares, Acme will configure the web-enabled file shares on the Content Sources > Web Crawl > Start and Block URLs (Previous to version 7.2: Crawl and Index > Crawl URLs) page in the GSA Admin Console.

Serve-Time authentication and authorization configuration Acme Inc. will use Kerberos as the preferred authentication mechanism between the GSA and the content server. They will make this work by performing the following tasks:



Creating an Active Directory service account for the GSA.



Configuring Kerberos on the GSA.



Configuring the SharePoint connector to submit content feeds to the GSA with feeding of ACLs enabled.



Configuring the ADGroups connector to cache AD objects in a database for quick resolution at serve time.



Configuring the SharePoint connector instance as an authentication mechanism in order to resolve groups for a user’s session that are to be used for ACL trimming in the index.

Since content feeds containing ACLs will be submitted from SharePoint to the GSA, content will be authorized in the GSA’s index utilizing the ACLs that were fed at crawl time.

Alternative approaches ●

Use the Google Search Appliance Connector for File Systems to index the file share content.



The advantage to this approach is that ACLs would be fed in along with the content, enabling an early binding Authorization decision, which is better performing.



The correct groups for the user would need to be resolved at Authentication time for this approach to work as groups are needed for the early binding ACL authorization trim. The ADGroups connector, also needed for SharePoint could be used for this purpose.

16

Project task overview The following table lists the project tasks and activities for implementing internal search over intranet, file system, and SharePoint. Task

Activities

Plan deployment architecture



Configure SharePoint/ADGroups connector server

Configure crawl and index

● ● ●

Configure SharePoint connector Configure ADGroups connector Configure File Share locations in Crawl and Index in the GSA Admin Console

Configure front end



Customize front ends for different data stores

Configure serve time authentication/authorization

● ● ●

Enable the GSA for Kerberos Configure Connector Based Authorization Configure group resolution mechanism ○ If the SharePoint connector is being used, this will most likely be SharePoint connector-based Authentication configured for group resolution only.

Long term enhancement Deploy the Google Search Box for SharePoint in order to serve search results from within SharePoint.

17

Chapter 4 Indexing through Feeds Scenario overview Acme Inc. has its own retail stores, which are run under two different brands. In the use case for this scenario, they want to make the products database searchable by store employees. The products are currently stored in a database, with certain data contained in business applications. The crawler cannot index the product pages directly. There is a web front end that displays product information when supplied with the product number in the URL.

Requirements ● ● ●

Index content about products. Provide search within specific retailer brands. Enable left-pane parametric navigation by: ○ Price ○ Category ○ Arrival time

Assumptions ●

There is a web front end in place to display product pages. This web front end doesn’t contain all the metadata required for the indexing of products.



Products are not secured and anyone can view them.

Key considerations ●

Decide whether to use the Google Search Appliance Connector for Databases to onboard the product records onto the GSA.



Decide whether to use a content feed or a web feed for onboarding the product records onto the GSA.



Define metadata for indexing along with the content to drive Dynamic Navigation and advanced search capabilities.

Recommended approach Google’s recommended approach for indexing through feeds covers the following areas:

● ● ● ●

Deployment architecture Crawl and index configuration Metadata focus Front end configuration

18

Deployment architecture Because records along with required metadata cannot be constructed using only database queries, the database connector will not be used to index content. Instead, Acme Inc. will use a custom feed. A custom feed is an application that constructs XML containing records to index on the GSA. The main step in getting the XML with records into the GSA is a POST action on the feeds protocol interface on the GSA. In addition to the GSA, another server (Windows or Linux) is required to host the feeds application. This application will perform some logic to construct a record and post records into the GSA.

Crawl and index configuration The recommended approach is to use content feeds to onboard product records into the GSA. In this way, content can be custom tailored for indexing. This method also takes advantage of the GSA capability for caching a custom product page in its index. In this way, store associates can choose to display the cached version, as perhaps it is easier to display than the product web front end in place. The feeds application has to be designed so that for every product row in the database, it can construct the required HTML and associated metadata to feed into the GSA. A mechanism is needed that will keep track of all deleted, modified, and added records.

Metadata focus A focus on metadata when dealing with product-style content is extremely important. Metadata helps end users perform more powerful advanced searches and also helps them “drill down” into different defined categories. Acme Inc. can identify certain metadata to drive dynamic navigation headings so users can drill down into different categories of metadata values with the click of a mouse. Other metadata values can be used in queries to restrict the query over a specific set of content.

Front end configuration One advantage of using content feeds to bring content into the GSAs index is that a cached version of the custom content created for the feed will be saved on the GSA and can be selected to be displayed on the front end. One example of a beneficial use is the indexing of printer-friendly pages that can be printed and given out as datasheets. When a user wants to see a quick facts page, she can reference the cached content on the GSA. When she wants a more in-depth view, she can click the link and be taken to the web front end that displays the detailed product page for the particular item. Acme Inc. must modify the GSA XSLT to display the products and associated metadata accordingly. Acme Inc. will also modify the front end to enable advanced search features for users, based on collections and defined metadata. Perhaps selecting a drop down or clicking a radio button will attach metadata as query terms to the associated query in order to scope it over the products corpus accordingly.

19

Alternative approaches ●

If all content and metadata can be derived from database queries, use the database connector for feeding in all products.



If the front end application for displaying products can display all required information needed for indexing, use a web feed for indexing all products.



Instead of defining the process to feed in metadata at indexing time, you can alternatively configure Entity Recognition rules through dictionaries or XML regular expression definitions to automatically tag the documents with entities at indexing time.

Project task overview The following table lists the project tasks and activities for implementing indexing through feeds. Task

Activities

Plan deployment architecture

● ●

Rack and cable search appliances Provision server to host feed application

Configure crawl and index

● ● ● ●

Configure Follow URLs for fed in content Configure collections identified for brands Design logic for constructing feed content Design feeds application for writing out XML records and posting them to the GSA

Configure front end

● ● ●

Enable Dynamic Navigation Modify XSLT to display records along with desired metadata Create an advanced search page or page section that will scope queries restricted to desired metadata

20

Chapter 5 Cookie Translation with Silent Authentication Scenario overview In the use case for this scenario, Acme Inc. wants to integrate their SiteMinder SSO and three different content sources with the GSA using the following mechanisms:

● ● ●

Per-URL ACLs Connector-based authorization Public content

They would prefer their users had a seamless, silent authentication experience with search after logging into the main company portal.

Requirements ●

Index the following content: ○ Livelink ○ Web Based People Directory Application ○ Lotus Connections



Provide a general search box, which would return a results page with the most relevant links across all indexed content.



Present search results for secure content only to users authorized to see the content.



Provide seamless silent authentication with search after a user initially logs into the main portal.



Content in Lotus Connections is secured based on native Lotus Connections groups as well as LDAP groups.

Assumptions ●

SiteMinder SSO is in place to authenticate users to the secure content sources—Livelink and Connections.



All content sources use the same common identity.



The existing Google Search Appliance Connector for Livelink will be used to integrate the GSA with Livelink.



Connections content will be fed into the GSA.



ACLs can be fed with Connections content to take advantage of the Per-URL ACLs feature of the GSA.

21

Key considerations ●

Confirm the assumption that all content sources use the same common identity.



Determine whether the content sources use content source-native groups or groups synched with Active Directory/LDAP.



Confirm the assumption that Lotus Connections can use the Per-URL-ACL’s feature of the GSA to feed ACL information along with the content feed.

Recommended approach Google’s recommended approach for implementing cookie translation that provides silent authentication covers the following areas:

● ● ●

Architecture overview Authentication Authorization

Architecture overview Acme Inc. will integrate the content sources listed in the following table with the GSA. Content source Lotus Connections

Open Text Livelink

Web-Based people directory content

Integration method



Feed containing Connections content and ACLs for each document



Connections native groups along with LDAP groups are fed in with the content



Content is kept in synch by using the SeedList capability of Lotus Connections



Content is traversed and fed into the GSA by the Livelink connector



The connector keeps the content in synch



ACLs are not fed in along with the content.



Connector-based authorization is used—it authorizes batches of documents at a time



Web exposed and directly crawled by the GSA

22

Authentication According to Acme Inc.’s authorization mechanisms, both secure content sources need a verified identity with a username in order to authorize content for users at serve time. The Livelink connector needs a verified username in order to perform connector-based Authorization. A valid username, along with associated groups, needs to be supplied to the GSA to authorize content with ACLs in the index. Because both content sources are under the SiteMinder SSO at Acme Inc., users will have a cookie in their session that enables them to access the sources after logging into the main search portal to perform a search. A Universal Login Forms Based Authentication rule will be setup to fetch a sample URL, protected by SiteMinder, as part of the authentication process. If a user is already logged into the portal and has a SiteMinder cookie, he will be authorized and won’t have to provide credentials.

Authorization Both content sources’ authorization mechanisms require credentials:

● ●

Livelink requires a verified identity with a username Connections requires a username and groups

For this reason, the SiteMinder SSO-protected sample URL page will need to return the following information about the user providing the SSO cookie for authentication:

● ●

A username A list of groups associated with the user

This is also referred to as “cookie cracking.” This is achieved by creating a JSP or ASP.NET that, after validating the authenticity of the SSO cookie, returns a HTTP response to the GSA. The response has a 200 OK status code and a verified username and list of associated groups in the header, specifically “X-Username” and “X-Groups.” Note that Connections native groups, along with user LDAP groups will need to be returned to support Connections ACLs in the index. The following steps describe a full and successful authentication and authorization flow: 1. User logs into company intranet portal through a SiteMinder login page and a SiteMinder cookie is created in his/her session. 2. User performs a search on the GSA, passing along applicable scoped cookies. 3. GSA fetches Sample URL, passing along cookies that are scoped for the fetch; the cookie is used to verify authentication to the SiteMinder protected page. 4. The page returns a 200 along with the verified username and groups associated with the user whose cookie was passed to the page. 5. The GSA considers the fetch a success and associates the username and group to the verified identity credential group.

23

6. The username and groups are used to further authorize content on the GSA. The username is passed on to do Connector based AuthZ for Livelink. The username and groups are passed on to do authorization checks for ACLs associated to content in the index.

Alternative approach For all content, perform HEAD request authorization checks, which wouldn’t need groups and a username associated with the verified identity.

Project task overview The following table lists the project tasks and activities for implementing cookie cracking that provides silent authentication. Task

Activities



Design an application that, given a cookie, will return the username and groups associated with a user



Deploy that application into a web/application server that is protected by SiteMinder—in the case of Apache, a SiteMinder plugin can be used to integrate with the SSO

Configure crawl and index

● ●

Configure Connectors for indexing content Configure crawler for indexing public content

Configure Cookie Based Authentication with a Sample URL under Universal Login Auth Mechanisms



When this is configured, the GSA will perform a sample URL fetch, which will access the page. The cookie cracker page will then return the username and groups associated with the verified identity performing the search.

Plan deployment architecture

Long term enhancements ●

Perform an architecture review as new content sources are brought on to see if existing architecture needs to be modified for onboarding new content.



If ACLs for Livelink become available, adjust cookie cracker to return Livelink groups as well. Consider a cookie cracker in a separate credential group (namespace), if there are group name clashes.

24

Chapter 6 Silent Authentication—Integrating with NTLM and the SAML Bridge Scenario overview Acme Inc. uses NTLM with Integrated Windows Authentication (IWA) with an Active Directory back end. In the use case for this scenario, they would prefer their users have a seamless, silent authentication experience with search after logging into the Windows domain and using Internet Explorer for browsing.

Requirements ●

Index and securely serve NTLM-protected content.



Provide a general search box, which would return a results page with most relevant links across all indexed content.



Present search results for secure content only to users authorized to see the content.



Provide seamless silent authentication with search after a user initially logs into the Windows Domain.

Assumptions ● ●

The IIS Server hosting the content accepts HEAD requests. All crawled content is under the same Windows Domain that users log into.

Key considerations ●

Confirm the assumption that all content sources use the same Windows domain that users log into.



Make sure all servers in the deployment architecture are time synched to the same time server.



Make sure there is a certificate available in IIS that can be used to sign SAML Post binding requests.



To enable communication with the SAML Bridge over HTTPS, configure the GSA with the root certificate of the SAML Bridge.

Recommended approach Google’s recommended approach for implementing silent authentication by integrating with NTLM and the SAML Bridge covers the following areas:

● ● ●

Authentication Authorization Process flow for authentication and authorization with the SAML Bridge

25

Authentication Acme Inc. will use the SAML Bridge to authenticate users with Active Directory, taking advantage of the silent authentication benefit provided by IWA. For this to work, the domain controller that is running Active Directory must meet the following requirements:



Windows 2003 Kerberos Extension must be available as Kerberos is used for authentication between the SAML Bridge and the content server.



The domain functional level must be set to Windows Server 2003.



Active Directory must be configured to permit the SAML Bridge to use delegated credentials from the user to access content on the content server.

To configure the SAML Bridge for use by the GSA in performing authentication, use the Search > Secure Search > Universal Login Auth Mechanisms (Previous to version 7.2: Serving > Universal Login Auth Mechanisms > SAML) page in the Admin Console. As SAML post binding became available as an option in SAML Bridge release 2.8, it is recommended that post binding be used for authentication with the SAML Bridge. See steps for configuring the SAML Bridge for post binding at the following wiki: http://code.google.com/p/google-saml-bridge-forwindows/wiki/SAMLBridge28features

Authorization Since authorization with NTLM is required, the SAML Bridge will also be used for authorization. In this case, the SAML Bridge must be configured as the Authorization Provider on the Search > Secure Search > Access Control (Previous to version 7.2: Serving > Access Control) page in the Admin Console. The GSA will then delegate authorization checks for individual documents to the SAML Bridge. The SAML Bridge will respond with PERMIT or DENY, accordingly.

Process flow for authentication and authorization with the SAML Bridge 1. A user creates a search query for secure content. 2. The GSA’s Authentication SPI is used to delegate to the SAML Bridge for Authentication. NTLM, which is configured on the user’s browser, is used to authenticate the user. 3. After authenticating the user, the GSA determines the most relevant results for the user. If these results contain secure documents, the GSA uses the Authorization SPI to delegate the authorization checks to the SAML Bridge for these documents. 4. The SAML Bridge obtains a Kerberos ticket on the user’s behalf and impersonates the user to the content server. 5. The SAML Bridge sends a PERMIT or DENY message back to the GSA and the GSA displays the results that a user is permitted to see on a search results page.

26

Alternative approach Switch to Kerberos as the authentication mechanism for content servers. Upon switching to Kerberos, there is the possibility that deploying the SAML Bridge can be bypassed in configuring silent authentication.

Project task overview The following table lists the project tasks and activities for implementing silent authentication, integrating with NTLM and the SAML Bridge. Task Plan deployment architecture

Activities

● ● ●

Configure SAML Bridge on the GSA

● ●

Prepare AD Configuration for SAML Bridge installation Deploy SAML Bridge on domain controller and make necessary configuration adjustments Configure certificates for POST Binding and communication over HTTPS Configure Authentication SPI to use SAML Bridge Configure Authorization SPI to use SAML Bridge

Long term enhancement Perform an architecture review as new content sources are brought on to see if existing architecture needs to be modified for onboarding new content.

27

Chapter 7 Implementing a Reverse Proxy for Perimeter Security and Other Reasons Scenario overview Acme Inc. has highly sensitive research and design documents. In this scenario, they want to restrict access to these documents by forcing all searches through a proxy. The proxy will enforce authentication with their Single Sign-On (SSO) system before allowing access to the GSA and also restrict the queries that can be submitted to the GSA.

Requirements ●

Enforce an SSO login before accessing the GSA.



Restrict queries performed on the GSA to a single specified collection by restricting URL request parameters.

Assumptions ●

For this example, the assumption is that we are to use Apache Web Server. Note that other web servers can be used for reverse proxies to the GSA.



An Apache server is available.



An Apache plugin for Acme’s SSO is available.

Key considerations ●

If using the GSA for secure searches: ○ Proxying HTTPS traffic is required. ○ Calls to the security manager on the GSA must be also be proxied.



If accessing the GSA over HTTPS, SSL traffic must also be proxied.



The GSA is protected by a firewall and access is restricted to the proxy server.

Recommended approach Google’s recommended approach for implementing a reverse proxy for perimeter security covers the following areas:

● ● ●

Integrating Apache with an SSO Proxying requests to the GSA Restricting all traffic through the reverse proxy

28

Integrating Apache with an SSO To protect the Apache instance with the SSO, Acme Inc. will install the Apache SSO plugin particular to the SSO that is being used. Depending on whether the plugin contains a configuration interface, they may be presented with application protection options as a series of wizards or the configuration may have to be made by setting appropriate resource filters for traffic in Apache. When the SSO plugin is configured, anytime the Apache host with the appropriate cookie domain scope is accessed, a user will be authenticated with the SSO. If the user doesn’t have a cookie in her session, she should get redirected to the SSO login page to get one. After that is done, she will be allowed to proceed to the GSA.

Proxying requests to the GSA A virtual host block is the mechanism commonly used for this, but you can do it in the main server configuration as well. To configure a virtual host to handle proxying of traffic: ProxyRequests Off Order Deny,Allow Deny from all Allow from [gsa_ip] ProxyPass / http://gsa32.example.com ProxyPassReverse / http://gsa32.example.com For configurations where secure search is enabled, the mod_ssl Apache plugin is needed for the proxying of HTTPS traffic. Issuing a certificate for the Apache server would also be needed. That certificate would need to be installed on the GSA, so that the proxied requests will be recognized as signed.

Restricting all traffic through the reverse proxy After the reverse proxy is implemented, Acme Inc. will configure a firewall rule to allow traffic to the GSA from the Apache host only. This will force all requests to go through the Apache reverse proxy when wanting to access the GSA.

Alternative approach Use an alternate web server for implementing the reverse proxy. One example is using IIS to handle filtering of traffic. As of GSA 6.14, the Perimeter Security feature of the GSA can be used to implement such a mechanism. The requirement would be to configure a security mechanism on the GSA to do authentication only. When this is enabled, public results will not be shown to users unless they are successfully authenticated to the GSA.

29

Project task overview The following table lists the project tasks and activities for implementing a reverse proxy for perimeter security. Task

Activities

Plan Apache integration with SSO

● ●

Protect Apache URL Configure Apache to use SSO plugin and set appropriate resource filters to filter traffic to the SSO protected resources

Configure proxying requests to the GSA by Apache



Configure the virtual host to proxy traffic to the GSA and vice versa If secure search or accessing the GSA over HTTPS is required, mod_ssl will be needed to proxy HTTPS traffic



Configure firewall to restrict access to the GSA from anywhere but the Apache host



Configure a firewall rule to place perimeter security around the GSA so the only way to access it is through the Apache proxy

Long term enhancements ●

Consider other uses for the reverse proxy: clean URLs, firewall tunneling, caching for performance.



Using Apache as a cache can greatly improve the response time and serving capacity of the GSA. For example, a memcache configuration can be added to the virtual host section: CacheEnable mem / MCacheSize 4096 MCacheMaxObjectCount 1000 MCacheMinObjectSize 1 MCacheMaxObjectSize 4096 This would cache the 1000 most recent GSA responses of 4K or less in memory.

30

Chapter 8 Relevancy Testing Scenario overview Acme Inc. has deployed its GSA and integrated the following content sources:

● ● ●

Livelink content Crawled intranet site People directory application.

In the use case for this scenario, they want to conduct relevancy testing to make sure their users are satisfied with the search results being returned by the GSA before rolling out the search solution to production.

Requirement Ensure search results returned to users are relevant according to their search terms.

Assumptions ● ●

Content has already been integrated into the GSA and is available in the index. Test planners know the business context of content in the GSA’s index.

Key considerations ●

Relevancy is hard to frame in terms of an absolute scientific measurement—it may mean different things to different people.



The GSA’s out-of-the-box relevancy algorithms have been shown to return highly relevant search results without performing any tweaks or modifications.

Recommended approach Google’s recommended approach for relevancy testing covers the following areas:

● ● ●

Test case preparation Test case execution Features to consider for relevancy tuning

31

Test case preparation ●

Identify different user groups from the organization, who will utilize search.



Based on what is in the GSA index, determine some business context about the type of searches different users would perform and what documents they would expect to be returned.



Develop a list of predetermined queries you will have users execute in order to comment on relevancy of results. In addition to the fixed set of queries, ask users to execute 3 or so of their own queries during the testing rounds to account for context not considered in test case preparation.



Identify a set of documents you would deem to be most relevant for a particular query for a particular user, which will be used for scoring.



Develop a scale that can be used by a tester to gauge how relevant their returned results are and communicate the scale and its basis to the users performing testing. For example, consider a 1-5 scale where:



1—relevancy is great. The first page returns all extremely relevant results. The identified document for this particular query is returned on the first page of results.



5—relevancy is very poor. The results I am expecting don’t come back in the first couple pages of search results. The identified relevant document for this query doesn’t appear until the 60th result. There is one content source crowding out all other results.

Test case execution ●

Before performing any relevancy tweaks, develop a relevancy benchmark by executing the fixed set of predetermined queries by an identified, beta, user set that spans different department/business units, which will eventually be using the search solution in Production.



In a spreadsheet tally, ask users to rate each of the results for the executed queries according to the pre-determined scale. Also have users enter their general comments for each search.



After taking a benchmark with the default relevancy configuration on the GSA (no synonyms for query expansion, no biasing policy, etc.), tweak the relevancy configuration systematically based on user feedback/comments and have users re-test and re-score after each round of changes to see how the previous change affected user relevancy perception.

32

Features to consider for relevancy tuning The following table lists GSA features to consider for relevancy tuning. Feature

Comment

Source biasing

By using pattern matching, bias one source over another.

Date biasing, metadata biasing

Bias documents, which have specific metadata attached.

KeyMatches

Use KeyMatches to promote documents for certain queries.

Query expansion

Use a query expansion policy to expand search queries terms into other terms (synonyms).

Self-Learning scorer

When Advanced Search Reporting is enabled, the GSA uses the self-learning Scorer feature to analyze click stream data and promote certain search results over time. As an example, for a given search query, if users consistently click the second result on the page instead of the first, that result will eventually move up to overtake the first position on the page.

Host crowding/filtering

GSA filters out any combination of: ● Results from the same path ● Results with duplicate titles and snippets

Ranking framework

Specify a per-URL biasing. Note that this is a very complex solution to manage and should only be tried as a last resort.

Stopwords (introduced in GSA 6.10)

Use stopwords to prevent certain terms in the query from being used in performing search. Take care when using this feature as this can have wide ranging implications if used as a solution to a particular problem.

Collections

Break content into different collections to restrict the document corpus available to a search query.

Exposing metadata and/or Entities in Dynamic Navigation for a richer user experience

Instead of tuning GSA relevance, consider enriching the user experience by adding additional dynamic navigation categories for metadata sources or Entities that were defined in Entity Recognition. Although not really a relevancy tuning option, Dynamic Navigation may have the benefit of enriching user experience so that a user can drill down into the results set to find the results they are looking for.

33

Alternative approach Address biasing at index time. Use a content feed and specify pagerank of individual documents. The pagerank attribute allows you to specify the pagerank of a document manually. This can be set as high as 99 for a very high pagerank. The default for all content fed documents was 96.

Project task overview The following table lists the project tasks and activities for relevancy testing. Task

Activities

Plan test

● ● ● ●

Identify user groups for test execution and relevancy feedback Develop a list of queries to be executed by each user as part of testing Identify a set of “relevant” documents pertaining to each query and user Develop a relevancy scale to gauge quality search results

Execute test

● ●

Develop a relevancy benchmark by having users execute tests before any tweaks are made on the GSA Instruct users to execute tests and tally results/feedback

● ●

Based on feedback received, perform GSA biasing tweak and have users re-test Refine and repeat until satisfied with results

Iterate and retest

Long term enhancement Develop a process and mechanism for gathering user feedback and continuing refinement of search relevancy in production.

34

Summary Each GSA deployment brings different challenges based on your IT landscape. The assumptions, considerations, approaches, project tasks, and enhancements, while examples, should not be taken as reference plans, as-is. Your own environment and time lines might reflect greater complexity. When you plan a deployment project, take specific business or technical requirements into consideration. Always include contingencies in your plans.

35

GSA Deployment Scenario Handbook

IT environment. GSA configured for public search with internet and intranet web sites and file .... Acme Inc. will configure start URLs for top-level pages. For content that .... Page 10 ... hosted on an external server in a Production environment.

382KB Sizes 5 Downloads 234 Views

Recommend Documents

Deployment Scenario Solution (Cost) -
Remote Area. Community Hotspot. For. Education and Learning www.worldpossible.org. Educational Content. On Any Device. In Any Location. Curated Offline Collections. Local Content Creation. RACHEL Hotspot Servers. No Internet Required. Infinitely Scal

GSA
An open source software package that Google provides that manages creation .... photos, names, and phone numbers. .... Multipurpose Internet Mail Extensions.

GSA Security
For example, are they office documents, web pages? database records? ... need to come back as fast as possible to give the end users the best experience ... 10. Although not as commonly used as Per-URL ACLs, it is a very flexible ..... Authorization,

MODERN SCENARIO STARTING POINTS
May 6, 2017 - It can paper over any cracks with the dreaded points system and ... enemy haven't been able to confirm and gain no points from that kill. If an enemy .... There may be multiple zones but some don't actually need your attention.

7.4 - Configuring GSA Mirroring
You may not attempt to decipher, decompile, or develop source code for any Google product .... value of the DNS alias to the replica, or using an external application. ... support.google.com/gsa/answer/2644707#Monitoring). .... Right-click the About

GSA Connectors Developer Guide
Dec 2, 2014 - Advanced Access Control : Fragment ACL .... In the GSA Admin Console, go to Content Sources > Web Crawl > Start and Block URLs. 2.