By

MICHAEL CHAU, XIAO FANG, a n d OLIVIA R. LIU SHENG

WHAT ARE PEOPLE SEARCHING ON GOVERNMENT WEB SITES? A study of search activity on the Utah.gov Web site.

The U.S. government provides a large amount of information to the public

on the Web. While the Freedom of Information Act requires the government and federal agencies to disclose a great deal of information to the public, the Paperwork Reduction Act allows these agencies to maintain and provide information through the Internet. Due in part to this Act, large amounts of government information have been put online and made publicly accessible. The provision of information was generally considered harmless before Sept. 11, 2001. After the terrorist attacks that day, the U.S. government and agencies became more concerned about what information was put online, and a lot of sensitive information was removed from the Web shortly thereafter. This includes reports on vulnerabilities related to plants and control measures, maps of power plants or water systems, emergency response plans, and so forth [4]. Terrorist access to such information can potentially facilitate terrorist attacks and become harmful to the public. However, arbitrarily removing information from the government Web sites is not the best solution to the problem. If too much information is removed, some legitimate requests for information may not return results, causing inconveniences to the public. For example, members of the general public also need to know the location of nuclear plants for the

COMMUNICATIONS OF THE ACM April 2007/Vol. 50, No. 4

87

sake of their own safety and concerns. It is impor- governments to better manage the content on their tant to study what information people are looking Web sites and support their online visitors [5, 6]. for on these sites in order to determine what Another reason to study the search logs is because information should be made more easily accessible government Web sites may store information rele(or otherwise) to the general public. For example, vant to public safety and national security, the 1. Overview of the search log possibly data. if Web log analyses indicate Table the public is interested study can reveal if there is any suspicious in some sensitive security information (such activity on their Web sites. as nuclear plant inforIn this article, we mation) but governreport our study on anaNumber of search queries 792,103 ment agencies decide lyzing the search logs of Number of non-empty queries 673,807 such information the Utah State GovernNumber of unique users 161,042 should only be accessed ment Web site Utah.gov. Number of sessions 458,962 through legitimate The Utah.gov site is one Mean number of queries per session 1.73 requests and remove it of the most advanced Median number of queries per session 1 from their Web sites, government Web sites they should provide and was named the best other secure means for state government Web portal in the U.S. by the Table 1. Overview of the search log data. the public to access this Center for Digital Government in 2003 [1]. A Tabletable 2. Comparison 10 queriesand withe-services AltaVista.are proinformation (for examlarge varietyofofTop information Chau 1 (4/07) ple, after proper authentication and authorization vided on this Web site. Therefore, it was chosen as at physical locations). the basis for our study. This project is also part of One way to study people’s information needs is to analyze Rank Utah.gov AltaVista the logs of search engines. Search query Frequency Percentage Search query Frequency Percentage Search engine log analysis has 1 sex dmv 1,551,477 3,794 0.27% 0.56% been conducted on many gen2 applet tax forms 1,169,031 2,532 0.20% 0.38% eral-purpose search engines. 3 porno sex offenders 712,790 2,173 0.12% 0.32% The most popular one is the 4 mp3 forms 613,902 2,036 0.11% 0.30% study of the Excite search log. 5 chat jobs 406,014 1,587 0.07% 0.24% 6 warez divorce 398,953 1,400 0.07% 0.21% Three single days of search logs 7 yahoo unemployment 377,025 1,359 0.07% 0.20% (sampled in 1997, 1999, and 8 playboy employment 356,556 1,257 0.06% 0.19% 2001) of the Excite search 9 xxx notary 324,923 1,061 0.06% 0.16% engine (www.excite.com) were 10 hotmail secretary of state 321,267 1,053 0.06% 0.16% made available to researchers and many studies have been ChauState’s table Center 2 (4/07)of Excellence Program reported [8, 10, 11]. These Table 2. Comparison of the Utah Top 10 queries with analyses have provided much AltaVista. funded in part by the NSF. As discussed, our goal information about the informais to study people’s search patterns for government tion needs and searching behavior of search engine information, such as top queries, query term disusers, including their search topics and search char- tribution, and session analysis. Such analysis will acteristics. Wang et al. also reported their study of help government agencies to better understand the the information needs of the users of an academic public’s information needs, improve the design of search engine [12]. The results of these analyses, their Web sites, and possibly uncover suspicious however, are not specific to government Web sites. activities occurring on their Web sites. Here, we There are several reasons for analyzing the logs describe the data collection process, the characterof government search engines. For example, it has istics of the data, and the analysis method as well as been shown that users’ behavior in Web site search the query analysis results. engines can be quite different from that of generalpurpose search engines [2, 12]. Search topics and THE UTAH STATE GOVERNMENT WEB SITE other query characteristics could be significantly We collected more than one million search queries different and research is needed to explore such submitted to Utah.gov from March 1, 2003 to unique behavior on government Web sites. Study- August 15, 2003 [2]. Utah.gov has a Web site ing the search logs can reveal the public’s informa- search engine accessible from the main page of the tion needs on government Web sites, which allows site. Site visitors can enter search queries in the text 88

April 2007/Vol. 50, No. 4 COMMUNICATIONS OF THE ACM

USERS OF GENERAL-PURPOSE SEARCH ENGINES HAVE MUCH BROADER INFORMATION NEEDS THAN WHAT IS PROVIDED BY GOVERNMENT SEARCH ENGINES.

box provided and submit the queries to the Web and a general-purpose search engine regarding the site search engine. Our search log contains a total average number of terms per query and the average of 1,895,680 records. Each record represents a number of result pages viewed per sessions [2]. request that can be a search query (requesting However, they show a lower number of queries per 3. Frequencies of terms potentially related to terrorism. either the first page ofTable search results or subsequent session and a different set of terms and topics used pages beyond the top 25 results), a request for in their queries. In our study, we identified the top viewing the actual docu25 most frequent queries Number of Sessions Number of Queries ment in the search in our data and comSuspicious Term Containing the Term Containing the Term result, or a request for an pared the queries with 119 83 image file. Each record terrorism that of AltaVista [9]; 118 92 consists of 14 fields, sars here, we correlate the nuclear 116 79 such as timestamp, IP top 10 of the search 77 west nile virus 95 address, the type of water system results between Utah.gov 29 62 request submitted, and emergency AND plan and AltaVista in Table 2. 35 61 other parameters of the power plant As one can expect, the 28 58 request. top queries submitted to 19 radioactive 39 We extracted the smallpox the government Web site 27 37 search queries from the disease control search engines are differ12 15 data and used informa- nuclear AND map ent from those in gen1 3 tion on cookies and IP pipeline AND map eral-purpose search 1 1 addresses to identify anthrax engines. The govern1 1 users from the data. Folment Web site queries lowing previous research, are mostly related to the sessions were identipeople’s general information needs from the governTable 3. Frequencies of terms potentially related to terrorism. fied from the user data ment, such as the Department of Motor Vehicles Chau table 3 (4/07) using the widely applied and tax forms. The results suggest the public frerule of thumb, in which the maximal session length quently relies on the government Web site to obtain should be less than 30 minutes [3]. Each session was useful information relevant to their daily activities. assigned a unique session id in our database; there It is also interesting to note the different distribuare 792,103 queries in total, submitted by a total of tion of the queries in the two studies. In addition, 161,042 unique users in 458,962 sessions. Each ses- we found that the top 25 queries in our data sion has on average 1.73 queries, or 1.25 unique accounted for 4.48% of all non-empty queries, queries. The latter number is much lower than the while the top 25 in AltaVista only represented number 2.52 reported in the Excite study [11] and 1.56% of their data. The difference indicates the 2.02 reported in the AltaVista study [9]; the results queries in government Web search engines (and are summarized in Table 1. other Web site search engines) are less diverse. In other words, users of general-purpose search ANALYSIS AND DISCUSSION engines have much broader information needs than When we compared our data with that of previous what is provided by government search engines. It search log analysis for general-purpose search is therefore desirable to customize the design of engines, we found that Web users behave similarly government Web sites and their search engines by when using a government Web site search engine making some prominent links to popular informaCOMMUNICATIONS OF THE ACM April 2007/Vol. 50, No. 4

89

sake of their own safety and concerns. It is impor- governments to better manage the content on their tant to study what information people are looking Web sites and support their online visitors [5, 6]. for on these sites in order to determine what Another reason to study the search logs is because information should be made more easily accessible government Web sites may store information rele(or otherwise) to the general public. For example, vant to public safety and national security, the 1. Overview of the search log possibly data. if Web log analyses indicate Table the public is interested study can reveal if there is any suspicious in some sensitive security information (such activity on their Web sites. as nuclear plant inforIn this article, we mation) but governreport our study on anaNumber of search queries 792,103 ment agencies decide lyzing the search logs of Number of non-empty queries 673,807 such information the Utah State GovernNumber of unique users 161,042 should only be accessed ment Web site Utah.gov. Number of sessions 458,962 through legitimate The Utah.gov site is one Mean number of queries per session 1.73 requests and remove it of the most advanced Median number of queries per session 1 from their Web sites, government Web sites they should provide and was named the best other secure means for state government Web portal in the U.S. by the Table 1. Overview of the search log data. the public to access this Center for Digital Government in 2003 [1]. A Tabletable 2. Comparison 10 queriesand withe-services AltaVista.are proinformation (for examlarge varietyofofTop information Chau 1 (4/07) ple, after proper authentication and authorization vided on this Web site. Therefore, it was chosen as at physical locations). the basis for our study. This project is also part of One way to study people’s information needs is to analyze Rank Utah.gov AltaVista the logs of search engines. Search query Frequency Percentage Search query Frequency Percentage Search engine log analysis has 1 sex dmv 1,551,477 3,794 0.27% 0.56% been conducted on many gen2 applet tax forms 1,169,031 2,532 0.20% 0.38% eral-purpose search engines. 3 porno sex offenders 712,790 2,173 0.12% 0.32% The most popular one is the 4 mp3 forms 613,902 2,036 0.11% 0.30% study of the Excite search log. 5 chat jobs 406,014 1,587 0.07% 0.24% 6 warez divorce 398,953 1,400 0.07% 0.21% Three single days of search logs 7 yahoo unemployment 377,025 1,359 0.07% 0.20% (sampled in 1997, 1999, and 8 playboy employment 356,556 1,257 0.06% 0.19% 2001) of the Excite search 9 xxx notary 324,923 1,061 0.06% 0.16% engine (www.excite.com) were 10 hotmail secretary of state 321,267 1,053 0.06% 0.16% made available to researchers and many studies have been ChauState’s table Center 2 (4/07)of Excellence Program reported [8, 10, 11]. These Table 2. Comparison of the Utah Top 10 queries with analyses have provided much AltaVista. funded in part by the NSF. As discussed, our goal information about the informais to study people’s search patterns for government tion needs and searching behavior of search engine information, such as top queries, query term disusers, including their search topics and search char- tribution, and session analysis. Such analysis will acteristics. Wang et al. also reported their study of help government agencies to better understand the the information needs of the users of an academic public’s information needs, improve the design of search engine [12]. The results of these analyses, their Web sites, and possibly uncover suspicious however, are not specific to government Web sites. activities occurring on their Web sites. Here, we There are several reasons for analyzing the logs describe the data collection process, the characterof government search engines. For example, it has istics of the data, and the analysis method as well as been shown that users’ behavior in Web site search the query analysis results. engines can be quite different from that of generalpurpose search engines [2, 12]. Search topics and THE UTAH STATE GOVERNMENT WEB SITE other query characteristics could be significantly We collected more than one million search queries different and research is needed to explore such submitted to Utah.gov from March 1, 2003 to unique behavior on government Web sites. Study- August 15, 2003 [2]. Utah.gov has a Web site ing the search logs can reveal the public’s informa- search engine accessible from the main page of the tion needs on government Web sites, which allows site. Site visitors can enter search queries in the text 88

April 2007/Vol. 50, No. 4 COMMUNICATIONS OF THE ACM

USERS OF GENERAL-PURPOSE SEARCH ENGINES HAVE MUCH BROADER INFORMATION NEEDS THAN WHAT IS PROVIDED BY GOVERNMENT SEARCH ENGINES.

box provided and submit the queries to the Web and a general-purpose search engine regarding the site search engine. Our search log contains a total average number of terms per query and the average of 1,895,680 records. Each record represents a number of result pages viewed per sessions [2]. request that can be a search query (requesting However, they show a lower number of queries per 3. Frequencies of terms potentially related to terrorism. either the first page ofTable search results or subsequent session and a different set of terms and topics used pages beyond the top 25 results), a request for in their queries. In our study, we identified the top viewing the actual docu25 most frequent queries Number of Sessions Number of Queries ment in the search in our data and comSuspicious Term Containing the Term Containing the Term result, or a request for an pared the queries with 119 83 image file. Each record terrorism that of AltaVista [9]; 118 92 consists of 14 fields, sars here, we correlate the nuclear 116 79 such as timestamp, IP top 10 of the search 77 west nile virus 95 address, the type of water system results between Utah.gov 29 62 request submitted, and emergency AND plan and AltaVista in Table 2. 35 61 other parameters of the power plant As one can expect, the 28 58 request. top queries submitted to 19 radioactive 39 We extracted the smallpox the government Web site 27 37 search queries from the disease control search engines are differ12 15 data and used informa- nuclear AND map ent from those in gen1 3 tion on cookies and IP pipeline AND map eral-purpose search 1 1 addresses to identify anthrax engines. The govern1 1 users from the data. Folment Web site queries lowing previous research, are mostly related to the sessions were identipeople’s general information needs from the governTable 3. Frequencies of terms potentially related to terrorism. fied from the user data ment, such as the Department of Motor Vehicles Chau table 3 (4/07) using the widely applied and tax forms. The results suggest the public frerule of thumb, in which the maximal session length quently relies on the government Web site to obtain should be less than 30 minutes [3]. Each session was useful information relevant to their daily activities. assigned a unique session id in our database; there It is also interesting to note the different distribuare 792,103 queries in total, submitted by a total of tion of the queries in the two studies. In addition, 161,042 unique users in 458,962 sessions. Each ses- we found that the top 25 queries in our data sion has on average 1.73 queries, or 1.25 unique accounted for 4.48% of all non-empty queries, queries. The latter number is much lower than the while the top 25 in AltaVista only represented number 2.52 reported in the Excite study [11] and 1.56% of their data. The difference indicates the 2.02 reported in the AltaVista study [9]; the results queries in government Web search engines (and are summarized in Table 1. other Web site search engines) are less diverse. In other words, users of general-purpose search ANALYSIS AND DISCUSSION engines have much broader information needs than When we compared our data with that of previous what is provided by government search engines. It search log analysis for general-purpose search is therefore desirable to customize the design of engines, we found that Web users behave similarly government Web sites and their search engines by when using a government Web site search engine making some prominent links to popular informaCOMMUNICATIONS OF THE ACM April 2007/Vol. 50, No. 4

89

90

April 2007/Vol. 50, No. 4 COMMUNICATIONS OF THE ACM

Number of Suspicious Queries Submitted

tion in order to better cater to users’ needs. All the 13 suspicious terms exist in the search We also revealed some interesting seasonal pat- logs, indicating there are users who entered these terns in our query logs. Seasonal effect has been 1.queries into effect the search engine to lookqueries. for informaFigure Seasonal of tax-related demonstrated in other search engines such as uni- tion relevant to these queries. While we have no versity Web search engines [12]. For example, the way to determine the real intent of these users— query “career services” occurred more frequently in February, March, September, and OctoNumber of Tax-Related Queries ber than in other months. Sim1000 ilarly, the query “football 900 tickets” appeared most often in 800 August and September. In our 700 data, we found some similar 600 patterns for terms related to 500 information needs that are “sea400 300 sonal.” An example of the 200 search for tax-related queries 100 (all queries that contain the 0 terms “tax,” “irs,” or “internal 3/1/2003 4/1/2003 5/1/2003 6/1/2003 7/1/2003 8/1/2003 revenue”) is shown in Figure 1. The number reached its peak on April 15 (the deadline for filing individual tax returns in Figure 1. Seasonal effect whether they are searching for such information of tax-related queries. the U.S.) and dropped quickly for legitimate or for otherpicas purposes, it is imporChau fig 1 use (4/07) - 26.5 afterward. tant to investigate what information they are lookThe top queries and seasonal effect analysis ing for. reveal the information needs of the general public The first few of these queries are related to on government Web sites. However, they do not nuclear and radioactive plants and substances. The show the queries of people with specific purposes U.S. government was widely criticized for putting (such as terrorists) because these queries would not detailed nuclear plant maps and nuclear substance appear frequently in the data. In order to study the transportation routes on government Web sites search of security-related information on govern- during the aftermath of the Sept. 11 attacks ment Web sites, we identified a set of “suspicious” because such information potentially allows terrorterms with the help of a researcher specializing in ists to easily target such facilities for attack. When terrorist research in the U.S. and analyzing the we took a closer look at these queries, we found occurrences of these terms in the search log data. that some of them may not be submitted by the The list of these terms and their frequencies is general public, such as “radioactive waste storage” shown in Table 3. and “nuclear waste transportation route map.”

Number of Suspicious Queries Submitted

THE TOP QUERIES AND SEASONAL EFFECT ANALYSIS REVEAL THE INFORMATION NEEDS OF THE GENERAL PUBLIC ON GOVERNMENT WEB SITES.

Because a lot of sensitive security information has this study, this kind of analysis could be useful in been removed from government Web sites, a search providing alerts by identifying users who submit a on the Utah.gov Web site with these queries did substantial number of suspicious queries. not return any sensitive security information. While our manual analysis showed that the IMPLICATIONS Utah.gov Web site currently does not contain any While some of these queries were made by the gensensitive security information, we are unsure of the eral public for legitimate reasons (such as environsituation of other government Web sites. The U.S. mental protection or vaccination), it is possible will be vulnerable if terrorists could find details of that some queries were made by terrorists for planning possible attacks. We suggest that such inforsuch information easily. Some other queries are related to the water sys- mation is desired by both parties—terrorist groups tem in state of Utah. One particular user searched may use such information for planning attacks and for “map of pipeline.” Details of water systems and the general public needs this information for betpipeline maps are also important for terrorism ter protecting the citizens. It is important for govFigure Distribution on the of suspicious to satisfy thenumber information needs of activities because terrorists can easily poison water2. ernment while by keeping supplies according to such information. It is legitimate queries users submitted users.terrorists from believed that Al Qaeda had plans to poison U.S. accessing sensitive information. This issue has water supplies and possessed some important information about the country’s water system 250 [7]. Other queries searched for diseases that could be used in 200 terrorism attacks, such as SARS, West Nile virus, anthrax, and 150 smallpox. While it is possible that these queries were submit100 ted by terrorists, we should note that it is equally possible these 50 queries were submitted by some general citizens who wanted to 0 know whether there were any 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 reported cases of these diseases in the state of Utah. There are also Number of Users queries that are related to disease control or emergency response planning. Again, there are two Figure 2. Distribution of inspired the question of how much information Chaubefigmade 2 (4/07) - 26.5forpicas the number of suspishould available online access and possible scenarios: terrorists are cious queries submitted assessing the state government’s by users. how to control access to sensitive information. One possible solution is to put non-sensitive inforability in handling terrorism attacks (or lack thereof ) or the general public is mation online while keeping sensitive information 250 accessible only after verifying identification (or looking for relevant information in this aspect. security clearance if needed) at government inforAs discussed earlier, we have used information 200 on cookies and IP addresses to identify users from mation offices. 150 Another solution is to put all information on the the data. The originators of these suspicious terms are associated with a small number of users. They Web but restrict access to sensitive information by 100 were queried by 355 unique users, only 0.22% of measures like password control. This will allow 50 the total population in this study. We further ana- authorized parties to retrieve information more easlyzed the distribution of the number of0 queries ily, but will be more prone to security threats such 1 2 3of4 5 as 6 hacking. 7 8 9 10 11 Further 12 13 14 15 research 16 17 18 19 20will be needed in the containing suspicious terms submitted by each of Users of information security, such as user profiling these users. As shown in Figure 2, each of these area Number users submitted an average of 2.1 suspicious and automatic detection and alerting of suspicious queries, with a maximum of 18. Although there is activities using data mining techniques. Our analysis also indicates that a significant prono extremely high number of queries containing suspicious terms submitted by any single user in portion of people are looking for information Chau fig 2 (4/07) - 19.5 picas COMMUNICATIONS OF THE ACM April 2007/Vol. 50, No. 4

91

90

April 2007/Vol. 50, No. 4 COMMUNICATIONS OF THE ACM

Number of Suspicious Queries Submitted

tion in order to better cater to users’ needs. All the 13 suspicious terms exist in the search We also revealed some interesting seasonal pat- logs, indicating there are users who entered these terns in our query logs. Seasonal effect has been 1.queries into effect the search engine to lookqueries. for informaFigure Seasonal of tax-related demonstrated in other search engines such as uni- tion relevant to these queries. While we have no versity Web search engines [12]. For example, the way to determine the real intent of these users— query “career services” occurred more frequently in February, March, September, and OctoNumber of Tax-Related Queries ber than in other months. Sim1000 ilarly, the query “football 900 tickets” appeared most often in 800 August and September. In our 700 data, we found some similar 600 patterns for terms related to 500 information needs that are “sea400 300 sonal.” An example of the 200 search for tax-related queries 100 (all queries that contain the 0 terms “tax,” “irs,” or “internal 3/1/2003 4/1/2003 5/1/2003 6/1/2003 7/1/2003 8/1/2003 revenue”) is shown in Figure 1. The number reached its peak on April 15 (the deadline for filing individual tax returns in Figure 1. Seasonal effect whether they are searching for such information of tax-related queries. the U.S.) and dropped quickly for legitimate or for otherpicas purposes, it is imporChau fig 1 use (4/07) - 26.5 afterward. tant to investigate what information they are lookThe top queries and seasonal effect analysis ing for. reveal the information needs of the general public The first few of these queries are related to on government Web sites. However, they do not nuclear and radioactive plants and substances. The show the queries of people with specific purposes U.S. government was widely criticized for putting (such as terrorists) because these queries would not detailed nuclear plant maps and nuclear substance appear frequently in the data. In order to study the transportation routes on government Web sites search of security-related information on govern- during the aftermath of the Sept. 11 attacks ment Web sites, we identified a set of “suspicious” because such information potentially allows terrorterms with the help of a researcher specializing in ists to easily target such facilities for attack. When terrorist research in the U.S. and analyzing the we took a closer look at these queries, we found occurrences of these terms in the search log data. that some of them may not be submitted by the The list of these terms and their frequencies is general public, such as “radioactive waste storage” shown in Table 3. and “nuclear waste transportation route map.”

Number of Suspicious Queries Submitted

THE TOP QUERIES AND SEASONAL EFFECT ANALYSIS REVEAL THE INFORMATION NEEDS OF THE GENERAL PUBLIC ON GOVERNMENT WEB SITES.

Because a lot of sensitive security information has this study, this kind of analysis could be useful in been removed from government Web sites, a search providing alerts by identifying users who submit a on the Utah.gov Web site with these queries did substantial number of suspicious queries. not return any sensitive security information. While our manual analysis showed that the IMPLICATIONS Utah.gov Web site currently does not contain any While some of these queries were made by the gensensitive security information, we are unsure of the eral public for legitimate reasons (such as environsituation of other government Web sites. The U.S. mental protection or vaccination), it is possible will be vulnerable if terrorists could find details of that some queries were made by terrorists for planning possible attacks. We suggest that such inforsuch information easily. Some other queries are related to the water sys- mation is desired by both parties—terrorist groups tem in state of Utah. One particular user searched may use such information for planning attacks and for “map of pipeline.” Details of water systems and the general public needs this information for betpipeline maps are also important for terrorism ter protecting the citizens. It is important for govFigure Distribution on the of suspicious to satisfy thenumber information needs of activities because terrorists can easily poison water2. ernment while by keeping supplies according to such information. It is legitimate queries users submitted users.terrorists from believed that Al Qaeda had plans to poison U.S. accessing sensitive information. This issue has water supplies and possessed some important information about the country’s water system 250 [7]. Other queries searched for diseases that could be used in 200 terrorism attacks, such as SARS, West Nile virus, anthrax, and 150 smallpox. While it is possible that these queries were submit100 ted by terrorists, we should note that it is equally possible these 50 queries were submitted by some general citizens who wanted to 0 know whether there were any 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 reported cases of these diseases in the state of Utah. There are also Number of Users queries that are related to disease control or emergency response planning. Again, there are two Figure 2. Distribution of inspired the question of how much information Chaubefigmade 2 (4/07) - 26.5forpicas the number of suspishould available online access and possible scenarios: terrorists are cious queries submitted assessing the state government’s by users. how to control access to sensitive information. One possible solution is to put non-sensitive inforability in handling terrorism attacks (or lack thereof ) or the general public is mation online while keeping sensitive information 250 accessible only after verifying identification (or looking for relevant information in this aspect. security clearance if needed) at government inforAs discussed earlier, we have used information 200 on cookies and IP addresses to identify users from mation offices. 150 Another solution is to put all information on the the data. The originators of these suspicious terms are associated with a small number of users. They Web but restrict access to sensitive information by 100 were queried by 355 unique users, only 0.22% of measures like password control. This will allow 50 the total population in this study. We further ana- authorized parties to retrieve information more easlyzed the distribution of the number of0 queries ily, but will be more prone to security threats such 1 2 3of4 5 as 6 hacking. 7 8 9 10 11 Further 12 13 14 15 research 16 17 18 19 20will be needed in the containing suspicious terms submitted by each of Users of information security, such as user profiling these users. As shown in Figure 2, each of these area Number users submitted an average of 2.1 suspicious and automatic detection and alerting of suspicious queries, with a maximum of 18. Although there is activities using data mining techniques. Our analysis also indicates that a significant prono extremely high number of queries containing suspicious terms submitted by any single user in portion of people are looking for information Chau fig 2 (4/07) - 19.5 picas COMMUNICATIONS OF THE ACM April 2007/Vol. 50, No. 4

91

related to a small number of topics in government Web sites, like tax information and the Department of Motor Vehicles. For example, the term “tax” appeared in 3.59% of all queries. To allow the general public to access online government information more easily, the Web site designers can analyze the search logs or the Web access logs. This data can reveal more about users’ most-wanted information resources and make the links to these resources easily accessible by users, say, by placing them prominently in the first page of the Web site. For example, the LinkSelector technique, which has been successfully applied to a university’s Web site, can be used to select the most appropriate set of links on the home page of a Web site in order to maximize the efficiency and effectiveness of a Web site’s usage based on log analysis [5]. Such techniques can be effectively applied in e-government projects to improve the performance of government portals. CONCLUSION In this article, we have reported our research on analyzing the query log of a government Web site search engine and we found that some terrorismrelated queries do exist in our data. A limitation of this study is that the analysis was only performed on one government Web site. Nevertheless, this problem must be taken seriously by governments in their information policies. On the other hand, as many countries have launched e-government (or digital government) projects, increasing numbers of government agencies are putting their information on the Web. The analysis of the search logs helps us better understand what users are seeking on government Web sites. Based on our study, we make the following suggestions to government agencies:

• Perform search log analysis to determine users’ search behaviors and monitor for suspicious information requests. • Make the most requested information more easily accessible to the public by putting it on the first page of the Web site or creating prominent navigation links. • Develop a clear classification on what kinds of information are potentially vulnerable to the country’s security and establish guidelines on what classes of information should be made accessible online. Information with mid-level or high-level sensitivity should be made available to an individual or organization only after proper authentication and/or security clearance. 92

April 2007/Vol. 50, No. 4 COMMUNICATIONS OF THE ACM

While removing vast amounts of information from government Web sites is not a good solution to the problem, it is not easy to strike a balance between providing easy access to information to the public and preventing terrorists from gaining access to sensitive information. We hope the suggestions proposed here will help alleviate the problem in government Web sites. c References 1. Center for Digital Government. Utah State Portal ranks No. 1 (2003); www.centerdigitalgov.com/center/highlightstory.phtml?docid= 69811. 2. Chau, M., Fang, X., and Sheng, O.R.L. Analysis of the query logs of a Web site search engine. Journal of the American Society for Information Science and Technology 56, 13 (2005), 1363–1376. 3. Cooley, R., Mobasher, B., and Srivastava, J. Data preparation for mining World Wide Web browsing patterns. Knowledge and Information Systems 1, 1 (1999). 4. Electronic Frontier Foundation. Chilling effects of anti-terrorism: ‘National security’ toll on freedom of expression, (2004); www.eff.org/Privacy/Surveillance/Terrorism/antiterrorism_chill.ht ml. 5. Fang, X. and Sheng, O.R.L. LinkSelector: A Web mining approach to hyperlink selection for Web portals. ACM Transactions on Internet Technology 4, 2 (2004), 209–237. 6. Fang, X. and Sheng, O.R.L. Designing a better Web portal for digital government: A Web-mining based approach. In Proceedings of the 2005 National Conference on Digital Goverment Research (dg.o 2005), (Atlanta, GA, 2005). 7. Feds arrest Al Qaeda suspects with plans to poison water supplies. Fox News (July 30, 2002); www.foxnews.com/story/ 0,2933,59055,00.html. 8. Jansen, B.J., Spink, A., Bateman, J., and Saracevic, T. Real life information retrieval: A study of user queries on the Web. ACM SIGIR Forum 32, 1 (1998), 5–17. 9. Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. Analysis of a very large Web search engine query log. ACM SIGIR Forum 33, 1 (1999), 6–12. 10. Spink, A., Jansen, B.J., Wolfram, D., and Saracevic, T. From e-sex to e-commerce: Web search changes. IEEE Computer 35, 3 (Mar. 2002), 107–109. 11. Spink, A., Wolfram, D., Jansen, B.J., and Saracevic, T. Searching the Web: The public and their queries. Journal of the American Society for Information Science and Technology 52, 3 (May 2001), 226–234. 12. Wang, P., Berry, M.W., and Yang, Y. Mining longitudinal Web queries: Trends and patterns. Journal of the American Society for Information Science and Technology 54, 8 (Aug. 2003), 743–758.

Michael Chau ([email protected]) is an assistant professor in the School of Business at the University of Hong Kong, Hong Kong. Xiao Fang ([email protected]) is an assistant professor in the College of Business Administration at the University of Toledo, OH. Olivia R. Liu Sheng ([email protected]) is a Presidential Professor and the Emma Eccles Jones Presidential Chair of Information Systems at the University of Utah, Salt Lake City, UT. This research was supported in part by National Science Foundation Grant No. 0410409. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

© 2007 ACM 0001-0782/07/0400 $5.00

what are people searching on government web ... - Semantic Scholar

through the Internet. Due in part to this Act, large amounts of government infor- mation have been put online and made publicly accessible. The provision of ..... 0. 100. 200. 300. 400. 500. 600. 700. 800. 900. 1000. 3/1/2003. 4/1/2003. 5/1/2003. 6/1/2003. 7/1/2003. 8/1/2003. Figure 1. Seasonal effect of tax-related queries.

155KB Sizes 3 Downloads 308 Views

Recommend Documents

What Are People Searching on Government Web Sites? - School of ...
tion Science and Technology 56, 13 (2005), 1363–1376. 3. Cooley, R. ... College of Business Administration at the University of Toledo, OH. Olivia R. Liu Sheng ...

Practical Fast Searching in Strings - Semantic Scholar
Dec 18, 1979 - School of Computer Science, McGill University, 805 Sherbrooke Street West, Montreal, Quebec. H3A 2K6 ... instruction on all the machines mentioned previously. ... list of the possible characters sorted into order according to their exp

What People are Saying...
Mar 22, 2010 - Erick Schonfeld, TechCrunch. “Rather than YouTube simply making intuition-based arguments to the judge that it's really hard to figure out ...

Such stuff as dreams are made on? Elaborative ... - Semantic Scholar
... system is specialized for processing spatial and relational information, whereas the .... that the AAOM is the basis of all effective memory tech- niques (and that ...

Such stuff as dreams are made on? Elaborative ... - Semantic Scholar
Episodic memory networks interconnect profusely within the cortex, ..... the education of the social elite (Carruthers & Ziolkowski. 2002). ...... beach, but no adults.

On Knowledge - Semantic Scholar
Rhizomatic Education: Community as Curriculum by Dave Cormier. The truths .... Couros's graduate-level course in educational technology offered at the University of Regina provides an .... Techknowledge: Literate practice and digital worlds.

On Knowledge - Semantic Scholar
Rhizomatic Education: Community as Curriculum .... articles (Nichol 2007). ... Couros's graduate-level course in educational technology offered at the University ...

Fiscal Centralization, Limited Government, and ... - Semantic Scholar
8. For instance, I find that centralized and limited re- gimes in Europe were associated with significant reductions in sove- reign credit risk from 1750 to 1913. 9.

Web 2.0 Broker - Semantic Scholar
Recent trends in information technology show that citizens are increasingly willing to share information using tools provided by Web 2.0 and crowdsourcing platforms to describe events that may have social impact. This is fuelled by the proliferation

Fiscal Centralization, Limited Government, and ... - Semantic Scholar
perform a statistical analysis of political regimes and public revenues in Europe from 1650 to 1913. .... 50. Dincecco period from 1650 to 1913 captures a clear pattern of political transfor- ...... American Political Science Review 89, no. 3 (1995):

Learning from a Web Tutor on Fostering Critical ... - Semantic Scholar
the tutors in their implementation of the program. Researchers .... practical limitations that present serious obstacles to collecting such data. The subject ..... social issues in each story. Experts are ...... Educational Data Mining 2009. 151-160.

What to Put on the Table - Semantic Scholar
Feb 24, 2011 - ‡Leonard Stern School of Business, Kaufman Management Center, 44 West 4th ... (2007) suggests that about half of company sales are performed via ..... We choose any of these types arbitrarily and call it the critical type of ...

Recovering Semantics of Tables on the Web - Semantic Scholar
Based on this, we build a table search engine with much higher precision than previous approaches. In ... semantics leads to high precision search with little loss of recall of tables in comparison to document based .... manually annotating the seman

What People Are Saying.pdf
Page 1 of 1. What People Are Saying... "Not having to have my scout team huddle between plays. For years we would. hold up the cards only to have kids read ...

Are behavioral differences among wild ... - Semantic Scholar
Jan 20, 2010 - ABSTRACT. Over the last 30 years it has become increasingly apparent that there are many behavioral differences among wild communities of Pan troglodytes. Some researchers argue these differences are a conse- quence of the behaviors be

Are Your Requirements Complete? - Semantic Scholar
Donald Firesmith, Software Engineering Institute, U.S.A.. Abstract ... system development cost and schedule, missing or incomplete requirements mean.

Approachability: How People Interpret Automatic ... - Semantic Scholar
Wendy Ju – Center for Design Research, Stanford University, Stanford CA USA, wendyju@stanford. ..... pixels, and were encoded using Apple Quicktime format.

Web Query Recommendation via Sequential ... - Semantic Scholar
wise approaches on large-scale search logs extracted from a commercial search engine. Results show that the sequence-wise approaches significantly outperform the conventional pair-wise ones in terms of prediction accuracy. In particular, our MVMM app

Web Query Recommendation via Sequential ... - Semantic Scholar
Abstract—Web query recommendation has long been con- sidered a key feature of search engines. Building a good Web query recommendation system, however, is very difficult due to the fundamental challenge of predicting users' search intent, especiall

Effective Web Searching on Mobile Devices
the time limit, and these were regarded as missing data. We chose the ... alternative visualization of page structure, such documents could be segmented at.

searching for expertise on the social web ... - Research
ready there. Before Aardvark, social QA systems used a wide variety of techniques to route questions, most often using ex- perience/reputation points or mon-.

Online Video Recommendation Based on ... - Semantic Scholar
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, P. R. ... precedented level, video recommendation has become a very.