A Lightweight Algorithm for Automated Forum Information Processing Wee-Yong Lim, Amit Sachan and Vrizlynn L. L. Thing Cybercrime and Security Intelligence (CSI) Department Institute for Infocomm Research, Singapore Email: {weylim, sachana, vriz}@i2r.a-star.edu.sg Abstract—The vast variety of information on web forums makes them a valuable resource for various purposes such as scam detection, national security protection and sentiment analysis. However, it is challenging to extract useful information from web forums accurately and efficiently. First, several page types exist in web forums and content is presented in different formats in these pages. Second, the content on the forum pages is stored in the form of data blocks. For the information to be meaningful, it is necessary to extract the relevant data blocks separately. The main problem with generic content extraction systems is that they cannot distinguish among various pages nor extract information with the required granularity. Although, several content extraction methods exist for web forums, these methods either do not satisfy the above requirements or use heuristics based approaches (such as assumptions on standard visual appearances, etc., resulting in limited applicability to different varieties of forum). In this paper, we propose a general and efficient content extraction method using the properties of links present in forum pages. The effectiveness of our proposed method is shown through our experimental results. keywords-content extraction; forum; DOM tree; web

I.

I NTRODUCTION

With a large amount of relatively unbiased data, web forums have become an important medium for users to browse and post information on subjects of their interest. The vast amount of unbiased and emerging data make them very valuable for purposes such as scams detection, national security protection and user opinion mining. However, it remains a challenging problem to efficiently extract forum data due to the following reasons. First, information in different types of forum pages are presented in different formats. In list-of-thread pages, the information is in the form of links to the list-of-post pages and their descriptions. While in list-of-post pages, the information is in the form of user posts, and associated meta data (e.g. time of post, user information). Therefore, during the extraction of information from a forum page, the page type must be priory known. Second, information in forum pages is generally stored in independent regions of content. For the extracted content to be intelligible, the granularity of the content extraction should be to the level of the independent content regions. Therefore, content extraction methods need to be able to extract the forum data accurately regardless of the page type and to the granularity of the individual threads and posts. Several different methods of content extraction exist for general web pages and forums. Most of these methods usually extract the content by identifying the regions that are rich in

text density [1], by relying on machine learning techniques [2], [3], [4], [5], [6] or by detecting the differentiable visual attributes [7], [8], [9]. However, these methods are either unable to distinguish among different types of pages in web forums and are suitable only for the list-of-post page content extraction, or are too computationally and labor intensive. In addition, the achievable granularity of the content extraction is not at the level of the individual threads and posts. In this paper, we propose a lightweight content extraction method using only links and text information in the forum pages. The proposed method is able to accurately extract the content present in the different forum page types in individual data regions. Our experimental results show the effectiveness of our proposed algorithm in terms of the ability to accurately extract data from all page types in web forums. A point to note is that our proposed algorithm in this paper is pertaining to content extraction and does not identify the specific page type. Instead, we rely on our previous work on forum crawling [10] to do so. Other forum crawling techniques [11], [12], [13] can also be used to traverse through a given forum site. In Section II, we discuss the existing works related to content extraction. In Section III, we describe the preliminaries required for this work, follow by the challenges involved in Section IV, and our proposed algorithm for web forum content extraction in Section V. Finally, experimental results and conclusions are presented in Sections VI and VII respectively. II.

R ELATED W ORK

Several works exist that attempt to extract data from web pages and forums. The first approach is wrapper based [2], [3], [4], [5], and uses supervised machine learning to learn data extraction rules from the positive and negative samples. The structural information of the sample web pages is utilized to classify similar data records according to their subtree structure. However, this approach is inflexible and non-scalable due to its web site template dependency. Manual labeling of the sample pages is also extremely labor intensive and time consuming, and has to be repeated even for pages within the same site due to varying intra-site templates. Another approach [7], [8] is the reliance on the web page visual attributes. The web pages are rendered by the web browser to ascertain the data structure on different pages. Features such as the positioning of the information units, the cell sizes, the font characteristics and the font colors are analysed to understand the semantic meaning of the contents based on the assumption that content rendering by the browser

ensures human understandable output format. The results based on the data structure inference from the visual attributes are observed to be at 81% and 68% for the precision and recall measurement, respectively. In [9], the authors identify different data blocks based on the differences in their visual styles such as the width, height, background color and font. The disadvantage of the visual attribute based approach is the need to render the web pages during both the learning and extraction phases, which is computationally expensive. Probabilistic model approaches take into consideration the semantic information in the web sites and generate models to aid in the data extraction. In [14], the authors observed certain strongly linked sequence characteristics among web objects of the same type across different web sites. They presented a twodimensional conditional random fields model to incorporate the two-dimensional neighborhood interactions of the web objects so as to detect and extract product information from web pages. In [8], the same authors presented a new model of hierarchical conditional random fields to couple the data record detection and attribute labeling phases to benefit from the semantics availability in the attribute labeling phase. However, these two approaches assume an adequate availability of semantic information specific to the domain type of the web sites (e.g. product information sites in this case). In [15], the authors proposed parsing the web page to form a tag tree based on the start and end tags. The primary content region is located by extracting the minimal subtree which contains all the objects of interest. Three features, namely the fanout, content size and tag count, are relied upon to choose the correct subtree. However, the evaluation of [15] in [16] shows that the proposed methods could not obtain good results in the data extraction from web pages. In [16], the authors proposed an algorithm to only consider nodes with a tree depth of at least three (derived from the observation of web page contents), and extract the data region based on nodes with a high string-based similarity [17]. However, the proposed algorithm requires a training phase to derive the string-based similarity threshold and is specific to different web sites. In [1], the authors rely on text density and composite text density measurements to support content extraction. Text density is the ratio of the number of characters to the number of tags in a region. The basis for using text density is that the text regions in a page generally contain a higher text ratio as compared to the other regions. Composite text density is computed by taking into consideration the noise due to hyperlinks, and gives a high score for the content containing a low number of hyperlinks (indicating region of interest). However, this method is not applicable to board and listof-thread page content extraction. In addition, the achievable granularity of the content extraction from the post pages is not to the required level of individual posts, which is desired in forum data extraction. In [6], the authors proposed generating the sitemap, which is a directed graph representing the forum site, by first sampling 2000 pages. The vertices of the graph represent the pages, while the edges denote the links between the pages. The authors then proposed extracting three features, which are the i) inner-page feature to capture characteristics such as the presence of time information, and whether the elements on a page are aligned by rendering via a web browser to

identify elements’ locations and whether the time information present a special order (i.e. to identify post records due to its sequential post time order), ii) inter-vertex feature to capture site-level knowledge such as to indicate if a vertex leads to another vertex with post pages and whether their joining edge is defined as a post link, and iii) inner-vertex feature to capture the alignment of nodes within a vertex such as whether they share a similar DOM path and tag attributes, from the sampled pages. Based on the extracted features, Markov Logic Network models are generated for each forum site. However, manual labeling of the sample pages is required in the training phase and is extremely labor intensive and time consuming. In addition, the sitemap construction and feature extraction processes have to be carried out during the operation phase and therefore, create a bottleneck during information extraction. Pretzsch et al. [18] proposed a method of extracting useful content from web forums. The proposed approach consists of following steps: i) Downloading all the pages in a web forum, ii) finding the list-of-post pages by performing clustering on all the downloaded pages, iii) extracting the useful information the list-of-post pages. The clustering process uses the fact that post pages are in the largest amount in web forums and the largest cluster is chosen as the cluster for the list-of-post pages. This process of finding the list-of-post pages has several drawbacks. First, it requires downloading all the pages in a web forum, which may be inefficient in terms of bandwidth and processing. Second, in web forums, many duplicate pages containing the same information exist and the content may be extracted from duplicate pages. To identify the post regions, the authors divide pages into small segments based on HTML markup. Then the tag path (similar to XPath) of these segments is determined. Finally, a tag distance based clustering is used and the tag paths representing the biggest cluster are used. The approach used in the paper to determine the block segments may be error prone as it considers several heuristics rules on content structure in post regions, which may not be applicable to individual posts in all the forums or may lead to many false positives. III.

P RELIMINARIES

A. Traversal Path A traversal path represents the hierarchical order of the sequence in which the pages on a web forum should be accessed. The traversal path typically starts from the board page, followed by the middle hierarchy pages and finally ends at the list-of-post pages. The first pages of each type of page are collectively referred to as skeleton pages, which may be linked to multiple skeleton-flipping pages via page-flipping links. This path is represented in Figure 1. A method to obtain the traversal path for a given web forum is discussed in our previous work [10]. First, links are extracted from a handful of random pages. Then, pre-defined rules signatures are generated for the links, and links with the same signature are placed in the same cluster. A second level of clustering is then performed to obtain the common keywords from the links within each cluster, thereby characterizing each of them via a signature and common keyword. Each cluster is assumed to correspond to a particular type of page (referred to as a vertex). An URL is said to be matching a vertex if they have the same signature and common keyword. In this work,

1. Identification of reference links for list-of-post pages: List-of-post pages (and their skeleton-flipping pages) are at the end of the traversal path but the reference links for the list-ofpost pages are absent from the traversal path (as represented in Figure 1). Therefore, we need to devise a method to obtain the reference links for the list-of-post pages. 2. Content extraction for the last reference link on a page: Data region for content extraction is determined using reference links as markers (see Section V). However, for the last data region in a page, there exists no more subsequent reference link to indicate the end of the data region. Extracting data from the last reference link till the end of the page will lead to redundant extracted data. It is better to terminate the extraction properly for the last data region in the page.

Fig. 1.

Organization of web forums

we assume that the traversal path, represented as a sequence of vertices, is already generated by using our previous work in [10] and therefore, the relevant types of pages can be downloaded using the crawler. B. Data Regions of Interest & Reference links Data regions of interest refer to the independent areas on different forum pages that contain useful content necessary for analysis purpose. Each of such content region shall contain a reference link indicating the presence of the content region. For list-of-thread pages, such links will point to the subsequent listof-posts pages. For list-of-post pages, such links are usually bookmark links to each individual post or links to display individual post separately. C. DOM tree & XPaths A web page is a nested tree of HTML tags, which can be represented by a document object model (DOM) tree. HTML tags forms the element nodes, their attributes form the attribute nodes and text contents form the text nodes. XPaths are used to identify nodes’ position in the DOM tree. The XPath of a node is the sequence of the tags starting from the root node to the node itself, in the DOM tree. Numbers are used to distinguish among the nodes having the same tag sequence. For example, if there are
nodes nested within , the XPaths are
and for the first and second node respectively. IV.

P ROBLEM D EFINITION

The objective of content extraction is to determine the data regions of interest in different pages in order to extract the content from these data regions. In this work, we use reference links for the identification of each data regions in forum pages. Although the concept is straightforward, several challenges enumerated below are present in practice.

3. Dealing with different formats of reference links: Reference links of the same page type may not share a single signature and common substring, due to them having different URL formats. In such cases, content may not be properly extracted from the web page. Additional means need to be considered to extract the content in such cases. 4. Dealing with reference links within the content: Reference links may exist within the user-generated content in web pages. This situation may lead to a wrong interpretation of data regions and produce erroneously broken-up data regions. 5. Dealing with misalignments in the content: Using only reference links as basis for content extraction may lead to nonextraction of content before the reference link in a given data region or extraction of content from the next data region before the next data region’s reference link. As such, important fields such as date of post or author name may be misaligned. V.

P ROPOSED A LGORITHM

In this section, we present our proposed algorithm to extract the content from different types of pages in web forums. The algorithm works in two phases namely, the training phase and the actual content extraction phase. During the training phase, we first identify the signature and common keyword for the reference links on the post pages. We then find the dominant XPath for the reference links and page-flipping links for each type of pages and subsequently identify the best XPath for differentiating between the different content regions in the page. Finally, the contents can then be extracted appropriately in discrete units during the actual content extraction phase. A. Training Phase 1) Reference Link Identification: Reference links for listof-post pages refer to the links that uniquely identify the different posts on the page. For example, the reference link www.scam.com/showpost.php?p=1147747&postcount=1 with its anchor text “#1” uniquely identifies the first post within the page. To obtain the signature and common substring identifiers for the post pages, we use the clusters obtained during the traversal path generation process [10]. 2) Dominant XPaths Extraction: Reference links are usually in a standard position in forum pages. XPaths are related to the position of elements in a web page and can be exploited to represent such similarity in the positions of these links in the page. Given the XPaths of the reference links, the dominant

XPath is a sequence of tags that can be matched against the majority of the links’ XPaths. This dominant XPath is to identify the reference links and page-flipping links even if their URL structures (signature or common keyword) do not match the URL structures of the previously obtained reference links. Specifically, we first obtain the respective reference links’ XPaths ignoring the numbers. The same sequence of tags is often shared by the XPaths of reference links in the same type of pages, but there could exist noisy links that are falsely identified as reference links. However, in practice, the most common XPath within the set can effectively be used as the dominant XPath. For each hierarchy, tags that are at a fixed position will retain the number (if any), while tags at different positions will have the wildcard symbol ‘*’ replacing the number to indicate the variable positions. To illustrate the whole process, consider the small example below. Example 1: Given XPaths of 5 links: XP ath1 = XP ath2 = XP ath3 = XP ath4 = XP ath5 =

The most frequent tag sequence is obtained from XP ath2..5 , giving the dominant XPath
. B. Content Extraction Using Links & XPaths 1) Using reference links as markers for content regions in page: When given a page, the task of the crawler in this work is to extract the relevant user generated content (i.e. threads/posts and associated meta-data) and the skeletonflipping links. Both tasks require the identification of relevant URLs. But while the latter tasks simply requires the storing of the links, the former task entails additional processing to extract the relevant contents in the page. This section describe in detail the algorithms involved in achieving these tasks, using the signature, common keyword and dominant XPath for the reference links obtained during the training phase. In scrapping data from a forum site, the required content to be extracted on each page is naturally dependent on the page type. Typically, the relevant content to be extracted from a list-of-thread page are the links to the subsequent list-ofpost pages and the meta-data for each of such links. While the content to be extracted from a list-of-post page are the user-generated post content themselves. Fortunately, given the difference in the type of content present in different page type and the intrinsic structured nature of forum sites, identifying the type of page while crawling a forum site is usually possible. To support the content extraction process, vertices information present in the traversal path [10] is assumed available here. The proposed approach first traverse through the DOM tree of the page in a depth-first, pre-order manner, seeking for non-spurious links whose anchor tag matches any of the required vertices for the current page type or XPath matches the associated dominant XPath of the vertex. Links that satisfy either of such conditions and are further evaluated to be nonspurious can then have the content in their subtrees extracted.

Referring to the traversal path (Figure 1), it is clear that such anchor nodes would either be links to skeleton-flipping pages or to the next skeleton pages. Extracted links to skeletonflipping pages are simply stored in a URL queue (or equivalent, depending on the crawler design) for the crawler to service in the future, whilst extracted links to the next skeleton pages serve as markers for a content region to be extracted. 2) Eliminating spurious links: In practice, it is common for spurious reference links to be present within a content region itself. For example, links are frequently found amongst the text of content in users’ posts, appearing as false markers for new content regions during the extraction process (see Section IV). To eliminate such spurious links in this work, the amount of text content in sibling text nodes are verified to be lowered than a pre-set threshold before the link is regarded as a valid reference link. This is driven by the observation that in general, reference links remain well separated from user-generated text (in the web page’s DOM tree representation), and hence, if observed otherwise, they can be regarded as spurious. 3) Region XPath for identifying start of content region: Pertaining to points 2, 4 and 5 in Section IV, it is apparent that obtaining reliable dominant XPaths and identifiers for the reference links may still be insufficient for addressing the the challenges of over-extraction for last content region on a page, under-extraction due to the presence of reference link within the content regions and misalignment of data before the reference link. These are dealt by deriving a region XPath based on a common truncated XPath for the content regions. Nodes in each content region share a common region XPath and this region XPath is different between the different content regions in the page. As such, this can be exploited to selectively reject data from other content regions during the extraction process. The region XPath is derived from the dominant XPath of the reference link by first identifying the wildcard tag (i.e. tag with ‘*’) with the highest variability in sibling count (i.e. position in the dominant XPaths where most of the URLs are different), and then removing all subsequent tags. For example, the list item
  • tag will be identified as the tag with the largest variability in sibling count in Example 1. The region XPath, thus, marks the sub-tree to the distinct content regions in the page. A termination condition can be set to stop a current extraction for a content region on encountering a node which resides outside the sub-tree indicated by the region XPath. It is assumed that the identified (non-spurious) reference URLs in the page are child nodes within the sub-trees identified by the region XPath during the content extraction process. Otherwise, the extraction can proceed using the reference URLs and dominant XPaths directly instead. For example, the first posts in post pages in scamwarners.com have different number of tags in its XPath as compared to rest of the posts. So, the dominant and region XPaths are calculated using the posts other than the first post (as these are in the majority). 4) Extracting Content: The actual content extraction process extracts data in each of the content region is via a depthfirst, pre-order traversal starting from each node corresponding to the appropriate region/dominant XPath and/or identifiers for the reference links as explained above. Such a traversal manner is important in the content extraction process as each content region (regardless of thread/post)

    Forum

    Pages

    Total Threads

    419legal.org exposeascam.com movingscam.com scam.com forum.scampatrol.org scamwarners.com scamvictimsunited.com scamfound.com realscam.com

    100 100 100 100 24 100 75 100 78

    2065 1830 3006 4681 872 4741 3066 9578 1301

    TABLE I.

    Extracted Threads 2065 1830 3006 4680 872 4741 3066 9576 1301

    Positive 2065 1830 3006 4680 872 4741 3066 9576 1301

    C ONTENT EXTRACTION FOR SAMPLE LIST- OF - THREADS PAGES

    Forum 419legal.org exposeascam.com movingscam.com scam.com forum.scampatrol.org scamwarners.com scamvictimsunited.com scamfound.com realscam.com

    TABLE II. Fig. 2.

    Total Posts 312 141 546 295 498 299 417 100 1979

    Extracted Posts 312 141 546 295 498 301 417 100 1980

    Positive 312 141 546 295 498 299 417 100 1973

    C ONTENT EXTRACTION FOR SAMPLE LIST- OF - POSTS PAGES

    DOM tree traversal for information extraction

    are assumed to be encapsulated within one or more sub-trees in the DOM tree representation of the web page. The region Xpath, reference links identifiers and/or dominant XPaths serve as indicators for identifying content regions in the page. On encountering such indicators when traversing through the DOM tree representation, the content is extracted till the next indicator is encountered or a termination condition have been reached. Extracted content is then saved as an unit of data in a database and this process is repeated for subsequent content regions in the page. The process is illustrated using Example 2 and corresponding DOM tree in Figure 2. Example 2: Referring to Figure 2, extraction of a content region starts from the identification of the region XPath, corresponding to the list item tag . The extraction algorithm then traverse in a depth-first, pre-order manner to extract the reference link itself (Reference node 1), links and text content within box 1 and subsequently, the text in box 2. Extraction for current content region stops on encountering the subsequent content region indicator in . It is noted that typically only text and link contents are extracted, while structural and list tags such as those in box 3 are ignored in this work. VI.

    E XPERIMENT AND A NALYSIS

    This section presents the experimental results for the proposed content extraction algorithm on list-of-threads and listof-posts pages from nine scam related web forums. An initial experiment was set up where a crawler was utilized to traverse the sites using their known traversal paths [10]. Contents are extracted during the crawling process and stored according to their page type. Extracted contents from 100 sampled listof-threads and 100 list-of-posts pages per forum were then inspected and the results reported in Tables I, II. For smaller forum sites which do not have 100 list-of-threads pages, all the available list-of-threads pages are used as indicated in Table I. The content extraction algorithm for both thread and post records achieve excellent results in the different forums. The average recall and precision rate for the thread extraction is

    99.99% and 100% respectively. The unextracted threads is due to their short thread titles not satisfying the heuristic condition of at least 8 characters. The average recall and precision rate for the post extraction is 99.97% and 99.87% respectively. All the missing posts are from a single list-of-posts page in www.realscam.com having unusual XPaths that were different from that of all other posts in the forum. False positives are due to false extractions for reference links within the user-generated post content. The excellent result is due to the consistent structure present in pages within each forum site. Hence, we believe that reference links, together with the XPath information, serve as useful indicators for extracting post records in forum sites. Although the extraction quality can be thoroughly verified manually in the above experiment, the relatively small sample of pages may not be sufficiently representative and the results are very sensitive to inclusion or exclusion of few pages. This is especially true while extracting at the site level where a high recall in detecting threads is necessary. Hence, further experiments are done to extract the content from the entire websites. The correct number of threads and posts in the forums is calculated (or approximated) using statistics provided on the board page of the forums. On some forums, we observed less number of threads and posts as compared to the statistics provided on website. In such cases, whenever detected, we have accordingly adjusted the ‘Total threads’ and ‘Total posts’ fields in Tables III and IV respectively. These are compared against the the number of extracted posts (and threads) and number of true positive posts (and threads). Table III summarizes the quantitative results obtained for extraction of threads, indicating a high precision rate close to 100% for all the forums. Only a single false positive thread is identified for 419legal.org. The false positive is due to a link having similar signature and keyword as that of thread reference URL and also occurs in a list of thread page. Due to the hierarchical structure of most forum sites, any slight decrease in the recall rate may cause a significantly large number of subsequent missing posts during the extraction.

    Hence it is clear that a good recall rate is essential to having a reliably forum crawler. To this end, experiments show a recall rate in the range of 99.88% to 100%, extracting most of the links to the list-of-posts in the forum. Table IV summarizes the results obtained for the extraction of posts. The precision for the sites ranges from 99.90% to 100%. Further investigation indicated that the false positives are caused by wrong reference link selection and existence of the reference links within the posts. Although, heuristics such as the elimination of spurious links and threshold on amount of anchor text are used, a few false positives still remains.

    lightweight links and XPath properties, as compared to existing methods, which use relatively costly (e.g. visual, partial tree alignment) pattern recognition based methods. Experiment on an initial small set of sample pages indicate high recall and precision rates, while extraction capability remains reasonably well for further experiments on the full forum sites. R EFERENCES [1]

    [2]

    The recall rate in the range of 97.71% to 100% for all the forums used in the experiments. Several reasons for missing posts in different forums have been identified. In exposeascam. com and realscam.com, the main reason for missing posts is non-matching of the reference URLs or page-flipping links in a page with the reference URL format and dominant XPath. In exposeascam.com, there exist only a few page-flipping links for the post pages. This prevents the learning of page-flipping link’s XPath during the XPath learning as no post page-flipping are seen during the training causing the crawler to skip posts in a few of the page-flipping pages.

    [3]

    [4]

    [5]

    [6]

    In realscam.com, some of posts are missed due to the presence of polls at the top of the pages (e.g. www.realscam.com/ f34/what-soapboxmoms-hottest-shot-285) causing a mismatch of the XPath of the post reference links and page-flipping with the dominant XPaths obtained during the learning process. Other reasons for deviation of extracted statistics are nonexistence of some of the pages on some forums (e.g. some of the flipped-skeleton post pages do not exist in scam.com and users are redirected to a previous flipped-skeleton page). Forum 419legal.org exposeascam.com movingscam.com scam.com forum.scampatrol. org scamwarners.com scamvictimsunited. com scamfound.com realscam.com

    TABLE III.

    Forum 419legal.org exposeascam.com movingscam.com scam.com forum.scampatrol. org scamwarners.com scamvictimsunited. com scamfound.com realscam.com

    TABLE IV.

    Total threads 100730 3132 16898 ∼16898 872

    Extracted threads 100731 3132 16898 49042 872

    True positives 100730 3132 16898 49042 872

    Precision

    Recall

    99.99 100 100 100 100

    100 100 100 99.88 100

    53328 3009

    53328 3009

    53328 3009

    100 100

    100 100

    253949 1354

    253949 1354

    253949 1354

    100 100

    100 100

    C ONTENT EXTRACTION FOR LIST- OF - THREAD PAGES

    Total posts 107449 3770 123150 ∼817000 1741

    Extracted posts 107455 3692 123096 798841 1741

    True positives 107413 3692 123096 798303 1741

    119194 16554

    119194 16546

    255171 37980

    255171 37235

    [8]

    [9]

    [10]

    [11]

    [12]

    [13]

    Precision

    Recall

    [14]

    99.96 100 100 99.93 100

    99.97 97.93 99.96 97.71 100

    [15]

    119194 16546

    100 100

    100 99.95

    255171 37196

    100 99.90

    100 97.94

    C ONTENT EXTRACTION FOR LIST- OF - POSTS PAGES

    VII.

    [7]

    C ONCLUSION

    In this paper, we proposed an efficient method for extracting content in web forums. The proposed algorithm uses

    [16]

    [17] [18]

    F. Sun, D. Song, and L. Liao, “Dom based content extraction via text density,” International ACM SIGIR Conference on Research and Development in Information, pp. 245–254, 2011. W. W. Cohen, M. Hurst, and L. S. Jensen, “A flexible learning system for wrapping tables and lists in html documents,” WWW Conference, pp. 232–241, 2002. N. Kushmerick, “Wrapper induction: efficiency and expressiveness,” Artificial Intelligence - Special issue on Intelligent Internet Systems, vol. 118, no. 1-2, pp. 15–68, 2000. I. Muslea, S. Minton, and C. Knoblock, “A hierarchical approach to wrapper induction,” Annual Conference on Autonomous Agents, pp. 190–197, 1999. S. Zheng, R. Song, J.-R. Wen, and D. Wu, “Joint optimization of wrapper generation and template detection,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 894–902, 2007. J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma, “Incorporating site-level knowledge to extract structured data from web forums,” WWW Conference, pp. 181–190, 2009. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak, “Towards domain-independent information extraction from web tables,” WWW Conference, pp. 71–80, 2007. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “Simultaneous record detection and attribute labeling in web data extraction,” ACM SIGKDD International Conference on Knowledge discovery in Data Mining, pp. 494–503, 2006. M. Asfia, M. M. Pedram, and A. M. Rahmani, “Main content extraction from detailed web pages,” International Journal of Computer Applications, vol. 4, no. 11, pp. 18–21, August 2010. A. Sachan, W. Y. Lim, and V. L. L. Thing, “A generalized links and text properties based forum crawler,” IEEE/WIC/ACM Web Intelligence Conference, 2012. R. Cai, J.-M. Yang, W. Lai, Y. Wang, and L. Zhang, “irobot: An intelligent crawler for web forums,” WWW Conference, pp. 447–456, 2008. Y. Wang, J.-M. Yang, W. Lai, R. Cai, L. Zhang, and W.-Y. Ma, “Exploring traversal strategy for web forum crawling,” ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 459–466, 2008. H.-M. Ying and V. L. L. Thing, “An enhanced intelligent forum crawler,” IEEE Symposium on Computational Intelligence for Security and Defence Applications, 2012. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma, “2d conditional random fields for web information extraction,” International Conference on Machine Learning, pp. 1044–1051, 2005. D. Buttler, L. Liu, and C. Pu, “A fully automated object extraction system for the world wide web,” IEEE International Conference on Distributed Computing Systems, pp. 361–370, 2001. B. Liu, R. Grossman, and Y. Zhai, “Mining data records from web pages,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606, 2003. D. Gusfield, Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. S. Pretzsch, K. Muthmann, and A. Schill, “Fodex–towards generic data extraction from web forums,” Advanced Information Networking and Applications Workshops (WAINA), 2012 26th International Conference on, pp. 821–826, 2012.

  • A Lightweight Algorithm for Automated Forum ...

    method using only links and text information in the forum pages. The proposed method is able to accurately extract the content present in the different forum page types in individual data regions. Our experimental results show the effectiveness of our proposed algorithm in terms of the ability to accurately extract data from all ...

    293KB Sizes 0 Downloads 209 Views

    Recommend Documents

    A Lightweight Algorithm for Dynamic If-Conversion ... - Semantic Scholar
    Jan 14, 2010 - Checking Memory Coalesing. Converting non-coalesced accesses into coalesced ones. Checking data sharing patterns. Thread & thread block merge for memory reuse. Data Prefetching. Optimized kernel functions & invocation parameters float

    Generalized and Lightweight Algorithms for Automated ...
    limitations of existing works in the extraction of information from generic web sites and forum sites. We also identify the need for better suited, generalized and lightweight algorithms to carry out a more accurate and efficient information extracti

    Building a Lightweight Semantic Model for ...
    Building a Lightweight Semantic Model for Unsupervised Information. Extraction on Short Listings. Doo Soon Kim. Accenture ... listings are, however, challenging to process due to their informal styles. In this paper, we .... we focus on extracting in

    Tri-Message: A Lightweight Time Synchronization Protocol for High ...
    dealt with: clock offset and clock skew (clock drift speed). Clock skew is ... well over Internet paths with high latency and high variability and estimates both offset ...

    a lightweight xml driven architecture for the ...
    system (ARCO), which relies on an Oracle9i database management system and patented ... In sections 4 to 6 we describe in more detail ARCOLite components.

    LDWPO – A Lightweight Ontology for Linked Data Management.pdf ...
    There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. LDWPO – A ...

    Lightweight, High-Resolution Monitoring for ... - Semantic Scholar
    large-scale production system, thereby reducing these in- termittent ... responsive services can be investigated by quantitatively analyzing ..... out. The stack traces for locks resembled the following one: c0601655 in mutex lock slowpath c0601544 i

    A Lightweight Multimedia Web Content Management System
    Also we need email for notification. Supporting ... Content meta-data can be subscribed and sent via email server. .... content in batch mode. Anonymous user ...

    Lightweight concrete compositions
    Apr 29, 2010 - 106/823. See application ?le for complete search history. (56). References Cited ...... the maximum load carried by the specimen during the test by ..... nois Tool Works Inc., Glenview, Illinois) or similar fasteners, lead anchors ...

    Development of a Machine Vision Application for Automated Tool ...
    measuring and classifying cutting tools wear, in order to provide a good ...... already under monitoring, resulting in better performance of the whole system.

    A fully automated method for quantifying and localizing ...
    aDepartment of Electrical and Computer Engineering, University of Pittsburgh, .... dencies on a training set. ... In the current study, we present an alternative auto-.

    A Time Decoupling Approach for Studying Forum Dynamics
    J. Chan. Department of Computer Science and Software Engineering, The University of Melbourne,. Australia. B. Hogan ... that can inform the structural design of forum websites. Accordingly, the temporal .... to studying evolving online social media i

    A Randomized Algorithm for Finding a Path ... - Semantic Scholar
    Dec 24, 1998 - Integrated communication networks (e.g., ATM) o er end-to-end ... suming speci c service disciplines, they cannot be used to nd a path subject ...

    A Circuit Representation Technique for Automated Circuit Design
    automated design, analog circuit synthesis, genetic algorithms, circuit .... engineering workstations (1996 Sun Ultra), we present evolved circuit solutions to four.

    Development of a fully automated system for delivering ... - Springer Link
    Development of a fully automated system for delivering odors in an MRI environment. ISABEL CUEVAS, BENOÎT GÉRARD, PAULA PLAZA, ELODIE LERENS, ...

    A Deep Convolutional Neural Network for Anomalous Online Forum ...
    as releasing of NSA hacking tools [1], card cloning services [24] and online ... We propose a methodology that employs a neural network to learn deep features.

    A Time Decoupling Approach for Studying Forum Dynamics
    E-mail: [email protected]. J. Chan, C. Hayes. Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland. J. Chan. Department of Computer Science and Software Engineering, The University of Melbourne,. Australia. B.

    A Discriminative Method For Semi-Automated Tumorous ... - CiteSeerX
    A Discriminative Method For Semi-Automated. Tumorous Tissues Segmentation of MR Brain Images. Yangqiu Song, Changshui Zhang, Jianguo Lee and Fei ...

    Automated Textual Descriptions for a Wide Range of ...
    Abstract. Presented is a hybrid method to generate textual descrip- tions of video based on actions. The method includes an action classifier and a description generator. The aim for the action classifier is to detect and classify the actions in the

    A Discriminative Method For Semi-Automated ...
    In the feature, we plan to do some fast sparse algorithm and a C++ implementation, which could make our algo- rithm used in an interactive way. References. [1] M. Belkin, P. Niyogi and V. Sindhwani: On Manifold Regu- larization. AI & Statistics, pp.

    A methodology for the automated creation of fuzzy ...
    +30 26510 98803; fax: +30 26510 98889. E-mail address: ...... [5] Zahan S. A fuzzy approach to computer-assisted myocardial ischemia diagnosis. Artif Intell ...

    SigMate: A Comprehensive Automated Tool for ...
    cal layer activation order using LFPs and current source density .... Open-source: il pacchetto software sar distribuito come open- ... 2.2.2 Platform Framework .

    A Discriminative Method For Semi-Automated ...
    ti. ). For the algorithms that do not need to predict for the unobserved data, there are mainly two ways to model this probability: a) : P(XN. ,tN. ) = P(XN. |tN. )P(tN. ).

    Hirundo: A Mechanism for Automated Production of ...
    Apr 25, 2012 - ranked list of optimized stream programs that are tailored ..... Analyze. Program. Structure. Generate similar programs with different flow.