On the Automatic Construction of Regular Expressions from Examples (GP vs. Humans 1-0) Alberto Bartoli

Andrea De Lorenzo

DIA - University of Trieste Italy

DIA - University of Trieste Italy

[email protected] [email protected] Eric Medvet Fabiano Tarlao DIA - University of Trieste Italy

[email protected]

DIA - University of Trieste Italy

[email protected]

ABSTRACT

ular expression automatically based on examples of the desired behavior. This problem may be cast in several ways, depending on the intended usage of the regular expression and on the nature of the input data (a systematic literature analysis can be found in [6, 5]). The intended usage may be either binary classification of input items or extraction of chunks from a (possibly very long) input item. In nearly all efforts the constructed expression is expected to generalize a pattern from the available examples, although there have also been proposals aimed at binary classification of input items in two predefined lists, without any generalization requirement [3]. Concerning the nature of input data, there have been proposals focussing on input items expressed in a formal language, on input items consisting of text lines, on input items consisting of an unstructured text stream. In our multi-year research activity on this topic we have developed several proposals based on Genetic Programming (GP) for automatic construction of regular expressions for text extraction from an unstructured stream. We represent a candidate solution (regular expression) as an abstract syntax tree assembled with the regular expression constructs and we evolve a population of candidate solutions with a multiobjective optimization algorithm in which the fitness of each candidate solution quantifies its accuracy on the available examples (to be maximized) and its length (to be minimized). Our activity may be summarized in a sort of two epochs: a first tool which improved over the earlier state-of-theart substantially, demonstrating the ability to address tasks of realistic complexity effectively [1, 2]; a second tool [6], which improved the first tool from a number of points of views, including support for the OR operator based on a form of separate-and-conquer search strategy [4], support for a broader set of regular expression constructs capable of addressing context-dependent extractions, a more sophisticated fitness definition delivering better F-measure and capable of supporting potentially unbounded input items. We have recently demonstrated that our tool is humancompetitive in terms of both quality of solutions and time required for their construction. We base this claim on a large-scale experiment involving more than 1700 users on 10 text extraction tasks of realistic complexity. The experiment is described in full detail in [5].

Regular expressions are systematically used in a number of different application domains. Writing a regular expression for solving a specific task is usually quite difficult, requiring significant technical skills and creativity. We have developed a tool based on Genetic Programming capable of constructing regular expressions for text extraction automatically, based on examples of the text to be extracted. We have recently demonstrated that our tool is humancompetitive in terms of both accuracy of the regular expressions and time required for their construction. We base this claim on a large-scale experiment involving more than 1700 users on 10 text extraction tasks of realistic complexity. The F-measure of the expressions constructed by our tool was almost always higher than the average F-measure of the expressions constructed by each of the three categories of users involved in our experiment (Novice, Intermediate, Experienced). The time required by our tool was almost always smaller than the average time required by each of the three categories of users. The experiment is described in full detail in “Can a machine replace humans? A case study. IEEE Intelligent Systems, 2016” .

Keywords Regular Expressions; Entity Extraction; Users Evaluation

1.

INTRODUCTION

Regular expressions are systematically used in a number of different application domains. Writing a regular expression is often a complex endeavor requiring significant technical skills and creativity. Along the years, a wealth of research efforts have considered the problem of constructing a regPermission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). GECCO’16 Companion July 20-24, 2016, Denver, CO, USA c 2016 Copyright held by the owner/author(s).

ACM ISBN 978-1-4503-4323-7/16/07. DOI: http://dx.doi.org/10.1145/2908961.2930946

155

We developed a web application containing a suite of extraction tasks and challenged users to test their skills with a Reddit post1 . Each task consisted of a piece of unstructured text annotated with the portions to be extracted (unannotated portions were not to be extracted)2 . The annotated portions of a task described a certain task-specific pattern: URLs (in two datasets of different nature), phone numbers, HTML href attributes, IP addresses, MAC addresses, HTML headings, HTML heading content (i.e., excluding the delimiting HTML tags), author names in bibtex entries, name of lead author in bibliographic lists. Each user was asked to self-classify his proficiency in regular expressions, either Novice, or Intermediate, or Experienced. We measured the time spent by each user on each task and assessed the F-measure of each constructed expression on a separate testing set. Next, we executed our tool on the very same tasks and the results were as follows.

multiple heterogenous patterns; our tool can only address those tasks as multiple, independent tasks). The sources of our tool are publicly available on GitHub and a prototype is available online3 .

2.

REFERENCES

[1] A. Bartoli, G. Davanzo, A. De Lorenzo, M. Mauri, E. Medvet, and E. Sorio. Automatic generation of regular expressions from examples with genetic programming. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO ’12, pages 1477–1478, New York, NY, USA, 2012. ACM. [2] A. Bartoli, G. Davanzo, A. De Lorenzo, E. Medvet, and E. Sorio. Automatic synthesis of regular expressions from examples. Computer, 47(12):72–80, Dec 2014. [3] A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao. Playing regex golf with genetic programming. In Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, GECCO ’14, pages 1063–1070, New York, NY, USA, 2014. ACM. [4] A. Bartoli, A. De Lorenzo, E. Medvet, and F. Tarlao. Learning text patterns using separate-and-conquer genetic programming. In 18th European Conference on Genetic Programming. Springer Verlag, 2015. [5] A. Bartoli, A. D. Lorenzo, E. Medvet, and F. Tarlao. Can a machine replace humans in building regular expressions? A case study. IEEE Intelligent Systems, 2016. To appear. [6] A. Bartoli, A. D. Lorenzo, E. Medvet, and F. Tarlao. Inference of regular expressions for text extraction from examples. IEEE Transactions on Knowledge and Data Engineering, 28(5):1217–1230, May 2016. [7] V. Le and S. Gulwani. FlashExtract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, page 55. ACM, 2014. [8] S. M. Lucas and T. J. Reynolds. Learning deterministic finite automata with a smart state labeling evolutionary algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1063–1074, 2005.

• For each task (except for one), the time spent by our tool for constructing a regular expression was much smaller than the average time required by each category of users. The only task in which our tool required more time than human operators was HTML heading content. However, on this task our tool delivered significantly better F-measure than all the three categories of human operators. • For each task (except for one), the F-measure of the regular expression constructed by our tool was higher than the average F-measure obtained by each category of users. The only task in which our tool delivered smaller (and unsatisfactory) F-measure was extraction of phone numbers. The reason is because the training data did not describe adequately text that looks like a phone number but is not a phone number. Humans were able to infer the general pattern appropriately from the available examples while our tool was not. With a larger training set, though, our tool was able to obtain Fmeasure comparable to human operators or better. Our work is significant, we believe, for at least two reasons. First, there is no other tool for automatic construction of regular expressions capable of delivering human-competitive performance on tasks of realistic complexity. Second, it demonstrates the power of GP on a difficult synthesis problem, requiring technical skills and creativity. Our recent reference work describing the internals of the tool in full detail includes an experimental comparison to other methods for learning of syntactical patterns that are not specified as a regular expression [6]. Specifically, a method for learning deterministic finite automata [8] and a method included in Windows Powershell for synthesizing programs in a specialized data extraction language [7]. The comparison demonstrates a clear superiority of our approach, on the text extraction tasks considered in our analysis (method [7] can address multifield extraction tasks, i.e., extraction of 1 https://www.reddit.com/r/programming/comments/ 3eblji/how good are you in writing regex challange/ 2 A plain and concise description of the web app can be found at http://www.i-programmer.info/news/204-challenges/ 9586-machine-learning-labs-regular-expression-game.html

3 https://github.com/MaLeLabTs/RegexGenerator http://regex.inginf.units.it/

156

and

On the Automatic Construction of Regular ... - ACM Digital Library

different application domains. Writing ... oped a tool based on Genetic Programming capable of con- ... We developed a web application containing a suite of ex-.

741KB Sizes 3 Downloads 327 Views

Recommend Documents

practice - ACM Digital Library
This article provides an overview of how XSS vulnerabilities arise and why it is so difficult to avoid them in real-world Web application software development.

The Chronicles of Narnia - ACM Digital Library
For almost 2 decades Rhythm and Hues Studios has been using its proprietary software pipeline to create photo real characters for films and commercials. However, the demands of "The Chronicles of. Narnia" forced a fundamental reevaluation of the stud

Challenges on the Journey to Co-Watching ... - ACM Digital Library
Mar 1, 2017 - Examples they gave include watching video to avoid interacting with ... steps that people take to co-watch and the main challenges faced in this ...... 10. Erving Goffman and others. 1978. The presentation of self in everyday life. Harm

6LoWPAN Architecture - ACM Digital Library
ABSTRACT. 6LoWPAN is a protocol definition to enable IPv6 packets to be carried on top of low power wireless networks, specifically IEEE. 802.15.4.

On Effective Presentation of Graph Patterns: A ... - ACM Digital Library
Oct 30, 2008 - to mine frequent patterns over graph data, with the large spectrum covering many variants of the problem. However, the real bottleneck for ...

Kinetic tiles - ACM Digital Library
May 7, 2011 - We propose and demonstrate Kinetic Tiles, modular construction units for kinetic animations. Three different design methods are explored and evaluated for kinetic animation with the Kinetic Tiles using preset movements, design via anima

Who knows?: searching for expertise on the ... - ACM Digital Library
ple had to do to find the answer to a question before the Web. Imagine it is. 1990, before the age of search engines, and of course, Wikipedia. You have.

The multidimensional role of social media in ... - ACM Digital Library
informed consent to informed choice in medical decisions. Social media is playing a vital role in this transformation. I'm alive and healthy because of great doctors. Diagnosed with advanced kidney cancer, I received care from a great oncologist, a g

Borg, Omega, and Kubernetes - ACM Digital Library
acmqueue | january-february 2016 71 system evolution. As more and more applications were developed to run on top of Borg, our application and infrastructure ...

Incorporating heterogeneous information for ... - ACM Digital Library
Aug 16, 2012 - A social tagging system contains heterogeneous in- formation like users' tagging behaviors, social networks, tag semantics and item profiles.

The Character, Value, and Management of ... - ACM Digital Library
the move. Instead we found workers kept large, highly valued paper archives. ..... suggest two general problems in processing data lead to the accumulation.

Proceedings Template - WORD - ACM Digital Library
knowledge-level approach (similarly to many AI approaches developed in the ..... 1 ArchE web: http://www.sei.cmu.edu/architecture/arche.html by ArchE when ...

Computing: An Emerging Profession? - ACM Digital Library
developments (e.g., the internet, mobile computing, and cloud computing) have led to further increases. The US Bureau of Labor Statistics estimates 2012 US.

GPLAG: Detection of Software Plagiarism by ... - ACM Digital Library
Along with the blossom of open source projects comes the convenience for software plagiarism. A company, if less self-disciplined, may be tempted to plagiarize ...

A guided tour of data-center networking - ACM Digital Library
Jun 2, 2012 - purpose-built custom system architec- tures. This is evident from the growth of Ethernet as a cluster interconnect on the Top500 list of most ...

BlueJ Visual Debugger for Learning the ... - ACM Digital Library
Science Education—computer science education, information systems education. General Terms: Experimentation, Human Factors. Additional Key Words and ...

Performance Modeling of Network Coding in ... - ACM Digital Library
without the priority scheme. Our analytical results provide insights into how network coding based epidemic routing with priority can reduce the data transmission ...

Evolutionary Learning of Syntax Patterns for ... - ACM Digital Library
Jul 15, 2015 - ABSTRACT. There is an increasing interest in the development of tech- niques for automatic relation extraction from unstructured text. The biomedical domain, in particular, is a sector that may greatly benefit from those techniques due

Remnance of Form: Interactive Narratives ... - ACM Digital Library
what's not. Through several playful vignettes, the shadow interacts with viewers' presence, body posture, and their manipulation of the light source creating the.

Adaptive Artistic Stylization of Images - ACM Digital Library
Dec 22, 2016 - Adaptive Artistic Stylization of Images. Ameya Deshpande. IIT Gandhinagar [email protected]. Shanmuganathan Raman.

The word-gesture keyboard: reimagining ... - ACM Digital Library
Sep 1, 2012 - tion in the form of email, text chat, and Web posting. As com- ... desktop, the need for effective text entry on mobile devices has been ...